Massively Parallel Self-Consistent-Field Calculations - American

Our efforts have resulted in a scalable direct-SCF (self-consistent-field) ... requirements are easily handled by using the Global Array (GA) software...
0 downloads 0 Views 700KB Size
Ind. Eng. Chem. Res. 1995,34, 4161-4165

4161

Massively Parallel Self-Consistent-FieldCalculations Jeffrey L.Tilsont Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, Illinois 60439

The advent of supercomputers with many computational nodes each with its own independent memory makes possible very fast computations. Our work is focused on the development of electronic structure techniques for the solution of Grand Challenge-size molecules containing hundreds of atoms. Our efforts have resulted in a scalable direct-SCF (self-consistent-field) Fock matrix construction kernel that is portable and efficient. Good parallel performance is obtained by allowing asynchronous communications and using a distributed data model. These requirements are easily handled by using the Global Array (GA) software library developed for this project. This algorithm has been incorporated into the new program NWChem.

1. Introduction Advances in theoretical chemistry over the past 2 decades have consistently improved the ability of electronic structure calculations t o accurately predict from first principles the structure, spectra, and energetics of molecules. Such predictions permit theoretical determinations of both thermochemistry and kinetics, fundamental information for all chemical processes. As might be expected, more accurate and computationally intensive methods are restricted to smaller molecular systems. But even for the simpler ab initio electronic structure techniques, there is a modest limit to the size of molecules that can be feasibly studied. Many practical problems in chemistry today require information on molecules too large for conventional electronic structure codes to feasibly handle. The fate of chlorofluorocarbon alternates in the atmosphere, the properties of ligand substituents in polymerization catalysis, and the mechanism of enzymatic destruction of toxins are all examples of current academic and industrial research areas where important electronic structure applications are frequently too large t o be feasibly done. Improvements in theoretical chemistry have frequently exploited advances in computer hardware. At present, such an advance is occurring through the use of massively parallel processors (MPP), high-speed networks of hundreds to thousands of computers. The economies of scale make such hardware the least expensive for assembling large-scale computational resources. In recognition of this direction in computer architecture, theoretical chemists and computer scientists at Argonne National Laboratory, Pacific Northwest Laboratory, three major oil companies, and one major chemical company have formed a collaboration to adapt electronic structure methods to MPP architectures for rapid modeling of large molecular systems. This paper reports the results of one particular MPP adaptation, that of the self-consistent-field (SCF) [Roothaan (1951)l electronic structure method and in particular the Fock matrix construction step. The results t o date suggest that efficient coding for MPP technology can qualitatively change the size of the molecule that can be treated by the SCF method. Efficient coding for the MPP architecture is not straightforward. When designing algorithms for these computers, important issues include avoiding replicated computation (computational efficiency), distributing data structures so as to avoid wasting memory (data distri+

e-mail: [email protected]: 708-252-4470.

bution),distributing computations to processesors so as to avoid idle time when one processor is busy and others are not (load balance), and minimizing time spent sending and receiving messages (communication efficiency). A metric that integrates these different criteria is scalability: the extent to which an algorithm is able to solve larger problems as the number of processors is increased. A sound methodology when developing parallel algorithms is to begin by examining alternative algorithms at a theoretical level. Only after scalability has been established theoretically should effort be devoted to implementing the algorithm. The SCF method, in addition to being important in its own right, is the starting point for more rigorous methods and typifies their use of large data structures and irregular data access patterns. Because of its importance, others have developed various parallel MPP SCF codes [see Harrison and Shepard (1994) and references therein]. However, the code reported here is (in our opinion) the most scalable SCF code currently available. 2.

SCF Wave Functions

The SCF wave function is constructed from an antisymmetric product of single particle functions. This is the molecular orbital (MO) approximation. These MOs represent the motion of an individual particle (electron) within the field of all the remaining electrons and a static (clamped) configuration of nuclei. This wavefunction form and the approximate Hamiltonian yield the SCF energy and wave function. Solution of the SCF problem has been shown useful in the determination of the nuclear configuration. Estimates to within 0.1-0.2 A for main-group elements are not uncommon. This makes the SCF solution useful for examining molecular geometries. Another quantity of interest is the binding energy and the orbital energies. The SCF gives reasonable binding energies for many molecules at their equilibrium geometry. It does not, however, describe the important structural correlations in molecules. This limitation, for example, prevents an accurate description of bond breaking and forming. The individual orbital energies may be used for a qualitative analysis of the electronic spectra. A very important aspect of the SCF equations is as a starting point for more accurate higher order methods. These methods generally correct the set of SCF MOs, making this technique an integral part of them. 2.1. Formalism. The total energy of a molecular system constructed from a set of occupied orthonormal

0888-5885/95/2634-4161$09.00/00 1995 American Chemical Society

4162 Ind. Eng. Chem. Res., Vol. 34,No. 12, 1995

MOs (q$) is

integrals, &kL, electrons

electrons

C

energy=

hip,+

g&&l

(1)

ij , k , l

ij

The terms h~ and gGk1 are one- and two-electron integrals, respectively, and are independent of the precise form of the wave function.

The (closed-shell)MO density matrix when transformed to the A 0 basis and with electron spin integrated out becomes occupied

D:

CPjCvj

=2 k=l

DAo = 2CCt

h denotes the standard one-particle operator and 7-12 the interparticle distance. dz, is the differential volume element for particle i. The form of the wave function influences the structure of the one- and two-particle density matrices, D, and d,kl. Generally, the oneparticle density matrix is simple to generate, becoming a delta function for the SCF, D, = 6,. The two-particle density matrix typically requires a substantial amount of effort for more accurate, highly correlated electronic structure methods (MCSCF, MRCI, full-CI, etc.) and can be a substantial computational process. This matrix, however, takes on a particularly simple structure for SCF wave functions becoming sums of products of the one-particle densities. The simplicity of the SCF twoparticle density matrix shifts the computational burden onto the generation of the integrals themselves. The SCF total energy may be simplified by substituting into Eqn. 1 the nonzero values for the density matrices. For closed-shell systems, the energy becomes occupied

energy = 2

C

occupied

h,, +

C

(%, - g,,)

(2)

LJ

I

This equation satisfies the requirements necessary for application of the variation principle. In essence, the best MOs will result in the lowest (best) SCF energy. Hence, one can find the best MOs by minimizing eq 2 subject to orbital orthonormality. The details of this derivation are widely available, and so only the results are presented here. This minimization results in the total energy expression

+

energy = x ( h , , Fii)

(3)

i

where Fij is the MO Fock matrix.

This energy can be determined exactly by using numerical techniques but only for very small systems. This exact solution for the SCF wave function is called the Hurtree-Pock solution. Modern implementations of the SCF procedure parametrize the MOs by using a finite set of basis functions, xp, called atomic orbitals (AOs), with expansion coefficients, C . The basis functions are linearly independent functions with metric S,, selected to simplify the calculation of the two-particle

(7)

Substitution of eqs 5 and 7 into eq 4 results in the canonical A 0 Fock matrix.

where the integrals are now over the A 0 functions. The optimized MOs are determined by finding the optimal coefficients, C, that satisfy the nonorthogonal matrix eigenvalue problem.

P0C = SCE

(9)

C is the matrix of MO coefficients (wave function); S and E are the metric and eigenvalues (orbital energies), respectively. In the limit of a complete basis, the true Hartree-Fock solution is attained. Equations 2-9 give us a prescription for solving the SCF problem. 1. Select a basis set, x,. 2. Select an initial coefficient matrix, C, and generate the current density matrix, DAo. 3. Construct the matrix FAo using the current DAo and generating the A 0 integrals. 4. Solve the generalized eigenvalue problem of eq 9 to obtain the new orbitals. 5. Check the new orbitals for self-consistency. If they have not converged, construct a new DAo matrix and repeat. Once the converged orbitals are found, eq 3 is solved, and the SCF calculation is finished. 2.2. Algorithm. Efficient, scalable SCF sofiware requires a detailed understanding of the SCF algorithm. The principal operations in the SCF procedure consist of two computational kernels that are repeated until a self-consistent solution of eq 9 is obtained. These steps are the generation of the A 0 integrals to construct the A 0 Fock matrix and diagonalization of the nonorthogonal SCF equation. The two-electron integrals used to construct the Fock matrix depend on four indices. These indices sample the space of A 0 basis functions; therefore, the number of integrals theoretically grows as O(I$aais), becoming large for even small problems. As an example, a small hydrocarbon might require 100 basis functions for an adequate representation of the electron field. This requirement results in @(lo8)bytes of memory required to store all the integrals. This exorbitantly high growth in storage costs forces the algorithm to either off-load these integrals to disk or recalculate then as needed. The integrals are constructed from localized basis functions (AOs) which introduces a considerable amount

Ind. Eng. Chem. Res., Vol. 34,No. 12, 1996 4163 A0 Fock Construction DO i = 1, N

i IF (z,j pair survive screening) THEN DO k = 1, a IF (k.EQ.i) lhi = j IF (k.NE.2) lhi = k DO 1 = 1, Ihi IF (k,l pair survive screening) THEN EVALUATE I = F,j F;j + DklI Fki = Fki + DijI Fik = F,k - lDl,I Fii = Fil 1Djr;I F,i = F,i - IDikI Fjk= Fjk- ?Dill 7 ENDDO ENDDO ENDIF ENDDO ENDDO Figure 1. Basic logic for Fock matrix construction. D O j - 1 ,

-

of sparsity. This sparsity and the relatively poor UO capabilities of many computers favor a recomputation strategy. This type of SCF algorithm is denoted the direct-SCF method [Almlof et al. (198211 and is the method selected for our work. Each integral is independent, and, hence, integrals mqy be grouped (blocked)into any convenient order. The integrals, however, may differ in computational effort by @lo2) arithmetic operations. In a sequential environment, this is of little consequence but is an important issue in a parallel environment. Furthermore, the symmetry properties of the two electron integrals and the A 0 F and D matrices result in a given integral contributing to a t most six elements of the A 0 F matrix and requiring at most six elements of the A 0 D matrix. A generic A 0 F construction algorithm is displayed in Figure 1. The second kernel is the determination of the new coefficients, C. Equation 9 may be solved directly to obtain the optimum orbitals. Efficient parallel construction of the A 0 F matrix, however, will result in a sequential diagonalization becoming a bottleneck. In a separate paper we address alternatives in optimization techniques for parallel computers.

3. Parallel SCF In this section we describe the A 0 F construction algorithm and present preliminary results for our MPPbased SCF program named NWChem. A complete description of NWChem is available in a recent review [Harrison and Shepard (199413. Here we summarize the important points for the F construction step of our scalable distributed-data SCF algorithm. For Grand Challenge-size problems this parallelism must address not only greatly reducing the time €or solution but also efficiently managing the aggregate memory of the computer. For example, the number of integrals required for a problem of size N b a s i s = 100 is on the order &los). If each integral requires on average 1000 arithmetic operations, and the chosen CPU executes at 40 Mflopds (millions of floating-point instructions per second), the total time to generate one integral is 25 ,US. A calculation on decane with N b a s i s = 250 would require approximately 3.4 h per iteration. "he parallel algorithm must also address the memory requirements of persistent matrix data required for the

calculation. The solution of a typical SCF problem requires the storage of @lo) persistent matrices each of Size (Nbasls x Nbasis) elements. The SCF solution of large molecular problems requires many matrices of dimension N b a s i s = @102-104) double-precision words. A typical direct-SCF calculation would then require aggregate memory of nearly &los) bytes. 3.1. Replicated Data Model. Several mature programs are now widely available on parallel architectures. The focus of these initial efforts was to use parallel computers t o greatly decrease the turnaround time of a calculation. This was accomplished (most often) by using the direct-SCF technique and parallelizing the integral generation step. The density and Fock matrix data are replicated on all coniputational nodes. Batches of integrals are grouped into a computational task, which is allocated to a waiting node. These integrals are contracted with the locally available A 0 D matrix to create a partial A 0 F matrix. The partial A 0 F matrices are summed together onto one node for further processing of eq 9. The resulting orbitals are then replicated back onto all nodes. This technique constructs a F matrix efficiently. The algorithms, however, are not inherently scalable, since memory storage is limited to that available on a single node. This approach achieves good speedup on large numbers of nodes and is fairly straightforward t o implement in existing programs. 3.2. Distributed-Data Model. Several models of scalable Fock matrix construction algorithms have been previously analyzed [Foster et al. (1995)l. The resulting initial program has been thoroughly discussed [Harrison et al. (1995), Harrison and Shepard (1994)l. We summarize the important parallel details here. The scalable construction of the A 0 F matrix requires that the integrals be allocated dynamically and that matrix data be distributed about the memory of the MPP. Nonlocal data requirements must be satisfied without unduly synchronizing the computational process. These requirements are difficult to satisfy by using a traditional point-to-point communications scheme. The integrals very greatly in their computational effort, and, equally important, for a given integral the actual amount of F and D data required depends upon the indices. Dynamic data caching increases the difficulty of data management. We first partition all A 0 matrix data, D, F, S, etc., into atomic blocks. These blocks are different sized submatrices with indices that span all basis functions for a given atomic center. These blocks are then arbitrarily allocated to the different nodes of the computer. We generate integrals in large atomic blocks (each of the four indices span all basic functions for the given atom) and dynamically allocate these blocks to nodes using a shared counter. When a node requests an integral block t o compute, a check on sparsity is performed; then, appropriate blocks of the A 0 D matrix are fetched. The resulting A 0 F matrix blocks are accumulated back to their assigned nodes. The simplicity of this algorithm is complicated by the varying data requirements for different integral blocks. Our communications are performed with a new library of softwave functions that emulate a shared-memory model using the primitive message-passing capabilities of the MPP. These Global Arrays support a lightweight one-sided communications model, thereby greatly simplifymg development of our scalable program [Harrison (19931, Nieplocha et al. (1994)l.

4164 Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995

In developing software for a typical message-passing environment, data are transferred between nodes by coding explicit SEND and RECEIVE calls in the program. A program written in this way essentially blocks the progress of the calculation until both nodes have satisfied their respective communications. This approach, therefore, places an effective synchronization step into the program, severely degrading the performance of the direct-SCF. The Global Array library eliminates this explicit synchronization. It allows the programmer to simply insert into a code a REQUEST for data. No companion SEND need be made. This local request activates a mechanism that finds the data, interrupts the work on the node holding the data, and commands that node t o participate in the data communication. The interrupted node then resumes with its work. If the data are local to the requesting node, no messages are sent. The overhead associated with these types of communications is higher than a primitive message-passing function but is not inhibiting. A typical Global Array request on the Intel Delta has a latency of 300 ps. The much greater integral load balance obtained in this way and use of appropriate data blocking greatly compensate for the slightly higher latency. The Global Arrays are capable of several one-sided kinds of communications (READ, SEND, ACCUMULATE, etc.) and also support all traditional point-to-point communications. The library is currently portable to several different parallel architectures. The simplicity of using Global Arrays to write distributed data applications does not obviate the need for algorithm modeling. The applications engineer still must consider the memory and network characteristics of the target computer and the usage of data within the algorithm for efficient implementation. That is, effective data blocking and the appropriate chunking of parallel tasks of the algorithm. Once the F matrix is constructed, the optimum orbitals must be generated. We have the capability to perform the generalized eigenvalue analysis in parallel. The scalability, however, is worse than construction of the F matrix because of the nature of the diagonalization algorithm [Littlefield and Maschhoff (199311. A manuscript of direct-SCF results including the optimization step is in preparation. Generally the difficulty in diagonalization led us to investigate alternative schemes. These techniques are second-order convergence techniques that try to find the minimum SCF energy within the space of parameters, C. A recent paper [Wong and Harrison (199511 compares various techniques for direct-SCF calculations. We are currently exploring the use of a simultaneous vector expansion method, as suggested by Shepard (19931, for overlapping computational effort. These second-order techniques can greatly accelerate the time for solution for some kinds of problems. They are strongly dependent, however, on the initial guess of C and so do not always exhibit quadratic convergence. The importance of these techniques is in their exposing highly scalable A 0 F constructions to the optimization scheme. 4. Benchmarks

A set of molecular problems has been assembled to demonstrate the performance of this algorithm. These problems include simple alkanes and transition-metalcontaining species. Additional benchmarks have also been published [Harrison et al. (199411. The largest alkane problem, C20H44, represents the interaction of

Table 1. Speedup Characteristics of NWChem on the Intel Touchstone Delta and IBM SP1 Computersa Delta IBM SP1 time no. of speed- time no. of speedmolecule NBF perAOF nodes up perAOF nodes up C4H10

C20H44

cobalt

biphenyl

titanium

110 110 110 110 110 520 520 520 520 520 114 114 114 114 114 324 324 324 324 147 147 147 147 147

1140.51 580.63 293.65 74.08 37.31 1287.79 860.90 647.73 333.11 188.51 891.93 453.05 233.16 121.47 63.84 2291.61 1148.13 575.86 290.59 3008.09 773.48 400.13 212.82 119.3

1 1 2 1.96 4 3.88 16 15.39 32 30.57 32 32 48 47.8 64 63.6 128 123.7 256 218.6 2 2 4 3.9 8 7.6 16 14.7 32 27.9 16 16 32 31.9 64 63.7 128 126.2 4 4 16 15.6 32 30.1 64 56.5 128 100.8

218.43 134.77 63.2 15.13

1 2 4 16

1.62 3.45 14.44

ND

ND

ND

1060.39 539.25 272.04 184.19 141.11 330.18 247.75 106.98 50.33

8 16 32 48 64 1 2 4 8

8 15.7 31.2 46.1 60.1 1 1.33 3.1 6.6

ND

ND

ND

846.48 429.33 260.85 131.73 713.26 364.51 186.64 95.69 50.55

8 16 32 64 4 8 16 32 64

8 15.8 25.96 51.4 4 7.8 15.3 29.8 56.4

1

a All times are for one A 0 F matrix construction in seconds. Speedup times for a given molecular species are relative to the measure timed on the fewest number of nodes. ND = Not Done.

two decane molecules. The three other presented benchmarks are (a) (C5H&o(NO)(CH3) designated cobalt, (b) ((Cp)z(CH2))TiCl2designated titanium, and (c) 2,2’-bis(trifluoromethyl)biphenylnamed biphenyl. We note that all total energies have been verified by independent calculations. The speedup is a measure of the efficiency with which parallelism has been implemented. If a program executes in time T(1) on a single node and in time T(P)on P nodes a speedup (SU) may be defined as

SU(P) = T(l)/T(P) If the parallel program is perfectly parallelized, T(P) = T(l)/P and SU(P) becomes simply P. A parallel eficiency may be calculated as SU(P)/P x 100.

Table 1 lists the time to construct the A 0 F matrix on the IBM SP1 and Intel Touchstone Delta computers as a function of the number of nodes. Analytical performance models predict that speedup will approach 90-95% of the ideal for very large problems. This is observed in Table 1,where we observe a speedup of 98% for the biphenyl benchmark on the Delta computer. These benchmarks were also analyzed by using an available program on the Cray YMP-C90 computer. This program is implemented using a shared-memory model. In this model the communications overhead is very cheap, since relatively few messages are sent. We observe a speedup on the 16-nodeYMP-C90 of close t o 15. The times per A 0 F construction for these YMPC90 tests are collected in Table 2. We find that generally a YMP-C90 node is observed to be 15-20 times faster than a node on the Intel Delta and 3-5 times faster than a node on the IBM SP1. These ratios are generally consistent with the peak FLOP% of these computers. We find that calculations on the Delta and SP1 can be made to run faster than on the YMP-C90 by application of enough computational nodes. Recent results indicate that the shared-memory program executed on the Cray YMP-C90 is not the highest perfor-

Ind. Eng. Chem. Res., Vol. 34, No. 12, 1995 4185 Table 2. Time for an A0 F Matrix Construction on the Cray YMP-CgO Using a Commonly Available ab-Initio Package molecule no. of nodes time Der A 0 F C4H10 1 78.24 C4H10 4 19.59 CzoH44 1 2490.23 C20H44 2 1204.26 cobalt 1 64.36 titanium 1 382.47 biphenyl 8 474.5

mance code available. We believe, however, that the limitation in performance is caused by an older integral generation algorithm. This algorithm, however, is expected to be of similar performance t o the current integral code within NWChem. Newer integral generation technology can compute integrals faster by a factor of 10. We expect the inclusion of these faster integrals into NWChem (in progress) will not affect the parallel performance but will decrease the time-for-solution. The scalability of NWChem is found to be reasonable for large molecular problems. The somewhat lessened performance for the smaller benchmarks is not an issue, since this MPP software is designed for the solution of massive problems that are not currently possible due to memory limitations. In particular, the SU begins to decrease when the number of processors (P)approaches 6YgtO Clearly, mB for). large problems ( N a b m s = 1000) high performance is expected. This high performance stems primarily from the use of integral and data blocking and the asynchronous communications made possible by Global Arrays.

5. Conclusion Our efforts have resulted in a scalable and efficient direct-SCF program called NWChem. We have benchmarked and validated the program on two currently available MPP computers. Comparisons with a fully functional replicated-data direct-SCF code on the Cray YMP-C90 indicate that MPP performance is competitive with that of traditional vector supercomputers when using appropriately designed scalable software. This work has focused on efficient use of multiple processors and a high-speed network. Future efforts must address utilization of all MPP resources, especially YO. In the direct-SCF, the recomputation of integrals eliminates the potentially massive storage of integrals while decreasing the total number of arithmetic operations. This fortuitous behavior is not necessarily applicable t o other electronic structure algorithms nor to algorithms in general. We are now beginning to address the issues of parallel I/O and its applications to remote data storage. This effort is not limited to direct-SCF. Several parallel projects are currently in place, including MP2, SCF gradients, and MCSCF. These techniques allow us to fully optimize SCF geometries and determine corrections to the SCF wave function.

Acknowledgment This work is the result of efforts a t Argonne National Laboratory and Pacific Northwest Laboratory including Drs. R. J. Harrison, R. A. Kendall, M. Minkoff, A. F.

Wagner, and R. Shepard. This work was supported through the U S . Department of Energy by the Mathematical, Information, and Computational Science Division of the Office of Computational and Technology Research by the Chemical Sciences Division of the Office of Basic Energy Sciences, and by the Office of Health and Environmental Research which funds the Pacific Northwest Laboratory Environmental Molecular Sciences Laboratory Project D-384. This work was performed under Contract W-31-109-Eng-38(Argonne National Laboratory) and under Contract DE-AC06-76RLO 1830 with Battelle Memorial Institute (Pacific Northwest Laboratory). This research was performed in part using the Intel Touchstone Delta System operated by Caltech on behalf of the Concurrent supercomputing consortium. Access to this facility was provided by Argonne National Laboratory. The author gratefully acknowledges use of the Argonne High-Performance Computing Research Facility. The HPCRF is funded principally by the US.Department of Energy Office of Scientific Computing.

Literature Cited Almlof, J.; Faegri, K.; Korsell, K. Pinciples for a Direct SCF Approach to LCAO-MO Ab-Initio Calculations. J . Comput. Chem. 1982,3,385-399. Foster, I. T.; Tilson, J. L.; Shepard, R.; Wagner, A. F.; Harrison, R. J.;Kendall, R. A.; Littlefield, R. L. Toward High-Performance Computational Chemistry: I. Scalable Fock Matrix Construction Algorithms. J . Comput. Chem. 1995, in press. Harrison, R. Moving Beyond Message Passing. Experiments with a Distributed-Data Model. Theor. Chim. Acta 1993,84, 363375. Harrison, R. J.; Shepard, R. Ab Initio Molecular Electronic Structure On Parallel Computers. Annu. Rev. Chem. 1994,45, 623-658. Harrison, R. J.; Guest, M. F.; Kendall, R. A.; Bernholdt, D. E.; Wong, A. T.; Stave, M.; Anchell, J.; Hess, A. C.; Littlefield, R. L.; Fann, G. L.; Nieplocha, J.;Thomas, G. S.; Elwood, D.; Tilson, J. L.; Shepard, R.; Wagner, A. F.; Foster, I. T.; Lusk, E.; Stevens, R. Toward High Performance Computational Chemistry: 11. A Scalable SCF Program. J . Comput. Chem. 1995, in press. Littlefield, R.; Maschhoff, K. Investigating The Performance Of Parallel Eigensolvers For Large Processor Counts. Theor. Chim. Acta 1993,84, 457-473. Nieplocha, J.; Harrison, R. J.; Littlefield, R. J. Global Arrays: A Portable "Shared-Memory" Programming Model For Distributed Memory Computers. Supercomputing '94; Institute of Electrical and Electronics Engineers and Association for Computing Machinery IEEE Computer Society: Los Alamitos, 1994. Roothaan, C. New Developments In Molecular Orbital Theory. Rev. Mod. P h p . 1951,23, 69-89. Shepard, R. Elimination Of The Diagonalization Bottleneck In Parallel Direct-SCF Methods. Theor. Chim. Acta 1993, 84, 343 -35 1. Wong, A. T.; Harrison, R. J. Approaches To Large-scale Parallel SCF. J. Chem. Phys. 1995, in press. Received for review April 3, 1995 Revised manuscript received September 18, 1995 Accepted September 29, 1995@ IE950225W

Abstract published in Advance A C S Abstracts, November 15, 1995. @