![]() |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Gaussian03 is the latest in
the Gaussian series of electronic structure programs. Designed to model
a broad range of molecular systems under a variety of conditions, it performs
its computations starting from the basic laws of quantum mechanics. Gaussian
can predict energies, molecular structures, vibrational frequencies-along
with the numerous molecular properties that are dervied from these three
basic computation types-for systems in the gas phase and in solution, and
it can model them in both their ground state and excited states. Chemists
apply these fundamental results to their own investigations, using Gaussian
to explore chemical phenomena like substituent effects, reaction mechanisms,
and electronic transitions.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are three platforms on which Gaussian will run within the cluster:
32-bit Xeon nodes (p2800):
64-bit Opteron nodes (o2000, o2200, o2600, o2800):
On all nodes, including Firebolt, the wrapper script /usr/local/bin/g03 will set all environment variables and ensure the proper executables and scratch directory are used. See below for more info on running Gaussian.
Gaussian uses several scratch files in the course of its computation. These include the checkpoint file (*.chk), the read-write file (*.rwf), the two-electron integral file (*.int), and the two-electron integral derivative file (*.d2e). These files can become extremely large, and because the program is accessing them constantly, I/O speed is a factor in performance.
Choosing a scratch directory can be very critical. This is done by defining the environmental variable $GAUSS_SCRDIR either immediately prior to execution or in the shell executing the program.
The default directory for scratch files on the Biowulf cluster is /gaussian (on individual nodes). Diskspace depends on the node, and varies from 20 GB to 60 GB. Because /gaussian is local to the node, this provides the fastest I/O speed for the Gaussian execution.
Scratch files will remain after completing Gaussian execution. There is no automatic mechanism to remove files from /gaussian on the nodes, and so unless the files are required for future runs, users are encouraged to include the Link0 command %NoSave at the end of input files.
When the L0 command '%chk=<filename>' is given in a Gaussian command script, and the stated checkpoint file already exists, Gaussian attempts to import the checkpoint file and begin running with data from the file. However, if the checkpoint file was generated on an architecture different than the current one (e.g., running Gaussian on a 64-bit node with a 32-bit generated checkpoint file), Gaussian execution stops abruptly with no discernable error message.
To fix this, the command gaussian_chkarch will determine the architecture on which a checkpoint file was generated. This must be run independently of the g03 command. The incompatible checkpoint file can be reformatted by using the Gaussian utilities formchk and unfchk.
Currently, this architecture check is only done on the Biowulf cluster nodes, and is limited to 32-bit to 64-bit files.
The memory requirements for a Gaussian job is dependent on the job type, the number of basis functions, and the algorithms used for integrals. In general, it is best to use as much memory as is available, but not more. Allocating more memory using the %Mem command than is available will cause the node to swap data back and forth from disk to memory, badly degrading the CPU performance.
Because the memory is shared among the all CPUs during multiprocessor jobs, each processor has access to a fraction (either 1/2 on single-core or 1/4 on dual-core) of the amount of memory when run on a single processor. Thus, the amount of memory allocated will need to be increased up to N-fold, with N equal to the number of processors available. For more information on how to calculate the amount of needed memory, see http://www.gaussian.com/g_ur/m_eff.htm.
Depending on the node, either the 32- or 64-bit version of Gaussian is run using the g03 command. However, there may be cases when checkpoint files require a single version. Because the 32-bit version of Gaussian will run on both the 32- and 64-bit nodes (although there are limitations to the 32-bit version, see above), giving the flag '-32' with the g03 command will force the 32-bit version of Gaussian to run.
Submit this job using the PBS 'qsub' command. Example:#!/bin/bash
#PBS -N g03
#PBS -e g03.err
#PBS -o g03.log
cd $PBS_O_WORKDIR
/usr/local/bin/g03 < test000.com > test000.log
See here for more information about PBS.qsub -l nodes=1 g03run
Create a swarm command file with each line containing a single gaussian command. For example, the file 'cmdfile' would have a single gaussian command per line:
g03 < test000.com > test000.log
g03 < test001.com > test001.log
g03 < test002.com > test002.log
g03 < test003.com > test003.log
g03 < test004.com > test004.log
g03 < test005.com > test005.log
...
Submit this swarm command file to the batch system with the command:
[biowulf]% swarm -f cmdfile
NOTE: Swarm will attempt to run one command per processor. If the gaussian job is to be run on multiple processors using the %NProcShared command, this will overload the node and cause the job to run much slower than possible. You will need to include the -n option to swarm. For example, using %NProcShared=2, you need to include -n 1 on a single-core (2CPU) node:
[biowulf]% swarm -n 1 -f cmdfile
In this case, swarm will run one command per node. See the Swarm documentation for more information.
[biowulf]% qsub -I -l nodes=1
qsub: waiting for job 2011.biobos to start
qsub: job 2011.biobos ready
[p139]% g03 < test000.com > test000.log
[p139]% exit
logout
qsub: job 2011.biobos completed
[biowulf]%
To run an older version of Gaussian (e.g. D.01), include the option -D01 or -C02 in the commandline.
[biowulf]% qsub -l nodes=1:o2200:m4096 g03run
The nodes availabe can be seen by typing the command 'freen'. Click here for more information about directing and monitoring jobs on the cluster.
While diskspace and memory is not restricted for the 64-bit version of Gaussian, they are still limited by the cluster hardware. Using the Link0 command %Mem and the MaxDisk option in the Gaussian command file may be required to prevent memory swapping or running out of diskspace. Click here for more information about memory and diskspace requirements for Gaussian.
Running the command /usr/local/bin/g03 will give, in addition to the Gaussian output, three additional pieces of information regarding the Gaussian job:
[biowulf]% head -4 test000.log
host = p1070
Running on 64-bit system
Current disk usage on p1070:
/dev/hda1 72G 2.1G 66G 4% /
The first line tells what node the Gaussian process is running on. The second line shows whether the 32-bit or 64-bit version is being run (whether the node has the x86-64 property or not). The third and fourth lines show how much scratch space is available for the node (in this case 66GB). This information is important in deciphering any problems the job may have encountered.
Gaussian comes with a large set of test input scripts (/usr/local/gaussian03/g03/tests/com). All input scripts were run as both single- and dual-threaded jobs on 32-bit (p2800, Xeon) and 64-bit (o2200, Opteron) nodes, and the following table summarizes the output. The values in the upper right corner shows the average relative speedup (or slowdown if the value is negative) on going from one job type (the first column) to another job type. The value in the lower left corner shows the number of test runs compared. The results were partitioned into all test runs, those which ran for more than 60 seconds, more than 600 seconds, and more than 6000 seconds.
All test runs:
| 32-1p |
32-2p |
64-1p |
64-2p |
|
| 32-1p |
+48% |
+44% |
+132% |
|
| 32-2p |
630 |
-2% |
+60% |
|
| 64-1p |
657 |
632 |
+63% |
|
| 64-2p |
657 |
632 |
661 |
> 60 sec.
32-1p |
32-2p |
64-1p |
64-2p |
|
32-1p |
+49% |
+43% |
+132% |
|
32-2p |
358 |
-4% |
+58% |
|
64-1p |
303 |
282 |
+65% |
|
64-2p |
272 |
251 |
273 |
> 600 sec.
32-1p |
32-2p |
64-1p |
64-2p |
|
32-1p |
+52% |
+42% |
+136% |
|
32-2p |
98 |
-7% |
+56% |
|
64-1p |
110 |
93 |
+66% |
|
64-2p |
83 |
71 |
86 |
> 6000 sec.
32-1p |
32-2p |
64-1p |
64-2p |
|
32-1p |
+59% |
+41% |
+150% |
|
32-2p |
20 |
-11% |
+53% |
|
64-1p |
25 |
19 |
+74% |
|
64-2p |
12 |
12 |
14 |
Overall, the speedup efficiency on going from a single-threaded job to a dual-threaded job on a 32-bit system was between 75% and 80%. For the 64-bit system, it was between 82% and 87%.
Several of the test runs failed for all systems (test284, test421, test598, test602, and test605). These test run input files were considered incorrectly written.
A number of test runs failed with stack overflow errors when run as dual-threaded jobs on 32-bit nodes (see text files below). This may be due to the limitations of memory by the executables (see above).
Many test runs ran slower as dual-threaded jobs than as single-threaded jobs (for example test410 and test559), with the greatest slowdown being 15% (64-bit, test559).
The full results of the test runs can be downloaded as text files:
Gaussian is a natively multithreaded application and in general can scale from 2-4 CPU. The default version of Gaussian (E.01) is compiled with TCP Linda, which allows Gaussian to run across multiple nodes. A small number of Links (L502, L703, L914, L1002, L1110) will scale up to 32 CPU using TCP Linda.
However, not all calculation types parallelize well or at all. In fact, most run best as single-threaded processes.
The following table shows the best use of Gaussian with respect to the number of processors:
| Method | Energy |
Gradient / Opt |
Freq / Hessian |
| HF | 4 |
4 |
4 |
| HDFT | 4 |
4 |
4 |
| Pure DFT | 4 |
4 |
4 |
| MP2 | 4 |
3 |
1-2 |
| MP3 | 1 |
1 |
|
| MP4 | 2-4 |
||
| MP5 | 1 |
||
| CCD | 1 |
1 |
|
| CCSD | 1 |
1 |
|
| CCSD(T) | 2-4 |
||
| CIS | 4 |
3 |
|
| CISD | 1 |
1 |
|
| AM1 | 1 |
1 |
Only HF, DFT, CCSD(T), CIS, and MP2/MP4 jobs will benefit from running with a %NProcShared=4 on dual core nodes. All others should be run with %NProcShared=2 (or 1).
Gaussian errors are not always straightforward to interpret. Something as simple as a "file not found" can seem baffling and cryptic. Here is a collection of errors and their translations:
| Gaussian Error | Translation to English |
| Error termination in NtrErr: ntran open failure returned to fopen. Segmentation fault |
Can't open a file. |
| Internal consistency error detected in FileIO for unit 1 I= 4 J=0 I Fail= 1. | Gaussian is limited to 16 GB of scratch space on the 32-bit nodes. | Out-of-memory error in routine UFChkP (IEnd= 12292175
MxCore= 6291456) Use %Mem=12MW to provide the minimum amount of memory required to complete this step. Error termination via Lnk1e at Thu Feb 2 13:05:32 2006. |
Default memory (6 MW, set in $GAUSS_MEMDEF) is too small for unfchk. |
| galloc: could not allocate memory.: Resource temporarily unavailable | Not enough memory. |
| Out-of-memory error in routine... | Not enough memory. |
| End of file in GetChg. Error termination via Lnk1e ... |
Not enough memory. |
| IMax=3 JMax=2 DiffMx= 0.00D+00 Unable to allocate space to process matrices in G2DrvN: NAtomX= 58 NBasis= 762 NBas6D= 762 MDV1= 6291106 MinMem= 105955841. |
Gaussian has 6 MW free memory (MDV1) but requires at least 106 MW (MinMem). |
| Estimate disk for full transformation -677255533 words. Semi-Direct transformation. Bad length for file. | MaxDisk has been set too low. |
| Error termination in NtrErr: NtrErr Called from FileIO. |
The calculation has exceeded the maximum limit of maxcyc. |
| Erroneous read. Read 0 instead of 6258688. fd = 4 g_read |
Disk quota or disk size exceeded. Could also be disk failure or NFS timeout. |
| Erroneous write. Write 8192 instead of 12288. fd = 4 orig len = 12288 left = 12288 g_write |
Disk quota or disk size exceeded. Could also be disk failure or NFS timeout. |
Firebolt is an SGI Altix 350 with 32 Itanium 2 processors and 96GB of memory, using an SGI NUMAlink interconnect and runs under RedHat Enterprise Linux 3. It is managed as a "fat node" of the NIH Biowulf Cluster. Gaussian jobs which require > 4GB of memory should be run on Firebolt. All other jobs should be run on the cluster.
The default scratch space for Gaussian on Firebolt is /gaussian, a local 2.2TB filesystem. Gaussian jobs must be submitted to the altix in a unique fashion, where the machine type (altix), number of processors (ncpus), and memory (4GB) must be explicitly defined:
[biowulf]% qsub -l nodes=1:altix:ncpus=4,mem=4gb g03run
All other PBS commands are identical to those used for managing cluster jobs. In addition, the wrapper script /usr/local/bin/g03 is equivalent as well.
Firebolt is roughly equivalent to the fastest nodes on the cluster, and in general Gaussian jobs scale well to about 4 cpus. As with all Gaussian jobs, an increase in allocated memory will accelerate performance. Here is a plot of 2-cpu (%nproc=2) benchmark jobs as compared to Firebolt:

Gaussian03 version E.01 (64-bit only) on Biowulf is compiled using TCP Linda. This allows a small subset of job types to be distributed across multiple nodes on the cluster:
To use TCP Linda, you need to include the L0 command %NProcLinda=#, where # is the number of nodes on which to distribute the job. You must also include the %NProcShared=# command, where # is the number of CPUs per node (%NProc and %NProcShared are synonyms). Finally, you need to specify the number of nodes needed for the Gaussian job. For example, to distribute a Gaussian run onto 16 CPUs across 8 o2800 (single-core, 2 CPU/node) nodes, the Gaussian input file would look like
%NProcShared=2
%NProcLinda=8
#p b3lyp 6-31G* td(nstates=10) test
Gaussian Test Job 438:
...
and would be submitted like
qsub -l nodes=8:o2800 g03run
See above for details about submitting to the batch system.
Keep in mind that only the processor load is distributed. The master (mother superior) node in a multi-node Linda job must have the required amount of RAM, while the worker nodes can have less. Thus, in the case of the Gaussian input
%Mem=8GB
%NProcShared=2
%NProcLinda=8
#p opt freq=noraman scf=tight...
...
the master node must have at least 8GB of RAM. See here for a discussion of memory issues.