This tutorial provides a comprehensive guide on running the CRYSTAL program, highlighting its various executables, parallelization methods, plotting tools, and scripts to streamline execution and workspace management.
The CRYSTAL suite includes different executables designed for distinguished types of calculations:
Each of these has several versions:
To run the CRYSTAL program, an INPUT file is required (note: the file must be named INPUT). For a detailed explanation of the content and usage of the INPUT file, please refer to other tutorials or the CRYSTAL23 user’s manual.
The crystal executable can be run directly from command line using the following instruction:
./path/to/executables/crystal < INPUT
Each run of crystal generates a series of files, some of which are used during the calculation, while others serve as output. T hese files are typically named in the format “fort.XX”.
To maintain organization in the working directory, it is recommended to use scripts that automatically perform the calculations in a scratch directory, copying back only the output files upon completion. Examples of such scripts are provided here.
These scripts interpret file name extensions based on the following rules:
| .d12 | CRYSTAL input calculation (for crystal, Pcrystal, MPPcrystal programs) |
| .out | Printed output file for CRYSTAL (modifiable via the environment variable $OUTFILE) |
| .d3 | PROPERTIES calculation input (for properties, Pproperties programs) |
| .outp | Printed output file for PROPERTIES (modifiable via the environment variable) |
To see the usage for these scripts, simply run them without additional arguments to display information.
The scripts depend on several environment variables (details in the Installation Instructions). Users should configure these variables based on their installation directory structure.
To automate this process, it is possible source the “cry32.cshrc” or “cry23.bashrc” file in the “.cshrc” or “.bashrc”, respectively, after setting the appropriate paths. For example, to source the “cry23.bashrc” file, add this line to the .bashrc:
source path/to/crystal/utils/cry23.bashrc
Here are the key variables that should be set by user and their meaning:
CRY23_ROOT="$HOME/CRYSTAL23" CRSYTAL23 main directory CRY23_BIN="bin" name of directory with executables CRY23_ARCH="Linux-ifort" string defining compiler/platform VERSION="v1.0.1" string associated with binary version CRY23_SCRDIR="$HOME/tmp" scratch folder (to be created by user)
Optional variables include:
OUTFILE Suffix for output files (Default: out) CRY23_INP Directory for "crystal" input files (Default: $here)
The default directory structure is:
$HOME/CRYSTAL23/bin/Linux-ifort/v1.0.1
Using the runcry23 script, the CRYSTAL program is executed for wave function and total energy calculations (and for many other options if required). If the execution is successful and an input file with the same name as the CRYSTAL input file (but with a .d3 extension) is present, the PROPERTIES program is automatically executed.
Alternatively the scripts runprop23 can be used providing both the name of the input file
(.d3 extension) and the wave function file
(.f9 extension).
At the end of the calculation, all the available output files, if they exist, are saved in the current directory. Some of these files are reported in the following lists, along with their corresponding saved names, data, and associated keywords:
CRYSTAL Program:
| File Name | Saved As | Data Stored | Keyword |
| fort.9 | inpfilename.f9 | Binary wave function | Default, if SCF successful |
| fort.98 | inpfilename.f98 | Formatted wave function | Default, if SCF successful |
| fort.33 | inpfilename.xyz | Atom coordinates | COORPRT |
| fort.34 | inpfilename.gui | GUI-compatible crystal structure | EXTPRT |
| HESSOPT.DAT | inpfilename.hessopt | Hessian | OPTGEOM |
| OPTINFO.DAT | inpfilename.optinfo | Info for optimization restart | OPTGEOM |
| FREQINFO.DAT | inpfilename.freqinfo | Info for frequencies restart | FREQCALC |
| HESSFREQ.DAT | inpfilename.hessfreq | Hessian | FREQCALC |
| REFLECTANCE.DAT | inpfilename.reflectance | Reflectance data to plot | FREQCALC/REFLECTANCE |
| EOSINFO.DAT | inpfilename.eosinfo | Equation of State | EOS |
| ELASINFO.DAT | inpfilename.elasinfo | Elastic constants | ELASTCON |
PROPERTIES Program:
| File | Saved as | Data stored | Keyword |
| fort.98 | inpfilename.f98 | Formatted wave function | FMWF |
| fort.33 | inpfilename.xyz | Atoms coordinates - xyz format | COORPRT |
| fort.25 | inpfilename.f25 | Data for visualization | BAND,DOSS,ECHG,POTM |
| fort.31 | inpfilename_dat.prop3d | 3D charge (spin) density/potential | ECH3/POT3 |
| fort.32 | inpfilename_dat.info3d | Information on 3D grid | ECH3/POT3 or GRID3D |
| fort.34 | inpfilename.gui | GUI - crystal structure | EXTPRT |
| SYMMINFO.DAT | inpfilename.sym | Symmetry information | SYMMINFO |
| BAND.DAT | inpfilename_dat.BAND | Band structure | BAND |
| DOSS.DAT | inpfilename_dat.DOSS | Density of states | DOSS |
| RHOLINE.DAT | inpfilename_dat.RHOLINE | electron density profile | ECHG |
| POTC.DAT | inpfilename_dat.POTC | exact electrostatic potential | POTC (FIELD in geometry input) |
| EMDL.DAT | inpfilename_dat.EMDL | Electron Momentum distribution | EMDL |
| BIDIDI.DAT | inpfilename.BDIDI | B(r) | BIDIERD |
| GRED.DAT | inpfilename.GRED | BS, geom,direct lattice reducible data | CRYAPI_OUT |
| KRED.DAT | inpfilename.KRED | Eigenvectors in full BZ | CRYAPI_OUT |
Files with the .f9 extension are not modified by PROPERTIES
For a full list of output files, run the scripts runcry23 or runprop23 without any additional argument.
This section refers mainly to the use of the sequential executables.
As an example, an INPUT file for the alpha quartz is provided (quartz_test.d12). The CRYSTAL calculation can be executed using the following command:
runcry23 quartz_test
The output file provides crucial information, depending on the calculation options specified in the INPUT file. For example, the final cell energy of the system is reported at the end of the SCF procedure. This value can be easily located by searching for the string: “TOTAL ENERGY”.
As reference, the final energy of the example system should be:
TOTAL ENERGY(DFT)(AU)( 7) -1.7596024773958E+03 DE-2.8E-07 tester 6.5E-10
At this stage, a PROPERTIES calculation can be performed using a new INPUT file (quartz_band.d3) and the wave function obtained from the previous calculation (.f9 extension). As reference, in this case a BAND structure calculation is presented.
The properties calculation can then be executed with the following command:
runprop23 quartz_bands quartz_test
CRYSTAL provides many tools to visualize the results of your calculations, in this tutorial a couple of them will be featured:
CRYSPLOT is an online tool designed for quick and easy plotting of basic results. Users can access it here.
To use the tool, select the desired property from the menu in the top-left corner of the page. Typically, a .f25 or .DAT file is required as input. After uploading the file, the plot will be generated, with options available to customize its appearance.
A Python package offering advanced plotting capabilities using Matplotlib. It can be installed via pip or from the online GitHub repository.
As an example, a Jupyter notebook to plot the output of the previous band calculation is provided.
The parallel versions of CRYSTAL and PROPERTIES are based on the MPI (Message Passing Interface) framework. To run these executables, the “mpirun” command (or its equivalents) should be used:
mpirun -np <MPI_processes> ./path/to/executables/Pcrystal
No additional options are required for basic execution. However, to ensure that the desired computing cores are being used effectively, some printing options can be helpful:
mpirun --report-bindings -np <MPI_processes> ./path/to/executables/Pcrystal
mpirun -print-rank-map -np <MPI_processes> ./path/to/executables/Pcrystal
If the process bindings are not coherent with the computing architecture (e.g., suboptimal core usage or misaligned with the hardware topology), specific binding options provided by the MPI framework should be used. For detailed instructions on these options, refer to the MPI manuals of the implementation you are using.
When executing OpenMP-enabled versions of crystal or properties, users are advised to increase the stack size to unlimited. This can be done using the following command before execution:
export OMP_TACKSIZE=16M
To control how many threads each MPI process generate, use the “OMP_NUM_THREADS” environment variable. For example, to set the number of threads to “n”, use the command:
export OMP_NUM_THREADS=n
To maximize efficiency, it is recommended that the total number of hardware cores is at least equal to the product of the number of OpenMP threads and the number of MPI processes:
\[ \begin{equation} \text{hardware cores} \ge \text{OMP threads} \times \text{MPI processes} \end{equation} \]
For optimal performance, it is generally recommended to use between 2 to 8 OpenMP threads per MPI process. Exceeding this range may lead to diminishing returns due to increased overhead in thread management.
Running the MPP version of CRYSTAL follows the same procedure as the Parallel version. The executable can be launched using the “mpirun” command:
mpirun -np <MPI_processes> ./path/to/executables/MPPcrystal < INPUT
The MPP version utilizes the ScaLAPACK library from Intel MKL to distribute linear algebra operations across multiple MPI processes, making it highly suitable for extensive HPC environments that provide access to a large number of computing cores.
The computational cost of the different steps in the SCF procedure is influenced by multiple factors, with the most significant being the size of the system, in terms of the number of basis functions (or atomic orbitals) used.
In a simple test case, it is observed that the most computationally intensive tasks are the calculation of two-electron integrals (green) and the linear algebra operations (blue). A crossover point can be seen in their relative computational cost (red circle). This is because the calculation of two-electron integrals scales quasi-linearly, while the cost of linear algebra operations scales cubically.
As a result, the system size is a crucial factor to consider when deciding which executable to use for the calculation.
The primary distinction between the different CRYSTAL executable is in how they handle linear algebra operations. These are carried out in reciprocal space across a defined number of k points within the First Brillouin Zone. The most computationally expensive task in this process is the diagonalization of the Hamiltonian matrix. For each k point a single matrix with dimensions of N2 is generated, where N is the number of atomic orbitals used in the calculation.
CRYSTAL uses multiple levels of parallelism to optimize this process. The first level of parallelism takes advantage of the independence of k points, distributing the matrices to be diagonalized evenly among the processes. The second level applies when spin-unrestricted calculations are performed. In this case, the Hamiltonian matrix for each k point is divided into two matrices - one for each spin component (alpha and beta). These matrices are treated as separate points and are also evenly distributed among the processes.
This distribution is managed through MPI parallelism, which can lead to an increase in memory usage due to the replication of data across processes.
In the standard parallel version of the code, Pcrystal, a third level of parallelism is introduced through symmetry irreducible representations. This level is enabled by default and can significantly reduce computational costs. The Hamiltonian matrix, due to its block-diagonal structure from symmetry, is further divided into multiple smaller matrices of varying sizes. These smaller matrices are then distributed across different MPI processes for parallel computation.
This version of the code relies on internal libraries for linear algebra operations, with the Jacobi algorithm used for diagonalization. The Jacobi method is an iterative procedure, and results to be particularly time-consuming during the first diagonalization of the Hamiltonian matrix in the initial SCF cycle, as the matrix is densely populated. However, in subsequent SCF cycles, the matrix is almost diagonal, making the diagonalization process faster. This results in a noticeable difference in timing between the first SCF iteration and the following ones. However, this is the least memory-consuming diagonalization algorithm among all CRYSTAL executables, thus this code can be used to lower memory requirements for this specific operation.
This version of the code is best suited on relatively small computational clusters for small to medium-sized systems where the number of k points (or irreducible representations) exceeds the number of MPI processes. In such cases, the SCF steps show good scalability until the number of processes surpasses the available k points (or irreducible representations). At this stage some processes inevitably remain idle, reducing the overall efficiency.
For larger systems, where linear algebra operations become more computationally demanding, fewer k points are usually required, leading to reduced scalability. In such cases, this executable may not be the most efficient choice.
OpenMP parallelism has been integrated throughout the entire SCF procedure and force calculations. Using the same number of physical cores but reducing the number of MPI processes while increasing the number of threads per process, offers several advantages. These include significantly reduced memory usage due to less data replication, improved scalability by enabling multiple cores to work on a single matrix operation, and faster communication due to the reduced number of MPI processes.
To enable threading within linear algebra operations, this version of the code requires specific linear algebra libraries, such as BLAS and LAPACK, as implemented in Intel oneAPI MKL. The LAPACK diagonalizer, which uses the divide-and-conquer algorithm, ensures that the cost of the linear algebra step remains consistent across all SCF cycles, unlike in previous executables where the first cycle was more expensive.
The OpenMP-enabled version of the code shares similar considerations with the standard Pcrystal executable. It is best suited for small to medium-sized systems but can also be effectively used on larger hardware architectures without compromising scalability, up to the point where the number of MPI processes matches the number of k points (or irreducible representations). It is recommended not to exceed 4 to 8 threads per process to maintain optimal performance.
The massively parallel (MPP) version of the code is significantly different from the others. It employs the ScaLAPACK library (from Intel oneAPI MKL) to enable highly scalable linear algebra operations across many processes. This scalability makes the linear algebra step of the SCF procedure extremely efficient on large-scale hardware. However, due to the increased complexity of the algorithm, the layer of parallelism over symmetry irreducible representations, present in other versions, is not available.
The MPP code is not optimized for memory or time efficiency on small hardware architectures. However, by distributing memory among multiple processes, it reduces the memory-per-core requirement, enabling calculations on large and very large systems that would be challenging or impossible with the standard parallel version. This code is ideal for use in environments with extensive HPC resources.
Within the MPP code, there are two parallelization logics for handling k points during the SCF process: "allk" and "splitk."
This is the default execution mode, where k points are processed sequentially. All available MPI processes collaborate on operations for a single Hamiltonian matrix at a time.
In this mode, all k points are processed simultaneously, with subsets of processes working on different Hamiltonian matrices concurrently. This mode is automatically activated when the number of MPI processes matches a specific key number:
\[ \begin{equation} N \text{ processes} = (nR + 2 nC) \times \text{non-prime integer (4,6,8,9…)} \end{equation} \]
where nR and nC represent the number of real and complex k points, respectively.
The splitk mode is more efficient than allk, making it the preferred choice when possible. However, it requires an even larger number of processes.
The exact number needed can be determined by performing a testrun: the output will indicate the exact number of processes required for the splitk logic,search for the string: “parallelism”
For reference the output will be as this:
*******************************************************************************
*******************************************************************************
Please, consider that for this calculation a double level of parallelism
can be exploited by using a number of processors given by:
N_proc = 27 x a_non_prime_integer (as 4, or 6, or 8, or 9, ...)
For instance, N_proc = 108, or 162, or 216, or 243, or ...
N_proc = 1728, or 3456, or 6912, or 13824, or ...
N_proc = 27648, or 55296, or ...
*******************************************************************************
*******************************************************************************
The MPP code also supports OpenMP directives, with considerations similar to those outlined for the OpenMP-enabled versions of other executables.