How to run CRYSTAL23

Summary

Overview of CRYSTAL Executables

Scripts usage

The first run
Plotting Tools

CRYSPLOT
CRYSTALClear

Parallel run

OpenMP
Massive Parallel Process – MPP

Select the right executable for your needs

Pcrystal/Properties
Pcrystal/Pproperties OMP
MPPcrystal

Allk
Splitk

Overview of CRYSTAL Executables

This tutorial provides a comprehensive guide on running the CRYSTAL program, highlighting its various executables, parallelization methods, plotting tools, and scripts to streamline execution and workspace management.

The CRYSTAL suite includes different executables designed for distinguished types of calculations:

[P/MPP/..] crystal: computes the energy, analytical gradient and wave function for a given geometry, which can also be fully optimized, also computes a number of energy-based and response quantities, such as vibrational frequencies and spectra, dielectric response, and more.
[P/..] properties: computes one electron properties (electrostatic potential, charge density, etc.), analyses the wave function in direct and reciprocal space, transforms the Bloch functions (BF) into Wannier functions (localization of BF), and much more.

Each of these has several versions:

Sequential Version (no prefix): Independent of external libraries.
Parallel Version (P prefix): Requires MPI for parallelization.
MPP Version (MPP prefix, only for crystal): Requires ScaLAPACK library from intel MKL.
OpenMP Version (OMP suffix): Available from CRYSTAL23, includes OpenMP directives for further parallelism, on top of MPI. Also requires intel MKL (LAPACK and BLAS libraries) to achieve multi-threads parallelization in linear algebra operations.

To run the CRYSTAL program, an INPUT file is required (note: the file must be named INPUT). For a detailed explanation of the content and usage of the INPUT file, please refer to other tutorials or the CRYSTAL23 user’s manual.

The crystal executable can be run directly from command line using the following instruction:

./path/to/executables/crystal < INPUT

Each run of crystal generates a series of files, some of which are used during the calculation, while others serve as output. T hese files are typically named in the format “fort.XX”.

Scripts usage

To maintain organization in the working directory, it is recommended to use scripts that automatically perform the calculations in a scratch directory, copying back only the output files upon completion. Examples of such scripts are provided here.

These scripts interpret file name extensions based on the following rules:

.d12	CRYSTAL input calculation (for crystal, Pcrystal, MPPcrystal programs)
.out	Printed output file for CRYSTAL (modifiable via the environment variable $OUTFILE)
.d3	PROPERTIES calculation input (for properties, Pproperties programs)
.outp	Printed output file for PROPERTIES (modifiable via the environment variable)

To see the usage for these scripts, simply run them without additional arguments to display information.

The scripts depend on several environment variables (details in the Installation Instructions). Users should configure these variables based on their installation directory structure.

To automate this process, it is possible source the “cry32.cshrc” or “cry23.bashrc” file in the “.cshrc” or “.bashrc”, respectively, after setting the appropriate paths. For example, to source the “cry23.bashrc” file, add this line to the .bashrc:

source path/to/crystal/utils/cry23.bashrc

Here are the key variables that should be set by user and their meaning:

CRY23_ROOT="$HOME/CRYSTAL23"	CRSYTAL23 main directory
CRY23_BIN="bin"			name of directory with executables
CRY23_ARCH="Linux-ifort"	string defining compiler/platform
VERSION="v1.0.1"		string associated with binary version
CRY23_SCRDIR="$HOME/tmp"	scratch folder (to be created by user)

Optional variables include:

OUTFILE		Suffix for output files			    (Default: out)
CRY23_INP	Directory for "crystal" input files	    (Default: $here)

The default directory structure is:

$HOME/CRYSTAL23/bin/Linux-ifort/v1.0.1

Using the runcry23 script, the CRYSTAL program is executed for wave function and total energy calculations (and for many other options if required). If the execution is successful and an input file with the same name as the CRYSTAL input file (but with a .d3 extension) is present, the PROPERTIES program is automatically executed.

Alternatively the scripts runprop23 can be used providing both the name of the input file (.d3 extension) and the wave function file
(.f9 extension).

At the end of the calculation, all the available output files, if they exist, are saved in the current directory. Some of these files are reported in the following lists, along with their corresponding saved names, data, and associated keywords:

CRYSTAL Program:

File Name	Saved As	Data Stored	Keyword
fort.9	inpfilename.f9	Binary wave function	Default, if SCF successful
fort.98	inpfilename.f98	Formatted wave function	Default, if SCF successful
fort.33	inpfilename.xyz	Atom coordinates	COORPRT
fort.34	inpfilename.gui	GUI-compatible crystal structure	EXTPRT
HESSOPT.DAT	inpfilename.hessopt	Hessian	OPTGEOM
OPTINFO.DAT	inpfilename.optinfo	Info for optimization restart	OPTGEOM
FREQINFO.DAT	inpfilename.freqinfo	Info for frequencies restart	FREQCALC
HESSFREQ.DAT	inpfilename.hessfreq	Hessian	FREQCALC
REFLECTANCE.DAT	inpfilename.reflectance	Reflectance data to plot	FREQCALC/REFLECTANCE
EOSINFO.DAT	inpfilename.eosinfo	Equation of State	EOS
ELASINFO.DAT	inpfilename.elasinfo	Elastic constants	ELASTCON

PROPERTIES Program:

File	Saved as	Data stored	Keyword
fort.98	inpfilename.f98	Formatted wave function	FMWF
fort.33	inpfilename.xyz	Atoms coordinates - xyz format	COORPRT
fort.25	inpfilename.f25	Data for visualization	BAND,DOSS,ECHG,POTM
fort.31	inpfilename_dat.prop3d	3D charge (spin) density/potential	ECH3/POT3
fort.32	inpfilename_dat.info3d	Information on 3D grid	ECH3/POT3 or GRID3D
fort.34	inpfilename.gui	GUI - crystal structure	EXTPRT
SYMMINFO.DAT	inpfilename.sym	Symmetry information	SYMMINFO
BAND.DAT	inpfilename_dat.BAND	Band structure	BAND
DOSS.DAT	inpfilename_dat.DOSS	Density of states	DOSS
RHOLINE.DAT	inpfilename_dat.RHOLINE	electron density profile	ECHG
POTC.DAT	inpfilename_dat.POTC	exact electrostatic potential	POTC (FIELD in geometry input)
EMDL.DAT	inpfilename_dat.EMDL	Electron Momentum distribution	EMDL
BIDIDI.DAT	inpfilename.BDIDI	B(r)	BIDIERD
GRED.DAT	inpfilename.GRED	BS, geom,direct lattice reducible data	CRYAPI_OUT
KRED.DAT	inpfilename.KRED	Eigenvectors in full BZ	CRYAPI_OUT

Files with the .f9 extension are not modified by PROPERTIES

For a full list of output files, run the scripts runcry23 or runprop23 without any additional argument.

The first run

This section refers mainly to the use of the sequential executables.

As an example, an INPUT file for the alpha quartz is provided (quartz_test.d12). The CRYSTAL calculation can be executed using the following command:

runcry23 quartz_test

The output file provides crucial information, depending on the calculation options specified in the INPUT file. For example, the final cell energy of the system is reported at the end of the SCF procedure. This value can be easily located by searching for the string: “TOTAL ENERGY”.

As reference, the final energy of the example system should be:

TOTAL ENERGY(DFT)(AU)(  7) -1.7596024773958E+03 DE-2.8E-07 tester 6.5E-10

At this stage, a PROPERTIES calculation can be performed using a new INPUT file (quartz_band.d3) and the wave function obtained from the previous calculation (.f9 extension). As reference, in this case a BAND structure calculation is presented.

The properties calculation can then be executed with the following command:

runprop23 quartz_bands quartz_test

Plotting Tools

CRYSTAL provides many tools to visualize the results of your calculations, in this tutorial a couple of them will be featured:

CRYSPLOT

CRYSPLOT is an online tool designed for quick and easy plotting of basic results. Users can access it here.

To use the tool, select the desired property from the menu in the top-left corner of the page. Typically, a .f25 or .DAT file is required as input. After uploading the file, the plot will be generated, with options available to customize its appearance.

CRYSTALClear

A Python package offering advanced plotting capabilities using Matplotlib. It can be installed via pip or from the online GitHub repository.

As an example, a Jupyter notebook to plot the output of the previous band calculation is provided.

Parallel run

The parallel versions of CRYSTAL and PROPERTIES are based on the MPI (Message Passing Interface) framework. To run these executables, the “mpirun” command (or its equivalents) should be used:

 mpirun -np <MPI_processes> ./path/to/executables/Pcrystal

No additional options are required for basic execution. However, to ensure that the desired computing cores are being used effectively, some printing options can be helpful:

For **Open MPI**, use the `--report-bindings` option to display how processes are bound to cores:
```
 mpirun --report-bindings -np <MPI_processes> ./path/to/executables/Pcrystal 
```
For **Intel MPI**, use the `-print-rank-map` option to achieve a similar effect:
```
 mpirun -print-rank-map -np <MPI_processes> ./path/to/executables/Pcrystal 
```

If the process bindings are not coherent with the computing architecture (e.g., suboptimal core usage or misaligned with the hardware topology), specific binding options provided by the MPI framework should be used. For detailed instructions on these options, refer to the MPI manuals of the implementation you are using.

OpenMP

When executing OpenMP-enabled versions of crystal or properties, users are advised to increase the stack size to unlimited. This can be done using the following command before execution:

 export OMP_TACKSIZE=16M

To control how many threads each MPI process generate, use the “OMP_NUM_THREADS” environment variable. For example, to set the number of threads to “n”, use the command:

 export OMP_NUM_THREADS=n

To maximize efficiency, it is recommended that the total number of hardware cores is at least equal to the product of the number of OpenMP threads and the number of MPI processes:

\[ \begin{equation} \text{hardware cores} \ge \text{OMP threads} \times \text{MPI processes} \end{equation} \]

For optimal performance, it is generally recommended to use between 2 to 8 OpenMP threads per MPI process. Exceeding this range may lead to diminishing returns due to increased overhead in thread management.

Massive Parallel Process – MPP

Running the MPP version of CRYSTAL follows the same procedure as the Parallel version. The executable can be launched using the “mpirun” command:

 mpirun -np <MPI_processes> ./path/to/executables/MPPcrystal < INPUT

The MPP version utilizes the ScaLAPACK library from Intel MKL to distribute linear algebra operations across multiple MPI processes, making it highly suitable for extensive HPC environments that provide access to a large number of computing cores.

Select the right executable for your needs

The computational cost of the different steps in the SCF procedure is influenced by multiple factors, with the most significant being the size of the system, in terms of the number of basis functions (or atomic orbitals) used.

In a simple test case, it is observed that the most computationally intensive tasks are the calculation of two-electron integrals (green) and the linear algebra operations (blue). A crossover point can be seen in their relative computational cost (red circle). This is because the calculation of two-electron integrals scales quasi-linearly, while the cost of linear algebra operations scales cubically.

As a result, the system size is a crucial factor to consider when deciding which executable to use for the calculation.

The primary distinction between the different CRYSTAL executable is in how they handle linear algebra operations. These are carried out in reciprocal space across a defined number of k points within the First Brillouin Zone. The most computationally expensive task in this process is the diagonalization of the Hamiltonian matrix. For each k point a single matrix with dimensions of N² is generated, where N is the number of atomic orbitals used in the calculation.

CRYSTAL uses multiple levels of parallelism to optimize this process. The first level of parallelism takes advantage of the independence of k points, distributing the matrices to be diagonalized evenly among the processes. The second level applies when spin-unrestricted calculations are performed. In this case, the Hamiltonian matrix for each k point is divided into two matrices - one for each spin component (alpha and beta). These matrices are treated as separate points and are also evenly distributed among the processes.

This distribution is managed through MPI parallelism, which can lead to an increase in memory usage due to the replication of data across processes.

Pcrystal/Properties

In the standard parallel version of the code, Pcrystal, a third level of parallelism is introduced through symmetry irreducible representations. This level is enabled by default and can significantly reduce computational costs. The Hamiltonian matrix, due to its block-diagonal structure from symmetry, is further divided into multiple smaller matrices of varying sizes. These smaller matrices are then distributed across different MPI processes for parallel computation.

This version of the code relies on internal libraries for linear algebra operations, with the Jacobi algorithm used for diagonalization. The Jacobi method is an iterative procedure, and results to be particularly time-consuming during the first diagonalization of the Hamiltonian matrix in the initial SCF cycle, as the matrix is densely populated. However, in subsequent SCF cycles, the matrix is almost diagonal, making the diagonalization process faster. This results in a noticeable difference in timing between the first SCF iteration and the following ones. However, this is the least memory-consuming diagonalization algorithm among all CRYSTAL executables, thus this code can be used to lower memory requirements for this specific operation.

This version of the code is best suited on relatively small computational clusters for small to medium-sized systems where the number of k points (or irreducible representations) exceeds the number of MPI processes. In such cases, the SCF steps show good scalability until the number of processes surpasses the available k points (or irreducible representations). At this stage some processes inevitably remain idle, reducing the overall efficiency.

For larger systems, where linear algebra operations become more computationally demanding, fewer k points are usually required, leading to reduced scalability. In such cases, this executable may not be the most efficient choice.

Pcrystal/Pproperties OMP

OpenMP parallelism has been integrated throughout the entire SCF procedure and force calculations. Using the same number of physical cores but reducing the number of MPI processes while increasing the number of threads per process, offers several advantages. These include significantly reduced memory usage due to less data replication, improved scalability by enabling multiple cores to work on a single matrix operation, and faster communication due to the reduced number of MPI processes.

To enable threading within linear algebra operations, this version of the code requires specific linear algebra libraries, such as BLAS and LAPACK, as implemented in Intel oneAPI MKL. The LAPACK diagonalizer, which uses the divide-and-conquer algorithm, ensures that the cost of the linear algebra step remains consistent across all SCF cycles, unlike in previous executables where the first cycle was more expensive.

The OpenMP-enabled version of the code shares similar considerations with the standard Pcrystal executable. It is best suited for small to medium-sized systems but can also be effectively used on larger hardware architectures without compromising scalability, up to the point where the number of MPI processes matches the number of k points (or irreducible representations). It is recommended not to exceed 4 to 8 threads per process to maintain optimal performance.

MPPcrystal

The massively parallel (MPP) version of the code is significantly different from the others. It employs the ScaLAPACK library (from Intel oneAPI MKL) to enable highly scalable linear algebra operations across many processes. This scalability makes the linear algebra step of the SCF procedure extremely efficient on large-scale hardware. However, due to the increased complexity of the algorithm, the layer of parallelism over symmetry irreducible representations, present in other versions, is not available.

The MPP code is not optimized for memory or time efficiency on small hardware architectures. However, by distributing memory among multiple processes, it reduces the memory-per-core requirement, enabling calculations on large and very large systems that would be challenging or impossible with the standard parallel version. This code is ideal for use in environments with extensive HPC resources.

Within the MPP code, there are two parallelization logics for handling k points during the SCF process: "allk" and "splitk."

Allk

This is the default execution mode, where k points are processed sequentially. All available MPI processes collaborate on operations for a single Hamiltonian matrix at a time.

Splitk

In this mode, all k points are processed simultaneously, with subsets of processes working on different Hamiltonian matrices concurrently. This mode is automatically activated when the number of MPI processes matches a specific key number:

\[ \begin{equation} N \text{ processes} = (nR + 2 nC) \times \text{non-prime integer (4,6,8,9…)} \end{equation} \]

where nR and nC represent the number of real and complex k points, respectively.

The splitk mode is more efficient than allk, making it the preferred choice when possible. However, it requires an even larger number of processes.

The exact number needed can be determined by performing a testrun: the output will indicate the exact number of processes required for the splitk logic,search for the string: “parallelism”

For reference the output will be as this:

 
*******************************************************************************
*******************************************************************************
  Please, consider that for this calculation a double level of parallelism
  can be exploited by using a number of processors given by:

      N_proc =     27 x a_non_prime_integer (as 4, or 6, or 8, or 9, ...)

  For instance, N_proc =      108, or      162, or      216, or      243, or ...
                N_proc =     1728, or     3456, or     6912, or    13824, or ...
                N_proc =    27648, or    55296, or ...
*******************************************************************************
*******************************************************************************

The MPP code also supports OpenMP directives, with considerations similar to those outlined for the OpenMP-enabled versions of other executables.