Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 6;15(1):10570.
doi: 10.1038/s41467-024-54639-7.

Crystal structure generation with autoregressive large language modeling

Affiliations

Crystal structure generation with autoregressive large language modeling

Luis M Antunes et al. Nat Commun. .

Abstract

The generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive large language modeling (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective models of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Large language modeling of CIF files.
a Core concepts in training the model: A CIF file (left) is converted into a sequence of symbols, through tokenization. The sequence is processed by the model, which produces a list of probability distributions over the vocabulary, for each corresponding symbol in the input. The resulting predicted probability distributions are evaluated against the target distributions (which contain the entire probability mass on the correct subsequent token), using the cross-entropy loss metric. The target tokens are the input tokens shifted one spot to the left, as the objective is to predict the next token given a sequence of preceding tokens. The tokens are categorized as CIF tags (blue), atoms (green), numeric digits (gold), and punctuation (red). Output tokens (not actually sampled during training) represent the tokens assigned the highest probability by the model. Underlined tokens represent predicted distributions assigning a relatively low probability to the correct next token. b Generation of a CIF file: First, a prompt is constructed by concatenating the symbol data_ with the desired cell composition, which is then tokenized and processed by the model. Next, a token is sampled from the predicted distribution for the upcoming token in the sequence. Finally, the sampled token is added to the accumulating contents of the CIF file. This procedure continues iteratively until a predefined terminating condition is met (e.g., two consecutive newline tokens are sampled).
Fig. 2
Fig. 2. Generated vs. true or implied cell parameters.
a The generated cell lengths for matching structures of the test set vs. the true cell lengths, when space group is included. b The generated cell volumes for matching structures of the test set vs. either the true cell volumes, or the cell volumes implied from the generated cell parameters, when space group is included. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. The generated structures of various inorganic compounds.
a Ba2MnCr. Cell parameters: a, b: 3.778 Å, c: 27.503 Å, α, β: 90.0, γ: 120.0. Color scheme: Ba: green, Mn: purple, Cr: blue. b CsCuTePt. Cell parameters: a, b, c: 7.153 Å, α, β, γ: 90.0. Color scheme: Cs: purple, Cu: blue, Te: gold, Pt: white. c YbMn6Sn6. Cell parameters: a, b: 5.488 Å, c: 8.832 Å, α, β: 90.0, γ: 120.0. ZrMn6Sn6, in the training set, possessed the same structure, but with the following cell parameters: a, b: 5.364 Å, c: 8.933 Å, α, β: 90.0, γ: 120.0. Color scheme: Yb: green, Mn: magenta, Sn: gray. d AuO2. Cell parameters: a, b: 4.838 Å, c: 3.429 Å, α, β, γ: 90.0. Color scheme: Au: yellow, O: red. e Sm2BS4. Cell parameters: a, b, c: 10.884 Å, α, β, γ: 90.0. Color scheme: Sm: light green, B: green, S: yellow. f KRb2TiF6. Cell parameters: a, b, c: 8.688 Å, α, β, γ: 90.0. Color scheme: K: white, Rb: purple, Ti: brown, F: green. g LiTa2NiSe5 (a: 3.517 Å, b: 13.362 Å, c: 15.156 Å), which resembles the recently reported structure in. h Ta2NiSe5, seen in training. i NaSn2CuSe5, seen in training. Source data are provided as CIF files in the Source Data file.
Fig. 4
Fig. 4. Pyrochlore case study results.
Generated vs. DFT-derived values of the cell parameter a for selected pyrochlores not in the training dataset. The error bars represent the  ± standard deviation of the value of the a cell parameter for the three generation attempts (all of which resulted in the pyrochlore structure), while the y-coordinate of the points represents the mean value of the cell parameter across the three attempts. The inset represents the structure of the generated pyrochlore Pr2Mn2O7, with cell parameters a, b, c: 10.34 Å, α, β, γ: 90.0. Color scheme: Pr = yellow, Mn = purple, O = red. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. The Monte Carlo Tree Search decoding procedure.
CIF files are generated as a tree is iteratively constructed, with each iteration guiding the generation of subsequent structures towards more desirable parameters (e.g., lower formation energy per atom). The nodes in the tree represent the cumulative contents of a CIF file at various points. a The Selection step involves descending the tree by choosing the most promising node at each level, using a variant of the PUCT algorithm. b During Expansion, an unexplored child node is randomly selected and added to the tree. If a node has only one highly probable child (represented as empty nodes), the child node bypasses the Rollout step. c The Rollout step involves prompting the model with the contents of the selected node, and sampling from the model until a terminal condition is met, so as to obtain a complete CIF file and an estimate of the value of a node. d The generated structure is validated and scored, incorporating the prediction of the structure’s formation energy per atom, as given by a pre-trained neural network. e Finally, the score is backpropagated through the selected nodes, which store the accumulated results of each iteration. The resulting generated CIF file, if valid, is returned.
Fig. 6
Fig. 6. Unconditionally generated novel structures.
The four lowest-energy novel structures generated unconditionally by the large model. a Ba4Na2Ir2O11Z = 2, Cm. Cell parameters: a: 10.308 Å, b: 5.995 Å, c: 10.269 Å, α, γ: 90.0, β: 108.5. Color scheme: Ba: green, Na: orange, Ir: white, O: red. Ehull = 0.00 eV/atom. b NaAlS2Z= 16, P21. Cell parameters: a: 10.233, b: 10.277 Å, c: 13.703 Å, α, γ: 90.0, β: 100.9. Color scheme: Na: orange, Al: gray, S: yellow. Ehull = 0.00 eV/atom. c Ca2YSbO6Z = 2, P21/c. Cell parameters: a: 5.651 Å, b: 5.853 Å, c: 9.850 Å, α, γ: 90.0, β: 125.0. Color scheme: Ca: blue, Y: purple, Sb: bronze, O: red. Ehull = 0.00 eV/atom. d Li2FeSiO4Z = 4, Pna21. Cell parameters: a: 10.988 Å, b: 6.278 Å, c: 5.026 Å, α, β, γ: 90.0. Color Scheme: Si: light blue, Fe: dark gray, Li: light green, O: red. Ehull =0.02 eV/atom. Source data are provided as CIF files in the Source Data file.

References

    1. Cerqueira, T. F. et al. Identification of novel Cu, Ag, and Au ternary oxides from global structural prediction. Chem. Mater.27, 4562–4573 (2015).
    1. Zhu, B. & Scanlon, D. O. Predicting lithium iron oxysulfides for battery cathodes. ACS Appl. Energy Mater.5, 575–584 (2022).
    1. Harper, A. F., Evans, M. L. & Morris, A. J. Computational investigation of copper phosphides as conversion anodes for lithium-ion batteries. Chem. Mater.32, 6629–6639 (2020). - PMC - PubMed
    1. Oganov, A. R., Pickard, C. J., Zhu, Q. & Needs, R. J. Structure prediction drives materials discovery. Nat. Rev. Mater.4, 331–348 (2019).
    1. Oganov, A. R. Modern Methods of Crystal Structure Prediction (John Wiley & Sons, 2011).

LinkOut - more resources