Rethinking chemical research in the age of large language models

The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft Research, 2009).

Meftahi, N. et al. Machine learning property prediction for organic photovoltaic devices. npj Comput. Mater. 6, 166 (2020).

Article

Google Scholar

Gupta, A., Chakraborty, S. & Ramakrishnan, R. Revving up ¹³C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules. Mach. Learn. Sci. Technol. 2, 035010 (2021).

Article

Google Scholar

Pinheiro, G. A. et al. Machine learning prediction of nine molecular properties based on the SMILES representation of the QM9 quantum-chemistry dataset. J. Phys. Chem. A 124, 9854–9866 (2020).

Article

Google Scholar

Guan, Y., Shree Sowndarya, S. V., Gallegos, L. C., St. John, P. C. & Paton, R. S. Real-time prediction of ¹H and ¹³C chemical shifts with DFT accuracy using a 3D graph neural network. Chem. Sci. 12, 12012–12026 (2021).

Article

Google Scholar

Borlido, P. et al. Exchange–correlation functionals for band gaps of solids: benchmark, reparametrization and machine learning. npj Comput. Mater. 6, 96 (2020).

Article

Google Scholar

Ward, L. et al. matminer: an open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).

Article

Google Scholar

Deringer, V. L., Caro, M. A. & Csányi, G. Machine learning interatomic potentials as emerging tools for materials science. Adv. Mater. 31, 1902765 (2019).

Article

Google Scholar

Grambow, C. A., Pattanaik, L. & Green, W. H. Deep learning of activation energies. J. Phys. Chem. Lett. 11, 2992–2997 (2020).

Article

Google Scholar

Wen, M., Blau, S. M., Spotte-Smith, E. W. C., Dwaraknath, S. & Persson, K. A. BonDNet: a graph neural network for the prediction of bond dissociation energies for charged molecules. Chem. Sci. 12, 1858–1868 (2021).

Article

Google Scholar

Griffiths, R.-R. & Miguel Hernández-Lobato, J. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chem. Sci. 11, 577–586 (2020).

Article

Google Scholar

Schweidtmann, A. M. et al. Machine learning meets continuous flow chemistry: automated optimization towards the Pareto front of multiple objectives. Chem. Eng. J. 352, 277–282 (2018).

Article

Google Scholar

Weston, L. et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 59, 3692–3702 (2019).

Article

Google Scholar

Huang, S. & Cole, J. M. BatteryDataExtractor: battery-aware text-mining software embedded with BERT models. Chem. Sci. 13, 11487–11495 (2022).

Article

Google Scholar

Musielewicz, J., Wang, X., Tian, T. & Ulissi, Z. FINETUNA: fine-tuning accelerated molecular simulations. Mach. Learn. Sci. Technol. 3, 03LT01 (2022).

Article

Google Scholar

Sultan, M. M. & Pande, V. S. Automated design of collective variables using supervised machine learning. J. Chem. Phys. 149, 094106 (2018).

Article

Google Scholar

Roch, L. M. et al. ChemOS: orchestrating autonomous experimentation. Sci. Robot. 3, eaat5559 (2018).

Article

Google Scholar

Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).

Article

Google Scholar

Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).

OpenAI et al. GPT-4 technical report. Preprint at (2024).

Yin, S. et al. A survey on multimodal large language models. Natl Sci. Rev. 11, nwae403 (2024).

Article

Google Scholar

Gemini Team et al. Gemini: a family of highly capable multimodal models. Preprint at (2024).

Honda, S., Shi, S. & Ueda, H. R. SMILES Transformer: pre-trained molecular fingerprint for low data drug discovery. Preprint at (2019).

MegaMolBART. GitHub (2022).

Sakano, K., Furui, K. & Ohue, M. NPGPT: natural product-like compound generation with GPT-based chemical language models. J. Supercomput. 81, 352 (2025).

Article

Google Scholar

Mazuz, E., Shtar, G., Shapira, B. & Rokach, L. Molecule generation using transformers and policy gradient reinforcement learning. Sci. Rep. 13, 8799 (2023).

Article

Google Scholar

Ouyang, L. et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) Vol. 35, 27730–27744 (Curran Associates, Inc., 2022).

Rafailov, R. et al. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 53728–53741 (Curran Associates, Inc., 2023).

M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

Article

Google Scholar

Ruan, Y. et al. An automatic end-to-end chemical synthesis development platform powered by large language models. Nat. Commun. 15, 10160 (2024).

Article

Google Scholar

McNaughton, A. D. et al. CACTUS: Chemistry Agent Connecting Tool-Usage to Science. ACS Omega 9, 46563–46573 (2024).

Article

Google Scholar

Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at (2021).

Templeton, A. et al. Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread (Anthropic, 2024).

Miret, S. & Krishnan, N. M. A. Are LLMs ready for real-world materials discovery? Preprint at (2024).

Geva, M. et al. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguist. 9, 346–361 (2021).

Article

Google Scholar

Gao, Y. et al. Retrieval-augmented generation for large language models: a survey. Preprint at (2024).

Lin, Y.-T. & Chen, Y.-N. LLM-Eval: unified multi-dimensional automatic evaluation for open-domain conversations with large language models. Preprint at (2023).

Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 59662–59688 (Curran Associates, Inc., 2023).

Mirza, A. et al. Are large language models superhuman chemists? Preprint at (2024).

Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

Article

Google Scholar

Sainz, O. et al. NLP evaluation in trouble: on the need to measure LLM data contamination for each benchmark. Preprint at (2023).

Sharma, M. et al. Towards understanding sycophancy in language models. Preprint at (2023).

Ranaldi, L. & Pucci, G. When large language models contradict humans? Large language models’ sycophantic behaviour. Preprint at (2024).

Schoenegger, P. & Park, P. S. Large language model prediction capabilities: evidence from a real-world forecasting tournament. Preprint at (2023).

Liu, S., Chen, C., Qu, X., Tang, K. & Ong, Y.-S. Large language models as evolutionary optimizers. In 2024 IEEE Congress on Evolutionary Computation (CEC) 1–8 (IEEE, 2024).

Chiang, W.-L. et al. Chatbot Arena: an open platform for evaluating LLMs by human preference. Preprint at (2024).

Mucci, T. & Stryker, C. What Is Artificial Superintelligence? (IBM, 2023).

Brockman, G. et al. OpenAI Gym. Preprint at (2016).

Wang, J. et al. GTA: a benchmark for general tool agents. Preprint at (2024).

Qin, Y. et al. ToolLLM: facilitating large language models to master 16000+ real-world APIs. Preprint at (2023).

Patil, S. G., Zhang, T., Wang, X. & Gonzalez, J. E. Gorilla: Large langage model connected with massive APIs. In Advances in Neural Information Processing Systems (eds Globerson, A. et al.) Vol. 37, 126544–126565 (Curran Associates, Inc., 2024).

Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S. & Kambhampati, S. PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 38975–38987 (Curran Associates, Inc., 2023).

Valmeekam, K., Marquez, M., Sreedharan, S. & Kambhampati, S. On the planning abilities of large language models—a critical investigation. Adv. Neural Inf. Process. Syst. 36, 75993–76005 (2023).

Google Scholar

Skarlinski, M. D. et al. Language agents achieve superhuman synthesis of scientific knowledge. Preprint at (2024).

Si, C., Yang, D. & Hashimoto, T. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. Preprint at (2024).

HasAnyone (FutureHouse, 2024).

Zhou, Y., Liu, H., Srivastava, T., Mei, H. & Tan, C. Hypothesis generation with large language models. Preprint at (2024).

Wellawatte, G. P. & Schwaller, P. Extracting human interpretable structure–property relationships in chemistry using XAI and large language models. Preprint at (2023).

Learning to reason with LLMs. OpenAI (2024)

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) Vol. 35, 24824–24837 (Curran Associates, Inc., 2022).

Muralidharan, S. et al. Compact language models via pruning and knowledge distillation. In Advances in Neural Information Processing Systems (eds Globerson, A. et al.) Vol. 37, 41076–41102 (Curran Associates, Inc., 2024).

Sreenivas, S. T. et al. LLM pruning and distillation in practice: the Minitron approach. Preprint at (2024).

Rai, D., Zhou, Y., Feng, S., Saparov, A. & Yao, Z. A practical review of mechanistic interpretability for transformer-based language models. Preprint at (2024).

Zhou, Z., Li, X. & Zare, R. N. Optimizing chemical reactions with deep reinforcement learning. ACS Cent. Sci. 3, 1337–1344 (2017).

Article

Google Scholar

Bowden, G. D., Pichler, B. J. & Maurer, A. A design of experiments (DoE) approach accelerates the optimization of copper-mediated ¹⁸F-fluorination reactions of arylstannanes. Sci. Rep. 9, 11370 (2019).

Article

Google Scholar

Cao, B. et al. How to optimize materials and devices via design of experiments and machine learning: demonstration using organic photovoltaics. ACS Nano 12, 7434–7444 (2018).

Article

Google Scholar

Reis, M. et al. Machine-learning-guided discovery of ¹⁹F MRI agents enabled by automated copolymer synthesis. J. Am. Chem. Soc. 143, 17677–17689 (2021).

Article

Google Scholar

Mahjour, B., Hoffstadt, J. & Cernak, T. Designing chemical reaction arrays using Phactor and ChatGPT. Org. Process Res. Dev. 27, 1510–1516 (2023).

Article

Google Scholar

Přichystal, J., Schug, K. A., Lemr, K., Novák, J. & Havlíček, V. Structural analysis of natural products. Anal. Chem. 88, 10338–10346 (2016).

Article

Google Scholar

Nature submission guidelines. Nature Medicine (2025)

Yang, E. et al. Model merging in LLMs, MLLMs, and beyond: methods, theories, applications and opportunities. Preprint at (2024).

Christensen, M. et al. Automation isn’t automatic. Chem. Sci. 12, 15473–15490 (2021).

Article

Google Scholar

Arnold, C. Cloud labs: where robots do the research. Nature 606, 612–613 (2022).

Article

Google Scholar

Liu, J., Xia, C. S., Wang, Y. & ZHANG, L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) vol. 36, 21558–21572 (Curran Associates, Inc., 2023).

O’Donoghue, O. et al. BioPlanner: automatic evaluation of LLMs on protocol planning in biology. Preprint at (2023).

Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at (2023).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at (2023).

Taylor, R. et al. Galactica: a large language model for science. Preprint at (2022).

Zheng, Z., Zhang, O., Borgs, C., Chayes, J. T. & Yaghi, O. M. ChatGPT Chemistry Assistant for text mining and prediction of MOF synthesis. J. Am. Chem. Soc. 145, 18048–18062 (2023).

Article

Google Scholar

Perplexity AI. www.perplexity.ai (2022)

Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 6, 161–169 (2024).

Article

Google Scholar

Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Comput. Mater. 6, 138 (2020).

Article

Google Scholar

Mateiu, P. & Groza, A. Ontology engineering with Large Language Models. Preprint at (2023).

Babaei Giglou, H., D’Souza, J. & Auer, S. LLMs4OL: large language models for ontology learning. In The Semantic Web—ISWC 2023 (eds. Payne, T. R. et al.) 408–427 (Springer Nature, 2023).

Ciatto, G., Agiollo, A., Magnini, M. & Omicini, A. Large language models as oracles for instantiating ontologies with domain-specific knowledge. Knowl.-Based Syst. 310, 112940 (2025).

Article

Google Scholar

Ye, Y. et al. Construction and application of materials knowledge graph in multidisciplinary materials science via large language model. In Advances in Neural Information Processing Systems (eds Globerson, A. et al.) Vol. 37, 56878–56897 (Curran Associates, Inc., 2024).

Pascazio, L. et al. Chemical species ontology for data integration and knowledge discovery. J. Chem. Inf. Model. 63, 6569–6586 (2023).

Article

Google Scholar

Gontier, N., Rodriguez, P., Laradji, I., Vazquez, D. & Pal, C. Language decision transformers with exponential tilt for interactive text environments. Preprint at (2023).

Wu, Y.-H., Wang, X. & Hamaya, M. Elastic decision transformer. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 18532–18550 (Curran Associates, Inc., 2023).

Xi, Z. et al. The rise and potential of large language model based agents: a survey. Sci. China Inf. Sci. 68, 121101 (2025).

Article

Google Scholar

DeepSeek-AI et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at (2025).

Wang, X. et al. Executable code actions elicit better LLM agents. Preprint at (2024).

Zhang, B. et al. Benchmarking the text-to-SQL capability of large language models: a comprehensive evaluation. Preprint at (2024).

Cheng, G. et al. Empowering large language models on robotic manipulation with affordance prompting. Preprint at (2024).

Reaxys. (Elsevier, 2009).

SciFinder. (CAS, 1995).

Swain, M. C. & Cole, J. M. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 56, 1894–1904 (2016).

Article

Google Scholar

Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).

Article

Google Scholar

CVE-2023-36258 Detail. National Vulnerability Database (NIST, accessed 11 June 2024); https://nvd.nist.gov/vuln/detail/CVE-2023-36258

Ruan, Y. et al. Accelerated end-to-end chemical synthesis development with large language models. Preprint at (2024).

Tom, G. et al. Self-driving laboratories for chemistry and materials science. Chem. Rev. 124, 9633–9732 (2024).

Article

Google Scholar

Software Business: 14th International Conference, ICSOB 2023, Lahti, Finland, November 27–29, 2023, Proc. (Springer Nature, 2024).

Favato, D., Ishitani, D., Oliveira, J. & Figueiredo, E. Linus’s law: more eyes fewer flaws in open source projects. In Proc. XVIII Brazilian Symposium on Software Quality 69–78 (ACM, 2019).

Yao, S. et al. ReAct: synergizing reasoning and acting in language models. Preprint at (2023).

Huang, Q., Vora, J., Liang, P. & Leskovec, J. MLAgentBench: evaluating language agents on machine learning experimentation. Preprint at (2023).

Liu, X. et al. AgentBench: evaluating LLMs as agents. Preprint at (2023).

Hasselgren, C. & Oprea, T. I. Artificial intelligence for drug discovery: are we there yet? Annu. Rev. Pharmacol. Toxicol. 64, 527–550 (2024).

Article

Google Scholar

Bordukova, M., Makarov, N., Rodriguez-Esteban, R., Schmich, F. & Menden, M. P. Generative artificial intelligence empowers digital twins in drug discovery and clinical trials. Expert Opin. Drug Discov. 19, 33–42 (2024).

Article

Google Scholar

AI’s potential to accelerate drug discovery needs a reality check. Nature 622, 217 (2023).

Zhang, Y. et al. Siren’s song in the AI ocean: a survey on hallucination in large language models. Preprint at (2023).

Li, J., Cheng, X., Zhao, X., Nie, J.-Y. & Wen, J.-R. HaluEval: a large-scale hallucination evaluation benchmark for large language models. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 6449–6464 (Association for Computational Linguistics, 2023).

Dhuliawala, S. et al. Chain-of-Verification reduces hallucination in large language models. Preprint at (2023).

Tonmoy, S. M. T. I. et al. A comprehensive survey of hallucination mitigation techniques in large language models. Preprint at (2024).

Liu, H. et al. A survey on hallucination in large vision-language models. Preprint at (2024).

Guan, X. et al. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proc. 38th AAAI Conference on Artificial Intelligence (eds Wooldridge, M. et al.) 18126–18134 (AAAI, 2024).

Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is inevitable: an innate limitation of large language models. Preprint at (2024).

Li, J. et al. The dawn after the dark: an empirical study on factuality hallucination in large language models. Preprint at (2024).

Luo, J., Xiao, C. & Ma, F. Zero-resource hallucination prevention for large language models. Preprint at (2023).

Zhang, D. et al. ChemLLM: a chemical large language model. Preprint at (2024).

Yasunaga, M., Ren, H., Bosselut, A., Liang, P. & Leskovec, J. QA-GNN: reasoning with language models and knowledge graphs for question answering. Preprint at (2021).

Lu, L. et al. Physics-informed neural networks with hard constraints for inverse design. SIAM J. Sci. Comput. 43, B1105–B1132 (2021).

Article
MathSciNet

Google Scholar

Han, S. et al. LLM multi-agent systems: challenges and open problems. Preprint at (2024).

Darvish, K. et al. ORGANA: A robotic assistant for automated chemistry experimentation and characterization. Matter 8, 101897 (2025).

Article

Google Scholar

Formica, M. et al. Catalytic enantioselective nucleophilic desymmetrization of phosphonate esters. Nat. Chem. 15, 714–721 (2023).

Article

Google Scholar

Ramos, M. C., Collison, C. J. & White, A. D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 16, 2514–2572 (2025).

Article

Google Scholar

Bran, A. M. & Schwaller, P. in Drug Development Supported by Informatics (eds Satoh, H. et al.) 143–163 (Springer Nature, 2024).

Pei, Q. et al. BioT5: enriching cross-modal integration in biology with chemical knowledge and natural language associations. Preprint at (2024).

Fang, J. et al. MolTC: towards molecular relational modeling in language models. Preprint at (2024).

Schuhmann, C. et al. LAION-5B: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems (eds. Koyejo, S. et al.) Vol. 35, 25278–25294 (Curran Associates, Inc., 2022).

Huang, J., Shao, H. & Chang, K. C.-C. Are large pre-trained language models leaking your personal information? Preprint at (2022).

Wahle, J. P., Ruas, T., Kirstein, F. & Gipp, B. How large language models are transforming machine-paraphrase plagiarism. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing 952–963 (Association for Computational Linguistics, 2022).

Karamolegkou, A., Li, J., Zhou, L. & Søgaard, A. Copyright violations and large language models. Preprint at (2023).

McDonald, J. et al. Great power, great responsibility: recommendations for reducing energy for training language models. In Findings of the Association for Computational Linguistics: NAACL 2022 1962–1970 (Association for Computational Linguistics, 2022).

Samsi, S. et al. From words to watts: benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC) 1–9 (IEEE, 2023).

Patterson, D. et al. Carbon emissions and large neural network training. Preprint at (2021).

Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023).

Article

Google Scholar

The United States Artificial Intelligence Safety Institute: Vision, Mission, and Strategic Goals (NIST, 2024).

Canadian Artificial Intelligence Safety Institute (Government of Canada, accessed 30 January 2025); https://ised-isde.canada.ca/site/ised/en/canadian-artificial-intelligence-safety-institute

AISI Research (AI Security Institute, accessed 30 January 2025); https://www.aisi.gov.uk/category/research

Lee, S. & Manthiram, A. Can cobalt be eliminated from lithium-ion batteries? ACS Energy Lett. 7, 3058–3063 (2022).

Article

Google Scholar

Chung, C. et al. Decarbonizing the chemical industry: a systematic review of sociotechnical systems, technological innovations, and policy options. Energy Res. Soc. Sci. 96, 102955 (2023).

Article

Google Scholar

Xia, R., Overa, S. & Jiao, F. Emerging electrochemical processes to decarbonize the chemical industry. JACS Au 2, 1054–1070 (2022).

Article

Google Scholar

Amodei, D. Machines of Loving Grace (2024).

McNally, A., Prier, C. K. & MacMillan, D. W. C. Discovery of an α-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334, 1114–1117 (2011).

Article

Google Scholar

Ban, T. A. The role of serendipity in drug discovery. Dialogues Clin. Neurosci. 8, 335–344 (2006).

Article

Google Scholar