The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft Research, 2009).
Meftahi, N. et al. Machine learning property prediction for organic photovoltaic devices. npj Comput. Mater. 6, 166 (2020).
Google Scholar
Gupta, A., Chakraborty, S. & Ramakrishnan, R. Revving up 13C NMR shielding predictions across chemical space: benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules. Mach. Learn. Sci. Technol. 2, 035010 (2021).
Google Scholar
Pinheiro, G. A. et al. Machine learning prediction of nine molecular properties based on the SMILES representation of the QM9 quantum-chemistry dataset. J. Phys. Chem. A 124, 9854–9866 (2020).
Google Scholar
Guan, Y., Shree Sowndarya, S. V., Gallegos, L. C., St. John, P. C. & Paton, R. S. Real-time prediction of 1H and 13C chemical shifts with DFT accuracy using a 3D graph neural network. Chem. Sci. 12, 12012–12026 (2021).
Google Scholar
Borlido, P. et al. Exchange–correlation functionals for band gaps of solids: benchmark, reparametrization and machine learning. npj Comput. Mater. 6, 96 (2020).
Google Scholar
Ward, L. et al. matminer: an open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
Google Scholar
Deringer, V. L., Caro, M. A. & Csányi, G. Machine learning interatomic potentials as emerging tools for materials science. Adv. Mater. 31, 1902765 (2019).
Google Scholar
Grambow, C. A., Pattanaik, L. & Green, W. H. Deep learning of activation energies. J. Phys. Chem. Lett. 11, 2992–2997 (2020).
Google Scholar
Wen, M., Blau, S. M., Spotte-Smith, E. W. C., Dwaraknath, S. & Persson, K. A. BonDNet: a graph neural network for the prediction of bond dissociation energies for charged molecules. Chem. Sci. 12, 1858–1868 (2021).
Google Scholar
Griffiths, R.-R. & Miguel Hernández-Lobato, J. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chem. Sci. 11, 577–586 (2020).
Google Scholar
Schweidtmann, A. M. et al. Machine learning meets continuous flow chemistry: automated optimization towards the Pareto front of multiple objectives. Chem. Eng. J. 352, 277–282 (2018).
Google Scholar
Weston, L. et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 59, 3692–3702 (2019).
Google Scholar
Huang, S. & Cole, J. M. BatteryDataExtractor: battery-aware text-mining software embedded with BERT models. Chem. Sci. 13, 11487–11495 (2022).
Google Scholar
Musielewicz, J., Wang, X., Tian, T. & Ulissi, Z. FINETUNA: fine-tuning accelerated molecular simulations. Mach. Learn. Sci. Technol. 3, 03LT01 (2022).
Google Scholar
Sultan, M. M. & Pande, V. S. Automated design of collective variables using supervised machine learning. J. Chem. Phys. 149, 094106 (2018).
Google Scholar
Roch, L. M. et al. ChemOS: orchestrating autonomous experimentation. Sci. Robot. 3, eaat5559 (2018).
Google Scholar
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
OpenAI et al. GPT-4 technical report. Preprint at (2024).
Yin, S. et al. A survey on multimodal large language models. Natl Sci. Rev. 11, nwae403 (2024).
Google Scholar
Gemini Team et al. Gemini: a family of highly capable multimodal models. Preprint at (2024).
Honda, S., Shi, S. & Ueda, H. R. SMILES Transformer: pre-trained molecular fingerprint for low data drug discovery. Preprint at (2019).
MegaMolBART. GitHub (2022).
Sakano, K., Furui, K. & Ohue, M. NPGPT: natural product-like compound generation with GPT-based chemical language models. J. Supercomput. 81, 352 (2025).
Google Scholar
Mazuz, E., Shtar, G., Shapira, B. & Rokach, L. Molecule generation using transformers and policy gradient reinforcement learning. Sci. Rep. 13, 8799 (2023).
Google Scholar
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) Vol. 35, 27730–27744 (Curran Associates, Inc., 2022).
Rafailov, R. et al. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 53728–53741 (Curran Associates, Inc., 2023).
M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
Google Scholar
Ruan, Y. et al. An automatic end-to-end chemical synthesis development platform powered by large language models. Nat. Commun. 15, 10160 (2024).
Google Scholar
McNaughton, A. D. et al. CACTUS: Chemistry Agent Connecting Tool-Usage to Science. ACS Omega 9, 46563–46573 (2024).
Google Scholar
Hendrycks, D. et al. Measuring massive multitask language understanding. Preprint at (2021).
Templeton, A. et al. Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread (Anthropic, 2024).
Miret, S. & Krishnan, N. M. A. Are LLMs ready for real-world materials discovery? Preprint at (2024).
Geva, M. et al. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguist. 9, 346–361 (2021).
Google Scholar
Gao, Y. et al. Retrieval-augmented generation for large language models: a survey. Preprint at (2024).
Lin, Y.-T. & Chen, Y.-N. LLM-Eval: unified multi-dimensional automatic evaluation for open-domain conversations with large language models. Preprint at (2023).
Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 59662–59688 (Curran Associates, Inc., 2023).
Mirza, A. et al. Are large language models superhuman chemists? Preprint at (2024).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Google Scholar
Sainz, O. et al. NLP evaluation in trouble: on the need to measure LLM data contamination for each benchmark. Preprint at (2023).
Sharma, M. et al. Towards understanding sycophancy in language models. Preprint at (2023).
Ranaldi, L. & Pucci, G. When large language models contradict humans? Large language models’ sycophantic behaviour. Preprint at (2024).
Schoenegger, P. & Park, P. S. Large language model prediction capabilities: evidence from a real-world forecasting tournament. Preprint at (2023).
Liu, S., Chen, C., Qu, X., Tang, K. & Ong, Y.-S. Large language models as evolutionary optimizers. In 2024 IEEE Congress on Evolutionary Computation (CEC) 1–8 (IEEE, 2024).
Chiang, W.-L. et al. Chatbot Arena: an open platform for evaluating LLMs by human preference. Preprint at (2024).
Mucci, T. & Stryker, C. What Is Artificial Superintelligence? (IBM, 2023).
Brockman, G. et al. OpenAI Gym. Preprint at (2016).
Wang, J. et al. GTA: a benchmark for general tool agents. Preprint at (2024).
Qin, Y. et al. ToolLLM: facilitating large language models to master 16000+ real-world APIs. Preprint at (2023).
Patil, S. G., Zhang, T., Wang, X. & Gonzalez, J. E. Gorilla: Large langage model connected with massive APIs. In Advances in Neural Information Processing Systems (eds Globerson, A. et al.) Vol. 37, 126544–126565 (Curran Associates, Inc., 2024).
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S. & Kambhampati, S. PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 38975–38987 (Curran Associates, Inc., 2023).
Valmeekam, K., Marquez, M., Sreedharan, S. & Kambhampati, S. On the planning abilities of large language models—a critical investigation. Adv. Neural Inf. Process. Syst. 36, 75993–76005 (2023).
Skarlinski, M. D. et al. Language agents achieve superhuman synthesis of scientific knowledge. Preprint at (2024).
Si, C., Yang, D. & Hashimoto, T. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. Preprint at (2024).
HasAnyone (FutureHouse, 2024).
Zhou, Y., Liu, H., Srivastava, T., Mei, H. & Tan, C. Hypothesis generation with large language models. Preprint at (2024).
Wellawatte, G. P. & Schwaller, P. Extracting human interpretable structure–property relationships in chemistry using XAI and large language models. Preprint at (2023).
Learning to reason with LLMs. OpenAI (2024)
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) Vol. 35, 24824–24837 (Curran Associates, Inc., 2022).
Muralidharan, S. et al. Compact language models via pruning and knowledge distillation. In Advances in Neural Information Processing Systems (eds Globerson, A. et al.) Vol. 37, 41076–41102 (Curran Associates, Inc., 2024).
Sreenivas, S. T. et al. LLM pruning and distillation in practice: the Minitron approach. Preprint at (2024).
Rai, D., Zhou, Y., Feng, S., Saparov, A. & Yao, Z. A practical review of mechanistic interpretability for transformer-based language models. Preprint at (2024).
Zhou, Z., Li, X. & Zare, R. N. Optimizing chemical reactions with deep reinforcement learning. ACS Cent. Sci. 3, 1337–1344 (2017).
Google Scholar
Bowden, G. D., Pichler, B. J. & Maurer, A. A design of experiments (DoE) approach accelerates the optimization of copper-mediated 18F-fluorination reactions of arylstannanes. Sci. Rep. 9, 11370 (2019).
Google Scholar
Cao, B. et al. How to optimize materials and devices via design of experiments and machine learning: demonstration using organic photovoltaics. ACS Nano 12, 7434–7444 (2018).
Google Scholar
Reis, M. et al. Machine-learning-guided discovery of 19F MRI agents enabled by automated copolymer synthesis. J. Am. Chem. Soc. 143, 17677–17689 (2021).
Google Scholar
Mahjour, B., Hoffstadt, J. & Cernak, T. Designing chemical reaction arrays using Phactor and ChatGPT. Org. Process Res. Dev. 27, 1510–1516 (2023).
Google Scholar
Přichystal, J., Schug, K. A., Lemr, K., Novák, J. & Havlíček, V. Structural analysis of natural products. Anal. Chem. 88, 10338–10346 (2016).
Google Scholar
Nature submission guidelines. Nature Medicine (2025)
Yang, E. et al. Model merging in LLMs, MLLMs, and beyond: methods, theories, applications and opportunities. Preprint at (2024).
Christensen, M. et al. Automation isn’t automatic. Chem. Sci. 12, 15473–15490 (2021).
Google Scholar
Arnold, C. Cloud labs: where robots do the research. Nature 606, 612–613 (2022).
Google Scholar
Liu, J., Xia, C. S., Wang, Y. & ZHANG, L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) vol. 36, 21558–21572 (Curran Associates, Inc., 2023).
O’Donoghue, O. et al. BioPlanner: automatic evaluation of LLMs on protocol planning in biology. Preprint at (2023).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at (2023).
Taylor, R. et al. Galactica: a large language model for science. Preprint at (2022).
Zheng, Z., Zhang, O., Borgs, C., Chayes, J. T. & Yaghi, O. M. ChatGPT Chemistry Assistant for text mining and prediction of MOF synthesis. J. Am. Chem. Soc. 145, 18048–18062 (2023).
Google Scholar
Perplexity AI. www.perplexity.ai (2022)
Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 6, 161–169 (2024).
Google Scholar
Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Comput. Mater. 6, 138 (2020).
Google Scholar
Mateiu, P. & Groza, A. Ontology engineering with Large Language Models. Preprint at (2023).
Babaei Giglou, H., D’Souza, J. & Auer, S. LLMs4OL: large language models for ontology learning. In The Semantic Web—ISWC 2023 (eds. Payne, T. R. et al.) 408–427 (Springer Nature, 2023).
Ciatto, G., Agiollo, A., Magnini, M. & Omicini, A. Large language models as oracles for instantiating ontologies with domain-specific knowledge. Knowl.-Based Syst. 310, 112940 (2025).
Google Scholar
Ye, Y. et al. Construction and application of materials knowledge graph in multidisciplinary materials science via large language model. In Advances in Neural Information Processing Systems (eds Globerson, A. et al.) Vol. 37, 56878–56897 (Curran Associates, Inc., 2024).
Pascazio, L. et al. Chemical species ontology for data integration and knowledge discovery. J. Chem. Inf. Model. 63, 6569–6586 (2023).
Google Scholar
Gontier, N., Rodriguez, P., Laradji, I., Vazquez, D. & Pal, C. Language decision transformers with exponential tilt for interactive text environments. Preprint at (2023).
Wu, Y.-H., Wang, X. & Hamaya, M. Elastic decision transformer. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) Vol. 36, 18532–18550 (Curran Associates, Inc., 2023).
Xi, Z. et al. The rise and potential of large language model based agents: a survey. Sci. China Inf. Sci. 68, 121101 (2025).
Google Scholar
DeepSeek-AI et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at (2025).
Wang, X. et al. Executable code actions elicit better LLM agents. Preprint at (2024).
Zhang, B. et al. Benchmarking the text-to-SQL capability of large language models: a comprehensive evaluation. Preprint at (2024).
Cheng, G. et al. Empowering large language models on robotic manipulation with affordance prompting. Preprint at (2024).
Reaxys. (Elsevier, 2009).
SciFinder. (CAS, 1995).
Swain, M. C. & Cole, J. M. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 56, 1894–1904 (2016).
Google Scholar
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).
Google Scholar
CVE-2023-36258 Detail. National Vulnerability Database (NIST, accessed 11 June 2024); https://nvd.nist.gov/vuln/detail/CVE-2023-36258
Ruan, Y. et al. Accelerated end-to-end chemical synthesis development with large language models. Preprint at (2024).
Tom, G. et al. Self-driving laboratories for chemistry and materials science. Chem. Rev. 124, 9633–9732 (2024).
Google Scholar
Software Business: 14th International Conference, ICSOB 2023, Lahti, Finland, November 27–29, 2023, Proc. (Springer Nature, 2024).
Favato, D., Ishitani, D., Oliveira, J. & Figueiredo, E. Linus’s law: more eyes fewer flaws in open source projects. In Proc. XVIII Brazilian Symposium on Software Quality 69–78 (ACM, 2019).
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. Preprint at (2023).
Huang, Q., Vora, J., Liang, P. & Leskovec, J. MLAgentBench: evaluating language agents on machine learning experimentation. Preprint at (2023).
Liu, X. et al. AgentBench: evaluating LLMs as agents. Preprint at (2023).
Hasselgren, C. & Oprea, T. I. Artificial intelligence for drug discovery: are we there yet? Annu. Rev. Pharmacol. Toxicol. 64, 527–550 (2024).
Google Scholar
Bordukova, M., Makarov, N., Rodriguez-Esteban, R., Schmich, F. & Menden, M. P. Generative artificial intelligence empowers digital twins in drug discovery and clinical trials. Expert Opin. Drug Discov. 19, 33–42 (2024).
Google Scholar
AI’s potential to accelerate drug discovery needs a reality check. Nature 622, 217 (2023).
Zhang, Y. et al. Siren’s song in the AI ocean: a survey on hallucination in large language models. Preprint at (2023).
Li, J., Cheng, X., Zhao, X., Nie, J.-Y. & Wen, J.-R. HaluEval: a large-scale hallucination evaluation benchmark for large language models. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 6449–6464 (Association for Computational Linguistics, 2023).
Dhuliawala, S. et al. Chain-of-Verification reduces hallucination in large language models. Preprint at (2023).
Tonmoy, S. M. T. I. et al. A comprehensive survey of hallucination mitigation techniques in large language models. Preprint at (2024).
Liu, H. et al. A survey on hallucination in large vision-language models. Preprint at (2024).
Guan, X. et al. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proc. 38th AAAI Conference on Artificial Intelligence (eds Wooldridge, M. et al.) 18126–18134 (AAAI, 2024).
Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is inevitable: an innate limitation of large language models. Preprint at (2024).
Li, J. et al. The dawn after the dark: an empirical study on factuality hallucination in large language models. Preprint at (2024).
Luo, J., Xiao, C. & Ma, F. Zero-resource hallucination prevention for large language models. Preprint at (2023).
Zhang, D. et al. ChemLLM: a chemical large language model. Preprint at (2024).
Yasunaga, M., Ren, H., Bosselut, A., Liang, P. & Leskovec, J. QA-GNN: reasoning with language models and knowledge graphs for question answering. Preprint at (2021).
Lu, L. et al. Physics-informed neural networks with hard constraints for inverse design. SIAM J. Sci. Comput. 43, B1105–B1132 (2021).
Google Scholar
Han, S. et al. LLM multi-agent systems: challenges and open problems. Preprint at (2024).
Darvish, K. et al. ORGANA: A robotic assistant for automated chemistry experimentation and characterization. Matter 8, 101897 (2025).
Google Scholar
Formica, M. et al. Catalytic enantioselective nucleophilic desymmetrization of phosphonate esters. Nat. Chem. 15, 714–721 (2023).
Google Scholar
Ramos, M. C., Collison, C. J. & White, A. D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 16, 2514–2572 (2025).
Google Scholar
Bran, A. M. & Schwaller, P. in Drug Development Supported by Informatics (eds Satoh, H. et al.) 143–163 (Springer Nature, 2024).
Pei, Q. et al. BioT5: enriching cross-modal integration in biology with chemical knowledge and natural language associations. Preprint at (2024).
Fang, J. et al. MolTC: towards molecular relational modeling in language models. Preprint at (2024).
Schuhmann, C. et al. LAION-5B: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems (eds. Koyejo, S. et al.) Vol. 35, 25278–25294 (Curran Associates, Inc., 2022).
Huang, J., Shao, H. & Chang, K. C.-C. Are large pre-trained language models leaking your personal information? Preprint at (2022).
Wahle, J. P., Ruas, T., Kirstein, F. & Gipp, B. How large language models are transforming machine-paraphrase plagiarism. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing 952–963 (Association for Computational Linguistics, 2022).
Karamolegkou, A., Li, J., Zhou, L. & Søgaard, A. Copyright violations and large language models. Preprint at (2023).
McDonald, J. et al. Great power, great responsibility: recommendations for reducing energy for training language models. In Findings of the Association for Computational Linguistics: NAACL 2022 1962–1970 (Association for Computational Linguistics, 2022).
Samsi, S. et al. From words to watts: benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC) 1–9 (IEEE, 2023).
Patterson, D. et al. Carbon emissions and large neural network training. Preprint at (2021).
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023).
Google Scholar
The United States Artificial Intelligence Safety Institute: Vision, Mission, and Strategic Goals (NIST, 2024).
Canadian Artificial Intelligence Safety Institute (Government of Canada, accessed 30 January 2025); https://ised-isde.canada.ca/site/ised/en/canadian-artificial-intelligence-safety-institute
AISI Research (AI Security Institute, accessed 30 January 2025); https://www.aisi.gov.uk/category/research
Lee, S. & Manthiram, A. Can cobalt be eliminated from lithium-ion batteries? ACS Energy Lett. 7, 3058–3063 (2022).
Google Scholar
Chung, C. et al. Decarbonizing the chemical industry: a systematic review of sociotechnical systems, technological innovations, and policy options. Energy Res. Soc. Sci. 96, 102955 (2023).
Google Scholar
Xia, R., Overa, S. & Jiao, F. Emerging electrochemical processes to decarbonize the chemical industry. JACS Au 2, 1054–1070 (2022).
Google Scholar
Amodei, D. Machines of Loving Grace (2024).
McNally, A., Prier, C. K. & MacMillan, D. W. C. Discovery of an α-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334, 1114–1117 (2011).
Google Scholar
Ban, T. A. The role of serendipity in drug discovery. Dialogues Clin. Neurosci. 8, 335–344 (2006).
Google Scholar
link

