Probing the limitations of multimodal language models for chemistry and materials research

Table of Contents

The MaCBench framework

Our benchmark design is guided by the observation that scientific work requires not only access to multiple modalities of information but also the ability to flexibly integrate them. To probe these capabilities of VLLMs meaningfully—rather than creating artificial question–answer-based challenges—we focus on tasks that mirror real scientific workflows, from interpreting scientific literature to evaluating laboratory conditions and analyzing experimental data (see Fig. 1). This approach allows us to evaluate the models’ ability to process different types of information and their capacity to use this information to support scientific discovery. To assess performance in a broad range of settings, we rely on both images we mined from patents but also some we generated from scratch.

The benchmark is structured around three key aspects that form the basis of many scientific workflows: information extraction, in silico or laboratory experiments, and data interpretation. Within each pillar, we include tasks spanning various scientific activities (see Fig. 2). The information extraction pillar analyzes the performance in parsing scientific literature, including extracting data from tables and plots, and interpreting chemical structures. The experiment execution pillar evaluates the models’ ability to understand laboratory safety, identify equipment, assess safety conditions and understand crystal structures (as potential simulation artifacts). The data interpretation pillar tests models’ capability to analyze various types of scientific data, from spectral analysis to electronic structure interpretation.

**Fig. 2: Distribution of tasks in the MaCBench dataset.**

Here, a task refers to a single prompt template containing multiple questions. A task can either be a multiple-choice question (MCQ) or a numeric-answer question. The current corpus has 779 MCQs and 374 numeric-answer questions. A topic is a collection of tasks related to the same topic (one topic can have different types of tasks related to that topic; for example, X-ray diffraction (XRD) can have multiple tasks related to identifying peak positions, and then another set of tasks related to ordering peak positions in ascending/descending order). The three overarching focus areas are data extraction, data interpretation and experiments, each encompassing multiple topics.

Performance landscape

There is considerable variation in model performance across different task types and modalities (Fig. 3; see Supplementary Table 1 for detailed descriptions of all tasks); however, when averaged over different tasks, Claude 3.5 Sonnet is the leading model on all three task families. Notably, the models do not fail at one specific part of the scientific process but struggle in all of them, suggesting that broader automation is not hindered by one bottleneck but requires advances on multiple fronts. Interestingly, even for a foundational pillar of the scientific process—that is, data extraction—some models do not perform much better than random guessing (for instance, Llama 3.2 90B Vision in Fig. 3). Current systems tend to perform best on MCQ-based perception tasks (for example, laboratory equipment and hand-drawn molecules in Fig. 3).

**Fig. 3: Performance of frontier VLLMs.**

Data extraction

Our analysis shows that the first step of the scientific workflow, that is, data extraction, already poses considerable challenges for the models we tested. This is particularly the case when extracting science-specific data, for instance, on organic reactions and molecules. Although the best models perform well at extracting information on reaction diagrams, they fail to correctly describe the relationship between isomers (see Supplementary Fig. 4). As discussed below, this is probably caused by models struggling with spatial reasoning. Furthermore, even the extraction of compositions from tables still shows room for improvement for the VLLMs we tested (average accuracy of 0.53), performing indistinguishably from random guessing for Llama 3.2 90B Vision.

In silico and laboratory experiments

A similar variance in performance is observed for tasks related to the execution of laboratory or in silico experiments. Although models show good performance at recognizing laboratory equipment (average accuracy of 0.77), reasoning about laboratory scenarios, for example, comparing the safety hazards of two similar laboratory set-ups, exhibits low performance (average accuracy of 0.46).

The disparity between equipment identification and safety assessment performance suggests that although models can learn to recognize standard laboratory equipment, they still struggle with the more complex reasoning required for safe laboratory operations, questioning their ability to assist in real-world experiment planning and execution. This finding also implicates that current models cannot bridge gaps in tacit knowledge frequently discussed in biosafety scenarios^44,45.

Furthermore, the interpretation of crystal structure renderings—a crucial step for in silico experiments—demonstrates performance that is indistinguishable from random guessing in some cases, for example, in the assignment of space groups (see Supplementary Fig. 3).

Data interpretation

Interpreting experimental results often proves challenging for all models, including Claude 3.5 Sonnet. Although most models can interpret capacity values (average accuracy of 0.59), compare Henry constants from metal–organic framework isotherms (average accuracy of 0.83), or interpret amorphous or crystalline systems from XRD with acceptable performance (average accuracy of 0.69), they struggle to interpret AFM images (average accuracy of 0.24) and often fail with tasks that involve measurements such as width and length (despite the presence of clear legends). They also fail to reliably interpret mass spectrometry and nuclear magnetic resonance spectra (average accuracy of 0.35), or to make inferences on the XRD pattern. In the XRD case, it is particularly striking that although some models perform very well at identifying the positions of the most intense reflections, they perform poorly in determining relative orderings, which is crucial for interpreting XRD patterns.

Understanding model limitations

We designed a comprehensive suite of ablation studies to further understand the failure modes of VLLMs. Our approach isolates specific aspects of scientific tasks, from the complexity of the reasoning required, to how the information is presented. We probe two distinct categories of limitations (Fig. 4): first, core reasoning limitations that seem fundamental to current model architectures or training approaches or datasets, and second, sensitivities to inference choices.

**Fig. 4: Ablation study results across four key dimensions of VLLMs performance in chemistry and materials science tasks.**

Core reasoning limitations

Some limitations seem to be intrinsic to current model architectures and are unlikely to be overcome regardless of how tasks are presented or prompted. These fundamental constraints manifest in three key areas.

Spatial reasoning

Although one might expect VLLMs to excel at processing spatial information, our results reveal substantial limitations in this capability. For example, although models achieve high performance in matching hand-drawn molecules to simplified molecular input line-entry system (SMILES) strings (average accuracy of 0.80, four-times better than baseline), they perform almost indistinguishably from random guessing at naming the isomeric relationship between two compounds (for example, enantiomer, regioisomer, average accuracy of 0.24, which is only 0.1 higher than the baseline accuracy) and when assigning stereochemistry (average accuracy of 0.24, baseline of 0.22). Similarly, models perform well in simple perception tasks on crystal structures (for instance, when counting the number of different species, average accuracy of 0.85) but struggle at assigning the crystal system (average accuracy of 0.55) or space groups (average accuracy of 0.45).

These performance drops for tasks requiring spatial reasoning suggest that current VLLMs cannot reliably be used for any tasks requiring this capability—even though this might be one of the most intuitive use cases of these models.

Synthesis across modalities

Given that models consume visual and textual input in seemingly similar ways, one might expect that the same information is processed in the same way regardless of how it is presented to the model.

We presented identical text and image information to probe the ability of models to integrate information across modalities. In Fig. 4 we find that for all tasks in which we show the same information, the performance in the text modality is better than when the information is provided as an image. A striking example emerges when identifying the peak position in XRD. Models show a nearly 35% increase in performance when presented with the same peak positions as text versus showing the peaks visually. Even when calculating the volume of crystal structures, models differ in performance by four percentage points when presented with the structural information in visual (unit cell parameters shown in the image) and textual (unit cell parameters shown in text) forms. These results suggest that current models have not yet developed robust strategies for cross-modal information synthesis.

Multi-step reasoning

Motivated by the fact that the overall performance analysis indicated that perception tasks tended to perform best, we designed experiments in which we probe—with the same inputs⁴⁶—performance on very similar tasks, but with varying numbers of reasoning steps (or different numbers of tool calls when implemented in an agentic framework).

Our analysis reveals consistent degradation in performance as tasks require more reasoning steps. Figure 4 shows that in all of our experiments, the tasks that require multiple steps perform substantially worse than those requiring only one. For instance, in XRD pattern analysis, models perform much better at identifying the highest peak than at ranking relative peak intensities (average accuracy of 0.74 for identification of the highest peak versus 0.28 for ranking). Similarly, for the interpretation of adsorption isotherms, the accuracy in finding the highest value notably exceeds the performance of ordering multiple values. This pattern suggests fundamental limitations in chaining logical steps—a crucial capability for scientific reasoning.

Sensitivity to inference choices

Although addressing these core limitations will require novel training approaches, we also identified several factors that substantially influence model performance through inference choices rather than fundamental capabilities. Those factors present an actionable way to improve the performance of current systems directly without retraining them.

Scientific terminology

One might hypothesize that models struggle with some tasks because they are unfamiliar with the scientific terminology used in the questions. Figure 4 shows that removing scientific terminology improves performance across some tasks, including the analysis of adsorption isotherms of metal–organic frameworks, XRD pattern interpretation. Similarly, using International Union of Pure and Applied Chemistry (IUPAC) names instead of SMILES notation for chemical compound identification leads to better results. This suggests models might be overly sensitive to specific technical vocabularies rather than understanding underlying concepts. In fact, some models such as Gemini 1.5 Pro (and the surrounding refusal mechanisms) are very sensitive to the exact wording of the prompt. In Supplementary Section 6, we show that for some questions, large variations in performance can be due to apparently minor changes in prompt wording, such as replacing the word ‘image’ with ‘diagram,’ ‘plot,’ ‘figure,’ ‘photograph’, or even omitting it entirely.

Guidance following

Given that chemists receive instructions on interpreting various experimental characterizations, we hypothesized that similar guidance might also help the models perform better on such tasks. Interestingly, adding step-by-step instructions improves performance for most models in spectral analysis, electronic structure interpretation and XRD pattern matching—with the notable exception of Claude 3.5 Sonnet, whose performance does not improve when provided with guidance. This variation in response to instruction suggests different underlying approaches to problem solving across models.

Performance as a function of frequency on the Internet

The varying impact of guidance across models led us to investigate whether models truly engage in scientific reasoning or primarily match patterns from their training data⁴⁶. To probe this question, we measured the number of Google search results for various crystal structures as a proxy for the frequency of those structures in the training corpus (Fig. 5).

**Fig. 5: VLLM performance as a function of number of search hits.**

Our analysis reveals a correlation between the prominence of crystal structures on the Internet and task performance. Figure 5 shows that for all cases in our benchmark, the structures for which the models solve the tasks are more prominent on the Internet. This suggests that models might rely more on pattern matching than genuine scientific reasoning. Interestingly, we observe this effect even for tasks that depend solely on perception, such as counting the number of distinct atomic species.

Toward robust multimodal assistants

Our analysis reveals the promise and limitations of state-of-the-art VLLMs in scientific tasks. Compared with text-only benchmarks such as that of Mirza and colleagues⁴⁰, we observe much higher performance variability across tasks, suggesting that multimodal systems are more fragile than LLMs. This fragility manifests in several ways: the striking performance gap between visual and textual representations of identical information indicates incomplete integration of modalities, whereas the strong correlation between model performance and the Internet presence of specific crystal structures raises questions about true reasoning capabilities versus pattern matching. The sensitivity to prompting choices (see Supplementary Section 6) and the counterintuitive finding that guidance can degrade performance for top models further underscore reliability concerns; however, our findings also point to actionable paths forward. Many observed limitations, particularly in spatial reasoning, could be addressed through synthetic training data generation. When pursuing such approaches, we recommend incorporating generalization tests (for example, evaluating spatial reasoning on larger compounds than those in training⁴⁷) to ensure robust capability development. Furthermore, the substantial performance differences between modalities suggest opportunities for improved training strategies, such as incorporating modality transformation tasks (for example, automated conversion between spectral data representations). These targeted interventions could help bridge the gap between current capabilities and the needs of scientific workflows. Looking forward, it is also important to note that for future workflows, with advanced data management⁴⁸ or self-driving laboratories⁴⁹, some of the tested multimodal integration abilities will be less important as data will directly be available in a machine-actionable form instead of requiring parsing from an image.

link

Probing the limitations of multimodal language models for chemistry and materials research

The MaCBench framework

Performance landscape

Data extraction

In silico and laboratory experiments

Data interpretation

Understanding model limitations

Core reasoning limitations

Spatial reasoning

Synthesis across modalities

Multi-step reasoning

Sensitivity to inference choices

Scientific terminology

Guidance following

Performance as a function of frequency on the Internet

Toward robust multimodal assistants

How contamination and ‘inert’ materials affect chemical reactions and catalysis

How to choose reactor materials and test for compatibility | Business

Native bacteria may break down dioxins without genetic modification

Leave a Reply Cancel reply

How contamination and ‘inert’ materials affect chemical reactions and catalysis

How to choose reactor materials and test for compatibility | Business

Native bacteria may break down dioxins without genetic modification

IFW Dresden selects Agnitron Agilis 100 MOCVD platform for precursor chemistry and ultra-wide-bandgap materials development

Covering Scientific & Technical AI

The MaCBench framework

Performance landscape

Data extraction

In silico and laboratory experiments

Data interpretation

Understanding model limitations

Core reasoning limitations

Spatial reasoning

Synthesis across modalities

Multi-step reasoning

Sensitivity to inference choices

Scientific terminology

Guidance following

Performance as a function of frequency on the Internet

Toward robust multimodal assistants

More Stories

Leave a Reply Cancel reply

You may have missed