Generative AI lacks the human creativity to achieve scientific discovery from scratch

0
Generative AI lacks the human creativity to achieve scientific discovery from scratch

The physicist Richard Feynman has provided the following suggestion on how to start a scientific discovery27: “In general, we look for a new law by the following process. First, we guess it. Then we compute the consequences of the guess to see what would be implied if this law that we guessed is right. Then we compare the result of the computation to nature, with [an] experiment or experience, compare it directly with observation, to see if it works. If it disagrees with [the] experiment, it is wrong. In that simple statement is the key to science.” Does ChatGPT4 exhibit a pattern similar to what Feynman described? The answer is no. We find that ChatGPT4 does not have curiosity as humans do. Instead, ChatGPT4 provides a clear picture of how to engage in a scientific discovery process. First, ChatGPT4 believes that the first step is generating a hypothesis rather than designing an experiment. Second, all the experiments it designed are hypothesis guided. Third, ChatGPT4 shows high confidence in each experimental outcome, and the experimental results seem to be what ChatGPT4 expected. Hence, unlike humans’ discovery process, ChatGPT4’s discovery process is not cyclic as shown in Fig. 1. Instead, it is quite simple—it proposed 5 hypotheses, designed 12 experiments, explained how the proposed hypotheses are confirmed with experimental results, and then concluded with high confidence. Consequently, ChatGPT4 has the illusion of making a completely successful discovery with overconfidence.

In the context of hypothesis: results on the origin of hypotheses

Formulating meaningful hypotheses from observations is central to scientific discovery and even more challenging than classical statistics-based pattern finding because it requires a creative spark to ask surprising and important questions30,34.

We find that humans often conduct several experiments with curiosity first and see what the experimental outcomes are; then, they formulate a hypothesis to explain the observed phenomenon in the discovery task. For humans, the hypothesis space is unknown. As the number of experiments increases, anomalies or surprising phenomena from experimental outcomes spark human curiosity and consequently the generation of various alternative hypotheses, consistent with the process shown in Fig. 1. However, ChatGPT4 takes a different approach. It treats pretrained data as constituting the known hypothesis space. It then uses a statistical and analogical reasoning approach to formulate hypotheses based on correlations between established scientific knowledge and the content of the discovery task. Because the discovery task involves E. coli and lactose, ChatGPT4 focused on the lac operon model in E. coli. ChatGPT4 mentioned that this operon is a well-studied model system in molecular biology that explains how bacteria adapt their enzyme production in response to the availability of different sugars. Therefore, ChatGPT4 used this foundational knowledge to perform an analogical extension to tailor the hypotheses to the current discovery task and proposed five hypotheses at once.

Interestingly, ChatGPT4 showed high confidence in the hypotheses it proposed and believed that its formulation method ensures that the hypotheses are scientifically plausible and directly relevant to the discovery task. We find that its hypothesis generation process is similar to searching for information in a known hypothesis space with existing published works by humans and selecting the best hypotheses through statistical calculations rather than a creative process guided by curiosity induced by phenomena or experimental results. Consequently, ChatGPT4 spent less time and proposed fewer hypotheses (5) and fewer types of relevant alternative hypotheses (2) than humans did (14 hypotheses and 7.78 types), indicating a smaller breadth of the searched hypothesis space than that of humans. Hence, the quality of the proposed hypotheses by ChatGPT4 is also lower than that by humans. In fact, only two (H1 on the I gene and H5 on the O gene) of the five hypotheses proposed by ChatGPT4 are relevant and critical to the discovery results (see ChatGPT4’ response to prompt 9 in Part A of the Supplementary Information).

In the context of justification: results on experimental design

Evaluating scientific hypotheses through experimentation is critical to scientific discovery. As summarized in Table 1, we find that on average, humans conduct more experiments (13.89) than ChatGPT4 (12). Additionally, humans propose experiments sequentially, while ChatGPT4 proposes all 12 experiments at once. Faced with an unknown hypothesis space, the experimental space for humans is also unknown. Hence, humans tend to search the experimental space more broadly (11.44) than ChatGPT4 (8) does and design various experiments to determine whether different unseen phenomena can occur to test these hypotheses. Consequently, humans develop more relevant alternative hypotheses (7.78) than ChatGPT4 (2). In contrast, ChatGPT4, in relation to its known hypothesis space, seems to know exactly what experiments to design. In fact, we find that ChatGPT4 does not fully understand the discovery task and shifts the focus of the task from finding how genes control others to understanding the role of lactose in E. coli regulation. Hence, it pays more attention to the well-established lac operon model in the molecular biology literature; moreover, its designed experiments are extracted from the literature and adapted to the current discovery context. Consequently, it conducts a goal-guided search in the experimental space, proposes all experiments simultaneously, and designs experiments that are simple. Hence, to some extent, ChatGPT4 is more confident in its experimental selections than humans are. ChatGPT4 conducts only one round of 12 experiments and does not revise any hypotheses or propose and conduct new experiments. This finding indicates that the experimental space is also known to ChatGPT4. Thus, given ChatGPT4’s fast processing speed and high search capability, we find that ChatGPT4 searches the experimental space more efficiently than humans do, with fewer experiments suggested (12 vs. 13.89). However, its effectiveness and quality of the search in the experimental space is lower than that of humans because of the smaller breadth of dimension searched (8 vs. 11.44), lower percentage of genes searched (30% vs. 41.67%), and lower percentage of the amount of lactose searched (33.33% vs. 46.30%).

To determine whether the conducted experiments are key for identifying the mechanism in the discovery task, we find that humans tend to repeatedly perform logical reasoning and compare the results of several adjacent experiments to verify whether they support a proposed hypothesis before they can decide which experiments are key. However, ChatGPT4 shows more confidence in its experiments and considers only haploid experiments as key experiments. Therefore, we find that ChatGPT4 proposes and conducts considerably fewer key experiments (only 4 key experiments, 33.33% of all proposed experiments) than humans do (9.67 key experiments, 86.67% of all proposed experiments), while the opposite is true for non-key experiments. For example, the P gene in our discovery task plays no role at all; however, ChatGPT4 hypothesizes that the P gene is a promoter because the P gene in the lac operon model from the literature is shown to play an activating role. Although the experimental results do not support its claim for the P gene, ChatGPT4 neither revises its hypothesis nor designs new experiments to further verify its hypothesis. Notably, humans correctly find that the P gene does not play a role. This difference indicates that ChatGPT4 is less creative and effective at identifying key experiments because it proposes fewer relevant hypotheses at the beginning of the process than humans did. Moreover, unlike humans, ChatGPT4 cannot find the types of key diploid experiments and can identify only two types of key haploid experiments, i.e., those involving I or O gene mutations (but not P gene mutations).

In the context of justification: experimental results interpretation

In the process of scientific discovery, scientists often come up with explanations after observing phenomena for the targeted problem35,36. When observing an experimental result, whether ChatGPT4 can yield valuable insights from the results is crucial for successful discovery. We find that compared with humans, ChatGPT4 shows a much higher frequency of summarizing data (26 vs. 6.02) and providing justifications using multiple (e.g., two) experiments (6 vs. 2.43). However, it shows a lower frequency of proposing alternative hypotheses, planning new experiments, and making predictions. For example, when hypotheses H3 and H4 are not supported, ChatGPT4 does not consider designing new experiments to further verify them. Instead, in its final conclusions, it claims that the hypotheses were consistent with the existing lac operon model and hence demonstrate how genetic mutations in these regulatory elements affect gene expression (see ChatGPT4’s responses to prompts 14, 15 and 17 in Part A of the Supplementary Information). This behavior indicates that ChatGPT4 has a higher capability of information retrieval and processing but less creativity than humans do.

Furthermore, many studies have shown that recognizing anomalies is crucial for successful discovery37,38. Scientists have paid attention to unexpected phenomena or anomalies (e.g., how Fleming discovered penicillin) from which they identified problems and formulated theories to solve the problems and explain the phenomena39,40,41,42,43,44,45. The ability to establish this attention or come to this realization requires ChatGPT4 to build associative links between key experiments.

Based on ChatGPT4’s answers to the prompts (see prompts 14, 18 and 19 in Part A of the Supplement Information), we find that ChatGPT4 explains each experiment it proposes and clearly mentions which two experiments could be compared and how it obtained the conclusions. However, there is no aha moment for ChatGPT4 because all the experimental results are expected, and no anomalies are detected. More interestingly, even though the experimental results do not support some hypotheses, ChatGPT4 still shows high confidence in the proposed hypotheses and did not plan to revise them. Although ChatGPT4 clearly knows the correct procedure for making a scientific discovery as well as the steps for verifying hypotheses, it does not follow this process to revise any of the hypotheses, propose alternative hypotheses or plan new experiments, indicating that it is stubborn and does not accept new evidence displayed in the experimental results. In contrast, humans experience an “aha moment” when observing a surprising experimental outcome after trying various combinations of experiments. Therefore, humans can break the shackles of existing knowledge given new information from experimental results, revise hypotheses, and design new experiments to further verify the new hypotheses.

Causes of differences in discovery between GenAI and humans

Both ChatGPT4’s and humans’ hypotheses on the mechanisms of the discovery task are rated on a 5-point scale. As shown in Table 1, the discovery scores are 1 and 1.67 for ChatGPT4 and humans, respectively, indicating a lower overall performance for ChatGPT4 than for humans. On the basis of the key results in the discovery task, ChatGPT4 correctly finds only that the I gene is an inhibitor (repressor) but that the O gene serves as the binding site for the repressor (but is not an inhibitor itself); it incorrectly concludes that the P gene is a promoter, even though the P gene does not play a role in gene regulation in this task. Furthermore, ChatGPT4 is unable to reveal the most important mechanism, namely, the I gene is a chemical inhibitor and the O gene is a physical inhibitor of β-gal production. In contrast, on average, human subjects are more successful, with some of them correctly discovering that the P gene does not play a role, the I gene is a chemical inhibitor, and the O gene is a physical inhibitor of β-gal production.

Furthermore, we find that humans outperform ChatGPT4 in the scientific discovery process with a higher quantity and quality of proposed hypotheses, more effective searches in the experimental space with more overall experiments and more key experiments conducted, more alternative hypotheses and new experiments designed, and more aha moments during the experiments. In contrast, ChatGPT4 demonstrates a higher speed in finishing the task and a greater ability to process information or analyze data with high confidence in its hypotheses, experimental results, and conclusions.

What are the reasons for these differences between ChatGPT4 and humans? We believe that these differences can be explained by differences in the capabilities and barriers of GenAI and humans regarding scientific discovery. Specifically, humans face two barriers in the scientific discovery process: cognitive limitations, with the tendency to search in familiar knowledge domains, and limited information processing capability. However, humans have the advantages of curiosity and imagination. Despite facing both an unknown hypothesis space and an unknown experiment space, human subjects are curious about unknown phenomena and display imagination, which can break with the constraints of existing knowledge and engage in divergent thinking. That is, it is human curiosity and imagination that make it easier for humans than GenAI to overcome the barriers to creating different hypotheses in scientific discovery. In other words, human beings can create things from scratch, that is a fundamental discovery.

The advantages and barriers of GenAI such as ChatGPT4 are exactly the opposite of those of humans. ChatGPT4 uses existing knowledge provided by humans (i.e., large amounts of training data) as its known hypothesis space and experimental space, so it can overcome the cognitive limitations of individual humans and has a superfast information processing speed. Because the hypotheses are generated from a known hypothesis space, which is formed from existing published works by humans, the efficiency and speed of its discovery are better than those of humans. However, for an unknown hypothesis space (i.e., not in its pretraining data library or the knowledge field unknown to humans) or an unknown experimental space, it is less creative than humans, and it cannot create completely new hypotheses or theories. For example, in our discovery task, ChatGPT4 successfully discovers the inhibition role of the I gene, but it does not identify the lack of involvement of the P gene, the chemical mechanism of the I gene and the physical mechanism of the O gene. Therefore, current GenAI can make only incremental discoveries, but cannot achieve fundamental discoveries from scratch. The major reason is that ChatGPT4 does not possess curiosity and imagination and cannot escape the boundaries of the known hypothesis and experimental spaces to make truly fundamental discoveries. Specifically, current GenAI systems rely on pre-trained large language models, and the breadth and scale of these learned models—coupled with the wide array of facts, concepts, and ideas that the system can access—far exceed what any single human could read or remember in a lifetime. Therefore, an anomaly for humans may not be unknown to GenAI systems such as ChatGPT4, and hence ChatGPT4 may not treat it as a source of discovery or an “aha” moment. Consequently, for GenAI systems like ChatGPT4, the threshold for detecting an anomaly is much higher compared to that for humans.

How and what types of scientific discovery current GenAI can make

With respect to whether and how GenAI, like ChatGPT4, can make a scientific discovery, we find that GenAI can make a limited original discovery, unlike humans. The discovery process or the “how” part of GenAI is completely different from that of humans, given their completely different capabilities and barriers.

What types of scientific discoveries can current GenAI make? Historically, scientific discoveries were made only by humans, and the cognitive process of how a new idea is created has been a persistent research question. The academic community generally believes that human curiosity, inspiration and creativity/imagination make new ideas and scientific laws possible. Thus, if we use GenAI to make scientific discoveries, we should find a way to realize the “curiosity”, “inspiration”, or “creativity/imagination” functions in GenAI such that it can exhibit the psychology required to make human scientific discoveries30.

At present, machine intelligence is achieved through computation. Therefore, based on current technologies, Table 3 displays the scientific discovery tasks that current GenAI systems can perform. First, the “curiosity”, “inspiration”, or “creativity/imagination” functions in GenAI must be realized through computable operations. Thus, in a scientific discovery task, the task must first be able to be represented in a digital or symbolic format, without losing its inherent meaning, so that it can be accepted and processed by the machine. We denote this requirement as “computable representation.” As shown in Table 3, for the known world, there are four types of representations that can be used to describe a discovery task. For the unknown world, current AI systems are unable to make successful discoveries in domains outside the training dataset.

Table 3 What scientific discoveries current GenAI systems can make.

Next, the required domain/discipline knowledge is represented as a “searchable or computable knowledge space.” Once the variables or elements in the task can be represented as symbols, numbers, vectors, network graphs, or state spaces, mathematical modeling, human problem-solving expressions, or deep learning-based graph methods are deployed to find solutions or systematically enumerate all possible patterns. Currently, AI for synthetic organic chemistry is such an example where a large collection of known synthetic reactions is used to train AI systems to implement automatic extraction of transformation rules (“templates”) from known chemical reactions. Therefore, in principle, such systems cannot suggest reactions that are outside the existing known knowledge domain.

As current GenAI systems rely heavily on statistics and graph models that are insufficient to capture causal properties of data and the unknown world, they are not yet able to autonomously make original scientific discoveries with either an unknown conceptual space or a task that requires venturing beyond the domain knowledge space of human scientists. In contrast, GenAI performs well on scientific discovery tasks that provide either a known representation of domain knowledge in the known conceptual space or access to human scientists’ domain knowledge space. For example, in chemistry, if we use a node to represent a chemical element, then following the rules shown in Mendeleev’s Periodic Table of Elements, GenAI systems can easily generate various possible combinations of chemical elements to help discover new chemical materials.

Note that currently people use silicon-based computation to enable a non-living machine to generate human-level intelligence. While powerful, this approach lacks the intrinsic human curiosity and imagination. Human cognition is underpinned by fluid-based processes within neural circuits, characterized by neural plasticity, stochastic signaling, and a highly interconnected structure that supports creativity and rapid adaptation. Therefore, to address the limitations of current generative AI, we propose considering the following approaches: (1). Neuromorphic Systems with a New Learning Function: Currently, the “learning function” in machine learning is a statistical extraction of patterns from data, which is fundamentally different from human learning. A new, human-like learning method is needed. At a fundamental level, designing hardware that mimics the structure and function of biological neural networks could help machines realize the dynamic, parallel, and adaptive thought processes exhibited in human perception and cognition. (2). Neuromorphic Systems with Quantum Computing: Although still in its infancy, incorporating quantum states in neuromorphic systems may provide a way to establish machine awareness, enabling anomaly detection and curiosity generation. (3). Continuous, Real-World Learning: Implementing frameworks that allow for real-time learning and adaptation, similar to human experiential learning, may help AI systems develop a “world” perception model for understanding the unknown and better detect and respond to unexpected anomalies. These approaches could move AI closer to the fluid, adaptive perceptual and cognitive processes seen in human biological systems. Addressing these areas could help bridge the gap between current GenAI limitations and the more dynamic, creative processes of human scientific discovery.

In addition, the integration of GenAI into scientific discovery offers transformative potential but also raises several ethical and societal concerns that merit explicit discussion. For example, we may not be able to discern how hypotheses are generated if the GenAI system does not provide reasoning or justification procedures. Therefore, transparency is essential when GenAI generates hypotheses or conclusions that lead to critical decisions. Furthermore, there is a risk that overreliance on AI-generated hypotheses might lead to the undervaluation of human judgment, intuition, and expertise. Although GenAI systems can process large datasets and identify patterns that are not immediately apparent, they lack the nuanced understanding and ethical reasoning inherent in human cognition. It is crucial to maintain a balanced approach in which GenAI serves as a supportive tool that enhances human diverse thinking rather than replacing it entirely. Human oversight is particularly important, as GenAI systems often appear “stubborn” in maintaining hypotheses despite evidence that might suggest alternative explanations46.

Moreover, biases in training data can introduce systematic blind spots in scientific discovery. Since GenAI systems learn from historical and current data, any inherent biases—whether in research focus, methodology, or interpretation—can be perpetuated in AI-generated hypotheses and experimental designs. This may result in an overemphasis on established paradigms and a neglect of novel or unconventional ideas. To mitigate these issues, it is essential to diversify training datasets, enhance the curiosity capabilities of GenAI systems, implement robust bias detection and correction algorithms, add justification functions, and incorporate human oversight to ensure a balanced, inclusive approach to scientific discovery.

link

Leave a Reply

Your email address will not be published. Required fields are marked *