A large-scale reaction dataset of mechanistic pathways of organic reactions
Reaction data
In this paper, we used the reaction data extracted from USPTO grant patents collected by Lowe13, an organic reaction dataset extensively used in benchmarking various reaction prediction approaches. In particular, we demonstrate the results using USPTO-50K dataset curated by Schneider et al.14 and atom-mapped by LocalMapper15. Each reaction in this dataset is presented in SMILES16 format. Since our approach only addresses two electron-based arrow-pushing mechanisms, we removed organometallic and radical reactions based one their reaction templates. Thus, 33,099 reactions are finally obtained after the above pre-processing procedure. We refer to this reaction subset as USPTO-33K dataset in this paper.
Since previous study reported that the necessary reagents are frequently missing in the chemical reactions recorded in the USPTO dataset15, we intentionally removed the reagent information from all reactions. For instance, approximately 50% of the Suzuki coupling reactions lack Pd catalyst and 40% of the Mitsunobu reactions do not include diethyl azodicarboxylate (DEAD) or diisopropyl azodicarboxylate (DIAD). This data impurity makes it challenging to treat different reactions with varying data completeness. Hence, we designed the MechFinder to automatically generate the reagent needed for mechanistic labeling (see Mechanistic Template (MT) section).
Reaction Template (RT)
In our approach, we leverage the reaction rules, which are localized around specific atoms and bonds. This allows us to narrow down the scope of deriving mechanistic labels by focusing only on the atoms involved in the reaction. To obtain the reactivity information of a reaction dataset, we extract a set of reaction templates (RT)17 from each reaction in the dataset. We start by identifying the reaction centers by comparing the chemical environments between the same atoms before and after the reaction. Nonetheless, we recognize that in many cases the electron movement can go beyond the changed atoms, such as in the nucleophilic acyl substitution reaction shown in Fig. 1. Therefore, we also include moieties that are π-conjugated to the changing atoms, such as double, triple, and aromatic bonds, and several mechanistically important special groups, such as carbonyl group and acetal group. This reaction template is simpler than the template extracted by RDChiral18 but more informative than the local reaction template described in LocalRetro19. However, it is important to note that our automated workflow does not differentiate certain mechanisms, such as neighboring group participation and SN2’ mechanisms, from more common reaction mechanisms because the leading assumption we made when developing this method was that the mechanism occurs around the defined reaction center.
The overall template extraction is performed by the following four steps:
-
1.
Compare the chemical environment of each atom before and after the reaction according to the atom-mapping. The atoms that are found to have changes in chemical environment are identified as “changed atoms”.
-
2.
For each identified changed atom, we identify the neighboring atoms connected to the changed atom in the reactants with double, triple, or aromatic bond as “extended atoms”.
-
3.
To further extend the scope of RT for mechanistic labeling, we manually define a set of mechanistically important special groups. If any of the changed atoms are identified in one of the special groups, all the atoms in the special groups are also added to the “extended atoms” list. After identifying the extended atoms in the reactants, we record the atoms sharing the same atom-map numbers in the product. The details of RT extension process can be found in Fig. S1 and S2.
-
4.
Using RDKit python package20, we extract the chemical fragment in the reactants and products in SMARTS format based on the identified changed atoms and extended atoms, and connect the fragments by a reaction symbol “ » ”.
Mechanistic Template (MT)
Since RTs only capture the changes before and after the reaction, simply applying heuristic rules on RTs to generate mechanistic pathway without any in-domain chemistry knowledge poses clear limitations, as the example shown in Fig. 1. Therefore, we additionally introduce the concepts of mechanistic classes (MC) and mechanistic templates (MT) to describe the actual reaction mechanism. The MC is defined as a group of reactions following the same reaction mechanism, including one or multiple RTs. For a given MC, we then hand-code the MT which describes the direction of electron movements in the form of a sequence of arrow-pushing diagrams, representing the attacking and electron-receiving moieties to incorporate chemistry knowledge. In particular, the designed MTs are able to distinguish different mechanisms sharing the same RT (such as SN1 and SN2 reactions) based on chemically designed criteria. In addition, the necessary reagents for deriving reaction mechanisms are recovered and essential functional groups (such as electron withdrawing group, EWG) are additionally included in the mechanism labeling process by MTs. The proposed MTs are represented by categorizing the arrows that illustrate the movement of electron pairs in organic reactions into four groups: lone pair to atom, lone pair to bond, bond to atom, and bond to bond. Technically, the lone pairs of atoms are simply annotated by their atom-map numbers and the electron pairs from bonds are annotated by pairs of atom-map number.
The proposed MT has four notable features:
-
1.
Because the atom types are specified in RTs but not in MT, multiple RTs often share the same MT. For example, different nucleophiles in substitution reactions can lead to different RTs but the same MTs (Fig. 2a).
-
2.
In some cases, a single RT can match different MTs depending on the specific chemical environment. In these cases, we design particular criteria to assign the correct MT to the obtained RT. For example, the decision of assigning SN1 and SN2 depends on the alkane group connected to the leaving group (Fig. 2b).
-
3.
For many reactions, the reaction can only occur when additional reagents are added, and the reaction mechanism can only be labeled if these reagents exist. For these reactions, we put the necessary additional reagents into the reactant set to complete the mechanism (Fig. 2c). We define missing reagents as compounds whose atoms, lone pairs or bonds participate in the mechanism, but do not appear in the major products. In our labeled dataset, we add the necessary reagent(s) for approximately 19,000 reactions (60%).
-
4.
Since the mechanistic pathway labeled by this method is based on the movement of electron pairs, reaction mechanisms beyond this scope such as organometallic or radical reactions cannot be labeled by the current method (Fig. 2d). More examples of MTs can be found in Supplementary Information.
It is noteworthy that, the mechanism derivation for certain groups of reactions inevitably requires the involvement of additional moieties beyond those present in the extracted RT. To address the limitation associated with the locality, we have incorporated technical maneuverability into our method to capture the important mechanistic elements. The framework and examples can be found in Supplementary Information.
Labeling reaction mechanism using MechFinder
In this paper, we introduce a dataset generated by a mechanism labeling framework called MechFinder utilizing reaction templates (RTs, details in previous subsection) and mechanistic templates (MTs, details in previous subsection) introduced above. The process of using MechFinder to label the reaction mechanisms in a reaction dataset is divided into two phases: the expert annotation phase and the automatic labeling phase, as shown in Fig. 3a,b.
During the expert annotation phase (Fig. 3a), we first extracted N unique RTs from all the X reactions in the reaction dataset. For each RT, we sampled k representative reactions to manually label the mechanism by three steps shown in Fig. 3c:
-
1.
RT extraction. We extracted reaction template focused on the reaction center, describing the local changes in atomic configuration upon a chemical transformation. The extraction process also yields an atom-map lookup table, recording the one-to-one atom-map correspondence between the input reaction and the extracted RT.
-
2.
MT identification. Having RT for the given reaction, the MC and MT is identified by manual labeling in the expert annotation phase (but, once mapped, automated in the actual large-scale mechanism generation).
-
3.
Mechanistic sequence acquisition. The mechanistic pathway for the input reaction is labeled by aligning the atom-map numbers from the MT to the input reaction according to the atom-map lookup table.
The number of sampled reactions k in the expert annotation phase is defined by the complexity of the encountered RT. For simple reactions like nucleophilic acyl substitution, we only sample one reaction to label the MT. For more complex reactions like SNAr reaction, we sample more reactions to include more cases where the electron withdrawing groups (EWG) are located at different positions (ortho or para) with respect to the leaving group to label the MT with different criteria.
MechFinder’s approach is centered around these sampled reactions, which then enables the automated labeling of the entire dataset. During the expert annotation phase, represented in Fig. 3a, kN reactions (k reactions from each of N templates) are directly observed and labeled, which typically amount to a few hundred reactions that form the basis for the development of our mechanism template library. The automatic labelling phase, shown in Fig. 3b, applies developed mechanistic steps to the entire dataset (X reactions).
In the current dataset used, we label 33,099 (X = 33,099) reactions with 100 unique RTs (N = 100), where we generally sample fewer than 5 reactions for each RT (k < 5). In other words, the MechFinder in this work is developed by manually labeling less than 500 reactions, which is affordable for a group of chemists in a reasonable amount of time.
link