Pre-trained molecular representations enable antimicrobial discovery

0
Pre-trained molecular representations enable antimicrobial discovery

Pre-training

Dataset: The MolE pre-training dataset was created by randomly sampling 100,000 unlabeled molecules from a collection of 10,000,000 unique structures originally collected by ChemBERTa25. This subset is then randomly split into training (90%) and validation (10%).

Molecular graph construction: Each molecule is represented as a graph, where all atoms are nodes V and the chemical bonds between them are the edges E. The attributes encoded for each atom are the element it represents and its chirality type (tetrahedral CW, tetrahedral CCW, other). Likewise, each bond is embedded by its type (single, double, triple, aromatic), and its direction (none, end-upright, end-downright).

As shown in Fig. 1a, during pre-training two augmentations (YA and YB) are created by following the subgraph removal procedure, first described in MolCLR27. Briefly, a seed atom is selected at random. This seed atom and its neighbors are masked and then neighbors of the neighbors are masked until 25% of the atoms present in the original have been masked. The bonds between these masked atoms are removed, producing a subgraph of the original molecule. This subgraph removal procedure is done for each individual augmentation. While this can result in very different subgraphs for the same compound entering the GNN backbone, we expect that by performing this procedure over several epochs and for several molecules, similar representations are learned for compounds with similar structures (Fig. 2).

Graph Neural Networks: We explored the ability of Graph Isomorphism Networks (GINs)51 and Graph Convolutional Networks (GCNs)11 to extract meaningful features from the molecular graphs constructed in the previous step. Both algorithms use an update function to learn an actualized representation of each node. In the case of GINs, this update is performed by an MLP head:

$$ _ ^=ML ^\left({{{\bf{h}}}}_{v}^{(l-1)}+{\sum}_{u\in N(v)}{{{\bf{h}}}}_{u}^{(l-1)}+{{{\bf{e}}}}_{v,u}^{(l)}\right)$$

(1)

where \({{{\bf{h}}}}_{v}^{(l)}\) is the updated vector representation of node v V at the l-th GIN layer, N(v) is the set of neighbors of v, and ev,u represents the vector embedding of the attributes of the bond between v and its neighbor u.

Molecular representation: To obtain a global vector representation of the molecular structure, we first gather a pooled, graph-level representation for each GNN layer (g(l)) by adding the node embeddings of all atoms in the molecule.

$${{{\bf{g}}}}^{(l)}={\sum}_{v\in V}{{{\bf{h}}}}_{v}^{(l)}$$

(2)

A final graph representation r is obtained by concatenating the graph-level representations of each layer into a single vector.

$${{\bf{r}}}=\,{{\rm{CONCAT}}}\,({{{\bf{g}}}}^{(l)}| l=1,2,…,L)$$

(3)

Here, L represents the total number of GNN layers used, and CONCAT is the concatenation operator. In our setup, the dimensionality of g(l) is set as a 200-dimensional vector. Given that L = 5, r is therefore a 1000-dimensional vector. The graph representation learned for augmentation YA is correspondingly denoted as rA, as is the representation learned for augmentation YB denoted as rB.

Non-contrastive learning: Once the final graph representations rA and rB are produced, an MLP layer is used to obtain embeddings zA and zB, which are D-dimensional vectors. These embedding vectors are used to evaluate the \({{{\mathcal{L}}}}_{BT}\) objective function30. First, an empirical cross- correlation matrix \({{\mathcal{C}}}\) is computed between zA and zB:

$${{{\mathcal{C}}}}_{ij}=\frac{{\sum }_{b}{{{\bf{z}}}}_{b,i}^{A}{{{\bf{z}}}}_{b,j}^{B}}{\sqrt{{\sum }_{b}{{{\bf{z}}}}_{b,i}^{A}}\sqrt{{\sum }_{b}{{{\bf{z}}}}_{b,j}^{B}}}$$

(4)

Here, b indexes the batch samples and i and j index the vector dimensions of the embeddings. In practical terms, \({{\mathcal{C}}}\) is a D × D matrix and represents the average cross-correlation matrix of a given batch b. Finally, \({{\mathcal{C}}}\) is used to calculate the \({{{\mathcal{L}}}}_{BT}\) loss:

$${{{\mathcal{L}}}}_{BT}={\sum}_{i}{(1-{{{\mathcal{C}}}}_{ii})}^{2}+\lambda {\sum}_{i}{\sum}_{i\ne j}{{{\mathcal{C}}}}_{ij}^{2}\,.$$

(5)

In essence, by minimizing \({{{\mathcal{L}}}}_{BT}\) we minimize the difference between \({{\mathcal{C}}}\) and the identity matrix. In this way, embeddings obtained from augmentations of the same molecule are made to be invariant to distortions, while at the same time penalizing redundancy of information encoded by the components of the embedding. The positive constant λ controls the trade-off between these two terms.

Pre-training setup: For all experiments, MolE pre-training is done over 1000 epochs with a batch size of 1000 molecules. An Adam optimizer with weight decay 10−5 is used to minimize \({{{\mathcal{L}}}}_{BT}\). The first 10 epochs are performed with a learning rate of 5 × 10−4, after which cosine learning decay is performed. The parameter configuration that minimizes the validation error is kept for downstream tasks.

Benchmarking

Datasets: We collected 11 benchmark datasets originally curated and made available by MoleculeNet14. This set included 6 binary classification tasks and 5 regression tasks. To make these benchmarking scenarios more challenging and realistic, we split each dataset into training (80%), validation (10%), and test (10%) sets following the scaffold splitting procedure described by Hu et al.36. Briefly, the Murcko scaffold of each molecule is determined, which identifies key topological landmarks in the overall structure52. Molecules that share the same scaffold are then collectively assigned to the same set. The aim of the scaffold splitting procedure is to create a realistic scenario where the molecules seen during training are structurally dissimilar to those seen during its application to a novel chemical library.

Training and evaluation: In our benchmarking experiments, MolE was pre-trained on 100,000 molecules from Pubchem29. Different dimensionalities of the embedding vectors (zA and zB) and values for the trade-off parameter λ were explored, while the dimensionality of the representation r is fixed as a 1000-dimensional vector. After pre-training, we evaluated the resulting representation in two ways: (i) the static vector representation r was used as input for machine-learning algorithms (termed MolEstatic), or (ii) the weights learned by the GIN layers were used to fine-tune a predictor for a specific task (termed MolEfinetune).

In the first case, the pre-trained representation MolEstatic was used as input molecular features to Random Forest35 or XGBoost34 algorithms. Hyperparameter optimization is performed via random search (details in Supplementary Tables 4 and 5). Each model configuration was trained three times with different seed values. The model configuration with the largest mean ROC-AUC value on the validation set was then evaluated on the test set. For datasets with more than one task, the average performance per task is reported.

In the case of fine-tuning, an untrained MLP head is placed after the GNN layers. For all classification tasks, the Softmax cross-entropy loss is optimized, while in the case of regression, the L1 loss is optimized for the QM7 and QM8 tasks and the mean squared error is optimized for all other tasks. A random search is performed for hyperparameter optimization (Supplementary Table 6). The selected architecture was trained for 100 epochs, using an Adam optimizer with a weight decay of 10−6. While the GNN and the MLP are updated during training, the learning rate chosen for both parts differed. Each model configuration was trained three times.

Extended Connectivity Fingerprints (ECFP) were calculated using the functionality available in RDKit 2020.09.1.053. In order to get ECFP4 fingerprints we set the relevant paramters fp_radius=2 and fp_bits=1024.

Other predictors: In Table 1 and Supplementary Table 1, performance metrics for GCN, GIN, SchNet, MGCN, D-MPNN, and Hu et al. are taken from the publication of MolCLR27. The ROC-AUC values for HiMol were taken from the respective publication54, except for the HIV task which is evaluated in the current study.

The MolCLRGIN27 model made available on their GitHub page was used for all benchmarks, following the instructions in the same repository. The molecular representation obtained after GNN feature extraction was used as the input for either XGBoost or Random Forest. The N-Gram55 and HiMol54 models were pre-trained and molecular features were extracted following the default instructions and parameters made available in their respective GitHub repositories.

Ablation study on the MolE framework

To identify the components of MolE that contribute to performance gains, we performed an ablation study shown in Supplementary Fig. 3.

GNN backbone and construction: We compared GINs and GCNs in our graph feature extraction step combined with two alternatives for constructing the molecular representation r. Concatenated (C) representations are obtained as described previously (Eq. (3)). Non-concatenated (NC) representations consisted of the pooled output of the last GNN layer. In experiments where no concatenation is performed, g(l) is set as a 1000-dimensional vector and the representation r remains a 1000-dimensional vector. We observed that models perform best when trained on representations built with the concatenated strategy, independent of the GNN backbone used during pre-training (Supplementary Fig. 3a). This indicates that each GNN layer captures important information about the molecular structure. In the ClinTox task, both GIN-derived representations outperformed their GCN counterparts.

Embedding dimensionality, trade-off parameter, and dataset size: We observed increased performance on the ClinTox task when zA and zB were 8000-dimensional vectors (Supplementary Fig. 3b). We also found performance increases as the λ trade-off parameter approaches 10−4 (Supplementary Fig. 3c). Finally, we noted that the size of the unlabeled dataset used during pre-training does not necessarily improve performance on most classification tasks (Supplementary Fig. 3d). A higher ROC-AUC value can be observed in the ClinTox task when MolE’s representation is pre-trained on 200,000 molecules.

The Barlow-Twins non-contrastive objective: Overall, we note that performance using MolCLR’s representation learned from large-scale unlabeled data is on par with the performance obtained from our Barlow-Twins pre-training framework on our smaller set of unlabeled data. We see the largest performance improvement when concatenating the graph-level representation learned by each GIN layer. In our performance comparison, we denote the original representations as non-concatenating (NC) and the novel strategy as concatenating (C), respectively (see Supplementary Fig. 3e for details).

With these observations, we decided to pursue the task of antimicrobial discovery using the MolEstatic representation obtained after pre-training on 100,000 molecules with λ = 10−4, \({{{\bf{z}}}}^{A,B}\in {{\mathbb{R}}}^{8000}\), using a GIN backbone and constructed by concatenating graph-layer representations.

Exploring the MolE representation

Representation similarity: The distance between two MolE representations for compounds i and j, ri and rj, was determined using the cosine distance:

$${d}_{{{\rm{cosine}}}}({{{\bf{r}}}}^{i},{{{\bf{r}}}}^{j})=1-\frac{{{{\bf{r}}}}^{i}\cdot {{{\bf{r}}}}^{j}}{| | {{{\bf{r}}}}^{i}| {| }_{2}| | {{{\bf{r}}}}^{j}| {| }_{2}}$$

(6)

The dissimilarity between two ECFP4 fingerprints, \({{{\bf{r}}}}_{\,{\mbox{ECFP4}}}^{i}\) and \({{{\bf{r}}}}_{\,{\mbox{ECFP4}}\,}^{j}\), was calculated using the Jaccard distance:

$${d}_{{{\rm{jaccard}}}}({{{\bf{r}}}}_{\,{\mbox{ECFP4}}\,}^{i},{{{\bf{r}}}}_{\,{\mbox{ECFP4}}\,}^{j})=1-\frac{| {{{\bf{r}}}}_{\,{\mbox{ECFP4}}\,}^{i}\cup {{{\bf{r}}}}_{\,{\mbox{ECFP4}}\,}^{j}| }{| {{{\bf{r}}}}_{\,{\mbox{ECFP4}}\,}^{i}|+| {{{\bf{r}}}}_{\,{\mbox{ECFP4}}\,}^{j}| }$$

(7)

Importantly, the Jaccard distance is equivalent to 1 – Tanimoto similarity (see Supplementary Fig. 2).

Uniform Manifold Approximation and Projection: A UMAP embedding based on the cosine distance between MolE representations was built using the umap-learn 0.5.3 Python module.

Predicting antimicrobial activity on human gut bacteria

Dataset: The adjusted p value table from Maier et al.24 was used to determine labels for the growth-inhibitory effects of the screened compounds. Compound-microbe combinations with an adjusted p value  <0.05 were considered to be examples of growth inhibition. The 1197 molecules were divided into training (80%), validation (10%), and test (10%) sets following the scaffold splitting procedure.

Molecular and bacterial strain representation: We used the SMILES string of a given molecule i to obtain the corresponding representation mi, which is a d-dimensional vector. In our work, m can be one of three possible representations: (i) ECFP4, in which case d = 1024, (ii) MolEstatic (d = 1000), and (iii) a set of explicit Chemical Descriptor features (d = 98), described by Algavi and Borenstein4, which were calculated using RDKit.

A given microbial strain j B (where B is the complete set of bacterial strains) is represented as a one-hot-encoded vector bj. Given that 40 strains are present in the dataset, bj is a 40-dimensional vector.

Compound-microbe predictions of antimicrobial activity: The compound and microbe vectors mi and bj are concatenated to form xij, which is a 40 + d-dimensional vector. This combination of molecular and microbe representations is the input to our classifier function f(). In our work, f() is an XGBoost model that, for each xij, outputs a probability (pij) that indicates the likelihood of compound i inhibiting the growth of microbe j.

$${x}_{ij}=\,{{\rm{CONCAT}}}\,({{\mathbf{m}}}_{i},{{\mathbf{b}}}_{j})$$

(8)

$${p}_{ij}=f({x}_{ij})\,{,}\,{p}_{ij}\in [0,1]$$

(9)

Ranking compounds with Antimicrobial Potential scores: We summarize the predicted spectrum of activity of compound i two ways:

(i) The total number of strains predicted to be inhibited by the compound Ki. This is achieved by thresholding the antimicrobial predictive probabilities (pij) into binary predictions of growth inhibition and adding up the number of strains predicted to be inhibited.

$${K}_{i}={\sum }_{j=1}^{40}\,{{\mbox{t}}}({p}_{ij}){,}\,{K}_{i}\in [0,1,2,…,40]\,$$

(10)

where

$$\,{{\mbox{t}}}\,({p}_{ij})=\left\{\begin{array}{ll}1,\quad &{p}_{ij}\ge s\\ 0,\quad &{p}_{ij} \, < \, s\end{array}\right.$$

(11)

Here, the function t() binarizes the output by determining whether pij exceeds a threshold s. We selected s to optimize the F1-Score metric obtained on the validation set for each model (Supplementary Fig. 14 and Supplementary Table 7). This optimized cutoff was 0.044, 0.068, and 0.209 for the model trained with MolE, Chemical Descriptors, and ECFP4 features, respectively.

(ii) We define the Antimicrobial Potential score Gi for compound i as the \({\log }_{2}\) of the geometric mean of the antimicrobial predictive probabilities pij across all j microbes:

$${G}_{i}={\log }_{2}\left({\left({\prod }_{j=1}^{40}{p}_{ij}\right)}^{\frac{1}{40}}\right)$$

(12)

Additionally, we consider the value of the Antimicrobial Potential score when calculated on the subsets of Gram-positive \({G}_{i}^{+}\) and Gram-negative \({G}_{i}^{-}\) microbes. In Eqs. (13), (14) we impose a fixed indexing on the set of taxa, where the first 22 indices represent all Gram-positive bacteria (j [1, 2, . . . , 22]) and the remaining 18 indices represent all Gram-negative bacteria (j [23, 24, . . . , 40])

$${G}_{i}^{+}={\log }_{2}\left({\left({\prod }_{j=1}^{22}{p}_{ij}\right)}^{\frac{1}{22}}\right)$$

(13)

$${G}_{i}^{-}={\log }_{2}\left({\left({\prod }_{j=23}^{40}{p}_{ij}\right)}^{\frac{1}{18}}\right)$$

(14)

Model selection and evaluation: A random search over XGBoost hyperparameters was performed for each chemical representation. The model configuration with the highest PR-AUC on the validation set was then evaluated on the test set.

Predicting antimicrobial compounds in an orthogonal chemical library

Chemical library: A separate chemical library was constructed based on the FDA-approved Drugs, Human Endogenous Metabolite, and Food-Homology Compound libraries made available by MedChemExpress The chemical structures for these compounds were gathered from PubChem29 using the pubchempy 1.0.4 Python module. SMILES were canonicalized and salts were removed using RDKit53.

Compound annotation: Information provided by MedChemExpress included descriptions of the chemicals in the library. The corresponding Anatomical-Therapeutic-Chemical (ATC) code was assigned to each compound by matching compound name strings. A complete collection of ATC codes was gathered from Overlap with chemicals in the library used by Maier et al.24 was determined by matching chemical names and ATC codes. Chemicals present in both libraries were not considered for downstream prediction. SMILES were gathered from PubChem using the Python module We also use the pubchempy v1.04.

Prediction and evaluation: Molecules with Ki ≥ 10 were prioritized for further evaluation. We performed a literature search for articles available on PubMed56 that described the in-vitro and/or in-vivo antimicrobial activity of our prioritized compounds against any bacterial species. When recording MICs, we considered the lowest concentration at which no growth was observed for any Gram-positive or Gram-negative strain.

Experimental validation

Compound prioritization: In total 6 compounds were selected for experimental validation. Criteria considered for compound selection were the following: (i) The compound was predicted to inhibit 10 strains or more, (ii) the compound was not an antibiotic, (iii) the compound did not have antifungal activity, (iv) no previous literature describing the antimicrobial activity of the compound was found, and (v) the compound could be purchased through an independent provider. Furthermore, we attempted to choose compounds with different biological functions that were structurally diverse from each other. Additional information about the chosen compounds can be found in Supplementary Table 8.

Bacterial strains and growth conditions: Before the experiments, all strains were cultured overnight in Lysogeny Broth (LB Lennox) adjusted to pH 7.5 at 37 °C. Detailed information about the strains used in this study can be found in Supplementary Table 9.

Measurement of Minimum Inhibitory Concentrations: All compounds were purchased from Biomol GmbH (Germany). Stock solutions were prepared in DMSO and stored at  − 20 °C until further use. MIC measurements were performed in 96-well plates with 100 μL bacterial cultures in Mueller Hinton (MH) broth using 1:2 serial dilutions of the tested compounds. Starting concentrations of 128 μg/mL were used for Cetrorelix, Opicapone, Thymidine, Visomitin, and Elvitegravir, and 64 μg/mL for Ebastine due to low solubility. No-compound controls contained DMSO or Water. Overnight cultures of three biological replicates of each bacterial strain were adjusted to OD600 = 0.1 and inoculated into the plates by pinning using a Singer Rotor (Singer Instruments, UK), achieving a 1:200 dilution. Plates were sealed with transparent breathable membranes (Breathe-Easy®, Sigma-Aldrich-Merck) and incubated at 37°C in a Cytomat 2 incubator (Thermo Scientific) with continuous shaking at 800 rpm. OD600 was measured at regular 30-minute intervals for up to 12 h in a Synergy H1 plate reader (Agilent, USA). Additional information about the tested bacterial strains can be seen in Supplementary Table 8.

Growth curve modeling: All growth curves were normalized by subtracting the minimum of the second, third, and fourth measurements taken. Afterward, each individual curve was modeled as a logistic curve with Sicegar 0.2.4 R package. From these curves, parameters such as the maximum growth rate (μmax), end of lag-phase, start of stationary phase, and the carrying capacity were extracted. The maximum doubling time td was estimated as:

$${t}_{{{\rm{d}}}}=\frac{{\mbox{ln}}2}{{\mu }_{{{\rm{max}}}}}$$

(15)

Software

The MolE pre-training framework was implemented using the pytorch-geometric 1.6.3 framework57 with Python 3.758. We use the RandomForestClassifier and RandomForestRegressor implementation available in scikit-learn 1.0.259 and the XGBClassifier and XGBRegressor objects from xgboost 1.6.260. The scikit-learn 1.0.259 module is also used when computing ROC-AUC, PR-AUC, and F1 score metrics. The R 4.3.1 language was used for its ggplot2 3.4.2 for plotting, Sicegar 0.2.4 packages. ECFP4, chemical descriptors, and general SMILES processing were done with the rdkit 2020.09.1.0 package.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link

Leave a Reply

Your email address will not be published. Required fields are marked *