Representation of molecular structures with persistent homology for machine learning applications in chemistry
Representation of molecular structures with persistent homology for machine learning applications in chemistry"
- Select a language for the TTS:
- UK English Female
- UK English Male
- US English Female
- US English Male
- Australian Female
- Australian Male
- Language selected: (auto detect) - EN
Play all audios:
ABSTRACT Machine learning and high-throughput computational screening have been valuable tools in accelerated first-principles screening for the discovery of the next generation of
functionalized molecules and materials. The application of machine learning for chemical applications requires the conversion of molecular structures to a machine-readable format known as a
molecular representation. The choice of such representations impacts the performance and outcomes of chemical machine learning methods. Herein, we present a new concise molecular
representation derived from persistent homology, an applied branch of mathematics. We have demonstrated its applicability in a high-throughput computational screening of a large molecular
database (GDB-9) with more than 133,000 organic molecules. Our target is to identify novel molecules that selectively interact with CO2. The methodology and performance of the novel
molecular fingerprinting method is presented and the new chemically-driven persistence image representation is used to screen the GDB-9 database to suggest molecules and/or functional groups
with enhanced properties. SIMILAR CONTENT BEING VIEWED BY OTHERS PERSISTENT DIRAC FOR MOLECULAR REPRESENTATION Article Open access 11 July 2023 RAPID PREDICTION OF MOLECULAR CRYSTAL
STRUCTURES USING SIMPLE TOPOLOGICAL AND PHYSICAL DESCRIPTORS Article Open access 11 November 2024 RAPID TRAVERSAL OF VAST CHEMICAL SPACE USING MACHINE LEARNING-GUIDED DOCKING SCREENS Article
Open access 13 March 2025 INTRODUCTION The increasing concentration of greenhouse gases has been identified as a primary factor of many facets of environmental degradation such as higher
global temperature, rising sea levels, increased ocean acidity, and more extreme weather-related events. CO2 is the most prominent greenhouse gas, and its atmospheric concentration has
exceeded 400 ppm, which is more than a 40% increase from pre-industrial conditions, potentially leading to a rise in global temperatures of more than 2 ∘C by the year 21001. Lowering CO2
emissions is therefore mandatory to meet ambitions to limit temperature increases to 1.5 ∘C2. Advancements in carbon capture and storage technology are desired for meeting these goals of
lowered atmospheric greenhouse gas emissions and reduced global temperature increases year to year. At an industrial level, liquid amine-based solvents are used for separation and capture of
CO2 via chemisorption, but the solvent regeneration step is an energy intensive process. Membrane-based technologies offer an alternative, cost-effective process for CO2. Unlike solvents,
where chemisorption involves a reaction with binding strengths exceeding 20 kcal mol−1 through the creation of chemical bonds between CO2 and solvent, membranes utilize much weaker
noncovalent interactions. Different types of materials have been suggested for the fabrication of permeable membranes including amorphous, non-porous polymeric membranes3,4,5,6, or
crystalline materials with permanent porosity such as metal-organic frameworks (MOFs) or zeolites7. The understanding of how the atomistic structure of materials affects the gas selectivies
is a crucial process for the development of more efficient carbon capture technologies. Most often, this involves the separation of CO2 from N2, which are two atmospheric gases with similar
kinetic diameters, making size-sieving a challenging task. The introduction of functional groups which selectively interact with CO2 has been a successful approach for increasing membrane
performance6,8. Such CO2-philic functional groups (usually Lewis bases) can be either introduced into the framework of porous crystalline materials (e.g., MOFs) or functionalized into the
repeat units of non-porous polymeric membranes. Electronic structure theory calculations between molecular units and the respective gases provide a quantification of these noncovalent
interactions, as well as elucidate their nature and properties9,10,11,12,13,14,15. However, the number of potential CO2-philic groups is intractably large, which leads to an excessive study
of such systems with accurate ab initio methods. In addition, the determination of gas interaction energies may require multiple calculations to evaluate competitive gas binding sites for
every structure, which further increases the computational cost and expert intervention. However, high-throughput computational screening can accelerate the discovery of new, functional
materials for rational synthesis through the circumvention of the expensive and time-demanding synthesis and testing process16,17. High-throughput computational design has shown great
success in identifying new molecules18,19 and materials20,21,22,23 with enhanced properties and advanced functionality. For many applications, first-principles studies are essential to
virtual screening, but the high computational cost of these methods makes the search of large parts of the chemical space cost-prohibitive. In recent years, machine learning (ML) has become
a valuable tool in reducing the cost of a systematic chemical space exploration by enhancing the search for structure-property relationships24,25,26, guiding molecular design27,28,29,30, and
predicting electronic structure properties31,32,33,34,35. ML algorithms are used for their ability to learn complicated relationships in data with high computational efficiency that can be
systematically improved through additional training data, but may require extensive training set sizes before predicting out-of-sample properties accurately. The efficiency of ML depends on
how these data are passed to the algorithm. For chemical applications, this occurs through molecular representations, which are the featurization of molecular compounds from their molecular
structure into a vector of values. The ML algorithm then infers the relation between the structure and the property of interest. Recent developments in the formulation of molecular
representations, particularly in the realm of quantum properties and structure-function relations, have increased the efficiency of ML for chemical applications36,37,38,39,40,41,42. In
addition, such representations generalize to more complex instances such as reaction barriers43. Despite the inherent ability of ML to extract important features, ML-model accuracy is
dependent on the molecular representation. A molecular representation reduces the dimensionality of a molecular structure into a chemically meaningful format that relays important chemical
information. For example, a chemical formula conveys a three-dimensional molecule as a string of characters but it is an ambiguous input for ML. Atomistic and molecular structure should be
converted into a machine-readable format that can be parsed efficiently to ML algorithms as input features. One of the most prominent representations is the Coulomb matrix (CM) introduced by
Rupp et al., which is a square atom-by-atom matrix containing an approximate potential energy of the free atom along the diagonal and pair Coulombic potentials on the off-diagonal terms33.
An improvement over CM is typically observed using the Bag-of-Bonds (BoB) representation, where each atomic pair is placed in specific vectors (bags) based off the elemental pairs and sorted
by value34. Faber et al. have developed FCHL, a representation based on Gaussian distribution functions for the universal kernel ridge regression-based quantum machine-learning models42. In
addition, the smooth overlap of atomic positions (SOAP) representation44 calculates the local density of atoms around all atoms in a given chemical environment but suffers from an increased
computational cost over the pairwise CM and BoB representations. Herein, we present a new molecular representation scheme based on persistent homology, a branch of computational topology.
Application of persistent homology on molecules encodes three-dimensional structural data into two-dimensional persistence images. Since persistence images hold topological features of
chemical structures, we are suggesting them as alternative molecular fingerprints which are transposed into ML input and used to identify relationships in the data. A molecular
representation is introduced for encoding chemical structures, which is applied in the prediction of interaction energies of organic molecules with gas molecules. Persistence images offer a
similar-size ML vectorization regardless of system size, and a numerical example is given as a proof of this concept. Whereas CM and BoB do not provide a constant-size representation by this
definition, it is effectively achieved through padding empty cells. However, the input vector space takes the dimensions of the largest molecule in the data and requires significant padding
for smaller molecules, whereas the introduced method has a predefined vector size regardless of number of atoms. It has been suggested that using feature vectors with sizes independent of
system size may result in improved generalization between small and large systems45. The new method is used for screening a large database of organic molecules for the discovery of
CO2-philic functional groups. RESULTS PERSISTENCE DIAGRAMS The mapping of molecular structure to a chemically driven persistence image entails several steps. First, the molecular homological
features, which measure the connectedness, proximity, and the empty space among the atoms, are computed and stored. These homological features are summarized in a persistence diagram
(PD)46,47,48,49,50,51,52,53,54,55,56,57,58. A PD encodes molecular features such as bonds and rings. The PDs can then be vectorized into a persistence image59 (PI) for use as a molecular
representation. However, employing persistent homology and its derivative, PI, purely focuses on detecting topological attributes but lacks explicit incorporation of key chemical information
such as element identity, leading to limited applicability in molecular systems. Here, we describe the application of persistent homology with domain-specific knowledge for the generation
of persistence images based on atomic properties. The basic steps are presented in the following paragraphs. Anisole was selected as a representative example because it contains two distinct
functional units (phenyl- and methoxy groups). To construct a persistence diagram for a given molecule, spheres of a given radius centered at each atom are considered and, as the radius
increases, the spheres intersect and lead to the evolution of homological features, called connected components and holes. The connected components encode interatomic distances, while holes
describe molecular attributes such as rings and functional groups. The PDs hold information about the generation or birth and the lifetime length or persistence of connected components and
holes. The placement of a birth is denoted by its location along the _x_-axis of the PD, whereas the persistence is denoted by its location on the _y_-axis. Birth of connected components
occurs at 0, since every atom is given an initial sphere with radius 0 at the start of the algorithm (see anisole as an example on Fig. 1). The spheres are then systematically expanded (Fig.
1a, d, g, j, and m) until spherical intersections occur, which effectively generate a new connected component by merging older ones. The persistence of connected components is then recorded
on the associated PD. In some sense, a PD records the time in terms of spheres’ radii for the atoms to form a single cluster. For anisole that is used here as an example, four different
types of connected components are generated. The first two appear at ~1.1 and correspond to the C–H bonds of the methoxyl and phenyl groups, respectively. The other two appear at ~1.4 and
correspond to the C–O and C–C bonds, respectively. This means that the units of the two axes are given in angstroms (Å). When sphere intersections lead to the formation of connected atoms
(connected component) on a ring, for example when all the six spheres of the phenyl carbons have met, a hole is generated (Fig. 1l). The death of a hole occurs when all spheres that form a
given hole intersect, and its persistence is recorded on the persistence diagram (Fig. 1o). It becomes now evident that the connected components depend on the distances of neighboring atoms
and the holes correspond to topological features of the functional groups. Two holes are now formed for the anisole example of Fig. 1, which correspond to the phenyl and methoxy groups.
These holes are unique for each respective unit and differentiate between different conformations of the same molecule and subtle differences in geometry60,61,62,63. CHEMICALLY DRIVEN
PERSISTENCE IMAGES AS MOLECULAR REPRESENTATIONS The PD is vectorized into a pixelized image, called persistence image (PI), which is a stable, computationally tractable representation59. PIs
are constructed by placing a Gaussian kernel centered at each point on the PD as it is highlighted for anisole in Fig. 2, where the pixel intensity corresponds to the multiplicity. For
example, five C–H pairs in the phenyl group, three C–H pairs in the methyl group etc. Next, the surface is transferred into pixel values (Fig. 2b, c). The resulting image effectively encodes
the molecular geometry. The transformation of a PD to a PI may lead to inconsistencies if all atoms are treated identically, especially when molecular structures with the same geometries
but different atom types are encoded into a PI. For example, the diatomic molecules HBr and F2 have approximately the same bond distance (1.41 Å), and therefore generate the same PI (Fig.
2f, g, respectively). To ameliorate this shortcoming, we introduce atomistic information in the variance of the Gaussian kernels that yielded a PI. The variance determines the spread of the
kernel, or how “smeared” each point on the PD is when placed onto the PI. This variance is chosen based on the atom type that created the point in the PD. Specifically, we define the
variance in the persistence images by the difference in electronegativity for connected components. Electronegativity differences are chosen because they provide a general description
between the nature of different bonds. For the example of HBr and F2, HBr has a very polar chemical bond, whereas molecular fluorine is nonpolar. Large variance is provided to atom pairs
with large electronegativity differences, which ultimately generates unique PIs (Fig. 2h, i, respectively). Our new chemically driven persistence image differentiates between molecules which
have similar geometric configurations but different atomic compositions. Another important feature is related to the dimensionality of the PI molecular representation which remains of the
same order with respect to the molecular size as we show empirically in Supplementary Figs. 2 and 3. This is what we call herein a similar-size representation. For example, for a small
molecule like anisole that is composed of 16 atoms, a 3 Å × 3 Å PD was generated. The equivalent PD of a medium-size molecule such as the _tert_-butylcalixarene (105 atoms) is of comparable
size (4 Å × 4 Å). Similarly, the PD of a large structure, the main protease of the new coronavirus identified as COVID-1964 in complex with an inhibitor N3 (2500 non-hydrogen atoms) has size
of 6 Å × 6 Å. A detailed analysis is given in Supplementary Note 2. As it was mentioned in the introduction, such a similar-size representation is desirable for many chemical
applications45. PERFORMANCE OF PERSISTENCE IMAGES AS MOLECULAR REPRESENTATIONS Here, we demonstrate the performance of the chemically driven PIs on an application relevant to green
chemistry. Our aim is to screen a large molecular database in order to discover molecular groups that show a stronger affinity for CO2 interaction over N2. Such molecular groups can be
introduced in polymeric materials for the development of the next generation of functional gas separation membranes. Since it is desirable to avoid any density functional theory
(DFT)-optimized geometries as input for ML models, which introduce a significant computational bottleneck for the screening of large molecular databases, we resort to structures generated by
the OpenBabel65 software package (gen3d function). Our target is to train a ML model that maps low-cost geometries with accurate quantum chemical data, so it can provide reliable
interaction energies for molecular species with geometries generated on-the-fly. We tested the performance of PI as alternative molecular representations that effectively encode chemical
structures. The initial subset of 100 organic molecules was used, generated based on the procedure described in Supplementary Note 3. The interaction energies of each of these structures
with CO2 and N2 were computed by means of DFT. We also wanted to compare PI with the widely utilized Coulomb matrices (CM), Bag-of-Bonds (BoB), FCHL, and Smooth Overlap of Atomic Positions
(SOAP) representations, since each of these are produced with little computational burden and widely implemented in a number of programming libraries66,67,68. PI, CMs, BoBs, FCHL, and SOAP
representations were generated for each structure and the performance of each representation scheme was evaluated for the prediction of gas interaction energies. For each scheme, a variety
of machine-learning algorithms were tested, including random forest, Gaussian process regression, and kernel ridge regression. A detailed analysis of the optimization process is given in
Supplementary Note 6. Overall, two machine-learning models were trained per molecular representation scheme, one for CO2 and one for N2 interaction energies. The 10-fold cross validation
root-mean-squared error (RMSE) for the five trained models on the CO2 energies are shown in Fig. 3. The error bars represent the standard deviation of the RMSE. Similar results were obtained
for N2 interaction energies (see Supplementary Note 7). Comparing the best learners for each representation, CM showed the highest deviation (RMSE of 0.63 kcal mol−1), followed by BoB and
FCHL (RMSE of 0.52 and 0.50 kcal mol−1, respectively). The most accurate models were PI with kernel ridge regression (Laplacian kernel, 0.44 kcal mol−1) and SOAP with kernel ridge regression
(linear kernel, 0.41 kcal mol−1), where PI showed a tighter variance, yielding a higher confidence in the predictions. SCREENING THE GDB-9 DATABASE High-throughput computational screening
using ML is an efficient method to survey molecules for numerous chemical applications. Here, we are applying the PI method for identifying molecules and functional groups that enhance CO2
interactions with little computational cost. ML models trained on DFT-quality data can estimate DFT-quality results for hundreds of thousands of systems within seconds, while the explicit
computation at the DFT level is a cost-prohibitive process. The GDB-9 database69 was screened, which includes 133,885 organic molecules containing no more than nine non-hydrogen atoms to
determine the most promising molecules for CO2 binding. The data from the initial 100 organic molecules discussed in the previous section do not adequately capture the properties of the
chemical space spanned by the GDB-9 database. The initial training set can be considered as biased since it contains largely N-containing heterocycles with small functionalizations on
aromatic carbons (see Supplementary Note 9). For surpassing this limitation and reliably screening the full GDB-9 space, we applied a methodology known as active learning. In active
learning, the training set is systematically expanded to capture the necessary missing physics to accurately predict for the targeted space. The top 40 molecules were selected with respect
to predicted CO2 interaction strength and further investigated by the MD/DFT scheme described in the computational details (Supplementary Note 4). Therefore, the training set was expanded to
better infer the relationship between the molecular representation and the chemical space spanning the GDB-9 database. We have repeated this processes four times by considering different
yet optimized molecular representation methods (CM, BoB, SOAP, and PI). The individual steps that were followed are shown schematically in Fig. 4 and analyzed in the next paragraphs. Three
iterations were performed with each representation scheme together with the optimized machine-learning algorithm, as it is discussed in the previous section and in Supplementary Note 6.
Thus, the kernel ridge regression (Laplacian) was used for CM and PI, Gaussian process regression for BoB, and kernel ridge regression (linear) for SOAP. The active-learning process resulted
a total of 220 data points, i.e., 220 molecular structures with their corresponding CO2 and N2 interaction energies computed by DFT per method. The distributions of the interaction energies
of these molecules for each method are visualized in Fig. 5. We set a mark at −6.0 kcal mol−1 for molecules with significantly strong CO2 interaction energy. For a detailed analysis of the
mean and median of each active-learning iteration per method, we refer the reader to the Supplementary Note 8. The first iteration contains only the original 100-molecule training set. By
expanding the training set with the 40 molecules from the first iteration, the next 40 best-predicted molecules from the model that utilizes the PI representation have significantly
improved. On the contrary, no significant changes were observed from the other three models. By the third iteration, the CO2 distribution from CM, BoB, and PI remained almost unchanged,
while a small shift toward stronger interaction energies was found for the SOAP model. Overall, PI showed the greatest performance since each respective iteration increased the number of
promising structures, from 10 (first iteration) to 43 (second iteration), and ultimately to 75 out of 120 molecules. Active learning with the PI molecular representation has systematically
expanded the training set to better represent the chemical space of the dataset, yielding more reliable predictions every round. Since promising structures are rare within the dataset, this
strategy allows the model to account for these rare instances within the training set in a way that would be impossible with a randomly chosen training set. In addition, the three top
candidates for CO2 separations were found to demonstrate stronger interaction energies than −6.50 kcal mol−1, which are shown in Fig. 4 (bottom, right). Our computational procedure allowed
us to discover new molecules with higher CO2 affinity that combine previously unknown binding motifs. In particular, we found that cooperative effects between N-containing heterocycles with
amino or hydroxo groups at ortho position increases the CO2 strength. The lone electron pair of nitrogen 6induces a dipole moment on CO2 that allows stronger interactions with hydrogen atoms
of the NH2− and/or OH− functional groups. After completing three iterations of active learning, the full dataset (220 molecules) is used to create a ML model for predictions on the whole
GDB-9 database. For comparison, the database was screened with the four different models, where each of them uses a different molecular representation method (CM, BoB, SOAP, and PI) and data
generated from the corresponding active-learning steps. The optimum learner for each method was used, as it was discussed in the previous section, except for BoB, where the kernel ridge
regression (linear) was applied (for a detailed discussion, see Supplementary Note 8). Figure 6 includes a plot for each model, where all predicted N2 and CO2 interaction energies are on the
_x_- and _y_-axis, respectively. Only the method that utilized the PI molecular representation was able to identify 4,5-diamino-1H-imidazol-2-ol as one of the molecules with the strongest
CO2 affinity as indicated with an orange dot on each plot of Fig. 6. 4,5-diamino-1H-imidazol-2-ol was part of the training set introduced in the second step of the active-learning process
(Fig. 4), and has a DFT CO2 interaction energy of −7.41 kcal mol−1. All methods agree that the majority of the molecular entries of the GDB-9 dataset have a mean CO2 interaction energy
centered between −3.0 and −4.0 kcal mol−1 and a N2 interaction energy at −2.0 kcal mol−1. Most of the molecules have predicted CO2 interaction energies between −3.0 and −5.0 kcal mol−1,
which emphasize the difficulty in determining new molecules with high CO2 affinity. However, CM and BoB were less effective in identifying rare instances. On the contrary, results from SOAP
were significantly scattered, and predicted many cases with false CO2 interaction energies close to 0 kcal mol−1 (in a few cases, SOAP even predicted repulsive interaction energies,
Supplementary Note 8). Interestingly, training of machine-learning models with the CM, BoB or the SOAP representations that utilize the molecules identified by active learning with the PI
representation yielded more concise distributions, while all models identified the molecular species with the stronger CO2 interaction energy among the top candidates (Supplementary Fig. 7).
In other words, PI provided higher quality data for all methods. Therefore, the model that utilizes PIs for molecular representations provided the most consistent distributions for both CO2
and N2 interaction energies. The PI screening revealed a total of 44 of the 133,885 molecules with CO2 interaction energies exceeding −6.5 kcal mol−1. DFT calculations were performed for
verification of these results. It should also be mentioned that SOAP needed 88,287 s for screening the full GDB-9 database, while the screening with the novel PI representation was almost 40
times faster (only 2219 s). All screenings were performed on an Intel® i5-4278U processor. DISCUSSION A novel molecular representation utilizing persistence images with embedded chemical
bonding information has been introduced for predicting DFT-quality CO2/N2 interaction energies. From our investigation, this new chemically driven persistence image is a concise,
computationally efficient, and effective representation that generally outperforms other representations for prediction of CO2 interaction energies since the computational cost is low and
does not suffer from dimensionality problems. This representation accounts for underlying topological structure in the molecule, providing a method to control uncertainty due to differing
geometric configurations. The new methodology has been applied for the screening of the GDB-9 database to suggest new CO2-philic moieties. By using an active-learning approach, our ML-based
screening was able to identify many promising molecules in the GDB-9 database despite a very small training set (220 molecules). Specifically, 44 molecules were identified that exceed −6.5
kcal mol−1 CO2 interaction energy. In addition, candidates that may exhibit strong CO2 interactions while maintaining weak N2 interactions were examined, yielding a strategy for identifying
species with potentially strong gas separation capabilities. Ultimately, chemically driven persistence images are promising molecular representations for larger supermolecular systems due to
compact vectorization. Therefore, we believe that the chemically driven PI molecular representations can be applied in a plethora of chemical problems. The PI method described herein relies
on a topological representation of a molecular compound that allows a flexible summary of the diversity of the atomic geometries. Due to this flexibility in terms of topological
equivalence, the PI is robust and provides accurate predictions in contrast to other methods that need to learn rigid geometric representations. We are currently expanding the applicability
of the novel molecular fingerprinting method to high-throughput screening of molecular databases for catalysis and ligand-based lanthanide/actinide separations. For this type of chemical
applications, additional features are taken into consideration, such as intensity normalization when a PI is generated from the corresponding PD and predictability of properties of larger
molecules from data generated from smaller ones. DATA AVAILABILITY The data for the high-throughput computational screening are available in the Supplementary Information. CODE AVAILABILITY
The code for the numerical simulations is available at https://gitlab.com/voglab/PersistentImages_Chemistry. CHANGE HISTORY * _ 14 JULY 2020 An amendment to this paper has been published and
can be accessed via a link at the top of the paper. _ REFERENCES * Capellán-Pérez, I., Arto, I., Polanco-Martínez, J. M., González-Eguino, M. & Neumann, M. B. Likelihood of climate
change pathways under uncertainty on fossil fuel resource availability. _Energy Environ. Sci._ 9, 2482–2496 (2016). Google Scholar * Hulme, M. 1.5 ∘C and climate research after the Paris
Agreement. _Nat. Clim. Chang._ 6, 222–224 (2016). ADS Google Scholar * Norahim, N., Yaisanga, P., Faungnawakij, K., Charinpanitkul, T. & Klaysom, C. Recent membrane developments for
CO2 separation and capture. _Chem. Eng. Technol._ 41, 211–223 (2018). CAS Google Scholar * Ahmad, J. et al. Recent advances in poly (amide-B-ethylene) based membranes for carbon dioxide
(CO2) capture: a review. _Polym. Technol. Mater._ 58, 366–383 (2019). CAS Google Scholar * Wang, Y. et al. Polymers of intrinsic microporosity for energy-intensive membrane-based gas
separations. _Mater. Today Nano_ 3, 69–95 (2018). Google Scholar * Hong, T. et al. Impact of tuning CO2-philicity in polydimethylsiloxane-based membranes for carbon dioxide separation. _J.
Memb. Sci._ 530, 213–219 (2017). CAS Google Scholar * Sumida, K. et al. Carbon dioxide capture in metal organic frameworks. _Chem. Rev._ 112, 724–781 (2011). PubMed Google Scholar *
Tian, Z., Dai, S. & Jiang, D.-e What can molecular simulation do for global warming? _Wiley Interdiscip. Rev. Comput. Mol. Sci._ 6, 173–197 (2016). CAS Google Scholar * Vogiatzis, K.
D., Mavrandonakis, A., Klopper, W. & Froudakis, G. E. Ab initio study of the interactions between CO2 and N-containing organic heterocycles. _ChemPhysChem_ 2, 374–383 (2009). Google
Scholar * Tian, Z., Saito, T. & Jiang, D.-e Ab initio screening of CO2-philic groups. _J. Phys. Chem. A_ 119, 3848–3852 (2015). CAS PubMed Google Scholar * Lee, H. M., Youn, I. S.,
Saleh, M., Lee, J. W. & Kim, K. S. Interactions of CO2 with various functional molecules. _Phys. Chem. Chem. Phys._ 17, 10925–10933 (2015). CAS PubMed Google Scholar * Chen, L., Cao,
F. & Sun, H. Ab initio study of the _π_–_π_ Interactions between CO2 and benzene, pyridine, and pyrrole. _Int. J. Quantum Chem._ 113, 2261–2266 (2013). CAS Google Scholar * Hussain, M.
A., Soujanya, Y. & Sastry, G. N. Evaluating the efficacy of amino acids as CO2 capturing agents: a first principles investigation. _Environ. Sci. Technol._ 45, 8582–8588 (2011). ADS
CAS PubMed Google Scholar * Townsend, J., Braunscheidel, N. M. & Vogiatzis, K. D. Understanding the nature of weak interactions between functionalized boranes and N2/O2, promising
functional groups for gas separations. _J. Phys. Chem. A_ 123, 3315–3325 (2019). CAS PubMed Google Scholar * Hymel, J. H., Townsend, J. & Vogiatzis, K. D. CO2 capture on
functionalized calixarenes: a computational study. _J. Phys. Chem. A_ 123, 10116–10122 (2019). CAS PubMed Google Scholar * Kim, J., Abouelnasr, M., Lin, L. C. & Smit, B. Large-scale
screening of zeolite structures for CO2 membrane separations. _J. Am. Chem. Soc._ 135, 7545–7552 (2013). CAS PubMed Google Scholar * Haldoupis, E., Nair, S. & Sholl, D. S. Finding
MOFs for highly selective CO2/N2 adsorption using materials screening based on efficient assignment of atomic point charges. _J. Am. Chem. Soc._ 134, 4313–4323 (2012). CAS PubMed Google
Scholar * Nandy, A., Duan, C., Janet, J. P., Gugler, S. & Kulik, H. J. Strategies and software for machine learning accelerated discovery in transition metal chemistry. _Ind. Eng. Chem.
Res._ 57, 13973–13986 (2018). CAS Google Scholar * Hachmann, J. et al. The Harvard clean energy project: large-scale computational screening and design of organic photovoltaics on the
world community grid. _J. Phys. Chem. Lett._ 2, 2241–2251 (2011). CAS Google Scholar * Persson, K. A. et al. Commentary: The Materials Project: a materials genome approach to accelerating
materials innovation. _APL Mater._ 1, 011002 (2013). ADS Google Scholar * Levy, O. et al. The high-throughput highway to computational materials design. _Nat. Mater._ 12, 191–201 (2013).
ADS PubMed Google Scholar * Nørskov, J. K. & Bligaard, T. The catalyst genome. _Angew. Chem. Int. Ed._ 52, 776–777 (2013). Google Scholar * Collins, K. D., Gensch, T. & Glorius,
F. Contemporary screening approaches to reaction discovery and development. _Nat. Chem._ 6, 859–871 (2014). CAS PubMed Google Scholar * Ma, X., Li, Z., Achenie, L. E. & Xin, H.
Machine-learning-augmented chemisorption model for CO2 electroreduction catalyst screening. _J. Phys. Chem. Lett._ 6, 3528–3533 (2015). CAS PubMed Google Scholar * Janet, J. P. &
Kulik, H. J. Predicting electronic structure properties of transition metal complexes with neural networks. _Chem. Sci._ 8, 5137–5152 (2017). CAS PubMed PubMed Central Google Scholar *
Li, Z., Ma, X. & Xin, H. Feature engineering of machine-learning chemisorption models for catalyst design. _Catal. Today_ 280, 232–238 (2017). CAS Google Scholar * Gómez-Bombarelli, R.
et al. Automatic chemical design using a data-driven continuous representation of molecules. _ACS Cent. Sci._ 4, 268–276 (2018). PubMed PubMed Central Google Scholar * Olivecrona, M.,
Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. _J. Cheminform._ 9, 1–14 (2017). Google Scholar * Sanchez-Lengeling, B. &
Aspuru-Guzik, A. Inverse molecular design using machine learning:Generative models for matter engineering. _Science_ 361, 360–365 (2018). ADS CAS PubMed Google Scholar * Sturluson, A.,
Huynh, M. T., York, A. H. P. & Simon, C. M. Eigencages: learning a latent space of porous cage molecules. _ACS Cent. Sci._ 4, 1663–1676 (2018). CAS PubMed PubMed Central Google
Scholar * Hansen, K. et al. Assessment and validation of machine learning methods for predicting molecular atomization energies. _J. Chem. Theory Comput._ 9, 3404–3419 (2013). CAS PubMed
Google Scholar * Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Big data meets quantum chemistry approximations: the _Δ_-machine learning approach. _J. Chem. Theory
Comput._ 11, 2087–2096 (2015). CAS PubMed Google Scholar * Rupp, M., Tkatchenko, A., Müller, K.-R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies
with machine learning. _Phys. Rev. Lett._ 108, 058301 (2012). ADS PubMed Google Scholar * Hansen, K. et al. Machine learning predictions of molecular properties: accurate many-body
potentials and nonlocality in chemical space. _J. Phys. Chem. Lett._ 6, 2326–2331 (2015). CAS PubMed PubMed Central Google Scholar * De, S., Bartók, A. P., Csányi, G. & Ceriotti, M.
Comparing molecules and solids across structural and alchemical space. _Phys. Chem. Chem. Phys._ 18, 13754–13769 (2016). CAS PubMed Google Scholar * Faber, F., Lindmaa, A., von
Lilienfeld, O. A. & Armiento, R. Crystal structure representations for machine learning models of formation energies. _Int. J. Quantum Chem._ 115, 1094–1101 (2015). CAS Google Scholar
* Faber, F. A. et al. Prediction errors of molecular machine learning models lower than hybrid DFT error. _J. Chem. Theory Comput._ 13, 5255–5264 (2017). CAS PubMed Google Scholar *
Huang, B. & von Lilienfeld, O. A. Communication: understanding molecular representations in machine learning: the role of uniqueness and target similarity. _J. Chem. Phys_. 145, 161102
(2016). ADS PubMed Google Scholar * Bereau, T., Andrienko, D. & von Lilienfeld, O. A. Transferable atomic multipole machine learning models for small organic molecules. _J. Chem.
Theory Comput._ 11, 3225–3233 (2015). CAS PubMed Google Scholar * Browning, N. J., Ramakrishnan, R., von Lilienfeld, O. A. & Roethlisberger, U. Genetic optimization of training sets
for improved machine learning models of molecular properties. _J. Phys. Chem. Lett._ 8, 1351–1359 (2017). CAS PubMed Google Scholar * Bartók, A. P. et al. Machine learning unifies the
modeling of materials and molecules. _Sci. Adv._ 3, e1701816 (2017). ADS PubMed PubMed Central Google Scholar * Faber, F. A., Christensen, A. S., Huang, B. & von Lilienfeld, O. A.
Alchemical and structural distribution based representation for improved QML. _J. Chem. Phys._ 148, 241717 (2018). ADS PubMed Google Scholar * Meyer, B., Sawatlon, B., Heinen, S., von
Lilienfeld, O. A. & Corminboeuf, C. Machine learning meets volcano plots: Computational discovery of cross-coupling catalysts. _Chem. Sci._ 9, 7069–7077 (2018). CAS PubMed PubMed
Central Google Scholar * Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. _Phys. Rev. B - Condens. Matter Mater. Phys._ 87, 1–16 (2013). Google Scholar *
Collins, C. R., Gordon, G. J., Von Lilienfeld, O. A. & Yaron, D. J. Constant size descriptors for accurate machine learning models of molecular properties. _J. Chem. Phys._ 148, 241718
(2018). ADS PubMed Google Scholar * Bendich, P., Marron, J. S., Miller, E., Pieloch, A. & Skwerer, S. Persistent homology analysis of brain artery trees. _Ann. Appl. Stat._ 10,
198–218 (2016). MathSciNet PubMed PubMed Central Google Scholar * Kramar, M., Goullet, A., Kondic, L. & Mischaikow, K. Persistence of force networks in compressed granular media.
_Phys. Rev. E_ 87, 042207 (2013). ADS CAS Google Scholar * Taylor, D. et al. Topological data analysis of contagion maps for examining spreading processes on networks. _Nat. Comm._ 6,
7723 (2015). ADS CAS Google Scholar * Takiyama, A., Teramoto, T., Suzuki, H., Yamashiro, K. & Tanaka, S. Persistent homology index as a robust quantitative measure of
immunohistochemical scoring. _Sci. Rep._ 7, 14002 (2017). ADS PubMed PubMed Central Google Scholar * Marchese, A. & Maroulas, V. Signal classification with a point process distance
on the space of persistence diagrams. _Adv. Data Anal. Classif._ 12, 657–682 (2018). MathSciNet MATH Google Scholar * Maroulas, V., Nasrin, F. & Oballe, C. A bayesian framework for
persistent homology. _SIAM J. Math. Data Sci._ 2, 48–74 (2020). MathSciNet Google Scholar * Maroulas, V., Mike, J. L. & Oballe, C. Nonparametric estimation of probability density
functions of random persistence diagrams. _J. Mach. Learn. Res._ 20, 1–49 (2019). MathSciNet MATH Google Scholar * Maroulas, V., Micucci, C. P. & Spannaus, A. stable cardinality
distance for topological classification. _Adv. Data Anal. Classi_. 1–18, https://link.springer.com/article/10.1007%2Fs11634-019-00378-3 (2019). * Cang, Z. & Wei, G. W. Analysis and
prediction of protein folding energy changes upon mutation by element specific persistent homology. _Bioinformatics_ 33, 3549–3557 (2017). CAS PubMed Google Scholar * Cang, Z. & Wei,
G.-W. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. _Int. J. Numer. Meth. Bio._ 34, e2914 (2018). Google Scholar *
Lee, Y. et al. Quantifying similarity of pore-geometry in nanoporous materials. _Nat. Comm._ 8, 15396 (2017). ADS CAS Google Scholar * Lee, Y. et al. High-throughput screening approach
for nanoporous materials genome using topological data analysis: application to zeolites. _J. Chem. Theory Comput._ 14, 4427–4437 (2018). CAS PubMed PubMed Central Google Scholar *
Kimura, M., Obayashi, I., Takeichi, Y., Murao, R. & Hiraoka, Y. Non-empirical identification of trigger sites in heterogeneous processes using persistent homology. _Sci. Rep._ 8, 1–9
(2018). Google Scholar * Adams, H. et al. Persistence images: a stable vector representation of persistent homology. _J. Mach. Learn. Res._ 18, 218–252 (2017). MathSciNet Google Scholar *
Zomorodian, A. & Carlsson, G. Computing persistent homology. _Discret., Comp. Geom._ 33, 249–274 (2005). MathSciNet MATH Google Scholar * Ghrist, R. Barcodes: the persistent topology
of data. _Bull. Am. Math. Soc._ 45, 61–75 (2008). MathSciNet MATH Google Scholar * Wasserman, L. Topological data analysis. _Annu. Rev. Stat. Appl._ 5, 501–532 (2018). MathSciNet Google
Scholar * Edelsbrunner, H. & Harer, J. _Computational Topology: an Introduction_ (American Mathematical Soc., 2010). * Jin, Z. et al. Structure of M_p__r__o_ from covid-19 virus
and discovery of its inhibitors. _Nature_. https://www.nature.com/articles/s41586-020-2223-y (2020). * Boyle, N. M. O. et al. Open Babel: an open chemical toolbox. _J. Cheminform._ 3, 1–14
(2011). Google Scholar * Himanen, L. et al. DScribe: library of descriptors for machine learning in materials science. _Comput. Phys. Commun._ 247, 106949 (2020). CAS Google Scholar *
Christensen, A. et al. _QML: A Python Toolkit for Quantum Machine Learning_. https://github.com/qmlcode/qml (2017). * Collins, C. R. MolML. https://github.com/crcollins/molml (2017). *
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. _Sci. Data_ 1, 1–7 (2014). Google Scholar Download
references ACKNOWLEDGEMENTS This material is based on work supported by the National Science Foundation under Grant CHE-1800237 (J.T., J.H.H., and K.D.V.), and by the ARO Grant #
W911NF-17-1-0313 and the NSF DMS-1821241 (C.P.M. and V.M.). This work used the Advanced Computer Facility (ACF) of the University of Tennessee. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS *
Department of Chemistry, University of Tennessee, Knoxville, TN, 37996-1600, USA Jacob Townsend, John H. Hymel & Konstantinos D. Vogiatzis * Department of Mathematics, University of
Tennessee, Knoxville, TN, 37996-1320, USA Cassie Putman Micucci & Vasileios Maroulas Authors * Jacob Townsend View author publications You can also search for this author inPubMed Google
Scholar * Cassie Putman Micucci View author publications You can also search for this author inPubMed Google Scholar * John H. Hymel View author publications You can also search for this
author inPubMed Google Scholar * Vasileios Maroulas View author publications You can also search for this author inPubMed Google Scholar * Konstantinos D. Vogiatzis View author publications
You can also search for this author inPubMed Google Scholar CONTRIBUTIONS V.M. and K.D.V. conceived the project. J.T. and C.P.M. wrote the code. J.T. and J.H.H. performed the calculations.
J.T., C.P.M., V.M., and K.D.V. wrote the paper. CORRESPONDING AUTHORS Correspondence to Vasileios Maroulas or Konstantinos D. Vogiatzis. ETHICS DECLARATIONS COMPETING INTERESTS The authors
declare no competing interests. ADDITIONAL INFORMATION PEER REVIEW INFORMATION _Nature Communications_ thanks Reinhard Maurer and the other, anonymous, reviewer(s) for their contribution to
the peer review of this work. Peer reviewer reports are available. PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional
affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION PEER REVIEW FILE RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0
International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Townsend, J., Micucci, C.P., Hymel, J.H. _et al._ Representation of molecular
structures with persistent homology for machine learning applications in chemistry. _Nat Commun_ 11, 3230 (2020). https://doi.org/10.1038/s41467-020-17035-5 Download citation * Received: 11
January 2020 * Accepted: 28 May 2020 * Published: 26 June 2020 * DOI: https://doi.org/10.1038/s41467-020-17035-5 SHARE THIS ARTICLE Anyone you share the following link with will be able to
read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing
initiative
Trending News
A monterey park teen set off on her bike to visit a relative. She never arrivedPolice are searching for a 15-year-old Monterey Park girl who disappeared Tuesday after leaving home on her bicycle. Ali...
Motor racing-alfa romeo seek to step up in final f1 season with sauberMotor racing-Alfa Romeo seek to step up in final F1 season with Sauber | WTVB | 1590 AM · 95.5 FM | The Voice of Branch ...
Hyderabad: hydra to get four additional commissioners among 169 officersThe Telangana government on Wednesday, September 25, issued orders sanctioning 169 posts in various categories for the H...
Son of former YouTube CEO found dead in UC Berkeley dormA UC Berkeley student found dead on campus last week was the son of former YouTube Chief Executive Susan Wojcicki, a fam...
Incredible moments from Bruce Springsteen at Anfield as Sir Paul McCartney joins him on stage - Liverpool EchoIncredible moments from Bruce Springsteen at Anfield as Sir Paul McCartney joins him on stageIt was an incredible perfor...
Latests News
Representation of molecular structures with persistent homology for machine learning applications in chemistryABSTRACT Machine learning and high-throughput computational screening have been valuable tools in accelerated first-prin...
Both gene expression for orotate phosphoribosyltransferase and its ratio to dihydropyrimidine dehydrogenase influence outcome following fluoropyrimidiActivation of 5-fluorouracil into its nucleotides requires phosphorylation by three pathways involving orotate phosphori...
On location: creating cadaversDaniel Tirinnanzi, left, Barney Burman and Ian Von Cromer work on a clay sculpture at B2FX in North Hollywood. The sculp...
Postscript: a few more thoughts about 'a parallelogram'_In “Postscript,” a critic returns to works already reviewed to respond to readers, touch on points not covered, share s...
Thursday night football draws 20. 8 million viewers to cbs/nfl networkCBS’ expensive foray into Thursday prime-time NFL games is off to a strong start in the ratings. The premiere of “Thursd...