Representation of molecular structures with persistent homology for machine learning applications in chemistry

Nature

Representation of molecular structures with persistent homology for machine learning applications in chemistry"


Play all audios:

    

ABSTRACT Machine learning and high-throughput computational screening have been valuable tools in accelerated first-principles screening for the discovery of the next generation of


functionalized molecules and materials. The application of machine learning for chemical applications requires the conversion of molecular structures to a machine-readable format known as a


molecular representation. The choice of such representations impacts the performance and outcomes of chemical machine learning methods. Herein, we present a new concise molecular


representation derived from persistent homology, an applied branch of mathematics. We have demonstrated its applicability in a high-throughput computational screening of a large molecular


database (GDB-9) with more than 133,000 organic molecules. Our target is to identify novel molecules that selectively interact with CO2. The methodology and performance of the novel


molecular fingerprinting method is presented and the new chemically-driven persistence image representation is used to screen the GDB-9 database to suggest molecules and/or functional groups


with enhanced properties. SIMILAR CONTENT BEING VIEWED BY OTHERS PERSISTENT DIRAC FOR MOLECULAR REPRESENTATION Article Open access 11 July 2023 RAPID PREDICTION OF MOLECULAR CRYSTAL


STRUCTURES USING SIMPLE TOPOLOGICAL AND PHYSICAL DESCRIPTORS Article Open access 11 November 2024 RAPID TRAVERSAL OF VAST CHEMICAL SPACE USING MACHINE LEARNING-GUIDED DOCKING SCREENS Article


Open access 13 March 2025 INTRODUCTION The increasing concentration of greenhouse gases has been identified as a primary factor of many facets of environmental degradation such as higher


global temperature, rising sea levels, increased ocean acidity, and more extreme weather-related events. CO2 is the most prominent greenhouse gas, and its atmospheric concentration has


exceeded 400 ppm, which is more than a 40% increase from pre-industrial conditions, potentially leading to a rise in global temperatures of more than 2 ∘C by the year 21001. Lowering CO2


emissions is therefore mandatory to meet ambitions to limit temperature increases to 1.5 ∘C2. Advancements in carbon capture and storage technology are desired for meeting these goals of


lowered atmospheric greenhouse gas emissions and reduced global temperature increases year to year. At an industrial level, liquid amine-based solvents are used for separation and capture of


CO2 via chemisorption, but the solvent regeneration step is an energy intensive process. Membrane-based technologies offer an alternative, cost-effective process for CO2. Unlike solvents,


where chemisorption involves a reaction with binding strengths exceeding 20 kcal mol−1 through the creation of chemical bonds between CO2 and solvent, membranes utilize much weaker


noncovalent interactions. Different types of materials have been suggested for the fabrication of permeable membranes including amorphous, non-porous polymeric membranes3,4,5,6, or


crystalline materials with permanent porosity such as metal-organic frameworks (MOFs) or zeolites7. The understanding of how the atomistic structure of materials affects the gas selectivies


is a crucial process for the development of more efficient carbon capture technologies. Most often, this involves the separation of CO2 from N2, which are two atmospheric gases with similar


kinetic diameters, making size-sieving a challenging task. The introduction of functional groups which selectively interact with CO2 has been a successful approach for increasing membrane


performance6,8. Such CO2-philic functional groups (usually Lewis bases) can be either introduced into the framework of porous crystalline materials (e.g., MOFs) or functionalized into the


repeat units of non-porous polymeric membranes. Electronic structure theory calculations between molecular units and the respective gases provide a quantification of these noncovalent


interactions, as well as elucidate their nature and properties9,10,11,12,13,14,15. However, the number of potential CO2-philic groups is intractably large, which leads to an excessive study


of such systems with accurate ab initio methods. In addition, the determination of gas interaction energies may require multiple calculations to evaluate competitive gas binding sites for


every structure, which further increases the computational cost and expert intervention. However, high-throughput computational screening can accelerate the discovery of new, functional


materials for rational synthesis through the circumvention of the expensive and time-demanding synthesis and testing process16,17. High-throughput computational design has shown great


success in identifying new molecules18,19 and materials20,21,22,23 with enhanced properties and advanced functionality. For many applications, first-principles studies are essential to


virtual screening, but the high computational cost of these methods makes the search of large parts of the chemical space cost-prohibitive. In recent years, machine learning (ML) has become


a valuable tool in reducing the cost of a systematic chemical space exploration by enhancing the search for structure-property relationships24,25,26, guiding molecular design27,28,29,30, and


predicting electronic structure properties31,32,33,34,35. ML algorithms are used for their ability to learn complicated relationships in data with high computational efficiency that can be


systematically improved through additional training data, but may require extensive training set sizes before predicting out-of-sample properties accurately. The efficiency of ML depends on


how these data are passed to the algorithm. For chemical applications, this occurs through molecular representations, which are the featurization of molecular compounds from their molecular


structure into a vector of values. The ML algorithm then infers the relation between the structure and the property of interest. Recent developments in the formulation of molecular


representations, particularly in the realm of quantum properties and structure-function relations, have increased the efficiency of ML for chemical applications36,37,38,39,40,41,42. In


addition, such representations generalize to more complex instances such as reaction barriers43. Despite the inherent ability of ML to extract important features, ML-model accuracy is


dependent on the molecular representation. A molecular representation reduces the dimensionality of a molecular structure into a chemically meaningful format that relays important chemical


information. For example, a chemical formula conveys a three-dimensional molecule as a string of characters but it is an ambiguous input for ML. Atomistic and molecular structure should be


converted into a machine-readable format that can be parsed efficiently to ML algorithms as input features. One of the most prominent representations is the Coulomb matrix (CM) introduced by


Rupp et al., which is a square atom-by-atom matrix containing an approximate potential energy of the free atom along the diagonal and pair Coulombic potentials on the off-diagonal terms33.


An improvement over CM is typically observed using the Bag-of-Bonds (BoB) representation, where each atomic pair is placed in specific vectors (bags) based off the elemental pairs and sorted


by value34. Faber et al. have developed FCHL, a representation based on Gaussian distribution functions for the universal kernel ridge regression-based quantum machine-learning models42. In


addition, the smooth overlap of atomic positions (SOAP) representation44 calculates the local density of atoms around all atoms in a given chemical environment but suffers from an increased


computational cost over the pairwise CM and BoB representations. Herein, we present a new molecular representation scheme based on persistent homology, a branch of computational topology.


Application of persistent homology on molecules encodes three-dimensional structural data into two-dimensional persistence images. Since persistence images hold topological features of


chemical structures, we are suggesting them as alternative molecular fingerprints which are transposed into ML input and used to identify relationships in the data. A molecular


representation is introduced for encoding chemical structures, which is applied in the prediction of interaction energies of organic molecules with gas molecules. Persistence images offer a


similar-size ML vectorization regardless of system size, and a numerical example is given as a proof of this concept. Whereas CM and BoB do not provide a constant-size representation by this


definition, it is effectively achieved through padding empty cells. However, the input vector space takes the dimensions of the largest molecule in the data and requires significant padding


for smaller molecules, whereas the introduced method has a predefined vector size regardless of number of atoms. It has been suggested that using feature vectors with sizes independent of


system size may result in improved generalization between small and large systems45. The new method is used for screening a large database of organic molecules for the discovery of


CO2-philic functional groups. RESULTS PERSISTENCE DIAGRAMS The mapping of molecular structure to a chemically driven persistence image entails several steps. First, the molecular homological


features, which measure the connectedness, proximity, and the empty space among the atoms, are computed and stored. These homological features are summarized in a persistence diagram


(PD)46,47,48,49,50,51,52,53,54,55,56,57,58. A PD encodes molecular features such as bonds and rings. The PDs can then be vectorized into a persistence image59 (PI) for use as a molecular


representation. However, employing persistent homology and its derivative, PI, purely focuses on detecting topological attributes but lacks explicit incorporation of key chemical information


such as element identity, leading to limited applicability in molecular systems. Here, we describe the application of persistent homology with domain-specific knowledge for the generation


of persistence images based on atomic properties. The basic steps are presented in the following paragraphs. Anisole was selected as a representative example because it contains two distinct


functional units (phenyl- and methoxy groups). To construct a persistence diagram for a given molecule, spheres of a given radius centered at each atom are considered and, as the radius


increases, the spheres intersect and lead to the evolution of homological features, called connected components and holes. The connected components encode interatomic distances, while holes


describe molecular attributes such as rings and functional groups. The PDs hold information about the generation or birth and the lifetime length or persistence of connected components and


holes. The placement of a birth is denoted by its location along the _x_-axis of the PD, whereas the persistence is denoted by its location on the _y_-axis. Birth of connected components


occurs at 0, since every atom is given an initial sphere with radius 0 at the start of the algorithm (see anisole as an example on Fig. 1). The spheres are then systematically expanded (Fig.


 1a, d, g, j, and m) until spherical intersections occur, which effectively generate a new connected component by merging older ones. The persistence of connected components is then recorded


on the associated PD. In some sense, a PD records the time in terms of spheres’ radii for the atoms to form a single cluster. For anisole that is used here as an example, four different


types of connected components are generated. The first two appear at ~1.1 and correspond to the C–H bonds of the methoxyl and phenyl groups, respectively. The other two appear at ~1.4 and


correspond to the C–O and C–C bonds, respectively. This means that the units of the two axes are given in angstroms (Å). When sphere intersections lead to the formation of connected atoms


(connected component) on a ring, for example when all the six spheres of the phenyl carbons have met, a hole is generated (Fig. 1l). The death of a hole occurs when all spheres that form a


given hole intersect, and its persistence is recorded on the persistence diagram (Fig. 1o). It becomes now evident that the connected components depend on the distances of neighboring atoms


and the holes correspond to topological features of the functional groups. Two holes are now formed for the anisole example of Fig. 1, which correspond to the phenyl and methoxy groups.


These holes are unique for each respective unit and differentiate between different conformations of the same molecule and subtle differences in geometry60,61,62,63. CHEMICALLY DRIVEN


PERSISTENCE IMAGES AS MOLECULAR REPRESENTATIONS The PD is vectorized into a pixelized image, called persistence image (PI), which is a stable, computationally tractable representation59. PIs


are constructed by placing a Gaussian kernel centered at each point on the PD as it is highlighted for anisole in Fig. 2, where the pixel intensity corresponds to the multiplicity. For


example, five C–H pairs in the phenyl group, three C–H pairs in the methyl group etc. Next, the surface is transferred into pixel values (Fig. 2b, c). The resulting image effectively encodes


the molecular geometry. The transformation of a PD to a PI may lead to inconsistencies if all atoms are treated identically, especially when molecular structures with the same geometries


but different atom types are encoded into a PI. For example, the diatomic molecules HBr and F2 have approximately the same bond distance (1.41 Å), and therefore generate the same PI (Fig. 


2f, g, respectively). To ameliorate this shortcoming, we introduce atomistic information in the variance of the Gaussian kernels that yielded a PI. The variance determines the spread of the


kernel, or how “smeared” each point on the PD is when placed onto the PI. This variance is chosen based on the atom type that created the point in the PD. Specifically, we define the


variance in the persistence images by the difference in electronegativity for connected components. Electronegativity differences are chosen because they provide a general description


between the nature of different bonds. For the example of HBr and F2, HBr has a very polar chemical bond, whereas molecular fluorine is nonpolar. Large variance is provided to atom pairs


with large electronegativity differences, which ultimately generates unique PIs (Fig. 2h, i, respectively). Our new chemically driven persistence image differentiates between molecules which


have similar geometric configurations but different atomic compositions. Another important feature is related to the dimensionality of the PI molecular representation which remains of the


same order with respect to the molecular size as we show empirically in Supplementary Figs. 2 and 3. This is what we call herein a similar-size representation. For example, for a small


molecule like anisole that is composed of 16 atoms, a 3 Å × 3 Å PD was generated. The equivalent PD of a medium-size molecule such as the _tert_-butylcalixarene (105 atoms) is of comparable


size (4 Å × 4 Å). Similarly, the PD of a large structure, the main protease of the new coronavirus identified as COVID-1964 in complex with an inhibitor N3 (2500 non-hydrogen atoms) has size


of 6 Å × 6 Å. A detailed analysis is given in Supplementary Note 2. As it was mentioned in the introduction, such a similar-size representation is desirable for many chemical


applications45. PERFORMANCE OF PERSISTENCE IMAGES AS MOLECULAR REPRESENTATIONS Here, we demonstrate the performance of the chemically driven PIs on an application relevant to green


chemistry. Our aim is to screen a large molecular database in order to discover molecular groups that show a stronger affinity for CO2 interaction over N2. Such molecular groups can be


introduced in polymeric materials for the development of the next generation of functional gas separation membranes. Since it is desirable to avoid any density functional theory


(DFT)-optimized geometries as input for ML models, which introduce a significant computational bottleneck for the screening of large molecular databases, we resort to structures generated by


the OpenBabel65 software package (gen3d function). Our target is to train a ML model that maps low-cost geometries with accurate quantum chemical data, so it can provide reliable


interaction energies for molecular species with geometries generated on-the-fly. We tested the performance of PI as alternative molecular representations that effectively encode chemical


structures. The initial subset of 100 organic molecules was used, generated based on the procedure described in Supplementary Note 3. The interaction energies of each of these structures


with CO2 and N2 were computed by means of DFT. We also wanted to compare PI with the widely utilized Coulomb matrices (CM), Bag-of-Bonds (BoB), FCHL, and Smooth Overlap of Atomic Positions


(SOAP) representations, since each of these are produced with little computational burden and widely implemented in a number of programming libraries66,67,68. PI, CMs, BoBs, FCHL, and SOAP


representations were generated for each structure and the performance of each representation scheme was evaluated for the prediction of gas interaction energies. For each scheme, a variety


of machine-learning algorithms were tested, including random forest, Gaussian process regression, and kernel ridge regression. A detailed analysis of the optimization process is given in


Supplementary Note 6. Overall, two machine-learning models were trained per molecular representation scheme, one for CO2 and one for N2 interaction energies. The 10-fold cross validation


root-mean-squared error (RMSE) for the five trained models on the CO2 energies are shown in Fig. 3. The error bars represent the standard deviation of the RMSE. Similar results were obtained


for N2 interaction energies (see Supplementary Note 7). Comparing the best learners for each representation, CM showed the highest deviation (RMSE of 0.63 kcal mol−1), followed by BoB and


FCHL (RMSE of 0.52 and 0.50 kcal mol−1, respectively). The most accurate models were PI with kernel ridge regression (Laplacian kernel, 0.44 kcal mol−1) and SOAP with kernel ridge regression


(linear kernel, 0.41 kcal mol−1), where PI showed a tighter variance, yielding a higher confidence in the predictions. SCREENING THE GDB-9 DATABASE High-throughput computational screening


using ML is an efficient method to survey molecules for numerous chemical applications. Here, we are applying the PI method for identifying molecules and functional groups that enhance CO2


interactions with little computational cost. ML models trained on DFT-quality data can estimate DFT-quality results for hundreds of thousands of systems within seconds, while the explicit


computation at the DFT level is a cost-prohibitive process. The GDB-9 database69 was screened, which includes 133,885 organic molecules containing no more than nine non-hydrogen atoms to


determine the most promising molecules for CO2 binding. The data from the initial 100 organic molecules discussed in the previous section do not adequately capture the properties of the


chemical space spanned by the GDB-9 database. The initial training set can be considered as biased since it contains largely N-containing heterocycles with small functionalizations on


aromatic carbons (see Supplementary Note 9). For surpassing this limitation and reliably screening the full GDB-9 space, we applied a methodology known as active learning. In active


learning, the training set is systematically expanded to capture the necessary missing physics to accurately predict for the targeted space. The top 40 molecules were selected with respect


to predicted CO2 interaction strength and further investigated by the MD/DFT scheme described in the computational details (Supplementary Note 4). Therefore, the training set was expanded to


better infer the relationship between the molecular representation and the chemical space spanning the GDB-9 database. We have repeated this processes four times by considering different


yet optimized molecular representation methods (CM, BoB, SOAP, and PI). The individual steps that were followed are shown schematically in Fig. 4 and analyzed in the next paragraphs. Three


iterations were performed with each representation scheme together with the optimized machine-learning algorithm, as it is discussed in the previous section and in Supplementary Note 6.


Thus, the kernel ridge regression (Laplacian) was used for CM and PI, Gaussian process regression for BoB, and kernel ridge regression (linear) for SOAP. The active-learning process resulted


a total of 220 data points, i.e., 220 molecular structures with their corresponding CO2 and N2 interaction energies computed by DFT per method. The distributions of the interaction energies


of these molecules for each method are visualized in Fig. 5. We set a mark at −6.0 kcal mol−1 for molecules with significantly strong CO2 interaction energy. For a detailed analysis of the


mean and median of each active-learning iteration per method, we refer the reader to the Supplementary Note 8. The first iteration contains only the original 100-molecule training set. By


expanding the training set with the 40 molecules from the first iteration, the next 40 best-predicted molecules from the model that utilizes the PI representation have significantly


improved. On the contrary, no significant changes were observed from the other three models. By the third iteration, the CO2 distribution from CM, BoB, and PI remained almost unchanged,


while a small shift toward stronger interaction energies was found for the SOAP model. Overall, PI showed the greatest performance since each respective iteration increased the number of


promising structures, from 10 (first iteration) to 43 (second iteration), and ultimately to 75 out of 120 molecules. Active learning with the PI molecular representation has systematically


expanded the training set to better represent the chemical space of the dataset, yielding more reliable predictions every round. Since promising structures are rare within the dataset, this


strategy allows the model to account for these rare instances within the training set in a way that would be impossible with a randomly chosen training set. In addition, the three top


candidates for CO2 separations were found to demonstrate stronger interaction energies than −6.50 kcal mol−1, which are shown in Fig. 4 (bottom, right). Our computational procedure allowed


us to discover new molecules with higher CO2 affinity that combine previously unknown binding motifs. In particular, we found that cooperative effects between N-containing heterocycles with


amino or hydroxo groups at ortho position increases the CO2 strength. The lone electron pair of nitrogen 6induces a dipole moment on CO2 that allows stronger interactions with hydrogen atoms


of the NH2− and/or OH− functional groups. After completing three iterations of active learning, the full dataset (220 molecules) is used to create a ML model for predictions on the whole


GDB-9 database. For comparison, the database was screened with the four different models, where each of them uses a different molecular representation method (CM, BoB, SOAP, and PI) and data


generated from the corresponding active-learning steps. The optimum learner for each method was used, as it was discussed in the previous section, except for BoB, where the kernel ridge


regression (linear) was applied (for a detailed discussion, see Supplementary Note 8). Figure 6 includes a plot for each model, where all predicted N2 and CO2 interaction energies are on the


_x_- and _y_-axis, respectively. Only the method that utilized the PI molecular representation was able to identify 4,5-diamino-1H-imidazol-2-ol as one of the molecules with the strongest


CO2 affinity as indicated with an orange dot on each plot of Fig. 6. 4,5-diamino-1H-imidazol-2-ol was part of the training set introduced in the second step of the active-learning process


(Fig. 4), and has a DFT CO2 interaction energy of −7.41 kcal mol−1. All methods agree that the majority of the molecular entries of the GDB-9 dataset have a mean CO2 interaction energy


centered between −3.0 and −4.0 kcal mol−1 and a N2 interaction energy at −2.0 kcal mol−1. Most of the molecules have predicted CO2 interaction energies between −3.0 and −5.0 kcal mol−1,


which emphasize the difficulty in determining new molecules with high CO2 affinity. However, CM and BoB were less effective in identifying rare instances. On the contrary, results from SOAP


were significantly scattered, and predicted many cases with false CO2 interaction energies close to 0 kcal mol−1 (in a few cases, SOAP even predicted repulsive interaction energies,


Supplementary Note 8). Interestingly, training of machine-learning models with the CM, BoB or the SOAP representations that utilize the molecules identified by active learning with the PI


representation yielded more concise distributions, while all models identified the molecular species with the stronger CO2 interaction energy among the top candidates (Supplementary Fig. 7).


In other words, PI provided higher quality data for all methods. Therefore, the model that utilizes PIs for molecular representations provided the most consistent distributions for both CO2


and N2 interaction energies. The PI screening revealed a total of 44 of the 133,885 molecules with CO2 interaction energies exceeding −6.5 kcal mol−1. DFT calculations were performed for


verification of these results. It should also be mentioned that SOAP needed 88,287 s for screening the full GDB-9 database, while the screening with the novel PI representation was almost 40


times faster (only 2219 s). All screenings were performed on an Intel® i5-4278U processor. DISCUSSION A novel molecular representation utilizing persistence images with embedded chemical


bonding information has been introduced for predicting DFT-quality CO2/N2 interaction energies. From our investigation, this new chemically driven persistence image is a concise,


computationally efficient, and effective representation that generally outperforms other representations for prediction of CO2 interaction energies since the computational cost is low and


does not suffer from dimensionality problems. This representation accounts for underlying topological structure in the molecule, providing a method to control uncertainty due to differing


geometric configurations. The new methodology has been applied for the screening of the GDB-9 database to suggest new CO2-philic moieties. By using an active-learning approach, our ML-based


screening was able to identify many promising molecules in the GDB-9 database despite a very small training set (220 molecules). Specifically, 44 molecules were identified that exceed −6.5 


kcal mol−1 CO2 interaction energy. In addition, candidates that may exhibit strong CO2 interactions while maintaining weak N2 interactions were examined, yielding a strategy for identifying


species with potentially strong gas separation capabilities. Ultimately, chemically driven persistence images are promising molecular representations for larger supermolecular systems due to


compact vectorization. Therefore, we believe that the chemically driven PI molecular representations can be applied in a plethora of chemical problems. The PI method described herein relies


on a topological representation of a molecular compound that allows a flexible summary of the diversity of the atomic geometries. Due to this flexibility in terms of topological


equivalence, the PI is robust and provides accurate predictions in contrast to other methods that need to learn rigid geometric representations. We are currently expanding the applicability


of the novel molecular fingerprinting method to high-throughput screening of molecular databases for catalysis and ligand-based lanthanide/actinide separations. For this type of chemical


applications, additional features are taken into consideration, such as intensity normalization when a PI is generated from the corresponding PD and predictability of properties of larger


molecules from data generated from smaller ones. DATA AVAILABILITY The data for the high-throughput computational screening are available in the Supplementary Information. CODE AVAILABILITY


The code for the numerical simulations is available at https://gitlab.com/voglab/PersistentImages_Chemistry. CHANGE HISTORY * _ 14 JULY 2020 An amendment to this paper has been published and


can be accessed via a link at the top of the paper. _ REFERENCES * Capellán-Pérez, I., Arto, I., Polanco-Martínez, J. M., González-Eguino, M. & Neumann, M. B. Likelihood of climate


change pathways under uncertainty on fossil fuel resource availability. _Energy Environ. Sci._ 9, 2482–2496 (2016). Google Scholar  * Hulme, M. 1.5 ∘C and climate research after the Paris


Agreement. _Nat. Clim. Chang._ 6, 222–224 (2016). ADS  Google Scholar  * Norahim, N., Yaisanga, P., Faungnawakij, K., Charinpanitkul, T. & Klaysom, C. Recent membrane developments for


CO2 separation and capture. _Chem. Eng. Technol._ 41, 211–223 (2018). CAS  Google Scholar  * Ahmad, J. et al. Recent advances in poly (amide-B-ethylene) based membranes for carbon dioxide


(CO2) capture: a review. _Polym. Technol. Mater._ 58, 366–383 (2019). CAS  Google Scholar  * Wang, Y. et al. Polymers of intrinsic microporosity for energy-intensive membrane-based gas


separations. _Mater. Today Nano_ 3, 69–95 (2018). Google Scholar  * Hong, T. et al. Impact of tuning CO2-philicity in polydimethylsiloxane-based membranes for carbon dioxide separation. _J.


Memb. Sci._ 530, 213–219 (2017). CAS  Google Scholar  * Sumida, K. et al. Carbon dioxide capture in metal organic frameworks. _Chem. Rev._ 112, 724–781 (2011). PubMed  Google Scholar  *


Tian, Z., Dai, S. & Jiang, D.-e What can molecular simulation do for global warming? _Wiley Interdiscip. Rev. Comput. Mol. Sci._ 6, 173–197 (2016). CAS  Google Scholar  * Vogiatzis, K.


D., Mavrandonakis, A., Klopper, W. & Froudakis, G. E. Ab initio study of the interactions between CO2 and N-containing organic heterocycles. _ChemPhysChem_ 2, 374–383 (2009). Google


Scholar  * Tian, Z., Saito, T. & Jiang, D.-e Ab initio screening of CO2-philic groups. _J. Phys. Chem. A_ 119, 3848–3852 (2015). CAS  PubMed  Google Scholar  * Lee, H. M., Youn, I. S.,


Saleh, M., Lee, J. W. & Kim, K. S. Interactions of CO2 with various functional molecules. _Phys. Chem. Chem. Phys._ 17, 10925–10933 (2015). CAS  PubMed  Google Scholar  * Chen, L., Cao,


F. & Sun, H. Ab initio study of the _π_–_π_ Interactions between CO2 and benzene, pyridine, and pyrrole. _Int. J. Quantum Chem._ 113, 2261–2266 (2013). CAS  Google Scholar  * Hussain, M.


A., Soujanya, Y. & Sastry, G. N. Evaluating the efficacy of amino acids as CO2 capturing agents: a first principles investigation. _Environ. Sci. Technol._ 45, 8582–8588 (2011). ADS 


CAS  PubMed  Google Scholar  * Townsend, J., Braunscheidel, N. M. & Vogiatzis, K. D. Understanding the nature of weak interactions between functionalized boranes and N2/O2, promising


functional groups for gas separations. _J. Phys. Chem. A_ 123, 3315–3325 (2019). CAS  PubMed  Google Scholar  * Hymel, J. H., Townsend, J. & Vogiatzis, K. D. CO2 capture on


functionalized calixarenes: a computational study. _J. Phys. Chem. A_ 123, 10116–10122 (2019). CAS  PubMed  Google Scholar  * Kim, J., Abouelnasr, M., Lin, L. C. & Smit, B. Large-scale


screening of zeolite structures for CO2 membrane separations. _J. Am. Chem. Soc._ 135, 7545–7552 (2013). CAS  PubMed  Google Scholar  * Haldoupis, E., Nair, S. & Sholl, D. S. Finding


MOFs for highly selective CO2/N2 adsorption using materials screening based on efficient assignment of atomic point charges. _J. Am. Chem. Soc._ 134, 4313–4323 (2012). CAS  PubMed  Google


Scholar  * Nandy, A., Duan, C., Janet, J. P., Gugler, S. & Kulik, H. J. Strategies and software for machine learning accelerated discovery in transition metal chemistry. _Ind. Eng. Chem.


Res._ 57, 13973–13986 (2018). CAS  Google Scholar  * Hachmann, J. et al. The Harvard clean energy project: large-scale computational screening and design of organic photovoltaics on the


world community grid. _J. Phys. Chem. Lett._ 2, 2241–2251 (2011). CAS  Google Scholar  * Persson, K. A. et al. Commentary: The Materials Project: a materials genome approach to accelerating


materials innovation. _APL Mater._ 1, 011002 (2013). ADS  Google Scholar  * Levy, O. et al. The high-throughput highway to computational materials design. _Nat. Mater._ 12, 191–201 (2013).


ADS  PubMed  Google Scholar  * Nørskov, J. K. & Bligaard, T. The catalyst genome. _Angew. Chem. Int. Ed._ 52, 776–777 (2013). Google Scholar  * Collins, K. D., Gensch, T. & Glorius,


F. Contemporary screening approaches to reaction discovery and development. _Nat. Chem._ 6, 859–871 (2014). CAS  PubMed  Google Scholar  * Ma, X., Li, Z., Achenie, L. E. & Xin, H.


Machine-learning-augmented chemisorption model for CO2 electroreduction catalyst screening. _J. Phys. Chem. Lett._ 6, 3528–3533 (2015). CAS  PubMed  Google Scholar  * Janet, J. P. &


Kulik, H. J. Predicting electronic structure properties of transition metal complexes with neural networks. _Chem. Sci._ 8, 5137–5152 (2017). CAS  PubMed  PubMed Central  Google Scholar  *


Li, Z., Ma, X. & Xin, H. Feature engineering of machine-learning chemisorption models for catalyst design. _Catal. Today_ 280, 232–238 (2017). CAS  Google Scholar  * Gómez-Bombarelli, R.


et al. Automatic chemical design using a data-driven continuous representation of molecules. _ACS Cent. Sci._ 4, 268–276 (2018). PubMed  PubMed Central  Google Scholar  * Olivecrona, M.,


Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. _J. Cheminform._ 9, 1–14 (2017). Google Scholar  * Sanchez-Lengeling, B. &


Aspuru-Guzik, A. Inverse molecular design using machine learning:Generative models for matter engineering. _Science_ 361, 360–365 (2018). ADS  CAS  PubMed  Google Scholar  * Sturluson, A.,


Huynh, M. T., York, A. H. P. & Simon, C. M. Eigencages: learning a latent space of porous cage molecules. _ACS Cent. Sci._ 4, 1663–1676 (2018). CAS  PubMed  PubMed Central  Google


Scholar  * Hansen, K. et al. Assessment and validation of machine learning methods for predicting molecular atomization energies. _J. Chem. Theory Comput._ 9, 3404–3419 (2013). CAS  PubMed 


Google Scholar  * Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Big data meets quantum chemistry approximations: the _Δ_-machine learning approach. _J. Chem. Theory


Comput._ 11, 2087–2096 (2015). CAS  PubMed  Google Scholar  * Rupp, M., Tkatchenko, A., Müller, K.-R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies


with machine learning. _Phys. Rev. Lett._ 108, 058301 (2012). ADS  PubMed  Google Scholar  * Hansen, K. et al. Machine learning predictions of molecular properties: accurate many-body


potentials and nonlocality in chemical space. _J. Phys. Chem. Lett._ 6, 2326–2331 (2015). CAS  PubMed  PubMed Central  Google Scholar  * De, S., Bartók, A. P., Csányi, G. & Ceriotti, M.


Comparing molecules and solids across structural and alchemical space. _Phys. Chem. Chem. Phys._ 18, 13754–13769 (2016). CAS  PubMed  Google Scholar  * Faber, F., Lindmaa, A., von


Lilienfeld, O. A. & Armiento, R. Crystal structure representations for machine learning models of formation energies. _Int. J. Quantum Chem._ 115, 1094–1101 (2015). CAS  Google Scholar 


* Faber, F. A. et al. Prediction errors of molecular machine learning models lower than hybrid DFT error. _J. Chem. Theory Comput._ 13, 5255–5264 (2017). CAS  PubMed  Google Scholar  *


Huang, B. & von Lilienfeld, O. A. Communication: understanding molecular representations in machine learning: the role of uniqueness and target similarity. _J. Chem. Phys_. 145, 161102


(2016). ADS  PubMed  Google Scholar  * Bereau, T., Andrienko, D. & von Lilienfeld, O. A. Transferable atomic multipole machine learning models for small organic molecules. _J. Chem.


Theory Comput._ 11, 3225–3233 (2015). CAS  PubMed  Google Scholar  * Browning, N. J., Ramakrishnan, R., von Lilienfeld, O. A. & Roethlisberger, U. Genetic optimization of training sets


for improved machine learning models of molecular properties. _J. Phys. Chem. Lett._ 8, 1351–1359 (2017). CAS  PubMed  Google Scholar  * Bartók, A. P. et al. Machine learning unifies the


modeling of materials and molecules. _Sci. Adv._ 3, e1701816 (2017). ADS  PubMed  PubMed Central  Google Scholar  * Faber, F. A., Christensen, A. S., Huang, B. & von Lilienfeld, O. A.


Alchemical and structural distribution based representation for improved QML. _J. Chem. Phys._ 148, 241717 (2018). ADS  PubMed  Google Scholar  * Meyer, B., Sawatlon, B., Heinen, S., von


Lilienfeld, O. A. & Corminboeuf, C. Machine learning meets volcano plots: Computational discovery of cross-coupling catalysts. _Chem. Sci._ 9, 7069–7077 (2018). CAS  PubMed  PubMed


Central  Google Scholar  * Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. _Phys. Rev. B - Condens. Matter Mater. Phys._ 87, 1–16 (2013). Google Scholar  *


Collins, C. R., Gordon, G. J., Von Lilienfeld, O. A. & Yaron, D. J. Constant size descriptors for accurate machine learning models of molecular properties. _J. Chem. Phys._ 148, 241718


(2018). ADS  PubMed  Google Scholar  * Bendich, P., Marron, J. S., Miller, E., Pieloch, A. & Skwerer, S. Persistent homology analysis of brain artery trees. _Ann. Appl. Stat._ 10,


198–218 (2016). MathSciNet  PubMed  PubMed Central  Google Scholar  * Kramar, M., Goullet, A., Kondic, L. & Mischaikow, K. Persistence of force networks in compressed granular media.


_Phys. Rev. E_ 87, 042207 (2013). ADS  CAS  Google Scholar  * Taylor, D. et al. Topological data analysis of contagion maps for examining spreading processes on networks. _Nat. Comm._ 6,


7723 (2015). ADS  CAS  Google Scholar  * Takiyama, A., Teramoto, T., Suzuki, H., Yamashiro, K. & Tanaka, S. Persistent homology index as a robust quantitative measure of


immunohistochemical scoring. _Sci. Rep._ 7, 14002 (2017). ADS  PubMed  PubMed Central  Google Scholar  * Marchese, A. & Maroulas, V. Signal classification with a point process distance


on the space of persistence diagrams. _Adv. Data Anal. Classif._ 12, 657–682 (2018). MathSciNet  MATH  Google Scholar  * Maroulas, V., Nasrin, F. & Oballe, C. A bayesian framework for


persistent homology. _SIAM J. Math. Data Sci._ 2, 48–74 (2020). MathSciNet  Google Scholar  * Maroulas, V., Mike, J. L. & Oballe, C. Nonparametric estimation of probability density


functions of random persistence diagrams. _J. Mach. Learn. Res._ 20, 1–49 (2019). MathSciNet  MATH  Google Scholar  * Maroulas,  V., Micucci,  C. P. &  Spannaus,  A.  stable cardinality


distance for topological classification. _Adv. Data Anal. Classi_. 1–18, https://link.springer.com/article/10.1007%2Fs11634-019-00378-3 (2019). * Cang, Z. & Wei, G. W. Analysis and


prediction of protein folding energy changes upon mutation by element specific persistent homology. _Bioinformatics_ 33, 3549–3557 (2017). CAS  PubMed  Google Scholar  * Cang, Z. & Wei,


G.-W. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. _Int. J. Numer. Meth. Bio._ 34, e2914 (2018). Google Scholar  *


Lee, Y. et al. Quantifying similarity of pore-geometry in nanoporous materials. _Nat. Comm._ 8, 15396 (2017). ADS  CAS  Google Scholar  * Lee, Y. et al. High-throughput screening approach


for nanoporous materials genome using topological data analysis: application to zeolites. _J. Chem. Theory Comput._ 14, 4427–4437 (2018). CAS  PubMed  PubMed Central  Google Scholar  *


Kimura, M., Obayashi, I., Takeichi, Y., Murao, R. & Hiraoka, Y. Non-empirical identification of trigger sites in heterogeneous processes using persistent homology. _Sci. Rep._ 8, 1–9


(2018). Google Scholar  * Adams, H. et al. Persistence images: a stable vector representation of persistent homology. _J. Mach. Learn. Res._ 18, 218–252 (2017). MathSciNet  Google Scholar  *


Zomorodian, A. & Carlsson, G. Computing persistent homology. _Discret., Comp. Geom._ 33, 249–274 (2005). MathSciNet  MATH  Google Scholar  * Ghrist, R. Barcodes: the persistent topology


of data. _Bull. Am. Math. Soc._ 45, 61–75 (2008). MathSciNet  MATH  Google Scholar  * Wasserman, L. Topological data analysis. _Annu. Rev. Stat. Appl._ 5, 501–532 (2018). MathSciNet  Google


Scholar  * Edelsbrunner,  H. &  Harer,  J.  _Computational Topology: an Introduction_ (American Mathematical Soc., 2010). * Jin,  Z.  et al.  Structure of M_p__r__o_ from covid-19 virus


and discovery of its inhibitors. _Nature_. https://www.nature.com/articles/s41586-020-2223-y (2020). * Boyle, N. M. O. et al. Open Babel: an open chemical toolbox. _J. Cheminform._ 3, 1–14


(2011). Google Scholar  * Himanen, L. et al. DScribe: library of descriptors for machine learning in materials science. _Comput. Phys. Commun._ 247, 106949 (2020). CAS  Google Scholar  *


Christensen,  A.  et al.  _QML: A Python Toolkit for Quantum Machine Learning_. https://github.com/qmlcode/qml (2017). * Collins,  C. R.  MolML. https://github.com/crcollins/molml (2017). *


Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. _Sci. Data_ 1, 1–7 (2014). Google Scholar  Download


references ACKNOWLEDGEMENTS This material is based on work supported by the National Science Foundation under Grant CHE-1800237 (J.T., J.H.H., and K.D.V.), and by the ARO Grant #


W911NF-17-1-0313 and the NSF DMS-1821241 (C.P.M. and V.M.). This work used the Advanced Computer Facility (ACF) of the University of Tennessee. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS *


Department of Chemistry, University of Tennessee, Knoxville, TN, 37996-1600, USA Jacob Townsend, John H. Hymel & Konstantinos D. Vogiatzis * Department of Mathematics, University of


Tennessee, Knoxville, TN, 37996-1320, USA Cassie Putman Micucci & Vasileios Maroulas Authors * Jacob Townsend View author publications You can also search for this author inPubMed Google


Scholar * Cassie Putman Micucci View author publications You can also search for this author inPubMed Google Scholar * John H. Hymel View author publications You can also search for this


author inPubMed Google Scholar * Vasileios Maroulas View author publications You can also search for this author inPubMed Google Scholar * Konstantinos D. Vogiatzis View author publications


You can also search for this author inPubMed Google Scholar CONTRIBUTIONS V.M. and K.D.V. conceived the project. J.T. and C.P.M. wrote the code. J.T. and J.H.H. performed the calculations.


J.T., C.P.M., V.M., and K.D.V. wrote the paper. CORRESPONDING AUTHORS Correspondence to Vasileios Maroulas or Konstantinos D. Vogiatzis. ETHICS DECLARATIONS COMPETING INTERESTS The authors


declare no competing interests. ADDITIONAL INFORMATION PEER REVIEW INFORMATION _Nature Communications_ thanks Reinhard Maurer and the other, anonymous, reviewer(s) for their contribution to


the peer review of this work. Peer reviewer reports are available. PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional


affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION PEER REVIEW FILE RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0


International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the


source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative


Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by


statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit


http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Townsend, J., Micucci, C.P., Hymel, J.H. _et al._ Representation of molecular


structures with persistent homology for machine learning applications in chemistry. _Nat Commun_ 11, 3230 (2020). https://doi.org/10.1038/s41467-020-17035-5 Download citation * Received: 11


January 2020 * Accepted: 28 May 2020 * Published: 26 June 2020 * DOI: https://doi.org/10.1038/s41467-020-17035-5 SHARE THIS ARTICLE Anyone you share the following link with will be able to


read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing


initiative


Trending News

A monterey park teen set off on her bike to visit a relative. She never arrived

Police are searching for a 15-year-old Monterey Park girl who disappeared Tuesday after leaving home on her bicycle. Ali...

Motor racing-alfa romeo seek to step up in final f1 season with sauber

Motor racing-Alfa Romeo seek to step up in final F1 season with Sauber | WTVB | 1590 AM · 95.5 FM | The Voice of Branch ...

Hyderabad: hydra to get four additional commissioners among 169 officers

The Telangana government on Wednesday, September 25, issued orders sanctioning 169 posts in various categories for the H...

Son of former YouTube CEO found dead in UC Berkeley dorm

A UC Berkeley student found dead on campus last week was the son of former YouTube Chief Executive Susan Wojcicki, a fam...

Incredible moments from Bruce Springsteen at Anfield as Sir Paul McCartney joins him on stage - Liverpool Echo

Incredible moments from Bruce Springsteen at Anfield as Sir Paul McCartney joins him on stageIt was an incredible perfor...

Latests News

Representation of molecular structures with persistent homology for machine learning applications in chemistry

ABSTRACT Machine learning and high-throughput computational screening have been valuable tools in accelerated first-prin...

Both gene expression for orotate phosphoribosyltransferase and its ratio to dihydropyrimidine dehydrogenase influence outcome following fluoropyrimidi

Activation of 5-fluorouracil into its nucleotides requires phosphorylation by three pathways involving orotate phosphori...

On location: creating cadavers

Daniel Tirinnanzi, left, Barney Burman and Ian Von Cromer work on a clay sculpture at B2FX in North Hollywood. The sculp...

Postscript: a few more thoughts about 'a parallelogram'

_In “Postscript,” a critic returns to works already reviewed to respond to readers, touch on points not covered, share s...

Thursday night football draws 20. 8 million viewers to cbs/nfl network

CBS’ expensive foray into Thursday prime-time NFL games is off to a strong start in the ratings. The premiere of “Thursd...

Top