pKalculator: A pKa predictor for C–H bonds

  1. ORCID Logo ,
  2. ORCID Logo and
  3. ORCID Logo
Department of Chemistry, University of Copenhagen, Copenhagen, DK-2100, Denmark
  1. Corresponding author email
Guest Editor: A. Schweidtmann
Beilstein J. Org. Chem. 2024, 20, 1614–1622. https://doi.org/10.3762/bjoc.20.144
Received 18 Mar 2024, Accepted 02 Jul 2024, Published 16 Jul 2024
A non-peer-reviewed version of this article has been posted as a preprint https://doi.org/10.26434/chemrxiv-2024-56h5h
Full Research Paper
cc by logo

Abstract

Determining the pKa values of various C–H sites in organic molecules offers valuable insights for synthetic chemists in predicting reaction sites. As molecular complexity increases, this task becomes more challenging. This paper introduces pKalculator, a quantum chemistry (QM)-based workflow for automatic computations of C–H pKa values, which is used to generate a training dataset for a machine learning (ML) model. The QM workflow is benchmarked against 695 experimentally determined C–H pKa values in DMSO. The ML model is trained on a diverse dataset of 775 molecules with 3910 C–H sites. Our ML model predicts C–H pKa values with a mean absolute error (MAE) and a root mean squared error (RMSE) of 1.24 and 2.15 pKa units, respectively. Furthermore, we employ our model on 1043 pKa-dependent reactions (aldol, Claisen, and Michael) and successfully indicate the reaction sites with a Matthew’s correlation coefficient (MCC) of 0.82.

Introduction

Over the years, the ability to selectively break a C–H bond to create new connections has attracted increasing interest [1]. While past methods allowed for C–H transformations in simple molecules, recent synthetic protocols [2] enable selective C–H activation and diversification in larger molecules. This has, for example, attracted the pharmaceutical industry to implement such C–H transformations to diversify different types of molecules ranging from small drug-like molecules to intermediates and lead compounds. Especially late-stage functionalization is a promising emerging field that allows chemists to efficiently explore the chemical space in complex molecules by exchanging a C–H bond with different functional groups to modify the biological activity of drugs [2]. However, pinpointing which C–H bond is reacting can be challenging.

Grzybowski and co-workers recently addressed this gap by predicting pKa values for C–H bonds in dimethyl sulfoxide (DMSO) using a graph convolutional neural network (GCNN) [3]. Using a mix of experimental and computed pKa data, they achieved a mean absolute error (MAE) of 2.1 pKa units. Lee and co-workers also addressed this problem by creating a general machine learning (ML) model using either a neural network or XGBoost. They trained on experimental pKa values in 39 solvents from the “internet Bond-energy Databank” (iBonD). Thus, they could predict the lowest pKa value for a wide range of molecules that contain bonds such as N–H, O–H, C–H, S–H, and P–H. However, they reported a scarcity of non-aqueous pKa values and achieved a MAE of 1.5 pKa units for the solvent DMSO using XGBoost [4,5]. Unfortunately, neither the Grzybowski group nor the Lee group have made their models generally available to other users.

Inspired by the efforts of the Grzybowski group and the Lee group, we have developed pKalculator, a quantum chemistry (QM)-based workflow for the automatic computation of C–H pKa values in DMSO. The computed C–H pKa values are then used to generate training data for an ML model using LightGBM [6]. The QM-based workflow and the ML model are freely available under the MIT license.

Methods

Datasets

We compile a dataset of 732 experimental pKa values in DMSO from two different sources, Bordwell [7] and iBonD [4]. The Bordwell dataset contains experimental C–H pKa values in DMSO from 419 molecules. For the iBonD database, we select experimental C–H pKa values in DMSO for 313 molecules. As the iBonD database only contains an image of each molecule, we employ the “Deep Learning for Chemical Image Recognition” software (DECIMER v. 2.0), developed by Rajan and co-workers [8-10]. While DECIMER converts molecular images into SMILES, manual intervention is required to ensure the SMILES string correctly represents the molecule. Finally, to mirror the dataset by Roszak et al. [3], we also incorporate 43 heterocycles without experimental pKa values from Shen et al., leaving us with a dataset of 775 compounds [11]. This dataset will be used to calculate QM pKa values using our QM workflow described in the next section.

We also create a dataset from Reaxys that contains 1043 pKa-controlled reactions. These reactions include 584 aldol, 408 Claisen, and 51 Michael reactions. This dataset is used as an out-of-sample dataset to see how well our ML model predicts the reaction site. Additionally, we use six pharmaceutical intermediates that undergo selective borylation to compare our QM workflow and ML model with experimentally determined reaction sites.

The quantum chemistry-based workflow

Following work by Ree et al. [12-15], we present a fully automated QM-based workflow for computing C–H pKa values. A given SMILES string undergoes modifications to produce a list of SMILES for each deprotonated C–H bond. We generate min(1 + 3nrot, 20) conformers for each SMILES using RDKit (v.2022.09.4) [16,17], where (nrot) represents the number of rotatable bonds. Each conformer undergoes optimization in dimethyl sulfoxide (DMSO, ε = 47.2) using GFN-FF-xTB [18] and the analytical linearized Poisson–Boltzmann (ALPB) equation [19] as the implicit solvation model. We then remove conformers with relative energies above 3 kcal/mol and select unique conformers by taking the centroids of a Butina clustering using pairwise heavy-atom root mean square deviation (RMSD) with a threshold of 0.5 Å [16,20]. For more information, refer to Supporting Information File 1, section “Selecting unique conformers”.

Subsequently, we re-optimize the remaining conformers in DMSO with GFN2-xTB [21] and the ALPB implicit solvation model to identify the lowest-energy conformer. We then conduct re-optimization in ORCA (v. 5.0.4) [22,23], using the dispersion D4-corrected DFT functional CAM-B3LYP [24,25], the Karlsruhe [26,27] triple-ζ basis set, def2-TZVPPD, and the conductor-like polarizable continuum model (CPCM) [28] as the implicit solvation models. CAM-B3LYP is chosen as the optimal functional based on a benchmark study that evaluates the accuracy of different levels of theory, ranging from semiempirical methods (xTB) [21] over composite electronic structure methods (r2SCAN-3c) [29] to DFT methods (CAM-B3LYP) [24,25]. All these methods are evaluated as single-point calculations or optimization and frequency calculations. For comprehensive details, refer to Supporting Information File 1, section “Benchmark study - computational methods”. Hereafter, we check the geometries for imaginary frequencies and use the total thermal energy at 298.15 K. Following the approach of the Grzybowski group [3], we compute the heterolytic dissociation energy through the direct deprotonation reaction, [Graphic 1]; see Equation 1.

[1860-5397-20-144-i2]
(1)

For each set of deprotonated C–H sites in a molecule, we determine the minimum heterolytic dissociation energy ([Graphic 2]). Hereafter, we assume a linear relationship between the experimental pKa values and [Graphic 3] as this assumption allows us to derive the empirical constants a and b and correct any systematic errors; see Equation 2, where ΔG° is replaced by [Graphic 4]. After retrieving the empirical constants a and b, we can determine the QM-computed pKa values for all deprotonated C–H sites using Equation 2:

[1860-5397-20-144-i3]
(2)

Machine learning

The feature descriptor

Recent research shows that the atomic descriptors introduced by Finkelmann et al. [30,31], using charge model 5 (CM5) atomic charges [32], are a great representation of atoms in molecules that can be used in combination with an ML model to predict a variety of properties. These properties encompass the site of metabolism [31,33], the strengths of hydrogen bond donors and acceptors [34-36], and the regioselectivity of electrophilic aromatic substitution reactions [14]. Building on the methodology from Finkelmann et al. [30,31] and Ree et al. [14], we utilize the automated approach to compute CM5 atomic charges from semiempirical tight-binding (GFN1-xTB [37]) calculations. We modify the workflow to enhance the accuracy of the computed CM5 atomic charges. Instead of generating a single random conformer, we produce 20 random conformers from a SMILES string and optimize the structure with molecular mechanics force fields [38] using RDKit [16]. The CM5 atomic charges of the lowest-energy conformer are then used to generate atomic descriptors based on sorting the CM5 charges for a given atom of the input SMILES string. Furthermore, we adjust the shell radius from 5 to 6, improving the performance of the ML model to predict pKa values as detailed in Supporting Information File 1, section “The descriptor”.

Data preparation and hyperparameter optimization

Building on the procedure outlined by Ree et al. [14], we employ the Optuna framework (v. 3.3.0) [39] to identify optimal hyperparameters for LigthGBM regression and classification models [6]. Specifically, the Bayesian optimization technique utilizing the tree-structured Parzen estimator is applied for hyperparameter space exploration. For the regression task, the target value are the QM-computed pKa values. For the binary classification task, which aims to predict the site with the lowest QM-computed pKa value, labels are assigned in the following manner: ‘1’ for the lowest QM-computed pKa value (true site) and ‘0’ for all other QM-computed pKa values. As there is sometimes a slight variation between the pKa value and the other pKa values, we also introduce a tolerance where a pKa value within +1 pKa units or +2 pKa units of the lowest pKa value is accepted as ‘1’ to account for these variations, see Supporting Information File 1, section “Machine learning models” for more information. Further, given the significant imbalance between the two classes (with ‘0’s far outnumbering ‘1’s), the hyperparameter scale_pos_weight is invoked during hyperparameter optimization. Finally, we establish a “null model” for the classification task, wherein all sites are predicted as ‘0’.

The dataset with QM-computed pKa values (775 compounds; 3910 pKa values) is initially split randomly by compound into a training set (80%; 620 compounds; 3121 pKa values) and a held-out test set (20%; 155 compounds; 789 pKa values). For each ML model, we carry out a fivefold randomly shuffled cross-validation. Within each fold, the original training set is further split randomly into a new training set (90% of the original training set) and a validation set (10% of the original training set). This allows us to evaluate different models and estimate their performance. Hereafter, each ML model is trained on our original training set and tested against the held-out test set. Finally, we select the best-performing ML model.

Results and Discussion

Computing pKa values

From section “The quantum chemistry-based workflow” above, we can determine the empirical values a and b in Equation 2. For each set of deprotonated sites in a molecule, we extract the computed [Graphic 5] value and fit it against the experimental pKa values. Hereafter, we convert the computed [Graphic 6] to QM-computed pKa values using Equation 2. We then inspect outliers that exceed an absolute pKa unit difference of 5 pKa units between the experimental pKa value and the QM-computed pKa value. We choose an absolute pKa unit difference of 5 pKa units to ensure that the QM-computed pKa is well above the error that is to be expected on the level of theory we are using (CAM-B3LYP). The observed outliers typically result from one of the following reasons: (i) calculation errors concerning the expected minimum pKa site, (ii) discrepancies between literature structures and database structures, (iii) mislabeled experimental pKa values, or (iv) extrapolated pKa values. Notably, the extrapolated pKa values correspond to compounds beyond the scale measurable in DMSO (pKa ≥ 35) because of the autoprotolysis of DMSO (pKa(DMSO) = 35) [40,41]. For more information regarding finding and removing outliers, see Supporting Information File 1, section “Finding outliers”. After multiple iterations, we identified 695 molecules to have reliable experimental pKa values and computed [Graphic 7] values. The values for the computed [Graphic 8] are then fitted against the experimental pKa values, leaving us with empirical constants a and b; see Figure 1. We now use the derived linear regression to convert all computed ΔG° values into QM-computed pKa values for our whole dataset (775 compounds). These values are used as target values for the ML part.

[1860-5397-20-144-1]

Figure 1: Correlating computed [Graphic 9] values and experimental pKa values for 695 compounds. r: Pearson correlation coefficient; ρ: Spearman’s rank correlation coefficient; MAE: mean absolute error; RMSE: root mean squared error. QM calculations were carried out at the CAM-B3LYP/def2-TZVPPD CPCM(DMSO)//GFN2-xTB ALPB(DMSO) level of theory.

Machine learning models for predicting C–H pKa values

To learn and predict C–H pKa values, we train a LightGBM regression model with our generated dataset containing QM-computed pKa values (775 compounds; 3910 pKa values). Hereafter, we correlate and compare the ML-predicted pKa values and the QM-computed pKa values and achieve a MAE and a RMSE of 1.24 and 2.15 pKa units, respectively, for the held-out test set (155 compounds; 789 pKa values), as illustrated in Figure 2. When zooming in on the ML-predicted pKa values that are not correlating well with the QM-computed pKa values, we find C–H sites that are either bridgeheads or where the negative charge is stabilized by resonance. This may be due to the nature of the chosen descriptor vector based on sorted CM5 atomic charges as it may not take into account, for example, steric strain and charge delocalisation. We discuss this further in Supporting Information File 1, section “Outliers for the test set”.

[1860-5397-20-144-2]

Figure 2: ML-predicted pKa values vs QM-computed pKa values of the held-out test set (155 compounds; 789 pKa values). r: Pearson correlation coefficient; ρ: Spearman’s rank correlation coefficient; MAE: mean absolute error; RMSE: root mean squared error. All predictions were made using the best ligthGBM regressor. All calculations were carried out at the CAM-B3LYP/def2-TZVPPD CPCM(DMSO)//GFN2-xTB ALPB(DMSO) level of theory.

We then compare our ML model with previously reported ML models for predicting pKa values, namely, the GCNN C–H pKa predictor by Roszak et al. [3] and the XGBoost pKa predictor by Yang et al. [5]. Roszak et al. [3] used a mix of experimental data (414 compounds) [7], manually curated DFT data (212 compounds), and previously reported DFT data (194 C–H sites) [11]; they obtained a MAE of 2.18 pKa units for their test set. Yang et al. [5] used filtered entries from the iBonD dataset, comprising 15338 compounds and 19397 pKa values across 39 solvents [5]. As they not only predict C–H pKa values, we cannot compare our result with their best ML model. However, they also report a holistic six-solvent (HM-6S) XGBoost model in DMSO (9.3% of the data), which most likely contains the majority of C–H pKa values. For this XGBoost model, they achieved MAE and RMSE values of 1.53 and 2.35 pKa units, respectively. A comparison between our ML model, the GCNN model of Roszak et al., and the model of Yang et al. is shown in Table 1. While a direct comparison with these studies is not feasible because of differing datasets, our model surpasses Roszak et al.’s GCNN model by a MAE of 0.94 pKa units and outperforms Yang et al.’s HM-6S model by a MAE of 0.29 pKa units.

Table 1: Comparing different ML models for predicting pKa values. Mean absolute error (MAE) and root mean squared error (RMSE) are provided in pKa units.

Method MAE RMSE
LGBM (this work) 1.24 2.15
GCNN [3] 2.18
XGBoost HM-6S (DMSO)a [5] 1.53 2.35

aHM-6S: Table 7 in their paper.

Predicting the lowest C–H pKa value

Now that we can fairly accurately predict pKa values with our LightGBM regressor, another use case is to be able to identify the C–H site with the lowest pKa value to predict the site of reaction. For this purpose, we treat the task as a binary classification and train both a LightGBM classifier and a LightGMB regressor. As described earlier in section “Data preparation and hyperparameter optimization”, the QM-computed pKa values are translated into binary values, with ‘1’ representing the lowest QM-computed pKa value and ‘0’ representing other QM-computed pKa values. The performance metrics for the test set demonstrate that the regression model (MCC of 0.97) outperforms the classification model (MCC of 0.92) when used as a binary classifier, as seen in Table 2.

Table 2: Test set performance metrics: comparison between a LightGBM classifier and a LightGBM regressor for binary classification of the lowest pKa site. Reaxys performance metrics: comparison between a LightGBM classifier and a LightGBM regressor for binary classification of the reaction site in Reaxys. The best model is marked in bold.a

  Test set performance metrics Reaxys performance metrics
method ACC MCC PPV TPR TNR NPV ACC MCC PPV TPR TNR NPV
null modelb 0.80 0 0 0 1.00 0.80 0.87 0 0 0 1.00 0.87
classifier 0.97 0.92 0.97 0.90 0.99 0.98 0.92 0.70 0.64 0.85 0.93 0.98
regressor 0.99 0.97 0.97 0.98 0.99 1.00 0.96 0.82 0.84 0.84 0.98 0.98

aACC: accuracy; MCC: Matthew's correlation coefficient; PPV: precision/positive predictive value; TPR: recall/true-positive rate; TNR: specificity/true-negative rate; NPV: negative predictive value. bAll predicted pKa values are “0” to highlight the imbalance of the dataset.

Now we train a LightGBM classifier and a LightGMB regressor for the entire dataset (775 compounds; 3910 pKa values) of QM-computed pKa values to assess the generalization capability of our ML models. We use an out-of-sample dataset of 1043 pKa-dependent reactions from Reaxys, containing 584 aldol, 408 Claisen, and 51 Michael reactions. These reactions are chosen because they all involve a deprotonation step, and the C–H site with the lowest pKa value is most likely the site of the reaction. We also use these reactions for comparison with Roszak et al. [3], who evaluated their GCNN model against 12873 pKa-controlled reactions, including aldol, Claisen and Michael reactions, and correctly predicted the reacting site with an accuracy of 90.5%. Our out-of-sample set is also used to see how well our ML models predict the site of reaction using the lowest ML-predicted pKa value.

To understand the result for the out-of-sample set, we show three different reactions in Scheme 1. The first step of the reaction shown in Scheme 1a is an aldol reaction where the deprotonation occurs at the least substituted C–H site next to the ketone (black arrow). Our ML model predicts a pKa value of 24.7 for the experimental site of reaction. Also, our ML model predicts that the reaction site should be at the highlighted circle. For this site, the ML model predicts a pKa value of 16.4. It is generally accepted that the most substituted C–H site next to a ketone will form the more stable carbanion (thermodynamic anion), whereas the least substituted carbanion will be the least stable carbanion (kinetic anion). This can generally be controlled by the type of base used. For the reaction in Scheme 1a, n-BuLi is commly used, which is known to lead to the kinetic anion. Because our ML model relies on the principle of lowest energy, it predicts the site with the lowest pKa value as the site of reaction (thermodynamic carbanion) and does not account for the type of base used.

[1860-5397-20-144-i1]

Scheme 1: Predicting the reaction site for three different reactions from the out-of-sample dataset from Reaxys. (a) Aldol reaction, Reaxys reaction ID: 9947221 [42]; (b) Claisen reaction, Reaxys reaction ID: 3402137 [43]; (c) Michael reaction, Reaxys reaction ID: 29819768 [44]. Arrow: experimental site; teal filled circle: ML-predicted lowest pKa.

Going to Scheme 1b, we look at a Claisen reaction where the experimental site of reaction occurs at the least substituted ketone. Our ML model predicts the pKa value here to be 20.5; however, the lowest ML-predicted pKa value is 4.2. Again, the ML model correctly predicts the most stable carbanion (lowest pKa value), but other factors come into play when synthesizing compounds.

Last, we have an example of the Michael reaction in Scheme 1c. Here, both the experimental site of reaction and the ML-predicted site of reaction match. Our ML model predicts the lowest pKa value to be 12.5, whereas the second lowest ML-predicted pKa value is 21.9 (the least substituted C–H next to a ketone). For more information, see Supporting Information File 1, section “Outliers for Reaxys”.

When we evaluate our ML models on the whole out-of-sample set, we again find that the regression model (MCC of 0.82) outperforms the classification model (MCC of 0.70) when used as a binary classifier as seen in Table 2. While a direct comparison cannot be made between Roszal et al.’s results [3] and ours, we find our result to outperform theirs with an accuracy of 0.96. In general, it is surprising that the LightGBM regressor outperforms our LightGBM classifier as Ree et al. [14] have shown the opposite to be true for electrophilic aromatic substitutions. However, our regression model serves a dual function, that is, it accurately predicts pKa values and identifies the reaction site.

Prediction of aryl C–H borylation sites

In the previous section, we showed that our ML model is able to predict the reaction site for pKa-dependent reactions. Now, we test the ML model on a more complex reaction type, namely, borylation reactions. Caldeweyher et al. [45] presented a workflow to predict the iridium-catalyzed borylation site of aryl C–H bonds (SoBo) [45] and experimentally validated their approach using six pharmaceutical intermediates from medicinal chemistry programs. In the article, they state that ”Iridium catalysts ligated by bipyridine ligands catalyze the borylation of the aryl C–H bonds that are most acidic and least sterically hindered…”[45]. For this reason, we tested both our QM workflow and the ML model to see how well they identify the reaction site when only considering the lowest aromatic C–H pKa value; see Figure 3. For both methods, we identify the possible site of reaction if the pKa value is within 1.5 pKa units of the lowest pKa value. This is slightly different from our previous approach. However, because of the higher complexity of the reaction and the similarity of aromatic C–H sites, we purposely allow the QM workflow and the ML model to assess more sites as ‘1’ or true site. When the pKa value is within 1.5 pKa units, we also ensure that we are within the range or the uncertainty of the QM-computed pKa values, which have a MAE of 1.48, as discussed in section “Computing pKa values”.

[1860-5397-20-144-3]

Figure 3: Predicting the site of borylation for a set of six experimentally reported borylation reactions [45]. Arrow: major experimental site/prediction by SoBo; black ring: QM-computed lowest pKa + 1.5; teal filled circle: ML-predicted lowest pKa + 1.5.

For compound 1, the ML model predicts two low-pKa sites, indicated by filled circles, of which none corresponds to the experimentally observed site of borylation, indicated by the arrow. However, the QM workflow predicts the correct site as the black ring indicates. Overall, the QM workflow accurately predicts four of the six borylation sites, although, in the case of compounds 2 and 6, there are additional sites with nearly identical pKa values. In the case of compound 3, most chemists would expect the pKa of pyrazole C–H sites to be considerably lower than those on the benzene ring, suggesting that factors other than pKa determine the site of borylation for this compound. In the case of compound 5, the most likely explanation is that the site with the lowest QM-computed pKa value is sterically hindered compared to the experimentally observed site of borylation. The ML model predicts three borylation sites correctly, but, in the case of compound 5, there are two additional sites with low pKa values. One failure is for compound 3, where the QM workflow also fails; however, for compounds 1 and 4, the ML model fails, while the QM workflow accurately predicts the site of borylation. This indicates that these compounds are not well represented in the training set.

Conclusion

We introduce pKalculator, an automated QM-based workflow that computes C–H pKa values with a MAE of 1.48 and a RMSE of 1.81 when correlating with experimental pKa values. We use this method to generate training data for an atom-based regression model that delivers fast and relatively precise predictions with MAE and RMSE values of 1.24 and 2.15, respectively, when correlating with QM-computed pKa values. Both methods are freely available under the MIT license. Our workflow can function as a filtering tool for computer-aided synthesis planning for the synthesis of various pKa-dependent reactions (aldol, Michael, and Claisen), evidenced by its accurate predictions of reaction sites for 1043 reactions (MCC of 0.82). Looking ahead, we aim to explore more reactions that depend on C–H pKa values, further enhancing the utility of pKalculator for synthetic chemists. Future iterations will consider factors such as a more extensive and diverse training set, as well as steric hindrance and base reactivity, ensuring even more precise predictions for reaction sites.

Supporting Information

Supporting Information File 1: Additional methods data.
Format: PDF Size: 3.1 MB Download

Funding

This work was funded by the Independent Research Foundation Denmark (DFF; grant number 1032-00129B).

Conflict of Interest

The authors declare that there are no competing interests.

Author Contributions

Rasmus M. Borup: data curation; formal analysis; investigation; methodology; software; visualization; writing – original draft; writing – review & editing. Nicolai Ree: software; supervision; validation; writing – review & editing. Jan H. Jensen: conceptualization; funding acquisition; project administration; supervision; writing – review & editing.

Data Availability Statement

All data that supports the findings of this study is available in the published article and/or the supporting information to this article. The code for the automated workflow and results of the analyzed data are available at https://github.com/jensengroup/pKalculator. Aditional data is available at https://sid.erda.dk/sharelink/EyuyjllJdp. The internet Bond-energy Databank (iBonD) is accessible for non-profit academic use. Due to licensing restrictions for Reaxys, the Reaxys data cannot be shared. We have provided a list of reaction IDs together with our predictions.

References

  1. Bergman, R. G. Nature 2007, 446, 391–393. doi:10.1038/446391a
    Return to citation in text: [1]
  2. Guillemard, L.; Kaplaneris, N.; Ackermann, L.; Johansson, M. J. Nat. Rev. Chem. 2021, 5, 522–545. doi:10.1038/s41570-021-00300-6
    Return to citation in text: [1] [2]
  3. Roszak, R.; Beker, W.; Molga, K.; Grzybowski, B. A. J. Am. Chem. Soc. 2019, 141, 17142–17149. doi:10.1021/jacs.9b05895
    Return to citation in text: [1] [2] [3] [4] [5] [6] [7] [8]
  4. iBonD. http://ibond.nankai.edu.cn/ (accessed Oct 27, 2023).
    Return to citation in text: [1] [2]
  5. Yang, Q.; Li, Y.; Yang, J.-D.; Liu, Y.; Zhang, L.; Luo, S.; Cheng, J.-P. Angew. Chem., Int. Ed. 2020, 59, 19282–19291. doi:10.1002/anie.202008528
    Return to citation in text: [1] [2] [3] [4] [5]
  6. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30 (NIPS 2017), Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R., Eds.; Curran Associates, Inc., 2017.
    Return to citation in text: [1] [2]
  7. Bordwell pKa Table. https://organicchemistrydata.org/hansreich/resources/pka/ (accessed Jan 11, 2023).
    Return to citation in text: [1] [2]
  8. Rajan, K.; Zielesny, A.; Steinbeck, C. J. Cheminf. 2020, 12, 65. doi:10.1186/s13321-020-00469-w
    Return to citation in text: [1]
  9. Rajan, K.; Zielesny, A.; Steinbeck, C. J. Cheminf. 2021, 13, 61. doi:10.1186/s13321-021-00538-8
    Return to citation in text: [1]
  10. Rajan, K.; Brinkhaus, H. O.; Agea, M. I.; Zielesny, A.; Steinbeck, C. Nat. Commun. 2023, 14, 5045. doi:10.1038/s41467-023-40782-0
    Return to citation in text: [1]
  11. Shen, K.; Fu, Y.; Li, J.-N.; Liu, L.; Guo, Q.-X. Tetrahedron 2007, 63, 1568–1576. doi:10.1016/j.tet.2006.12.032
    Return to citation in text: [1] [2]
  12. Ree, N.; Göller, A. H.; Jensen, J. H. J. Cheminf. 2021, 13, 10. doi:10.1186/s13321-021-00490-7
    Return to citation in text: [1]
  13. Ree, N.; Göller, A. H.; Jensen, J. H. ACS Omega 2022, 7, 45617–45623. doi:10.1021/acsomega.2c06378
    Return to citation in text: [1]
  14. Ree, N.; Göller, A. H.; Jensen, J. H. Digital Discovery 2022, 1, 108–114. doi:10.1039/d1dd00032b
    Return to citation in text: [1] [2] [3] [4] [5]
  15. Ree, N.; Göller, A. H.; Jensen, J. H. Digital Discovery 2024, 3, 347–354. doi:10.1039/d3dd00224a
    Return to citation in text: [1]
  16. RDKit 2022_09_4 (Q3 2022) Release; Zenodo, 2023. doi:10.5281/zenodo.7541264
    Return to citation in text: [1] [2] [3]
  17. Riniker, S.; Landrum, G. A. J. Chem. Inf. Model. 2015, 55, 2562–2574. doi:10.1021/acs.jcim.5b00654
    Return to citation in text: [1]
  18. Spicher, S.; Grimme, S. Angew. Chem., Int. Ed. 2020, 59, 15665–15673. doi:10.1002/anie.202004239
    Return to citation in text: [1]
  19. Sigalov, G.; Fenley, A.; Onufriev, A. J. Chem. Phys. 2006, 124, 124902. doi:10.1063/1.2177251
    Return to citation in text: [1]
  20. Butina, D. J. Chem. Inf. Comput. Sci. 1999, 39, 747–750. doi:10.1021/ci9803381
    Return to citation in text: [1]
  21. Bannwarth, C.; Ehlert, S.; Grimme, S. J. Chem. Theory Comput. 2019, 15, 1652–1671. doi:10.1021/acs.jctc.8b01176
    Return to citation in text: [1] [2]
  22. Neese, F.; Wennmohs, F.; Becker, U.; Riplinger, C. J. Chem. Phys. 2020, 152, 224108. doi:10.1063/5.0004608
    Return to citation in text: [1]
  23. Neese, F. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 73–78. doi:10.1002/wcms.81
    Return to citation in text: [1]
  24. Yanai, T.; Tew, D. P.; Handy, N. C. Chem. Phys. Lett. 2004, 393, 51–57. doi:10.1016/j.cplett.2004.06.011
    Return to citation in text: [1] [2]
  25. Caldeweyher, E.; Ehlert, S.; Hansen, A.; Neugebauer, H.; Spicher, S.; Bannwarth, C.; Grimme, S. J. Chem. Phys. 2019, 150, 154122. doi:10.1063/1.5090222
    Return to citation in text: [1] [2]
  26. Weigend, F.; Ahlrichs, R. Phys. Chem. Chem. Phys. 2005, 7, 3297. doi:10.1039/b508541a
    Return to citation in text: [1]
  27. Rappoport, D.; Furche, F. J. Chem. Phys. 2010, 133, 134105. doi:10.1063/1.3484283
    Return to citation in text: [1]
  28. Barone, V.; Cossi, M. J. Phys. Chem. A 1998, 102, 1995–2001. doi:10.1021/jp9716997
    Return to citation in text: [1]
  29. Grimme, S.; Hansen, A.; Ehlert, S.; Mewes, J.-M. J. Chem. Phys. 2021, 154, 064103. doi:10.1063/5.0040021
    Return to citation in text: [1]
  30. Finkelmann, A. R.; Göller, A. H.; Schneider, G. Chem. Commun. 2016, 52, 681–684. doi:10.1039/c5cc07887c
    Return to citation in text: [1] [2]
  31. Finkelmann, A. R.; Göller, A. H.; Schneider, G. ChemMedChem 2017, 12, 606–612. doi:10.1002/cmdc.201700097
    Return to citation in text: [1] [2] [3]
  32. Marenich, A. V.; Jerome, S. V.; Cramer, C. J.; Truhlar, D. G. J. Chem. Theory Comput. 2012, 8, 527–541. doi:10.1021/ct200866d
    Return to citation in text: [1]
  33. Finkelmann, A. R.; Goldmann, D.; Schneider, G.; Göller, A. H. ChemMedChem 2018, 13, 2281–2289. doi:10.1002/cmdc.201800309
    Return to citation in text: [1]
  34. Bauer, C. A.; Schneider, G.; GÃller, A. H. Mol. Inf. 2019, 38, 1800115. doi:10.1002/minf.201800115
    Return to citation in text: [1]
  35. Bauer, C. A.; Schneider, G.; GÃller, A. H. J. Cheminf. 2019, 11, 59. doi:10.1186/s13321-019-0381-4
    Return to citation in text: [1]
  36. Kuhnke, L.; ter Laak, A.; Göller, A. H. J. Chem. Inf. Model. 2019, 59, 668–672. doi:10.1021/acs.jcim.8b00758
    Return to citation in text: [1]
  37. Grimme, S.; Bannwarth, C.; Shushkov, P. J. Chem. Theory Comput. 2017, 13, 1989–2009. doi:10.1021/acs.jctc.7b00118
    Return to citation in text: [1]
  38. Tosco, P.; Stiefl, N.; Landrum, G. J. Cheminf. 2014, 6, 37. doi:10.1186/s13321-014-0037-3
    Return to citation in text: [1]
  39. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In KDD ’19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019; pp 2623–2631. doi:10.1145/3292500.3330701
    Return to citation in text: [1]
  40. Matthews, W. S.; Bares, J. E.; Bartmess, J. E.; Bordwell, F. G.; Cornforth, F. J.; Drucker, G. E.; Margolin, Z.; McCallum, R. J.; McCollum, G. J.; Vanier, N. R. J. Am. Chem. Soc. 1975, 97, 7006–7014. doi:10.1021/ja00857a010
    Return to citation in text: [1]
  41. Koppel, I. A.; Koppel, J.; Pihl, V.; Leito, I.; Mishima, M.; Vlasov, V. M.; Yagupolskii, L. M.; Taft, R. W. J. Chem. Soc., Perkin Trans. 2 2000, 1125–1133. doi:10.1039/b001792m
    Return to citation in text: [1]
  42. Barbarow, J. E.; Miller, A. K.; Trauner, D. Org. Lett. 2005, 7, 2901–2903. doi:10.1021/ol050831f
    Return to citation in text: [1]
  43. Hamama, W. S.; Hammouda, M.; Afsah, E. M. Z. Naturforsch., B: J. Chem. Sci. 1988, 43, 897–900. doi:10.1515/znb-1988-0716
    Return to citation in text: [1]
  44. Bettati, M.; Cavanni, P.; Di Fabio, R.; Oliosi, B.; Perini, O.; Scheid, G.; Tedesco, G.; Zonzini, L.; Micheli, F. ChemMedChem 2010, 5, 361–366. doi:10.1002/cmdc.200900482
    Return to citation in text: [1]
  45. Caldeweyher, E.; Elkin, M.; Gheibi, G.; Johansson, M.; Sköld, C.; Norrby, P.-O.; Hartwig, J. F. J. Am. Chem. Soc. 2023, 145, 17367–17376. doi:10.1021/jacs.3c04986
    Return to citation in text: [1] [2] [3] [4]
Other Beilstein-Institut Open Science Activities