Search results

Search for "dataset" in Full Text gives 46 result(s) in Beilstein Journal of Organic Chemistry.

Enhancing chemical synthesis planning: automated quantum mechanics-based regioselectivity prediction for C–H activation with directing groups

  • Julius Seumer,
  • Nicolai Ree and
  • Jan H. Jensen

Beilstein J. Org. Chem. 2025, 21, 1171–1182, doi:10.3762/bjoc.21.94

Graphical Abstract
  • computational cost. Validation against a comprehensive dataset reveals that the workflow achieves high accuracy, significantly surpassing traditional models in both speed and predictive capability. This development promises substantial advancements in the design of new synthetic routes, offering rapid and
  • approach performs well on this dataset, it does not generalize well to other molecules since not all relevant DGs are covered in the work by Tomberg and colleagues [9]. This is evidenced by our analysis using a dataset curated from Reaxys. Using our implementation of the method presented by Tomberg et al
  • . [9] (for further details see section “Pattern matching”), we could only obtain correct predictions for four out of ten molecules, see section “Dataset curated from Reaxys”. This underscores the necessity for more robust and versatile predictive models that can adapt to the broad spectrum of organic
PDF
Album
Supp Info
Full Research Paper
Published 16 Jun 2025

Supramolecular assembly of hypervalent iodine macrocycles and alkali metals

  • Krishna Pandey,
  • Lucas X. Orton,
  • Grayson Venus,
  • Waseem A. Hussain,
  • Toby Woods,
  • Lichang Wang and
  • Kyle N. Plunkett

Beilstein J. Org. Chem. 2025, 21, 1095–1103, doi:10.3762/bjoc.21.87

Graphical Abstract
  • protocol, and the characterization data are consistent with the original dataset [18]. Through a series of crystallographic experiments, we demonstrate that HIMs coordinate with alkali metals through the periphery carbonyl oxygens via metal–oxygen bonding to form a higher order metal-coordinated
PDF
Album
Supp Info
Full Research Paper
Published 30 May 2025

Data accessibility in the chemical sciences: an analysis of recent practice in organic chemistry journals

  • Sally Bloodworth,
  • Cerys Willoughby and
  • Simon J. Coles

Beilstein J. Org. Chem. 2025, 21, 864–876, doi:10.3762/bjoc.21.70

Graphical Abstract
  • coding of responses, list of data types associated with each article, and the resulting main dataset from assessment of 240 research papers are available in our supporting data package. As all research articles include results based on original (raw) data, and include previously unreported chemical
  • landing page provides this link back to the main article. Find_3, ‘is the dataset assigned a unique, citable and persistent identifier?’ assesses if the data can be found over the long-term. We found poor compliance (24% of primary data) as only files uploaded to a formal repository had a DOI. Two
  • datasets uploaded to Zenodo [41], one data package deposited with an institutional repository, and MODEL data uploaded in native formats had a unique identifier assigned by the repository for the data or dataset. None of the SI PDFs containing MODEL data uploaded to Figshare can be considered to have a DOI
PDF
Album
Supp Info
Full Research Paper
Published 02 May 2025

Emerging trends in the optimization of organic synthesis through high-throughput tools and machine learning

  • Pablo Quijano Velasco,
  • Kedar Hippalgaonkar and
  • Balamurugan Ramalingam

Beilstein J. Org. Chem. 2025, 21, 10–38, doi:10.3762/bjoc.21.3

Graphical Abstract
  • conditions (i.e., temperature, time, solvent, catalyst, etc.) and the corresponding outcome values for the target optimization objectives (e.g., yield, purity, cost, etc.). The initial dataset is commonly obtained by sampling a combination of reaction variables from the parametric space, performing the
  • sampling, and centerpoint sampling methods. Alternatively, the initial dataset can be obtained from values previously reported in the literature. After that, one or various predictive models are fitted to the initial dataset to predict the expected values of the optimization objectives. The number of
  • optimization objectives. Finally, a set of the most promising suggestions is selected and tested experimentally. The dataset is then updated with the outcomes of the latest experimental parameters, and the process is repeated until the optimal conditions have been found. Depending on the number of objectives
PDF
Album
Review
Published 06 Jan 2025

Chemical structure metagenomics of microbial natural products: surveying nonribosomal peptides and beyond

  • Thomas Ma and
  • John Chu

Beilstein J. Org. Chem. 2024, 20, 3050–3060, doi:10.3762/bjoc.20.253

Graphical Abstract
  • to the genetic code) [58]. Hundreds of known nonribosomal codes and their corresponding BBs can be extracted from natural products that have been characterized over the past several decades, generating a dataset to train NRP prediction algorithms (Figure 3b) [49][59][60]. A software suite called
  • failed to match a nonribosomal code in the algorithm training dataset and were deemed “unpredictable” [37]. It is also possible that the A domain in question aligned so poorly to prototypical A domains that prevented the proper identification of the nonribosomal code itself [70]. Regardless of the
  • scenario, these bioinformatically intractable A domains are distinct from known ones and point to enormous biosynthetic novelty that still awaits our exploration. Compiling a dataset for training A domain substrate prediction algorithms has never been the objective for natural product research in the past
PDF
Album
Perspective
Published 20 Nov 2024

Structure and thermal stability of phosphorus-iodonium ylids

  • Andrew Greener,
  • Stephen P. Argent,
  • Coby J. Clarke and
  • Miriam L. O’Duill

Beilstein J. Org. Chem. 2024, 20, 2931–2939, doi:10.3762/bjoc.20.245

Graphical Abstract
  • investigated in our study (Table 1). Thermal stability data The phosphorus-iodonium ylids were analysed by differential scanning calorimetry (DSC) and thermogravimetric analysis (TGA) [34][35], and results have been summarised in Table 2 and Figure 3. (The full dataset is available in Supporting Information
  • -iodonium ylids at 10 °C min−1 in N2 (full dataset in Supporting Information File 1). (b) First derivatives of TGA thermograms normalised to the intensity of the first peak. (c) Correlation of Tonset with the dihedral angle φ (between the R–I–X bond and the plane of the arene substituent). (d–g) DSC
PDF
Album
Supp Info
Full Research Paper
Published 14 Nov 2024

Applications of microscopy and small angle scattering techniques for the characterisation of supramolecular gels

  • Connor R. M. MacDonald and
  • Emily R. Draper

Beilstein J. Org. Chem. 2024, 20, 2608–2634, doi:10.3762/bjoc.20.220

Graphical Abstract
PDF
Album
Review
Published 16 Oct 2024

Machine learning-guided strategies for reaction conditions design and optimization

  • Lung-Yi Chen and
  • Yi-Pei Li

Beilstein J. Org. Chem. 2024, 20, 2476–2492, doi:10.3762/bjoc.20.212

Graphical Abstract
  • two types of reaction condition models based on their scope of applicability and dataset size: global and local models. The global models cover a wide range of reaction types and typically predict the experimental conditions based on a predefined list derived from literature data. However, this method
  • aspects of the dataset features and data preprocessing methods. Moreover, we introduce common algorithms and representative studies for developing both global and local models. We highlight representative studies that demonstrate the effectiveness and applicability of these algorithms in real-world
  • technique to construct a dataset of ≈693k reactions with detailed procedures and developed a sequence-to-sequence model to predict synthetic steps that are actionable and compatible with robotic platforms [70]. Guo et al. [71] conducted a continual pretraining scheme on the BERT model [72] to obtain a
PDF
Album
Review
Published 04 Oct 2024

Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis

  • Stefan P. Schmid,
  • Leon Schlosser,
  • Frank Glorius and
  • Kjell Jorner

Beilstein J. Org. Chem. 2024, 20, 2280–2304, doi:10.3762/bjoc.20.196

Graphical Abstract
  • quality of the underlying dataset will determine the model’s predictive capabilities. To obtain high predictive accuracy for a broad range of problems, a data set is sought which covers the problem space comprehensively. This does not only encompass the chemical diversity of the included molecules, but
  • features can vary based on the reactions that are contained in the training and test set. While descriptors can help in gaining mechanistic insight, it is important to not overinterpret the significance of single features to form a mechanistic hypothesis. Ideally, to overcome issues such as a high dataset
  • best of the authors’ knowledge, no HTE dataset has found widespread application in ML for organocatalysis, Denmark and co-workers published a data set comprising more than 1,000 organocatalytic transformations [67]. In their work, the authors demonstrated a data-driven workflow to study the
PDF
Album
Review
Published 10 Sep 2024

Finding the most potent compounds using active learning on molecular pairs

  • Zachary Fralish and
  • Daniel Reker

Beilstein J. Org. Chem. 2024, 20, 2152–2162, doi:10.3762/bjoc.20.185

Graphical Abstract
  • active learning, the machine learning model is trained on the available training data and the next compound to be added to the training dataset is selected based on which compound from the learning set has the highest predicted value [19] (Figure 1A). For ActiveDelta learning, training data is paired to
  • learn property differences between molecules [12]. Then, the next compound is selected based on which compound has the greatest predicted improvement from the most promising compound currently in the training dataset (Figure 1B). For the first time, we here present the ActiveDelta concept and evaluate
  • training and testing sets to simulate time-based splits. Datasets were split into training and test sets at an 80:20 ratio. Duplicate molecules were removed. For initial active learning training dataset formation, two random datapoints were selected from each original training dataset and the remaining
PDF
Album
Supp Info
Full Research Paper
Published 27 Aug 2024

Computational toolbox for the analysis of protein–glycan interactions

  • Ferran Nieto-Fabregat,
  • Maria Pia Lenza,
  • Angela Marseglia,
  • Cristina Di Carluccio,
  • Antonio Molinaro,
  • Alba Silipo and
  • Roberta Marchetti

Beilstein J. Org. Chem. 2024, 20, 2084–2107, doi:10.3762/bjoc.20.180

Graphical Abstract
PDF
Album
Review
Published 22 Aug 2024

Hetero-polycyclic aromatic systems: A data-driven investigation of structure–property relationships

  • Sabyasachi Chakraborty,
  • Eduardo Mayo Yanes and
  • Renana Gershoni-Poranne

Beilstein J. Org. Chem. 2024, 20, 1817–1830, doi:10.3762/bjoc.20.160

Graphical Abstract
  • -driven investigation of the newly generated COMPAS-2 dataset, which contains ~500k molecules consisting of 11 types of aromatic and antiaromatic rings and ranging in size from one to ten rings. Our analysis explores the effects of electron count, geometry, atomic composition, and heterocyclic composition
  • on a range of electronic molecular properties of PASs. Keywords: computational chemistry; database; dataset; π-conjugated; polycyclic aromatic hydrocarbons; polycyclic aromatic systems; Introduction Polycyclic aromatic systems (PASs) – molecules made up of fused aromatic rings – are among the most
  • constructing the COMPAS-2 dataset, we opted to maintain equal percentages of the different types of heterocycles (~10% of each type). This was done to avoid biasing the construction of molecules towards specific motifs. However, because there are multiple types of B-containing and N-containing heterocycles
PDF
Album
Supp Info
Full Research Paper
Published 31 Jul 2024

Discovery of antimicrobial peptides clostrisin and cellulosin from Clostridium: insights into their structures, co-localized biosynthetic gene clusters, and antibiotic activity

  • Moisés Alejandro Alejo Hernandez,
  • Katia Pamela Villavicencio Sánchez,
  • Rosendo Sánchez Morales,
  • Karla Georgina Hernández-Magro Gil,
  • David Silverio Moreno-Gutiérrez,
  • Eddie Guillermo Sanchez-Rueda,
  • Yanet Teresa-Cruz,
  • Brian Choi,
  • Armando Hernández Garcia,
  • Alba Romero-Rodríguez,
  • Oscar Juárez,
  • Siseth Martínez-Caballero,
  • Mario Figueroa and
  • Corina-Diana Ceapă

Beilstein J. Org. Chem. 2024, 20, 1800–1816, doi:10.3762/bjoc.20.159

Graphical Abstract
  • positions in the final dataset. The phylogenetic tree was built from the alignments using the Neighbor-Joining [43], and the evolutionary distances were calculated using the Poisson correction method. A LanL class IV lanthionine synthetase from the venezuelin cluster [44] was used as an outgroup. Square
PDF
Album
Supp Info
Full Research Paper
Published 30 Jul 2024

pKalculator: A pKa predictor for C–H bonds

  • Rasmus M. Borup,
  • Nicolai Ree and
  • Jan H. Jensen

Beilstein J. Org. Chem. 2024, 20, 1614–1622, doi:10.3762/bjoc.20.144

Graphical Abstract
  • . As molecular complexity increases, this task becomes more challenging. This paper introduces pKalculator, a quantum chemistry (QM)-based workflow for automatic computations of C–H pKa values, which is used to generate a training dataset for a machine learning (ML) model. The QM workflow is
  • benchmarked against 695 experimentally determined C–H pKa values in DMSO. The ML model is trained on a diverse dataset of 775 molecules with 3910 C–H sites. Our ML model predicts C–H pKa values with a mean absolute error (MAE) and a root mean squared error (RMSE) of 1.24 and 2.15 pKa units, respectively
  • compile a dataset of 732 experimental pKa values in DMSO from two different sources, Bordwell [7] and iBonD [4]. The Bordwell dataset contains experimental C–H pKa values in DMSO from 419 molecules. For the iBonD database, we select experimental C–H pKa values in DMSO for 313 molecules. As the iBonD
PDF
Album
Supp Info
Full Research Paper
Published 16 Jul 2024

Generation of multimillion chemical space based on the parallel Groebke–Blackburn–Bienaymé reaction

  • Evgen V. Govor,
  • Vasyl Naumchyk,
  • Ihor Nestorak,
  • Dmytro S. Radchenko,
  • Dmytro Dudenko,
  • Yurii S. Moroz,
  • Olexiy D. Kachkovsky and
  • Oleksandr O. Grygorenko

Beilstein J. Org. Chem. 2024, 20, 1604–1613, doi:10.3762/bjoc.20.143

Graphical Abstract
  • (due to the large size of the dataset, a preliminary clusterization was performed to achieve ca. 5-fold size reduction); C) ZINC15 drug-like compounds, and D) enamine’s stock screening collection. Average T values are shown by dotted lines. t-Distributed stochastic neighbor embedding (t-SNE
PDF
Album
Supp Info
Full Research Paper
Published 16 Jul 2024

Mining raw plant transcriptomic data for new cyclopeptide alkaloids

  • Draco Kriger,
  • Michael A. Pasquale,
  • Brigitte G. Ampolini and
  • Jonathan R. Chekan

Beilstein J. Org. Chem. 2024, 20, 1548–1559, doi:10.3762/bjoc.20.138

Graphical Abstract
  • dataset used are listed in this Supporting Table File along with core peptide sequence alignments. Supporting Information File 11: The full cladogram with species names as a high-resolution pdf. Supporting Information File 12: The split burpitide precursor peptide HMM and output from all 700
PDF
Album
Supp Info
Full Research Paper
Published 11 Jul 2024

Bioinformatic prediction of the stereoselectivity of modular polyketide synthase: an update of the sequence motifs in ketoreductase domain

  • Changjun Xiang,
  • Shunyu Yao,
  • Ruoyu Wang and
  • Lihan Zhang

Beilstein J. Org. Chem. 2024, 20, 1476–1485, doi:10.3762/bjoc.20.131

Graphical Abstract
  • , deepening our understanding of the stereocontrol of PKSs. Results and Discussion Preparation of KR sequence dataset We first curated the amino acid sequences of KR domains from characterized bacterial cis-AT PKSs recorded in MIBiG database [20] and by manual literature review. In total, 1,762 KRs whose
  • that corresponded exactly to the results predicted by bioinformatics were also considered reliable. Compounds of which only the relative configurations were elucidated were excluded from the dataset. (c) Sequences for which it was impossible to infer the stereochemistry of KR product were removed, such
  • Figure 3. For full-length sequence logo, see Figure S5 in Supporting Information File 1. Summary of the updated fingerprints sorted by the taxonomic origin and the module type. Percentage numbers show KRs meeting the fingerprint description in our curated dataset. (a) Motifs useful for the stereochemical
PDF
Album
Supp Info
Full Research Paper
Published 02 Jul 2024

Predicting bond dissociation energies of cyclic hypervalent halogen reagents using DFT calculations and graph attention network model

  • Yingbo Shao,
  • Zhiyuan Ren,
  • Zhihui Han,
  • Li Chen,
  • Yao Li and
  • Xiao-Song Xue

Beilstein J. Org. Chem. 2024, 20, 1444–1452, doi:10.3762/bjoc.20.127

Graphical Abstract
  • 209 heterolytic BDE data points. Taking homolytic BDE datasets as an example (Figure 4a), the distribution of this dataset is illustrated with key bond energy values normalized using min–max scaling. This approach ensures both data consistency and improves training efficiency. We used the GAT model as
  • the core framework, incorporating ten selected atomic descriptors as local information within the graph structure. Effective molecular transformations into molecular graphs (Figure 4b) were achieved using the RDKit and Deep Graph Library [87]. The dataset was randomly divided into training and testing
  • superior predictive results by not distinguishing between halogen categories in the dataset. This approach is reliable and efficient in assisting chemists in estimating the bond energy ranges of novel cyclic hypervalent halogen reagents. We conducted additional tests with cyclic hypervalent halogen
PDF
Album
Supp Info
Letter
Published 28 Jun 2024

Synthesis of photo- and ionochromic N-acylated 2-(aminomethylene)benzo[b]thiophene-3(2Н)-ones with a terminal phenanthroline group

  • Vladimir P. Rybalkin,
  • Sofiya Yu. Zmeeva,
  • Lidiya L. Popova,
  • Irina V. Dubonosova,
  • Olga Yu. Karlutova,
  • Oleg P. Demidov,
  • Alexander D. Dubonosov and
  • Vladimir A. Bren

Beilstein J. Org. Chem. 2024, 20, 552–560, doi:10.3762/bjoc.20.47

Graphical Abstract
  • ™ Impact instrument (electrospray ionization). Melting points were determined on a Fisher–Johns melting point apparatus. X-ray diffraction study The X-ray diffraction dataset of compound 3b was recorded on an Agilent SuperNova diffractometer using a microfocus X-ray radiation source with copper anode and
  • dedicated CrysAlisPro software suite [34]. The structure was solved with the ShelXT program [35] and refined with the ShelXL program [36], and the graphics were rendered using the Olex2 software suite [37]. The complete X-ray structural dataset for compound 2a was deposited with the Cambridge
PDF
Album
Supp Info
Full Research Paper
Published 11 Mar 2024

NMRium: Teaching nuclear magnetic resonance spectra interpretation in an online platform

  • Luc Patiny,
  • Hamed Musallam,
  • Alejandro Bolaños,
  • Michaël Zasso,
  • Julien Wist,
  • Metin Karayilan,
  • Eva Ziegler,
  • Johannes C. Liermann and
  • Nils E. Schlörer

Beilstein J. Org. Chem. 2024, 20, 25–31, doi:10.3762/bjoc.20.4

Graphical Abstract
  • which can be embedded in other websites for more advanced teaching scenarios. Teaching NMR assignment Finally, teachers can also give their more advanced students assignments without the online quiz functionality, by providing spectra or a complete .nmrium dataset for the students to process, analyze
PDF
Album
Perspective
Published 05 Jan 2024

GlAIcomics: a deep neural network classifier for spectroscopy-augmented mass spectrometric glycans data

  • Thomas Barillot,
  • Baptiste Schindler,
  • Baptiste Moge,
  • Elisa Fadda,
  • Franck Lépine and
  • Isabelle Compagnon

Beilstein J. Org. Chem. 2023, 19, 1825–1831, doi:10.3762/bjoc.19.134

Graphical Abstract
  • the trained model. Results and Discussion Model classification accuracy Our GlAIcomics model shows a classification accuracy of 100% on the validation set and 99.98% on the test set (S.M : dataset 2 in Table 1). The 8000 synthetic spectra of set 2 were sorted by noise level, amplitude modulation
  • to identify false positives. However, a small number of negative results is expected, which makes it doable to assess them systematically. False negative could be identified manually, labelled correctly, and injected back to improve the model. The third dataset was used to evaluate the model
  • . For the test set (dataset 2), the accuracy is 99.91%, 99.61%, and 99.98%, respectively. When the accuracy of the prediction is further investigated as a function of the data augmentation parameters used to model experimental fluctuations, an advantage is found for GlAIcomics and RF over XGBoost
PDF
Album
Supp Info
Full Research Paper
Published 05 Dec 2023

Discrimination of β-cyclodextrin/hazelnut (Corylus avellana L.) oil/flavonoid glycoside and flavonolignan ternary complexes by Fourier-transform infrared spectroscopy coupled with principal component analysis

  • Nicoleta G. Hădărugă,
  • Gabriela Popescu,
  • Dina Gligor (Pane),
  • Cristina L. Mitroi,
  • Sorin M. Stanciu and
  • Daniel Ioan Hădărugă

Beilstein J. Org. Chem. 2023, 19, 380–398, doi:10.3762/bjoc.19.30

Graphical Abstract
  • samples and identification of the important FTIR variables for such classifications. PCA is a widely used multivariate statistical analysis technique that can extract valuable information from a large dataset. It is the case of FTIR data (both wavenumbers and intensities), where were assigned 20, 17, 34
  • give information about the influence of variables to the classification of cases. Only few PCs will extract the useful information from the dataset. As a consequence, the large number of variables will be reduced to only 2–4 PCs that will explain the variance of the data. Discrimination of flavonoid
PDF
Album
Supp Info
Full Research Paper
Published 28 Mar 2023

Navigating and expanding the roadmap of natural product genome mining tools

  • Friederike Biermann,
  • Sebastian L. Wenski and
  • Eric J. N. Helfrich

Beilstein J. Org. Chem. 2022, 18, 1656–1671, doi:10.3762/bjoc.18.178

Graphical Abstract
  • research fields like image recognition [65][66][67][68]. Most ML-based tools utilize “supervised learning,” a strategy that employs a dataset with known classifications to train the algorithm [72]. Traditional ML algorithms include regression, decision tree-based classifiers, and support vector machines
  • sequence or sequence homology based, they heavily rely on the sequence space of known BGCs. The bias of hard-coded algorithms is embedded in the biosynthetic rules used for BGC detection and the dataset used to create pHMMs. The bias of ML-based algorithms results from their training sets that usually
PDF
Album
Perspective
Published 06 Dec 2022

A systems-based framework to computationally describe putative transcription factors and signaling pathways regulating glycan biosynthesis

  • Theodore Groth,
  • Rudiyanto Gunawan and
  • Sriram Neelamegham

Beilstein J. Org. Chem. 2021, 17, 1712–1724, doi:10.3762/bjoc.17.119

Graphical Abstract
  • –glycosylation pathway enrichments found here represent the starting point for wet-lab and orthogonal dataset validation. Such studies could enhance our fundamental understanding of glycosylation pathway regulation, and lead to novel ways to control the glycogenes and glycan structures during health and disease
  • article (Supporting Information File 3, Table S1). In total, the full dataset contained 45,238 TF-to-glycogene relationships, including relational data for 570 unique TFs found in the 29 cancer systems across all the glycogenes. Positive regulatory relationships between TFs and glycogenes were selected
PDF
Album
Supp Info
Full Research Paper
Published 22 Jul 2021

Volatile emission and biosynthesis in endophytic fungi colonizing black poplar leaves

  • Christin Walther,
  • Pamela Baumann,
  • Katrin Luck,
  • Beate Rothe,
  • Peter H. W. Biedermann,
  • Jonathan Gershenzon,
  • Tobias G. Köllner and
  • Sybille B. Unsicker

Beilstein J. Org. Chem. 2021, 17, 1698–1711, doi:10.3762/bjoc.17.118

Graphical Abstract
  • between calculated retention index and literature data were within ±5 points. Identified volatiles with a similarity hit above 90% and that were present in five out of seven replicates were included in this study, whereas VOCs which were also collected by blanks were removed from the final dataset. A
PDF
Album
Supp Info
Full Research Paper
Published 22 Jul 2021
Other Beilstein-Institut Open Science Activities