/ E-Alerts

Search results

Search for "dataset" in Full Text gives 46 result(s) in Beilstein Journal of Organic Chemistry.

PAGE

PAGE
First
Previous
1
2
Next
Last

Enhancing chemical synthesis planning: automated quantum mechanics-based regioselectivity prediction for C–H activation with directing groups

Julius Seumer,
Nicolai Ree and
Jan H. Jensen

Beilstein J. Org. Chem. 2025, 21, 1171–1182, doi:10.3762/bjoc.21.94

computational cost. Validation against a comprehensive dataset reveals that the workflow achieves high accuracy, significantly surpassing traditional models in both speed and predictive capability. This development promises substantial advancements in the design of new synthetic routes, offering rapid and

approach performs well on this dataset, it does not generalize well to other molecules since not all relevant DGs are covered in the work by Tomberg and colleagues [9]. This is evidenced by our analysis using a dataset curated from Reaxys. Using our implementation of the method presented by Tomberg et al

. [9] (for further details see section “Pattern matching”), we could only obtain correct predictions for four out of ten molecules, see section “Dataset curated from Reaxys”. This underscores the necessity for more robust and versatile predictive models that can adapt to the broad spectrum of organic

PDF

Album

Supp Info

Full Research Paper

Published 16 Jun 2025

Full text

Supramolecular assembly of hypervalent iodine macrocycles and alkali metals

Krishna Pandey,
Lucas X. Orton,
Grayson Venus,
Waseem A. Hussain,
Toby Woods,
Lichang Wang and
Kyle N. Plunkett

Beilstein J. Org. Chem. 2025, 21, 1095–1103, doi:10.3762/bjoc.21.87

protocol, and the characterization data are consistent with the original dataset [18]. Through a series of crystallographic experiments, we demonstrate that HIMs coordinate with alkali metals through the periphery carbonyl oxygens via metal–oxygen bonding to form a higher order metal-coordinated

PDF

Album

Supp Info

Full Research Paper

Published 30 May 2025

Full text

Data accessibility in the chemical sciences: an analysis of recent practice in organic chemistry journals

Sally Bloodworth,
Cerys Willoughby and
Simon J. Coles

Beilstein J. Org. Chem. 2025, 21, 864–876, doi:10.3762/bjoc.21.70

coding of responses, list of data types associated with each article, and the resulting main dataset from assessment of 240 research papers are available in our supporting data package. As all research articles include results based on original (raw) data, and include previously unreported chemical

landing page provides this link back to the main article. Find_3, ‘is the dataset assigned a unique, citable and persistent identifier?’ assesses if the data can be found over the long-term. We found poor compliance (24% of primary data) as only files uploaded to a formal repository had a DOI. Two

datasets uploaded to Zenodo [41], one data package deposited with an institutional repository, and MODEL data uploaded in native formats had a unique identifier assigned by the repository for the data or dataset. None of the SI PDFs containing MODEL data uploaded to Figshare can be considered to have a DOI

PDF

Album

Supp Info

Full Research Paper

Published 02 May 2025

Full text

Emerging trends in the optimization of organic synthesis through high-throughput tools and machine learning

Pablo Quijano Velasco,
Kedar Hippalgaonkar and
Balamurugan Ramalingam

Beilstein J. Org. Chem. 2025, 21, 10–38, doi:10.3762/bjoc.21.3

conditions (i.e., temperature, time, solvent, catalyst, etc.) and the corresponding outcome values for the target optimization objectives (e.g., yield, purity, cost, etc.). The initial dataset is commonly obtained by sampling a combination of reaction variables from the parametric space, performing the

sampling, and centerpoint sampling methods. Alternatively, the initial dataset can be obtained from values previously reported in the literature. After that, one or various predictive models are fitted to the initial dataset to predict the expected values of the optimization objectives. The number of

optimization objectives. Finally, a set of the most promising suggestions is selected and tested experimentally. The dataset is then updated with the outcomes of the latest experimental parameters, and the process is repeated until the optimal conditions have been found. Depending on the number of objectives

PDF

Album

Review

Published 06 Jan 2025

Full text

Chemical structure metagenomics of microbial natural products: surveying nonribosomal peptides and beyond

Thomas Ma and
John Chu

Beilstein J. Org. Chem. 2024, 20, 3050–3060, doi:10.3762/bjoc.20.253

to the genetic code) [58]. Hundreds of known nonribosomal codes and their corresponding BBs can be extracted from natural products that have been characterized over the past several decades, generating a dataset to train NRP prediction algorithms (Figure 3b) [49][59][60]. A software suite called

failed to match a nonribosomal code in the algorithm training dataset and were deemed “unpredictable” [37]. It is also possible that the A domain in question aligned so poorly to prototypical A domains that prevented the proper identification of the nonribosomal code itself [70]. Regardless of the

scenario, these bioinformatically intractable A domains are distinct from known ones and point to enormous biosynthetic novelty that still awaits our exploration. Compiling a dataset for training A domain substrate prediction algorithms has never been the objective for natural product research in the past

PDF

Album

Perspective

Published 20 Nov 2024

Full text

Structure and thermal stability of phosphorus-iodonium ylids

Andrew Greener,
Stephen P. Argent,
Coby J. Clarke and
Miriam L. O’Duill

Beilstein J. Org. Chem. 2024, 20, 2931–2939, doi:10.3762/bjoc.20.245

investigated in our study (Table 1). Thermal stability data The phosphorus-iodonium ylids were analysed by differential scanning calorimetry (DSC) and thermogravimetric analysis (TGA) [34][35], and results have been summarised in Table 2 and Figure 3. (The full dataset is available in Supporting Information

-iodonium ylids at 10 °C min−1 in N2 (full dataset in Supporting Information File 1). (b) First derivatives of TGA thermograms normalised to the intensity of the first peak. (c) Correlation of Tonset with the dihedral angle φ (between the R–I–X bond and the plane of the arene substituent). (d–g) DSC

PDF

Album

Supp Info

Full Research Paper

Published 14 Nov 2024

Full text

Applications of microscopy and small angle scattering techniques for the characterisation of supramolecular gels

Connor R. M. MacDonald and
Emily R. Draper

Beilstein J. Org. Chem. 2024, 20, 2608–2634, doi:10.3762/bjoc.20.220

PDF

Album

Review

Published 16 Oct 2024

Full text

Machine learning-guided strategies for reaction conditions design and optimization

Lung-Yi Chen and
Yi-Pei Li

Beilstein J. Org. Chem. 2024, 20, 2476–2492, doi:10.3762/bjoc.20.212

two types of reaction condition models based on their scope of applicability and dataset size: global and local models. The global models cover a wide range of reaction types and typically predict the experimental conditions based on a predefined list derived from literature data. However, this method

aspects of the dataset features and data preprocessing methods. Moreover, we introduce common algorithms and representative studies for developing both global and local models. We highlight representative studies that demonstrate the effectiveness and applicability of these algorithms in real-world

technique to construct a dataset of ≈693k reactions with detailed procedures and developed a sequence-to-sequence model to predict synthetic steps that are actionable and compatible with robotic platforms [70]. Guo et al. [71] conducted a continual pretraining scheme on the BERT model [72] to obtain a

PDF

Album

Review

Published 04 Oct 2024

Full text

Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis

Stefan P. Schmid,
Leon Schlosser,
Frank Glorius and
Kjell Jorner

Beilstein J. Org. Chem. 2024, 20, 2280–2304, doi:10.3762/bjoc.20.196

quality of the underlying dataset will determine the model’s predictive capabilities. To obtain high predictive accuracy for a broad range of problems, a data set is sought which covers the problem space comprehensively. This does not only encompass the chemical diversity of the included molecules, but

features can vary based on the reactions that are contained in the training and test set. While descriptors can help in gaining mechanistic insight, it is important to not overinterpret the significance of single features to form a mechanistic hypothesis. Ideally, to overcome issues such as a high dataset

best of the authors’ knowledge, no HTE dataset has found widespread application in ML for organocatalysis, Denmark and co-workers published a data set comprising more than 1,000 organocatalytic transformations [67]. In their work, the authors demonstrated a data-driven workflow to study the

PDF

Album

Review

Published 10 Sep 2024

Full text

Finding the most potent compounds using active learning on molecular pairs

Zachary Fralish and
Daniel Reker

Beilstein J. Org. Chem. 2024, 20, 2152–2162, doi:10.3762/bjoc.20.185

active learning, the machine learning model is trained on the available training data and the next compound to be added to the training dataset is selected based on which compound from the learning set has the highest predicted value [19] (Figure 1A). For ActiveDelta learning, training data is paired to

learn property differences between molecules [12]. Then, the next compound is selected based on which compound has the greatest predicted improvement from the most promising compound currently in the training dataset (Figure 1B). For the first time, we here present the ActiveDelta concept and evaluate

training and testing sets to simulate time-based splits. Datasets were split into training and test sets at an 80:20 ratio. Duplicate molecules were removed. For initial active learning training dataset formation, two random datapoints were selected from each original training dataset and the remaining

PDF

Album

Supp Info

Full Research Paper

Published 27 Aug 2024

Full text

Computational toolbox for the analysis of protein–glycan interactions

Ferran Nieto-Fabregat,
Maria Pia Lenza,
Angela Marseglia,
Cristina Di Carluccio,
Antonio Molinaro,
Alba Silipo and
Roberta Marchetti

Beilstein J. Org. Chem. 2024, 20, 2084–2107, doi:10.3762/bjoc.20.180

PDF

Album

Review

Published 22 Aug 2024

Full text

Hetero-polycyclic aromatic systems: A data-driven investigation of structure–property relationships

Sabyasachi Chakraborty,
Eduardo Mayo Yanes and
Renana Gershoni-Poranne

Beilstein J. Org. Chem. 2024, 20, 1817–1830, doi:10.3762/bjoc.20.160

-driven investigation of the newly generated COMPAS-2 dataset, which contains ~500k molecules consisting of 11 types of aromatic and antiaromatic rings and ranging in size from one to ten rings. Our analysis explores the effects of electron count, geometry, atomic composition, and heterocyclic composition

on a range of electronic molecular properties of PASs. Keywords: computational chemistry; database; dataset; π-conjugated; polycyclic aromatic hydrocarbons; polycyclic aromatic systems; Introduction Polycyclic aromatic systems (PASs) – molecules made up of fused aromatic rings – are among the most

constructing the COMPAS-2 dataset, we opted to maintain equal percentages of the different types of heterocycles (~10% of each type). This was done to avoid biasing the construction of molecules towards specific motifs. However, because there are multiple types of B-containing and N-containing heterocycles

PDF

Album

Supp Info

Full Research Paper

Published 31 Jul 2024

Full text

Discovery of antimicrobial peptides clostrisin and cellulosin from Clostridium: insights into their structures, co-localized biosynthetic gene clusters, and antibiotic activity

Moisés Alejandro Alejo Hernandez,
Katia Pamela Villavicencio Sánchez,
Rosendo Sánchez Morales,
Karla Georgina Hernández-Magro Gil,
David Silverio Moreno-Gutiérrez,
Eddie Guillermo Sanchez-Rueda,
Yanet Teresa-Cruz,
Brian Choi,
Armando Hernández Garcia,
Alba Romero-Rodríguez,
Oscar Juárez,
Siseth Martínez-Caballero,
Mario Figueroa and
Corina-Diana Ceapă

Beilstein J. Org. Chem. 2024, 20, 1800–1816, doi:10.3762/bjoc.20.159

positions in the final dataset. The phylogenetic tree was built from the alignments using the Neighbor-Joining [43], and the evolutionary distances were calculated using the Poisson correction method. A LanL class IV lanthionine synthetase from the venezuelin cluster [44] was used as an outgroup. Square

PDF

Album

Supp Info

Full Research Paper

Published 30 Jul 2024

Full text

pKalculator: A pK_a predictor for C–H bonds

Rasmus M. Borup,
Nicolai Ree and
Jan H. Jensen

Beilstein J. Org. Chem. 2024, 20, 1614–1622, doi:10.3762/bjoc.20.144

. As molecular complexity increases, this task becomes more challenging. This paper introduces pKalculator, a quantum chemistry (QM)-based workflow for automatic computations of C–H pKa values, which is used to generate a training dataset for a machine learning (ML) model. The QM workflow is

benchmarked against 695 experimentally determined C–H pKa values in DMSO. The ML model is trained on a diverse dataset of 775 molecules with 3910 C–H sites. Our ML model predicts C–H pKa values with a mean absolute error (MAE) and a root mean squared error (RMSE) of 1.24 and 2.15 pKa units, respectively

compile a dataset of 732 experimental pKa values in DMSO from two different sources, Bordwell [7] and iBonD [4]. The Bordwell dataset contains experimental C–H pKa values in DMSO from 419 molecules. For the iBonD database, we select experimental C–H pKa values in DMSO for 313 molecules. As the iBonD

PDF

Album

Supp Info

Full Research Paper

Published 16 Jul 2024

Full text

Generation of multimillion chemical space based on the parallel Groebke–Blackburn–Bienaymé reaction

Evgen V. Govor,
Vasyl Naumchyk,
Ihor Nestorak,
Dmytro S. Radchenko,
Dmytro Dudenko,
Yurii S. Moroz,
Olexiy D. Kachkovsky and
Oleksandr O. Grygorenko

Beilstein J. Org. Chem. 2024, 20, 1604–1613, doi:10.3762/bjoc.20.143

(due to the large size of the dataset, a preliminary clusterization was performed to achieve ca. 5-fold size reduction); C) ZINC15 drug-like compounds, and D) enamine’s stock screening collection. Average T values are shown by dotted lines. t-Distributed stochastic neighbor embedding (t-SNE

PDF

Album

Supp Info

Full Research Paper

Published 16 Jul 2024

Full text

Mining raw plant transcriptomic data for new cyclopeptide alkaloids

Draco Kriger,
Michael A. Pasquale,
Brigitte G. Ampolini and
Jonathan R. Chekan

Beilstein J. Org. Chem. 2024, 20, 1548–1559, doi:10.3762/bjoc.20.138

dataset used are listed in this Supporting Table File along with core peptide sequence alignments. Supporting Information File 11: The full cladogram with species names as a high-resolution pdf. Supporting Information File 12: The split burpitide precursor peptide HMM and output from all 700

PDF

Album

Supp Info

Full Research Paper

Published 11 Jul 2024

Full text

Bioinformatic prediction of the stereoselectivity of modular polyketide synthase: an update of the sequence motifs in ketoreductase domain

Changjun Xiang,
Shunyu Yao,
Ruoyu Wang and
Lihan Zhang

Beilstein J. Org. Chem. 2024, 20, 1476–1485, doi:10.3762/bjoc.20.131

, deepening our understanding of the stereocontrol of PKSs. Results and Discussion Preparation of KR sequence dataset We first curated the amino acid sequences of KR domains from characterized bacterial cis-AT PKSs recorded in MIBiG database [20] and by manual literature review. In total, 1,762 KRs whose

that corresponded exactly to the results predicted by bioinformatics were also considered reliable. Compounds of which only the relative configurations were elucidated were excluded from the dataset. (c) Sequences for which it was impossible to infer the stereochemistry of KR product were removed, such

Figure 3. For full-length sequence logo, see Figure S5 in Supporting Information File 1. Summary of the updated fingerprints sorted by the taxonomic origin and the module type. Percentage numbers show KRs meeting the fingerprint description in our curated dataset. (a) Motifs useful for the stereochemical

PDF

Album

Supp Info

Full Research Paper

Published 02 Jul 2024

Full text

Predicting bond dissociation energies of cyclic hypervalent halogen reagents using DFT calculations and graph attention network model

Yingbo Shao,
Zhiyuan Ren,
Zhihui Han,
Li Chen,
Yao Li and
Xiao-Song Xue

Beilstein J. Org. Chem. 2024, 20, 1444–1452, doi:10.3762/bjoc.20.127

209 heterolytic BDE data points. Taking homolytic BDE datasets as an example (Figure 4a), the distribution of this dataset is illustrated with key bond energy values normalized using min–max scaling. This approach ensures both data consistency and improves training efficiency. We used the GAT model as

the core framework, incorporating ten selected atomic descriptors as local information within the graph structure. Effective molecular transformations into molecular graphs (Figure 4b) were achieved using the RDKit and Deep Graph Library [87]. The dataset was randomly divided into training and testing

superior predictive results by not distinguishing between halogen categories in the dataset. This approach is reliable and efficient in assisting chemists in estimating the bond energy ranges of novel cyclic hypervalent halogen reagents. We conducted additional tests with cyclic hypervalent halogen

PDF

Album

Supp Info

Letter

Published 28 Jun 2024

Full text

Synthesis of photo- and ionochromic N-acylated 2-(aminomethylene)benzo[b]thiophene-3(2Н)-ones with a terminal phenanthroline group

Vladimir P. Rybalkin,
Sofiya Yu. Zmeeva,
Lidiya L. Popova,
Irina V. Dubonosova,
Olga Yu. Karlutova,
Oleg P. Demidov,
Alexander D. Dubonosov and
Vladimir A. Bren

Beilstein J. Org. Chem. 2024, 20, 552–560, doi:10.3762/bjoc.20.47

™ Impact instrument (electrospray ionization). Melting points were determined on a Fisher–Johns melting point apparatus. X-ray diffraction study The X-ray diffraction dataset of compound 3b was recorded on an Agilent SuperNova diffractometer using a microfocus X-ray radiation source with copper anode and

dedicated CrysAlisPro software suite [34]. The structure was solved with the ShelXT program [35] and refined with the ShelXL program [36], and the graphics were rendered using the Olex2 software suite [37]. The complete X-ray structural dataset for compound 2a was deposited with the Cambridge

PDF

Album

Supp Info

Full Research Paper

Published 11 Mar 2024

Full text

NMRium: Teaching nuclear magnetic resonance spectra interpretation in an online platform

Luc Patiny,
Hamed Musallam,
Alejandro Bolaños,
Michaël Zasso,
Julien Wist,
Metin Karayilan,
Eva Ziegler,
Johannes C. Liermann and
Nils E. Schlörer

Beilstein J. Org. Chem. 2024, 20, 25–31, doi:10.3762/bjoc.20.4

which can be embedded in other websites for more advanced teaching scenarios. Teaching NMR assignment Finally, teachers can also give their more advanced students assignments without the online quiz functionality, by providing spectra or a complete .nmrium dataset for the students to process, analyze

PDF

Album

Perspective

Published 05 Jan 2024

Full text

GlAIcomics: a deep neural network classifier for spectroscopy-augmented mass spectrometric glycans data

Thomas Barillot,
Baptiste Schindler,
Baptiste Moge,
Elisa Fadda,
Franck Lépine and
Isabelle Compagnon

Beilstein J. Org. Chem. 2023, 19, 1825–1831, doi:10.3762/bjoc.19.134

the trained model. Results and Discussion Model classification accuracy Our GlAIcomics model shows a classification accuracy of 100% on the validation set and 99.98% on the test set (S.M : dataset 2 in Table 1). The 8000 synthetic spectra of set 2 were sorted by noise level, amplitude modulation

to identify false positives. However, a small number of negative results is expected, which makes it doable to assess them systematically. False negative could be identified manually, labelled correctly, and injected back to improve the model. The third dataset was used to evaluate the model

. For the test set (dataset 2), the accuracy is 99.91%, 99.61%, and 99.98%, respectively. When the accuracy of the prediction is further investigated as a function of the data augmentation parameters used to model experimental fluctuations, an advantage is found for GlAIcomics and RF over XGBoost

PDF

Album

Supp Info

Full Research Paper

Published 05 Dec 2023

Full text

Discrimination of β-cyclodextrin/hazelnut (Corylus avellana L.) oil/flavonoid glycoside and flavonolignan ternary complexes by Fourier-transform infrared spectroscopy coupled with principal component analysis

Nicoleta G. Hădărugă,
Gabriela Popescu,
Dina Gligor (Pane),
Cristina L. Mitroi,
Sorin M. Stanciu and
Daniel Ioan Hădărugă

Beilstein J. Org. Chem. 2023, 19, 380–398, doi:10.3762/bjoc.19.30

samples and identification of the important FTIR variables for such classifications. PCA is a widely used multivariate statistical analysis technique that can extract valuable information from a large dataset. It is the case of FTIR data (both wavenumbers and intensities), where were assigned 20, 17, 34

give information about the influence of variables to the classification of cases. Only few PCs will extract the useful information from the dataset. As a consequence, the large number of variables will be reduced to only 2–4 PCs that will explain the variance of the data. Discrimination of flavonoid

PDF

Album

Supp Info

Full Research Paper

Published 28 Mar 2023

Full text

Navigating and expanding the roadmap of natural product genome mining tools

Friederike Biermann,
Sebastian L. Wenski and
Eric J. N. Helfrich

Beilstein J. Org. Chem. 2022, 18, 1656–1671, doi:10.3762/bjoc.18.178

research fields like image recognition [65][66][67][68]. Most ML-based tools utilize “supervised learning,” a strategy that employs a dataset with known classifications to train the algorithm [72]. Traditional ML algorithms include regression, decision tree-based classifiers, and support vector machines

sequence or sequence homology based, they heavily rely on the sequence space of known BGCs. The bias of hard-coded algorithms is embedded in the biosynthetic rules used for BGC detection and the dataset used to create pHMMs. The bias of ML-based algorithms results from their training sets that usually

PDF

Album

Perspective

Published 06 Dec 2022

Full text

A systems-based framework to computationally describe putative transcription factors and signaling pathways regulating glycan biosynthesis

Theodore Groth,
Rudiyanto Gunawan and
Sriram Neelamegham

Beilstein J. Org. Chem. 2021, 17, 1712–1724, doi:10.3762/bjoc.17.119

–glycosylation pathway enrichments found here represent the starting point for wet-lab and orthogonal dataset validation. Such studies could enhance our fundamental understanding of glycosylation pathway regulation, and lead to novel ways to control the glycogenes and glycan structures during health and disease

article (Supporting Information File 3, Table S1). In total, the full dataset contained 45,238 TF-to-glycogene relationships, including relational data for 570 unique TFs found in the 29 cancer systems across all the glycogenes. Positive regulatory relationships between TFs and glycogenes were selected

PDF

Album

Supp Info

Full Research Paper

Published 22 Jul 2021

Full text

Volatile emission and biosynthesis in endophytic fungi colonizing black poplar leaves

Christin Walther,
Pamela Baumann,
Katrin Luck,
Beate Rothe,
Peter H. W. Biedermann,
Jonathan Gershenzon,
Tobias G. Köllner and
Sybille B. Unsicker

Beilstein J. Org. Chem. 2021, 17, 1698–1711, doi:10.3762/bjoc.17.118

between calculated retention index and literature data were within ±5 points. Identified volatiles with a similarity hit above 90% and that were present in five out of seven replicates were included in this study, whereas VOCs which were also collected by blanks were removed from the final dataset. A

PDF

Album

Supp Info

Full Research Paper

Published 22 Jul 2021

Full text

PAGE

PAGE
First
Previous
1
2
Next
Last

Other Beilstein-Institut Open Science Activities

aromatic	the word “aromatic”
aromatic aldehyde	the word “aromatic” OR “aldehyde”
+aromatic +aldehyde	both words “aromatic” AND “aldehyde”
+aromatic -aldehyde	the word “aromatic” but NOT “aldehyde”
“aromatic aldehyde”	the exact phrase “aromatic aldehyde”
benz*	words which begin with “benz”, such as “benzene” or “benzyl”
benz*yl	words that begin with “benz” and end with “yl”, such as “benzyl” or “benzoyl”
benzyl~	words that are close to the word “benzyl”, such as “benzoyl” (i.e., fuzzy search)