Search results

Search for "dataset" in Full Text gives 43 result(s) in Beilstein Journal of Organic Chemistry.

Emerging trends in the optimization of organic synthesis through high-throughput tools and machine learning

  • Pablo Quijano Velasco,
  • Kedar Hippalgaonkar and
  • Balamurugan Ramalingam

Beilstein J. Org. Chem. 2025, 21, 10–38, doi:10.3762/bjoc.21.3

Graphical Abstract
  • conditions (i.e., temperature, time, solvent, catalyst, etc.) and the corresponding outcome values for the target optimization objectives (e.g., yield, purity, cost, etc.). The initial dataset is commonly obtained by sampling a combination of reaction variables from the parametric space, performing the
  • sampling, and centerpoint sampling methods. Alternatively, the initial dataset can be obtained from values previously reported in the literature. After that, one or various predictive models are fitted to the initial dataset to predict the expected values of the optimization objectives. The number of
  • optimization objectives. Finally, a set of the most promising suggestions is selected and tested experimentally. The dataset is then updated with the outcomes of the latest experimental parameters, and the process is repeated until the optimal conditions have been found. Depending on the number of objectives
PDF
Album
Review
Published 06 Jan 2025

Chemical structure metagenomics of microbial natural products: surveying nonribosomal peptides and beyond

  • Thomas Ma and
  • John Chu

Beilstein J. Org. Chem. 2024, 20, 3050–3060, doi:10.3762/bjoc.20.253

Graphical Abstract
  • to the genetic code) [58]. Hundreds of known nonribosomal codes and their corresponding BBs can be extracted from natural products that have been characterized over the past several decades, generating a dataset to train NRP prediction algorithms (Figure 3b) [49][59][60]. A software suite called
  • failed to match a nonribosomal code in the algorithm training dataset and were deemed “unpredictable” [37]. It is also possible that the A domain in question aligned so poorly to prototypical A domains that prevented the proper identification of the nonribosomal code itself [70]. Regardless of the
  • scenario, these bioinformatically intractable A domains are distinct from known ones and point to enormous biosynthetic novelty that still awaits our exploration. Compiling a dataset for training A domain substrate prediction algorithms has never been the objective for natural product research in the past
PDF
Album
Perspective
Published 20 Nov 2024

Structure and thermal stability of phosphorus-iodonium ylids

  • Andrew Greener,
  • Stephen P. Argent,
  • Coby J. Clarke and
  • Miriam L. O’Duill

Beilstein J. Org. Chem. 2024, 20, 2931–2939, doi:10.3762/bjoc.20.245

Graphical Abstract
  • investigated in our study (Table 1). Thermal stability data The phosphorus-iodonium ylids were analysed by differential scanning calorimetry (DSC) and thermogravimetric analysis (TGA) [34][35], and results have been summarised in Table 2 and Figure 3. (The full dataset is available in Supporting Information
  • -iodonium ylids at 10 °C min−1 in N2 (full dataset in Supporting Information File 1). (b) First derivatives of TGA thermograms normalised to the intensity of the first peak. (c) Correlation of Tonset with the dihedral angle φ (between the R–I–X bond and the plane of the arene substituent). (d–g) DSC
PDF
Album
Supp Info
Full Research Paper
Published 14 Nov 2024

Applications of microscopy and small angle scattering techniques for the characterisation of supramolecular gels

  • Connor R. M. MacDonald and
  • Emily R. Draper

Beilstein J. Org. Chem. 2024, 20, 2608–2634, doi:10.3762/bjoc.20.220

Graphical Abstract
PDF
Album
Review
Published 16 Oct 2024

Machine learning-guided strategies for reaction conditions design and optimization

  • Lung-Yi Chen and
  • Yi-Pei Li

Beilstein J. Org. Chem. 2024, 20, 2476–2492, doi:10.3762/bjoc.20.212

Graphical Abstract
  • two types of reaction condition models based on their scope of applicability and dataset size: global and local models. The global models cover a wide range of reaction types and typically predict the experimental conditions based on a predefined list derived from literature data. However, this method
  • aspects of the dataset features and data preprocessing methods. Moreover, we introduce common algorithms and representative studies for developing both global and local models. We highlight representative studies that demonstrate the effectiveness and applicability of these algorithms in real-world
  • technique to construct a dataset of ≈693k reactions with detailed procedures and developed a sequence-to-sequence model to predict synthetic steps that are actionable and compatible with robotic platforms [70]. Guo et al. [71] conducted a continual pretraining scheme on the BERT model [72] to obtain a
PDF
Album
Review
Published 04 Oct 2024

Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis

  • Stefan P. Schmid,
  • Leon Schlosser,
  • Frank Glorius and
  • Kjell Jorner

Beilstein J. Org. Chem. 2024, 20, 2280–2304, doi:10.3762/bjoc.20.196

Graphical Abstract
  • quality of the underlying dataset will determine the model’s predictive capabilities. To obtain high predictive accuracy for a broad range of problems, a data set is sought which covers the problem space comprehensively. This does not only encompass the chemical diversity of the included molecules, but
  • features can vary based on the reactions that are contained in the training and test set. While descriptors can help in gaining mechanistic insight, it is important to not overinterpret the significance of single features to form a mechanistic hypothesis. Ideally, to overcome issues such as a high dataset
  • best of the authors’ knowledge, no HTE dataset has found widespread application in ML for organocatalysis, Denmark and co-workers published a data set comprising more than 1,000 organocatalytic transformations [67]. In their work, the authors demonstrated a data-driven workflow to study the
PDF
Album
Review
Published 10 Sep 2024

Finding the most potent compounds using active learning on molecular pairs

  • Zachary Fralish and
  • Daniel Reker

Beilstein J. Org. Chem. 2024, 20, 2152–2162, doi:10.3762/bjoc.20.185

Graphical Abstract
  • active learning, the machine learning model is trained on the available training data and the next compound to be added to the training dataset is selected based on which compound from the learning set has the highest predicted value [19] (Figure 1A). For ActiveDelta learning, training data is paired to
  • learn property differences between molecules [12]. Then, the next compound is selected based on which compound has the greatest predicted improvement from the most promising compound currently in the training dataset (Figure 1B). For the first time, we here present the ActiveDelta concept and evaluate
  • training and testing sets to simulate time-based splits. Datasets were split into training and test sets at an 80:20 ratio. Duplicate molecules were removed. For initial active learning training dataset formation, two random datapoints were selected from each original training dataset and the remaining
PDF
Album
Supp Info
Full Research Paper
Published 27 Aug 2024

Computational toolbox for the analysis of protein–glycan interactions

  • Ferran Nieto-Fabregat,
  • Maria Pia Lenza,
  • Angela Marseglia,
  • Cristina Di Carluccio,
  • Antonio Molinaro,
  • Alba Silipo and
  • Roberta Marchetti

Beilstein J. Org. Chem. 2024, 20, 2084–2107, doi:10.3762/bjoc.20.180

Graphical Abstract
PDF
Album
Review
Published 22 Aug 2024

Hetero-polycyclic aromatic systems: A data-driven investigation of structure–property relationships

  • Sabyasachi Chakraborty,
  • Eduardo Mayo Yanes and
  • Renana Gershoni-Poranne

Beilstein J. Org. Chem. 2024, 20, 1817–1830, doi:10.3762/bjoc.20.160

Graphical Abstract
  • -driven investigation of the newly generated COMPAS-2 dataset, which contains ~500k molecules consisting of 11 types of aromatic and antiaromatic rings and ranging in size from one to ten rings. Our analysis explores the effects of electron count, geometry, atomic composition, and heterocyclic composition
  • on a range of electronic molecular properties of PASs. Keywords: computational chemistry; database; dataset; π-conjugated; polycyclic aromatic hydrocarbons; polycyclic aromatic systems; Introduction Polycyclic aromatic systems (PASs) – molecules made up of fused aromatic rings – are among the most
  • constructing the COMPAS-2 dataset, we opted to maintain equal percentages of the different types of heterocycles (~10% of each type). This was done to avoid biasing the construction of molecules towards specific motifs. However, because there are multiple types of B-containing and N-containing heterocycles
PDF
Album
Supp Info
Full Research Paper
Published 31 Jul 2024

Discovery of antimicrobial peptides clostrisin and cellulosin from Clostridium: insights into their structures, co-localized biosynthetic gene clusters, and antibiotic activity

  • Moisés Alejandro Alejo Hernandez,
  • Katia Pamela Villavicencio Sánchez,
  • Rosendo Sánchez Morales,
  • Karla Georgina Hernández-Magro Gil,
  • David Silverio Moreno-Gutiérrez,
  • Eddie Guillermo Sanchez-Rueda,
  • Yanet Teresa-Cruz,
  • Brian Choi,
  • Armando Hernández Garcia,
  • Alba Romero-Rodríguez,
  • Oscar Juárez,
  • Siseth Martínez-Caballero,
  • Mario Figueroa and
  • Corina-Diana Ceapă

Beilstein J. Org. Chem. 2024, 20, 1800–1816, doi:10.3762/bjoc.20.159

Graphical Abstract
  • positions in the final dataset. The phylogenetic tree was built from the alignments using the Neighbor-Joining [43], and the evolutionary distances were calculated using the Poisson correction method. A LanL class IV lanthionine synthetase from the venezuelin cluster [44] was used as an outgroup. Square
PDF
Album
Supp Info
Full Research Paper
Published 30 Jul 2024

pKalculator: A pKa predictor for C–H bonds

  • Rasmus M. Borup,
  • Nicolai Ree and
  • Jan H. Jensen

Beilstein J. Org. Chem. 2024, 20, 1614–1622, doi:10.3762/bjoc.20.144

Graphical Abstract
  • . As molecular complexity increases, this task becomes more challenging. This paper introduces pKalculator, a quantum chemistry (QM)-based workflow for automatic computations of C–H pKa values, which is used to generate a training dataset for a machine learning (ML) model. The QM workflow is
  • benchmarked against 695 experimentally determined C–H pKa values in DMSO. The ML model is trained on a diverse dataset of 775 molecules with 3910 C–H sites. Our ML model predicts C–H pKa values with a mean absolute error (MAE) and a root mean squared error (RMSE) of 1.24 and 2.15 pKa units, respectively
  • compile a dataset of 732 experimental pKa values in DMSO from two different sources, Bordwell [7] and iBonD [4]. The Bordwell dataset contains experimental C–H pKa values in DMSO from 419 molecules. For the iBonD database, we select experimental C–H pKa values in DMSO for 313 molecules. As the iBonD
PDF
Album
Supp Info
Full Research Paper
Published 16 Jul 2024

Generation of multimillion chemical space based on the parallel Groebke–Blackburn–Bienaymé reaction

  • Evgen V. Govor,
  • Vasyl Naumchyk,
  • Ihor Nestorak,
  • Dmytro S. Radchenko,
  • Dmytro Dudenko,
  • Yurii S. Moroz,
  • Olexiy D. Kachkovsky and
  • Oleksandr O. Grygorenko

Beilstein J. Org. Chem. 2024, 20, 1604–1613, doi:10.3762/bjoc.20.143

Graphical Abstract
  • (due to the large size of the dataset, a preliminary clusterization was performed to achieve ca. 5-fold size reduction); C) ZINC15 drug-like compounds, and D) enamine’s stock screening collection. Average T values are shown by dotted lines. t-Distributed stochastic neighbor embedding (t-SNE
PDF
Album
Supp Info
Full Research Paper
Published 16 Jul 2024

Mining raw plant transcriptomic data for new cyclopeptide alkaloids

  • Draco Kriger,
  • Michael A. Pasquale,
  • Brigitte G. Ampolini and
  • Jonathan R. Chekan

Beilstein J. Org. Chem. 2024, 20, 1548–1559, doi:10.3762/bjoc.20.138

Graphical Abstract
  • dataset used are listed in this Supporting Table File along with core peptide sequence alignments. Supporting Information File 11: The full cladogram with species names as a high-resolution pdf. Supporting Information File 12: The split burpitide precursor peptide HMM and output from all 700
PDF
Album
Supp Info
Full Research Paper
Published 11 Jul 2024

Bioinformatic prediction of the stereoselectivity of modular polyketide synthase: an update of the sequence motifs in ketoreductase domain

  • Changjun Xiang,
  • Shunyu Yao,
  • Ruoyu Wang and
  • Lihan Zhang

Beilstein J. Org. Chem. 2024, 20, 1476–1485, doi:10.3762/bjoc.20.131

Graphical Abstract
  • , deepening our understanding of the stereocontrol of PKSs. Results and Discussion Preparation of KR sequence dataset We first curated the amino acid sequences of KR domains from characterized bacterial cis-AT PKSs recorded in MIBiG database [20] and by manual literature review. In total, 1,762 KRs whose
  • that corresponded exactly to the results predicted by bioinformatics were also considered reliable. Compounds of which only the relative configurations were elucidated were excluded from the dataset. (c) Sequences for which it was impossible to infer the stereochemistry of KR product were removed, such
  • Figure 3. For full-length sequence logo, see Figure S5 in Supporting Information File 1. Summary of the updated fingerprints sorted by the taxonomic origin and the module type. Percentage numbers show KRs meeting the fingerprint description in our curated dataset. (a) Motifs useful for the stereochemical
PDF
Album
Supp Info
Full Research Paper
Published 02 Jul 2024

Predicting bond dissociation energies of cyclic hypervalent halogen reagents using DFT calculations and graph attention network model

  • Yingbo Shao,
  • Zhiyuan Ren,
  • Zhihui Han,
  • Li Chen,
  • Yao Li and
  • Xiao-Song Xue

Beilstein J. Org. Chem. 2024, 20, 1444–1452, doi:10.3762/bjoc.20.127

Graphical Abstract
  • 209 heterolytic BDE data points. Taking homolytic BDE datasets as an example (Figure 4a), the distribution of this dataset is illustrated with key bond energy values normalized using min–max scaling. This approach ensures both data consistency and improves training efficiency. We used the GAT model as
  • the core framework, incorporating ten selected atomic descriptors as local information within the graph structure. Effective molecular transformations into molecular graphs (Figure 4b) were achieved using the RDKit and Deep Graph Library [87]. The dataset was randomly divided into training and testing
  • superior predictive results by not distinguishing between halogen categories in the dataset. This approach is reliable and efficient in assisting chemists in estimating the bond energy ranges of novel cyclic hypervalent halogen reagents. We conducted additional tests with cyclic hypervalent halogen
PDF
Album
Supp Info
Letter
Published 28 Jun 2024

Synthesis of photo- and ionochromic N-acylated 2-(aminomethylene)benzo[b]thiophene-3(2Н)-ones with a terminal phenanthroline group

  • Vladimir P. Rybalkin,
  • Sofiya Yu. Zmeeva,
  • Lidiya L. Popova,
  • Irina V. Dubonosova,
  • Olga Yu. Karlutova,
  • Oleg P. Demidov,
  • Alexander D. Dubonosov and
  • Vladimir A. Bren

Beilstein J. Org. Chem. 2024, 20, 552–560, doi:10.3762/bjoc.20.47

Graphical Abstract
  • ™ Impact instrument (electrospray ionization). Melting points were determined on a Fisher–Johns melting point apparatus. X-ray diffraction study The X-ray diffraction dataset of compound 3b was recorded on an Agilent SuperNova diffractometer using a microfocus X-ray radiation source with copper anode and
  • dedicated CrysAlisPro software suite [34]. The structure was solved with the ShelXT program [35] and refined with the ShelXL program [36], and the graphics were rendered using the Olex2 software suite [37]. The complete X-ray structural dataset for compound 2a was deposited with the Cambridge
PDF
Album
Supp Info
Full Research Paper
Published 11 Mar 2024

NMRium: Teaching nuclear magnetic resonance spectra interpretation in an online platform

  • Luc Patiny,
  • Hamed Musallam,
  • Alejandro Bolaños,
  • Michaël Zasso,
  • Julien Wist,
  • Metin Karayilan,
  • Eva Ziegler,
  • Johannes C. Liermann and
  • Nils E. Schlörer

Beilstein J. Org. Chem. 2024, 20, 25–31, doi:10.3762/bjoc.20.4

Graphical Abstract
  • which can be embedded in other websites for more advanced teaching scenarios. Teaching NMR assignment Finally, teachers can also give their more advanced students assignments without the online quiz functionality, by providing spectra or a complete .nmrium dataset for the students to process, analyze
PDF
Album
Perspective
Published 05 Jan 2024

GlAIcomics: a deep neural network classifier for spectroscopy-augmented mass spectrometric glycans data

  • Thomas Barillot,
  • Baptiste Schindler,
  • Baptiste Moge,
  • Elisa Fadda,
  • Franck Lépine and
  • Isabelle Compagnon

Beilstein J. Org. Chem. 2023, 19, 1825–1831, doi:10.3762/bjoc.19.134

Graphical Abstract
  • the trained model. Results and Discussion Model classification accuracy Our GlAIcomics model shows a classification accuracy of 100% on the validation set and 99.98% on the test set (S.M : dataset 2 in Table 1). The 8000 synthetic spectra of set 2 were sorted by noise level, amplitude modulation
  • to identify false positives. However, a small number of negative results is expected, which makes it doable to assess them systematically. False negative could be identified manually, labelled correctly, and injected back to improve the model. The third dataset was used to evaluate the model
  • . For the test set (dataset 2), the accuracy is 99.91%, 99.61%, and 99.98%, respectively. When the accuracy of the prediction is further investigated as a function of the data augmentation parameters used to model experimental fluctuations, an advantage is found for GlAIcomics and RF over XGBoost
PDF
Album
Supp Info
Full Research Paper
Published 05 Dec 2023

Discrimination of β-cyclodextrin/hazelnut (Corylus avellana L.) oil/flavonoid glycoside and flavonolignan ternary complexes by Fourier-transform infrared spectroscopy coupled with principal component analysis

  • Nicoleta G. Hădărugă,
  • Gabriela Popescu,
  • Dina Gligor (Pane),
  • Cristina L. Mitroi,
  • Sorin M. Stanciu and
  • Daniel Ioan Hădărugă

Beilstein J. Org. Chem. 2023, 19, 380–398, doi:10.3762/bjoc.19.30

Graphical Abstract
  • samples and identification of the important FTIR variables for such classifications. PCA is a widely used multivariate statistical analysis technique that can extract valuable information from a large dataset. It is the case of FTIR data (both wavenumbers and intensities), where were assigned 20, 17, 34
  • give information about the influence of variables to the classification of cases. Only few PCs will extract the useful information from the dataset. As a consequence, the large number of variables will be reduced to only 2–4 PCs that will explain the variance of the data. Discrimination of flavonoid
PDF
Album
Supp Info
Full Research Paper
Published 28 Mar 2023

Navigating and expanding the roadmap of natural product genome mining tools

  • Friederike Biermann,
  • Sebastian L. Wenski and
  • Eric J. N. Helfrich

Beilstein J. Org. Chem. 2022, 18, 1656–1671, doi:10.3762/bjoc.18.178

Graphical Abstract
  • research fields like image recognition [65][66][67][68]. Most ML-based tools utilize “supervised learning,” a strategy that employs a dataset with known classifications to train the algorithm [72]. Traditional ML algorithms include regression, decision tree-based classifiers, and support vector machines
  • sequence or sequence homology based, they heavily rely on the sequence space of known BGCs. The bias of hard-coded algorithms is embedded in the biosynthetic rules used for BGC detection and the dataset used to create pHMMs. The bias of ML-based algorithms results from their training sets that usually
PDF
Album
Perspective
Published 06 Dec 2022

A systems-based framework to computationally describe putative transcription factors and signaling pathways regulating glycan biosynthesis

  • Theodore Groth,
  • Rudiyanto Gunawan and
  • Sriram Neelamegham

Beilstein J. Org. Chem. 2021, 17, 1712–1724, doi:10.3762/bjoc.17.119

Graphical Abstract
  • –glycosylation pathway enrichments found here represent the starting point for wet-lab and orthogonal dataset validation. Such studies could enhance our fundamental understanding of glycosylation pathway regulation, and lead to novel ways to control the glycogenes and glycan structures during health and disease
  • article (Supporting Information File 3, Table S1). In total, the full dataset contained 45,238 TF-to-glycogene relationships, including relational data for 570 unique TFs found in the 29 cancer systems across all the glycogenes. Positive regulatory relationships between TFs and glycogenes were selected
PDF
Album
Supp Info
Full Research Paper
Published 22 Jul 2021

Volatile emission and biosynthesis in endophytic fungi colonizing black poplar leaves

  • Christin Walther,
  • Pamela Baumann,
  • Katrin Luck,
  • Beate Rothe,
  • Peter H. W. Biedermann,
  • Jonathan Gershenzon,
  • Tobias G. Köllner and
  • Sybille B. Unsicker

Beilstein J. Org. Chem. 2021, 17, 1698–1711, doi:10.3762/bjoc.17.118

Graphical Abstract
  • between calculated retention index and literature data were within ±5 points. Identified volatiles with a similarity hit above 90% and that were present in five out of seven replicates were included in this study, whereas VOCs which were also collected by blanks were removed from the final dataset. A
PDF
Album
Supp Info
Full Research Paper
Published 22 Jul 2021

Comparative ligand structural analytics illustrated on variably glycosylated MUC1 antigen–antibody binding

  • Christopher B. Barnett,
  • Tharindu Senapathi and
  • Kevin J. Naidoo

Beilstein J. Org. Chem. 2020, 16, 2540–2550, doi:10.3762/bjoc.16.206

Graphical Abstract
  • how the sugar modulates binding. Methods The inputs, simulation scripts, Galaxy workflows (a series of tools and dataset actions that run in sequence), and data for these simulations are available at https://github.com/chrisbarnettster/bjoc-paper-2020-sm. Simulation There is an increasing number of
PDF
Album
Supp Info
Full Research Paper
Published 13 Oct 2020

Tools for generating and analyzing glycan microarray data

  • Akul Y. Mehta,
  • Jamie Heimburg-Molinaro and
  • Richard D. Cummings

Beilstein J. Org. Chem. 2020, 16, 2260–2271, doi:10.3762/bjoc.16.187

Graphical Abstract
  • the dataset. D) Structural information tools 1. GlyMDB: Status: Available. Address: http://www.glycanstructure.org/glymdb/. Description: GlyMDB is a web-based database which links glycan microarray binding data from the CFG database to protein structures (PDB) [51]. A user can select a dataset from
  • the CFG dataset available and set thresholds for binding versus nonbinding. The application can then show you motifs which make a significant binding contribution on the microarray. In addition it allows you to quickly search for PDB files with sequence identity matching to the protein sample put on
  • CFG glycan array data; (B) screenshot of an example of Imperial College microarray data online portal. A demonstration of glycan array data visualization with GLAD. The dataset used is from Byrd-Leotis et al. [16], and is provided as a GLAD session file in the manuscript. The dataset contains data on
PDF
Album
Review
Published 10 Sep 2020

GlypNirO: An automated workflow for quantitative N- and O-linked glycoproteomic data analysis

  • Toan K. Phung,
  • Cassandra L. Pegg and
  • Benjamin L. Schulz

Beilstein J. Org. Chem. 2020, 16, 2127–2135, doi:10.3762/bjoc.16.180

Graphical Abstract
  • statistical workflows. We used GlypNirO to analyse a published plasma glycoproteome dataset and identified changes in site-specific N- and O-glycosylation occupancy and structure associated with hepatocellular carcinoma as putative biomarkers of disease. Keywords: glycoproteomics; mass spectrometry; N
  • Fisher) and Byonic (Protein Metrics), to extract occupancy and glycoform abundancy of all identified glycopeptides from LC–MS/MS datasets. We applied the workflow to a published dataset comparing the plasma glycoproteomes of liver cancer patients (heptatocellular carcinoma, HCC) and healthy controls [20
  • provide a proof-of-concept use of GlypNirO, we performed an exploratory reanalysis of a previously published dataset [20] obtained from the ProteomeXchange Consortium via the MassIVE repository (PXD003369, MSV000079426). This study performed glycoproteomic LC–MS/MS analysis of whole plasma or plasma
PDF
Album
Supp Info
Full Research Paper
Published 01 Sep 2020
Other Beilstein-Institut Open Science Activities