REVIEW | doi:10.20944/preprints201912.0332.v1
Online: 25 December 2019 (03:24:53 CET)
Natural products (NPs) have been the centre of attention of the scientific community in the last decencies and the interest around them continues to grow incessantly. As a consequence, in the last 20 years, there was a rapid multiplication of various databases and collections as generalistic or thematic resources for NP information. In this review, we establish a complete overview of these resources, and the numbers are overwhelming: over 120 different NP databases and collections were published and re-used since 2000. 98 of them are still somehow accessible and only 50 are open access. The latter include not only databases but also big collections of NPs published as supplementary material in scientific publications and collections that were backed up in the ZINC database for commercially-available compounds. Some databases, even published relatively recently are already not accessible anymore, which leads to a dramatic loss of data on NPs. The data sources are presented in this manuscript, together with the comparison of the content of open ones. With this review, we also compiled the open-access natural compounds in one single dataset a COlleCtion of Open NatUral producTs (COCONUT), which is available on Zenodo and contains structures and sparse annotations for over 400000 non-redundant NPs, which makes it the biggest open collection of NPs available to this date.
COMMUNICATION | doi:10.20944/preprints202105.0701.v1
Subject: Chemistry, Analytical Chemistry Keywords: Natural products; databases; dereplication; taxonomy; NMR
Online: 28 May 2021 (12:59:37 CEST)
The recent revival of the study of organic natural products as renewable sources of medicinal drugs, cosmetics, dyes, and materials motivated the creation of general-purpose structural databases. Dereplication, the efficient identification of already reported compounds, relies on the grouping of structural, taxonomic and spectroscopic databases that focus on a particular taxon (species, genus, family, order…). A set of freely available python scripts, CNMRPredict, is proposed for the quick supplementation of taxon-oriented search results from the LOTUS database (lotus.naturalproducts.net) with predicted carbon-13 NMR data from the ACD/Labs (acdlabs.com) CNMR predictor and DB software to provide easily searchable databases. The database construction process is illustrated using Brassica rapa as taxon example.
ARTICLE | doi:10.20944/preprints202103.0640.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Databases; database administration; database management systems; counting; storage; structure; search; No SQL; SQL; Oracle; relational databases; non-relational databases; magnetic tapes; punched tapes; relational model; Datamining; BigData; Datawarehouse
Online: 25 March 2021 (16:05:52 CET)
Databases are by far the most valuable asset of companies. Since the need was seen not only to count but also to have some type of record of elements such as crops, animals, money, properties and that this record could be consulted and modified according to the situation, that is where the first database was born. , and after that, these databases cannot be disorganized, they also need to be managed and administered under established standards that facilitate their understanding and management not only by their creators but by the other people who subsequently administer them. Databases and database management systems have an interesting evolutionary history that deserves to be analyzed and this is the objective of this document, where it is sought to understand. Along with databases and their management systems, data mining or Data mining arises that in order not to extend ourselves so much, it is the job of finding common patterns in various data sources and in what way they can be used to predict situations or results of various circumstances; We also focus on the other topic that we will present, Oracle data mining, which roughly is to merge data mining with Oracle, which makes it a powerful tool for obtaining information and predicting results based on statistics.In this article we will study and analyze the ideas, concepts and basic examples that make up SGBD and Data Mining and, we will try to go deeper into this topic, the use of decision techniques such as advanced statistical algorithms. We also present a fictitious example of the application of these techniques: predicting which products can be sold based on their relationship with others. we will give a brief explanation of association rules, data mining cycle and the types of learning and the evolution that data mining has had.
ARTICLE | doi:10.20944/preprints202108.0063.v1
Subject: Earth Sciences, Atmospheric Science Keywords: Bioeconomy, bibliographic databases, value chains agricultural, production.
Online: 2 August 2021 (23:07:58 CEST)
This work analyzes the visibility and scientific impact of publications related to agricultural value chains. The incidence of bibliometric indicators allows for the interpretation of bibliographic information generated worldwide. Objective: The objective of this research is to analyze the published literature and bibliometric indicators on agricultural value chains. The Web of Science database was used to extract value chains data. The study analyzed articles published between 2010 and 2020. The keywords used are "agricultural value chains'' and articles from journals or studies related to the subject were selected for bibliometric analysis and methodological review. In the search for the keyword, a total of 4208 results were extracted, of which 1,669 records were considered for analysis. The bibliometric analysis of the data reveals that Wageningen University (55) has the highest number of publications, followed by Chinese Acad Sci (26). The author Klerkx L (9) has the highest number of records, followed by Hellin J (7). With respect to the countries with the greatest contributions on the subject are: the People's Republic of China, Germany, Italy, France and the United States. The study contributes to the analysis of bibliometrics and provides a methodological review of published journal articles on agricultural value chains. This bibliographic study presents the history of research development in agricultural value chains.
ARTICLE | doi:10.20944/preprints201806.0422.v1
Subject: Physical Sciences, Astronomy & Astrophysics Keywords: stellar spectra; atomic and molecular data; databases
Online: 26 June 2018 (13:08:52 CEST)
Vienna Atomic Line Database (VALD) contains data on atomic and molecular energy levels and parameters of spectral lines required for stellar spectra analysis. Hundreds of millions lines for fine spectral synthesis and for opacity calculations are collected in present version of VALD (VALD3). Critical evaluation of the data and the diversity of extraction tools support high popularity of VALD among users. The data model of VALD3 incorporates obligatory links to the bibliography making our database more attractive as publishing platform for data producers. The VALD data quality and completeness are constantly improving allowing better reproduction of stellar spectra. To illustrate continuous evolution of the data content we present a comparative analysis of the recent experimental and theoretical atomic data for Fe-group elements, which will be included in the next VALD release. This release will also include a possibility for extracting the line data with full isotopic and hyperfine structures.
ARTICLE | doi:10.20944/preprints201911.0186.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: crowdsensing; databases; smartphones; urban positioning; wi-fi fingerprinting
Online: 16 November 2019 (00:41:05 CET)
Wi-Fi fingerprinting positioning systems have been deployed for a long time in location-based services for indoor environments. Combining mobile crowdsensing and Wi-Fi fingerprinting systems could reduce the high cost for collecting the necessary data enabling the deployment of the resulting system for outdoor positioning in areas with dense Wi-Fi coverage. In this paper, we present the results attained in the designing and evaluation of an urban fingerprinting positioning system based on crowdsensed Wi-Fi measurements. We first asses the quality of the collected measurements, highlighting the influence of received signal strength on data collection. We then, evaluate the proposed system by comparing the influence of the crowdsensed fingerprints on the overall positioning accuracy for different scenarios. The evaluation helps gain valuable insight into the design and deployment of urban Wi-Fi positioning systems while also allowing the proposed system to match GPS-like accuracy in similar conditions.
REVIEW | doi:10.20944/preprints202007.0123.v1
Subject: Life Sciences, Other Keywords: causal interactions; databases; interoperability; biological pathway; logical modeling; computational biology
Online: 7 July 2020 (09:50:40 CEST)
Causal molecular interactions represent key building blocks used in computational modeling, where they facilitate the assembly of regulatory networks. These regulatory networks can then be used to predict biological and cellular behavior by system perturbations and in silico simulations. Today, broad sets of these interactions are being made available in a variety of biological knowledge resources. Moreover, different visions, based on distinct biological interests, have led to the development of multiple ways to describe and annotate causal molecular interactions. Therefore, data users can find it challenging to efficiently explore resources of causal interaction and to be aware of recorded contextual information that ensures valid use of the data. This manuscript presents a review of public resources collecting causal interactions and the different views they convey, together with a thorough description of the export formats established to store and retrieve these interactions. Our goal is to raise awareness amongst the targeted audience, i.e., logical modelers, but also any scientist interested in molecular causal interactions, about existing data resources and how to get familiar with them.
ARTICLE | doi:10.20944/preprints201709.0107.v2
Subject: Physical Sciences, Atomic & Molecular Physics Keywords: electron scattering; cross sections; Rosetta mission; atomic and molecular databases
Online: 21 October 2017 (15:30:59 CEST)
The emission of [O I] lines in the coma of Comet 67P/Churyumov-Gerasimenko during the Rosetta mission have been explained by electron impact dissociation of water rather than the process of photodissociation. This is the direct evidence for the role of electron induced processing has been seen on such a body. Analysis of other emission features is handicapped by a lack of detailed knowledge of electron impact cross sections which highlights the need for a broad range of electron scattering data from the molecular systems detected on the comet. In this paper we present an overview of the needs for electron scattering data relevant for the understanding of observations in coma, the tenuous atmosphere and on the surface of 67P/Churyumov-Gerasimenko during the Rosetta mission. The relevant observations for elucidating the role of electrons come from optical spectra, particle analysis using the ion and electron sensors and mass spectrometry measurements. To model these processes electron impact data should be collated and reviewed in an electron scattering database and an example is given in the BEAMD, which is a part of a larger consortium of Virtual Atomic and Molecular Data Centre – VAMDC.
REVIEW | doi:10.20944/preprints202208.0230.v1
Subject: Chemistry, General & Theoretical Chemistry Keywords: chemoinformatics; compound databases; chemical space; diversity; drug discovery; openscience; pseudo-natural product
Online: 12 August 2022 (08:39:40 CEST)
Natural products (NPs) are a rich source of structurally novel molecules, and the chemical space they encompass is far from being fully explored. Over history, NPs have represented a significant source of bioactive molecules and have served as a source of inspiration for developing many drugs on the market. On the other hand, computer-aided drug design (CADD) has contributed to drug discovery research, mitigating costs and time. In this sense, compound databases represent a fundamental element for the CADD. This work reviews the progress toward developing compound databases of natural origin, particularly databases developed in Latin America, and their practical applications in the drug discovery area. We also survey the computational methods, emphasizing chemoinformatic approaches to profile natural product databases.
REVIEW | doi:10.20944/preprints202105.0240.v1
Subject: Life Sciences, Biochemistry Keywords: LPI, lncRNA, ncRNA, protein, transcriptomics, molecular docking, machine learning, deep learning, databases
Online: 11 May 2021 (10:54:27 CEST)
Phenotypes are driven by regulated gene expression, which in turn are mediated by complex interactions between diverse biological molecules. Protein-DNA interactions such as histone and transcription factor binding are well studied, along with RNA-RNA interactions in short RNA silencing of genes. In contrast, lncRNA-protein interaction (LPI) mechanisms are comparatively unknown, likely driven by the difficulties in studying LPI. However, LPI are emerging as key interactions in epigenetic mechanism, playing a role in development and disease. Their importance is further highlighted by their conservation across kingdoms. Hence, interest in LPI research is increasing. We therefore review the current state of the art in lncRNA-protein interactions. We specifically surveyed recent computational methods and databases which researchers can exploit for LPI investigation. We discovered that algorithm development is heavily reliant on a few generic databases containing curated LPI information. We show that early methods predict LPI using molecular docking, have limited scope and are slow, creating a data processing bottleneck. Recently, machine learning has become the strategy of choice in LPI prediction, likely due to the rapid growth in machine learning infrastructure and expertise. While many of these methods have notable limitations, machine learning is expected to be the basis of modern LPI prediction algorithms.
Subject: Life Sciences, Other Keywords: data science; reuse; sequencing data; genomics; bioinformatics; databases; computational biology; open science
Online: 16 July 2020 (12:39:43 CEST)
The 'big data revolution' has enabled novel types of analyses in the life sciences, facilitated by public sharing and reuse of datasets. Here, we review the prodigious potential of reusing publicly available datasets and the challenges, limitations and risks associated with it. Possible solutions to issues and research integrity considerations are also discussed. Due to the prominence, abundance and wide distribution of sequencing data, we focus on the reuse of publicly available sequence datasets. We define ‘successful reuse’ as the use of previously published data to enable novel scientific findings and use selected examples of such reuse from different disciplines to illustrate the enormous potential of the practice, while acknowledging their respective limitations and risks. A checklist to determine the reuse value and potential of a particular dataset is also provided. The open discussion of data reuse and the establishment of the practice as a norm has the potential to benefit all stakeholders in the life sciences.
REVIEW | doi:10.20944/preprints202107.0193.v1
Subject: Life Sciences, Biochemistry Keywords: metabolomics; plant biology; metabolomics databases; data analysis; metabolomics software tools; mass spectrometry; omics
Online: 8 July 2021 (10:46:55 CEST)
Metabolomics is now considered to be a wide-ranging, sensitive and practical approach to acquire useful information on the composition of a metabolite pool present in any organism, including plants. Investigating metabolomic regulation in plants is essential to understand their adaptation, acclimation and defense response to environmental stresses through the production of numerous metabolites. Moreover, metabolomics can be easily applied for the phenotyping of plants; and thus, it has great potential to be used in molecular breeding and genome editing programs to develop superior next generation crops. This review describes the recent analytical tools and techniques available to study plants metabolome, along with their significance of sample preparation using targeted and non-targeted method. Advanced analytical tools, like gas chromatography-mass spectrometry (GC-MS), liquid chromatography mass-spectroscopy (LC-MS), capillary electrophoresis-mass spectrometry (CE-MS), fourier transform ion cyclotron resonance-mass spectrometry (FTICR-MS) and matrix-assisted laser desorption/ionization (MALDI) have speed up metabolic profiling in plants. Further, we deliver a complete overview of bioinformatics tools and plant metabolome database that can be utilized to advance our knowledge to plant biology.
ARTICLE | doi:10.20944/preprints202102.0189.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: image quality assessment; image databases; superpixels; color image; color space; image quality measures
Online: 8 February 2021 (11:11:47 CET)
Objective Image Quality Assessment (IQA) measures are playing an increasingly important role in the evaluation of digital image quality. New IQA indices are expected to be strongly correlated with subjective observer evaluations expressed by MOS/DMOS scores. One such recently proposed index is the SuperPixel-based SIMilarity (SPSIM) index, which uses superpixel patches instead of the rectangular pixel grid.The authors in this paper have been proposed three modifications of SPSIM index. For this purpose, the color space used by SPSIM was changed and the way SPSIM determines similarity maps was modified using methods derived from the algorithm for computing the MDSI index. The third modification was a combination of the first two. These three new quality indices were used in the assessment process. The experimental results obtained on many color images from five image databases demonstrated the advantages of the proposed SPSIM modifications.
ARTICLE | doi:10.20944/preprints202101.0515.v1
Subject: Physical Sciences, Acoustics Keywords: Entropy; Black hole physics; Radiation mechanism: thermal; Relativity; Methods: analytical; Astronomical databases: miscellaneous
Online: 25 January 2021 (15:32:41 CET)
Is it possible to quantify in General Relativity, GR, the entropy generated by supermassive black holes, BHs, during its evaporation time, since the intrinsic Hawking radiation in the infinity that, although insignificant, is important in the effects on the thermal quantum atmosphere?The purpose was to develop a formula that allows us to measure the entropy generated during the evaporation time of different types of BHs of: i. remnant BH of the binary black holes’ merger, BBH: GW150914, GW151226 and LTV151012 detected by the Laser Interferometer Gravitational-Wave Observatory (LIGO), and ii. Schwarzschild, Reissner-Nordström, Kerr and Kerr-Newman, and thus quantify in GR the “insignificant” quantum effects involved, in order to contribute to the validity of the generalized second law (GSL) that directly links the laws of black hole mechanics to the ordinary laws of thermodynamics, as a starting point for unifying quantum effects with GR. This formula could have some relationship with the detection of the shadow’s image of the event horizon of a BH.This formula was developed in dimensional analysis, using the constants of nature and the possible evaporation time of a black hole, to quantify the entropy generated during that time. The energy-stress tensor was calculated with the 4 metrics to obtain the material content and apply the proposed formula.The entropy of the evaporation time of BHs proved to be insignificant, its temperature is barely above absolute zero, however, the calculation of this type of entropy allows us to argue about the importance of the quantum effects of Hawking radiation mentioned by authors who have studied the quantum effects with arguments that are fundamentally based on the presence of the surrounding thermal atmosphere of the black hole.
ARTICLE | doi:10.20944/preprints202012.0756.v1
Subject: Social Sciences, Education Studies Keywords: STEM Education; Energy Efficiency; CO₂ Emissions; APEC databases; Cross-Border classes; Sustainable Development
Online: 30 December 2020 (14:37:12 CET)
Early education is critical for improving energy efficiency. The purpose of this study is to explore the feasibility of Interactive Cross-Border Classes to increase awareness of energy efficiency among middle school students. We designed and tested an Interactive Cross-Border class between Chilean and Peruvian 8th-grade classes. The classes were synchronously connected and all students answered open-ended questions on an online platform. Some of the questions were designed to check conceptual understanding while others asked for suggestions of how to develop their economies while keeping CO₂ air concentration at acceptable levels. In real-time, the teacher reviewed the students’ written answers and the concept maps that were automatically generated based on their responses. Students peer-reviewed their classmates’ suggestions. This is part of an Asia-Pacific Economic Cooperation (APEC) STEM Education project on Energy Efficiency using APEC databases. We found high levels of student engagement, where students discussed not only the cross-cutting nature of energy, but also its relation to socioeconomic development and CO₂ emissions, and the need to work together to improve energy efficiency. In conclusion, Interactive Cross-Border classes are a feasible educational alternative, with potential as a scalable public policy strategy for improving awareness of energy efficiency among the population.
ARTICLE | doi:10.20944/preprints201911.0054.v1
Subject: Physical Sciences, Astronomy & Astrophysics Keywords: atomic lifetime and oscillator strength determination; theoretical modeling and computational approaches; atomic databases and related topics
Online: 6 November 2019 (05:29:59 CET)
Orthogonal operators can successfully be used to calculate eigenvalues and eigenvector compositions in complex spectra. Orthogonality ensures least correlation between the operators and thereby more stability in the fit, even for small interactions. The resulting eigenvectors are used to transform the pure transition matrix into realistic intermediate coupling transition probabilities. Calculated transition probabilities for close lying levels illustrate the power of the complete orthogonal operator approach.
ARTICLE | doi:10.20944/preprints201805.0210.v1
Subject: Chemistry, Analytical Chemistry Keywords: pomegranate; fruit juice; total antioxidant capacity; ABTS; iRAC; total phenols content; Folin-Ciocalteu; food composition; databases
Online: 15 May 2018 (08:34:06 CEST)
Pomegranate juice (PJ) has total antioxidant capacity (TAC) which is reportedly higher compared to other common beverages. This short study aimed to evaluate the TAC of commercial PJ and pomegranate fruit in terms of a newly described iron (III) reducing antioxidant capacity (iRAC) and to compare with ABTS free radical quenching activity. Commercial PJ, freeze-dried pomegranate, and oven dried-pomegranate were analyzed. The total phenols content (TPC) was also assessed by the Folin-Ciocalteu method. The calibration results for iRAC were comparable to ABTS and Folin-Ciocalteu methods in terms of linearity (R2 > 0.99), sensitivity and precision. The TAC for PJ expressed as trolox equivalent antioxidant capacity (TEAC) was 33.4 ± 0.5 mM with the iRAC method and 36.3 ± 2.1 mM using the ABTS method. For dried pomegranates, TAC was 89–110 mmol/100 g or 76.0 ± 4.3 mmol/100 g using iRAC and ABTS methods, respectively. Freeze-dried pomegranate had 15% higher TAC compared with oven-dried pomegranate. In conclusion, pomegranate has high TAC as evaluated by the iRAC and ABTS methods, though variations occur due to the type of cultivar, geographic origin, processing and other factors. The study is relevant for attempts to refine food composition data for pomegranate and other functional foods.
REVIEW | doi:10.20944/preprints202102.0529.v1
Subject: Chemistry, Analytical Chemistry Keywords: Caver Web; databases; libraries; microbial products; PredictSNPonco; molecular docking; molecular targets; mutations; treatment; virtual screening; web tools
Online: 23 February 2021 (15:59:19 CET)
The development of microbial products for cancer treatment has been in the spotlight in recent years. In order to accelerate the lengthy and expensive drug development process, in silico screening tools are systematically employed, especially during the initial discovery phase. Moreover, considering the steadily increasing number of molecules approved by authorities for commercial use, there is a demand for faster methods to repurpose such drugs. Here we present a review on virtual screening web tools, publicly available databases of molecular targets and libraries of ligands, with the aim to facilitate the discovery of potential anticancer drugs based on microbial products. We provide an entry-level step-by-step description of the workflow for virtual screening of microbial metabolites with known protein targets, as well as two practical examples using freely available web tools. The first case presents a virtual screening study of drugs developed from microbial products using Caver Web, a web tool that performs docking along a tunnel. The second case comprises a comparative analysis between a healthy isocitrate dehydrogenase 1, a mutant that results in cancer, using the recently developed web tool PredictSNPOnco. In summary, this review provides the basic and essential background information necessary for virtual screening experiments, which may accelerate the discovery of novel anticancer drugs.
REVIEW | doi:10.20944/preprints201807.0116.v1
Subject: Chemistry, Medicinal Chemistry Keywords: chemical space; chemoinformatics; data mining; databases; DNMT inhibitors; drug discovery; epi-informatics; molecular modeling; similarity searching; virtual screening
Online: 6 July 2018 (10:04:44 CEST)
Naturally occurring small molecules include a large variety of natural products from different sources that have confirmed activity against epigenetic targets. In this work we review chemoinformatic, molecular modeling and other computational approaches that have been used to uncover natural products as inhibitors of DNA metiltransferases, a major family of epigenetic targets with significant potential for the treatment of cancer and several other diseases. Examples of these computational approaches include docking, similarity-based virtual screening, and pharmacophore modeling. It is also commented the chemoinformatic-based exploration of the chemical space of naturally occurring compounds as epigenetic modulators which may have significant implications in epigenetic drug discovery and nutriepigenetics.
ARTICLE | doi:10.20944/preprints201812.0016.v1
Subject: Social Sciences, Library & Information Science Keywords: corpus linguistics; language modeling; big data; language data; databases; monitor corpora; documentary analysis; nuclear power; government regulation; tobacco documents
Online: 3 December 2018 (09:16:14 CET)
With the influence of Big Data culture on qualitative data collection, acquisition, and processing, it is becoming increasingly important that social scientists understand the complexity underlying data collection and the resulting models and analyses. Systematic approaches for creating computationally tractable models need to be employed in order to create representative, specialized reference corpora subsampled from Big Language Data sources. Even more importantly, any such method must be tested and vetted for its reproducibility and consistency in generating a representative model of a particular population in question. This article considers and tests one such method for Big Language Data downsampling of digitally-accessible language data to determine both how to operationalize this form of corpus model creation, as well as testing whether the method is reproducible. Using the U.S. Nuclear Regulatory Commission's public documentation database as a test source, the sampling method's procedure was evaluated to assess variation in the rate of which documents were deemed fit for inclusion or exclusion from the corpus across four iterations. The findings of this study indicate that such a principled sampling method is viable, thus necessitating the need for an approach for creating language-based models that account for extralinguistic factors and linguistic characteristics of documents.
ARTICLE | doi:10.20944/preprints202003.0012.v1
Subject: Engineering, Construction Keywords: Building Information Modelling (BIM); Life-Cycle Assessment (LCA); Building process; Level of Development (LOD); Embodied environmental impacts; Greenhouse Gas emissions (GHG); LCA databases; LCA values; LCA benchmarks; cost estimation structure
Online: 1 March 2020 (13:16:52 CET)
The building sector has a big potential to reduce the material resource demand needed for building construction and therefore, greenhouse gas (GHG) emissions. Digitalisation can help to make use of this potential and improve sustainability throughout the entire building’s life cycle. One way to address this potential is through the integration of Life-Cycle Assessment (LCA) into the building process by employing Building Information Modelling (BIM). BIM can reduce the effort needed to carry out an LCA and therefore facilitate the integration into the building process. A review of current industry practice and scientific literature shows two main approaches to address BIM-LCA integration. Either the LCA is performed in a simplified way at the beginning of the building process, or it is done at the very end when all the needed information is available, but it is too late for decision-making. One reason for this is the lack of methods, workflows and tools to implement BIM-LCA integration over the entire building process. Therefore, the main objective of this study is to develop an integrated BIM-LCA workflow implemented into a method for the whole building process using an existing structure for cost estimation. A tool is created and used in a case study in Switzerland to test the developed approach. The results of this study show that LCA can be performed continuously in each building phase over the entire building process using existing BIM modelling techniques. The main benefit of this approach is that the re-work caused by the need for re-entering data and the usage of many different software tools that characterise most of the current LCA practices is minimised. Furthermore, decision-making, both at the element and building levels, is supported.