ARTICLE | doi:10.20944/preprints202112.0286.v2
Subject: Engineering, Other Keywords: data integration; interoperability; harmonization; GeoBIM; metadata
Online: 7 June 2022 (11:10:07 CEST)
The reuse and integration of data give big opportunities, supported by the F.A.I.R. data principles. Seamless data integration from heterogenous sources has been interest of the geospatial community for long time. However, 3D city models, BIM and information supporting smart cities present higher semantic and geometrical complexity, which pose new challenges, never tackled in a comprehensive methodology. Building on previous theories and studies, this paper proposes an overarching workflow and framework for multisource (geo)spatial data integration. It starts from the definition of use case-based requirements for the integrated data, guides the analysis of integrability of the involved datasets, suggesting actions to harmonise them, until data merging and validation. It is finally tested and exemplified on a case study. This approach allows the development of consistent, well-documented and inclusive data integration workflows, for the sake of use cases automation in various geospatial domains and the production of Interoperable and Reusable data.
ARTICLE | doi:10.20944/preprints202206.0335.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: metadata; contextual data; harmonization; genomic surveillance; data management
Online: 24 June 2022 (08:46:04 CEST)
ARTICLE | doi:10.20944/preprints202202.0139.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: uncertainty; prognostic modeling; image biomarkers; radiomics; radiomics harmonization
Online: 9 February 2022 (11:50:10 CET)
Problem. Image biomarker analysis, also known as radiomics, is a tool for tissue characterization and treatment prognosis that relies on routinely acquired clinical images and delineations. Due to the uncertainty in image acquisition, processing, and segmentation (delineation) protocols, radiomics often lacks reproducibility. Radiomics harmonization techniques have been proposed as a solution to reduce these sources of uncertainty and/or their influence on the prognostic model performance. A relevant question is how to estimate the protocol-induced uncertainty of a specific image biomarker, what the effect is on the model performance, and how to optimize the model given the uncertainty. In this manuscript, we show how protocol uncertainty can drastically reduce prognostic model performance. We introduce an effect-size measure η that assesses the protocol-induced uncertainty versus the measurable effect. Methods. Two non-small cell lung cancer (NSCLC) cohorts, composed of 421 and 240 patients respectively, were used for training and testing. Per patient, a Monte Carlo algorithm was used to generate three hundred synthetic contours with a surface dice tolerance measure less than 1.18 mm with respect to the original GTV. These contours were subsequently used to derive 104 radiomic features, which were ranked on their relative sensitivity to contour perturbation, expressed in the parameter η. The top four (low η) and the bottom four (high η) features were selected for two models based on Cox proportional hazards model. To investigate the influence of segmentation uncertainty on the prognostic model, we trained and tested the setup in 5000 augmented realizations (using a Monte Carlo sampling method); the log-rank test was used to assess the stratification performance and stability to segmentation uncertainty. Results. Although both low and high η setup showed significant testing set log-rank p-values (p=0.01) in the original GTV delineations (without segmentation uncertainty introduced), in the model with high uncertainty to effect ratio only around 30% of the augmented realizations resulted in model performance with p < 0.05 in the test set. In contrast, the low η setup performed with log-rank p < 0.05 in 90% of the augmented realizations. Moreover, the high η setup classification was uncertain for 50% of the subjects in the testing set (for 80% agreement rate), whereas the low η setup was uncertain only in 10% of the cases. The code and part of the data are available at https://github.com/Maastro-CDS-Imaging-Group/sure. Discussion. Estimating image biomarker model performance based only on the original GTV segmentation without considering segmentation uncertainty may be deceiving. The model might result in a significant stratification performance, but can be unstable for delineation variations, which are inherent to manual segmentation. Simulating segmentation uncertainty using the method described allows for more stable image biomarker estimation, selection, and model development. The segmentation uncertainty estimation method described here is universal and can be extended to estimate other protocol uncertainties (such as image acquisition and pre-processing).
ARTICLE | doi:10.20944/preprints201910.0275.v1
Subject: Earth Sciences, Geoinformatics Keywords: Landsat; Sentinel 2; harmonization; crop monitoring; Google Earth Engine
Online: 24 October 2019 (06:02:04 CEST)
Proper satellite-based crop monitoring applications at the farm-level often require near-daily imagery at medium to high spatial resolution. The synthesizing of ongoing satellite missions by ESA (Sentinel 2) and NASA (Landsat7/8) provides this unprecedented opportunity at a global scale; nonetheless, this is rarely implemented because these procedures are data demanding and computationally intensive. This study developed a complete stream processing in the Google Earth Engine cloud platform to generate harmonized surface reflectance images of Landsat7,8 and Sentinel 2 missions. The harmonized images were generated for two agriculture schemes in Bekaa (Lebanon) and Ninh Thuan (Vietnam) during the period 2018-2019. We evaluated the performance of several pre-processing steps needed for the harmonization including image co-registration, brdf correction, topographic correction, and band adjustment. This study found that the miss-registration between Landsat 8 and Sentinel 2 images, varied from 10 meters in Ninh Thuan, Vietnam to 32 meters in Bekaa, Lebanon, and if not treated, posed a great impact on the quality of the harmonized dataset. Analysis of a pair overlapped L8-S2 images over the Bekaa region showed that after the harmonization, all band-to-band spatial correlations were greatly improved from (0.57, 0.64, 0.67, 0.75, 0.76, 0.75, 0.79) to (0.87, 0.91, 0.92, 0.94, 0.97, 0.97, 0.96) in bands (blue, green, red, nir,swir1,swir2, ndvi) respectively. We demonstrated that dense observation of the harmonized dataset can be very helpful for characterizing cropland in highly dynamic areas. We detected unimodal, bimodal and trimodal shapes in the temporal NDVI patterns (likely cycles of paddy rice) in Ninh Thuan province only during the year 2018. We fitted the temporal signatures of the NDVI time series using harmonic (Fourier) analysis. Derived phase (angle from the starting point to the cycle's peak) and amplitude (the cycle's height) were combined with max-NDVI to generate an R-G-B image. This image highlighted croplands as colored pixels (high phase and amplitude) and other types of land as grey/dark pixels (low phase/amplitude). Generated harmonized datasets that contain surface reflectance images (bands blue, green, red, nir, swir1, swir2, and ndvi at 30 meters) over the two studied sites are provided for public usage and testing.
Subject: Life Sciences, Other Keywords: hearing impairment; hearing loss; ontology; data harmonization; meta-analysis
Online: 19 September 2019 (11:37:08 CEST)
Hearing impairment (HI) is a common sensory disorder that is defined as the partial or complete inability to detect sound in one or both ears. This diverse pathology is associated with a myriad of phenotypic expressions and/or syndromes. HI can be caused by various intrinsic, environmental and/or unknown factors. Some ontologies capture some relevant HI forms, phenotypes and syndromes, but there is no comprehensive knowledge portal which includes aspects specific to the HI disease state. This hampers inter-study comparability, integration and interoperability within and across disciplines. This work describes the HI Ontology (HIO) that was developed based on the Sickle Cell Disease Ontology (SCDO) model. This is a collaboratively developed resource built around the 'Hearing Impairment' concept by a group of experts in different aspects of HI and ontologies. HIO is the first comprehensive, standardized, hierarchical and logical representation of existing HI knowledge. HIO allows researchers and clinicians alike to readily access standardized HI-related knowledge in a single location and promote collaborations and HI information sharing, including epidemiological, socio-environmental, biomedical, genetic and phenotypic information. Furthermore, this ontology illustrates the adaptability of the SCDO framework for use in developing a disease-specific ontology.
REVIEW | doi:10.20944/preprints202008.0133.v1
Subject: Life Sciences, Virology Keywords: epidemic; viral sequences; genomics; metadata; data harmonization; integration and search
Online: 5 August 2020 (10:58:27 CEST)
With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the affects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation, while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects.
ARTICLE | doi:10.20944/preprints202205.0251.v1
Subject: Life Sciences, Biotechnology Keywords: Recombinant DNA; scope of legislation; regulatory compliance; analytical capabilities; safety; global harmonization
Online: 19 May 2022 (04:55:19 CEST)
A large variety of fermentation products are used in food and feed production, but also in other industries, and many of these products are produced with genetically modified microorganisms (GMMs). In food and feed production, prominent examples are amino acids, vitamins, food and feed enzymes, colorants, non-caloric sweeteners and human milk oligosaccharides. From a regulatory perspective, fermentation products are typically produced under containment. This means that premises, equipment and work processes need to be designed to prevent or at least minimize release of GMMs into the environment. The fermentation products themselves should not contain any live cells of the GMM. Over the past years, there have been concerning developments, particularly in the European Union, stipulating that also absence of recombinant DNA might be interpreted as a regulatory requirement for fermentation products produced with GMMs. In this paper, we (i) attempt to place these developments into the historical context, (ii) sketch the potential negative repercussions for the food and feed industries, (iii) elaborate on the safety of recombinant DNA, and (iv) postulate that recombinant DNA should remain an integral part of the safety assessment of fermentation products but should not be misconstrued as a criterion for regulatory classification of products of biotechnology.
ARTICLE | doi:10.20944/preprints202301.0165.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Multiparametric MRI; image preprocessing; intensity harmonization; intensity standardization; high-grade glioma; radiomics signatures
Online: 10 January 2023 (01:20:12 CET)
Purpose: This study investigates the impact of different intensity normalization (IN) methods on the overall survival (OS) radiomics models’ performance of MR sequences in primary (pHGG) and recurrent high-grade glioma (rHGG). Methods: MR scans acquired before radiotherapy were retrieved from two independent cohorts (rHGG C1: 197, pHGG C2: 141) from multiple scanners (15, 14). The sequences are T1 weighted (w), contrast-enhanced T1w (T1wce), T2w, and T2w-FLAIR. Sequence-specific significant features (SF) associated with OS, extracted from the tumour volume, were derived after applying 15 different IN methods. Survival analyses were conducted using Cox proportional hazard (CPH) and Poisson regression (POI) models. A ranking score was assigned based on the 10-fold cross-validated (CV) concordance index (C-I), mean square error (MSE), and the Akaike information criterion (AICs), to evaluate the methods’ performance. Results: Scatter plots of the 10-CV C-I and MSE against the AIC showed an impact on the survival predictions between the IN methods and MR sequences (C1/C2 C-I range: 0.62-0.71/0.61-0.72, MSE range: 0.20-0.42/0.13-0.22). White stripe showed stable results for T1wce (C1/C2 C-I: 0.71/0.65, MSE: 0.21/0.14). Combat (0.68/0.62, 0.22/0.15) and histogram matching (HM, 0.67/0.64, 0.22/0.15) showed consistent prediction results for T2w models. They were also the top-performing methods for T1w in C2 (Combat: 0.67, 0.13; HM: 0.67, 0.13), however, only HM achieved high predictions in C1 (0.66, 0.22). After eliminating IN impacted SF using Spearman’s rank-order correlation coefficient, a mean decrease in the C-I and MSE of 0.05 and 0.03 was observed in all four sequences. Conclusion: The IN method impacted the predictive power of survival models. Thus, performance is sequence-dependent.