A Streamlined Workflow for Conversion, Peer-Review and Publication of Omics Metadata as Omics Data Papers

Background Data papers have emerged as a powerful instrument for open data publishing, obtaining credit, and establishing priority for datasets generated in scientific experiments. Academic publishing improves data and metadata quality through peerreview and increases the impact of datasets by enhancing their visibility, accessibility, and re-usability. Objective We aimed to establish a new type of article structure and template for omics studies: the omics data paper. To improve data interoperability and further incentivise researchers to publish high-quality data sets, we created a workflow for streamlined import of omics metadata directly into a data paper manuscript. Methods An omics data paper template was designed by defining key article sections which encourage the description of omics datasets and methodologies. The workflow was based on REpresentational State Transfer services and Xpath to extract information from the European Nucleotide Archive, ArrayExpress and BioSamples databases, which follow community-agreed standards. Findings The workflow for automatic import of standard-compliant metadata into an omics data paper manuscript facilitates the authoring process. It demonstrates the importance and potential of creating machine-readable and standard-compliant metadata. Conclusion Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 16 September 2020 doi:10.20944/preprints202009.0357.v1


Introduction
The term "omics" refers to the study of biological systems through the examination of Independently from the recent development of omics technologies and data generation, however, the publication of high-quality omics biodiversity data and its accompanying, standardised metadata, is still neither harmonised nor interoperable [6]. Existing infrastructures in omics data science focus on the sequence or molecular data generated from omics studies. The databases of the International Nucleotide Sequence Database Collaboration (INSDC) [7,8] have provided a trusted archive for these data. In parallel, major infrastructures to handle higher-order biodiversity data (e.g. occurrences linked to taxa, specimen records) have emerged and include the Global Biodiversity Information Facility (GBIF) [9], the Integrated Digitized Biocollections (iDigBio) [10], the Distributed System of Scientific Collections (DiSSCo) [11], the Ocean Biogeographic Information System (OBIS) [12], the Global Genome Biodiversity Network (GGBN) [13] and others. Some of these infrastructures support data repositories which follow community-accepted metadata standards. GBIF uses the Darwin Core Standard (DwC) for biodiversity data, while the GGBN have developed their own GGBN Data Standard, which interoperates with DwC and the Access to Biological Collections Data (ABCD) schema for primary biodiversity data [14,15,16,17]. The INSDC cooperates with community standards initiatives such as the Genomics Standards Consortium (GSC) to implement their Minimum Information about any (x) Sequence (MIxS) standard for genomic, metagenomic and environmental metadata descriptors and with the Global Microbial Identifier (GMI) group for pathogen sequence metadata [18,19,20]. MIxS consists of three checklists each containing several packages for the description of various environments where genomic material could be sampled from. Other international data repositories such as EBI EMBL's ArrayExpress and the BioSamples database implement standards such as Minimum Information about a high-throughput nucleotide SEQuencing Experiment (MINSEQE) and Minimum Information About a Microarray Experiment (MIAME) and various MIxS environmental checklists [21,22,23].
The presence of a digital infrastructure and standards supporting a certain data class is just one of the necessary conditions for adequate data sharing and re-use, which is often impeded by the insufficient use of these standards and inadequate incentives for the stakeholders participating in the process [8]. The concept of Findable, Accessible, Interoperable and Re-usable (FAIR) data is a major step forward in building the foundation for sharing reusable data [24]. How can we, however, incentivise the data creators, holders, scientists and institutions to pursue the FAIRsharing of data, information, and knowledge exchange [25] in omic biodiversity science?
There are different ways scientists can publish their data, however, all can be attributed to two main routes: (1) data publishing through international trusted data repositories, such as INSDC [7], GBIF [9], and others, and (2) scholarly data publishing in the form of data papers or as data underpinning a research article [26,27,28,29,30]. While the first route focuses on data aggregation, standardisation and re-use, the second one augments the quality and reusability of data and metadata through peer reviewing and data auditing in the scholarly publishing process. Scholarly data publishing provides an opportunity to enhance the original metadata in the data paper narrative and to link it to the original dataset via stable identifiers, thus improving the reproducibility and findability of the data. Furthermore, it ensures a scientific record, crediting and acknowledgement for the data creators and scientists in the form of citable scholarly articles. Academic publishing involves dissemination of research through additional channels, such as journal distribution networks, and creates increased opportunities for open science collaboration [26].
While standards and infrastructures are crucial to the advancement of data sharing and reuse within the field of omics, we argue that incentivising authors to publish their data in the form of peer reviewed journal articles (data papers) creates the driving force towards a truly FAIR data world.
As more and more researchers want to deposit and share their datasets, standards, infrastructures, and workflows become central to delivering FAIR data. Following the example set by Chavan and Penev [26], who introduced data papers in biodiversity science, we have established a concept for an omics data paper -a type of scholarly paper in which data, generated in genomic or other omic experiments, is described with extended and peer reviewed metadata, and linked to the corresponding dataset(s) deposited in an INSDC or other archive. To further incentivise authors to publish omics data papers and to demonstrate the importance of high-quality metadata, we propose a streamlined workflow for conversion of European Nucleotide Archive (ENA) metadata directly into a data paper manuscript.
The aim of the present paper is to conceptualise the omics data paper, and to describe the streamlined workflow for automated import of metadata into an omics data paper manuscript. This workflow also accommodates the peer review and publication processes associated with the manuscript.

Approach
We took the following steps to approach the goal of establishing an omics data paper template and workflow: 1. Identify the high-level needs of the omics communities.
3. Synthesise the technical solutions and incorporate further functional needs to create the structure of the new type of data paper.
We created a template, defining article sections and subsections to map the article narrative to metadata associated with the dataset(s) described in an omics data paper.

Workflow for extracting relevant metadata from ENA XML files
We proceeded to develop a workflow for automatic import of metadata into omics data paper manuscripts based on ENA's metadata structure, as well as the ArrayExpress [21] and BioSamples [22] databases. The workflow uses REST API requests and Xpath to retrieve segments of information from XML files from ENA, ArrayExpress and BioSamples [36]. It then imports them into our proposed data paper manuscript structure.
For demonstration, testing and reproducibility purposes, this workflow was implemented in a R shiny app [37,38,39] which visualises metadata extracted from ENA inside the relevant sections of the proposed manuscript template within the application interface. During this prototyping stage, we tested the app with multiple ENA study identifiers to guide the improvement of the import algorithm. The R shiny app is available both as open source code on Github [40] and as an interactive web app [41] deployed in an RStudio cloud environment [42] and hosted on a Shinyapps.io server [37,38]. The R version at the time of developing the R Shiny app was R version 4.0.0 (2020-04-24) (Arbor Day) [39].

Integration of metadata extraction workflow with the ARPHA Writing Tool
The metadata extraction workflow was completed with a conversion tool working "inside" the Pensoft's ARPHA Writing Tool [31]. A new type of publication and template, "OMICS data paper", was created, following the proposed data paper structure. Certain sections of the omics data paper template were made mandatory such as the "Methods" section and the "Data resources" section.
An important component of the design and implementation of the omics data papers is the BioSamples Supplementary Table. ENA metadata records that contain links to associated BioSamples metadata (MIxS checklists) [20,22]

Methods -Sampling -Environme ntal profile -Geographic
This section is split into 3 major parts to describe how the physical material was collected, processed and transformed into a ArrayExpress XML> Protocol XMLs: protocol/type protocol/text protocol/hardware range -Technologi es used -Sample processing -Technologi es used -Data processing dataset. The "Sampling" section allows authors to outline the environmental and geographic characteristics of the locations where their material was collected. Authors are encouraged to share as much detail as they can (e.g. geographic coordinates, habitats, seasonal information, etc.). The sampling methods should be described in the "Technologies used" subsection. "Sample processing" should explain the laboratory procedures involved in the transition of the physical sample into its digital footprint. Finally, the "Data Processing" subsection should mention the steps taken to transform the raw dataset into the one which was published (e.g. normalisation steps). None of the subsections are compulsory and the authors can write the Methods in a form outside these topics but our template provides a detailed best practices structure to follow. The template focuses on the value of the data, the methods used to generate it and the qualitative and quantitative characteristics of the dataset.

Omics metadata extraction workflow
Metadata describing the datasets was utilised to facilitate creation and authoring of the data paper manuscript. By following ENA's metadata model [32], including its links to the ArrayExpress [21] and BioSamples [22] databases, we designed a workflow which orchestrates the extraction of relevant metadata from the various ENA XML files (Fig. 1). The Study XML and the Project XML are the starting points in the proposed workflow as they integrate all other types of data and metadata available in ENA for a given scientific study. Each metadata object in the ENA metadata model is associated with a unique identifier, which can be used to retrieve its corresponding XML file via the ENA API [32].

Fig. 1 Metadata extraction workflow from ENA, ArrayExpress and BioSamples
As outlined in our proposed workflow (Fig. 1) We implemented the template and workflow into Pensoft's ARPHA Writing Tool [31] , enabling import of the extracted ENA metadata records into the omics data paper template (Table 1). Fig. 2 shows a diagram demonstrating the import functionality from the perspective of the user. Tool.

R shiny app -deployment and reproducibility
The template and workflow were first prototyped in a R shiny app [41], the code for which is open source and available on Github under Apache 2.0 license [40], as outlined in the Methodology section of this paper. The R shiny app is a web application emulating the functionality of the metadata import workflow in the ARPHA Writing Tool. The application runs in a virtual R environment [39,42] and is deployed and hosted on the web via Shinyapps.io [37,38], configured to allow up to 50 concurrent connections. The R shiny app has one additional functionality, which is not present in the workflow implemented in the ARPHA Writing Tool: it transforms the imported metadata into a Journal Article Tag Suite (JATS) XML file [44], which can be downloaded by clicking the 'Download XML' button. We validated the XML against the latest JATS DTD version with the JATS4R validator [45]. The JATS XML is structured according to the Pensoft omics data paper template so that most article section nodes are defined with the sec tag and an attribute sec-type is used to define the exact section name (e.g. the Methods section is marked in the XML as <sec sec-type="Methods">). A basic "skeleton" file of the JATS XML file is available in the project Github repository [40].
Despite being tailored to the Pensoft omics data paper template, JATS XML files generated via the R shiny app can be used by other publishers or individuals to generate their own omics data paper manuscripts. Together with ENA's documentation about programmatic access to its resources [36], the codebase enables reproducibility of our workflow and creates the potential for it to be deployed by other journals or publishers.

The data and metadata publishing landscape
The concept of data papers is not new; in fact, they have been in existence for several decades already. One of the first journals to implement this concept was Ecological Society of America's Ecological Archives [46,47]. In 2011, Chavan and Penev envisioned metadata as a resource for authoring data papers for primary biodiversity data and identified a lack of clear guidelines and good practices for authoring metadata (the "how") and the incentives for authors to do so (the "why") [26]. They proposed data papers as a "mechanism to incentivise data publishing in biodiversity science" and introduced them to the biodiversity community through Pensoft's journals. To further simplify data paper authoring, Pensoft pioneered an integrated workflow for automatic metadata-to-manuscript conversion of primary biodiversity datasets published through GBIF's Integrated Publishing Toolkit (IPT) [26,29,48].
This streamlined metadata conversion workflow was first introduced in several of Pensoft's biodiversity journals and then in other journals by other publishers, such as Nature's Scientific Data, PLOS ONE, BMC Ecology and many others [49]. Since 2011, nearly 300 data papers have been published in Pensoft's journals and the average number of published data papers continues to grow (Fig. 3). The total number of data papers published by other publishers is in the thousands (Fig. 4) [50]. Data papers are no longer an abstract idea but have already been practically implemented in multiple journals in different disciplines.  Schöpfel, leading author of [50].

Comparison with other tools and workflows
Since 2011, Pensoft has developed other integrative ways to streamline metadata authoring and data paper publication by integrating different workflows into its collaborative online authoring tool, the ARPHA Writing Tool (AWT) and associated Biodiversity Data Journal [51]. For instance, Ecological Metadata Language (EML) metadata files used in the IPT can be directly converted and imported into manuscripts in AWT "at the click of a button", then edited in the tool and submitted to the Biodiversity Data Journal [30,52]. This workflow closely resembles the workflow described in this paper but it is focused on ecological data. The EML workflow accepts a single specimen record identifier and imports information about that record from several infrastructures (GBIF, Barcode of Life Data Systems (BOLD), iDigBio, or PlutoF) into manuscripts [52]. It also enables conversion of an EML-formatted file into a biodiversity data paper [52], a functionality not covered by the present workflow, which only performs API requests.
Generation of extended metadata descriptors has been the focus of other tools, such as the Metadata Shiny Automated Resources and Knowledge (MetaShARK) [53] and Datascriptor [54], which is still under development. MetaShARK aims to facilitate assembly of ecology metadata by providing a user-friendly workflow for metadata packaging [53]. Unlike the workflow described here, it is more focused on primary metadata generation than metadata sharing and reuse [53]. Our workflow uses already generated metadata and provides a template for their extension to create an extended metadata description converted to narrative. Datascriptor is more closely related to our workflow because it aims to transform metadata, generated by following community standards, into a data article [54]. To do so, the developers have envisioned the generation of a JATS XML [54], which is what we have implemented in our R shiny app demonstrating the workflow for import of metadata into omics data paper manuscript.

Data papers for the field of omics: rationale and benefits
Generation of omic data and metadata is one of the very first outputs of the research cycle, but not all of this is shared via research publications. Even when these data are published, the focus is usually on the interpretation of the data, rather than metadata quality or the FAIR properties of the dataset. Deposition of raw omic data, such as sequencing data, mass spectrometry (MS) proteomic data and RNAsequencing data, into centralised databases has become a routine practice for studies involving omic experiments [55]. ENA provides the necessary infrastructure to share sequencing data in a structured format and enables machine-readability and interoperability through the use of identifiers, consistent schema models and APIs [32]. However, for the data to be communicated to and shared with other researchers it needs to be described inside a human-readable narrative. With our proposed omics data paper and the automatic import workflow, we encapsulate all metadata about a study into a single piece of narrative, thus completing the scientific process.
Authoring omics data papers, despite being aided by the automated workflow, requires additional effort and time. Here we outline some of the benefits which make the process of creating such manuscripts worthwhile, as well as on how data interoperability contributes to the FAIR data and metadata publishing landscape.

Omics data papers and underlying datasets undergo peer-review and data auditing
Prior to peer-review of submitted omics data paper manuscripts, all underlying datasets go through mandatory data auditing, cleaning and quality checks to assure that they meet the journal standards for publication [56]. This is done by a data auditor, whose role is to technically evaluate the submitted datasets for compliance to a data quality checklist [57] and to provide authors with a detailed report, including recommendations for improving the dataset. Only after the authors change the dataset according to the recommendations, it can be approved for peer-review. The introduction of data scientists into the publishing process ensures that submitted data and metadata are FAIR, consistent and of high-quality. This double checking -first of the datasets by the data auditors and then of the whole manuscript by the reviewers -is an meticulous approach to enhancing the quality of datasets and to the best of our knowledge has not been adopted by any other publisher so far.

Publication of data papers improves metadata quality
Authoring metadata is a necessary step to publish omics data into an open repository; however, there is considerable variability when it comes to the quality of the published metadata [6,55] sharing and re-use have become crucial [6].

Curation of metadata, currently implemented by ArrayExpress via their
Annotaire tool [58,59], is an adequate method for high-quality metadata publishing, based on standards. Most important, however, is that metadata authors learn to adopt and correctly use the existing standards in the process of describing their data. By directly observing the role of their metadata in creating the manuscript, they are made aware of its value and should be incentivised to improve the quality and quantity of the metadata they provide.
After importing metadata into their omics data paper manuscript, authors would need to manually correct and fill in the missing information, which defies the main purpose of the workflow: to make the data better described through extended and detailed metadata in the form of peer-reviewed, widely accessible and citable data papers.

High-quality metadata enables data-driven discovery
Metadata which follows community accepted standards is vital for data-driven discoveries as it provides the necessary context to characterise the dataset it describes. Omics data papers do not only improve the quality of the metadata but also constitute an enhanced metadata record themselves.  [60]. By giving further visibility to omic datasets through their publication in an omics data paper and by enhancing metadata through publication, we stimulate scientific research and data-driven discovery.

Data papers help to establish priority
Publishing data papers at early stages of the research process can provide an important benefit for authors: the opportunity to get the first scientific record for their effort in assembling a dataset and obtain feedback from the research community. It is well known that many authors are hesitant to publish datasets which they have not yet analysed or used for supporting any research findings for fear of someone else using the data and getting 'scooped'. By publishing a data paper, the authors are guaranteed that the described data can be reused in accordance with the Open Science principles, following all community accepted ethical norms for citation, priority and generating new knowledge through joint publications based on shared data.

Publishing omics data papers is a way to obtain credit for one's work
Science crediting further incentivises researchers to publish omics data papers because their work impact can be measured in a way familiar to authors of traditional research papers, adding to their researcher impact metrics. In addition, the data managers and scientists who generate the data are not always among the authors of traditional research articles, which focus on the data analysis and outcomes. Thus, data paper publishing can be a way for all actors involved in the process of gathering, curating and managing the data -be they early stage researchers, technicians or data scientists -to obtain credit for their valuable work.

Limitations and future outlooks
The automated workflow for importing omics metadata into data paper manuscripts currently works only with ENA metadata records. While INSDC metadata is exchanged across all three databases in the consortium (ENA, GenBank and DDBJ) [7], it would be beneficial if users could import metadata from any of the three data repositories via their associated identifier. The reason for the current limitation is the requirement for additional integrations, produced by the variation of APIs and the differing metadata schemas. We decided to integrate the workflow with ENA as the first showcase of this novel method of creation of data paper manuscripts, because their metadata structure was easiest to work with and because they share data with GenBank and DDBJ.
Currently, the streamlined metadata import workflow for the omics data paper is focused mostly on genomic data. In the future, we plan to expand the workflow to include other repositories and data types, such as metagenomics data and operational taxonomic units (OTU) tables. This addition will integrate new data science solutions for efficiently and interoperably exchanging and storing sparse and high dimensional contingency tables along with their associated sample and taxonomic metadata (e.g. the BIOM format [61]). Thus, we support the development away from the fragmentation of data and towards a single quantum of information to exchange, containing interoperable, accessible, and transparent information. Making use of this advancement, future workflows for omics data paper creation may also support BIOM files for data provision, as outlined in an unpublished dissertation by Raissa Mayer (2020).
Integrations between existing infrastructures and data-driven initiatives are key to the FAIRness of data and metadata. The streamlined workflow for import of metadata from ENA, ArrayExpress and BioSample is another step in this direction. However, to make metadata truly FAIR, there should be a two-way link between the original data and metadata repository (e.g. ENA) and the enhanced metadata record (e.g. the omics data paper).

Conclusions
In conclusion, the new omics data paper, implemented in Pensoft's publishing process provides a mechanism for incentivising omics data sharing and reuse through scholarly publishing. In addition, the workflow for import of metadata into manuscripts encourages and incentivises authors to enhance data quality and completeness. The workflow also demonstrates the importance of linking data from different infrastructures using stable identifiers and thus sets an example for future integrations with other metadata and data repositories.

Availability of supporting source code and requirements
• Project name: Omics Data Paper Generator • Project home page: https://github.com/pensoft/omics-data-paper-shinyapp • Operating system(s): Platform independent