Ten Simple Rules for FAIR Sharing of Experimental and Clinical Data with the Modeling Community

Science continues to become more interdisciplinary and to involve increasingly complex data sets. Many projects in the biomedical and health-related sciences follow or aim to follow the principles of FAIR data sharing, which has been demonstrated to foster collaboration, to lead to better research outcomes, and to help ensure reproducibility of results. Data generated in the course of biomedical and health research present specific challenges for FAIR sharing in the sense that they are heterogeneous and highly sensitive to context and the needs of protection and privacy. Data sharing must respect these features without impeding timely dissemination of results, so that they can contribute to time-critical advances in medical therapy and treatment. Modeling and simulation of biomedical processes have become established tools, and a global community has been developing algorithms, methodologies, and standards for applying biomedical simulation models in clinical research. However,

Data provides the evidence for scientific knowledge and science is built on data [1]. 2 Data sharing improves the quality of scientific reporting and increases scientific 3 outcome, collaboration, publication rates, and visibility. Hence, it benefits researchers, 4 society, and funders. Data sharing is part of good scientific practice, enables data reuse, 5 and fosters the scientific discovery process [2]. In addition, when cited properly, 6 researchers get the credit they deserve for the data they generate [3]. Among the 7 essential factors for catalyzing data sharing are (i) clear policies from funders, 8 institutions, journals, and research communities; (ii) credit and incentives for data 9 publication; (iii) explicit funding for data management, data sharing, and data 10 publishing; (iv) practical help with organizing data, finding appropriate repositories and 11 simpler ways of sharing; and (v) training and education in research data 12 management [4]. We focus on the last two points. 13 The fields of computational biology and medicine are highly dependent on the 14 availability of research data. This requirement is independent of methodology, modeling 15 domain, or complexity of the research object. Data sharing between experimentalists, 16 clinicians and modelers is an essential part of most investigations. Data is needed during 17 model construction, in which a subset of the data is used to calibrate model parameters 18 and model behavior, as well as during model evaluation, in which data is used to 19 evaluate the model performance by comparing model predictions against test data. 20 Despite the efforts of the scientific communities to provide guidelines and tools for 21 open and reproducible science, most data is difficult for modelers to use. One reason is 22 a lack of data accessibility: researchers rarely make their data available in a manner 23 directly accessible by modelers, despite widespread encouragement from funding 24 agencies, scientific journals, and policy makers [2,5]. Another reason is lack of data 25 interoperability, with accessible data being difficult to integrate with computational 26 models due to technical challenges with the shared data (for instance, poorly annotated 27 data hidden in PDF documents [1]). Many challenges relate to poor data FAIRness, i.e., 28 data not being findable, accessible, interoperable, or reusable [6]. However, simply 29 making data FAIR does not guarantee that the data will be of high quality nor that it 30 can be used in computational modeling. 31 It is surprising that despite the importance of data sharing, no guidelines or best 32 practices for sharing research data with the modeling community exist. Members of the 33 COMBINE community [7] discussed these problems during their annual meetings [8] 34 and collected best practices. From these discussions, we have distilled ten simple rules 35 ( Figure 1) on how to share data for computational modeling. The rules address issues of 36 data accessibility and quality, provide guidelines for the research community, and 37 provide material for education.

38
Related "Simple Rules" cover best practices of data sharing [9,10], digital data 39 storage [11], how to get the most value out of numerical data [12], how to enable 40 multi-site collaborations through data sharing [13] and how to create a good data 41 management plan [14].  The first rule of data sharing is to actually make an effort to share your data. Sharing 45 data means making it accessible (online) to both humans and machines via a data 46 repository. Humans should be able to access your data via a web browser and machines 47 via web services and/or persistent links. Use a repository that minimizes hurdles to 48 data access, i.e., if possible avoid resources that require accounts or registration for 49 accessing data or are only accessible to a limited community (such as only within an 50 institution). Remember that science is a global endeavor and data should be accessible 51 for researchers worldwide, not only from one country or region. Providing open access 52 to data is not only important for reuse and data mining but to confirm that results 53 presented in a publication are truly based on actual data [15]. Important criteria for 54 choosing a data repository are long-term availability and acceptance within the 55 community. The longer a repository exists, and the more users it has, the better is the 56 Commons license CC-0 hands your data over to the public domain and allows for the 81 broadest reuse, for instance in a context like data aggregation. CC-0 has many benefits 82 for individuals and society with minimal implications for authors [16]. Alternatively,

83
CC-BY provides all the openness but requires attribution, i.e. citation of the primary 84 source, which is especially important for researchers. It is important to note that CC-0 85 does not mean that you don't request citation, it only means you allow re-use in 86 contexts like data aggregation where attribution might be difficult. One of the most 87 famous examples of open biomedical data is the human genome.

88
Equally important to the license is the information on how to correctly attribute the 89 data creators. Depending on the data set this may mean to acknowledge others, cite the 90 data set, or to offer co-authorship.

91
License and attribution information must be distributed with the data. It is advised 92 to provide a human-readable description as well as computer-readable metadata. This  To make data useful it should be provided in interoperable and open machine-readable 99 formats. Formats should easily be parsable, not require any special software or license, 100 and be supported by a wide range of tools and programming languages. Examples for 101 open formats are JSON, CSV, YAML, XML or HDF5. Interoperable data formats can 102 easily be integrated into modeling workflows. I.e., a CSV format is generally much work as it makes changes on the data visible and trackable. Another important 106 advantage of text-based formats is that they can easily be searched and indexed. A 107 minimal requirement for data to be interoperable is that the data is both syntactically 108 parsable and semantically understandable according to the respective standard. For 109 example, you should ensure that there is no issue with the data files, and if possible 110 perform structural checks and content checks. Structural checks ensure that there are 111 no empty rows, no blank headers, etc. Content checks ensure that the values are of the 112 correct types (text, number, date, etc.), that their format is valid ("biological database 113 identifiers must match a certain pattern"), and that constraints are respected ("age 114 must be a number greater than 18"). Domain-specific formats often have associated 115 validators, or simple mechanisms for validation (e.g. using XML or JSON schema files). 116 Many communities develop their own domain-specific standards, which should be used 117 whenever possible (see also rule 4). It is best to use domain-specific repositories and data formats whenever possible, as this 120 will greatly simplify integration of your data with other data sets, software tools, and 121 modeling workflows. Examples of such domains are data of genomic sequences, proteins 122 and protein structures, or metabolomic or transcriptomic data (see Table 1).

123
Domain-specific databases are highly relevant for the findability of your data set 124 because they offer an entry point for data search. In addition, these databases are often 125 integrated with other domain-specific tools and workflows. Libraries exist for working 126 with these formats and databases. Compliance of submitted data to the relevant 127 reporting standards promotes consistent and adequate data description, thorough data 128 validation, data discoverability, data reproducibility, data interoperability, and 129 (re)usability. General-purpose repositories such as BioStudies [17]

Share all raw and processed data 134
Publish all relevant data as raw data, not just aggregated and highly processed data 135 sets. For example, providing a figure is not the same as sharing the data points. sets-pooled data is effectively the same as no data at all for such applications (e.g.

146
individual-based modeling or parameter fitting). In the context of time course data for 147 ODE-based models, to give a specific example, the time courses are very heterogeneous 148 and the mean from individuals can be misleading. To make the data useful for modeling 149 raw and processed data must be shared for individual measurements and subjects, and 150 figures and tables should contain individual data in addition to grouped or pooled data. 151 Often crucial information for modeling and data integration is not relevant for the 152 primary publication and never reported (e.g. body weight, age, sex). In most cases, the 153 data is used in completely new contexts than what the original data creator anticipated. 154 Sharing as much as possible extends the possibilities of subsequent analysis and data 155 integration and makes the data set much more valuable.

156
At the same time, it is important to be careful of what raw data to share and not to 157 share. In the context of biomedical research, the protection of patient-derived data is of 158 highest priority, and legal matters must always be obeyed. For example, all sensitive 159 data attributes must be removed from data sets, patient data must be anonymized or at 160 least pseudonymized, and data that might allow re-identification of patients must be 161 removed. This includes, for instance, genetic information or data about rare diseases.

162
For recommended best practices see [27][28][29][30]. Metadata (data describing the data) puts the data into context using biological, 167 medical, or computational ontologies and mapping information in the data set to 168 database identifiers. Metadata adds a semantic layer (experimental, biologically, 169 environmentally, etc.) and allows others and your future self to interpret your data.

170
Metadata improves findability as semantic information can additionally be indexed and 171 then used for search and filter functions. One example of a crucial metadata item for 172 computational models is unit information. Units should be defined as SI units, e.g., 173 providing an insulin concentration in pmole/ml is much more helpful than in 174 international units (IU). Adding provenance information and information on the context 175 under which a data set is valid/applicable can be very helpful.  However, to enable the reproduction of analysis results, one must be able to apply an 179 identical analysis pipeline. Software tools, libraries, and workflows change over time, 180 and this often leads to changes in results, due to different parameters, new algorithms, 181 or simple implementation errors. Your shared data should therefore also contain code 182 and workflows used in data processing, or at least clearly state the used software, its 183 version, and methods. An example is RNAseq data with raw data being the raw counts, 184 whereas the processed and analyzed data is often something like differential gene 185 November 18, 2021 6/10 expression between conditions. To make the analysis reproducible, the workflow for 186 processing the data should be provided as code. The ideal case is if the complete code 187 which generated the figures from the raw data is provided. This allows to easily update 188 the analysis pipeline and reuse the pipeline if additional data sets are generated (which 189 is often the case for validation of computational models). An important aspect of findable and accessible data for use in computational models is 192 to provide a standard identification mechanism by which data can be located. Archive 193 your code and data in a separate 3rd-party archiving site such as Zenodo, as well as any 194 long-term access repositories that are provided by your institution (e.g. data.caltech.edu 195 for Caltech). Note that archiving is not equivalent to making your code and data 196 available in code-sharing sites such as GitHub. Get unique and stable identifiers for the 197 data or data set. Even once shared data is often lost due to either resource decay and 198 link decay. We highly recommend using a repository with resolvable identifiers [31], So you shared your data with license and attribution information, people will be able to 216 find it via metadata in their favorite repositories, and you made it easy to integrate 217 data and processing into other people's computational modeling workflows because you 218 provided easy to parse computer-readable formats and code. Congratulation, you 219 improved the quality of scientific reporting and increased the chance for collaboration, 220 publication rates, funding, citations, and visibility of your research. What's next? Lean 221 back and enjoy your fame! You made an important contribution to scientific research, 222 and computational models built using your data could answer important questions in 223 biology and medicine. Thank you for making the effort. Your data matters.

224
Discussion 225 between biomedical and clinician scientists and the biomodeling and simulation 230 community. We focus on mathematical models describing biomedical systems, such as 231 disease progression, organ level models, or biochemical processes leading to disorders.

232
Typical data that needs to be shared are genomics, proteomics, metabolomics, but also 233 patient-specific measurements. For any of these data types, it will be important to 234 understand that the data should remain understandable for both humans and machines. 235 Formal representations and semantic annotations using domain-specific standards are 236 key factors. Data can only be reused when it is equipped with a license and deposited in 237 a findable repository. When talking about "data", this includes not only the raw and 238 processed data from measurements, but also software code, scripts, documentation, and 239 all relevant metadata such as provenance information. We would like to encourage the 240 community to adhere to these recommendations, which fully respect the FAIR principles 241 for data stewardship.