Ten Simple Rules for Sharing Experimental and Clinical Data with the Modeling Community

Science continues to become more interdisciplinary and to involve increasingly complex data sets. Many projects in the biomedical and health related sciences adhere to the principles of FAIR data sharing, or aim to follow them. Data sharing has been proven to foster collaboration, to lead to better research outcomes, and to help ensure reproducibility of results. Data generated in biomedical and health research are specific in the sense that they are heterogeneous, often big, and highly sensitive in terms of data protection needs and contextuality. Data sharing has to respect these features, but at the same time advances in medical therapy and treatment are time-critical. Modeling and simulation of biomedical processes have become an established tool, and a global community has been developing algorithms, methodologies, and standards for applying biomedical simulation models in clinical research. However, it can be difficult for clinician

Data provides the evidence for scientific knowledge and science is built on data [1]. 2 Data sharing improves the quality of scientific reporting and increases scientific 3 outcome, collaboration, publication rates, and visibility. Hence, it is beneficial to 4 researchers, society, and funders. Data sharing is part of good scientific practice, 5 enables data reuse, and fosters the scientific discovery process [2]. In addition, when 6 cited properly, researchers get the credit they deserve for the data they generate [3]. 7 Among the essential factors for catalyzing data sharing are (i) clear policies from 8 funders, institutions, journals, and research communities; (ii) credit and incentives for 9 data publication; (iii) explicit funding for data management, data sharing, and data 10 publishing; (iv) practical help with organizing data, finding appropriate repositories and 11 simpler ways of sharing; and (v) training and education in research data 12 management [4]. Within this work we address the last two points. 13 Computational biology and medicine are fields that are highly dependent on the 14 availability of research data. This requirement is independent of methodology, modeling 15 domain, or complexity of the research object. Data sharing between experimentalists, 16 clinicians and modelers is an essential part of most investigations. Data is needed 17 during model construction (parametrization), in which a subset of the data (training tools for open and reproducible science, most data is difficult for modelers to use. One 23 reason is a lack of data accessibility, with researchers rarely making their data available 24 in a manner directly accessible by modelers, despite widespread support from funding 25 agencies, scientific journals, and policy makers [2,5]. A further reason is a lack of data 26 interoperability, with accessible data being difficult to integrate with computational 27 models due to technical challenges with the shared data, for instance poorly annotated 28 data hidden in PDF documents [1]. Many challenges relate to poor data FAIRness, i.e., 29 data not being findable, accessible, interoperable, or reusable [6]. However, simply 30 making data FAIR does not guarantee the data is of high quality nor that it can be 31 used in computational modeling.

32
It is surprising that despite the importance of data sharing, no guideline or best 33 practice for sharing research data with the modeling community exists. Members of the 34 COMBINE community [7] discussed these problems during their annual meetings [8] 35 and collected best practices. As a result, we provide ten simple rules ( Figure 1) on how 36 to share data for computational modeling, addressing issues of data accessibility and 37 quality as well as providing guidelines for the research community and material for 38 education.  The first rule of data sharing is to actually make an effort to share your data. Sharing 42 data means making it accessible (online) to both humans and machines via a data 43 repository. Humans should be able to access your data via a web browser and machines 44 via web services and/or persistent links. Use a repository that minimizes hurdles to 45 data access, i.e., if possible avoid resources that require accounts or registration for 46 accessing data or are only accessible to a limited community (e.g., only within an 47 institution). Remember that science is a global endeavor and data should be accessible 48 for researchers worldwide, not only from one country or region. Providing open access 49 to data is not only important for reuse and data mining but to confirm that results 50 presented in a publication are truly based on actual data [9]. Important criteria for 51 choosing a data repository are long-term availability and acceptance within the 52 community. The longer a repository exists, and the more users it has, the better is the 53 support and ecosystem of software tools. Several platforms support scientists in finding 54 the best repository for their needs, including criteria such as supported data formats 55 (e.g. https://fairsharing.org), archiving services, services for Digital Object Identifiers 56 (DOI), and choice of licenses (e.g. https://www.re3data.org/). Note that there is a clear 57 association between articles that include a data availability statement containing a link 58 to a repository, which have up to 25% higher citation impact on average [3]. So what 59 data should you share? As a rule of thumb, data sharing should be 'as open as possible, 60 as closed as necessary'. For most data, this translates to sharing your data set openly 61 without any restrictions. As a side note, you often share your data with your future self. 62 Hence you should make everything accessible which you will require to reproduce and 63 build on your results.  Equally important to the license is the information on how to correctly attribute the 86 data creators. Depending on the data set this may mean to acknowledge others, cite the 87 data set, or to offer co-authorship. 88 License and attribution information must be distributed with the data. It is advised 89 to provide a human-readable description as well as computer-readable metadata. This

Share all raw and processed data
130 Publish all relevant data as raw data, not just aggregated and highly processed data 131 sets. For example, providing a figure is not the same as sharing the data points.

132
Sharing data for a plot means to provide the underlying raw data and processed data sets, pooled data is the same as no data at all for such applications (e.g.

142
individual-based modeling or parameter fitting). In the context of time course data for 143 ODE based models, to give a specific example, the time courses are very heterogeneous 144 and the mean from individuals can be misleading. To make the data useful for modeling 145 raw and processed data must be shared for individual measurements and subjects, and 146 figures and tables should contain individual data in addition to grouped or pooled data. 147 Often crucial information for modeling and data integration is not relevant for the 148 primary publication and never reported (e.g., body weight, age, sex). In most cases, the 149 data is used in completely new contexts than what the original data creator anticipated. 150 Sharing as much as possible extends the possibilities of subsequent analysis and data 151 integration and makes the data set much more valuable.

152
However, you should always be careful what raw data to share and not to share. In 153 the context of biomedical research, the protection of patient-derived data is of highest 154 priority, and legal matters must always be obeyed. For example, all sensitive data must 155 be removed from data sets, patient data must be anonymized or at least pseudonymized, 156 and data that would allow re-identification of patients must be removed from data sets. 157 This includes, for instance, genetic information or data about rare diseases. To publish FAIR data, it is necessary to clearly state what information is contained in 160 each of the data items, e.g. what has been measured in a specific variable. Metadata 161 (data describing the data) puts the data into context using biological, medical, or 162 computational ontologies and mapping information in the data set to database 163 identifiers. Metadata adds a semantic layer (experimental, biologically, environmentally, 164 etc.) and allows others and your future self to interpret your data. Metadata improves 165 findability as semantic information can additionally be indexed and then used for search 166 and filter functions. One example of a crucial metadata item for computational models 167 is unit information. Units should be defined as SI units, e.g., providing an insulin 168 concentration in pmole/ml is much more helpful than in international units (IU).

169
Adding provenance information and information on the context under which a data set 170 is valid/applicable can be very helpful.  However, to enable reproducing analysis results, one must be able to apply an identical 174 analysis pipeline. Software tools, libraries, and workflows change over time, and this pipeline and reuse the pipeline if additional data sets are generated (which is often the 184 case for validation of computational models). 185 8. Archive data and code 186 An important aspect of findable and accessible data for use in computational models is 187 to provide a standard identification mechanism to make data locatable. Archive your 188 code and data in a separate 3rd-party archiving site such as Zenodo, as well as any 189 long-term access repositories that are provided by your institution (e.g., 190 data.caltech.edu for Caltech). Note that archiving is not equivalent to making your code 191 and data available in code-sharing sites such as GitHub. Get unique and stable 192 identifiers for the data or data set. Even once shared data is often lost due to either 193 resource decay and link decay. We highly recommend using a repository with resolvable 194 identifiers REF identifiers paper, such as DOIs which are accessible and resolvable long 195 term. In case of a dedicated database, these can be database identifiers which should be 196 uniquely resolvable (e.g., identifiers.org information). The journal Scientific Data 197 maintains a good list of archives you can look at. So you shared your data with license and attribution information, people will be able to 211 find it via metadata in their favorite repositories, and you made it easy to integrate 212 data and processing into other people's computational modeling workflows because you 213 provided easy to parse computer-readable formats and code. What's next? Lean back 214 and enjoy your fame, you made an important contribution to scientific research and 215 computational models using your data could answer important questions in biology and 216 medicine. Thanks for your efforts. Your data matters.

218
Publishing the data behind biomedical and clinical studies is good scientific practice, 219 and it encourages scientific discourse. As a result, the data can be transparently 220 checked and further reused, and scientific results can obtain a higher level of curation 221 and trust. In this paper, we outline the recommended workflow for sharing data 222 between biomedical and clinician scientists and the biomodeling and simulation 223 community. We focus on mathematical models describing biomedical systems, such as 224 disease progression, organ level models, or biochemical processes leading to disorders.

225
Typical data that needs to be shared are genomics, proteomics, metabolomics, but also 226 patient-specific measurements. For any of these data types, it will be important to 227 August 13, 2021 7/9 understand that the data should remain understandable for both humans and machines. 228 Formal representations and semantic annotations using domain-specific standards are 229 key factors. Data can only be reused when it is equipped with a license and deposited in 230 a findable repository. When talking about 'data', this includes not only the raw and 231 processed data from measurements, but also software code, scripts, documentation, and 232 all relevant metadata such as provenance information. We would like to encourage the 233 community to adhere to these recommendations, which fully respect the FAIR principles 234 for data stewardship.