1 Supplementary Materials to “ The Emerging Landscape of Epidemiological Research Based on Biobanks Linked to Electronic Health Records : Existing Resources , Analytic Challenges and Potential Opportunities

In this section, we describe the methods we used to identify and classify recent literature based on major biobanks. A preliminary search of biobanks was conducted. First, a university-sponsored database was searched for papers on biobanks and papers published using biobank data. Second, we compiled a short list of biobanks and searched their websites for biobank-promoted research articles. These papers were read to identify various topics for search terms. We identified the following topic areas: feasibility and implementation, ethics and public perception, cohort characteristics, therapy, GWAS/PheWAS, and other analyses of biobank data. PubMed was the primary database used. We searched for various combinations of terms related to the topic areas we identified as well as the names of specific biobanks. Papers promoted on biobank websites were also included. Papers from these searches were included if they (a) analyzed data from a biobank (genetic or non-genetic), (b) were published about a specific biobank, or (c) were published about biobanks in general. Papers were excluded if they were not in English, but we placed no restrictions on date of publication or geographic region. A subsequent search was conducted focusing solely on papers published using biobank data (particularly biobanks linked with EHR) and performing a genetic analysis. Preference was given to studies where genotype data was analyzed (largely GWAS/PheWAS). The publication search was concluded June 1, 2018. We would like to emphasize that this is not intended to be an exhaustive list of all biobank-related literature. It was, however, intended to provide a good understanding of the state of biobank literature in general.

Biobanks linked to electronic health records (EHR) provide a rich data resource for health-related research.Biobanks, loosely defined, are biorepositories that accept, process, store and distribute biospecimen and/or associated data for use in research and clinical care. 1 The rise in the number and size of biobanks across the world in recent years can be explained by improvements in biospecimen analysis and the need for large datasets to address complex diseases and conditions. 1,2Many types of biobanks exist, including commercial, single medical center, health system-based, and population-based biobanks.Some biobanks are disease-or organ-specific, while others encompass a large breadth of diseases.
The development and accessibility of large-scale biorepositories provide the opportunity to accelerate agnostic ("hypothesis-free") searches, new discoveries, and hypothesis-generating studies of disease-treatment, disease-exposure and disease-gene associations.Rather than spending time and money designing and implementing a single study, researchers can use biobanks' existing data-rich resources to answer scientific questions as quickly as they can analyze them.With the establishment of biobank infrastructure, the availability and utility of data from biobanks has dramatically increased over time, and scientific interest in biobank-based research has grown.Moreover, governments and institutions are investing in the establishment of large-scale biobanks such as the US National Institutes of Health's upcoming All of Us biobank 3 and the well-established, multi-institutional UK Biobank (UKB). 4,5As more researchers become interested in using biobank data to explore a diverse spectrum of scientific questions, resources guiding the data access, design, and analysis of biobank-based studies will be crucial.Comprehensive resources describing the types of data available in major biobanks and comparing their patient populations and research emphases are still limited.
Recent reviews briefly discuss statistical and computational considerations for studies involving genetic data, 6 limitations of traditional study designs and identifying real world phenotypes, 7,8 and EHRbased approaches and database linkages in making pharmacogenetic discoveries. 9These reviews are limited in their discussion of statistical methods related to biobank and EHR-based research and in their exploration of critical concepts such as study design, sampling, missing data, and other analytic issues related to biobank research.In this paper, we complement and extend recent publications about biobank-based research with the ultimate goal of providing an extensive catalog of resources and some practical guidance to researchers pursuing biobank-based research.In Section 2, we characterize different types of biobanks and provide detailed descriptions of specific biobanks including their geographic location, size, data access and availability, data linkages and more.We also discuss the dominant health-related outcomes studied in biobank research to date.In Section 3, we describe different statistical approaches for genome-and phenome-wide association studies (GWAS/PheWAS), an area of particular interest in biobank research.In Section 4, we discuss general statistical issues related to biobank research including study design, sampling strategy, phenotype identification, and missing data.We illustrate some of these issues using data from three biobanks, the Michigan Genomics Initiative (MGI) 10,11 the UK Biobank (UKB) 4,5 , and Genes for Good (GFG).In Section 5, we mention potential opportunities and promising future directions for expanded and improved biobank-based research through a discussion of novel and emerging uses of EHR data and the integration of EHR with external data sources.

Section 2: A Characterization of Major Biobanks
In this section, we describe the types of biobanks that are frequently discussed in the literature and provide detailed descriptions for many specific biobanks.We then discuss recent biobank-based literature and highlight specific topics receiving particular attention.The goal of this section is to provide a highlevel overview of the kinds of research being conducted using biobank data and to provide practical resources describing specific biobanks.

Existing Biobanks
Table 1 describes some notable biobanks in terms of their size, location, type, and accessible data.This table extends the biobank descriptions in Wolford et al. (2018) to include additional information about data linkages and cohort characteristics, and it includes information for a broader set of biobanks. 6Many of the biobanks listed in Table 1 provide access to data for outside researchers.These biobanks are often connected to EHR and contain genotype information for some of the patients.Some of these biobanks also have linkage to death registries and detailed prescription information.The biobanks in Table 1 often fall into two general categories: population-based biobanks and medical cancer/health care system-based biobanks.

Population-based biobanks
Population-based biobanks are large-scale biorepositories that aim to recruit subjects reasonably representative of the source population.Population-based biobanks recruit directly from the general population (e.g.China Kadoorie Biobank), and subjects are eligible for enrollment irrespective of their disease status or healthcare utilization.Estonia, 12,13 Denmark, 14 Sweden, 15 Saudi Arabia, 16 China, 17 the Republic of Korea, 18,19 Qatar, 20,21 and Taiwan 22,23 are just some of the countries that have invested in establishing population-based (or reasonably representative) biobanks.
Perhaps the most well-known population-based biobank that has been used for research is the UKB 4 With over 500,000 subjects, it is one of the largest biobanks in the world.All residents aged 40-69 who lived within 25 miles of one of their 22 assessment centers (~9.2 million) were invited to participate. 5UKB takes advantage of the UK National Health Service to obtain follow-up data (e.g.mortality, cancer registrations, hospital admissions, primary care data, etc.) and actively collect and verify conditions that are typically under-reported (e.g.cognitive function, depression). 5

Medical Center and Health system-based biobanks
Another common type of biobank is based on a medical center or a particular health care system.In general, health system-based biobanks, such as Partners HealthCare Biobank, contain EHR and genotype data along with survey data.Some, like the large Kaiser Permanente Research Bank (KPRB), have additional linkages with detailed prescription information and feature-specific sub-cohorts (e.g.pregnancy and cancer cohorts in the case of KPRB).A notable health-system based biobank is the Million Veterans Program.With already more than 600,000 enrolled, it is one of the world's largest genomic biobanks, and it recruits participants from the US veteran population, allowing for the investigation of military-related diseases and conditions.Other such biobanks recruit patients from a distributed network of health centers throughout the country.Their sampling strategy many include active recruitment for particular subpopulations; for example, BioBank Japan 24 recruits patients with particular current or past disease status and the upcoming NIH All of Us 25 program will feature the active recruitment of underrepresented minorities.
MGI (used in illustrative examples below) is an academic medical center-based biobank that started at the University of Michigan in 2012.It recruits surgical patients over the age of 18 based on opt-in consent (allowing for re-contact for future research purposes), collects and stores blood samples, genotypes DNA samples, collects brief survey data related to pain, and is linked to EHR.This biobank also links patient data to other data sources including their cancer registry, prescription data, insurance claims and national death index and is also undergoing an effort to implement a broad epidemiologic questionnaire designed to be comparable to other biobank survey data, namely the UKB.For some biobanks, select biobank descriptives are publicly available online without application; for example, MGI publishes summary counts for International Classification of Diseases (ICD) codes and some PheWAS results, 11 and DiscovEHR shares frequencies of various genetic variants. 26

Other types of biobanks
Initially planning to become the first nationwide biobank, deCODE Genetics is now a privatelyowned commercial biobank.Launched in 2007 and funded by the National Human Genome Research Institute (NHGRI), the Electronic Medical Records and Genomics (eMERGE) Network combines a network of DNA biorepositories linked with EHR as a resource for genetic analyses.Disease-specific biobanks are also common, and these biobanks may focus on rarer conditions.Two examples are PcBaSe Sweden, 27 a prostate cancer cohort, and the Mayo Clinic Biobank for bipolar disorder. 28While diseasespecific biobanks may be better powered than other biobank types to study certain diseases, they are typically smaller, may not be linked with EHR, and may not have genotype data readily available.
GFG (used in illustrative examples later on) is a subject-initiated biobank that started at the University of Michigan in 2015.It recruits participants over the age of 18 from all 50 US states through an online Facebook application, collecting survey data on health and behavior.As an incentive for continued participation and contribution of a saliva sample for genotyping, participants also receive ancestry analysis and the option to download their raw genetic data.At the time of publication, over 70,000 participants are enrolled and over 27,000 have been genotyped.Unlike many other biobanks in this paper, GFG is not linked to EHR data.

Recent Major Biobank-Based Literature
In order to characterize the current biobank literature, we conducted a brief literature search using PubMed to find papers about biobanks and biobanking and papers using biobank data.Details regarding the methodology used to identify publications can be found in Supplementary Section S1.We emphasize that this is not intended to be an exhaustive list of biobank-based literature.The papers published about biobanks or using biobank data can be roughly grouped based on their scientific goals as follows: (1)  biobank study design and cohort characteristics, (2) ethics and public perception of biobanks, (3) feasibility and implementation, (4) exploration of treatments and therapies, (5) epidemiologic exploration focused on non-genetic data, and (6) epidemiologic exploration using genetic data.Below, we review papers in these six broad categories in more detail.

Study Design and Cohort Characteristics
Biobanks typically publish papers on study design, 24,29,30 cohort characteristics, 13,[30][31][32][33][34] how the cohort differs for the rest of the country's population, 35 and characteristics of specific patient populations (e.g.clinical characteristics of colorectal 36 and prostate 37 cancer patients in the BioBank Japan cohort).This information is critical for determining generalizability of results obtained using biobank data.

Ethics and Public Perception
There has been a good deal of attention given to ethics of biobanks, particularly ethical and legal concerns 2 with the use of broad consent (seeking consent for future unspecified research).Particular attention has been given to the use of opt-out consents in biobanks with plans for broad, long-term use. 38,39dditionally, research has looked at the public perceptions of biobanks and biobanking, 40 identified areas of reluctance for potential subjects to consent, and gathered general thoughts on medical and epidemiological research.While hurdles exist (including concerns about privacy and confidentiality, benefit-sharing and commercialization, and internationalization), there is evidence from Germany 41 and China 42 that there is general public support for biobanks and large-scale cohort studies.

Feasibility and Implementation
Literature about biobanks explores feasibility and implementation for establishing biobanks, including business plans and models for facilitating biobank creation, 43 how to recruit and obtain consent (particularly among particular groups of patients such as cancer patients), [44][45][46] and the use of electronic consent in biobanking. 47Increasingly, biobanks are augmenting their survey data with EHR database connections.The promise and utility of EHR data for secondary research use has been well-established. 48,49esearch into EHR data quality suggests a need for standardized methods of EHR data quality assessment 50 and awareness of underlying data collection processes. 51Concerns around EHR data manipulation and analysis are discussed later.

Scientific Studies of Health-Related Outcomes
The vast majority of emerging biobank-based literature focuses on studying health-related outcomes.One area of exploration involves comparisons or characterizations of different treatments or therapies.For example, Ramirez et al. (2012) examined the impact of genetic variants in European-Americans and African-Americans on the response to different warfarin dosages. 52EHR-linked biobank data is well positioned to explore treatment or therapy outcomes, treatment repurposing, and gene-bytreatment interactions.
67][68][69][70][71]54,[72][73][74][75][55][56][57][58][59][60][61] We group these papers published using biobank data into two coarse categories: genetic and non-genetic analyses.Two examples of non-genetic analyses include Song  et al. (2018), which describes the protective nature of alcohol consumption on coronary artery disease risk in the Million Veterans Program, and Peters et al. (2018), which describes sex differences in the association between measures of general and central adiposity and risk of myocardial infarction in the UKB. 54,55Pilling et al. ( 2017) is an example of a genetic study, where the authors conducted a genome-wide association study of UKB data to identify 25 loci associated with human longevity. 76Another example of a genetic analysis paper, Nielsen et al. (2018) used biobank data to explore the relationship between genetics and atrial fibrillation. 77gure 1 provides a distribution of included biobank-based publications falling into each of the above categories over time.The recent, rapid increase of biobank-based analyses, particularly the nongenetic and genetic-based analyses of health-related outcomes, is evident.The rise of genetic-based studies can be partly explained by the increase in the number of genome-wide association studies (GWAS) and phenome-wide association studies (PheWAS).GWAS use genotype data, typically from a large number of individuals, to relate millions of genetic variants with a given disease/health condition, and biobanks often contain upwards of several hundred thousand individuals.Additionally, many biobanks have linked the genotype data to EHR, which allows for in-depth phenotyping and, thus, the feasibility of relating millions of genetic variants with hundreds of diagnoses and lab tests, leading to exploration of the genome x phenome landscape through PheWAS.
While the overall number of biobank-related papers has been increasing rapidly, it is worth exploring the number and types of publications produced by individual biobanks.The types of papers published for a particular biobank may depend on the kinds of data available and the willingness to share data externally.Table S1 provides additional details about the types of identified papers associated with several prominent biobanks.UKB is associated with a large number of publications and particularly papers involving genetic data.The large volume of publications can be explained by external data accessibility and the presence of high-quality genetic information on a large number of patients.In studies conducted using data from other biobanks, UKB data is often chosen as a validation dataset.

Common Outcomes in Biobank Research
While data in large biobanks allow researchers to examine a broad array of outcomes (and often many at once), psychiatric/neurologic outcomes, cardiovascular disease, obesity/diabetes, cancer, and pulmonary conditions dominate recent biobank-based research.2][93] These outcomes are ascertained by either diagnosis codes or survey responses, and different definitions and thresholds are used in sensitivity analyses.Similarly, cardiovascular disease outcomes include coronary artery disease/coronary heart disease, 32,54,[98][99][100]55,64,87,88,[94][95][96][97] which are defined as a combination of more specific conditions including myocardial infarction. Related condition like stroke, atrial fibrillation 77,101- 103 and calcific aortic valve stenosis 104 have also been explored in the literature.Obesity (and related measurements like BMI and waist-to-hip ratio) and diabetes have also been explored.54,58,110-118,95,99,100,105- 109 Colorectal, 53,119 breast, 57,120,121 lung, 72 pancreatic, 122 and skin 10 cancers as well as pulmonary conditions including smoking 32,59,60,75,123 and airflow obstruction 59,60,124 have been investigated, but to a lesser extent.
While psychiatric/neurologic conditions, cardiovascular disease, obesity, cancer, and pulmonary conditions are responsible for a significant portion of morbidity and mortality, the breadth and depth of EHR-linked biobank data offer a valuable resource to research many other rare and chronic diseases and conditions as well as risk factors and health behaviors.As such, there is great opportunity for future explorations into health outcomes using EHR-linked biobank data.

Section 3: Brief Summary of Statistical Approaches for GWAS/PheWAS
The combination of large-scale genotype and phenotype data provides new avenues for exploring scientific questions regarding the relationships between genotypes and phenotypes.First demonstrated in Denny et al. (2010), PheWAS explore the associations between a single variable of interest and many EHRderived phenotypes. 125PheWAS usually relate phenotypes to a single genetic variant or a polygenic risk score (e.g.Fritsche et al. 2018) 10 , but it is worth noting that PheWAS can be conducted based on additional biomarker values/lab tests (e.g.Liao et al. 2017, Neuraz et al. 2013). 126,127This provides a broad range of scientific questions that can be explored through GWAS and PheWAS-type analyses using biobank data.For both GWAS and PheWAS, phenotypes are often defined using ICD codes derived from the EHR (see the section on "Defining the Phenome" for more details).
The current GWAS/PheWAS literature features studies that fall into multiple different categories in terms of their analytic goals.A natural and common goal of GWAS and PheWAS is to study the associations between specific phenotypes and variants at a particular gene region.This analysis is often performed using linear or logistic regression (recently, Firth-corrected logistic regression) or using mixed linear model association (MLMA) analysis. 10,86,95,101,117A discussion of MLMA and related issues can be found in Yang et al. (2014). 128Recently, Dey et al. (2017) proposed a fast alternative to Firth-penalized regression to stabilize estimation for PheWAS studies using saddle-point approximation that is particularly useful for handling extremely unbalanced case-control data. 129Recently, a saddle-point approximation approach for estimating mixed models (called SAIGE) was proposed for handling highly unbalanced casecontrol data with additional sample relatedness, which is typical for biobanks. 130n the PheWAS setting, researchers may be further interested in studying the association between multiple phenotypes in terms of underlying genetic risk.In Fritsche et al. (2018), researchers approach this task by developing a polygenic risk score for a primary phenotype of interest and relating the polygenic risk score to other phenotypes. 10][133] Another common target for these studies is to identify the proportion of variation in a particular phenotype that can be attributed to genetic variation, called heritability.5][136] Recently, Bastarache et al. (2018) developed a phenotype risk score-based method (called PheRS) to study rare genetic variants associated with Mendelian diseases. 137ecently, researchers have used particular genetic variants as instrumental variables in PheWAS analyses, where loci related to a primary phenotype are selected and their associations with secondary phenotypes are evaluated. 101Mendelian randomization analysis is then used to explore potential causal relationships between the genetic trait and the primary and secondary phenotypes. 138

Study Design
A key issue to consider when performing a biobank-based study is study design.Design choices can have implications for the analysis and interpretation of the study results.In this section, we describe several common approaches for study design used in biobank research and describe some analytical and design-based strategies for dealing with common sources of bias.A critical part of study design is the mechanism by which patients are sampled from the population of interest.Two sampling mechanisms are at play: (1) the mechanism by which subjects are sampled from the population into the biobank and (2) the mechanism by which biobank subjects are sampled for inclusion in the study.

Sampling from the Population
Population-based biobanks like UKB and China Kadoorie Biobank sample patients from a network of health centers or administrative sites across each country.Compared to other types of biobanks, population-based biobanks are often thought to be more representative of the target population and often recruit a larger number of subjects.However, individual characteristics may still impact inclusion in a population-based biobank--for example, living near an assessment center (UKB) or living in a region with certain desirable characteristics (Kadoorie).Medical center-based biobanks (e.g.MGI, BioVU) and health system biobanks (e.g.KPRB, Partners) attempt to recruit all patients that meet certain criteria within the center/health system, often through selected clinics.Generally, participation in these biobanks requires subjects to use healthcare, which is indicative both of ability to access healthcare (e.g.barriers to access including transportation and insurance) and health (i.e.people with diseases and chronic conditions are more frequent users of healthcare).In the case of BioBank Japan, patients at participating health centers are identified if they have had or become diagnosed with one of 47 diseases.Compared to population-based biobanks, academic medical cancer-based biobanks tend to see more patients with rare or complicated diseases due to availability of specialized care and, thus, are often useful for investigating rare conditions.For example, MGI 10,11 is enriched for analyses of some cancer types, most notably melanoma of the skin, since Michigan Medicine is known for its skin cancer treatment and care.Disease-specific biobanks are used to examine specific conditions, and in some cases, disease-specific biobanks will also recruit controls (e.g.PcBaSe Sweden).Of note are biobank recruitment methods that recruit self-selected volunteers directly from the general population such as GFG (subject-initiated via Facebook).Many biobanks will further screen interested volunteers in addition to the sampling mechanisms described above (as is planned in the upcoming NIH All of Us biobank).In all cases, the study designs have the potential to induce sampling and participation biases into the analysis.This can have implications on the generalizability of the results and impact measures of association.
To demonstrate an impact of different sampling mechanisms, we consider prevalence rates for different phenotypes in three biobanks, MGI, UKB, and GFG.As mentioned previously, MGI is a biobank of over 60,000 patients treated at an academic medical cancer.Patients in MGI were recruited through the anaesthesiology department as patients were preparing to have a surgical procedure.The UKB is a population-based collection of over 500,000 patients.GFG is a self-initiated biobank recruiting subjects via Facebook.MGI and UKB are linked to EHR, while GFG obtains phenotype information via survey selfreporting.All three biobanks are described in Table 1, and Table S2 in Supplementary Section S4 provides comparisons of the patients in MGI, UKB, and GFG in terms of demographics.The three biobanks have very different sampling mechanisms into the biobank, and we expect the phenotype prevalences in MGI and GFG patients to be quite different both from the general US population and the subjects in the UKB.Phenotypes were defined for MGI and UKB using aggregated versions of ICD codes, called PheWAS codes or phecodes. 139Phenotypes in GFG were defined based on survey responses.A description of the phenotype generation process can be found in Supplementary Section S2.
Table 2 presents prevalences of many commonly-studied diseases in MGI, UKB, and GFG along with published prevalences for their corresponding target populations.MGI often captures subjects with many conditions at a higher rate than is observed in the nationwide population.The UKB has higher case counts for several conditions due to its size, and it is often more representative of the rates observed in the population (at least for conditions common among ages 40-69, the age range included in UKB).We note that there are several diseases for which the ICD-derived phenotype classifications in UKB do not appear representative, particularly obesity.We discuss this issue in more detail in Supplementary Section S5.The high prevalence of depression in GFG is a result of the broad definition of the depression phenotype, which is obtained by asking subjects whether they have every felt depressed.We note the small proportion of subjects diagnosed with breast cancer and prostate cancer in GFG compared to the US population.This may be a result of the differing age distributions, where GFG consists of generally younger people.Differences between GFG and the other biobanks may be a result of different sampling mechanisms or differing phenotyping procedures.
The difference in sampling mechanisms between the MGI and UKB has an impact on observed disease prevalences for many types of diseases.Figure 2 shows the relative prevalence of various phenotype codes within different disease categories between MGI and UKB.We see that the majority of the prevalences are higher in MGI.In particular, prevalences for neoplasms, symptoms, endocrine/metabolic disorders, infectious diseases, and congenital anomalies are uniformly higher for MGI compared to UKB.
The biobank sampling mechanism may also have implications for the use of EHR data.Populationbased biobanks may be more likely to have access to a patient's primary care center EHR but might have to deal with heterogeneity both in terms of the EHR-interface used to collect and store the data and differences in case/procedure/diagnosis reporting. 51,140Some population-based biobanks may overcome/mitigate many of these issues if they operate in countries with universal healthcare (BioBank Japan) or publicly funded healthcare (UKB).Medical center-based biobanks may face complications related to patients coming to their centers for specialized treatment; for example, cancer surgery.While EHR data related to the observed surgery and treatment would often be robust, we might expect the length of each patient's medical history may be shorter and less complete compared to population-based biobanks, since many patients may return to their local health care provider for post-surgery treatment.Unlike biobanks in an academic medical center, we believe broad health system-based biobanks and population-based biobanks may likely have more detailed and consistent data on biobank subjects and may have more complete EHR data for subjects with more common and easily-managed diseases.

Sampling from the Biobank
Within pre-existing biobanks, researchers then seek to sample patients for inclusion in a particular study.Such samples may be limited by data availability, where some subjects may not have, for example, genotype information or survey response information.A common study design involves phenotype-specific case-control sampling, where all observed cases for a particular phenotype are selected and some subset of (possibly matched) controls for that phenotype are sampled from the biobank (e.g.Fritsche et al. 2018,  Abana et al. 2017). 10,120An advantage of case-control sampling is that it does not require longitudinal information and instead relies on dichotomized phenotypes, but it is heavily dependent on the "case" and "control" definitions.
Another common study design is cohort sampling, where all biobank subjects with available data are included in all analyses (e.g.Au Yeung et al. 2014, Hall et al. 2018). 86,141Self-controlled designs in which each subject serves as his/her own control are emerging as an appealing design paradigm for some scientific problems (e.g.Kuhnert et al. 2011, et al.Zhou 2018). 142,143Two variations of self-controlled designs are the self-controlled case series design and the cross-over design.Recently, Schuemie et al. ( 2016) developed an adapted self-controlled case series design that uses the notion of accumulated exposure to study long-term drug effects. 144A detailed comparison of the two primary designs can be found in MacClure et al. (2012), 145 and additional exploration of self-controlled case series can be found in Petersen et al.  (2016) and Simpson et al. (2013). 146,147An advantage of this design is that it controls for confounding due to time-invariant variables.Unlike cohort and case-control designs, however, this method requires longitudinal data to be available for all subjects.Large-scale longitudinal observational databases, such as EHR-linked biobank databases with time-stamped diagnosis, procedure and therapy data, are readily accessible resources for many longitudinal outcomes. 146However, self-controlled designs require adequate longitudinal data (in terms of number of visits and length of follow-up), which can either be missing or incomplete in some EHR-linked databases.

Impact of Sampling Mechanism on Inference
Madigan et al. ( 2013) compares effect estimates resulting from self-controlled case series, cohort, and case-control designs in a particular setting and demonstrates that the choice of study design can have substantial impacts on effect estimates. 148These choices also impact the statistical power and generalizability of the results.Therefore, study design should be considered carefully.In addition to impacting power, the method by which the subjects are chosen may result in biased inference (with respect to the target population), called sampling bias.Haneuse et al. (2016) provides a general framework for exploring and dealing with selection/sampling bias for EHR-based analyses. 149 Haneuse et al. (2016)  focuses on characterizing the mechanism by which subjects were included in the dataset by breaking it into smaller observation mechanisms.For example, a subject may be included in a study if 1) the subject is selected for inclusion in the biobank, 2) the subject consents, and 3) the subject is selected from the biobank by study researchers.Different factors may impact different selection mechanisms, and possible sources of selection bias arising from each individual step can be explored in detail in a sensitivity analysis framework.
The impact of the selection procedure on inference may depend on the analysis being performed.For example, case-control sampling from the biobank will result in biased estimates of the marginal probability of having a disease; however, this sampling design may be able to produce valid estimates of the association between disease status and a covariate.52][153][154] There is a belief in the literature that GWAS/PheWAS study results may be less susceptible to bias resulting from the patient sampling mechanism, but bias due to genotype relationships with the sampling mechanism can still arise in certain settings. 155,156Additional work may help clarify settings in which bias is and is not expected in GWAS and PheWAS studies.In general, issues of sampling bias are not unique to EHR data, and many authors have explored the impact of sampling on inference.However, additional characterizations of the mechanisms by which we can have sampling bias in biobank and EHR research may help guide study design in the future.

Dealing with Confounding
In addition to sampling, measured and unmeasured confounding are common sources of bias in observational data.Careful use of existing analytical tools can help reduce or eliminate biases resulting from confounding.Here, we define a confounder as a variable that impacts both our outcome and our predictor(s).We exclude the situation where the variable is a mediator.Failure to adjust for the confounder may result in biased inference regarding the association between the predictor and the outcome.In a given dataset, sampling and confounding biases can both be present, and careful adjustment of one source of bias does not preclude the possibility of bias from the other source.Haneuse (2016b) details differences between sampling and confounding biases, where sampling bias resulting from the patient selection mechanism impacts external validity of the results, and confounding biases impact internal validity. 157There are many analytical strategies in the statistical literature for dealing with confounding.A typical method is to adjust for confounders in a statistical model or stratify analyses by the potential confounders (e.g.Hall et al.  2018). 86Techniques for reducing and eliminating confounding often assume that the potential confounders are measured.In the EHR setting, however, some confounders of interest (e.g.comorbidities) may often be unmeasured, crudely measured, or incomplete.][160] Biobank data provides several design-based strategies for dealing with confounding as well.In a case-control sampling framework, controls can be matched to cases based on potential confounders such as age and gender, which can make the case and control populations more similar in terms of their age and gender distributions (e.g.Fritsche et al. 2018). 10Rather than stratifying statistical analysis by a potential confounder, one could directly define the analytical sample within narrow windows of a particular confounder (e.g.subjects ages 60-65).With large biobank datasets, we can often still obtain an analytical sample of a substantial size with narrow inclusion constraints.As mentioned previously, self-controlled studies adjust for time-invariant confounders through the design, and additional statistical methods have been developed to further account for systematic differences between time periods. 161In terms of methods designed for large-scale agnostic EHR-based studies such as GWAS or PheWAS, Schuemie et al. ( 2014) and Schuemie et al. (2018) propose a p-value calibration method that may be able to account for both random and systematic (e.g.confounding, sampling biases) sources of error using distributions of effect estimates believed to be truly null effects. 162,163ditional Thoughts on Identifying the Study Sample An additional concept to consider when defining the study sample is the independence between subjects.Longitudinal outcomes are expected to be correlated within patients, and outcomes may be correlated between patients due to relatedness, nesting within doctor or clinic, belonging to a common social network, or other reasons.The software KING (Kinship-based Inference for GWAS) uses genotype data to determine pairwise kinship between subjects. 164We might then define the study sample restricted to unrelated subjects and apply methods that rely on independence between subjects (e.g.Firth-corrected logistic regression in Fritsche et al. 2018). 10Statistical modeling approaches such as mixed modeling can also be used to account for residual correlations between individuals. 130lthough not discussed in detail here, finite resources for collecting patient information presents another sampling-related challenge.In particular, suppose we want to collect genotype information on some subset of our subjects.Who do we test?This and related issues are explored in detail in Sun et al. (2017) 165 and Schildcrout et al. ( 2015) and (2018). 166,167fining the Phenome

Phenotypes from Structured Data
Previous PheWAS studies primarily rely on structured data to define the phenotypes.In particular, ICD9 and ICD10 diagnosis codes (International Classification of Diseases, revisions 9 and 10) are the most common source used for defining phenomes. 168These codes are appealing due to their standard definitions (although perhaps with differential usage in practice) across institutions.These codes are often aggregated to conform to a standardized set of phenotype definitions, called "phecodes."However, there is a large amount of additional information in the EHR that can be used to define phenotypes.Figure 3 provides some examples of the types of structured and unstructured EHR information that can be used to construct phenotypes.
There are many challenges to incorporating additional structured EHR information to define the phenotypes.One challenge involves automation and computation.Suppose we define phenotypes based on structured EHR-based measures of many different types, e.g.binary, count, and continuous.For example, suppose we have many continuous lab values.We may be tempted to model all values using linear regressions, but pre-processing may be required to determine whether the linear regression assumptions are reasonably met.Such evaluation may be difficult to perform manually for a large number of phenotypes, and the development and use of automated algorithms is essential. 169Another issue involves comparability of the phenotypes between institutions, where lab tests may be performed using different assays or with different rates of variability, and there may be differences in coding and billing norms.
An alternative strategy to phenome generation uses additional expert input (for example, through a consortium) to inform the phenotype definitions.However, establishing a well-accepted definition for a given phenotype requires time, careful thought, and discussion.The eMERGE Phenotype Knowledgebase 170 (PheKB) details existing phenotyping algorithms for individual phenotypes that incorporate additional patient information.Due to the complexity of these phenotyping algorithms, the simpler ICD-based phenotyping method is common for PheWAS studies, but incorporation of these external phenotyping resources may help improve phenotype definitions in the future.

Phenotypes from Unstructured Data
][173][174][175][176][177][178][179] Such methods can also be used to obtain patient measures such as smoking status. 171Natural language processing methods mine free text such as narrative doctor's notes for words or phrases corresponding to a particular characteristic.The general goal is to develop a model combining structured and unstructured data to classify each patient as having or not having the phenotype of interest in such a way as to maximize prediction abilities for the sample as a whole, perhaps measured using negative or positive predictive value. 171,172Some challenges include dealing with misspellings, tenses, alternative phrasing, and defining a trained dictionary of words and phrases that may correspond to a particular phenotype.Algorithms are usually trained using expert annotations, but recent methods have attempted to automate this step as well. 177,178Additional machine learning methods have also been used to define phenotypes (e.g.][182] Generally, there is a great potential for incorporating data of different types in order to define phenotypes used in EHR-based research.However, future work is needed to provide automated methods for incorporating data of different types for phenome generation.

Misclassification of ICD Codes
A common strategy when defining disease phenotypes is to list a subject as being a case if he/she has received a certain number of ICD9/ICD10 diagnosis codes (or composites, called phecodes) for a particular disease.This general strategy, however, only captures part of the story.This disease status determination is usually performed across subjects who have different amounts of follow-up time, who have different numbers of visits, and who are being seen in different types of medical clinics.These factors may all be related to the underlying disease status, and a person who would eventually develop the disease or had developed the disease prior to the follow-up window may not be captured. 183Some statistical tools have been developed to try to deal with this and related issues, but computational restrictions may make these methods difficult to apply to large-scale biobank data (e.g.Bergeron et al. 2018 and Sinnott et al.  2014). 176,184Additionally, symptoms occurring between visits may not always be reported, and the use of diagnostic guidelines and assessment of the phenotype may vary from doctor to doctor. 185,186These underlying patient-specific properties are often ignored when classifying subjects as cases and controls for a particular disease, and this can lead to phenotype misclassification.Such misclassification can be viewed in the context of missing data as explored in the next section.
ICD-based phenotype misclassification is particularly common for psychiatric disorders, where diagnosis can be particularly challenging. 174,185For diseases with burdensome treatments such as cancer, we may expect that all subjects receiving a cancer diagnosis truly do have cancer, and there may be only a few cancer cases without a corresponding ICD9 code.In contrast, ICD9 codes for psychiatric disorders such as anxiety may be sometimes attributed to some subjects that do not meet the ICD9 definitions for the disorder.There may also be a tendency for patients to receive ICD classifications that result in reimbursement from the insurance provider.Some ICD codes, for example obesity, may not result in reimbursement and may be expected to have different patterns of misclassification.Additionally, disease ICD codes are sometimes assigned when a disease is suspected prior to further diagnostic testing, so it may be unclear whether a given ICD code refers to the final diagnosis. 8,103henotype misclassification can result in bias ("information bias") and negatively impact the statistical power to detect associations with the disease of interest.The extent of misclassification can be described using quantities such as sensitivity, specificity, and negative and positive predictive values (provided a gold standard exists for comparison), but these quantities can vary from population to population and from phenotype to phenotype. 187Therefore, it is difficult to detect the extent of phenotype misclassification in a particular population without performing further phenotype validation. 188For example, Liao et al. (2017) estimated misclassification rates for particular phenotypes by sampling subsets of patients for manual chart review to verify the phenotype classification. 126Recently, Huang et al. ( 2018) explored a method for accounting for phenotype misclassification in association studies using a likelihoodbased method that integrates over unknown sensitivity and specificity parameters, placing less emphasis on previously-reported values for sensitivity and specificity from other populations. 188 Duffy et al. (2004)  proposes an alternative method for correcting logistic regression effect estimates under misclassification of the outcome. 189,190sclassification in Self-Reported Measures Another source of phenotype misclassification results from reliance on self-reported measures.Spangler et al. (2015) reported a discrepancy between self-reported oral contraceptive use with filled prescription data in a population-based study, with prescriptions being filled 11-45% higher than self-reported oral contraceptive use for the same time period. 193Sensitive health issues may be particularly susceptible to being under-or over-reported, and studies (e.g.those recruiting via social media like GFG) involving such measures should carefully consider the potential impacts of under-or over-reporting on their results.

Missing Data
Missing data is a common issue for biobank analyses, and data may be missing for a variety of reasons.A common source of missingness in GWAS/PheWAS studies is missingness in the genotypes.This is often handled by first excluding subjects with missingness rates above a particular threshold (say, 2%) and then imputing missing values for subjects with smaller missingness rates. 86,88While many of these biobank analyses reported their treatment of missing genotype data, missing information in the phenotype information or demographics is rarely discussed.For example, when a phenotype is constructed using survey data, how is survey non-response handled?Additionally, many studies define their analytical sample based on some subset of biobank participants.However, it is sometimes unclear how these participants were chosen.A more transparent description of how the study sample was derived and the treatment of missing data may shed some light on the generalizability of study results.
5][196][197] These methods may also rely on natural language processing to obtain information from unstructured clinical notes.Such approaches can prove extremely valuable to EHR-based research, but implicit assumptions about the missingness mechanisms should be carefully considered.
A common type of "missing" data is the true phenotype state of each subject.We can view the sampling mechanism that gave rise to our study population and the mechanism behind phenotype misclassifications (which we might call the observation mechanism) in a missing data framework.The observed phenome in our sample is a function of the true phenome state (the "missing" data), the mechanism by which subjects are sampled, and the mechanism by which phenotypes are observed in the sample as shown in Figure 4, where arrows represent dependence.
The probability that a particular subject has an observed phenotype will be related to whether the subject truly has the phenotype, but it may also be related to other factors such as the number of visits to the health care provider, the length of follow-up, the types of health services they receive, and other predictors.These other factors may also be correlated with the true disease status of the subject.For example, a healthier subject may "drop out" of the biobank and may instead seek health care from a tertiary care center.Figures S3-S5 present descriptions of the length of follow-up, number of unique observed phecodes, and number of visits by gender and observed cancer status in MGI.These figures demonstrate a relationship between these variables and whether the subject ever received an ICD code for cancer during follow-up.The sampling and observation mechanisms and their relationships to underlying disease status and patient characteristics may impact study inference.Further work should be done to explore the impact of different sampling and phenotyping mechanisms on statistical inference.

Multiple Testing
GWAS/PheWAS studies and many other types of EHR-based research often involve the simultaneous testing of many hypotheses.Failure to account for multiple testing can result in inflated Type I error, and many statistical methods have been developed to control the Type I error in the multiple testing setting.Some commonly-used examples include Bonferroni adjustment, false discovery rate-controlling thresholds (e.g.Li et al. 2018), 101 and Benjamini-Hochberg thresholds (e.g.Liao et al. 2017). 126However, many of these methods (in particular, the simple Bonferroni adjustment method) have been shown to be overly conservative when the many statistical tests are not independent.This is often the case in large-scale GWAS/PheWAS studies, where associations are explored between many different combinations of related characteristics.In this setting, the goal may be to control for the effective number of independent tests rather than the number of correlated tests being performed.Such an approach may improve statistical power to detect significant associations while still controlling the Type I error rate.
Several methods have been proposed to estimate the effective number of tests (e.g.Li 2012) or control for correlated tests.Good (2005) describes resampling-based testing via permutation or bootstrap to correct the p-values for multiple testing. 198 Gao et al. (2008) proposes the simple M method to estimate the effective number of tests, which uses a combination of principal components analysis and Bonferroni correction. 199For a PheWAS study presented in Ge et al. (2017), the effective number of tests is estimated using principal components analysis of a matrix of pairwise correlations between pairs of phenotypes. 135lternative methods adjust for multiple testing using multivariate normal assumptions for the correlated test statistics (e.g.1][202] In the context of correlated SNPs, some methods correct for multiple testing via analysis of the underlying linkage disequilibrium structure of the genetic data (e.g.Duggal et al. 2008). 203][206]

Heterogeneity between Biobanks
Researchers often attempt to validate statistical findings from their data analysis using an independent dataset, often from a different population.Differences between the population characteristics, however, could impact the generalizability of results between populations and impact our ability to replicate findings.Additional issues can arise when comparing inference across datasets with different sampling mechanisms.In the meta-analysis literature, heterogeneity between studies is broadly grouped into three categories: clinical heterogeneity (differences in patients, interventions, and effects), methodological heterogeneity (differences in study design and sampling), and statistical heterogeneity (when the observed effects are more variable across studies than we would expect from random chance).Statistical heterogeneity may be a result of clinical and/or methodological heterogeneity.

Methodological Heterogeneity
To demonstrate the impact of different sampling mechanisms across biobanks on statistical inference (an example of methodological heterogeneity), we consider phenotype co-occurrence rates and genotype-phenotype associations in MGI and UKB.Suppose we are interested in comparing the odds ratio for having a particular phenotype based on the status of another phenotype, called phenotype cooccurrences, in MGI and UKB.While prevalences will clearly be impacted by the different sampling designs between MGI and UKB (see Figure 2), it is not clear how the resulting phenotype associations will compare between datasets.
In Figure 5, we show the estimated log-odds ratios of having a phecode diagnosis of breast cancer based on other diagnoses in the phenome.See Supplementary Section S2 for more details on the phenotype generation procedure.We note that the estimated odds ratios from the UKB data tended to be larger in magnitude compared to the odds ratios in MGI.One possible explanation for this phenomenon is that in order for subjects to get a phecode in UKB, they must visit a health care provider, during which time they may get multiple codes.When we compare these subjects with UKB subjects who did not visit a health care provider or did not visit as often, we may obtain inflated odds ratios.The subjects in MGI are enriched with phecodes across the board, but subjects with and without a particular phenotype may have many opportunities to collect other diagnoses through their interactions with the health care provider.In this breast cancer example, the odds ratios for other neoplasms and genitourinary diseases did not exhibit the same differences in MGI and UKB as with other classes of diseases.This may be due to enhanced screening of these diseases after diagnosis of breast cancer in both MGI and UKB.Similar exploration for melanoma showed odds ratio inflation in UKB compared to MGI for all disease categories except neoplasms and dermatological conditions.The odds ratio inflation phenomenon seen for breast cancer and melanoma was present for chronic conditions such as hypothyroidism as well as emergent conditions such as concussions.While the size of the odds ratio estimates differed between the two biobanks, we note that, when both associations were significant at a p-value threshold of 0.05, the associations were largely in the same direction.
There are some associations that may not be appreciably impacted by the sampling mechanism.For example, suppose we are interested in studying associations between various SNPs and a phenotype in a GWAS study.It may be reasonable to believe that any given SNP alone is not appreciably related to selection into the sample or variables related to selection into the sample.In this case, we may believe that GWAS results will be reasonably representative of the population.In Figure 6, we compare GWAS results in MGI and UKB for several cancers.In this figure, points represent SNPs identified as being related to the corresponding phenotype in the NHGRI-EBI GWAS catalog.See Supplementary Section S3 for details.While MGI and UKB have very different sampling mechanisms, the GWAS results generally appear similar between MGI and UKB.
However, this may not always be the case.In Figure 7, we compare GWAS results in MGI with results in GFG.We also compare results in both biobanks to genotype-phenotype associations reported in the NHGRI-EBI GWAS catalog.We note that MGI has nearly double the sample size of GFG, and we do not account for this when comparing effect estimates (N = 30,702 vs. 15,156).We do see some differences when comparing GWAS results between MGI and GFG.There are many possible explanations for this.One explanation is that the phenotypes are defined differently in MGI and GFG.In MGI, phenotypes are derived based on ICD codes reported in the EHR during patient follow-up.In GFG, the breast cancer phenotype is derived from survey responses, which ask subjects whether they have ever had breast cancer.The difference in the sampling mechanism both in terms of obtaining subjects and in terms of self-reporting of phenotypes in GFG along with the low number of events in GFG could also explain differences in the GWAS results.A comparison of the MGI GWAS results with the log-odds ratios reported in the NHGRI-EBI GWAS catalog shows a positive relationship, and this relationship is weaker when comparing GFG results with the GWAS catalog estimates.

Clinical Heterogeneity
In addition to differences in the sampling mechanism, differences in patient populations in terms of potential effect modifiers (e.g.age and race) could impact replicability of results across biobanks.For example, suppose we are interested in a particular genotype-phenotype association but that the association varies across genetic ancestries.This is an example of clinical heterogeneity.Such a difference in association could be driven by true biological heterogeneity or by different linkage disequilibrium properties between the populations.When comparing this association overall between two different populations, a failure to adjust for the genetic ancestry composition of the two populations could result in biased inference.Au Yeung et al. ( 2014) explores the association between ALDH2 and lung function in a southern Chinese population. 141The authors discuss lack of consistency between their results and results from Western populations, which could be the result of different health attributes of the populations (e.g.different alcohol and smoking behaviors) and could be attributed to different rates of polymorphism between the two populations.An example of this for MGI and UKB is age, where MGI consists of patients aged 18 and up, while UKB consists of subjects aged 40-69.If the association of interest depends on age, we would have different marginal associations in MGI and UKB.

Statistical Methods for Dealing with Heterogeneity
In the presence of this heterogeneity between study populations, we may explore statistical methods to improve our ability to compare between different populations.8][209][210] Heterogeneity is often handled in meta-analyses through mixed effects modeling.2][213] Future work may explore resampling-based methods to make studies more comparable in the presence of heterogeneity with respect to the sampling mechanism.

Section 5: Emerging Uses of Electronic Health Record Data and Combination with External Data
Many of the existing large biobanks in the US are from academic institutions, which may only provide specialty care.Therefore, the EHR from single institutions or health systems may lack the data for some longitudinal analyses.There is a large opportunity to incorporate additional data sources or types to enrich the typical EHR data and enhance the scope of biobank research.For example, by linking cancer and death registry information to the EHR, we may be able to study survival and disease-related outcomes after clinical diagnoses.Local and national surgical registries offer opportunities for more granular health-related outcomes.When registry data is not available, claims data may also provide some insight for survival and disease-related outcomes-based research. 214Recent work has developed methods for defining the exposome based on clinical narrative information in EHRs or based on additional subject-level measurements. 215,216][219][220][221][222] In addition to geo-coded and registry data, longitudinal data within the EHR and beyond offers many opportunities for research.The rise of mobile fitness tracking devices also provides an opportunity to incorporate longitudinal health metrics or even use text messages or game performance to define phenotypes. 223,224 2015) considers seasonal/calendar effects related to disease. 161,225,2268][229][230] Additional work leverages large-scale medical data to study potential new indications for existing drugs, called drug repurposing or repositioning. 2313][234][235] Machine learning methods have great potential for prediction based on EHR data. 236hen combining data from multiple disparate sources, several problems arise.Most notably are issues regarding patient privacy.Additionally, we must consider issues of data processing, rules for linking records for a single subject, etc.8][239][240][241] Statistical methods have also been developed for combining data across distributed data sources where data from individual subjects is not accessible, called distributed regression analysis.These methods involve sharing sufficient statistics of the data (functions of the individual-level data) from which the individual-level data are not recoverable. 242,243Yang et al. ( 2013) developed methods for performing meta-analysis based on sufficient statistics from existing GWAS, and similar methods should be developed for PheWAS studies in the future. 244arge biobank datasets also provide an opportunity to study different treatment pathways observed for different patients and their corresponding outcomes. 245Additional components such as treatment nonresponse and treatment adherence can also be explored. 173,246While studies of treatment response and adherence are certainly not new, the wealth of information provided through EHRs provides opportunities to study treatment-related outcomes at scale.Additionally, these data sources provide a clearer look at treatment-related outcomes in practice, which may not always align with treatment-related outcomes under more ideal settings of a clinical trial.Similarly, these data can be used to analyze and/or predict various outcomes to treatments, medications, and/or dosages for different diseases (sometimes stratified by patient characteristics -e.g.race).For example, Delaney et al. (2012) demonstrated clopidogrel resistance for genetic variants in ABCB1 and CYP2C19 using EHR-linked data from cardiac patients. 52,247Similar analyses can be used for drug repurposing as well.
Randomized controlled trials are often considered a gold standard for statistical inference.Researchers have explored approaches for obtaining results more similar to a randomized trial using observational data and, in particular, EHR data.9][250][251] An exploration of the use of observational data instead of clinical trials for inference can be found in Franklin et al. (2017). 252

Section 6: Conclusion
Biobanks linked to electronic health records (EHR) provide a rich data resource for health-related research, and scientific interest in biobank-based research has grown dramatically in recent years.As more researchers become interested in using biobank data to explore a diverse spectrum of scientific questions, resources guiding the data access, design, and analysis of biobank-based studies will be crucial.This work serves to complement and extend recent publications about biobank-based research (e.g.][8][9] In this paper, we provide a detailed characterization of many of the major EHR-linked biobanks in an effort to facilitate researchers' ability to obtain and investigate research-quality biobank data with some understanding of the associated population, sampling mechanism, and data linkages.We also survey biobank-based papers that have been published.Papers using biobank data have focused on illnesses and conditions that cause a large portion of morbidity and mortality including cancer, cardiovascular disease, and obesity/diabetes.Future research can utilize increasingly large EHR-linked biobank cohorts to study a broader range of diseases.Biobank data also present an exciting opportunity to explore treatment and therapy schedules, drug repurposing, or gene-by-treatment interactions in the future.Such explorations can also be used to inform dynamic, patient-centric predictions for monitoring and treating future patients.
When using biobank data for health-related research, it is important that researchers understand the statistical and practical issues that accompany such analyses and have resources to address them.We describe many of the statistical challenges involved in biobank research and some current statistical methods.However, there is a great need for further statistical developments to address the many varied issues that go hand in hand with EHR-based research.One large challenge involves defining the phenome.Many methods have been developed to incorporate unstructured EHR data through natural language processing methods or image analytics, and some researchers have considered other issues of misclassification related to ICD9/10-based phenotype classification.Future work can expand on these methods and explore ways to incorporate a broader spectrum of EHR information into phenotype classification.
Missing data is another broad issue with EHR data.Data can be missing for a variety of reasons, and the mechanism generating the missingness can have large implications on inference.Statistical methods tailored to handling issues of missing data in EHR could prove extremely useful.Additional work regarding sampling mechanisms (e.g.into the biobank, into the study, consenting) is needed to clarify in which settings these sampling mechanisms will impact inference.
With an increase in the volume and variety of data becoming available, additional emphasis should be placed on methods for incorporating data from external sources and emerging data streams (for example, geo-coded data, longitudinal biomonitoring data, mobile data, registry data, genomics/metabolomics data, imaging data, ecologic data, etc.).Such analyses can widen the scope of scientific questions we can address, and they necessitate a new wave of related statistical methods.

GFG and GWAS Catalog
Potential Explanation for Differences: There were 2,025 female breast cancers in MGI out of 16,297 women (12.4%).However, there were only 115 female breast cancers in GFG out of 10,802 women (1.1%).This is explained by the different age distributions, since GFG subjects were younger on average, with a mean age of 36.9 years in GFG and 54.2 years in MGI.Therefore, many GFG subjects are not in the age window of susceptibility for breast cancer, which is a disease more common after 50.Differences in the log-OR estimates can also be partly explained by differences in sample sizes, leading to GFG estimates that are much more variable than estimates in MGI or those reported in the GWAS Catalog.Additionally, differences in log-OR estimates may result from different phenotype definitions.
* Each point represents a SNP identified as being related to the corresponding phenotype in the NHGRI-EBI GWAS catalog and the corresponding estimated log-OR SNP-phenotype associations in MGI, GFG, or reported in the GWAS catalog.The two lines correspond to equality of the estimates and a fitted line to the points (excluding any outlying points with absolute log-OR greater than 0.6)."Spearman" indicates the Spearman correlation and "CCC" indicates Lin's concordance correlation coefficient, which is a measure of agreement (with 1 being perfect agreement).

A
central challenge for research involving EHRs is in defining phenotypes.The data available falls into two broad categories: structured and unstructured.Some examples of structured data are billing and procedure codes, numeric lab and test results, and prescription information (both what has been prescribed and what has been filled).Some examples of unstructured data are the narrative notes made by physicians/nurses and radiological/pathological notes and images.For a detailed review of phenotyping procedures, see Bush et al. (2016). 8 Johnson et al. (2010), Zhang et al. (2012) and Li et al. (2012) Noren et al. (2010) and Noren et al. (2013) use longitudinal health record data to discover temporal patterns, and Boland et al. (

Figures
Figures

Figure 1 :
Figure 1: Overall Distribution of Selected Biobank-Based Publications by Year and Type

Figure 2 :
Figure 2: Boxplots of Ratio of PheWAS Code Prevalence in MGI vs. UK Biobank Across Phenome

Figure 3 :
Figure 3: Potential Data Sources for Generating the Phenome

Figure 4 :
Figure 4: Relationship between True and Observed Phenome

Figure 5 :
Figure 5: Log-Odds Ratios of having Breast Cancer Diagnosis by Other Phenotype Diagnoses*

Figure 6 :*
Figure 6: A Comparison of GWAS Results in MGI and UK Biobank (UKB) for Selected Cancer Phenotypes*

Figure 7 :
Figure 7: A Comparison of Breast Cancer GWAS Results in MGI with GFG*