The aforementioned issues, though incomplete, provide a good indication of the scale and diversity of the innovation challenge. However, in our opinion, it is the scientific issues that deserve immediate analysis and introspection. To do this, we can borrow from one of the numerous innovation frameworks that have successfully been applied in to solve complex social, engineering and technology challenges. These include design thinking, computational thinking, analytical thinking, lateral thinking and first principles thinking. Arguably, the most successful is First Principles Thinking (FPT), an approach first practiced by Aristotle who defined it as the search for “the fundamental basis from which a thing is known” [
32]. In practice, innovators are encouraged to critically question every assumption of the challenge and break it down into basic components and to search for solutions from “first principles”. From this, new learnings and opportunities are defined. This approach is distinct from reasoning by analogy (e.g. using competitive analysis), which builds knowledge and solves problems based on prior assumptions, beliefs and widely held ‘best practices’ approved by the majority of people.
Applying FPT to the drug discovery process, we have seen that much of preclinical research is built upon the inherent practicalities of model systems, as opposed to the functional realities of human biology. At the data level, such an approach has quantitative advantages, but not qualitative. Therefore, failure to expeditiously balance investment towards human-specific approaches, risks us applying new technologies to generate even more incongruent data. As a corollary, we can posit that current standards of preclinical research are frontloading the drug discovery process with data that lacks human specificity and technical reproducibility for routinely developing successful new drugs. This is ultimately what may be responsible for long development times and high attrition rates. Thus, pivoting discovery to a largely human-focused approach using human-specific data and in silico disease models (i.e. Human Data Driven Discovery: HD3), has the potential to provide radical improvements in drug research efficiency.
2.3. Human data as a driver for systems-based discovery
Today, the molecular characterization of human populations plays a critical role in target-driven discovery. In oncology, for example, the application of omics technologies has helped characterize the key molecular differences between treatment responders and non-responders, thereby enabling the development of efficacious therapies targeting specific driver mutations, such as imatinib [
33] and crizotinib [
34]. However, one of the primary advantages of model systems is that they also allow us to prospectively perturb the function of genes in the study of disease specific targets and to perform
in vivo physiological screening with small molecule chemistries. While it is generally understood that such studies are unethical in patients, the reality is that doctors have for decades been performing physiological phenotypic studies on their patients with prescribed drugs. Once administered, the drug and/or its metabolites interact with targets in different cellular systems. The phenotypic readout is a consequence of the combined molecular interactions across the entire patient system (see
Figure 1B). From this perspective, we can view a side effect as drug-induced disease phenotype, whose molecular etiology tells us something not only about the drugs mode-of-action, but also about the targets/pathways associated with the observed phenotype. Thus, if we can accurately define a drugs interaction partners/biotype (targets, off-targets, metabolizing enzymes, transporters) across all cellular systems, we can learn more about the human-specific molecular networks involved in human phenotypes. The advantage of this approach, in comparison to model systems, is that the clinical observations and connected molecular knowledge are directly defined by the human condition.
An important facilitating factor in the pivot to a human-focused discovery paradigm is therefore the utility of treatment and outcome information from vast tomes of existing real-world data (RWD) and clinical trial results. RWD exists as a spectrum of different qualities, typically defined by both the context of assessment from which the data was derived, and the degree to which the data was generated to answer specific research questions (see
Figure 2). Data pertaining to treatments and clinical outcomes, whether positive (e.g. drug-induced disease remission) or negative (e.g. disease recurrence or adverse reactions) are of primary importance. Such data is widely available in research databases, large-scale clinical registries, EMR-linked sources, Administration/claims sources, facilitated networks and regulatory databases. Spontaneous Reporting System databases are particularly interesting and although redacted in terms of clinical narratives, they offer a highly valuable window into observed drug-induced adverse event (AE) phenotypes for millions of patients. Major sources include the FDA’s Adverse Event Reporting System (FAERS) [
35] and Sentinel Initiative [
36,
37], together with the European Medicines Agency EudraVigilance system [
38] and the global database of individual case safety reports (ICSRs) called VigiBase, maintained by the World Health Organization’s (WHO) Uppsala Monitoring Center (UMC). [
39] The data contained in these databases are analogous to chemical phenotype screening data from model systems, only this time specific to humans. By providing insights into the phenotypic effects of drug-induced perturbation on targets within the patient system, they allow us to capitalize on publicly available treatment and outcome data for tens of millions of patients.
Figure 2.
Overview of key sources of clinical outcomes data. Beyond Randomized Controlled Trial (RCT’s) results, there are four types of outcome reports, otherwise known as “contexts of assessment” associated with RWD: 1) Clinician reported outcomes (e.g. from Electronic Medical Records (EMR’s)), 2) Patient reported outcomes, 3) Observer reported outcomes and 4) Performance outcomes. While clinician reported outcomes are the most reliable source of outcomes data types, there is an ever-growing realization of the value of patient reported outcomes. Nevertheless, for the purposes of the HD3 approach it is treatment outcomes data that provides the most valuable datapoints, with Spontaneous Reporting Systems such as FAERS and Vigibase, providing a highly accessible source of this information for tens of millions of patients. We can also extend the concept of “outcome” to include the phenotypic consequences of genetic aberrations, with such data being available in disease agnostic databases such as the Human Gene Mutation Database (HGMD) and OMIM, or disease specific databases, such as The Cancer Genome Atlas project (TCGA).
Figure 2.
Overview of key sources of clinical outcomes data. Beyond Randomized Controlled Trial (RCT’s) results, there are four types of outcome reports, otherwise known as “contexts of assessment” associated with RWD: 1) Clinician reported outcomes (e.g. from Electronic Medical Records (EMR’s)), 2) Patient reported outcomes, 3) Observer reported outcomes and 4) Performance outcomes. While clinician reported outcomes are the most reliable source of outcomes data types, there is an ever-growing realization of the value of patient reported outcomes. Nevertheless, for the purposes of the HD3 approach it is treatment outcomes data that provides the most valuable datapoints, with Spontaneous Reporting Systems such as FAERS and Vigibase, providing a highly accessible source of this information for tens of millions of patients. We can also extend the concept of “outcome” to include the phenotypic consequences of genetic aberrations, with such data being available in disease agnostic databases such as the Human Gene Mutation Database (HGMD) and OMIM, or disease specific databases, such as The Cancer Genome Atlas project (TCGA).
Treatment outcomes data can also be integrated with data emerging from the multi-omics characterization of human populations. A vast wealth of data resources and public genomics data initiatives have become available over the years, with general, organ-specific and disease-specific data now globally accessible (for a list of 86 different globally available resources, see
supplementary Table 1). Here, disease-associated genetic data from resources such as Online Mendelian Inheritance in Man (OMIM) database [
40] or the phenotype associated NHGRI-EBI Genome Wide Association Study (GWAS) catalog [
41] are of particular interest. Updated daily, the OMIM database catalogues information for more than 15000 genes with a core emphasis on the molecular relationship between genetic and phenotypic diversity, especially around human disorders. The data also lends itself to organ specific analyses (and modelling), with Parsa
et al. for example compiling a list of the 258 OMIM genes responsible for kidney related diseases, including renal hypoplasia, dysplasia or agenesis, end-stage renal disease and proteinuria [
42]. By aligning this data with GWAS data from the CKDGen Consortium, they were further able to characterize the potential association of genetic polymorphisms and kidney function within the general population [
43]. Although not reported by the authors, such data can also be further contextualized with kidney specific disease pathway information, such as the Kidney and Urinary Pathway Knowledge Base (KUPKB) (
http://www.kupkb.org/) [
44] or the Chronic Kidney Disease database (CKDdb) (
http://www.padb.org/ckddb) [
45], or with data from drugs used to treat kidney diseases and outcomes data related to kidney specific side-effects from other drug treatments.
The power of such strategies was also demonstrated by results from the TCGA project, where molecular data and phenotypic information have been analyzed to decipher novel targets and prognostic classifiers. Analysis of the TCGA endometrial carcinoma dataset, for example, has brought important new insights into the molecular nature of this disease, including the discovery of a new classification system based on four prognostically significant subgroups [
46]. Indeed, it can be argued that most recent clinical advancements emerge from analysis of patient-derived molecular data. However, vast tomes of clinical information available throughout the healthcare system remain underutilized for discovery purposes.
Two other types of data resource are fundamental to systems-based discovery, a) network models of disease systems and signaling pathways, defined initially at the level of proteins (nodes) and their interactions (edges) and b) accurate and comprehensive drug-to-target knowledge. The network view provides a basic proteo-anatomical structure of the system, upon which additional molecular data sources can be added. The scale-free and redundant characteristics of these networks often permit perturbation without a complete loss of function, implying that multiple perturbations, at nodes and/or edges, are typically involved in the emergence of disease phenotypes. Recent computational work by Zhong
et al., for example, examined the system-level mutational features of heritable disease and found that they were more likely to be caused by mutations at edges, as opposed to nodes. [
47]. Interestingly, edge-based perturbations, typically involving in-frame mutations of (near) full-length protein, were more commonly observed across multiple diseases. Such mutations tend to abrogate interaction with one or more neighboring nodes. Moreover, different disease phenotypes may be caused by different mutations in a single edge. Nodal mutations on the other hand typically involved truncated proteins and did not necessarily affect the interaction with other proteins nodes in a signaling network. Drug-to-Target knowledge is also a key prerequisite in systems-based discovery endeavors: In addition to the aforementioned DrugBank [
48] and WOMBAT [
49] databases, the Therapeutic Targets Database (TTD) [
50], Pubchem [
51] and ChEBI [
52], all provide critical knowledge for systems-based analyses of drug targets.
This brings us to the next level of challenge. How do we optimally structure this drug and network knowledge to facilitate drug discovery? Ontologies will certainly play a critical role in making data not just machine readable, but also machine actionable. A key element here will be ontology interoperability and robust ontology applications to help make data Findable, Accessible, Interoperable and Reusable (i.e., FAIR compliant)[
53]. Beyond the ontological challenges associated with knowledge modelling, the biological accuracy of systems-based models is critical, especially if we are aiming for whole patient models. Today, most pathway models represent an amalgamation of biochemical findings across a multitude of different cellular systems, under various physiological conditions. This leads to generic model representations that are likely inaccurate at the cell-type specific level. We must therefore aim for cell-type and tissue-specific representations of core biological mechanisms, which can then be mapped at the whole-patient level. Several existing resources can aid this endeavor, including important work by Jiang
et al. who reported a quantitative proteome map of the human body, with expression data for 12,000 genes across 32 normal human tissues [
54]. Other examples include The Human Protein Atlas (HPA), which presents the spatial distribution of proteins in 44 different human tissues and 20 cancer types [
55]. Organ specific sources are also widely available, such as The Brain Protein Atlas [
56] and the Human Kidney and Urine Proteome Project (HKUPP) (
http://www.hkupp.org/) [
57]. The resultant networks facilitate the mapping of drugs to phenotypes across all levels of the patient system and provide a powerful basis for hypothesis generation and AI-driven discovery. They also allow us to add additional data-types (e.g. genomic, transcriptomic, drug-binding constants and target activity data) that may later facilitate more functional analyses using, for example, simulation algorithms. Such an approach would largely meet the requirements for developing an effective
in silico model system, since the human disease/systems and the
in silico models should be substantially congruent with respect to structure and composition.
Finally, from a technical perspective, outcomes data such as spontaneous AE reports are typically stored in relational databases (RDBs), in multiple tables with information pertaining to the case demography, drugs (medications) given, reported AEs etc. Such data structure allows easy integration of new reports, distribution and sharing of data and a relatively straightforward retrieval of specific/individual information. However, the need to join a large number of tables to combine information for each outcome, AE, medication and co-medications, indications and demography, can result in rather complex queries and long computational times. Moreover, thorough understanding of the underlying data structure is required to avoid potential pitfalls, such as the multi-axiality of data, leading to erroneous results that are not always directly apparent. These characteristics can make advanced analyses of data in RDB structure more cumbersome and unpractical. One of practical solutions is to convert the RDB into a graph database for the purpose of data analysis. Graph databases (GDBs) store information in the form of nodes (entities) and their properties as edges (relationships) instead of tables with rows and columns [
58]. As each node is directly connected to all the other relevant nodes, the queries to retrieve more ‘distant’ information are much simpler and faster, as there is no need to join multiple tables. More importantly, the GDB structure enables use of efficient algorithms from simple random walks to graph convolutional networks to facilitate discovery of hidden relationships between entities and preparation/transformation of data for machine learning/AI approaches, such as feature engineering. Several reports show that use of GDBs in the analysis of safety signals or predicting safety of drugs show that such approaches have the potential to outperform the current approaches [
59,
60,
61,
62,
63]. Future development of knowledge graphs integrating full outcomes and spontaneous AE report databases, together with other information will not only improve the performance of drug toxicity predictions but also help to uncover hitherto unknown interactions between drugs and co-morbidities. From this perspective, the approach may prove quite effective in target deconvolution and drug (re-)positioning studies.