Expanding interactome analyses beyond model eukaryotes

Interactome analyses have traditionally been applied to yeast, human and other model organisms due to the availability of protein-protein interactions data for these species. Recently these techniques have been applied to more diverse species using computational interaction prediction from genome sequence and other data types. This review describes the various types of computational interactome networks that can be created and how they have been used in diverse eukaryotic species, highlighting some of the key interactome studies in non-model organisms.

One of the fundamental goals of these analyses is verification of the cellular interactome [16,17]. The interactome is vital to understanding cellular biology [18] since many biological functions can only be understood as part of the spatio-temporal interactions of the cell [19,20]. Defining the interactome is not straightforward since the cell contains other molecules that interact with proteins [19,21,22]. The associations between genes/proteins can be functional rather than a direct physical interaction, for instance shared complex membership [23][24][25]. Here the interactome is defined loosely as "the entire complement of functional molecular associations that may occur in a cell " in order to encompass the range of networks discussed.

Genome sequence-based interactions
Several aspects of genomic sequence can be used to predict PPI [108]. Gene context (GC) can indicate interaction since interacting proteins have increased conservation of their gene order in comparison to non-interacting proteins [109][110][111]; this prediction method is most accurate in bacterial genomes due to their operons [112], although conservation of gene order has also been observed in mammals [113]. Gene fusion (GF) events can reveal protein interaction since proteins that are fused as a single entity in one species are likely to have a functional link in other species in which they are encoded separately [114][115][116]. The distribution of gene sequences across species, termed their phylogenetic profile, is also conserved for many interacting pairs [38,[117][118][119][120][121][122] since they evolve at the same rate [39]. Two genes that have similar profiles are likely to have co-evolved and their products may interact physically or have shared function [123][124][125][126][127].

Integrated interactomes
The number of interactions common to different experimental datasets can be surprisingly low [33,[128][129][130][131] because each individual experiments only measure certain aspects of the cell's behaviour and the resulting datasets are incomplete [132,133]. A more complete view of the interactome can be produced by incorporating multiple sources of interaction evidence [134] (Figure 2 A). This approach reduces the impact of experimental noise in HTP datasets [23,[135][136][137] and reveals global properties not evident in a single data type [19]. Integrated networks have advanced our understanding of several areas including cellular biology [23,138], disease processes [14] and evolution [11,13,139].
Early interactome studies only linked proteins which had physical interactions (either binary or complex) [4,5]. In functional integrated networks, pairs of proteins/genes are linked if they have any type of association; links may represent the gene/proteins' involvement in the same cellular processes without direct interaction [20], for instance via genetic interactions [56], co-localisation [140,141], or co-citation [142][143][144][145]. The greater density of links provided by functional data provides a more informative basis for network analysis than physical interactions alone. Functional networks have been used to analyse data from yeast [146,147] and human [148], and to compare patterns of interaction across multiple species [149].

Probabilistic networks and machine learning
At the simplest level datasets can be combined naïvely into a network in which nodes represent genes or gene products, and edges represent any type of functional interaction between them [150][151][152][153] (Figure 2 A). Such networks are useful for the basic visualisation of integrated results but no attention is paid to the amount of evidence for each interaction.
A more useful network can use interaction weights to represent the number of lines of evidence for each interaction (Figure 2 B). This weighting provides a measure of confidence since interactions with several sources of evidence are more likely to be true interactions [44,112,136,151,[153][154][155][156][157][158][159]. Taking these evidence levels into account allows thresholding of networks to produce high quality sub-networks that are supported by multiple lines of evidence (Figure 2 D) [153,[160][161][162].
However, the quality of different datasets, in terms of coverage of the genome and accuracy, depends upon the experimental technique used. Probabilistic functional integrated networks (PFINs) take data quality into account by assessing datasets' quality prior to integration (Figure 2 C) [23,25,147]. The confidence scores are produced by statistical comparison with a Gold Standard dataset [163][164][165]: a high-confidence set of interactions believed to be biologically correct [166,167]. This benchmarking reduces noise from HTP datasets, produces consistent integration of interactions from different studies, and allows the use of thresholding (Figure 2 E) and statistical algorithms that take these probabilities into account [168]. Probabilistic networks have been created for yeast and a number of other species using a variety of methods and Gold Standards. These networks can then be used to detect protein complexes [32,[169][170][171], annotate proteins [37,159,172] and predict missing interactions [173,174].

Computational interactome networks of diverse species Fungi
While the model budding yeast Saccharomyces cerevisiae was the first species to be used in highthroughput interaction screens [1,2,[4][5][6][7], the model fission yeast Schizosaccharomyces pombe has lagged behind, with few HTP datasets being produced until relatively recently [190][191][192]. Comparison of the stress-response interactome of S. pombe, StressNet, with the network from S. cerevisiae indicated that most stress-related interactions are not conserved and have undergone considerable rewiring during evolution [193]. Binary interactions of S. pombe were found to be better conserved

A.
B. C.
D. E. E. Interaction confidence scores can be filtered to select interactions with high confidence (here >=3.0). In this case the interaction between the pink and red proteins is lost since it does not pass the threshold despite having two lines of evidence in D.
with humans than with budding yeast, further supporting this evolutionary rewiring [191]. Genetic interactions appear to have higher conservation between S. pombe and budding yeast [194], with ∼ 30% of synthetic lethal interactions found to be conserved [57]. Mass spectroscopy has been used in S. cerevisiae, Candida albicans, and S. pombe to quantify the evolutionary rate of change in phosphorylation, indicating that kinases have a lower rate of conservation, and several of the S. pombe results were confirmed using E-MAPs [195]. A more recent E-MAP study suggested a hierarchical model for the evolution of genetic interactions [196]. Despite the evidence for evolutionary rewiring of S. pombe PPIs, it is possible to use S. cerevisiae and other data to computationally predict S. pombe interactions [188,197,198]. In particular the Pombe Interactome was predicted using over 100 protein features by comparison with budding yeast and several predicted interactions for the Cbf11 transcription factor were subsequently confirmed using AP-MS [199]. Candida albicans is a budding yeast and opportunistic pathogen found in the human gut [201]. Production of interaction data in this species has lagged behind other yeasts due to non-standard codon usage [202], which requires modified methodologies [203][204][205]. Schoeters and colleagues have provided a manually curated list C. albicans protein-protein interaction data [200] and several other interactome datasets have been produced in this and other budding yeasts (Table 1). Ascomycetes and other Fungi have also been investigated using computational network prediction (Tables 2 and  3). The PHI-Nets resource contains networks, produced using interologs and DDI mapping from S. cerevisiae and S. pombe, for fifteen Ascomycetes [206] including the pathogenic fungi Aspergillus fumigatus [207] and Fusarium graminearum [208]. The FPPI database is a resource for F. graminearum, which provides confidence scored interologs covering ∼ 52% of the proteome [88]. In a later study, FPPI interactions were combined with gene expression data to predict a subnetwork of pathogenicity genes; two interconnected network modules were identified that were enriched in G-protein coupled receptors and MAPK signalling pathways, and which contained several known pathogenicity genes [209]. Finally, a large-scale study by Zitnik and colleagues produced computa-tionally predicted networks for over 40 fungi [13].
Combining networks from pathogenic yeasts and their hosts aids understanding of infection processes and can identify potential drug targets. Microarray data from C. albicans infected zebrafish were used to predict networks at different stages of infection and decipher the mechanisms underlying C. albicans pathogenicity [210]. Time-course C. albicans-zebrafish transcriptomics mapped to interolog networks indicated that redox status is crucial to infection in this species [211]. Similarly, comparison of the early and late stages of C. albicans infection identified important functional modules in both the pathogenic and defensive mechanisms [212]. Remmele and co-workers used interolog mapping to identify host-pathogen interactions of A. fumigatus during human and mouse infection, highlighting the roles of the PLB1 virulence factor and APP anti-fungal host protein [213]. Several network studies have also investigated yeast infection in plants [61,89,91,[214][215][216][217].

Plants
The plant Arabidopsis thaliana, thale cress, was chosen as the first 'model' plant due to its small genome since and diploid nature that made genetic manipulation relatively simple [219]. A. thaliana has most well-characterised interactome of the plants with several experimental [220][221][222][223][224][225][226][227] and computational datasets [13,149,187,188,198,[228][229][230][231][232][233][234][235][236][237][238], many of which form the basis of interactome studies in other plants [238][239][240][241][242][243]. Cross-species interactomes have also investigated infection in this species [215,244]. More recently, new plant models have been developed [219] and several interactome studies have been carried out in other plant species, many of which have commercial value (Table 4). Unlike A. thaliana, very little interaction data have been produced for rice, Oryza sativa, despite its economic importance as a staple food. The majority of interactome networks in this species were computationally-predicted, although some experimental data have been produced [245][246][247]. Interolog mapping was used to create a proteome-wide interactome for rice, which correlated well with co-expression data [248]. RicePPINet used structural and functional information as machine learning inputs to predict the rice interactome and the resulting network was used to identify genes involved in disease resistance and drought tolerance [249]. PlaPPISite comprises 36,420 interologs predicted using several computational methods [238], while the predicted Rice Interactome Network, PRIN, contains 76,585 interologs [250,251]. RiceNet is a PFIN for rice produced using data from several model organisms [252]; an updated RiceNet network has 1,775,000 interactions between 25,765 genes [253]. BarleyNet is a PFIN for barley, Hordeum vulgare, based on orthology to Arabidopsis and rice [254]. Five computationally-predicted interactomes were produced for Zea mays, maize [13,231,238,239,255]. In horse gram, Macrotyloma uniflorum, an interolog network of over 6,000 interactions has been produced [256]. Predicted interactomes have also been created for many other plant species (Table  4). For instance, PTIR is a tomato interolog network based on orthology to six model organisms containing over 12,000 high confidence interactions, ten of which were experimentally verified [257]. A turnip (Brassica rapa) interolog network has been inferred from A. thalina data [258] and a predicted interactome was produced for coffee, Coffea canephora, using orthology to model organisms including A. thalaina [259]. Predicted networks have also been built for the tea plant Camellia sinensis [92,260], cassava Manihot esculenta [261,262], orange Citrus sinensis [263], thai basil, Ocimum tenuiflorum [264], the poplar tree Populus trichocarpa, and the moss Physcomitrella patens [187].
Compuatational methods have been used to compare networks between multiple plant species. For instance, PlaPPISite contains interolog networks for eleven other plant species inferred using structural and domain information which provides accurate prediction of interaction sites in these species [238]. Ding and colleagues used functional interaction data to infer confidence-scored interolog networks for three species-A. thaliana, Glycine max (soybean) and Z. mays-many interactions of which were supported by literature curated evidence [231]. Vandereyken and co-workers compared hub proteins between major plant interactome studies and found that many are involved in stress responses [265]. Zitnik and colleagues' large-scale interactome study created networks for sixteen plant species [13]. Finally, several studies have produced plant-pathogen networks [89,266]. Fungal-plant networks are discussed above and reviewed in [267].

Protozoa
The apicomplexan Plasmodium falciparum is the causative agent of malaria [275]. Interactome analysis is a powerful method for understanding this parasite and identifying potential targets for therapeutic intervention. Few large-scale interaction screens have been carried out in P. falciparum due to its AT-rich genome hampering classical experimental methodologies such as yeast two hybrid [276]. A comparison of the P. falpicarum interactome [277] with those of yeast, worm, fly and human revealed a marked difference in hub connectivity in the malarial network, with clustered interconnected hubs [278]. This divergence from other species has also been observed in P. falciparum's protein complex conservation [279]. Computational prediction has been applied to host-malaria infection. Dyer and co-workers integrated intra-species PPIs with protein-domain profiles to predict PPIs between P. falciparum and its human host [280]. Interolog mapping using eighteen other eukaryotic species was used to produce an interactome for P. falciparum and predict its interactions with human proteins [70]. A later interolog study based on multiple interaction datasets revealed parasite proteins predominantly target hub proteins to take control of the human host cell [281]. An integrated interactome of predicted and experimental data was used to study the pathogenesis of cerebral malaria [282] and in the related species, P. vivax, machine learning was used to create a human-malaria interactome that was analysed to identify putative drug targets [283].
Leishmania and Trypanosoma are groups of protazoan parasites that cause disease in humans-Leishmaniasis and Chagas disease/sleeping sickness, respectively [284][285][286]-for which there are several interactome datasets. Interactome networks for Leishmania braziliensis and Leishmania infantum were produced by Dos Santos Vasconce and colleagues based on structural data and machine learning [287]. These predicted networks, were later enhanced by incorporating data from an interolog mapping study in which networks were produced for three Leishmania species, L. braziliensis, L. major and L. infantum before confidence evaluation using gold standard data [288].
To identify potential drug targets for Chagas disease, the predicted secretome for this and the insect pathogen Trypanosoma rangeli, were computationally mapped to cellular pathways [289]. Expression data were mapped to interolog networks of T. brucei and its vector Glossina morsitans morsitans to identify genes and proteins involved in the response to infection [290]. In a later study, interolog mapping identified interactions between T. brucei and G. m. morsitans [291]. TrypsNetDB contains experimental and interolog interactions for several Leishmania and Trypanosoma [292].
Several other interactome studies have been carried out in protozoans (Table 5). Date and colleagues created a probabilistic functional integrated network for P. falciparum and mapped the network to T. gondii, and C. parvum, to identify areas of commonality [293]. Twenty three protozoan networks were produced in a study of 68 eukaryotic integrated interactomes [13]. Finally, Cuesta-Astroz and colleague's large-scale study of fifteen parasite-host interolog networks including Plasmodiums, Trypanosomas , Leishmanias and other apicomplexan parasites [294].

Mammals
The laboratory mouse, Mus musculus, is possibly the most important model species that has been used extensively in the study of genetics [299]. Due to its importance to the understanding of human disease, the mouse interactome has been a key research goal since completion of the mouse genome, and a wealth of experimental [89,[300][301][302][303][304][305][306][307][308] and computational [13,64,188,198,[309][310][311][312] interactome data have been produced, which along with human and yeast data, form the basis of interactome studies in many other mammals ( Table 6). One of the largest, MouseNet, represents a large-scale PFIN produced via machine learning based on multiple types of functional interaction data including PPIs, expression and phenotypes. This network successfully predicts known human disease phenotypes, demonstrating the potential of interactomes in cross-species prediction [64,313]. In a comparative analysis, Shin and colleagues found that differing ortholog mapping algorithms have low overlap, and so produced an interolog network that combined the different results [310]. Yellaboina and co-workers combined interolog mapping with genome context and phylogenetic profiles to produce a network of over 40,000 mouse interactions [309]. MppDB is a predicted mouse interactome built using text-mined data followed by machine learning using several data types and mapping techniques [311].
Many mammalian studies have concerned species with commercial value, for example Bos taurus interactomes were used to investigate meat production [314] and infection [315]. Wang and colleagues predicted interactomes for cattle, dogs, horses and rabbits and demonstrated their reliability using subcellular localisation and in comparison to randomised networks [316]. Of particular note is the evolutionary study of Zitnik and co-workers, who produced interactomes for 1,840 species, including 28 mammals. Interactomes were found to evolve to become more resilient to network failure, and in bacteria this resilience was correlated with the variability of the species' environment [13]. The FunCoup database contains PFINs for sixteen model species including four mammals, and provides an interactive interface for comparative interactomics between species [198]. The STRING database contains functional interaction data, including co-citation, co-expression and gene neighbourhood, for multiple species including several model mammals, which can be queried through an interactive server by protein name [317]. Finally, the BiomeNet server can be used to construct PFINs for any species based a set of eighteen PFINS including human and mouse [318].  Fish Zebrafish, Danio rerio, the best characterised of the fish, is a model for regeneration and development [323], and has been the subject of several experimental interactome studies [13,188,198,324,325]. Due to the economic importance of global fish production [326], many studies have aimed to understand the interactome of other fish (Table 7) and their responses to parasitic disease [210][211][212][327][328][329][330][331][332]. Carrera and colleagues created an interactome from the STRING database based on the combined proteome of fifteen different sarcoplasmic fish, revealing a core interactome involved in energy and metabolism [333]. Millan-Cubillo and co-workers also used STRING to produce interaction networks for two developmental stages in the seabream, Sparus aurata, based on expression data [334]. A later seabream study used expression data to mine STRING for interactions of the stress response [335]. Co-expression analysis was also used to create a genetic interactome for the Nile tilapia, Oreochromis niloticus [336]. Zitnik and co-workers evolutionary study included seven fish and the coelocanth, Latimeria chalumnae [13].
Li and colleagues combined physical interaction data with interolog mapping to produce the Worm Interactome [376]. Simonis and co-workers extended this network using further Y2H screening to produce version 8 of the Worm Interactome [377]. By combining Y2H and protein-DNA interaction (PDI) mapping the C. elegans interactome was expanded to include more than 2000 transcription factor interactions [378]. Gunsalus and co-workers combined the Worm Interactome (version 5) [376] with expression and phenotypic data to produce an integrated network of early embryogenesis [379]. These datasets have formed the basis of many interactome analyses and comparisons in C. elegans and beyond [382,383].
Few interactome studies have been carried out in other nematodes (Table 9) Interolog mapping was used to produce and compare host parasite interactomes for six parasites including the human and plant parasites Meloidogyne hapla and Meloidogyne incognita [384]. Comparison with a predicted interactome for C. elegans was then used to prioritise drug targets. Finally, Cuesta-Astroz and colleague's large-scale study of fifteen parasite-host interaction networks included an interactome network for Trichinella spiralis [294].
The platyhelminths Schistosoma mansoni and Schistosoma japonicum are important parasitic blood flukes that cause schistosomiasis in humans [385,386]. Interolog mapping has been used to produce and compare host-parasite interactomes for six parasites including S. mansoni and S. japonicum [384]. White-Bear and colleagues used structural prediction, followed by extensive confidence filtering, to produce an interactome of over 1000 S. mansoni -human interactions [387]. A combination of Y2H and Co-IP produced 205 interactions involving the essential histone deacetylase 8 [388]. Co-IP was also used to identify the host-parasite interactome of S. mansoni and its mollusc host Biomphalaria glabrata [389]. In Cuesta-Astroz and colleague's large-scale study of fifteen parasitehost interaction networks, a network of ∼ 700 interactions was produced for S. mansoni [294].
Castillo-Lara and co-workers used interolog mapping from a human reference interactome, gene expression data and machine learning to produce PlanNet, a predicted interactome for the planarian, Schmidtea mediterranea, with online visualisation and analysis tool [390]. This resource was later extended to allow exploration of the network using gene expression data [391]. Finally, a probabilisitic functional integrated network of interologs was produced for the mouse bile duct tapeworm, Hymenolepis microstoma [74]. Although there is little interactome data for these species (Table 9), interactomics has the potential to expand our understanding of parasitism in the future.

Other Eukaryotes
Interactome analysis that have applied to a variety of other species, including birds, amphibians, reptiles and crustaceans, are summarised in Table 10.

Future Perspective
Data from traditional model organisms, chosen in part for ease of their experimental study [399], are now being expanded by the addition of genomic data from closely-related species [400]. Diverse organisms have become the study species of choice for answering more obscure biological questions [401] and to test the long-held fundamental rules built from the original model species [402].
Comparative interactomics is now possible on a large scale as demonstrated by Zitnik and colleagues study of 1,840 predicted interactomes spanning the tree of life [13]. With over 8 million proteinprotein interactions, this dataset allowed the authors to observe interaction rewiring that had only previously been seen on a smaller scale [191-193, 353, 378, 403].
Computationally-predicted interactions can compensate for the lack of experimental data in many species but, like experimental methods, they have drawbacks such as noise and false positive interactions [293,404]. Conservation of PPI is unequal and accuracy of interolog mapping will vary between species [405]. This accuracy will also be dependent on evolutionary distance and careful selection of thresholds will be necessary [406]. It's clear from yeast interactomes that there can be low overlap between species; S. cerevsiae is well characterised but the C. albicans and S. pombe interactomes are different in a number of respects [191,193,204,407]. Combining different computational methods [86,105,106,310], and experimental data if available [232,235,376,408], can give a fuller picture of the interactome and mitigate the effect of data noise by using a probabilistic framework [74,198].
Interolog and DDI mapping can only detect interactions within conserved sequences or domains [68]; organism-specific proteins and interactions are missed. Filling in the gaps is vital to our understanding of biology and evolution. Interactome analyses can help to identify these gaps and target further analysis in non-model species. While there are parts of some interactomes that cannot currently be predicted, interactome accuracy will improve as coverage of diverse species increases. Non-model species have already been used to provide insights in a number of fields, including human disease [409], ageing and regeneration [410,411], and to expand our understanding of evolution [412], ecology [413,414] and biological diversity [415,416]. There are currently more than 13,000 entries from over 10,000 species in the NCBI Genome database 1 , the majority of which are non-models. At the same time, de novo transcriptomes have also been produced for many species that lack genome sequence data [417][418][419][420]. The Sequence Read archive contains more than 20,000 non-model transcriptomic datasets from over 7,000 species 1 . These resources represent a huge amount of data for interactome generation in more diverse species and provide the basis to build interactomes that will potentially have impact across the field of Biology.