Pathways with genomic variation in family DNA microarrays linked to autism risk inheritance

The genetic heterogeneity of autism has stymied the search for causes and cures. Whole-genomic studies on large numbers of families have helped identify combinations of inherited and de novo signal. In the present work, we re-analyze DNA microarrays using a novel strategy that takes prior knowledge of genetic relationships into account and that was designed to boost signal important to our understanding of the molecular basis of autism. Our strategy was designed to identify significant genomic variation within a priori defined biological concepts and improves signal detection while lessening the severity of multiple test correction seen in standard analysis of genome-wide association data. Upon application of our approach using 3,244 biological concepts, we detected genomic variation in 68 biological concepts with significant association to autism in comparison to family-based controls. These concepts clustered naturally into a total of 19 classes, principally including cell adhesion, cancer, and immune response. The top-ranking concepts contained high percentages of genes already suspected to play roles in autism or in a related neurological disorder. In addition, many of the sets associated with autism at the DNA level also proved to be predictive of changes in gene expression within a separate population of autistic cases, suggesting that the signature of genomic variation may also be detectable in blood-based transcriptional profiles. This cross-validation with gene expression data from individuals with autism coupled with the enrichment within autism-related neurological disorders supported the possibility that the mutations play important roles in the onset of autism and should be given priority for further study. Our work provides new leads into the genetic underpinnings of autism and highlights the importance of reanalysis of genomic studies of complex disease using prior knowledge of genetic organization.


Introduction
Autism is a complex neurological disorder characterized by restricted communication, impaired social interaction, and repetitive behavior. Diagnosis typically occurs by 4.3 years of age, with a current CDC estimate of approximately 1 in 59 children meeting criteria for autism spectrum disorder (ASD).
Heritability estimates of ~90% [1] for ASD imply a strong genetic component, making it one of the most highly heritable genetic disorders. Despite this high heritability, genetic association studies have been unable to detect variants explaining more than a negligible fraction of the phenotypic variability present in study populations.
Published genome-scale attempts to identify variants with significant association to ASD have been met with limited success. Weiss et al. [2] were not able to identify any individual single-nucleotide polymorphism (SNP) meeting genome-wide significance, but were able to detect a highly penetrant microdeletion/microduplication in the 16p11.2 region, present in approximately 1% of the cases studied. Examining many of the same families using an alternative genotyping platform, Wang et al. [3] uncovered a common variant in the region between two cell adhesion genes, CDH9 and CHD10, that met criteria for genome-wide significance after aggregating results from multiple studies. Despite these successes, our picture of autism-associated variation remains limited.
The inability to find signal associated with the genetic causes of autism may be partially attributable to the phenotypic heterogeneity of ASDs, as the behavioral variation is likely to be rooted in high genetic heterogeneity. It may also be due to technological bias as the genotyping platforms used were designed to detect common, rather than rare, variants. Additionally, purely data-driven approaches to the analysis of genotyping data tend to suffer from low power due to inadequate sample sizes and the 5 severity of multiple test correction. This can be more pronounced in complex disorders like autism where it is likely that many variants of weak to moderate effect, which require exponentially larger sample sizes to detect, contribute to the onset and progression of the disorder.
The success of GWAS has been debatable [4][5][6][7], with studies suffering from insufficient samples sizes leading to low statistical power, significant findings in gene deserts which provide little, if any, biological insight, and minimal reproducibility of significant findings. As a consequence, the field has been motivated to develop alternative approaches to GWAS analysis that can circumvent some of the signal detection problems associated with purely data-driven approaches while elucidating biologically meaningful variation. Borrowing from the field of microarray data analysis, which suffers similar problems of multi-dimensionality and low signal-to-noise ratio, methods classifiable as knowledgedriven strategies have emerged to improve detection of real biological signal among previously elusive GWAS data [3,[8][9][10][11][12][13][14][15]. Several of these approaches have enjoyed considerable success. The strategy employed by the majority of these methods begins by assigning a significance level to individual genes based on the p-values computed for the GWAS data. Once significance has been determined at the gene level, approaches assess pathway significance using a variety of approaches, including ranking genes and testing for overrepresentation in the top of the ordered list, building protein-protein interaction (PPI) networks, and testing the distribution of genes within a pathway that contain nominally significant SNPs. However, some strategies use the SNP data directly and look at distributions of nominally significant SNPs within pathways or annotation classes. In many cases these approaches are helping to filter the search space of genetic variants associated with various diseases and, in general, strategies like these are helping GWAS achieve some of the potential of their original promise.

6
In the present study, we reanalyzed data from two GWAS of autism families with a similar approach designed specifically to handle family-based data and to detect previously masked signal among a priori defined biological concepts. By doing so, we were able to identify SNPs significantly associated with autism at a whole-genomic scale that had been previously missed by standard analytical methods. These SNPs mapped to genes that were highly enriched for neurological disease candidates and that exhibited significant differential expression in blood profiles of an entirely different population of autistic individuals. Our results reveal significant genotypic variation in an autistic population that is likely to be linked to the molecular pathology of the disorder, and also highlight the important of using knowledgedriven approaches to study GWAS data that had previously yielded limited signal.

Results
We used two published genome-wide association studies (GWAS), Broad and CHOP, for our analyses (Table 1). From these, we acquired a total of 500,000 and 550,000 SNPs for 751 (2,883) and 943 (4,444) families, respectively ( Table 2). The overlap in genomic coverage between the Broad and CHOP studies was low with less than 68,000 SNPs in common between the two platforms, thus the union provided significantly higher resolution than either study independently. After calculating nominal p-values using the family-based association test (FBAT) [16], we mapped all SNPs from both the Broad and CHOP studies to the MSigDB C2 and C5 gene set collections. This mapping resulted in 3344 SNP sets that ranged in size from 1-57,525 SNPs (mean 1,315.8), representing anywhere from 1-1,873 (mean 59.3) genes ( Table 2). We then used these sets to search for significant differences in genotype frequencies associated with the population of autistic cases in our two samples.

Set-based analysis
7 After excluding 7 SNP sets with low expected values and controlling the FDR at = 0.05, 68 of the 3,344 (3.0%) sets tested achieved significance in either the Broad or CHOP data set (Table 3; Supplementary   Table 1).
To evaluate potential methodological biases, we looked for correlation between the level of significance (p-value) and the number of SNPs per set. The largest correlation found was r 2 = 0.03, with all other correlations were similarly near zero ( Table 2; Supplementary Figure 1). This lack of correlation indicated that differences in the sizes of our sets were not related to differences in the observed levels of significance, and eliminated the possibility of spurious results among the set of significant biological concepts.
Given the limited overlap of genomic coverage between the two studies (Table 1), we were not surprised to find no overlap in the sets identified as significant. The CHOP data yielded the highest number of significant sets with 53 passing our adjusted p-value threshold, while only 15 sets passed the p-value threshold using the Broad data set. Close inspection of these sets showed virtually no overlap in coverage of SNPS between the Broad and CHOP data sets ( Figure 1). In other words, when a set was identified as significant using the Broad Affymetrix array the SNPs in the set were not represented on the Illumina array, and vice versa. This suggested that the lack of agreement between the two data sets was due only to experimental design rather than systematic bias associated with our method.
To characterize redundancy of genes across pathways, we clustered the 68 significant sets according to genes contained in each set. By creating binary profiles of presence and absence of the constituent genes from each set we could compute a pairwise correlation matrix and generate a clustering for all 68 q 8 significant sets. Nineteen distinct clusters could be identified including several involved in cell adhesion, immune response, and cancer progression ( Figure 2).

mRNA Expression differences in autistic cases
To validate the significant gene sets and their potential importance to autism, we used published gene expression data from a study of blood-based transcriptional profiles in 17 early-onset autistic cases [17].
Our objective was to determine whether any of the 19 clusters ( Figure 2) contained genes with significantly different patterns of expression in autistic individuals when compared with controls. For this, we elected to examine the set in each of the 19 clusters containing the lowest average p value (Table 4). All but two processes contained at least some genes with significant (FDR-adjusted) differential expression, and several had over 90% of its constituent genes exhibiting significant differential expression in autistic cases when compared to controls. In sum, the biological concepts identified as significantly associated with autism genome-wide data appeared to be predictive of differential expression of the same genes in whole-blood samples from a different population of autistic individuals.

Neurological disease enrichment
As a second strategy for validation of the significant gene sets, we determined the extent to which each set was enriched for genes involved in autism-related disorders, following from previous work on crossdisease comparison of neurological disorders [18]. Autism spectrum disorders share behaviors and potentially genetic mechanisms with a host of other neurological disorders [18,19]. Using Autworks, a knowledge base of genetic associations for autism and related conditions, we computed a neurological disease enrichment (NDE) score to determine if our top ranked gene sets contained unusually high percentages of neurological disorder candidates (Table 5). Of the 37 neurological disorders present in 9 Autworks, 36 of them had at least one candidate gene that was a member of one or more of the topranking gene sets passing test correction (Table 5). All but three of the 19 representative clusters were significantly enriched (p < 0.05) for known neurological disorder candidate genes.

Discussion
The challenge of identifying and characterizing the genetic predisposition to autism has been exacerbated by the low signal-to-noise ratio inherent to genome-wide association studies. In complex disorders, where it is expected that many variants of mild effect induce the disease state, the sample sizes necessary to achieve sufficient statistical power is especially problematic. In fact, the most promising candidate variants in recently published GWAS of autism have detected variants with maximum odds ratios of 0.77 [20] and 1.19 [3], stressing the need for alternative analytical strategies.
Here we leveraged prior biological knowledge to supplement standard analysis of GWAS data in order to maximize signal relevant to the genetic basis of autism.
Through application of our knowledge-driven approach (Figure 3), we were able to detect 68 biological concepts significantly associated with autism cases in two studies, one from the Broad Institute (Broad) and the other from the Children's Hospital of Philadelphia (CHOP). Upon inspection of the genes represented in the 68 concepts of interest, it became apparent that many of the member genes functioned in multiple concepts. Due to this overlap, we were able to cluster the concepts into 19 groups ( Figure 2) and identify that the majority of these gene sets could be mapped to three major themes, cell adhesion, cancer response, and immune response.
In strong support of their roles in autism, all but five of our top ranking SNP collections proved to be predictive of significant changes in gene expression in a different study of autistic individuals [17], and all but three yielded significant neurological disease enrichments scores, indicating that they were strongly enriched with candidates linked to disorders already known to be related to autism [18]. These two independent avenues for validation of our sets strongly supported their role in the molecular pathology of autism and stressed the importance of follow-up studies to further refine our understanding of which pathways, and which SNPs in those pathways, may contribute to the manifestation of this disorder.
The 68 significant SNP sets naturally clustered into 19 groups that could be reliably mapped to three themes -cell adhesion, cancer progression, and immune system response. The enrichment of cell adhesion-related concepts in our analysis supports previous results implicating the role of improper cohesive states in the manifestation of autism, among the most recent being the identification of a genome-wide significant variant in the intergenic region between CDH9 and CDH10 [3]. Both CDH9/10 are members of the cadherin superfamily, which includes integral membrane proteins that mediate calcium-dependent cell-cell adhesion. The top scoring of our cell-adhesion concepts was HOMOPHILIC_CELL_ADHESION, a collection of 16 genes that function in the attachment of two identical adhesion molecules in adjacent cells. This set was highly enriched for neurological disorders (Table 4), including three member genes (CADM1, ROBO1, ROBO2) having previous associations with autism [21,22]. Furthermore, genes in these sets play known roles in central nervous system (CNS) related processes, including activated T cell proliferation [23], axon guidance [24], axon guidance receptor activity [25], axonal fasciculation [26], brain developmental processes [27][28][29], central nervous system development [25], myelination [26], nervous system development [30], neurite outgrowth [31], neuromuscular junction [32], and positive regulation of axonogenesis [26,33]. Interestingly, axonal fasciculation is known to be affected by MECP2 levels in Rett Syndrome [34], one of the broad-spectrum autistic disorders. In addition to involvement in the CNS, 6 of the cell adhesion genes identified by our analysis also function in the immune system, including in B cell differentiation [35], defense response [36], immune response [37], leukocyte cell-cell adhesion [35], positive regulation of cytokine secretion [23], positive regulation of immunoglobulin mediated immune response [38], positive regulation of natural killer cell mediated cytotoxicity [39,40], susceptibility to natural killer cell mediated cytotoxicity [23,40,41], susceptibility to T cell mediated cytotoxicity [42], T-cell activation [43,44], T cell mediated cytotoxicity [23], type 1 fibroblast growth factor receptor binding [45], and synapse development [46]. This intersection in gene function between cell adhesion, CNS development, and immune function provides tantalizing insight into the potential importance of these biological themes in the pathology of autism.
Little evidence has been found to support the role of cancer-related genes in ASDs. However, the presence of several cancer-related concepts among our top scoring and validated SNP sets warranted a deeper look at the function of constituent genes. The top scoring cancer response related concept was BRCA1_SW480_UP (Broad (ALL), mean p = 0.020), a collection of 25 genes affected by BRCA1 expression in breast and lung cancer cell lines that cause dysregulation of the cell cycle and DNA repair processes [47]. Of the 21 genes with SNPs in the Broad data, one (MET) has previously been listed as a likely contributor to autism [48] and 18 of the remaining 20 (90%) have been implicated in at least one other closely related neurological disorder.
MET encodes a proto-oncogene that is involved in cell-cell adhesion, CNS development, and neuron migration. Variants in the MET promoter region that interfere with normal gene transcription have previously been identified in subsets of autism cases [49,50]. Another gene member gene in this 12 concept, CTNNA1, functions with multiple proteins involved in construction and maintenance of the extracellular matrix. CTNNA1 binds CDH1, a member of the cadherin superfamily that also interacts with MET directly to provide support necessary to promote strong cell-cell adhesive states. The direct interaction of CTNNA1 and CDH1, especially in light of the knowledge that CTNNA1 has previously been linked to both Rett Syndrome and Schizophrenia, may be another mechanism promoting unhealthy cellcell adhesive states in autism. A third cancer-related gene that promotes healthy cohesive states is the tissue inhibitor of metallopeptidase 1 (TIMP1). The TIMP family of proteins regulates the activity of matrix metalloproteinases (MMPs), a family of peptidases involved in degradation of the extracellular matrix [51]. With extracellular degradation being essential in the routine maintenance and repair of tissue after injury, the appropriate regulation of MMP activity again emphasizes the importance of proper connective states in normal tissue function. Additionally, the balance of TIMP1/MMP9 levels is thought to be important in blood-brain barrier (BBB) function [52]. TIMP1 levels have previously been shown to be dysregulated in the serum and cerebrospinal fluid (CSF) of patients with neurological disorders related to autism [53][54][55][56][57], suggesting improper TIMP1/MMP9 levels and dysfunction of the BBB may contribute to the onset and progression of neurological disorders.
We have demonstrated that knowledge-driven approaches to signal detection can detect autism-related signal previously missed by alternative approaches. And, we have demonstrated that two of the largest published GWAS data sets on autism to-date contain an abundance of significant pathways with SNPs that are highly likely to play linked to the molecular pathology. These pathways could be largely verified in two ways, first by transcriptional profiles of autistic individuals, and second by their enrichment of neurological disease candidates. In the former, we were able to demonstrate that all but 5 of the 19 most significant pathways contained genes that are significantly differentially expressed in autistic individuals. In the latter, leveraging published studies on autism-related disorders, we were able to demonstrate that all but three of these 19 pathways contained significant percentages of genes linked to neurological disease.
Our reanalysis of autism genome-wide association data identified significant allelic differences among autistic individuals that correspond to a limited number of biological concepts. While it remains difficult to determine which SNPs among these concepts are the most informative to the genetic bases of autism, our results point to a highly promising set of candidates that are worthy of further investigation.

Conclusions
Our approach identified significant signal in two historical autism GWAS datasets. The genomic variants were enriched in autism-related neurological disorders and linked to genes that were significantly differentially expressed in autistic individuals when compared to controls. These genetic variants represent a high priority set of candidates worthy of deeper inspection in a larger cohort of autistic individuals. In sum, our work provides exciting new leads into the genetic underpinnings of autism and highlights the importance of reanalysis of genomic studies of complex disease using prior knowledge of genetic organization.

Ethics Statement
Our study (number: M18096-101) has been evaluated by the Institutional Review Board and identified exempt as defined under 45CFR46.102(f) and as meeting the conditions regarding coded biological specimens or data. As such, (a) the specimens/data were not collected specifically for the research through an interaction or intervention with a living person, and (b) the investigators cannot "readily ascertain" the identity of the individual who provided the specimen/data to whom any code pertains.

Genotype samples
Genotyping data were acquired from the Autism Genetic Resource Exchange (AGRE) [66], consisting of two previously described cohorts from the Broad Institute (Broad) [2] and the Children's Hospital of Philadelphia (CHOP) [44]. The Broad cohort consisted of 2,883 subjects from 751 families genotyped at 500K SNPs with probands diagnosed using the Autism Diagnostic Interview-Revised (ADI-R) [67], and the CHOP cohort consisted of 4,444 subjects from 943 families genotyped at 550K SNPs with probands diagnosed using both the ADI-R and the Autism Diagnostic Observation Schedule (ADOS) [68].
Standard quality control measures were applied using PLINK [69]. SNPs were excluded from the analysis if minor allele frequency < 0.05, call rate < 0.80, or Hardy-Weinberg Equilibrium (HWE) was not met (p < 0.01). Additionally, genotypes for all nuclear family members were set to "missing" for SNPs in violation of Mendelian rules of inheritance. Genotyping samples were removed where sex or affected status were unknown, or genotyping call rate < 0.80. Finally, subjects of European descent were selected for further analysis, leaving 727 families containing 3,322 subjects and 557 families containing 2,137 subjects for the CHOP and Broad cohorts, respectively (Table 1). After applying quality control measures and removing SNPs not required for the set-based analysis, 269,214 (ALL) and 112,403 (PRI) SNPs remained for the Broad cohort, and 264,543 (ALL) and 102,194 (PRI) SNPs remained for the CHOP cohort (Table 1).

SNP set construction
SNP identifiers were mapped to the 1,892 C2 and 1,454 C5 MSigDB [70] gene sets based on the SNP-togene associations specified in the Affymetrix 5.0 and Illumina HumanHap550 genotyping platform annotation files for cohorts from the Broad Institute [2] (Broad) and Children's Hospital of Philadelphia [3] (CHOP), respectively. All SNPs listed as associated with a gene were mapped to the set.

Set-based analysis
We devised an analytical strategy that incorporates a priori biological knowledge to collectively test for significant association of groups of SNPs with autism in the genome-wide association data utilized here.
The outline of our strategy is depicted in Figure 3 and described below.

Construction of SNP sets. SNP identifiers were mapped to the C2 (curated gene collections) and C5
(gene sets from the biological process ontology of Gene Ontology) MSigDB [70] gene sets based on the SNP-gene relationships specified in the respective genotyping platform annotation files to create SNP sets representing the biological concept of the original gene set.

Compute SNP observed p-values and set counts. For each SNP si, a p-value,
, was calculated following the procedure previously described [16] as appropriate for family-based cohorts; an additive effects model was used to code the genotypes. The number of SNPs in each SNP set SSj with , , was then counted and saved for later use. Finally, q-values [71] for each set were computed using the qvalue package in R and all SNP sets with were considered significant.

Clustering of significant sets
Once significant sets of SNPs were identified, we clustered the sets to reduce redundancy of the representative genes and improve our ability to identify which major biological themes were enriched among our autism cases. To do so, we created binary profiles of genes present or absent in each of the sets passing our significance cutoff. Then, we computed a Pearson correlation coefficient for all pairs of binary profiles to create a pairwise correlation matrix that could be used to generate a simple tree to visualize clustering among the sets.

Neurological disease enrichment (NDE) scores
To assess the representation of neurologically related genes in each of the biological concepts, we computed a neurological disease enrichment (NDE) scores as follows: where N = 23402, the number of human gene entries in Entrez gene (http://www.ncbi.nlm.nih.gov/gene) at the time of this writing, m = 3511, the total number of genes with known association to a neurological disorder, n is the number of genes in gene set j, and k is the number of neurologically related genes observed in gene set j. Intuitively, this represents the probability of a biological concept containing at least as many neurological disorder genes as was observed. Using Autworks , a knowledge base of genetic associations for autism and related conditions, we computed a neurological disease enrichment (NDE) score, to determine if our top ranked gene sets contained unusually high percentages of neurological disorder candidates (Table 5).

mRNA expression data processing
We downloaded GSE6575 [17,59]  to contain limited signal [18] and were excluded from the validation steps in this study. Preprocessing and expression analyses were done with the Bioinformatics Toolbox Version 2.6 (For Matlab R2007a+).
GCRMA was used for background adjustment and control probe intensities were used to estimate nonspecific binding [72]. Housekeeping genes, gene expression data with empty gene symbols, genes with very low absolute expression values, and genes with a small variance across samples were removed from the preprocessed dataset. To adjust for multiple testing, we used the q-value calculation [71], a measurement framed in terms of the false discovery rate [73], considering q < 0.05 an indication of significant differential expression.   SNP set, "Calcium Channel Activity," that achieved significance using the CHOP data (red) but not using the Broad data (blue). As depicted, the overlap in SNPs was limited, confirming that the lack of replication was due to experimental design and not biases associated with our enrichment technique.  After all p-values were calculated, the number of nominally significant SNPs per set was counted.

List of abbreviations
Disease labels were then randomly permuted for N=1000 iterations. Upon each shuffling of the labels, p-values were recomputed and the number of nominally significant SNPs per set was recounted. After the final shuffling of the disease labels, the expected number of nominally significant SNPs was computed for each set given the distribution of the respective counts in the permuted data. Finally, a 2x2 contingency table was constructed for each set and used to compute the Chi Square statistic and pvalue for the set.

23
Supplementary Figure 1. Gene set size versus p-value. Correlation coefficients were calculated and correlation plots generated to evaluate any potential bias arising from SNP set size. No substantial correlation was found between set size and p-value (maximum r 2 = 0.0349), even with larger sets having higher statistical power to detect subtle differences. Points for sets passing FDR correction are shown in red. We examined correlation for all SNPs annotated to a gene and only SNPs within the primary coding transcript (PRI) for both Broad and CHOP data sets.