Polygenic Selection , Polygenic Scores , Spatial Autocorrelation and Correlated Allele Frequencies . Can We Model Polygenic Selection on Intellectual Abilities ?

The majority of polygenic selection signal of educational attainment GWAS hits is confined to a handful of SNPs within genomic regions replicated across GWAS publications. A polygenic score comprising 9 SNPs predicts population IQ (r=0.9), outperforming 99.9% of the polygenic scores obtained from sets of random SNPs. Its predictive power remains unaffected after controlling for spatial autocorrelation. Even random polygenic scores are moderate predictors of population IQ, and their predictive power increases logarithmically with the number of SNPs, indicating an exponential reduction in noise.Thus, the predictive power of polygenic scores has to be scaled in proportion to the number of SNPs composing them.


Introduction
Piffer [1] identified 9 genomic loci that were replicated across the three largest GWAS of educational attainment published to date [2][3][4].The 9 loci contain GWAS significant alleles that were found to be in strong LD (r>0.8).One locus was replicated across three GWAS [2][3][4] and the same SNP (rs9320913) was found in two of them [2,4].The population frequencies of the 9 pairs (one member belonging to each GWAS publication) of alleles were highly correlated (r=0.919),hence the SNPs published in [4] were used.Thus, this set of 9 SNPs was considered the best candidate for analysis of natural selection on educational attainment and related phenotypes (e.g.general cognitive ability or gca).Another set of 7 SNPs that reached significance in the UK Biobank and another database was identified by [4].Population IQ estimates were obtained from [1].This paper has several aims: to test the presence of correlated frequencies among GWAS hits and the predictive power of polygenic scores (average frequencies of GWAS alleles with positive effect), independently of spatial autocorrelation.A null model will be built using a large set of random SNPs and the polygenic selection model will be tested against it.

Methods and Results
A simulation was performed using a random dataset (a large sample (N=7369) of random unlinked (minor alleles, downloaded from 1000 Genomes, phase 3) matched SNPs frequencies (r<0.1 among EUR).Matching was carried out using SNPSNAP [5], by feeding the 9 SNPs and setting LD r 2 <0.1.A correlation between all the variables (i.e.SNPs frequencies) was run on the entire dataset.This produced a very large correlation matrix (N=27,147,396 for the lower triangle).The average correlation coefficient was 0.058 (SD=0.537).The slightly positive value is likely due to the differential representation of minor alleles among populations.As the SD value for the smaller samples tended to be lower (0.45-0.49), the larger SD for the random set (0.537) was used to compute corrected Z scores.The same analysis was applied to the educational attainment GWAS hits (table 1).It is clear that there is very little signal in the GWAS hits and it seems to be concentrated within a small subset of SNPs, possibly the 9 replicated loci and the 7 cross-replicated hits (r=0.278 and 0.125, respectively).Similar null results (table 1) were obtained for the 600+ SNPs from the largest GWAS of human height [6].Since there was some LD between the 9 quasi-replicated SNPs, only one SNP per chromosome was retained, yielding 6 unlinked SNPs.This gave a "pure" (LD-free) measure of correlation.The average correlation was slightly higher than for the 9 SNPs (r=0.343),implying that LD did not produce the correlation among the full set of 9 hits.

Correlation between polygenic scores and population IQ
The polygenic score computed using the 9 SNPs was highly correlated (r=0.9) to an estimate [6] of average population IQ (fig.1).An empirical simulation was run using 819 PS computed from groups of 9 SNPs taken from the random dataset.The average correlation between population IQ and the random polygenic scores was 0.22 (N=819).The slightly positive correlation can be interpreted as an effect of spatial/phylogenetic autocorrelation [1].Indeed, the correlation between population IQ and the polygenic score of all the random SNPs (N=7369) was r=0.425, suggesting again the presence of phylogenetic autocorrelation.The increase in the correlation coefficients moving up from low (9)to high SNPs number (7k+) is due to the reduction in the noise associated with each SNP.Because the correlation coefficients were not normally distributed (fig.2), z-score computation was not appropriate.Hence, the percentile corresponding to a correlation coefficient r=0.9 was found to be 99.9% (using the 819 random polygenic scores), implying that the result is highly significant (produced only 1 out of 1000 times using random sets of SNPs).

Partialling out spatial autocorrelation using multiple regression
Population IQ was regressed on the "random PS" (computed using the 7k+ random SNPs) and the 9 GWAS hits PS.The model was significant (F=43.06,p=5.649e-08,Adj R 2 =0.793.The random PS had no predictive power (B=0.037),whereas the 9 GWAS hits PS had strong predictive power (Beta=0.884).A polygenic score computed from a higher number of SNPs should reduce the noise in the data.In order to test this model, polygenic scores were created using different number of SNPs over the random SNPs dataset.The correlation of each polygenic score with population IQ was computed.The data followed a logarithmic function (figure 3), and the log regression model was compared to a linear model: the former had a much better fit to the data (Adjusted R-squared: 0.9388, F-statistic: 185.

Figure 1 .
Figure 1.Correlation between population IQ and polygenic score.

Figure 2 .
Figure 2. Distribution of correlation coefficients (r population IQ x sets of nine random SNPs).

Figure 3 .
Figure 3. Relationship between number of SNPs and predictive power.