Preprint Communication Version 1 This version is not peer-reviewed

Unexplained Allele-Calling Errors May Account for Apparent Denisovan-Neanderthal F1 Genome

Version 1 : Received: 28 August 2018 / Approved: 29 August 2018 / Online: 29 August 2018 (04:50:36 CEST)

How to cite: Curtis, D. Unexplained Allele-Calling Errors May Account for Apparent Denisovan-Neanderthal F1 Genome. Preprints 2018, 2018080480 (doi: 10.20944/preprints201808.0480.v1). Curtis, D. Unexplained Allele-Calling Errors May Account for Apparent Denisovan-Neanderthal F1 Genome. Preprints 2018, 2018080480 (doi: 10.20944/preprints201808.0480.v1).

Abstract

The recent report that DNA extracted from ancient bone must have from the offspring of a female Neanderthal and a male Denisovan depends on the inference that the subject has a high level of heterozygosity for Neanderthal and Denisovan alleles across the genome. Here I point out that the relative frequencies of derived transversion polymorphisms varies markedly between the new specimen, Denisova 11, and two high-coverage Neanderthal genomes. In Denisova 11 the AC and CG polymorphisms are much commoner than the others and are almost twice as common as the AT polymorphism. In the high-coverage Neanderthal genomes the four types of transversion are about equally common, with the AT being slightly commoner than the others. These results suggest that allele-calling errors are frequent and that this may provide an alternative explanation for the observed heterozygosity.

Supplementary and Associated Material

Subject Areas

Neanderthal; Denisova; genome; allele

Comments (4)

Comment 1
Received: 30 August 2018
Commenter: Benjamin Vernot
Commenter's Conflict of Interests: I am an author on the manuscript critiqued in the preprint.
Comment: This preprint concludes, based on Table S5.2, that relative allele counts are different between Denisova 11 and the high coverage Neandertals - specifically that AC and GC polymorphisms are too frequent in Denisova 11 - and that this difference could erroneously cause us to identify Denisova 11 as a Neandertal/Denisovan F1. Unfortunately, this is a misinterpretation of this table, and of the evidence for the major conclusions regarding Denisova 11 made in the manuscript.

(1) The numbers in Table S5.2 are not raw allele counts, but rather maximum likelihood estimates of the frequency of certain genotypes in the 4.2Mb where Denisova 11 carries two alleles of Neandertal ancestry. For Denisova 11, these estimates are made on 2-fold coverage data, while for the high-coverage Neandertals they are made on >30-fold coverage data. Such estimates are naturally more accurate in high-coverage data than low-coverage data, so the fact that there are differences between the estimates in Denisova 11 and in the high-coverage archaics in Table S5.2 is not surprising. A comparison demonstrating this is done in Table S5.1, where Vindija 33.19 is down-sampled to 2x, and the results compared to the 30x genome. Indeed, *exactly* the same pattern is shown, where AC and GC polymorphisms become much more frequent in the 2x estimates than the 30x estimates. The conclusion is therefore simply that such maximum likelihood estimates are less accurate in low-coverage data than high-coverage data. We acknowledge this limitation in Supplementary Section 5 by writing that "estimated frequencies are only approximate". Additionally, these estimates are used only when discussing the heterozygosity of the 4.2Mb of homozygous Neandertal DNA in Denisova 11 - they are not used for the main (or any other) conclusions in the manuscript.

(2) The data presented in the table are restricted to just the 4.2Mb of the genome where Denisova 11 has homozygous Neandertal ancestry. Since substantial local variation in allele proportions is to be expected, these regions are not necessarily representative of genome-wide data. If we look at all sites on chromosome 1 where two sampled DNA fragments from Denisova 11 differ, and calculate the relative proportion of the six allele combinations, we arrive at:
0.014, 0.470, 0.015, 0.016, 0.470, 0.014 for AC, AG, AT, CG, CT and GT, respectively. You can see that if we expand our analysis beyond 4.2Mb of the genome, and consider the allele counts that were used for the main conclusions of the paper, the proportions of all transversion allele pairs are stable.

Further, even assuming that we incorrectly inflated two out of four classes of transversion polymorphisms by ~56%, this would only increase the heterozygosity estimate by ~28% - far lower than the 4x increase observed in Denisova 11.

the authors of the Denisova 11 manuscript
+ Respond to this comment
Comment 2
Received: 30 August 2018
Commenter: David Curtis
The commenter has declared there is no conflict of interests.
Comment: Thanks Benjamin. Here is the reply which I have emailed:

Hi.

Thank you for your prompt response.

It seems like you have some unpublished results which would help people to assess the robustness of your findings.

The results for Vindija 33.19 do indeed show that the allele-calling from the 2-fold data is wrong. If I sample two alleles at random from the high coverage genome there is no reason why this should lead to a change in the relative proportions of the genotypes. But this seem to be what you observe, with a systematic increase in AC and CG in both the simulated calls and in Denisova 11. As you say, exactly the same pattern is shown. I would not describe these estimates as being "approximate" but as being systematically biased, in the way you describe. So far as I can see, this must be a problem with snpAD and the error model it is implementing. Perhaps the error model introduces these biases.

If one knows that these calls are not to be relied upon, it doesn't make sense to me to present the comparison in Table 5.2 or to say in the text that they are "more similar to the frequency of transversion differences between the two high-coverage Neanderthal genomes".

I had studied Figure 5.2, which shows that in general the low coverage does not result in a marked increase in transversion calls. It possibly does for CG but certainly not for AC. So it's hard to know why these biases are occurring just in the Neanderthal ancestry regions. It would be easier to interpret these findings if there were more information on the actual numbers of sites they were based on, rather than just proportions. And confidence limits on the estimates might also be informative.

Thank you for informing of the results for chromosome 1. You say that the proportions of transversion allele pairs are stable, by which I think you might mean that they are about equal. I'm not sure if these are based on actual counts or whether they are estimates generated by snpAD. For me, these results introduce a new problem, which is that fully 94% of the heterozygous calls are AG or CT. Put another way, these genotypes are each about 30 times commoner than the transversion genotypes. This seems a very striking discrepancy, although I appreciate that this is characteristic of ancient DNA. I am not sure how this result is compatible with the frequencies shown in Supplementary Figure 5.3, which seem to show that these genotypes are only about four times commoner than the others.

I thus remain uncertain that the allele calls are sufficiently reliable. The possible magnitude of any problems is hard to judge. For example, if the change in genotype proportions reflects some errors in the process then one does not know that there is not also a bias towards increased heterozygosity for all genotypes.

Best wishes

- Dave
+ Respond to this comment
Comment 3
Received: 4 September 2018
Commenter: Benjamin Vernot
Commenter's Conflict of Interests: We are authors of the original paper.
Comment: We reiterate that:
- Genome-wide, all transversions occur at the same rate among alleles from randomly sampled reads. The conclusion that this individual is an F1 is based on these genome-wide transversion allele counts.
- You are conflating these raw counts with maximum likelihood estimates of the genotypes for the small part of the genome (4.2Mb) where Denisova 11 inherited Neandertal ancestry from both her mother and father.

We have made all data used in this paper public. With respect to your comment on "unpublished results": we assume you are referring to the allele count table we provided in our response to you. Such a table can easily be computed from the full data, available at:
https://www.ebi.ac.uk/ena/data/view/ERA1193240
Viviane, Fabrizio, Benjamin, Janet, Kay, Svante
+ Respond to this comment
Response 1 to Comment 3
Received: 5 September 2018
Commenter: David Curtis
The commenter has declared there is no conflict of interests.
Comment: Hi.

Yes, apologies if I wasn't clear. The "unpublished data" I was referring to were the allele combination frequencies you provided for chromosome 1. These show that AG and CT are thirty times as common as the others. However in Figure 5.3 using the simulated low coverage results for Vindija the figure is more like ten times (not four times as I mistakenly wrote before). This seems like a pretty marked discrepancy and certainly makes me wonder how accurate the allele calls really are.

I don't see why there should be a systematic variation in the Neanderthal derived regions. I'm also a bit doubtful that the

As I said previously, it would be easier to judge this if we saw actual counts rather than just frequencies.

I'm sure it's easy to compute the table from the raw data if one has the relevant scripts. However, as far as I am aware these are not publicly available.

Best wishes

- Dave

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 4
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.