Guideline for genome transposon annotation derived from evaluation of popular TE identification tools

Background: Transposable elements (TEs) constitute the vast majority of all eukaryotic DNA, and display extreme diversity, with thousands of families. Given their abundance and diversity, TEs discovery and annotation becomes challengeable. At present, tools and databases have built libraries to mask TEs in genomes based on de novo- and homology-based identification strategies, but no consensus criteria about which tools should be used have been proposed. 
 
Results: In the de novo-based strategy, we compared performances of TE libraries developed by four commonly used tools, including RepeatModeler, LTR_FINDER, LTRharvest, and MITE_Hunter, by using a simulated genome as a standard control. The results showed that the performance of RepeatModeler decreased as it was combined with either LTR_FINDER or LTRharvest. Combination of RepeatModeler and MITE_Hunter showed better performance than RepeatModeler and MITE_Hunter alone. In the homology-based strategy, we evaluated different sources from a taxonomic point of view to build an accurate TE library. When we selected a library from databases to identify TEs for Arabidopsis thaliana genome, the library from a genus genetically closer to Arabidopsis achieved better performance than other genera with further genetic distance. Without the Arabidopsis, combination of top three genera closer to Arabidopsis showed better performance than combination of all genera. 
 
Conclusion: This study proposes a series of recommendations to perform an accurate TE annotation: 1) For de novo-based strategy, RepeatModeler and MITE_Hunter are suggested to build a TE library; 2) For homology-based strategy, it is recommended to use library of genus genetically close to the species rather than use combined library from all genera.


67
These tools are able to aid non-specialists to easily identify and annotate TEs, but most studies 68 identified TEs in a new genome using different strategies and tools. We collected 58 plant genome 69 sequencing studies in 2019 (Table S1). Thirteen studies only utilized the de novo-based strategy to  (Table S1 and S2). In the homology-based strategy, nearly half of the studies (48%; 74 28/58) used all TE libraries, and eight studies used species-or genus-specific libraries from the 75 RepBase (Table S2). Taken together, no consensus criteria were built to develop the TE libraries. A 76 guideline needs to be proposed to generate high-quality TE library to accurately mask TEs in 77 genomes.

78
In the de novo-based strategy, we evaluated performances of four tools, the most frequently 79 used in the collected studies: RepeatModeler, LTR_FINDER, LTRharvest, and MITE-Hunter (Table   80 S2). In order to evaluate the specificity and sensibility of these tools we have developed a simulated 81 genome with randomly inserted TEs. PILER is the fifth most frequent tool used in nine studies 82 (Table S2). It was not included in our evaluation, since its one dependence PALS tool is no longer

107
1%, 5%, and 10% of the total nucleotides for each copy which underwent with random mutations.

108
The deletion and insertion changes were set at the 1%, 5%, and 10% levels, similar to the single 109 nucleotide changes. Different combinations of these mutation types were also generated (Table S3).

110
A total of 3,780 TEs with mutations were generated for each copy type. These TEs were randomly 111 inserted into the clean genome to form a simulated genome (Table S3; Figure 1a). Target

116
Evaluation of tool performances

117
The testing tools were used to predict TE locations in the simulated genome ( Figure 1a). We

180
Ref-based method was introduced to these tools ( Figure S1). The Ref-based method indicates 181 identification of TEs based on a TE library that is from the initial TEs without any duplication.

182
Ref-based method alone had a significant (p < 0.05) increasing precision and specificity from one to 183 ten copy times. When combined with RepeatModeler and LTRharvest, there was a significant (p < 184 0.05) decrease in precision and specificity from one to ten copy times ( Figure S1c, d).

193
Accuracy and sensitivity scores significantly (p < 0.05) increased from one to ten copy times in

194
RepeatModeler, but decreased from ten to 25 copy times for accuracy ( Figure S2a, b). In

195
MITE_Hunter, precision and specificity had a significant (p < 0.05) increasing from one to ten copy RepeatModeler showed a decreasing precision from ten to 25 copy times ( Figure S2c).

203
The MITEs and LTRs are two major families of TEs, and most tools were developed for identifying 204 these two families (Table S1). We evaluated performances of RepeatModeler, LTR_FINDER, and

205
LTRharvest for LTR detection, and of RepeatModeler and MITE_Hunter for MITE detection.

206
For detecting LTRs, RepeatModeler had the highest evaluation scores comparing with LTR_FINDER

207
and LTRharvest under the one and ten TE copy types (Figure 3a, b), while in the 25 TE copy type,   (Table S2). To clarify which 253 strategy could achieve better prediction, we evaluated performance of TE library from each of 22 254 plant genera (Table S4) and used A. thaliana as the reference genome.

255
The library of Arabidopsis showed the highest sensitivity and TPR, and the lowest FNR. It 256 outperformed the combined library with all genera (Figure 5a, b, c). Medicago, Triticum, Malus, and

257
Solanum performed better than other genera. Chlamydomonas displayed the worst performance.

258
Gossypioides had lower FNR that nine genera, but it had lower sensitivity and TPR than other 20 259 genera (Figure 5a, b, c).

260
To test hypothesis that the closer genetic background to Arabidopsis for a genus, the better

332
According to performances of the four common tools to identify TEs, we synthesized two 333 recommendations for the de novo-based strategy:

352
We evaluated performances of four commonly used tools to identify TEs in genomes including 353 RepeatModeler, LTR_FINDER, LTRharvest, and MITE_Hunter. A simulated sequence randomly 354 inserted by TEs with mutations was constructed to build a reference to evaluate different parameters 355 for these tools such as precision and sensitivity. To build an accurate TE library for novel genomes 356 using homology-based method, we also evaluated different sources from a taxonomic point of view.

357
Based on the evaluation results, we provide a series of recommendations to perform an accurate TE 358 annotation and propose a guideline to develop a comprehensive TE library.