LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
For LTR candidates found in the negative strand, the locus presentation is now 5' -> 3', similar to candidates found in the positive strand. For example, Chr1:7890..3456
suggests the candidate is on the - strand. This information is shown in the first column of the pass.list
, the last column of the gff3
file, and the sequence names of the intact.fa
file. If the element is on the - strand, its sequence in the intact.fa
file will be shown as 5' -> 3' from the negative strand. For example, Chr1:7890..3456
's sequence will be a reverse complement to Chr1:3456..7890
's sequence. For candidates without strand information (i.e., lack of coding sequence), their strangeness will be assumed positive for convenience.
Update get_range.pl
Fix bug #153 in v2.9.4 when introducing TEsorter to classify LTR candidates.
Add TEsorter to help to identify not LTR sequences. Candidate LTRs will be determined as "false" if they contain not-LTR HMM profile matches even the candidate contains LTR/TSD and the TGCA motif. This purging will remove a small number of structurally intact LTR candidates (5/2304 in rice). This implementation offers slight improvements over older versions and should be more significant for larger genomes.
LTR_retriever-harvest_FINDER | sens | spec | accu | prec | FDR | F1 |
---|---|---|---|---|---|---|
retriever_v2.5 | 0.967 | 0.920 | 0.931 | 0.789 | 0.211 | 0.869 |
retriever_v2.6 | 0.963 | 0.931 | 0.939 | 0.811 | 0.189 | 0.881 |
retriever_v2.9.2 | 0.966 | 0.926 | 0.935 | 0.802 | 0.198 | 0.876 |
retriever_v2.9.4 | 0.967 | 0.928 | 0.937 | 0.804 | 0.196 | 0.878 |
Add more filtering parameters to identify solo LTRs, improve the solo-intact ratio calculation (#111, #110).
Resolve RMblast errors when it attempts to overutilize CPUs #137
Andreas Wallberg, @shokusei, Evan Ernst, @xie-wei-hh, @with9, and users like YOU!
This version has many improvements in the downstream outputs including:
Reformat the GFF3 output of intact and whole-genome LTR sequences following the standard GFF3 guideline.
Change to use the env
default Perl
and make shebang lines more consistent. #68
Fix inconsistent total LTR summary. #66
Remove precompiled trf
in the package.
I recently identified a bug for dropping intact LTR elements, which have an imbalance LTR length > 15bp due to InDels. After manual checks, I determined these are still high-quality intact elements and thus salvage them in the output. This will marginally improve the sensitivity especially for genomes with limited LTR sequences (e.g. Arabidopsis, ~7%) and the margin decreases for those with decent amounts of LTRs, such as rice (~25%) and maize (~75%), because the abundance of intact elements has been sufficient to construct a comprehensive library. However, the number of intact LTR elements could increase for 10-20% comparing to the last version (v2.7), which has some positive effects on the calculation of LAI. Some benchmarking results:
Arabidopsis (TAIR10) | v1.x | v2.0 | v2.8 |
---|---|---|---|
Sensitivity | 90.70% | 90.90% | 95.04% |
Specificity | 99.00% | 99.00% | 98.88% |
Accuracy | 98.50% | 98.50% | 98.64% |
Precision | 86.60% | 86.50% | 84.99% |
Rice (MSUv7) | v1.x | v2.0 | v2.5 | v2.8 |
---|---|---|---|---|
Sensitivity | 95.00% | 95.30% | 96.30% | 96.71% |
Specificity | 95.00% | 94.60% | 94.00% | 93.87% |
Accuracy | 95.00% | 94.80% | 94.50% | 94.54% |
Precision | 85.40% | 84.50% | 83.10% | 83.09% |
I am excited to release this much faster version of LTR_retriever. Its multithreading module has been slowing down the program and I finally get the chance to improve it. This part of the update will not change the program outcome since this is just a more efficient implementation of parallel computation.
With the test on the 14.5 Gb bread wheat genome, a total of 941,338 LTR raw candidates were processed and a non-redundant library was generated. This process only took 8 days 3 hours and 31 minutes for the current version (v2.7) with 10 threads (-threads 10
), which would have required 3 weeks for the last version (v2.6).
Three more programs are supported by LTR_retriever
:
Users need to convert candidates identified by these programs into the LTRharvest
format with scripts located in the /bin
folder:
convert_ltr_struc.pl
convert_MGEScan3.0.pl
convert_ltrdetector.pl
Then feed them to LTR_retriever
with -inharvest
. You may concatenate multiple LTRharvest
format input files together.
Note: You won't find a lot of intact LTR elements from LTR_STRUC
and LtrDetector
outputs due to the fuzzy sequence boundaries these programs provided. So please use these two as supplements to other inputs.
bin/solo_finder.pl
and bin/intact_finder_coarse.pl
(#41)Users can recover interrupted runs from a number of major checkpoints. This is particularly useful when running LTR_retriever on huge genomes (i.e., common wheat) and got interrupted (for example, the job is killed due to walltime limit). Use LTR_retriever -h
for further information.
Previous versions would remove nested insertion of solo LTRs. However, when a full element is nested in a library sequence, the internal region of the nesting element won't be removed, causing sequence mosaics and library redundancy. In this update, a new module is developed to clean up composite sequences caused by full-element nesting. This update was inspired by Mr. Robert Hubley's report.
The current version has a slight decrease of accuracy with a marginal gain of sensitivity. This is likely due to the removal of nesting sequences that may have slightly shifted the annotation dynamic of RepeatMasker. Nevertheless, there is no extra sequence added in this process, but removes up to 60% of library sequences (i.e., in common wheat) that are redundant due to nested full-element insertions.
Rice (MSUv7) | v1.x | v2.0 | v2.5 |
---|---|---|---|
Sensitivity | 95.0% | 95.3% | 96.3% |
Specificity | 95.0% | 94.6% | 94.0% |
Accuracy | 95.0% | 94.8% | 94.5% |
Precision | 85.4% | 84.5% | 83.1% |