LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
In particular,
Rice (MSUv7) | v1.x | v2.0 |
---|---|---|
Sensitivity | 95.0% | 95.3% |
Specificity | 95.0% | 94.6% |
Accuracy | 95.0% | 94.8% |
Precision | 85.4% | 84.5% |
Arabidopsis (TAIR10) | v1.x | v2.0 |
---|---|---|
Sensitivity | 90.7% | 90.9% |
Specificity | 99.0% | 99.0% |
Accuracy | 98.5% | 98.5% |
Precision | 86.6% | 86.5% |
perl ./LTR_retriever/bin/fam_coverage.pl TE_lib RM_output genome_size_bp > TE_fam.size.list
perl ./LTR_retriever/bin/fam_summary.pl TE_fam.size.list genome_size_bp > TE_fam.sum.txt
perl ./LTR_retriever/bin/LTR_sum.pl -genome genome.fa -all genome.fa.RM.out [options]
perl ./LTR_retriever/bin/simulate_mutation.pl -g genome.fasta -u [0-1] > genome.mutated.fasta
perl ./LTR_retriever/bin/make_gff3.pl genome.fa.RepeatMasker.out > genome.fa.RepeatMasker.gff
New features
Bug fixed
New feature: The LTR-RT Assembly Index (LAI) for evaluation of genome assembly continuity Description: LTR retrotransposon is very difficult to assemble due to their repetitive nature (up to 75% of a genome, i.e., maize) and long length (up to 20 Kb long). A very simple idea that more intact LTR-RT could be found in the more continuous genome provides the theoretical support of LAI. This module is using the list of intact LTR-RT and the whole-genome annotation of LTR-RT produced by LTR_retriever (*.pass.list and *.out, respectively) for calculation of LAI. A window-based calculation is implemented for estimation of regional continuity. A manuscript describing this feature is in preparation.
Other feathers:
Provide a workaround for the blast bug (described in v1.3 and #4 #3 ) occurred under high CPU usage or resource over-allocation. Each blast attempt will be checked and will be redone for up to 100 times if encounter error status. This is not a total fix but at least there should be no more such errors.
Several bugs have been reported since the last release. Most of them were fixed in this release.
BLAST engine error: Warning: Sequence contains no data or Warning: [blastn] Subject_1 chr:from..to|chr:from..to: Subject sequence contians no data
Some candidate sequences were appeared to be empty when analyzing sequence structures. This could happen when more CPUs were allocated to the program than what the system could provide. Some users said it also occurred even plenty of CPUs were available. So far I could not reproduce the second situation, so I don't know how to fix it. Good news is that this kind of LTR candidates are usually problematic, which means that they are usually false LTRs and will be screened out anyways. So if you only have this kind of errors you should be fine. The results are still reliable.
It's strongly suggested that users provide the genome file with short sequence names, i.e., <15 characters. One of the LTR_retriever dependencies, RepeatMasker, has a namespace limit for about 40 characters. As LTR_retriever utilizes the actual coordinate of LTR sequences with RepeatMasker format annotations as the library ID, e.g., Chr1:12345..13545_LTR#LTR/Gypsy, the naming space for chromosome names (i.e., "Chr1", not the full library sequence name) drop down to 15-20 characters. To avoid frustrating errors bring by long sequence names, this release added a module to attempt to convert long sequence names to shorter but still unique names automatically. The converted genome will be saved with the ending ".mod" and the original genome is untouched. This is a workaround in case users forget to do their conversion or don't know how.
This version provides more detailed stdout with timestamps. The timestamps will help users to keep track of their progress, and the stdout will help to debug. If there is any error message printed to stdout, please copy the adjacent timestamps with the error msg embedded in the middle and paste to a new issue. This kind of information will help to quickly pinpoint the problem.