LTR Retriever Versions Save

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.

v2.0

5 years ago

This version has fixed a couple minor bugs reported previously (i.e., #6, #7, #26, #28, #29).

In particular,

  1. Fix the bug 'substr sequences out of range' when the candidate locates at the boundary of a contig.
  2. Fix the bug for sometimes producing slightly different results when using both LTRharvest and LTR_FINDER inputs.
  3. Fix the bug for bias to identify TGCA motifs over non-TGCA motifs and improved TSD identification.
  4. Improve detection/filtering sensitivity for LINE/DNA transposases and plant proteins.
  5. Remove short sequences (<100bp) in the final library.
  6. Update README and citations.

The v2.0 LTR_retriever has similar high performance comparing to v1.x versions.

Rice (MSUv7) v1.x v2.0
Sensitivity 95.0% 95.3%
Specificity 95.0% 94.6%
Accuracy 95.0% 94.8%
Precision 85.4% 84.5%
Arabidopsis (TAIR10) v1.x v2.0
Sensitivity 90.7% 90.9%
Specificity 99.0% 99.0%
Accuracy 98.5% 98.5%
Precision 86.6% 86.5%

v1.9

5 years ago

LTR_retriever

New features

  1. Add LTR_digest support
    • The *pass.list.gff3 becomes readable to LTR_digest.
    • You can also use /LTR_retriever/database/TEfam.hmm to feed LTR_digest.
  2. Improved gff3 output for intact LTR-RTs
    • Add strand info for each elements.
    • The ones with '?' (unknown direction) in the *pass.list will remain '?'s in the *pass.list.gff3 file.
  3. Improve multi-threading efficiency
    • Use the Thread::Queue module to replace the Thread::Semaphore module
    • At least 100% more efficient
  4. Add Mac OsX support (High Sierra v10.13.3 tested)
  5. Add a script to summarize the genome % of each TE families using RepeatMasker .out files
    • Usage: perl ./LTR_retriever/bin/fam_coverage.pl TE_lib RM_output genome_size_bp > TE_fam.size.list
    • Not only works with LTRs but also other TEs in the RM.out file.
  6. Add a script to summarize the genome % of each TE superfamilies (TE summary table for genome publications)
    • Usage: perl ./LTR_retriever/bin/fam_summary.pl TE_fam.size.list genome_size_bp > TE_fam.sum.txt
    • Summary tables for LTR families and superfamilies are added to the output of LTR_retriever
  7. Add a script to calculate LTR distribution (Copia, Gyspy, and unknow) on chromosomes.
    • Usage: perl ./LTR_retriever/bin/LTR_sum.pl -genome genome.fa -all genome.fa.RM.out [options]
    • Options: -window [int] bp size of the sliding window, default 3,000,000 -step [int] bp size of the moving step, defalut 300,000 -intact indicate the -all file is an LTR_retriever .pass.list instead of a RepeatMasker .out file
    • The .out.LTR.distribution.txt file is generated by default.
  8. Add a script for whole-genome forward simulation (randomly add mutations on the genome)
    • Usage: perl ./LTR_retriever/bin/simulate_mutation.pl -g genome.fasta -u [0-1] > genome.mutated.fasta
    • -u specifis the mutation rate. i.e., -u 0.01 will randomly mutate 1% of the entire genome.
  9. Replace annotate_gff.pl with make_gff3_with_RMout.pl for better whole-genome LTR-RT annotation
    • Usage: perl ./LTR_retriever/bin/make_gff3.pl genome.fa.RepeatMasker.out > genome.fa.RepeatMasker.gff
    • Applied basic hit filtering: SW_score>=300, alignment length >= 80 bp
  10. Add more usage information to -h
  11. Update README

Bug fixed

  1. Program halt when nothing is masked in truncated candidates.
  2. Program halt when multiple LTR_retriever tasks simutainously check RepeatMasker in the same directory
  3. substr sequences out of range when self-corrected reads are used as input

LAI Version b2

  1. Rewrite LTR_calc.pl with more accurate and efficient algorithms.
    • Add the -step parameter for overlapping-sliding window scheme to estimate LAI
    • Output the size of the genome for genome LAI
    • Memory consumption of this scrip is approx. 2X the size of the input genome
  2. To control the boom and bust dynamic of LTR-RTs, adjust the raw LAI based on LTR identity.
    • Estimate mean identity of LTR sequences in a genome using all-versus-all blastn search
    • Add a quick estimation (-q) of genomic LTR identity based on a log-linear model with the slope estimated from three small subsets of LTRs
    • To avoid abnormal adjustment, if estimated LTR identity <= 92% or >= 96%, then corrected it to 92% or 96%, respectively
    • Use the -unlock parameter to release the restriction of LTR identity ([92, 96]) for good genomes with extreme LTR activities
    • Set LAI_adj=0 if raw LAI==0
    • The alignment identity cutoff (-iden) can excludes hits higher than this value for LTR identity calculation. Default: 100 (%)
  3. Change the output naming of LAI to raw_LAI and LAI_adj to LAI for easier description.
  4. Add polyploid support.
    • If the input genome is a polypoid (diploidized ancient polypoid does not count), then only a set of chromosomes (1x, a monoploid) should be used to estimate LAI, otherwise the LTR identity will be erroneously estimated to a very high value and substantially decrease the LAI.
    • Use the -mono parameter to provide a list of chromosome names of a monoploid, LAI will be calculated only on these sequences.
    • Users can run LAI multiple times with different monoploids specified to obtain the whole genome LAI estimation.
  5. Set prerequisites of LAI estimation
    • set intact LTR-RT limit >= 0.01%;
    • set total LTR limit >= 5%
  6. Add the -totLTR parameter for customized total LTR content;
  7. Add the -window parameter to control window size
  8. Add the rush mode (-qq) to quickly estimate raw LAI for version comparison. Raw LAI should not be used to compare between different species because the LTR dynamic is not controlled.
  9. Add status output of the LAI program. LAI is a default output of LTR_retriever. You should rerun LAI with the -mono parameter if the target genome is a polyploid.
  10. Add Mac OsX support (High Sierra v10.13.3 tested).

v1.6

6 years ago

New features

  1. Add the citation for LTR_retriever. Please cite our program: S. Ou and N. Jiang (2017) LTR_retriever: a highly accurate and sensitive program for identification of long terminal-repeat retrotransposons. Plant Physiology, pp.01310.2017; DOI: 10.1104/pp.17.01310 http://www.plantphysiol.org/content/early/2017/12/12/pp.17.01310
  2. Retain the unreduced library (*.LTRlib.redundant.fa). Please use the non-redundant library if you don't have a specific reason. Note that using unreduced library may not improve the annotation sensitivity, if any, it's marginal but will take significantly more time.
  3. Remove the entire candidate if plant protein sequence is found dominant (70%) in either the LTR region or the internal region.
  4. Add module # in the status output.

Bug fixed

  1. Remove space(s) at the end of each seq ID to avoid error;
  2. Check if seq names are duplicated;
  3. Fix bugs in reading the window size parameter for LAI;
  4. Fix a bug in program halt when nothing is masked in truncated candidates.

v1.5

6 years ago

New feature: The LTR-RT Assembly Index (LAI) for evaluation of genome assembly continuity Description: LTR retrotransposon is very difficult to assemble due to their repetitive nature (up to 75% of a genome, i.e., maize) and long length (up to 20 Kb long). A very simple idea that more intact LTR-RT could be found in the more continuous genome provides the theoretical support of LAI. This module is using the list of intact LTR-RT and the whole-genome annotation of LTR-RT produced by LTR_retriever (*.pass.list and *.out, respectively) for calculation of LAI. A window-based calculation is implemented for estimation of regional continuity. A manuscript describing this feature is in preparation.

Other feathers:

  1. improved purging criteria. Introduce the identity cutoff for alignment hits (>=30%), change the alignment length criteria to the identity-length criteria: identity-length = alignment length - mismatch >=90 for a real hit.
  2. add scripts to identify solo LTR and complete LTR, and to estimate solo-complete ratio for each family, and count family size in the genome. These codes were initially developed for this study: https://www.nature.com/articles/s41467-017-02546-5
  3. Control the length of internal regions (>=100 bp) on LTR candidates.
  4. Updated the manual

v1.4

6 years ago

New features:

  1. Introduce fingerprints for databases to avoid accidentally deleting these files (especially when running multiple LTR_retrievers in the same folder. e.g., #2 ). In other words, you can run multiple LTR_retriever in the same folder now.
  2. Add warnings if specified file(s) not exist.
  3. Update license to GNU-GPL v3, aka., LTR_retriever is an open source software.

Bug fixed

Provide a workaround for the blast bug (described in v1.3 and #4 #3 ) occurred under high CPU usage or resource over-allocation. Each blast attempt will be checked and will be redone for up to 100 times if encounter error status. This is not a total fix but at least there should be no more such errors.

v1.3

6 years ago

Several bugs have been reported since the last release. Most of them were fixed in this release.

Fixed bugs

  1. Sorting a list of sequence coordinate failed when special characters occurred.
  2. RepeatMasker ran incorrectly if it was installed using HMMER as the default search engine. Added a checking procedure to make sure the blast+ engine is available. Reinstall RepeatMasker is needed if user receives similar errors.
  3. Copy the database files to the working directory instead of working in the installed directory to avoid write error.
  4. Update the Manual. I didn't realize it was a commented version.
  5. Further steps were halted when there was no coding sequence contamination needed to be cleaned.
  6. Tested lowest dependency versions:
  • CDHIT/4.5.6 or up
  • BLAST+/2.2.25 or up
  • RepeatMasker/3.3.0 or up
  • HMMER/3.1b1 or up

Unfixed bug

BLAST engine error: Warning: Sequence contains no data or Warning: [blastn] Subject_1 chr:from..to|chr:from..to: Subject sequence contians no data

Some candidate sequences were appeared to be empty when analyzing sequence structures. This could happen when more CPUs were allocated to the program than what the system could provide. Some users said it also occurred even plenty of CPUs were available. So far I could not reproduce the second situation, so I don't know how to fix it. Good news is that this kind of LTR candidates are usually problematic, which means that they are usually false LTRs and will be screened out anyways. So if you only have this kind of errors you should be fine. The results are still reliable.

v1.2

6 years ago

It's strongly suggested that users provide the genome file with short sequence names, i.e., <15 characters. One of the LTR_retriever dependencies, RepeatMasker, has a namespace limit for about 40 characters. As LTR_retriever utilizes the actual coordinate of LTR sequences with RepeatMasker format annotations as the library ID, e.g., Chr1:12345..13545_LTR#LTR/Gypsy, the naming space for chromosome names (i.e., "Chr1", not the full library sequence name) drop down to 15-20 characters. To avoid frustrating errors bring by long sequence names, this release added a module to attempt to convert long sequence names to shorter but still unique names automatically. The converted genome will be saved with the ending ".mod" and the original genome is untouched. This is a workaround in case users forget to do their conversion or don't know how.

v1.1

7 years ago

This version provides more detailed stdout with timestamps. The timestamps will help users to keep track of their progress, and the stdout will help to debug. If there is any error message printed to stdout, please copy the adjacent timestamps with the error msg embedded in the middle and paste to a new issue. This kind of information will help to quickly pinpoint the problem.