Mortazavilab TALON Versions Save

Technology agnostic long read analysis pipeline for transcriptomes

v4.0

5 years ago

This release is written for Python 3.7 rather than Python 2.7. There are some small issues to be worked out in the post-TALON_tools section, but the schema and talon.py script are set.

  • Matching is done on the vertex rather than exon level (except for mono-exonic transcripts, where overlap-based matching is attempted first). Known starts and ends for the gene are prioritized when matching transcript start and endpoints.
  • Updated filtering: genomic transcripts are removed regardless of reproducibility
  • Schema changes to TALON database:
    • Added more information to the observed table, including start/end exons
    • In transcript table, the 'jn_path' column now omits the start and end exon. This information is stored in the start_exon and end_exon columns instead.
    • location table no longer includes strand. This has been moved to the edge table
    • gene table now includes strand
  • Reports type of novelty each time a new gene or transcript is identified. For genes, novelty types include antisense and intergenic. For transcripts, novelty types include incomplete splice match (ISM), ISM prefix, ISM suffix, novel in catalog (NIC), novel not in catalog (NNC), antisense, genomic, and intergenic.
  • Updated GTF utility to use a whitelist file rather than database filtering
  • Initialization step now assumes that provided GTF genes, transcripts, and exon are known unless specified otherwise in the GTF attributes. Necessary because new versions of the GENCODE annotation now lack the 'gene_status' and 'transcript_status' fields.
  • Expanded testing suite

v3.0-beta

5 years ago
  • Updates have been made to the database schema, so this version is not backwards-compatible with previous releases
  • Instead of separate observed 5' and observed 3' end tables, the schema now includes a single table called 'observed'. This table tracks 5' and 3' end differences as before, but also has additional attributes such as the original read name and length.
  • Fixed bug in the testing suite that resulted in certain tests crashing when run on a different computer than originally

v2.1-beta

5 years ago
  • Includes a GTF utility that allows the user to output annotations from a TALON database in the GTF format, with or without transcript filtering
  • Includes an abundance file utility that runs on a TALON database to create a file recording transcript abundance in each dataset. Can be run with or without filtering.
  • Expanded and improved the testing suite
  • Fixed bugs affecting v2.0-beta

v2.0-beta

5 years ago
  • Updated the TALON database schema (expanded the transcript table to include start_vertex, end_vertex, and n_exons)
  • Removed abundance output file that was subject to bugs when TALON was run successively to add new datasets.
  • Added a post-TALON filtering utility to simplify downstream data analysis
  • Expanded and improved the testing suite

v1.2-beta

5 years ago
  • This release was used for my RNA Club talk on 10/18/18
  • Fixed some small bugs that mainly affected second runs on the database, including reversing the path for novel transcripts on the minus strand.

v1.1-beta

5 years ago
  • New output file with abundance by dataset, among other things
  • ENCODE mode for filtering novel transcripts
  • This version is the one I will present in lab meeting on 10/12/18

v1.0-beta

5 years ago
  • This release encodes GTF annotations in a SQLite database
  • Each time TALON is run on a new dataset, novel genes, transcripts, vertices, edges, and locations are added to the database.
  • Includes a limited set of test cases (run using Pytest)

v0.1-alpha

6 years ago
  • This is the version I presented at the Mortazavi-Wold joint meeting on 4/20/2018
  • Pulls annotations straight from GTF file; does not modify the annotation in a way that lasts beyond the run. This means that samples must be run at the same time in order to get corresponding novel IDs
  • Writes results to a tab-delimited file that can be analyzed with the provided plotting script (some hard-coding in there)
  • Query transcripts lacking any exact junction matches to the annotation are not processed completely yet- they get assigned to a very broad 'novel' category that will be improved in the next release.
  • Permissive 3' and 5' end handling: May be changed in a later release.