Kallisto Versions Save

Near-optimal RNA-Seq quantification

v0.50.1

6 months ago

Kallisto index version is now index 13 (kallisto v0.50.0 had index version 12)

New features (kallisto index):

  • Can input priors for the EM algorithm
  • D-list has an overhang option
  • D-list is now stored in a hash table rather than part of the graph
  • Fix some compilation issues in Bifrost
  • Can specify custom k-mers to be D-listed by having an empty fasta header
  • Can specify custom k-mers to be indexed by using --distinguish and assigning each input fasta entry a numerical ID (zero-indexed) in the fasta header

New features (kallisto bus technologies):

  • Kallisto technology string can now have format -x bc:umi:cdna%strand%parity
  • Split-seq defaults to --fr-stranded
  • STORM-seq and VASA-seq now supported as technology options

v0.50.0

10 months ago

kallisto index

The improved kallisto index reduces memory consumption for large FASTA files and features a d-list option to improve k-mer mapping specificity. Additionally, new input and output features have been added as well as support for sample barcodes (which can be recorded in addition to cell barcodes).

New features

  • kallisto quant-tcc: This new command can run the EM algorithm on a supplied transcripts-compatibility counts (TCC) matrix file, such as that generated by "bustools count", to generate transcript-level estimates. When a gene-mapping file is supplied, gene-level abundances will also be outputted. Effective length normalization will only be performed if a kallisto index is supplied and if fragment length information is provided.
  • New technologies were added to "kallisto bus": -x SmartSeq3 (--tag can be used to supply a 5′ tag sequence that identifies UMI-containing reads), -x BDWTA (BD Rhapsody), -x Visium (10x Visium), -x SPLIT-SEQ (SPLiT-seq preprocessing), and -x Bulk (for preprocessing non-demultiplexed Bulk RNA-seq files)
  • "kallisto bus" can be run with -x BULK specified: In this case, it will either process a batch file (supplied via --batch) like in the old "kallisto pseudo" or will process fastQ files supplied directly on the command line, treating each fastQ file or each pair of fastQ file (if --paired is specified) as an individual sample. This is useful for generating BUS files when each sample is in a separate fastQ file. With bustools and kallisto quant-tcc, this feature effectively entirely deprecates the old "kallisto pseudo".
  • Strand-specificity is now enabled by default for 10X, SureCell, CelSeq, BD Rhapsody, and Smart-seq3 UMI technologies (unstranded is default for other technologies) and the user can override this by supplying --fr-stranded, --rf-stranded, and --unstranded options.
  • Various performance improvements (mostly in regards to data ingestion throughput)
  • A minimal form of the kallisto index is outputted in a file named index.saved and a file containing fragment length distributions (flens.txt) is outputted when "kallisto bus" is run on paired-end reads (which can be specified via the option --paired). This is so kallisto quant-tcc can perform effective length normalization should the need arise.

New index

  • A new index is used that is incompatible with the old index, and users should upgrade to this new index for kallisto v0.50.0
  • With the new index, users can set the minimizer length (--min-size) which can tune indexing runtime+memory performance
  • --max-ec-size has been added so that users can cap the size of equivalence classes (i.e. the number of transcripts compatible with a given k-mer); k-mers that exceed this size aren't considered in the pseudoalignment. This can reduce memory usage and increase runtime performance (with some loss of information if --max-ec-size is too small).
  • --threads option now enabled for kallisto index to allow indices to be created in a multithreaded fashion (to improve runtime)
  • --d-list can be used to supply a FASTA file where distinguishing flanking k-mers will be extracted from (to act as a general k-mer filter for improving mapping specificity)
  • --distinguish option is added (where no polyA trimming, etc. occur) and each target is indexed as-is with the targets distinguished from one another by the target name (e.g. two targets can have the same name and be indexed together as a single target)
  • kallisto inspect can output more information: minimizer length, number of unitigs, max EC size, number of ECs discarded (i.e. over the --max-ec-size threshold), and number of D-listed elements (DFKs)

New input features

  • --inleaved option added to kallisto bus to support reading in interleaved FASTQ input
  • Streaming FASTQ reads directly into kallisto bus is enabled by supplying - in lieu of FASTQ files
  • --x technology string Bustools technology string can read RX:Z: UMIs in FASTQ header comments by supplying something like 0,0,8:RX:1,0,0 (i.e. RX can be supplied into the UMI portion of the technolog string)
  • --numReads can be set to terminate after a certain number of reads have been processed

New sample barcode feature

  • --batch-barcodes in kallisto bus will record encode batch ID as a unique nucleotide sequence in the hidden metadata of the barcode column of the BUS file (i.e. serving as a sample barcode).
  • --batch in kallisto bus now allows a technology string to be supplied (if --batch-barcodes is not supplied, only the barcodes extracted from the technology string are stored in the BUS file [i.e. sample barcodes aren't recorded]; if -1 is supplied in the barcode part of the technology string, only the batch-specific barcodes [i.e. sample barcodes] are stored directly in the BUS file, not in the hidden metadata unless --batch-barcodes is supplied)

New output features

  • kallisto quant-tcc command can output exactly what “kallisto quant” does (including w/ bootstraps for sleuth) for each barcode into separate abundance.tsv files (if --matrix-to-files is specified) or into separate directories, each containing an abundance.tsv file (if ---matrix-to-directories is specified). Also, h5ad will be produced if compiled with that options (unless --plaintext is supplied to quant-tcc).

Other new features

  • Progress is outputted every 1M reads
  • --aa option enabled in kallisto bus and kallisto index for amino acid mapping to nucleotide (functionalities to be described in a paper)

New compilation options

  • HTSLIB is no longer enabled by default; need to use cmake .. -DUSE_BAM=ON
  • Zlib is still compatible and used by default but the better zlib-ng is included and can be used if the given cmake option is supplied.
  • Compilation flags to enable all features are as follows: cmake .. -DZLIBNG=ON -DUSE_BAM=ON -DBUILD_FUNCTESTING=ON -DUSE_HDF5=ON

End of support for existing bulk RNAseq features

  • --bias, --fusion, --genomebam, and --pseudobam in kallisto quant and kallisto bus are no longer supported -- users should use v0.48.0 for use of these features.
  • --gfa,--gtf, and --bed options in kallisto inspect are no longer support -- users should use v0.48.0 for use of these features.

v0.48.0

2 years ago

New features

  • kallisto quant-tcc: This new command can run the EM algorithm on a supplied transcripts-compatibility counts (TCC) matrix file, such as that generated by "bustools count", to generate transcript-level estimates. When a gene-mapping file is supplied, gene-level abundances will also be outputted. Effective length normalization will only be performed if a kallisto index is supplied and if fragment length information is provided.
  • New technologies were added to "kallisto bus": -x SmartSeq3 (--tag can be used to supply a 5′ tag sequence that identifies UMI-containing reads), -x BDWTA (BD Rhapsody), -x Visium (10x Visium), -x SPLIT-SEQ (SPLiT-seq preprocessing), and -x Bulk (for preprocessing non-demultiplexed Bulk RNA-seq files)
  • "kallisto bus" can be run with no technology specified: In this case, it will either process a batch file (supplied via --batch) like in the old "kallisto pseudo" or will process fastQ files supplied directly on the command line, treating each fastQ file or each pair of fastQ file (if --paired is specified) as an individual sample. This is useful for generating BUS files when each sample is in a separate fastQ file. With bustools and kallisto quant-tcc, this feature effectively entirely deprecates the old "kallisto pseudo".
  • Strand-specificity is now enabled by default for 10X, SureCell, CelSeq, BD Rhapsody, and Smart-seq3 UMI technologies (unstranded is default for other technologies) and the user can override this by supplying --fr-stranded, --rf-stranded, and --unstranded options.
  • Various performance improvements (mostly in regards to data ingestion throughput)
  • A minimal form of the kallisto index is outputted in a file named index.saved and a file containing fragment length distributions (flens.txt) is outputted when "kallisto bus" is run on paired-end reads (which can be specified via the option --paired). This is so kallisto quant-tcc can perform effective length normalization should the need arise.

Deprecation

  • "kallisto pseudo" is now deprecated and will be removed in a future release; users should supply batch files of fastQ file names to "kallisto bus" instead

Fixes

  • Issue #319 : header import
  • Issue #272 : "kallisto quant" and "kallisto pseudo" inconsistency (now fixed)

v0.46.2

4 years ago

Phasing out HDF5

For this release HDF5 is not a required dependency for running kallisto bus for single cell RNA-seq analysis. It is still required for compatibility with sleuth and other downstream tools. By default kallisto will not be built with HDF5 support, this can be enabled by running

cmake  .. -DUSE_HDF5=ON

The binaries for this release are compiled with HDF5 built in, but we will switch from using HDF5 in future versions (coordinated with sleuth).

When running kallisto quant without HDF5 support

  • quant without bootstrapping will create the same files as before, except for abundance.h5
  • quant with bootstrapping, -b, will not perform bootstrapping but displays the following warning Warning: kallisto was not compiled with HDF5 support so no bootstrapping will be performed. Run quant with --plaintext option or recompile with HDF5 support to obtain bootstrap estimates.
  • quant with -b k and --plaintext will create the bootstrap values in files bs_abundance_i.tsv for i=0..k-1

For users relying on HDF5 support we recommend compiling kallilsto with HDF5 or downloading the kallisto binaries.

Over the next releases HDF5 will gradually be phased out and information on bootstraps will be replaced with a new format.

Changes

  • kallisto pseudo outputs a file of transcript ids
  • Fixes #240
  • kallisto bus allows having sequence split across more than one file, closes #226

v0.46.1

4 years ago

This release adds options for parsing the inDrops technology (versions 2 and 3 are new) as well as specifying input from BAM files rather than raw FASTQ files.

v0.46.0

4 years ago

This version adds the option of specifying an arbitrary single cell technology for the bus command in kallisto.

v0.45.1

5 years ago

This release adds 10xv3 as a technology option for the bus command.

Bug fixes

  • #201 Pseudobam was not being run unless bootstrap was also performed
  • #199 Error when reading UMI files for the pseudo mode.
  • -l flag for bus was inactive.

v0.45.0

5 years ago

Changes from v0.44.0

BUS

kallisto can now process raw FASTQ files for single cell RNA-Seq and create an output in BUS format which can be further processed using bustools

To process single cell data run kallisto with the bus command. To see a list of supported technologies, run with the --list option

> kallisto bus --list 
List of supported single cell technologies

short name       description
----------       -----------
10Xv1            10X chemistry version 1
10Xv2            10X chemistry verison 2
DropSeq          DropSeq
inDrop           inDrop
CELSeq           CEL-Seq
CELSeq2          CEL-Seq version 2
SCRBSeq          SCRB-Seq

v0.44.0

6 years ago

Changes from v0.43.1

BAM!

kallisto can now project pseudoalignments from transcripts down to genomic coordinates. This requires a GTF file corresponding to the transcriptome used to construct the index. The resulting BAM file is sorted by genomic coordinates and indexed.

  • --pseudobam option works as before in transcript coordinates, but creates a single output pseudoalignments.bam in the output folder. This mode no longer writes SAM format to standard output, but writes the binary BAM file directly. Multithreaded --pseudobam works now
  • --genomebam option writes pseudoalignments to the file pseudoalignments.bam in sorted genomic coordinates, requires a --gtf option and optionally a --chromosomes options set.

quant mode

Adds a --single-overhang option that does not discard reads where unobserved rest of fragment is predicted to lie outside a transcript. This is mainly useful for mapping 3' biased reads from single cell experiments.

JSON output

Adds QC information to run_info.json in the output folder

The added fields are

  • n_pseudoaligned : number of fragments that could be pseudoaligned
  • p_pseudoaligned : percentage of fragments that could be pseudoaligned
  • n_unique : number of fragments that could be pseudoaligned to a unique target sequence
  • p_unique : percentage of fragments that could be pseudoaligned to a unique target sequence

v0.43.1

6 years ago

Changes from v0.43.0

fusions

kallisto can now find reads which span potential fusion breakpoints. The quant mode adds a --fusion flag which identifies read pairs involved in fusions and writes output to fusion.txt, this file is then processed by pizzly for downstream analysis.

quant mode:

Switched to a uniform point for the EM algorithm that works better in highly ambiguous cases.

pseudobam fixes

Several fixes to the pseudobam output so that the resulting SAM/BAM file can be validated with picard.

Bug fixes

  • updates kseq library, which would loop indefinitely on CRC corrupt gzipped files.
  • warning when no reads pseudoalign and fixes crash (resulting output file will contain nan for tpm values).