A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
Fix: Fixed how nascent/mature/ambiguous matrices are stored in anndata/loom files
Fix: The old lamanno workflow (for legacy purposes) should work now.
Anndata/loom files now have nascent/mature layers rather than unspliced/spliced layers.
--workflow=custom can take in multiple FASTA inputs
Allow --d-list to have comma-separated multiple FASTA files with URLs
Command-line options menu cleaned up a bit
Implements all the updates detailed in protocols paper: https://doi.org/10.1101/2023.11.21.568164
ngs-tools>=1.7.3
.ref
-n
has been fully deprecated. (Thanks to @amcdavid for catching a bug)count
--workflow kite:10xFB
, where bustools project
would be called before bustools correct
(the order should be opposite). This fix required a bump to the ngs-tools
dependency.--workflow lamanno
for -x smartseq3
.-i
has been fully deprecated.-w option
) for bulk
, smartseq2
and smartseq3
technologies.-x 10XV3_ULTIMA
.-n
option) will be deprecated in the next major release. It is now recommended to use --include-attribute
and --exclude-attribute
options, similar to Cellranger's mkref
options (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references), to kb ref
to reduce index size and memory usage.ref
fasta
(genomic FASTA) and/or gtf
(gene annotation GTF) arguments. Support from ngs_tools 1.5.13
.count
SMARTSEQ
is now deprecated. All future uses should use BULK
, SMARTSEQ2
or SMARTSEQ3
.gene_name
column (or the adata.var_names
if --gene-names
is used).--workflow lamanno
for -x BULK
and -x SMARTSEQ2
technologies.compile
command. See below for more information. (#139)v0.48.0
.v0.41.0
.kb compile
is suggested.compile
kallisto
and/or bustools
binary from source. At the most basic level, it downloads the latest release source distributions from the respective GitHub repositories, compiles them, and places them where kb
can automatically detect them.target
positional argument specifies which binary (or both) to compile. Possible values are kallisto
, bustools
and all
.--url
optional argument may be provided with a URL to a remote archive that will be used instead of the latest GitHub release. When this option is used, target
may not be all
.--ref
optional argument may be provided with a commit hash or git tag. When this option is used, target
may not be all
.-o
optional argument may be used to place the compiled binaries in a different directory. Note that if this option is used, --kallisto
and --bustools
options will have to be set appropriately when running ref
or count
.--view
option may be used to simply view what binaries (their locations and versions) will be used by kb
.--remove
option may be used to remove existing compiled binaries.--overwrite
option may be used to overwrite existing compiled binaries.kallisto
compilation follows https://pachterlab.github.io/kallisto/source and has the same dependencies.bustools
compilation follows https://bustools.github.io/source and has the same dependencies.--cmake-arguments
argument may be used to pass in a string of additional arguments to pass directly to the cmake
command. For instance, to manually specify additional include directories, --cmake-arguments "-DCMAKE_CXX_FLAGS='-I /usr/include'"
ref
--include-attribute
and --exclude-attribute
options which can be used to include/exclude specific GTF entries based on their attributes. The argument to these options must be in the form of a key:value
pair, where key
is a GTF attribute name and value
is the value of the aforementioned attribute to include/exclude. Only one of these two options may be specified, and each option may be specified more than once. When multiple --include-attribute
are provided, GTF entries that have any one of the attributes will be processed. When multiple --exclude-attribute
are provided, GTF entries that have any one of the attributes will not be processed.count
--filter-threshold
option to specify the barcode filter threshold. This option may only be used when also providing --filter bustools
and indicates the minimum number of times a barcode must appear to be retained from filtering. (#142)--strand
option to override automatic strandedness setting by kallisto bus
. Available options are unstranded
, forward
, and reverse
.transcript_ids
column to be a semicolon-delimited string instead of a list (only applicable when --tcc
is provided) as a workaround for an issue with writing lists to h5ad with h5py>=3
. #141BULK
and SMARTSEQ2
technologies. The two technologies behave identically. The FASTQs may be provided either directly via command-line (only for multiplexed samples), in which case kb
will perform demultiplexing, or as a single batch definition text file (only for demultiplexed samples). See https://pachterlab.github.io/kallisto/manual section about batch.txt
for formatting. This batch textfile may also contain remote urls to FASTQ files, which will be streamed for supported operating systems. Additionally, added --parity
, --fragment-l
and --fragment-s
options, which may only be provided for these technologies. The first must always be provided, indicating the parity of the reads (single
, paired
), and the latter two may only be provided when --parity single
is also provided, specifying the mean length of the fragments and standard deviation of the fragment lengths.SMARTSEQ
technology has been deprecated and will be removed in the next release. Instead, SMARTSEQ2
should be used. See previous point for more information.SMARTSEQ3
technology.--dry-run
instead of an alias.--umi-gene
option, which deduplicates UMIs by gene. Can not be used with smartseq or bulk technologies.--em
option, which estimated gene abundances using the EM algorithm. Can not be used with smartseq or bulk technologies, or with --tcc
.-o
option to bustools count
already exists, but as a directory. For instance, counts_unfiltered/cells_x_genes
. Such folders are removed before running the command.--gene-names
option, which may only be used with --h5ad
or -loom
and not --tcc
. By specifying this option, the output h5ad or loom matrix will be aggregated by gene names instead of IDs.BDWTA
(BD Rhapsody), SPLIT-SEQ
, Visium
(10x).