Viral genomics analysis pipelines
New:
metagenomics.py taxlevel_summary
to tabulate taxonomic abundance data from multiple Kraken-format summary files. [#792]spikein
to report spike-ins [#796]contigs
WDL workflow that runs depletion, SPAdes, and has a placeholder for future contig-based taxonomic classification steps to be added later [#796]Changed:
assembly.scaffold
: change name of final output file from {sample_name}.scaffold.fasta to {sample_name}.scaffolded_imputed.fasta [#796]Fixed:
tbl_transfer_prealigned --oob_clip
behavior. [#807]tbl2asn
spec [#787]samples-*.txt
files [#794]Added/Upgraded:
New:
illumina.py illumina_demux
now supports simplex runs: non-multiplexed flowcells that do not have a SampleSheet and have no B
entries in their read_structure
[#776, #759]. Support has also been added for iSeq/FireFly-style flowcell IDs. Simplex-basecalling functionality is not yet tested at the pipeline level (Snakemake or WDL).Changed:
refine_2x_and_plot
step now creates its final output filename based on the input reads bam instead of the input fasta (as in the pre-WDL dx workflows) and calls the final assembly samplename.fasta (instead of samplename.taxfilt.assembly1-trinity.scaffold.refine2.fasta). [#783]final_assembly_fasta
instead of refine2_assembly_fasta
. [#783]Fixed:
refine_2x_and_plot
combined task (introduced in the last month as an optimization to reduce instance spin up and staging time) was computing the assembly_length
and assembly_length_unambiguous
numbers incorrectly: it was computing on the input fasta (from scaffolding), not the output / final assembly. It was also accidentally invoking the plot_coverage
python command twice. [#783]deplete_human
was randomly memory-starved and would fail with "sacrifice child" errors. We have increased bmtagger (srprism)'s RAM limit in the WDL invocation to 90% RAM, up from 50% (since it never runs simultaneously with Java or anything else). [#781, #780]rmdup_mvicuna_bam
[#771]taxon_filter.py deplete_human
backwards compatibility wrapper around deplete
[#775]taxon_filter.py deplete_bwa_bam
(was referring to non-existent argument --JVMmemory) [#778]Added/upgraded:
nproc
invocations of single-threaded Krona in the kraken WDL task. Addresses the observation that the Krona for
loop was sometimes taking several times longer than Kraken itself. Also parallelizes the single-threaded tar/gzip calls after Krona. [#781]New:
taxon_filter.py
file now has a new command, deplete_bwa_bam
, which uses bwa for depletion of sequence data provided in *.fasta
format or pre-indexed bwa database format.taxon_filter.py deplete
via the --bwaDbs
argumentChanged:
taxon_filter.py deplete_human
command is now deprecated in favor of taxon_filter.py deplete
. The deplete_human
command will remain for the time being for compatibility.align_and_plot
workflow in WDL, Cromwell, DNAnexus.Fixed:
Added/Upgraded:
viral-baseimage
upgraded from 0.1.6 (zesty) to 0.1.8 (artful) with fixes for Spectre and Meltdown
Documentation:
This is a release with many changes, including new WDL pipelines, a distribution of viral-ngs on DNAnexus that will be updated in sync with the latest version of viral-ngs, the ability to provide multiple references for scaffolding, and several critical bug fixes. With this release, the Docker image for viral-ngs moves from Docker Hub to quay.io/broadinstitute/viral-ngs.
New:
pipes/WDL/
directory of viral-ngs. The pipelines can be executed locally or in the Google cloud via cromwell
(on bioconda), or via the public distribution available on DNAnexus.
assembly.py::order_and_orient
. Scaffolding is now performed using several references (in parallel); the one that yields the most non-N bases is chosen to be used for the scaffolded genome. For the positional argument, inReference
, multiple FASTA files may now be provided, each containing one reference genome. Alternatively, multiple references may be given by specifying a single filename, and giving the number of reference segments with the --nGenomeSegments
parameter. If multiple references are given, they must all contain the same number of segments listed in the same order.
taxon_filter.py
changes to deplete_bmtagger_bam
and deplete_blastn_bam
: can now accept blast/bmtagger databases as .tar.gz
, .tar.lz4
, .tar.bz2
bundles and also as unindexed fasta files (that will be indexed on the fly)util.file.extract_tarball
exposed on the CLI as read_utils.py::extract_tarball
. Accepts stdin piped input.Changed:
easy-deploy-viral-ngs.sh
script--threads
argparse option now common and available across viral-ngs commandsillumina.py::illumina_demux
illumina.py::common_barcodes
execution time has been reducedeasy-deploy-viral-ngs.sh
, some messages have been moved from stdout
to stderr
taxon_filter.py
: clean up and optimization around blastn
-based read depletionFixed:
reports.py::plot_coverage
from removing the bam file provided as input if it is already sorted and dupe removal is being not performed. In such cases the input bam is used directly and is now preserved.diamond
tests for accession taxonomy fixed: subprocess.PIPE replaced with named pipes to prevent deadlockstaxon_filter.py::bmtagger_build_db
default value for word_size is now 18
, not 8
taxon_filter.py::deplete_bmtagger_bam
and deplete_human
Added/Upgraded:
0.12.0.1
-> 0.13.0
1.5
-> 1.6
0.10.6_fork3
-> 1.0.0_fork3
131
added as requirement2.3.4
added as requirement2.5
added as requirementChanged:
Fixed:
--notemp
added to Snakemake call script to support usage of --immediate-submit
as required by newer Snakemake versions-l h_rt=hh:mm:ss
spec now consistently using =
Assembly improvements including gap2seq and an alternative assembler (SPAdes). This will be replaced with formal release notes soon.
New:
s3://
, gs://
, sftp://
) and if the system is preconfigured with credentials.config.yaml
file has been changed to include s3://*
paths for pre-built databases, rather than Broad Institute-specific paths (and files listed are live and available for all!)align_and_plot_coverage
and read_utils.py::align_and_fix
functions now expose an optional argument, --minScoreToFilter
. This adds an option—when using bwa—to calculate an alignment score for each query by summing the scores across the query's alignments, and keep only the queries whose score is at least the value of the specified threshold.*.csv.gz
formatVIRAL_NGS_TMP_DIRKEEP
environment variablermdup_cdhit_bam
. Note that this is not currently used in the pipeline by default.gapfill_gap2seq
. Note that this is not currently used in the pipeline by default.--chunkSize
in taxon_filter.deplete_human
.Changed:
-l h_rt hh:mm:ss
be passed to schedule max runtime for each jobtaxon_filter.deplete_human
Fixed:
\r\n
for Windows vs. \n
for Linux/Unix/macOS)*.yaml
files within *.yaml
config files (for config includes/composition)Upgraded:
719
-> 876
1.5
0.12.0.1
Fixed:
CheckIlluminaDirectory
now creates symlinks only if necessaryplot_coverage
for cases where an alignment has very high coverage depthChanged:
s.locs
file for cluster locations rather than per-tile location files. Picard's CheckIlluminaDirectory
can create symlinks that take the place of per-tile *.locs
files, however these links can break when runs are moved between systems. The change in this release of viral-ngs allows broken links to be removed and corrected.CheckIlluminaDirectory
is now called on each call to illumina_demux
to check run directories for validity prior to demultiplexing