Viral Ngs Versions Save

Viral genomics analysis pipelines

v1.19.3

6 years ago

New:

  • WDL workflow for Genbank submission [#797, #793, #800]
  • metagenomics.py taxlevel_summary to tabulate taxonomic abundance data from multiple Kraken-format summary files. [#792]
  • WDL workflow spikein to report spike-ins [#796]
  • new contigs WDL workflow that runs depletion, SPAdes, and has a placeholder for future contig-based taxonomic classification steps to be added later [#796]
  • sample name is now reported in Krona reports as the "root"-level name [#785]

Changed:

  • WDL workflows for depletion now default to use an hg19 BWA database instead of an hg19 bmtagger database [#796]
  • WDL assembly.scaffold: change name of final output file from {sample_name}.scaffold.fasta to {sample_name}.scaffolded_imputed.fasta [#796]

Fixed:

  • Fixes and updates to tbl_transfer_prealigned --oob_clip behavior. [#807]
  • Fixes version mismatch for tbl2asn spec [#787]
  • Addresses edge cases of tbl2asn usage [#797]
  • In Snakemake workflow, skip blank lines in samples-*.txt files [#794]
  • bugfix ncbi annotation step in Snakemake workflow for Genbank submission prep [#786]

Added/Upgraded:

  • dxWDL version 0.59->0.60.2 [#797]
  • base image used for Docker container 0.1.8->0.1.9 to include security fixes from upstream [#802]

v1.19.2

6 years ago

New:

  • illumina.py illumina_demux now supports simplex runs: non-multiplexed flowcells that do not have a SampleSheet and have no B entries in their read_structure [#776, #759]. Support has also been added for iSeq/FireFly-style flowcell IDs. Simplex-basecalling functionality is not yet tested at the pipeline level (Snakemake or WDL).

Changed:

  • Filename cleanups: more assembly WDL tasks now stripping off suffixes left by previous steps (e.g. "taxfilt" or "assembly1-trinity") before appending their new suffixes, resulting in cleaner filenames throughout the assembly workflow. The final refine_2x_and_plot step now creates its final output filename based on the input reads bam instead of the input fasta (as in the pre-WDL dx workflows) and calls the final assembly samplename.fasta (instead of samplename.taxfilt.assembly1-trinity.scaffold.refine2.fasta). [#783]
  • Final assembly output file variable is now final_assembly_fasta instead of refine2_assembly_fasta. [#783]

Fixed:

  • The WDL refine_2x_and_plot combined task (introduced in the last month as an optimization to reduce instance spin up and staging time) was computing the assembly_length and assembly_length_unambiguous numbers incorrectly: it was computing on the input fasta (from scaffolding), not the output / final assembly. It was also accidentally invoking the plot_coverage python command twice. [#783]
  • bmtagger in deplete_human was randomly memory-starved and would fail with "sacrifice child" errors. We have increased bmtagger (srprism)'s RAM limit in the WDL invocation to 90% RAM, up from 50% (since it never runs simultaneously with Java or anything else). [#781, #780]
  • Allow Picard SamToFastq more RAM during rmdup_mvicuna_bam [#771]
  • bugfix in taxon_filter.py deplete_human backwards compatibility wrapper around deplete [#775]
  • bugfix argparse setup for standalone command taxon_filter.py deplete_bwa_bam (was referring to non-existent argument --JVMmemory) [#778]
  • No longer silently swallow runtime errors during scaffolding due to a bug in how we parallelized it. [#772]

Added/upgraded:

  • upgrade from mummer3 to mummer4, which fixes a few odd bugs, speeds up scaffolding, increases the genome size we can scaffold to, while producing otherwise identical output. [#772, #677]
  • bump Picard from 2.13 to 2.17.5 [#774]
  • bump Cromwell from v29 to v30.2 [#773]
  • bump dxWDL from 0.58.1 to 0.59 [#782]
  • bump wdltool from 0.14 to cromwell/womtool 30.2 [#782]
  • WDL pipeline: uses GNU Parallel to parallelize up to nproc invocations of single-threaded Krona in the kraken WDL task. Addresses the observation that the Krona for loop was sometimes taking several times longer than Kraken itself. Also parallelizes the single-threaded tar/gzip calls after Krona. [#781]

v1.19.1

6 years ago

New:

  • [#762] The taxon_filter.py file now has a new command, deplete_bwa_bam, which uses bwa for depletion of sequence data provided in *.fasta format or pre-indexed bwa database format.
  • [#762] bwa-based depletion is now available as an option in taxon_filter.py deplete via the --bwaDbs argument

Changed:

  • The taxon_filter.py deplete_human command is now deprecated in favor of taxon_filter.py deplete. The deplete_human command will remain for the time being for compatibility.
  • [#755, #766] Add a new align_and_plot workflow in WDL, Cromwell, DNAnexus.

Fixed:

  • [#761] Fix a tar extraction bug when running within the Docker container as root
  • [#765] Fix TruSight illumina indexes
  • [#760] Prevent ambiguous contig alignment during scaffolding from causing hard failures (warn and proceed with remaining contigs)
  • [#741] When scaffolding against multiple reference genomes, allow some to fail, as long as some succeed

Added/Upgraded:

  • [#752, #750, ] DNAnexus workflows now include defaulted file parameters for various databases
  • [#751] MVicuna duplicate removal is now parallelized if multiple read groups exist in the input BAM
  • [#756, #767] Docker viral-baseimage upgraded from 0.1.6 (zesty) to 0.1.8 (artful) with fixes for Spectre and Meltdown

Documentation:

  • [#768] fixed explanation of manual conda installation
  • [#739] removed deprecated virtualized install from docs
  • [#746] explain flowcells.txt a bit more

v1.19.0

6 years ago

This is a release with many changes, including new WDL pipelines, a distribution of viral-ngs on DNAnexus that will be updated in sync with the latest version of viral-ngs, the ability to provide multiple references for scaffolding, and several critical bug fixes. With this release, the Docker image for viral-ngs moves from Docker Hub to quay.io/broadinstitute/viral-ngs.

New:

  • WDL (more info) pipelines have been added, inspired by the previous DNAnexus implementation of viral-ngs. The WDL files currently reside within the pipes/WDL/ directory of viral-ngs. The pipelines can be executed locally or in the Google cloud via cromwell(on bioconda), or via the public distribution available on DNAnexus.
    • WDL workflows are tested locally on Travis via Cromwell
    • WDL workflows are compiled for DNAnexus via dxWDL, and tested on DNAnexus
  • a simple form of reference selection via assembly.py::order_and_orient. Scaffolding is now performed using several references (in parallel); the one that yields the most non-N bases is chosen to be used for the scaffolded genome. For the positional argument, inReference, multiple FASTA files may now be provided, each containing one reference genome. Alternatively, multiple references may be given by specifying a single filename, and giving the number of reference segments with the --nGenomeSegments parameter. If multiple references are given, they must all contain the same number of segments listed in the same order.
    • This has been included in the new WDL pipelines
  • New kraken execution strategy to process multiple inputs in one run
  • taxon_filter.py changes to deplete_bmtagger_bam and deplete_blastn_bam: can now accept blast/bmtagger databases as .tar.gz, .tar.lz4, .tar.bz2 bundles and also as unindexed fasta files (that will be indexed on the fly)
  • new internal function util.file.extract_tarball exposed on the CLI as read_utils.py::extract_tarball. Accepts stdin piped input.

Changed:

  • various and extensive changes to how the viral-ngs Docker image is prepared and distributed:
    • Note: The Docker image is now available from quay.io/broadinstitute/viral-ngs, which is faster for staging than Docker Hub
    • the Docker image build process no longer relies on the easy-deploy-viral-ngs.sh script
  • --threads argparse option now common and available across viral-ngs commands
  • optimizations in illumina.py::illumina_demux
  • illumina.py::common_barcodes execution time has been reduced
  • in easy-deploy-viral-ngs.sh, some messages have been moved from stdout to stderr
  • taxon_filter.py: clean up and optimization around blastn-based read depletion
  • various development-related changes including:
    • travis cleanup re: pip package installs, conditionals, build matrix
    • Docker deployment bugfixes

Fixed:

  • prevent reports.py::plot_coverage from removing the bam file provided as input if it is already sorted and dupe removal is being not performed. In such cases the input bam is used directly and is now preserved.
  • diamond tests for accession taxonomy fixed: subprocess.PIPE replaced with named pipes to prevent deadlocks
  • taxon_filter.py::bmtagger_build_db default value for word_size is now 18, not 8
  • fixes the use of fasta databases for taxon_filter.py::deplete_bmtagger_bam and deplete_human

Added/Upgraded:

  • pysam 0.12.0.1 -> 0.13.0
  • samtools 1.5 -> 1.6
  • kraken 0.10.6_fork3 -> 1.0.0_fork3
  • lz4-bin 131 added as requirement
  • pigz 2.3.4 added as requirement
  • lbzip2 2.5 added as requirement

v1.18.2

6 years ago

Changed:

  • Demultiplexing from Illumina basecalls is now more permissive of varying input directory
  • [dev-related] conda package and Docker image are now built on each branch commit

Fixed:

  • [dev-related] package and docker build now optimized for more rapid built+test
  • --notemp added to Snakemake call script to support usage of --immediate-submit as required by newer Snakemake versions
  • Snakemake pipeline demux fixed
  • -l h_rt=hh:mm:ss spec now consistently using =

v1.18.1

6 years ago

Assembly improvements including gap2seq and an alternative assembler (SPAdes). This will be replaced with formal release notes soon.

v1.18.0

6 years ago

New:

  • The Snakemake pipeline can now source database files from S3, GS, or SFTP if given protocol-prefixed paths (s3://, gs://, sftp://) and if the system is preconfigured with credentials.
  • The config.yaml file has been changed to include s3://* paths for pre-built databases, rather than Broad Institute-specific paths (and files listed are live and available for all!)
  • Kraken is now enabled on OSX, though significant RAM is required to use it
  • The reports.py::align_and_plot_coverage and read_utils.py::align_and_fix functions now expose an optional argument, --minScoreToFilter. This adds an option—when using bwa—to calculate an alignment score for each query by summing the scores across the query's alignments, and keep only the queries whose score is at least the value of the specified threshold.
  • sample sheets can now be specified in *.csv.gz format
  • For debugging or more bespoke analysis, temp files can now be kept more easily by setting the VIRAL_NGS_TMP_DIRKEEP environment variable
  • The cd-hit-dup tool has been added as an alternative to mvicuna for removing duplicate reads, via a new CLI function read_utils.py::rmdup_cdhit_bam. Note that this is not currently used in the pipeline by default.
  • The Gap2Seq tool has been added for filling gaps between contigs. It is exposed via the new CLI command: assembly.py::gapfill_gap2seq. Note that this is not currently used in the pipeline by default.
  • The Spades assembler has been added as an alternative to Trinity for de novo assembly. Note that this is not currently used in the pipeline by default.
  • Expose blastn --chunkSize in taxon_filter.deplete_human.

Changed:

  • metagenomics rules in the Snakemake pipeline now break out kraken files as separate targets
  • improvements to speed of automated tests
  • The source and binaries for mvicuna and v-phaser2 have been removed from this repository since they now reside in their own repositories
  • viral-ngs is no longer tested against or distributed for Python 3.4, from this release forward. This should not impact users since the package is typically installed in an isolated conda environment with Python 3.5 or 2.7.
  • The Snakemake rules and cluster-submitter have been updated to reflect changes to the UGER cluster system at the Broad Institute, which now requires that -l h_rt hh:mm:ss be passed to schedule max runtime for each job
  • performance improvements to lastal filtering
  • lastal database is now built automatically if supplied pre-built
  • SPAdes wrapper more resilient to empty fastq inputs
  • Reimplement samtools.filterByCigarString using pysam instead of samtools
  • Kraken on OSX now exists on broad-viral: enable it in OSX git hooks and turn on all tests
  • Remove lastal optional outputs from taxon_filter.deplete_human

Fixed:

  • In the Snakemake pipeline, code that reads sample sheets and barcode files is now more tolerant of different formats, including files formatted with Windows-style newlines (\r\n for Windows vs. \n for Linux/Unix/macOS)
  • fixed handling of empty subtrees when importing *.yaml files within *.yaml config files (for config includes/composition)
  • fixed other edge cases related to config imports

Upgraded:

  • last 719 -> 876
  • Update samtools to 1.5
  • Update pysam to 0.12.0.1

v1.17.3

6 years ago

Fixed:

  • CheckIlluminaDirectory now creates symlinks only if necessary
  • Fix plot_coverage for cases where an alignment has very high coverage depth

v1.17.2

6 years ago

Changed:

  • Improved HiSeq X / HiSeq 4000 compatibility: broken symlinks are now removed from Illumina lane directories if present. This is helpful for HiSeq-X/4000 systems, which write out a single s.locs file for cluster locations rather than per-tile location files. Picard's CheckIlluminaDirectory can create symlinks that take the place of per-tile *.locs files, however these links can break when runs are moved between systems. The change in this release of viral-ngs allows broken links to be removed and corrected.
  • CheckIlluminaDirectory is now called on each call to illumina_demux to check run directories for validity prior to demultiplexing

v1.17.1

6 years ago

Fixed:

  • two issues corrected in using bwa mem for alignment (see commit)

Changed:

  • a few internal function calls (see commits)