Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
Fixed a bug introduced in v3.0.0 which caused the default path for translations to be incorrect. This affected only users who used --output-all
without passing a custom path template via --output-translations
. The new default path is nextclade.cds_translation.{cds}.fasta
where {cds}
gets replaced with the name of the CDS, e.g. nextclade.cds_translation.S.fasta
for SARS-CoV-2's spike protein.
Fixed a bug where nextclade dataset get
command fails to download a dataset if a dataset has more than one version released.
nextclade.cds_translation.{cds}.fasta
. Before v3, the default path was nextclade_gene_{gene}.translation.fasta
. You can emulate the old (default) behavior by passing --output-translations="nextclade_gene_{cds}.translation.fasta"
to nextclade3
.๐ฅ Nextclade CLI can be downloaded from the links in the "Assets" section just below. Note the difference in operating systems and computer architectures.
๐ Nextclade Web is available at https://clades.nextstrain.org
๐ Docker images are available at DockerHub
๐ To understand how it all works, make sure to read the Documentation
We are happy to present a major release of Nextclade, containing new features and bug fixes.
โ ๏ธ This release contains breaking changes which may require your attention.
Useful links:
This section briefly lists breaking changes in Nextclade v3 compared to Nextclade v2. Please see Nextclade v3 migration guide (alternative link) for a detailed description of each breaking change and of possible migration paths.
The sections below list all changes - breaking and non-breaking. The breaking changes are denoted with word [BREAKING]
.
If you encounter problems during migration, or breaking changes not mentioned in this document, please report it to the developers by opening a new GitHub issue.
The seed matching algorithm was rewritten to be more robust and handle sequences with higher diversity. For example, RSV-A can now be aligned against RSV-B.
Parameters minSeeds
, seedLength
, seedSpacing
, minMatchRate
, mismatchesAllowed
, maxIndel
no longer have any effect and are removed.
New parameters kmerLength
, kmerDistance
, minMatchLength
, allowedMismatches
, windowSize
are added.
Default values should work for sequences with a diversity of up to X%. For sequences with higher diversity, the parameters may need to be adjusted.
For short sequences, the threshold length to use full-matrix alignment is now determined based on kmerLength
instead of the removed seedLength
. The coefficient is adjusted to roughly match the old final value.
Nextclade now treats genes only as containers for CDSes ("CDS" is coding sequence). CDSes are the main unit of translation and a basis for AA mutations now. A gene can contain multiple CDSes, but they are handled independently.
A CDS can consist of multiple fragments. These fragments are extracted from the full nucleotide genome independently and joined together (in the order provided in the genome annotation) to form the nucleotide sequence of the CDS. The CDS is then translated and the resulting polypeptides are analyzed (mutations are detected etc.). This implementation allows to handle slippage (e.g. ORF1ab in coronaviruses) and splicing (e.g. tat and rev in HIV-1).
If genome annotation describes a CDS fragment as circular (wrapping around origin), Nextclade splits it into multiple linear (non-wrapping) fragments. The translation and analysis is then performed as if it was a linear genome.
Nextclade follows the GFF3 specification. Please refer to it for how to describe circular features.
The GFF3 file parser has been augmented to support all the types of genetic features necessary for Nextclade to operate. There are still feature types which Nextclade ignores. We can consider supporting more types as scientific need arises.
Nextclade v3 now has the ability to phylogenetically resolve relationships between input sequences, where v2 would only attach each query sequence independently to the reference tree. Nextclade v3 thus may produce trees that are different from the trees produced in Nextclade v2.
Please read the Phylogenetic placement section in the documentation for more details.
We no longer treat mutations to ambiguous nucleotides as reversions, i.e. if the attachment node has a mutation mutated with respect to reference and the query sequence is ambiguous we previously counted this as a reversion. This change only affects โprivate mutationโ QC score and the classification of private mutations into โreversion substitutionโ and โunlabeled substitutionโ.
Nextclade Web can now optionally suggest the most appropriate dataset(s) for user-provided input sequences. Drop your sequences and click "Suggest" to try out this feature.
Following changes in genome annotation handling, the genome annotations widget in Nextclade Web now shows CDS fragments instead of genes.
The gene selector dropdown in Nextclade Web's results table has been transformed into a more general genetic feature selector. It shows the hierarchy of genetic features if there are nested features. Otherwise, the list is flat, to save screen space. It shows types of each of the genetic feature (gene, CDS or protein) as colorful badges. The menu is searchable, which is useful for mpox and other large viruses with many genes. Only CDSes can be selected currently, but we may extend this in the future to more feature types.
Nucleotide sequence views (in the results table) now also show colored markers for ambiguous nucleotides (non-ACTGN).
The row of buttons, containing "Back", "Tree" and other buttons is removed. Instead, different sections of the web application are always accessible via the main navigation bar.
The "Export" ("Download") and "Settings" sections are moved to dedicated pages.
Due to changes in the dataset format and input files, the URL parameters have the following changes:
input-root-seq
renamed to input-ref
input-gene-map
renamed to input-annotation
input-pathogen-json
addedinput-qc-config
removedinput-pcr-primers
removedinput-virus-properties
removeddataset-reference
removedThe nextclade.errors.csv
and nextclade.insertions.csv
files are removed and no longer appear in the "Export" dialog, nor are they included into the nextclade.zip
archive of all outputs.
Errors and insertions are now included in the nextclade.csv
and nextclade.tsv
files.
The Auspice tree viewer component is updated from version 2.45.2 to 2.51.0. See the Auspice releases or changelog.
Nextalign CLI is no longer provided as a standalone application along with Nextclade CLI v3 because Nextclade now has all the features that distinguished Nextalign. This means there's only one set of command line arguments to remember. Nextclade CLI runs the same algorithms, accepts same the inputs and provides the same outputs as v2 Nextalign, plus some more. For most use-cases, the CLI interface and the input and output files should be the same or very similar.
Due to changes in the seed alignment algorithm, the following parameters are no longer used and the corresponding CLI arguments and JSON fields under alignmentParams
in pathogen.json
(previously virus_properties.json
) were removed:
--seed-length
--seed-spacing
--max-indel
--min-match-rate
--min-seeds
--mismatches-allowed
The following new alignment parameters were added:
--allowed-mismatches
--kmer-distance
--kmer-length
--min-match-length
--min-seed-cover
--max-alignment-attempts
--max-band-area
--window-size
Due to changes in the dataset format the following CLI arguments were removed:
--input-virus-properties
--input-qc-config
--input-pcr-primers
in favor of --input-pathogen-json
.
The arguments --output-errors
and --output-insertions
have been removed. Their information is now included in --output-csv
and --output-tsv
.
The argument --input-gene-map
renamed to --input-annotation
. The short form -m
remains unchanged.
The argument --genes
is renamed to --cds-selection
. The short form -g
remains unchanged.
Nextclade can now also export the tree in Newick format via the --output-tree-nwk
argument.
Most input files and files inside datasets are now optional. This simplifies dataset creation and maintenance and allows for step-by-step, incremental extension of them. You can start only with a reference sequence, which will only allow for alignment and very basic mutation calling in Nextclade, and later you can add more functionality. Optional input files also enable the removal of Nextalign CLI.
If you maintain a custom dataset or want to try creating one - refer to our Dataset curation guide. Community contributed datasets are welcome!
The old phylogenetic tree placement behavior can be restored by adding the --without-greedy-tree-builder
flag.
dataset list
commandThe new argument --only-names
allows to print a concise list of dataset names:
nextclade dataset list --only-names
The new argument --search
allows to search datasets using substring match with dataset name, dataset friendly name, reference name or reference accession:
nextclade dataset list --search=flu
The argument --json
allows to output a JSON object instead of the table. You can write it into a file and to postprocess it:
nextclade dataset list --json > "dataset_list.json"
nextclade dataset list --json | jq '.[] | select(.path | startswith("nextstrain/sars-cov-2")) | .attributes'
sort
The sort
subcommand takes your sequences in FASTA format and outputs sequences grouped by dataset in the form of a directory tree. Each subdirectory corresponds to a dataset and contains an output FASTA file with only sequences that are detected to be similar to the reference sequence in this dataset.
Example usage:
nextclade sort --output-dir="out/sort/" --output-results-tsv="out/sort.tsv" "input.fasta"
This can be useful for splitting FASTA files containing sequences which belong to different pathogens, strains or segments, for example for separating flu HA and NA segments.
read-annotation
The read-annotation
subcommand takes a GFF3 file and displays how features are arranged hierarchically as viewed by Nextclade. This is useful for Nextclade developers and dataset creators to verify (and debug) how Nextclade understand genetic features from a particular GFF3 file.
Example usage:
nextclade read-annotation genome_annotation.gff3
Type nextclade read-annotation --help
for description of arguments.
Nextclade Web now uses multithreading more effectively. This results in faster processing of large fastas on computers with more than one processor. The speedup is around 2 for 1000 SARS-CoV-2 sequences on a multi-core machine.
The new features caused changes in major internal data structures and made them more complex. We now generate JSON schema and Typescript typings from Rust code. This allows to find mismatches between parts written in different languages, and to avoid bugs related to data types.
The change in genome annotation handling had significant consequences for coordinate spaces Nextclade is using internally (e.g. alignment space vs reference space, nuc space vs aa space, global nuc space vs nuc space local to a CDS). In order to make coordinate transforms safer, we introduced new Position
and Range
types, different for each space. This prevents mixing up coordinates in different spaces.
๐ฅ Nextclade CLI & Nextalign CLI can be downloaded from the links in the "Assets" section just below. Click "Show all" at the bottom of the "Assets" section to show more download options. Note the difference between "nextalign" and "nextclade" files as well as differences in operating systems and computer architectures.
๐ Nextclade Web is available at https://clades.nextstrain.org
๐ Docker images are available at DockerHub
๐ To understand how it all works, make sure to read the Documentation
โ ๏ธ | This is a pre-release. It can contain bugs and significant changes which are not yet finalized. Changes may appear without notice. We recommend to try the pre-releases to learn about upcoming features. For important projects, use stable releases. |
---|
For changes compared to the previous final release version, please refer to "Unreleased" section in CHANGELOG.md
โ ๏ธ | This is a pre-release. It can contain bugs and significant changes which are not yet finalized. Changes may appear without notice. We recommend to try the pre-releases to learn about upcoming features. For important projects, use stable releases. |
---|
For changes compared to the previous final release version, please refer to "Unreleased" section in CHANGELOG.md
โ ๏ธ | This is a pre-release. It can contain bugs and significant changes which are not yet finalized. Changes may appear without notice. We recommend to try the pre-releases to learn about upcoming features. For important projects, use stable releases. |
---|
For changes compared to the previous final release version, please refer to "Unreleased" section in CHANGELOG.md
For some viruses, genome sequencing is unreliable in specific parts of the genome or some regions should be ignored for other reasons when calculating distances between nodes for the purpose of placing query sequences on the reference tree. These distances are used to find the optimal (smallest distance) placement of the query sequence on the reference tree and sequence errors in these regions can lead to wrong placement.
Until now, to place query sequences on the reference tree, Nextclade counted all nucleotide differences between query and reference sequence. Moving forward, sequence regions to be ignored for reference tree placement can be defined in datasets' virus_properties.json
. This is useful for example for SARS-CoV-2, where we will start ignoring the terminal parts of the untranslated regions. Another use case is mpox, where the terminal repeats are intrinsically constrained to be identical. Masking one of the two terminals will avoid double-counting of the same mutations.
PR #1128 adds this feature to Nextclade's algorithm.
Masked ranges are specified in the new field placementMaskRanges
in datasets' virus_properties.json
. For example, the terminal 50 nucleotides of SARS-CoV-2 can be ignored for tree placement by adding the following line (positions are 0-based and end-exclusive):
"placementMaskRanges":[{"begin":0,"end":50},{"begin":29850,"end":29902}],
The changes are backwards compatible, if the field does not exist, Nextclade defaults to the old behavior of counting all nucleotide differences.
We are planning to shortly release a new version of SARS-CoV-2 datasets making use of this feature. Only a small proportion of sequences (<1%)should be affected, however where there are changes they will be a slight improvement in accuracy.
It was widely reported that users with long-persisting browser tabs and also users who don't switch datasets often, sometimes do not receive new Nextclade dataset updates, which meant that these users would not get newly designated lineages and clades lineage assignments.
Nextclade Web is a fully client-side, single-page application, which downloads the code and list of datasets once when first opening a tab. When users do not refresh the tab and don't change dataset, the same software and dataset version are used indefinitely. Without periodic page refresh and without periodic fetching of new dataset versions, users can run old code and use old data indefinitely, receiving obsolete or incomplete results.
In order to mitigate this problem, in this version, we add periodic background version checks in Nextclade Web. Every day or so, Nextclade Web will check whether the currently used version of software is the latest, as well as periodically refresh the list of available datasets and their versions. Whenever a new version of software or of a dataset is available, user will receive an update notification. The update can be accepted or dismissed (until the next version is available). Additionally, one can always obtain the latest code and datasets by doing a simple page reload in the browser (no need to clear the cache).
Nextclade is a fast-moving project, where new features and bug fixes are added frequently. We emphasize importance of using the latest versions of both, software and datasets, to receive the most accurate and up-to-date results.
Nextclade Web previously had a bug, sorting incorrectly when the the column to be sorted by contained empty values. Empty values are now treated as empty strings, fixing this issue.
The "Citation" modal is now more readable and translated to multiple languages. We also added missing translations for some of the sentences in Nextclade Web. We made the intro text on main page of Nextclade Web more relevant.
๐ฅ Nextclade CLI & Nextalign CLI can be downloaded from the links in the "Assets" section just below. Click "Show all" at the bottom of the "Assets" section to show more download options. Note the difference between "nextalign" and "nextclade" files as well as differences in operating systems and computer architectures.
๐ Nextclade Web is available at https://clades.nextstrain.org
๐ Docker images are available at DockerHub
๐ To understand how it all works, make sure to read the Documentation
Warnings related to translation of peptides now have verbosity level "info", down from "warning", to reduce clutter in logs. You can still find all errors and warnings in the "errors" and "warnings" columns of the CSV and TSV output files, as well as in the corresponding fields of JSON output files. If you want these warnings to be printed into the console, you can increase Nextclade CLI verbosity level to "info" by adding at least one occurrence of --verbose
(-v
) flag or by explicitly setting --verbosity=info
or to a lower value. Type nextclade run --help
for more details.
๐ฅ Nextclade CLI & Nextalign CLI can be downloaded from the links in the "Assets" section just below. Click "Show all" at the bottom of the "Assets" section to show more download options. Note the difference between "nextalign" and "nextclade" files as well as differences in operating systems and computer architectures.
๐ Nextclade Web is available at https://clades.nextstrain.org
๐ Docker images are available at DockerHub
๐ To understand how it all works, make sure to read the Documentation
Until now, when there were multiple positions with equal numbers of mismatches between a query sequence and reference tree position, Nextclade always attached the query sequence to the reference tree node with the fewest number of ancestors. Due to the way recombinants are placed in the SARS-CoV-2 reference trees, this meant that in particular partial sequences were often attached to recombinants. With most recombinants being rare, this bias to attach to recombinants was often surprising.
In this version, we introduce a new feature that allows to attach sequences to a priori most likely nodes - taking into account which positions on the reference tree are most commonly found in circulation. The information on the prior probability that a particular reference tree node is the best match for a random query sequence is contained in the placement_prior
reference tree node attribute. This attribute is currently only present in the most recent SARS-CoV-2 reference trees. The calculation can be found in this nextclade_data_workflows
pull request.
To give an example: a partial sequence may have as many mismatches when compared to BA.5 as it has to the recombinant XP. Based on sequences in public databases, we know that BA.5 is much more common than XP. Hence, the query sequence is attached to BA.5. Previously, the query sequence would have been attached to XP, because XP has fewer parent nodes in the reference tree.
The impact of the feature is biggest for partial and incomplete sequences.
When available in the dataset, the phenotype values (such as ace2_binding
and immune_escape
) are written into all output files except Auspice tree JSON. This omission is now fixed, and these values are set as tree node attributes. This allows to see the values and colorings for phenotype values on the tree page, and when loading the output tree JSON file into Auspice.
Nextclade Web was showing right boundary of the unsequenced AA range on the 3' end of peptide sequences incorrectly - the range was longer than expected. The calculations were using length of a gene in nucleotides, where there should be length in codons. This is now fixed.
The mutation badges in various places in Nextclade Web could show position "0", even though they are supposed to be 1-based. This was due to a programming mistake, which is now corrected.
input-pcr-primers
and input-virus-properties
URL params in Nextclade WebThe input-pcr-primers
and input-virus-properties
URL params were swapped in the code accidentally, so one was incorrectly setting the other. This is now fixed.
Due to an omission, Nextclade CLI and Nextalign CLI since v2 did not print sequence translation-related warnings to the console. This is now fixed.
We resolved warnings in Google Search Console: added canonical URL meta tag, and added noindex
tag for non-release deployments. This should improve Nextclade appearance in Google Search.
๐ฅ Nextclade CLI & Nextalign CLI can be downloaded from the links in the "Assets" section just below. Click "Show all" at the bottom of the "Assets" section to show more download options. Note the difference between "nextalign" and "nextclade" files as well as differences in operating systems and computer architectures.
๐ Nextclade Web is available at https://clades.nextstrain.org
๐ Docker images are available at DockerHub
๐ To understand how it all works, make sure to read the Documentation
This column's tooltip now also shows ranges of unsequenced regions, i.e. contiguous ranges of nucleotide characters absent at the 5' and 3' end of the original query sequence, as compared to the reference sequence. To put it differently, these are the ranges that are to the left and right of the alignment range - from 0 to alignmentStart
and from alignmentEnd
to the length of the reference sequence. These regions may appear after alignment step, where Nextclade or Nextalign might insert characters -
on the 5' and 3' ends to fill the query sequence to the length of the reference sequence. Just like it does with the characters that are absent from the inner parts of the query sequence (which we then call "deletions"). If found, the unsequenced regions are also shown as two light-grey rectangles at either or both ends of the sequence in sequence view column in Nextclade Web.
Unsequenced regions are not to be confused with the missing nucleotides, which are also shown in the same tooltip. Missing nucleotides are the N
characters present in the original query sequence. They are not introduced nor modified by Nextclade and Nextalign, and are only detected and counted.
It seems that there is no consensus in the bioinformatics community about the notation and naming of either of these events (e.g. which character to use and how to call these ranges). Be thoughtful about these regions when working with the results of Nextclade and Nextalign, especially if you analyze:
N
s and large deletions in the body)N
or -
, or even filling from a consensus genome)If you find strange or inconsistent results, we encourage you to inspect the input and output sequences in an alignment viewer on per-sequence basis and to contact the authors of individual sequences to clarify their conventions and intent.
In CSV and TSV outputs, the values in columns alignmentStart
and alignmentEnd
were emitted in 0-based numbering. This was unexpected - by convention, CSV and TSV files have all ranges in 1-based format. This is now fixed.
We added new columns in CSV and TSV outputs:
unknownAaRanges
- list of detected contiguous ranges of unknown aminoacid (character X
)totalUnknownAa
- total number of unknown aminoacids (character X
)๐ฅ Nextclade CLI & Nextalign CLI can be downloaded from the links in the "Assets" section just below. Click "Show all" at the bottom of the "Assets" section to show more download options. Note the difference between "nextalign" and "nextclade" files as well as differences in operating systems and computer architectures.
๐ Nextclade Web is available at https://clades.nextstrain.org
๐ Docker images are available at DockerHub
๐ To understand how it all works, make sure to read the Documentation
index
column is written to CSV/TSV output files in case of errorThe new column index
was correctly written when analysis of a sample succeeds. However, for analyses which ended up with an error (e.g. "Unable to align") this column was mistakenly missing. In this version we fix this omission.
Gene map (genome annotation) was misaligned with sequence views (not matching their width). This has been fixed in this version.
We added a column with index of the row in the table. This is useful for visual search and counting of sorted and filtered results.
Not to be confused with sequence index. Row indices always start with 0 and sorted in ascending order, and do not change their position when sorting or filtering the results.
These indices are not a part of output files. Nextclade CLI is not affected.
Errors due to failure of sequence alignment are reworded and hopefully are more complete and comprehensible now. Additionally, we improved error message when reference sequence fails to read.
On smaller screens the "Download", "Tree" and other action buttons were not visible by default and horizontal scrolling were required to see them. We changed the layout such that the panel with buttons does not overflow along with table and so the buttons are always visible. Table is still scrollable.
We improved text on main page as well as descriptions inside HTML markup, adding more concrete information and keywords. This should be more pleasant to read and might improve Nextclade ranking in search engines.
---๐ฅ Nextclade CLI & Nextalign CLI can be downloaded from the links in the "Assets" section just below. There click "Show all" to show more options. Note the difference between "nextalign" and "nextclade" files.
๐ Nextclade Web is available at https://clades.nextstrain.org
๐ Docker images are available at DockerHub
๐ To understand how it all works, make sure to read the Documentation