Python and C++ code for reading and writing genomics data.
Nucleus is a library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF. In addition, Nucleus enables painless integration with the TensorFlow machine learning framework, as anywhere a genomics file is consumed or produced, a TensorFlow tfrecords file may be used instead.
Please check out our tutorial on using Nucleus and TensorFlow for DNA sequencing error correction. It's a Python notebook that really demonstrates the power of Nucleus at integrating information from multiple file types (BAM, VCF and Fasta) and turning it into a form usable by TensorFlow.
Which of these would most increase your usage of Nucleus? (Click on an option to vote on it.)
Nucleus currently only works on modern Linux systems using Python 3. It must be
installed using a version of pip
less than 21. To determine the version of pip
installed on your system, run
pip --version
To install Nucleus, run
pip install --user google-nucleus
Note that each version of Nucleus works with a specific TensorFlow version. Check the releases page for specifics.
You can ignore any "Failed building wheel for google-nucleus" error messages -- these are expected and won't prevent Nucleus from installing successfully.
If you are using Python 2, instead run
pip install --user google-nucleus==0.3.2
For Ubuntu 20, building from source is easy. Simply type
source install.sh
This will call build_clif.sh
, which will build CLIF from scratch as well.
For all other systems, you will need to first install CLIF by following the instructions at https://github.com/google/clif#installation before running install.sh. You'll need to run this command with Python 3.8. If you don't want to build CLIF binaries on your own, you can consider using pre-built CLIF binaries (see an example here). Note that we don't plan to update these pre-built CLIF binaries, so we recommend building CLIF binaries from scratch.
Note that install.sh extensively depends on apt-get, so it is unlikely to run without extensive modifications on non-Debian-based systems.
Nucleus depends on TensorFlow. By default, install.sh will install a CPU-only
version of a stable TensorFlow release (currently 2.6). If that isn't what you
want, there are several other options that can be enabled with a simple edit to
install.sh
.
Running install.sh
will build all of Nucleus's programs and libraries. You can
find the generated binaries under bazel-bin/nucleus
. If in addition to
building Nucleus you would like to run its tests, execute
bazel test -c opt $BAZEL_FLAGS nucleus/...
This is Nucleus 0.6.0. Nucleus follows semantic versioning.
New in 0.6.0:
New in 0.5.9:
New in 0.5.8:
util/vis.py
to use updated channel names.MED_DP
(median DP) field for a VariantCall
.New in 0.5.7:
util/vis.py
.New in 0.5.6:
New in 0.5.5:
New in 0.5.4:
New in 0.5.3:
New in 0.5.2:
util/vis.py
now supports saving images to Google Cloud Storage.New in 0.5.1:
New in 0.5.0:
New in 0.4.1:
New in 0.4.0:
New in 0.3.0:
New in 0.2.3:
New in 0.2.2:
New in 0.2.1:
New in 0.2.0:
Nucleus is licensed under the terms of the Apache 2 license.
The Genomics team in Google Brain actively supports Nucleus and are always interested in improving its quality. If you run into an issue, please report the problem on our Issue tracker. Be sure to add enough detail to your report that we can reproduce the problem and fix it. We encourage including links to snippets of BAM/VCF/etc files that provoke the bug, if possible. Depending on the severity of the issue we may patch Nucleus immediately with the fix or roll it into the next release.
Interested in contributing? See CONTRIBUTING.
Nucleus grew out of the DeepVariant project.
This is not an official Google product.