An open-source toolkit for large-scale genomic analysis
glow.register
is no longer necessary if Glow is on the classpath when Spark is launchedaggregate_by_index
, CSV pipe transformer). Workarounds are provided in the documentation.On a dataset with 1B left rows and 1M right rows and varying percentages of SNPs in the left table (tested with 1 4 core executor due to quota):
Inner range join + left join, all SNP percentages: 4h
Glow join, 0% SNPs: 4h
Glow join, 50% SNPs: 2h9m
Glow join, 90% SNPs: 0h42m
The Python source artifact is built from tag v2.0.0-conda
in order to fix Glow's conda recipe.
Full Changelog: https://github.com/projectglow/glow/compare/v1.2.1...v2.0.0
v1.2.1 bumps glow to Spark v3.2.1
This release includes Java/Scala artifacts in Maven Central , and Python artifacts in pypi. Docker containers projectglow/open-source-glow:1.2.1
, projectglow/databricks-glow:1.2.1
, projectglow/databricks-glow:10.4
and projectglow/databricks-hail:0.2.93
can be found in projectglow's dockerhub. The Glow notebook continous integration test now uses Databricks Runtime 10.4, which is on Spark 3.2.1 (workflow definition json)
Glow leverages private catalyst APIs that have changed from Spark 3.1 to Spark 3.2. We wrote a Shim to maintain backwards compatibility. However, Spark 2 is end of life (EoL). Databricks, AWS EMR and Google Dataproc now depend on Hadoop 3.x, which is incompatible with Spark 2. So we are removing support for Spark 2, including the Spark 2 continuous integration tests (ci/cd) performed with circleci. Glow version 1.1.2 is the last release that supports Spark 2
The Spark 3 ci/cd tests depend on Hail, and these were failing since Hail does not yet support Spark 3.2, they are waiting on Google's Dataproc and AWS EMR to upgrade from Spark 3.1. So for now we expect the Spark 3 circleci tests to continue failing until we can resolve the hail tests. However, we moved forward with the new release as it is unclear when Dataproc or EMR will support Spark 3.2
Thanks to Alex Barreto, Jasser Abidi, Cameron Smith, Marcus Henry, Karen Feng, Joseph Bradley, and William Brandler for their contributions to this release
Full Changelog: https://github.com/projectglow/glow/compare/v1.1.2...v1.2.1
v1.1.2
Glow incorporates new functionality for quarantining records with the Glow pipe transformer in v1.1.2.
This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.
Full Changelog: https://github.com/projectglow/glow/compare/v1.1.1...v1.1.2
v1.1.1
Glow incorporates new functionality for sample masking in GWAS v1.1.1, which has been documented as a quickstart guide. Nightly notebook tests are now dockerized, making it easier to integrate Glow with other bioinformatics libraries. VEP schema changes fixes a bug with indel parsing
This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.
Alex Barreto, Boris Boutkov, Brian Cajes, Karen Feng, William Brandler, dim de grave
Full Changelog: https://github.com/projectglow/glow/compare/v1.1.0...v1.1.1
v1.1.0 bumps the Spark version of Glow to 3.1.2
Glow also now runs automated nightly testing of notebooks in the docs, making it easier for users to contribute code or algorithms to help others make use of Glow
This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.
Notable changes:
Minor changes include:
Credits: Brian Cajes, Karen Feng, William Brandler, dim de grave
v1.0.1 is a patch release
This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.
We are excited to announce the release of Glow 1.0.0. This release includes major scalability and usability improvements, particularly for GloWGR whole-genome regression and genome-wide association study regression tests. These improvements create a more performant GloWGR workflow with simpler APIs.
Major features and changes include:
linear_regression
Python function which can be used to perform GWAS linear regression tests for multiple phenotypes simultaneously. The function is optimized for performance through one-time calculation of intermediate matrices common across multiple phenotypes and genotypes. The function can also accept WGR terms as an offset parameter. This function is superior in performance compared to the existing SQL-based linear_regression_gwas
function, which only works on a single phenotype.logistic_regression
Python function with the same properties mentioned above for linear regression. This function implements a fast multi-phenotype multi-genotype score test with fallback logic for significant variants indicated by the score test. The currently supported fallback test is the Approximate Firth method presented in REGENIE.estimate_loco_offsets
function was added to perform an end-to-end generation of loco predictors using a single command. In addition, GloWGR was revised to make its behavior regarding standardization of phenotypes and genotypes, and treatment of intercept match the REGENIE algorithm.Backwards-incompatible changes:
register
function to not modify the Spark session by default.We are excited to announce the release of Glow 0.6.0. This release includes both Java/Scala and Python artifacts that can be found in Maven Central and PyPI, respectively. Please note that the name of Maven Artifacts has changed from glow
to glow-spark3
and glow-spark2
as glow is now released for both versions of Spark.
Notable additions/changes are:
transform_loco
function for RidgeRegression
, which applies the fitted model in a leave-one-chromosome-out to get phenotype predictors for each chromosomereshape_for_gwas
convenience function to prepare the output of GloWGR for use in glow GWAS functionslift_over_variants
transformerThis release features the initial release of GloWGR, a framework for distributed whole genome regression. For more information, see the blog post and user guide.
Additional features:
#222: Accept non-string arguments in transformers
#213: Accept numpy ndarray
s as literal arguments to GWAS functions
#228: Add a user guide for merging variant datasets with Glow
A small patch release to v0.4.0. This release includes both Java/Scala and Python artifacts that can be found in Maven Central and PyPI respectively.
#224: Fixes an issue where some Glow expressions cause an error during query execution if used after variant splitting.