Projectglow Glow Versions Save

An open-source toolkit for large-scale genomic analysis

v2.0.0

2 months ago

What's Changed

Major changes

  • Support Spark 3.4 and 3.5
  • Add functions for left and left semi joins with overlap criteria accelerated by Databricks' range join optimization
  • Register SQL functions via SQL extension service provider interface, so glow.register is no longer necessary if Glow is on the classpath when Spark is launched

Other user facing changes

  • Remove Hail integration
  • Remove features that frequently cause incompatibilities between versions (aggregate_by_index, CSV pipe transformer). Workarounds are provided in the documentation.

Internal changes

  • Future proof for Spark 4.0 / Scala 2.13 / JDK 17
  • Migrate CI and release process to GitHub Actions

Overlap join benchmarks

On a dataset with 1B left rows and 1M right rows and varying percentages of SNPs in the left table (tested with 1 4 core executor due to quota):

Inner range join + left join, all SNP percentages: 4h
Glow join, 0% SNPs: 4h
Glow join, 50% SNPs: 2h9m
Glow join, 90% SNPs: 0h42m

Other notes

The Python source artifact is built from tag v2.0.0-conda in order to fix Glow's conda recipe.

New Contributors

Full Changelog: https://github.com/projectglow/glow/compare/v1.2.1...v2.0.0

v1.2.1

2 years ago

v1.2.1 bumps glow to Spark v3.2.1

This release includes Java/Scala artifacts in Maven Central , and Python artifacts in pypi. Docker containers projectglow/open-source-glow:1.2.1, projectglow/databricks-glow:1.2.1, projectglow/databricks-glow:10.4 and projectglow/databricks-hail:0.2.93 can be found in projectglow's dockerhub. The Glow notebook continous integration test now uses Databricks Runtime 10.4, which is on Spark 3.2.1 (workflow definition json)

Glow leverages private catalyst APIs that have changed from Spark 3.1 to Spark 3.2. We wrote a Shim to maintain backwards compatibility. However, Spark 2 is end of life (EoL). Databricks, AWS EMR and Google Dataproc now depend on Hadoop 3.x, which is incompatible with Spark 2. So we are removing support for Spark 2, including the Spark 2 continuous integration tests (ci/cd) performed with circleci. Glow version 1.1.2 is the last release that supports Spark 2

The Spark 3 ci/cd tests depend on Hail, and these were failing since Hail does not yet support Spark 3.2, they are waiting on Google's Dataproc and AWS EMR to upgrade from Spark 3.1. So for now we expect the Spark 3 circleci tests to continue failing until we can resolve the hail tests. However, we moved forward with the new release as it is unclear when Dataproc or EMR will support Spark 3.2

Thanks to Alex Barreto, Jasser Abidi, Cameron Smith, Marcus Henry, Karen Feng, Joseph Bradley, and William Brandler for their contributions to this release

New Contributors

Full Changelog: https://github.com/projectglow/glow/compare/v1.1.2...v1.2.1

v1.1.2

2 years ago

v1.1.2

Glow incorporates new functionality for quarantining records with the Glow pipe transformer in v1.1.2.

This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.

New Contributors

Full Changelog: https://github.com/projectglow/glow/compare/v1.1.1...v1.1.2

v1.1.1

2 years ago

v1.1.1

Glow incorporates new functionality for sample masking in GWAS v1.1.1, which has been documented as a quickstart guide. Nightly notebook tests are now dockerized, making it easier to integrate Glow with other bioinformatics libraries. VEP schema changes fixes a bug with indel parsing

This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.

What's Changed

New Contributors

Credits

Alex Barreto, Boris Boutkov, Brian Cajes, Karen Feng, William Brandler, dim de grave

Full Changelog: https://github.com/projectglow/glow/compare/v1.1.0...v1.1.1

v1.1.0

2 years ago

v1.1.0 bumps the Spark version of Glow to 3.1.2

Glow also now runs automated nightly testing of notebooks in the docs, making it easier for users to contribute code or algorithms to help others make use of Glow

This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.

Notable changes:

  • Upgrade Spark dependency from 3.0.0 to 3.1.2 #396
  • Create integration test script #373
  • Hail related enhancements #377
  • Remove typecheck for numpy arrays #366

Minor changes include:

  • Migrate from Bintray to Sonatype #367
  • Test changed notebooks in branches #380

Credits: Brian Cajes, Karen Feng, William Brandler, dim de grave

v1.0.1

3 years ago

v1.0.1 is a patch release

This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.

v1.0.0

3 years ago

We are excited to announce the release of Glow 1.0.0. This release includes major scalability and usability improvements, particularly for GloWGR whole-genome regression and genome-wide association study regression tests. These improvements create a more performant GloWGR workflow with simpler APIs.

Major features and changes include:

  • #302, #309: Pandas-based linear regression. Introduced the linear_regression Python function which can be used to perform GWAS linear regression tests for multiple phenotypes simultaneously. The function is optimized for performance through one-time calculation of intermediate matrices common across multiple phenotypes and genotypes. The function can also accept WGR terms as an offset parameter. This function is superior in performance compared to the existing SQL-based linear_regression_gwas function, which only works on a single phenotype.
  • #316, #318, #319: Pandas-based logistic regression. Introduced the logistic_regression Python function with the same properties mentioned above for linear regression. This function implements a fast multi-phenotype multi-genotype score test with fallback logic for significant variants indicated by the score test. The currently supported fallback test is the Approximate Firth method presented in REGENIE.
  • #323: Improved the WGR API so that the user can now provide all the input to a single class and run different functions without passing any arguments. An estimate_loco_offsets function was added to perform an end-to-end generation of loco predictors using a single command. In addition, GloWGR was revised to make its behavior regarding standardization of phenotypes and genotypes, and treatment of intercept match the REGENIE algorithm.
  • #300: Conversion from Hail MatrixTables to Glow-compatible Spark DataFrames.
  • #274: Faster default VCF reader.
  • #294: Streamlined GloWGR between WGR and GWAS functions.
  • #282: Improved scalability of GloWGR.
  • #303: Added hard calling by default to the BGEN reader.

Backwards-incompatible changes:

  • #326: Changed Glow register function to not modify the Spark session by default.

v0.6.0

3 years ago

We are excited to announce the release of Glow 0.6.0. This release includes both Java/Scala and Python artifacts that can be found in Maven Central and PyPI, respectively. Please note that the name of Maven Artifacts has changed from glow to glow-spark3 and glow-spark2 as glow is now released for both versions of Spark.

Notable additions/changes are:

  • #245 Added GloWGR for binary traits
  • #240 Input validation for GloWGR
  • #242 transform_loco function for RidgeRegression, which applies the fitted model in a leave-one-chromosome-out to get phenotype predictors for each chromosome
  • #243 reshape_for_gwas convenience function to prepare the output of GloWGR for use in glow GWAS functions
  • #285 Improved performance of lift_over_variants transformer
  • #249 Faster conversion form python double array to java array
  • #276 Added support for reading uncompressed or zstd compressed BGEN files
  • #254 , #291 Feature to cross release for Spark 3 and Spark 2
  • #258 Fixed error in python literal conversion
  • #264 Fixed splitability state of non-compressed VCFs
  • #271, #281 Minor fixes to GloWGR
  • #247, #250, #252, #273, #275, #279, #287 Documentation, notebook, and blog improvements
  • Other minor fixes

v0.5.0

3 years ago

This release features the initial release of GloWGR, a framework for distributed whole genome regression. For more information, see the blog post and user guide.

Additional features: #222: Accept non-string arguments in transformers #213: Accept numpy ndarrays as literal arguments to GWAS functions #228: Add a user guide for merging variant datasets with Glow

v0.4.1

3 years ago

A small patch release to v0.4.0. This release includes both Java/Scala and Python artifacts that can be found in Maven Central and PyPI respectively.

#224: Fixes an issue where some Glow expressions cause an error during query execution if used after variant splitting.