FunctionLab Selene Versions Save

a framework for training sequence-level deep learning networks

0.5.0

2 years ago

Version 0.5.0

New functionality

  • sampler.MultiSampler: MultiSampler accepts any Selene sampler for each of the train, validation, and test partitions where previously MultiFileSampler only accepted FileSamplers. We will deprecate MultiFileSampler in our next major release.
  • DataLoader: Parallel data loading based on PyTorch's DataLoader class, which can be used with Selene's MultiSampler and MultiFileSampler class. (see: sampler.SamplerDataLoader, sampler.H5DataLoader)
  • To support parallelism via multiprocessing, the sampler that SamplerDataLoader used needs to be picklable. To enable this, opening file operations are delayed to when any method that needs the file is called. There is no change to the API and setting init_unpicklable=True in __init__ for Genome and all OnlineSampler classes will fully reproduce the functionality in selene_sdk<=0.4.8.
  • sampler.RandomPositionsSampler: added support for center_bin_to_predict taking in a list/tuple of two integers to specify the region from which to query the targets---that is, center_bin_to_predict by default (center_bin_to_predict=<int>) queries targets based on the center bin size, but can be specified as start and end integers that are not at the center if desired.
  • EvaluateModel: accepts a list of metrics (by default computing ROC AUC and average precision) with which to evaluate the test dataset.

Usage

  • Command-line interface (CLI): You can now run the CLI directly with python -m selene_sdk (if you have cloned the repository, make sure you have locally installed selene_sdk via python setup.py install, or selene_sdk is in the same directory as your script / added to PYTHONPATH). Developers can make a copy of the selene_sdk/cli.py script and use it the same way that selene_cli.py was used in earlier versions of Selene (python -u cli.py <config-yml> [--lr])

Bug fixes

  • EvaluateModel: use_features_ord allows you to evaluate a trained model on only a subset of chromatin features (targets) predicted by the model. If you are using a FileSampler for your test dataset, you now have the option to pass in a subsetted matrix; however, this matrix must be ordered the same way as features (the original targets prediction ordering) and not in the same ordering as use_features_ord. However, the final model predictions and targets (test_predictions.npz and test_targets.npz) will be outputted according to the use_features_ord list and ordering.
  • MatFileSampler: Previously the MatFileSampler reset the pointer to the start of the matrix too early (going back to the first sample before we had finished sampling the whole matrix).
  • CLI learning rate: Edge cases (e.g. not specifying the learning rate via CLI or config) previously were not handled correctly and did not throw an informative error.

0.4.8

4 years ago

Enhancements

  • PyTorch now has flexible state dict loading, which allows users more flexibility in loading models that were trained with older/newer versions of PyTorch. Selene has been updated to use this parameter.
  • Added HeartENN model architecture ahead of publication.

0.4.7

4 years ago

Bugfixes:

  • Use self.use_cuda in get_predict for raw sequence input in the AnalyzeSequences class.

0.4.6

4 years ago

Updates

  • Allow users to pass in individual sequences to get_predictions in AnalyzeSequences class and get the model prediction directly (as opposed to having it be written to an output file).

0.4.5

4 years ago

Updates

  • Specify upper & lower bounds for Selene's torch dependency
  • Add '.' as a valid delimiter for VCF multiallelic parsing
  • Allow users to evaluate on subsets of features in EvaluateModel

Bugfixes:

  • BASES_ARR type consistency (specify as a list only) and resetting for lua-trained model vs. Selene-trained model.

0.4.4

4 years ago

Updates

  • Refactored variant effect prediction to simplify the code
  • Removed contains_unk column from output of get_predictions_from_fasta in AnalyzeSequences class

Bugfixes

  • Fixed variant effect prediction handling for odd-length sequences

0.4.3

4 years ago

Updates:

  • Add a column contains_unk to BED/VCF predictions. This boolean column indicates whether a sequence contains any unknown bases.

Bugfixes:

  • MultiModelWrapper can be used with CUDA.

0.4.2

4 years ago

Updates:

  • MultiModelWrapper for model evaluation

Bugfixes:

  • Type check for GenomicFeatures is less strict (can accept int if threshold is 1) (#106)
  • Syntax error in EvaluateModel (#110)
  • RandomPositionsSampler sampling bounds (#114)
  • LR scheduler correctly tracks min loss now (#115)
  • Get predictions for BED file - fix edge case of single-entry BEd file (#118)

0.4.1

4 years ago

Updates:

  • HDF5 support for in silico mutagenesis.
  • Add google groups link to documentation.

Bug fixes:

  • Predicting on sequences: write predictions to file for input less than batch size.

0.4.0

4 years ago

Updates:

  • Variant effect prediction: adjustments made to variant centering and strand-specific sequence handling so that the sequence context fetched for a variant matches the implementation for code associated with DeepSEA and SeqWeaver (https://hb.flatironinstitute.org/asdbrowser/help, https://github.com/FunctionLab/expecto)
  • Predicting on sequences accepts BED file as input
  • Add compatibility with Lua-trained DeepSEA and SeqWeaver models (converted to PyTorch) - models themselves will be officially released through the ASD browser on HumanBase in the coming weeks.
  • Simplified the prediction handlers output for variant effect prediction - sequences where the reference allele doesn't match the reference genome are no longer diverted to a new file. Rather, a column has been added ref_match that denotes whether the allele matches or not.

Bug fixes:

  • Predicting on sequences: previously did not output anything if N < batch size