Chemprop Versions Save

Message Passing Neural Networks for Molecule Property Prediction

v1.4.0

2 years ago

Features

Spectra training

Introduces spectra as a new dataset type available for training, in which each target in a multitarget regression refers to a positive intensity value in one position of a spectrum. Training methods are consistent with https://github.com/gfm-collab/chemprop-IR. Default loss function is spectral information divergence (SID), but Wasserstein loss (earthmover distance) is also supported with --metric wasserstein --alternative_loss_function wasserstein. PR #197

Preloading model in predictions

Refactored the make_predictions into smaller functions for better capability to use chemprop functions as a python library. Refactoring specficially designed to allow for the loading of a model using the function chemprop.train.load_model a single time and then using it for multiple instances of predictions by feeding that model as an argument to chemprop.train.make_predictions. PR #200

Improved hyperparameter optimization

Added several new features to hyperparameter optimization, many related to hyperparameter checkpoints saved in the location specified by --hyperopt_checkpoint_dir <dir_path>. The new functionalities:

  • Restarting failed hyperparameter optimization jobs by selecting the same checkpoint directory.
  • Parallelizing multiple instances of hyperparameter optimization by setting a shared checkpoint directory among instances.
  • Seeding hyperparameter optimizations with previously run jobs by indicating an old checkpoint directory and/or by specifying the save directories of relevant jobs trained with train.py using -manual_trial_dirs <list-of-directories>.
  • Manually set the number of hyperparameter trials that use randomized parameters before directed TPE search begins using --startup_random_iters <int, default=10>. PR #208

Return results from all ensemble models

When making predictions from an ensemble of models, returns the mean prediction but also the individual predictions from the individual models when --individual_ensemble_predictions is specified. PR #190

Latent representations for ensembles and from FFN layers

Allows for the calculation of latent fingerprints from an ensemble of models by concatenating them together. Also allows for the return of either a latent representation from the MPNN output or from the next-to-last FFN layer using the argument --fingerprint_type <MPN or last_FFN>. PR #193

Target imputation for sklearn multitask models

Sklearn multitask training cannot proceed with missing targets among the data, previously would have needed to be run as multiple singletask models. This PR introduces target imputation for missing data to allow multitask sklearn training even when some data is missing with the argument --impute_mode <model/linear/median/mean/frequent> indicating which method to use for imputation. PR #210 Issue #211

Reaction balancing

Adds options in reaction training for how to handle situations where reactants and products are not balanced. The argument --reaction_mode now also has the options reac_diff_balance, prod_diff_balance, and reac_prod_balance (in addition to the current options reac_diff, prod_diff, and reac_prod). Also fixes an error where atomic numbers are incorrect when an atom is present in the products but not in the reactants. PR #212 Issue #204

Bug Fixes

Interactions with git repos

Resolves a problem with TAP (typed-argument-parser) where running Chemprop from inside a different git repo would trigger an error related to the generation of a reproducibility hash. In this situation the reproducibility hash is not generated, but it logs the issue and does not stop Chemprop from running. PR #195

Global features structure

Changes the way that global variables related to model construction and feature vector size are handled. Resolves a problem in pytest where these variables wouldn't reset between runs. PR #206

v1.3.1

2 years ago

Features

Resume training on multiple folds if interrupted

As training progresses through folds of a multiple fold model, the results of each individual fold are stored in a JSON file. If training is interrupted, the completed fold results will be read from the JSON file and resume on the first uncompleted fold if using the flag --resume_experiment. PR #164

Frozen layers for pre-training

Added functionality to freeze the MPN or FFN layers in a model being trained at the values of a previously trained model. Freezes MPN values using a model indicated with --checkpoint_frzn <path>. FFN layers will also be frozen if indicated with --frzn_ffn_layers <number-of-layers>. Models with multiple molecules can select to only freeze the first molecule MPN using --freeze_first_only. PR #170

tSNE functionality

Added HDBScan clustering to the tSNE script. PR #172

Weighted training by target and by datapoint

Added training weights for different targets and different datapoints, with normalization of weight values. Target weights indicated with the argument --target_weights <list-of-values>. Data weights supplied through an input file indicated with the argument --data_weights_path <path>. PR #173, #175, #189 Issue #145

Bug Fixes

MPNN input

Providing SMILES or RDKit molecules to the MPN's forward function failed (only BatchMolGraph worked) following other changes. Now, SMILES and RDKit molecules can once again be used as input. PR #164

Backwards compatibility with old checkpoints

Backwards compatibility for features scaling PR #164 Issue #108

Updated readme

Added information to the readme and documentation of pre-training, treatment of missing values in multitask models and caching. PR #165 Issue #156

Multiclass classification

Corrected error when using the metric accuracy with multiclass classification. PR #169

RDKit Compatibility

Bugfix for compatibility issues of RDKit 2021.03.01 with the interpretation script. PR #182 Issue #178

v1.3.0

3 years ago

New Features

Custom atom/bond features

Enabled custom input of atom and bond features either in addition or instead of the default features.

PR: https://github.com/chemprop/chemprop/pull/137

Epistemic uncertainty

Introduced the argument --ensemble_variance which calculates the epistemic uncertainty of predictions via an ensemble of models.

PR: https://github.com/chemprop/chemprop/pull/140

Reaction option

Introduced CGR option - input of atom-mapped reaction smiles instead of molecules. This creates a pseudo-molecule of the graph transition state between reactants and products, and performs message passing on this pseudo-molecule

PR: https://github.com/chemprop/chemprop/pull/152

Latent representation

Added a new functionality that saves the latent representation of a molecule (the MPNN output), which can be used similar to predicting with a given checkpoint file, and saves the MPNN output to file.

PR: https://github.com/chemprop/chemprop/pull/119

Preprocessing updates

Updates to the preprocessing, handling and saving of smiles strings. Removed redundant checks.

PR: https://github.com/chemprop/chemprop/pull/135

Resume experiments

Experiments with multiple folds can now be resumed using the --resume_experiment flag. Additionally, the test results of each fold are saved as a JSON file in the corresponding subfolder in save_dir.

PR: https://github.com/chemprop/chemprop/pull/164

Bug Fixes

Atom messages

Major bugfix for running Chemprop with the argument --atom_messages, where the wrong features were passed to the MPNN. This improves the performance of Chemprop in atom_messages mode, and causes backwards incompatibility with old checkpoint files if created in atom_messages mode. Since Chemprop is mainly used for directed message passing via bond messages, we hope not many users are affected.

Issue: https://github.com/chemprop/chemprop/issues/133 PR: https://github.com/chemprop/chemprop/pull/138

Backwards compatibility with old checkpoints

Backwards compatibility for correctly setting recently introduced training arguments for old models.

Issue: https://github.com/chemprop/chemprop/issues/148 and https://github.com/chemprop/chemprop/issues/108 PR: https://github.com/chemprop/chemprop/pull/149 and PR: https://github.com/chemprop/chemprop/pull/164

Sklearn scores

Bugfix in training sklearn models: Scores were not saved correctly previously.

PR: https://github.com/chemprop/chemprop/pull/162

Data split script

Bugfix in a standalone script to create data splits: Multi-molecule input had previously created incompatibilities with passing data to the scaffold split functionality. Update of docstring.

Issue: https://github.com/chemprop/chemprop/issues/158 PR: https://github.com/chemprop/chemprop/pull/159

MPNN sanity check

Bugfix for sanity checks for dimensions of batches within the MPNN forward pass: The introduction of multi-molecule input had previously caused an inconsistency in one of the checks.

Issue: https://github.com/chemprop/chemprop/issues/153 PR: https://github.com/chemprop/chemprop/pull/154

MPNN type annotations

Bugfix for type annotation in the MPNN forward pass + update of docstring.

PR: https://github.com/chemprop/chemprop/pull/151 and PR: https://github.com/chemprop/chemprop/pull/164

Tanimoto distance

Bugfix for calculating Tanimoto distances. The introduction of multi-molecule input had previously caused incompatibilities in the standalone script to find similar molecules in the training data.

Issue: https://github.com/chemprop/chemprop/issues/143 PR: https://github.com/chemprop/chemprop/pull/144

README typos

Fixed typos for a few arguments in the README

PR: https://github.com/chemprop/chemprop/pull/139

Sanitize script

Bugfix in standalone script sanitize.py - open output file with write access.

RDKit molecule caching

Bugfix for creating RDKit molecules from smiles strings. Previously the molecules were recreated even though they were already cached.

PR: https://github.com/chemprop/chemprop/pull/152

Saving SMILES

Bugfix for error occurring when --save_smiles_splits is used in conjunction with --separate_test_path. Now, the data split csv files are still generated, but split_indices.pkl is not generated if there are multiple data points with the same SMILES or if some of the data comes from a separate data file.

Issue: https://github.com/chemprop/chemprop/issues/157 PR: https://github.com/chemprop/chemprop/pull/163

SMILES/mols as input to MPNN

Bugfix for SMILES or RDKit molecules as input to MPNN model instead of BatchMolGraph.

PR: https://github.com/chemprop/chemprop/pull/164

v1.2.0

3 years ago

Features

New split type

The split type --split_type cv already existed to perform k-fold cross-validation (where k is set by --num_folds). In each fold, 1/k of the data is put in the test set, 1/k of the data is in put in the validation set, and the remaining (k-2)/k of the data is put in the training set.

Now, a new split type --split_type cv-no-test exists which is essentially identical except that it assigns no data to the test set on each fold (https://github.com/chemprop/chemprop/commit/b56ca9866b303036eab61cab93188cccbaa24af2). Instead, 1/k of the data is put in the validation set and (k-1)/k of the data is put in the training set with no test data. The purpose of this split type is to maximize the training data when training a model in cases where the test performance is already known (or is not important) and doesn't need to be determined. Note that the validation set is still necessary to perform early stopping.

Dropping extra columns during prediction

Previously, when using predict.py, all the columns from the test_path file were copied to the preds_path file and then the predictions were added as additional columns at the end. Now there is an option called --drop_extra_columns which will not copy over these extraneous columns to preds_path (https://github.com/chemprop/chemprop/commit/83ea4c06dda4231902777ea6776da922aeba2ad3 and https://github.com/chemprop/chemprop/commit/061339568045863c30c9bd8c2a143b674a0082d8). When --drop_extra_columns is used, preds_path will only contain columns with the SMILES and with the prediction values.

Bug Fixes

Backward compatibility for load_checkpoint

Previously, newer versions of Chemprop incorrectly loaded checkpoints that were trained using older versions of Chemprop due to a change in the names of the parameters. Backward compatibility has now been added to allow this version of Chemprop to load checkpoints with either set of names (https://github.com/chemprop/chemprop/commit/5371b29e7c65e41fa8b83d9c76ba2bfdd400b139 and https://github.com/chemprop/chemprop/commit/206950c6ec92a3646800f95bc69ae6d8dc7ca646).

Saving SMILES splits

Due to new Chemprop features such as the ability to load multiple molecules, the feature --save_smiles_splits, which saves the SMILES corresponding to the train, validation, and test splits, had broken (https://github.com/chemprop/chemprop/issues/110). This was fixed in https://github.com/chemprop/chemprop/pull/117.

Fixing interpret.py

Similar to the issue with saving SMILES splits, interpret.py broke due to the Chemprop feature that enables multiple molecules to be used as input (https://github.com/chemprop/chemprop/issues/107 and https://github.com/chemprop/chemprop/issues/113). This was fixed in https://github.com/chemprop/chemprop/pull/128.

Updating Dockerfile

The Dockerfile has been updated to address https://github.com/chemprop/chemprop/issues/100 and https://github.com/chemprop/chemprop/issues/129. This was fixed in https://github.com/chemprop/chemprop/pull/131.

Fixing atom descriptors

The atom_descriptors feature did not work in predict.py (https://github.com/chemprop/chemprop/issues/120). This was fixed in https://github.com/chemprop/chemprop/pull/114.

Logging

Logging to the terminal and to files (quiet.log and verbose.log in the save_dir) broke for some OS systems (https://github.com/chemprop/chemprop/issues/106). This was fixed in https://github.com/chemprop/chemprop/pull/118.

README additions

Some of the relatively new features, like custom atomic features, were missing from the README (https://github.com/chemprop/chemprop/issues/121). This was fixed in https://github.com/chemprop/chemprop/pull/122.

Infrastructure Changes

Migrating from Travis CI to GitHub Actions

Chemprop previously used Travis CI to run automated tests upon pushing to master or creating a pull request, but Travis changed its pricing structure and no longer offers unlimited free testing. For this reason, Chemprop now uses GitHub Actions to run automated tests. The results of the test runs can be seen in the Actions tab of the repo.

v1.1.0

3 years ago

Features

Multiple Input Molecules

[PR] Use multiple molecules as an input to chemprop. The number of molecules is specified with the keyword number_of_molecules. Those molecules are embedded with a separate D-MPNN by default. The latent representations are concatenated prior to the FFN.

The keyword mpn_shared allows you to use a shared D-MPNN. Note that, since the latent representations are concatenated, the order of the input molecules is important. This method is not invariant and there are better ways to use multiple molecules with shared D-MPNN, which will be implemented for the next release.

Custom Atom Features

[PR] Implemented custom atomic features as a counterpart of the custom molecular features in ChemProp. The new feature allows users to provide additional atomic features to each node in a given molecule. To use the feature, use the keyword atom_descriptors. The custom atom features can be employed in two modes. In the first mode, --atom_descriptors feature, custom features are used as normal node features, which are concatenated to the default node vector before the D-MPNN block. In the second mode, --atom_descriptors descriptor, custom atom features will not participate in the model until the atom feature vector has been updated through D-MPNN block. That is, the --atom_descriptors descriptor model will not disturb the extra custom atom features much and keep the information to the maximum extent.

The extra custom descriptors can be put into ChemProp through a variety of pickle files (.pkl, .pickle, .pckl), Numpy save file (.npz), or a .sdf file.

.pkl format

The .pkl file must store a Pandas DataFrame with smiles as index and columns as descriptors. All descriptors must be a 1D numpy array or 2D numpy array. For example:

1 custom atomic feature for each atom provided in a 1D array

smiles                          descriptors
CCOc1ccc2nc(S(N)(=O)=O)sc2c1    [0.637781931055927, 0.7075571757878132, 0.7339...
CCN1C(=O)NC(c2ccccc2)C1=O       [0.09588231301387817, 0.6521911050735447, 0.45...

Multiple atomic features for each atom provided in multiple 1D array

smiles                         desc1                                        desc2
CCOc1ccc2nc(S(N)(=O)=O)sc2c1  [0.637781931055927, 0.7075571757878132...    [0.8266363223032338, 0.89641156703512 ...
CCN1C(=O)NC(c2ccccc2)C1=O     [0.09588231301387817, 0.6521911050735447...  [0.2847367042611851, 0.8410454963208516...

Note: mixed 1D array and 2D array for different columns are not allowed

.npz file

Atomic descriptors for each molecule must be saved as one independent 2D numpy array ([number of atoms x number of descriptors]) in the .npz file for example by:

np.savez('descriptors.npz', *descriptors)

where descriptors is a list of atomic descriptors in 2D array in the order of molecules in the training/predicting datafile

.sdf file

Each molecule is presented as a mol block in the .sdf file. Descriptors should be saved as entries for each mol block in the format of comma separated values. Each molecule must has an entry named SMILES that stores the smiles string. For example:

CHEMBL1308_loner5
     RDKit          3D

  6  6  0  0  1  0  0  0  0  0999 V2000
   -0.7579   -0.5337   -2.8744 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2229   -1.3763   -1.7558 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0046   -1.0089   -0.4029 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.4824   -2.0104    0.3280 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.5806   -3.0317   -0.5484 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.1735   -2.6999   -1.8031 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  6  2  0
  2  3  1  0
  3  4  2  0
  4  5  1  0
  5  6  1  0
M  END
>  <desc1>  (1) 
-8.568031e-05,0.0001865207,-0.0002012379,-5.054658e-05,0.0002148434,-0.0003503839,1.970448e-05,3.081137e-05,2.997883e-05,9.446278e-05,-7.194711e-05,0.0001527364

>  <desc2>  (1) 
5.462954e-05,-2.415399e-06,0.0001044788,-2.274438e-05,0.0001698836,5.206409e-06,4.5825e-06,-8.882181e-07,-1.08787e-05,2.993307e-05,-4.069051e-06,1.338413e-05

>  <SMILES>  (1) 
Cc1cnnHc1

$$$$

where the name of descriptor entries desc1, desc2 can be arbitrary.

When using this feature, users are responsible for all atomic feature preprocessing works, including feature normalization and expansion.

Note: This feature is developed for small-to-medium sized training dataset, where extra QM descriptors have been demonstrated to be powerful and slow down the model performance downgrade.

Options for Aggregation Function

[PR] By default, at the end of message passing, the D-MPNN aggregates atom hidden representations into a single hidden representation for the whole molecule by taking the mean of the atom representations. Now, this aggregation function can be changed by using --aggregate <mode>, which currently supports “mean” (the default), “sum”, and “norm” (which is equivalent to “sum” with normalization by the constant specified by --aggregation_norm).

Cross-Validation

[commit] The default split type (i.e., --split_type random) randomly samples data into the train, validation, and test sets on each of the num_folds folds independently. This means that the same molecule can end up in the test split on more than one fold. The advantage of this method is that it can be used easily with an arbitrary number of folds, but the downside is that it does not perform strict cross-validation.

The new split type cv (--split_type cv) performs true cross-validation. The data is broken down into num_folds pieces, each of size len(data) / num_folds, and each piece serves as the test split one, the validation split once, and part of the train split on all other folds. The benefit of this method is that it is true cross-validation, but the downside is that the size of each split is dependent on the number of folds, meaning less flexibility (e.g., --num_folds 3 will result in train, validation, and test splits each with 33.3% of the data, which is perhaps too small for the train split and too large for the test split). --num_folds 10 is recommended.

Saving Test Predictions

[commit] The --save_preds option will save predictions on the test split of each fold in a file called “test_preds.csv” in the save_dir.

Multiple Metrics

[commit] The --metric argument still works as before and this is still the metric that is used for early stopping (i.e., selecting the model which performs best on the validation split), but now there is an additional --extra_metrics argument where additional metrics can be specified and will be recorded. The metrics should be space separated (e.g., --extra_metrics mae rmse r2).

Saving Test Scores

[commit] Scores on the test splits are now saved to file in the save_dir under the name “test_scores.csv”.

Fixes and Improvements

Undefined Rows

[commit] Rows in the input data file with target values that are all undefined are now correctly skipped. This is especially relevant when the row may contain some defined target values, but none of those targets are included in target_columns.

Data Loading

[commit] Data is now only loaded once to decrease training time.

Tests

[tests] Added more comprehensive tests to ensure correct functionality.

Train Loss

[commit] Fixed incorrect averaging of the train loss, which affects the train loss that is printed to screen and saved in tensorboard.

v_1.0.2

3 years ago

Since descriptastorus isn't on PyPi, it can't be installed automatically via pip install chemprop. Instead, it must be installed separately via pip install git+https://github.com/bp-kelley/descriptastorus.

v_1.0.1

3 years ago

Fixing an issue with PyPi installation and updating relevant documentation.

v_1.0.0

3 years ago

Chemprop is now available on PyPi: https://pypi.org/project/chemprop. Installation instructions are below.

  1. conda create -n chemprop python=3.8
  2. conda activate chemprop
  3. conda install -c conda-forge rdkit 4.pip install git+https://github.com/bp-kelley/descriptastorus
  4. pip install chemprop

After installing through PyPi, training and predicting are available via the chemprop_train and chemprop_predict commands, which are equivalent to python train.py and python predict.py. All the command line arguments for training and predicting apply as usual. Please see the README for more details.

v0.0.2

5 years ago