Message Passing Neural Networks for Molecule Property Prediction
Introduces spectra
as a new dataset type available for training, in which each target in a multitarget regression refers to a positive intensity value in one position of a spectrum. Training methods are consistent with https://github.com/gfm-collab/chemprop-IR. Default loss function is spectral information divergence (SID), but Wasserstein loss (earthmover distance) is also supported with --metric wasserstein --alternative_loss_function wasserstein
.
PR #197
Refactored the make_predictions
into smaller functions for better capability to use chemprop functions as a python library. Refactoring specficially designed to allow for the loading of a model using the function chemprop.train.load_model
a single time and then using it for multiple instances of predictions by feeding that model as an argument to chemprop.train.make_predictions
.
PR #200
Added several new features to hyperparameter optimization, many related to hyperparameter checkpoints saved in the location specified by --hyperopt_checkpoint_dir <dir_path>
. The new functionalities:
train.py
using -manual_trial_dirs <list-of-directories>
.--startup_random_iters <int, default=10>
.
PR #208When making predictions from an ensemble of models, returns the mean prediction but also the individual predictions from the individual models when --individual_ensemble_predictions
is specified.
PR #190
Allows for the calculation of latent fingerprints from an ensemble of models by concatenating them together. Also allows for the return of either a latent representation from the MPNN output or from the next-to-last FFN layer using the argument --fingerprint_type <MPN or last_FFN>
.
PR #193
Sklearn multitask training cannot proceed with missing targets among the data, previously would have needed to be run as multiple singletask models. This PR introduces target imputation for missing data to allow multitask sklearn training even when some data is missing with the argument --impute_mode <model/linear/median/mean/frequent>
indicating which method to use for imputation.
PR #210
Issue #211
Adds options in reaction training for how to handle situations where reactants and products are not balanced. The argument --reaction_mode
now also has the options reac_diff_balance
, prod_diff_balance
, and reac_prod_balance
(in addition to the current options reac_diff
, prod_diff
, and reac_prod
). Also fixes an error where atomic numbers are incorrect when an atom is present in the products but not in the reactants.
PR #212
Issue #204
Resolves a problem with TAP (typed-argument-parser) where running Chemprop from inside a different git repo would trigger an error related to the generation of a reproducibility hash. In this situation the reproducibility hash is not generated, but it logs the issue and does not stop Chemprop from running. PR #195
Changes the way that global variables related to model construction and feature vector size are handled. Resolves a problem in pytest where these variables wouldn't reset between runs. PR #206
As training progresses through folds of a multiple fold model, the results of each individual fold are stored in a JSON file. If training is interrupted, the completed fold results will be read from the JSON file and resume on the first uncompleted fold if using the flag --resume_experiment
.
PR #164
Added functionality to freeze the MPN or FFN layers in a model being trained at the values of a previously trained model. Freezes MPN values using a model indicated with --checkpoint_frzn <path>
. FFN layers will also be frozen if indicated with --frzn_ffn_layers <number-of-layers>
. Models with multiple molecules can select to only freeze the first molecule MPN using --freeze_first_only
.
PR #170
Added HDBScan clustering to the tSNE script. PR #172
Added training weights for different targets and different datapoints, with normalization of weight values. Target weights indicated with the argument --target_weights <list-of-values>
. Data weights supplied through an input file indicated with the argument --data_weights_path <path>
.
PR #173, #175, #189
Issue #145
Providing SMILES or RDKit molecules to the MPN
's forward
function failed (only BatchMolGraph
worked) following other changes. Now, SMILES and RDKit molecules can once again be used as input.
PR #164
Backwards compatibility for features scaling PR #164 Issue #108
Added information to the readme and documentation of pre-training, treatment of missing values in multitask models and caching. PR #165 Issue #156
Corrected error when using the metric accuracy
with multiclass classification.
PR #169
Bugfix for compatibility issues of RDKit 2021.03.01 with the interpretation script. PR #182 Issue #178
Enabled custom input of atom and bond features either in addition or instead of the default features.
PR: https://github.com/chemprop/chemprop/pull/137
Introduced the argument --ensemble_variance
which calculates the epistemic uncertainty of predictions via an ensemble of models.
PR: https://github.com/chemprop/chemprop/pull/140
Introduced CGR option - input of atom-mapped reaction smiles instead of molecules. This creates a pseudo-molecule of the graph transition state between reactants and products, and performs message passing on this pseudo-molecule
PR: https://github.com/chemprop/chemprop/pull/152
Added a new functionality that saves the latent representation of a molecule (the MPNN output), which can be used similar to predicting with a given checkpoint file, and saves the MPNN output to file.
PR: https://github.com/chemprop/chemprop/pull/119
Updates to the preprocessing, handling and saving of smiles strings. Removed redundant checks.
PR: https://github.com/chemprop/chemprop/pull/135
Experiments with multiple folds can now be resumed using the --resume_experiment
flag. Additionally, the test results of each fold are saved as a JSON file in the corresponding subfolder in save_dir
.
PR: https://github.com/chemprop/chemprop/pull/164
Major bugfix for running Chemprop with the argument --atom_messages
, where the wrong features were passed to the MPNN. This improves the performance of Chemprop in atom_messages
mode, and causes backwards incompatibility with old checkpoint files if created in atom_messages
mode. Since Chemprop is mainly used for directed message passing via bond messages, we hope not many users are affected.
Issue: https://github.com/chemprop/chemprop/issues/133 PR: https://github.com/chemprop/chemprop/pull/138
Backwards compatibility for correctly setting recently introduced training arguments for old models.
Issue: https://github.com/chemprop/chemprop/issues/148 and https://github.com/chemprop/chemprop/issues/108 PR: https://github.com/chemprop/chemprop/pull/149 and PR: https://github.com/chemprop/chemprop/pull/164
Bugfix in training sklearn models: Scores were not saved correctly previously.
PR: https://github.com/chemprop/chemprop/pull/162
Bugfix in a standalone script to create data splits: Multi-molecule input had previously created incompatibilities with passing data to the scaffold split functionality. Update of docstring.
Issue: https://github.com/chemprop/chemprop/issues/158 PR: https://github.com/chemprop/chemprop/pull/159
Bugfix for sanity checks for dimensions of batches within the MPNN forward pass: The introduction of multi-molecule input had previously caused an inconsistency in one of the checks.
Issue: https://github.com/chemprop/chemprop/issues/153 PR: https://github.com/chemprop/chemprop/pull/154
Bugfix for type annotation in the MPNN forward pass + update of docstring.
PR: https://github.com/chemprop/chemprop/pull/151 and PR: https://github.com/chemprop/chemprop/pull/164
Bugfix for calculating Tanimoto distances. The introduction of multi-molecule input had previously caused incompatibilities in the standalone script to find similar molecules in the training data.
Issue: https://github.com/chemprop/chemprop/issues/143 PR: https://github.com/chemprop/chemprop/pull/144
Fixed typos for a few arguments in the README
PR: https://github.com/chemprop/chemprop/pull/139
Bugfix in standalone script sanitize.py - open output file with write access.
Bugfix for creating RDKit molecules from smiles strings. Previously the molecules were recreated even though they were already cached.
PR: https://github.com/chemprop/chemprop/pull/152
Bugfix for error occurring when --save_smiles_splits
is used in conjunction with --separate_test_path
. Now, the data split csv files are still generated, but split_indices.pkl
is not generated if there are multiple data points with the same SMILES or if some of the data comes from a separate data file.
Issue: https://github.com/chemprop/chemprop/issues/157 PR: https://github.com/chemprop/chemprop/pull/163
Bugfix for SMILES or RDKit molecules as input to MPNN model instead of BatchMolGraph
.
The split type --split_type cv
already existed to perform k
-fold cross-validation (where k
is set by --num_folds
). In each fold, 1/k
of the data is put in the test set, 1/k
of the data is in put in the validation set, and the remaining (k-2)/k
of the data is put in the training set.
Now, a new split type --split_type cv-no-test
exists which is essentially identical except that it assigns no data to the test set on each fold (https://github.com/chemprop/chemprop/commit/b56ca9866b303036eab61cab93188cccbaa24af2). Instead, 1/k
of the data is put in the validation set and (k-1)/k
of the data is put in the training set with no test data. The purpose of this split type is to maximize the training data when training a model in cases where the test performance is already known (or is not important) and doesn't need to be determined. Note that the validation set is still necessary to perform early stopping.
Previously, when using predict.py
, all the columns from the test_path
file were copied to the preds_path
file and then the predictions were added as additional columns at the end. Now there is an option called --drop_extra_columns
which will not copy over these extraneous columns to preds_path
(https://github.com/chemprop/chemprop/commit/83ea4c06dda4231902777ea6776da922aeba2ad3 and https://github.com/chemprop/chemprop/commit/061339568045863c30c9bd8c2a143b674a0082d8). When --drop_extra_columns
is used, preds_path
will only contain columns with the SMILES and with the prediction values.
load_checkpoint
Previously, newer versions of Chemprop incorrectly loaded checkpoints that were trained using older versions of Chemprop due to a change in the names of the parameters. Backward compatibility has now been added to allow this version of Chemprop to load checkpoints with either set of names (https://github.com/chemprop/chemprop/commit/5371b29e7c65e41fa8b83d9c76ba2bfdd400b139 and https://github.com/chemprop/chemprop/commit/206950c6ec92a3646800f95bc69ae6d8dc7ca646).
Due to new Chemprop features such as the ability to load multiple molecules, the feature --save_smiles_splits
, which saves the SMILES corresponding to the train, validation, and test splits, had broken (https://github.com/chemprop/chemprop/issues/110). This was fixed in https://github.com/chemprop/chemprop/pull/117.
interpret.py
Similar to the issue with saving SMILES splits, interpret.py
broke due to the Chemprop feature that enables multiple molecules to be used as input (https://github.com/chemprop/chemprop/issues/107 and https://github.com/chemprop/chemprop/issues/113). This was fixed in https://github.com/chemprop/chemprop/pull/128.
The Dockerfile has been updated to address https://github.com/chemprop/chemprop/issues/100 and https://github.com/chemprop/chemprop/issues/129. This was fixed in https://github.com/chemprop/chemprop/pull/131.
The atom_descriptors
feature did not work in predict.py
(https://github.com/chemprop/chemprop/issues/120). This was fixed in https://github.com/chemprop/chemprop/pull/114.
Logging to the terminal and to files (quiet.log
and verbose.log
in the save_dir
) broke for some OS systems (https://github.com/chemprop/chemprop/issues/106). This was fixed in https://github.com/chemprop/chemprop/pull/118.
Some of the relatively new features, like custom atomic features, were missing from the README (https://github.com/chemprop/chemprop/issues/121). This was fixed in https://github.com/chemprop/chemprop/pull/122.
Chemprop previously used Travis CI to run automated tests upon pushing to master or creating a pull request, but Travis changed its pricing structure and no longer offers unlimited free testing. For this reason, Chemprop now uses GitHub Actions to run automated tests. The results of the test runs can be seen in the Actions tab of the repo.
[PR] Use multiple molecules as an input to chemprop. The number of molecules is specified with the keyword number_of_molecules
. Those molecules are embedded with a separate D-MPNN by default. The latent representations are concatenated prior to the FFN.
The keyword mpn_shared
allows you to use a shared D-MPNN. Note that, since the latent representations are concatenated, the order of the input molecules is important. This method is not invariant and there are better ways to use multiple molecules with shared D-MPNN, which will be implemented for the next release.
[PR] Implemented custom atomic features as a counterpart of the custom molecular features in ChemProp. The new feature allows users to provide additional atomic features to each node in a given molecule. To use the feature, use the keyword atom_descriptors
. The custom atom features can be employed in two modes. In the first mode, --atom_descriptors feature
, custom features are used as normal node features, which are concatenated to the default node vector before the D-MPNN block. In the second mode, --atom_descriptors descriptor
, custom atom features will not participate in the model until the atom feature vector has been updated through D-MPNN block. That is, the --atom_descriptors descriptor
model will not disturb the extra custom atom features much and keep the information to the maximum extent.
The extra custom descriptors can be put into ChemProp through a variety of pickle files (.pkl
, .pickle
, .pckl
), Numpy save file (.npz
), or a .sdf
file.
.pkl
formatThe .pkl
file must store a Pandas DataFrame with smiles as index and columns as descriptors. All descriptors must be a 1D numpy array or 2D numpy array. For example:
1 custom atomic feature for each atom provided in a 1D array
smiles descriptors
CCOc1ccc2nc(S(N)(=O)=O)sc2c1 [0.637781931055927, 0.7075571757878132, 0.7339...
CCN1C(=O)NC(c2ccccc2)C1=O [0.09588231301387817, 0.6521911050735447, 0.45...
Multiple atomic features for each atom provided in multiple 1D array
smiles desc1 desc2
CCOc1ccc2nc(S(N)(=O)=O)sc2c1 [0.637781931055927, 0.7075571757878132... [0.8266363223032338, 0.89641156703512 ...
CCN1C(=O)NC(c2ccccc2)C1=O [0.09588231301387817, 0.6521911050735447... [0.2847367042611851, 0.8410454963208516...
Note: mixed 1D array and 2D array for different columns are not allowed
.npz
fileAtomic descriptors for each molecule must be saved as one independent 2D numpy array ([number of atoms x number of descriptors]) in the .npz
file for example by:
np.savez('descriptors.npz', *descriptors)
where descriptors
is a list of atomic descriptors in 2D array in the order of molecules in the training/predicting datafile
.sdf
fileEach molecule is presented as a mol block in the .sdf
file. Descriptors should be saved as entries for each mol block in the format of comma separated values. Each molecule must has an entry named SMILES that stores the smiles string. For example:
CHEMBL1308_loner5
RDKit 3D
6 6 0 0 1 0 0 0 0 0999 V2000
-0.7579 -0.5337 -2.8744 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.2229 -1.3763 -1.7558 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0046 -1.0089 -0.4029 C 0 0 0 0 0 0 0 0 0 0 0 0
0.4824 -2.0104 0.3280 N 0 0 0 0 0 0 0 0 0 0 0 0
0.5806 -3.0317 -0.5484 N 0 0 0 0 0 0 0 0 0 0 0 0
0.1735 -2.6999 -1.8031 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 6 2 0
2 3 1 0
3 4 2 0
4 5 1 0
5 6 1 0
M END
> <desc1> (1)
-8.568031e-05,0.0001865207,-0.0002012379,-5.054658e-05,0.0002148434,-0.0003503839,1.970448e-05,3.081137e-05,2.997883e-05,9.446278e-05,-7.194711e-05,0.0001527364
> <desc2> (1)
5.462954e-05,-2.415399e-06,0.0001044788,-2.274438e-05,0.0001698836,5.206409e-06,4.5825e-06,-8.882181e-07,-1.08787e-05,2.993307e-05,-4.069051e-06,1.338413e-05
> <SMILES> (1)
Cc1cnnHc1
$$$$
where the name of descriptor entries desc1
, desc2
can be arbitrary.
When using this feature, users are responsible for all atomic feature preprocessing works, including feature normalization and expansion.
Note: This feature is developed for small-to-medium sized training dataset, where extra QM descriptors have been demonstrated to be powerful and slow down the model performance downgrade.
[PR] By default, at the end of message passing, the D-MPNN aggregates atom hidden representations into a single hidden representation for the whole molecule by taking the mean of the atom representations. Now, this aggregation function can be changed by using --aggregate <mode>
, which currently supports “mean” (the default), “sum”, and “norm” (which is equivalent to “sum” with normalization by the constant specified by --aggregation_norm
).
[commit] The default split type (i.e., --split_type random
) randomly samples data into the train, validation, and test sets on each of the num_folds
folds independently. This means that the same molecule can end up in the test split on more than one fold. The advantage of this method is that it can be used easily with an arbitrary number of folds, but the downside is that it does not perform strict cross-validation.
The new split type cv (--split_type cv
) performs true cross-validation. The data is broken down into num_folds
pieces, each of size len(data) / num_folds
, and each piece serves as the test split one, the validation split once, and part of the train split on all other folds. The benefit of this method is that it is true cross-validation, but the downside is that the size of each split is dependent on the number of folds, meaning less flexibility (e.g., --num_folds 3
will result in train, validation, and test splits each with 33.3% of the data, which is perhaps too small for the train split and too large for the test split). --num_folds 10
is recommended.
[commit] The --save_preds
option will save predictions on the test split of each fold in a file called “test_preds.csv” in the save_dir
.
[commit] The --metric
argument still works as before and this is still the metric that is used for early stopping (i.e., selecting the model which performs best on the validation split), but now there is an additional --extra_metrics
argument where additional metrics can be specified and will be recorded. The metrics should be space separated (e.g., --extra_metrics mae rmse r2
).
[commit] Scores on the test splits are now saved to file in the save_dir
under the name “test_scores.csv”.
[commit] Rows in the input data file with target values that are all undefined are now correctly skipped. This is especially relevant when the row may contain some defined target values, but none of those targets are included in target_columns
.
[commit] Data is now only loaded once to decrease training time.
[tests] Added more comprehensive tests to ensure correct functionality.
[commit] Fixed incorrect averaging of the train loss, which affects the train loss that is printed to screen and saved in tensorboard.
Since descriptastorus isn't on PyPi, it can't be installed automatically via pip install chemprop
. Instead, it must be installed separately via pip install git+https://github.com/bp-kelley/descriptastorus
.
Fixing an issue with PyPi installation and updating relevant documentation.
Chemprop is now available on PyPi: https://pypi.org/project/chemprop. Installation instructions are below.
conda create -n chemprop python=3.8
conda activate chemprop
conda install -c conda-forge rdkit
4.pip install git+https://github.com/bp-kelley/descriptastorus
pip install chemprop
After installing through PyPi, training and predicting are available via the chemprop_train
and chemprop_predict
commands, which are equivalent to python train.py
and python predict.py
. All the command line arguments for training and predicting apply as usual. Please see the README for more details.