Training#
To train a model, run:
chemprop train --data-path <input_path> --task-type <task> --output-dir <dir>
where <input_path> is the path to a CSV file containing a dataset, <task> is the type of modeling task, and <dir> is the directory where model checkpoints will be saved.
For example:
chemprop train --data-path tests/data/regression.csv \
--task-type regression \
--output-dir solubility_checkpoints
The following modeling tasks are supported:
regression
regression-mve
regression-evidential
regression-quantile
classification
classification-dirichlet
multiclass
multiclass-dirichlet
spectral
A full list of available command-line arguments can be found in CLI Reference.
Input Data#
In order to train a model, you must provide training data containing molecules (as SMILES strings) and known target values. Targets can either be real numbers, if performing regression, or binary (i.e. 0s and 1s), if performing classification. Target values which are unknown can be left as blanks. A model can be trained as either single- or multi-task.
The data file must be be a CSV file with a header row. For example:
smiles,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
CCOc1ccc2nc(S(N)(=O)=O)sc2c1,0,0,1,,,0,0,1,0,0,0,0
CCN1C(=O)NC(c2ccccc2)C1=O,0,0,0,0,0,0,0,,0,,0,0
...
By default, it is assumed that the SMILES are in the first column and the targets are in the remaining columns. However, the specific columns containing the SMILES and targets can be specified using the --smiles-columns <column> and --target-columns <column_1> <column_2> ... flags, respectively. To simultaneously train multiple molecules (such as a solute and a solvent), supply two column headers in --smiles-columns <columns>.
Train/Validation/Test Splits#
Our code supports several methods of splitting data into train, validation, and test sets.
Random: By default, the data will be split randomly into train, validation, and test sets.
Scaffold: Alternatively, the data can be split by molecular scaffold so that the same scaffold never appears in more than one split. This can be specified by adding
--split-type scaffold_balanced.User Specified Splits Custom splits can be specified in two ways,
--splits-columnand--splits-file, examples of which are shown below.
chemprop train --splits-column split -i data.csv -t regression
smiles |
property |
split |
|---|---|---|
C |
1.0 |
train |
CC |
2.0 |
train |
CCC |
3.0 |
test |
CCCC |
4.0 |
val |
CCCCC |
5.0 |
val |
CCCCCC |
6.0 |
test |
chemprop train --splits-file splits.json -i data.csv -t regression
Note
Use zero-indexing when assigning data indices to different sets. Additionally note that ranges have inclusive ends (ie. [0,1] / “0-1” / “0,1” are equivalent).
[
{"train": [0, 1], "val": "2-3", "test": "4,5"},
{"val": [0, 1], "test": "2-3", "train": "4,5"},
]
Note
To train without a validation set or test set, the respective key can be omitted from the JSON file.
Note
By default, both random and scaffold split the data into 80% train, 10% validation, and 10% test. This can be changed with --split-sizes <train_frac> <val_frac> <test_frac>. The default setting is --split-sizes 0.8 0.1 0.1. Both splits also involve a random component that can be seeded with --data-seed <seed>. The default setting is --data-seed 0.
Other supported splitting methods include random_with_repeated_smiles, kennard_stone, and kmeans.
Supplying Separate Train/Val/Test Files
You can also pass in train/val/test files separately through 2 or 3 paths to the training command via --data-path (or -i):
# 3 files: use them as train / val / test respectively
chemprop train --data-path train.csv val.csv test.csv -t regression
# 2 files: first is split into train/val, second is used as test
chemprop train --data-path trainval.csv test.csv -t regression
Note
When two paths are provided, the first file is split into train/val using the configured split method, and the second file is taken as the test set as-is. When three paths are provided, they map directly to train/val/test in order.
Note
Passing separate train/val/tests files is not supported for extra features and descriptors .npz files.
- Saving Splits
Each training run saves the train/val/test splits used as a
splits.jsonfile with the same format as above. Additionally the flag--save-data-splitscan be used to automatically split the input data files into separate train/val/test files in the output directory.
Replicates#
Repeat random trials (i.e. replicates) run by specifying --num-replicates <n> (default 1, i.e. no replicates).
This is analogous to the ‘outer loop’ of nested cross validation but at a lower cost, suitable for deep learning applications.
Ensembling#
To train an ensemble, specify the number of models in the ensemble with --ensemble-size <n> (default 1).
Hyperparameters#
Model performance is often highly dependent on the hyperparameters used. Below is a list of common hyperparameters (see CLI Reference for a full list):
--batch-sizeBatch size (default 64)
--message-hidden-dim <n>Hidden dimension of the messages in the MPNN (default 300)
--depth <n>Number of message-passing steps (default 3)
--dropout <n>Dropout probability in the MPNN & FFN layers (default 0)
--activation <activation_type>The activation function used in the MPNN and FNN layers. Runchemprop train -hto see the full list of activation functions supported via CLI.
--epochs <n>How many epochs to train over (default 50)
--warmup-epochs <n>: The number of epochs during which the learning rate is linearly incremented frominit_lrtomax_lr(default 2)
--init-lr <n>Initial learning rate (default 0.0001)
--max-lr <n>Maximum learning rate (default 0.001)
--final-lr <n>Final learning rate (default 0.0001)
Loss Functions#
The loss function can be specified using the --loss-function <function> keyword, where <function> is one of the following:
Regression:
mseMean squared error (default)
bounded-mseBounded mean squared error
mveMean-variance estimation
evidentialEvidential; if used,--evidential-regularizationcan be specified to modify the regularization, and--epsto modify epsilon.
quantile-pointPoint-based pinball (quantile) loss
Classification:
bceBinary cross-entropy (default)
binary-mccBinary Matthews correlation coefficient
dirichletDirichlet
Multiclass:
ceCross-entropy (default)
multiclass-mccMulticlass Matthews correlation coefficient
dirichletDirichlet
Spectral:
sidSpectral information divergence (default)
earthmoversEarth mover’s distance (or first-order Wasserstein distance)
wassersteinSee above.
Evaluation Metrics#
The following evaluation metrics are supported during training:
Regression:
rmseRoot mean squared error
maeMean absolute error
mseMean squared error (default)
bounded-maeBounded mean absolute error
bounded-mseBounded mean squared error
bounded-rmseBounded root mean squared error
r2R squared metric
Classification:
rocReceiver operating characteristic (default)
prcPrecision-recall curve
accuracyAccuracy
f1F1 score
bceBinary cross-entropy
binary-mccBinary Matthews correlation coefficient
Multiclass:
ceCross-entropy (default)
multiclass-mccMulticlass Matthews correlation coefficient
Spectral:
sidSpectral information divergence (default)
wassersteinEarth mover’s distance (or first-order Wasserstein distance)
Advanced Training Methods#
Pretraining and Transfer Learning#
An existing model, for example from training on a larger, lower quality dataset, can be used for parameter-initialization of a new model by providing a checkpoint of the existing model using --checkpoint <path>. <model_path>` is the location of checkpoint(s) or model file(s). It can be a path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt), a directory that contains these files, or a list of path(s) and directory(s).
When training the new model, its architecture must resemble that of the old model. Depending on the similarity of the tasks and datasets, as well as the quality of the old model, the new model might require fewer epochs to achieve optimal performance compared to training from scratch.
It is also possible to freeze the weights of a loaded Chemprop model during training, such as for transfer learning applications. To do so, you first need to load a pre-trained model by specifying its checkpoint file using --checkpoint <path>. After loading the model, the MPNN weights can be frozen via --freeze-encoder. You can control how the weights are frozen in the FFN layers by using --frzn-ffn-layers <n> flag, where the n is the first n layers are frozen in the FFN layers. By default, n is set to 0, meaning all FFN layers are trainable unless specified otherwise.
Finetuning Foundation Models#
During finetuning one can pretrain a model on an unrelated task and then re-use the learned representation in a new task to improve predictions. This has the effect of improving predictions, particularly on small datasets, by circumventing the need to the model to re-learn the basic facets of molecules representation.
Unlike Transfer Learning, this does not require that the downstream task’s FFN has the same architecture as the pretrained model. When finetuning, the Message Passing (depth, hidden size, activation function, etc.) and Aggregation configurations are fixed to be whatever they were during pretraining, but the FNN is initialized from scratch according to the users request and then trained.
Users can access pretrained foundation models by using the --from-foundation <name> command line argument. Currently, the following foundation models are available in Chemprop:
CheMeleonMordred-descriptor based foundation model pretrained on 1M molecules from PubChem, suitable for many tasks and especially small datasets. See the CheMeleon GitHub repository for more information.
<your-model>.ptspecify a filepath for a Chemprop model trained via the CLI and the Message Passing will be re-used with a new FFN
The first time a given model is requested it will automatically be downloaded for you and saved to a directory called .chemprop in your home directory (except for your own models).
Performant Training#
By default, graph featurization occurs a single time at the beginning of the training run, and the results are cached for use during each training epoch. This saves time but requires more memory. This behavior can be turned off by specifying --no-cache. In either case, graph featurization can be sped up by using more CPU cores, specified via --num-workers. This will also convert SMILES strings to Chem.Mol objects in parallel and compute any molecule features specified with --molecule-featurizers in parallel.
Note
Setting num_workers to a value greater than 0 can cause hangs on Windows and MacOS
Training can be further accelerated using a molecular featurizer package called cuik-molmaker. This package is not installed by default, but can be installed using the script check_and_install_cuik_molmaker.py. In order to enable the accelerated featurizer, use the --use-cuikmolmaker-featurization flag. This featurizer also performs on-the-fly featurization of molecules and reduces memory usage which is particularly useful for large datasets.
Training on Reactions#
Chemprop can also process atom-mapped reaction SMILES (see Daylight manual for details), which consist of three parts denoting reactants, agents, and products, each separated by “>”. For example, an atom-mapped reaction SMILES denoting the reaction of methanol to formaldehyde without hydrogens: [CH3:1][OH:2]>>[CH2:1]=[O:2] and with hydrogens: [C:1]([H:3])([H:4])([H:5])[O:2][H:6]>>[C:1]([H:3])([H:4])=[O:2].[H:5][H:6]. The reactions do not need to be balanced and can thus contain unmapped parts, for example leaving groups, if necessary.
Specify columns in the input file with reaction SMILES using the option --reaction-columns to enable this, which transforms the reactants and products to the corresponding condensed graph of reaction, and changes the initial atom and bond features depending on the argument provided to --rxn-mode <feature_type>:
reac_diffFeaturize with the reactant and the difference upon reaction (default)
reac_prodFeaturize with both the reactant and product
prod_diffFeaturize with the product and the difference upon reaction
Each of these arguments can be modified to balance imbalanced reactions by appending _balance, e.g. reac_diff_balance.
In reaction mode, Chemprop concatenates information to each atomic and bond feature vector. For example, using --reaction-mode reac_prod, each atomic feature vector holds information on the state of the atom in the reactant (similar to default Chemprop), and concatenates information on the state of the atom in the product. Agents are featurized with but not connected to the reactants. Functions incompatible with a reaction as input (scaffold splitting and feature generation) are carried out on the reactants only.
If the atom-mapped reaction SMILES contain mapped hydrogens, enable explicit hydrogens via --keep-h.
For further details and benchmarking, as well as a citable reference, please see DOI 10.1021/acs.jcim.1c00975.
Training Reactions with Molecules (e.g. Solvents, Reagents)#
Both reaction and molecule SMILES can be associated with a target (e.g. a reaction rate in a solvent). To do so, use both --smiles-columns and --reaction-columns.
The reaction and molecule SMILES columns can be ordered in any way. However, the same column ordering as used in the training must be used for the prediction. For more information on atom-mapped reaction SMILES, please refer to Training on Reactions.
Training on Spectra#
Spectra training is different than other datatypes because it considers the predictions of all targets together. Targets for spectra should be provided as the values for the spectrum at a specific position in the spectrum. Spectra predictions are configured to return only positive values and normalize them to sum each spectrum to 1. Spectral prediction are still in beta and will be updated in the future.
Additional Features#
While the model works very well on its own, especially after hyperparameter optimization, additional features and descriptors may further improve performance on certain datasets. Features are used before message passing while descriptors are used after message passing. The additional features/descriptors can be added at the atom-, bond, or molecule-level. Molecule-level features can be either automatically generated by RDKit or custom features provided by the user and are concatenated to the learned descriptors generated by Chemprop during message passing (i.e. used as extra descriptors).
Atom-Level Features/Descriptors#
You can provide additional atom features via --atom-features-path /path/to/atom/features.npz as a numpy .npz file. This command concatenates the features to each atomic feature vector before the D-MPNN, so that they are used during message-passing. This file can be saved using np.savez("atom_features.npz", *V_fs), where V_fs is a list containing the atom features V_f for each molecule, where V_f is a 2D array with a shape of number of atoms by number of atom features in the exact same order as the SMILES strings in your data file.
Similarly, you can provide additional atom descriptors via --atom-descriptors-path /path/to/atom/descriptors.npz as a numpy .npz file. This command concatenates the new features to the embedded atomic features after the D-MPNN with an additional linear layer. This file can be saved using np.savez("atom_descriptors.npz", *V_ds), where V_ds has the same format as V_fs above.
The order of the atom features and atom descriptors for each atom per molecule must match the ordering of atoms in the RDKit molecule object.
The atom-level features and descriptors are scaled by default. This can be disabled with the option --no-atom-feature-scaling or --no-atom-descriptor-scaling.
Bond-Level Features#
Bond-level features can be provided using the option --bond-features-path /path/to/bond/features.npz. as a numpy .npz file. This command concatenates the features to each bond feature vector before the D-MPNN, so that they are used during message-passing. This file can be saved using np.savez("bond_features.npz", *E_fs), where E_fs is a list containing the bond features E_f for each molecule, where E_f is a 2D array with a shape of number of bonds by number of bond features in the exact same order as the SMILES strings in your data file.
The order of the bond features for each molecule must match the bond ordering in the RDKit molecule object.
Note that bond descriptors are not currently supported because the post message passing readout function aggregates atom descriptors.
The bond-level features are scaled by default. This can be disabled with the option --no-bond-features-scaling.
Extra Datapoint Descriptors#
Additional datapoint descriptors can be concatenated to the learned representation after aggregation. These extra descriptors could be molecule-level features. If you install from source, you can modify the code to load custom descriptors as follows:
Generate features: If you want to generate molecular features in code, you can write a custom features generator function using the default featurizers in
chemprop/featurizers/. This also works for custom atom and bond features.Load features: Additional descriptors can be provided using
--descriptors-path /path/to/descriptors.npzwhere the descriptors are saved as a numpy.npzfile. This file can be saved usingnp.savez("/path/to/descriptors.npz", X_d), whereX_dis a 2D array with a shape of number of datapoints by number of additional descriptors. Note that the descriptors must be in the same order as the SMILES strings in your data file. The extra descriptors are scaled by default. This can be disabled with the option--no-descriptor-scaling.
Molecule-Level 2D Features#
Chemprop provides several molecule featurizers that automatically calculate molecular features and uses them as extra datapoint descriptors. These are specified using --molecule-featurizers followed by one or more of the following:
morgan_binarybinary Morgan fingerprints, radius 2 and 2048 bits
morgan_countcount-based Morgan, radius 2 and 2048 bits
rdkit_2dRDKit 2D features
v1_rdkit_2dThe RDKit 2D features used in Chemprop v1
v1_rdkit_2d_normalizedThe normalized RDKit 2D features used in Chemprop v1
Note
The Morgan fingerprints should not be scaled. Use --no-descriptor-scaling to ensure this.
The RDKit 2D features are not normalized. The StandardScaler used in the CLI to normalize is non-optimal for some of the RDKit features. It is recommended to precompute and scale these features outside of the CLI using an appropriate scaler and then provide them using --descriptors-path and --no-descriptor-scaling as described above.
In Chemprop v1, descriptastorus was used to calculate RDKit 2D features. This package offers normalization of the features, with the normalizations fit to a set of molecules randomly selected from ChEMBL. Several descriptors have been added to rdkit recently which are not included in descriptastorus including ‘AvgIpc’, ‘BCUT2D_CHGHI’, ‘BCUT2D_CHGLO’, ‘BCUT2D_LOGPHI’, ‘BCUT2D_LOGPLOW’, ‘BCUT2D_MRHI’, ‘BCUT2D_MRLOW’, ‘BCUT2D_MWHI’, ‘BCUT2D_MWLOW’, and ‘SPS’.
Missing Target Values#
When training multitask models (models which predict more than one target simultaneously), sometimes not all target values are known for all molecules in the dataset. Chemprop automatically handles missing entries in the dataset by masking out the respective values in the loss function, so that partial data can be utilized.
The loss function is rescaled according to all non-missing values, and missing values do not contribute to validation or test errors. Training on partial data is therefore possible and encouraged (versus taking out datapoints with missing target entries). No keyword is needed for this behavior, it is the default.
TensorBoard#
During training, TensorBoard logs are automatically saved to the output directory under model_{i}/trainer_logs/version_0/.
.. To view TensorBoard logs, run tensorboard --logdir=<dir> where <dir> is the path to the checkpoint directory. Then navigate to http://localhost:6006.