Scikit-Learn Models
In addition to message passing neural networks, Chemprop also enables training and predicting with scikit-learn Random Forest and Support Vector Machine models applied to Morgan fingerprints.
Scikit-Learn Train
chemprop.sklearn_train.py contains functions for training scikit-learn
models.
- chemprop.sklearn_train.impute_sklearn(model: RandomForestRegressor | RandomForestClassifier | SVR | SVC, train_data: MoleculeDataset, args: SklearnTrainArgs, logger: Logger | None = None, threshold: float = 0.5) List[float] [source]
Trains a single-task scikit-learn model, meaning a separate model is trained for each task.
This is necessary if some tasks have None (unknown) values.
- Parameters:
model – The scikit-learn model to train.
train_data – The training data.
args – A
SklearnTrainArgs
object containing arguments for training the scikit-learn model.logger – A logger to record output.
theshold – Threshold for classification tasks.
- Returns:
A list of list of target values.
- chemprop.sklearn_train.multi_task_sklearn(model: RandomForestRegressor | RandomForestClassifier | SVR | SVC, train_data: MoleculeDataset, test_data: MoleculeDataset, metrics: List[str], args: SklearnTrainArgs, logger: Logger | None = None) Dict[str, List[float]] [source]
Trains a multi-task scikit-learn model, meaning one model is trained simultaneously on all tasks.
This is only possible if none of the tasks have None (unknown) values.
- Parameters:
model – The scikit-learn model to train.
train_data – The training data.
test_data – The test data.
metrics – A list of names of metric functions.
args – A
SklearnTrainArgs
object containing arguments for training the scikit-learn model.logger – A logger to record output.
- Returns:
A dictionary mapping each metric in
metrics
to a list of values for each task.
- chemprop.sklearn_train.predict(model: RandomForestRegressor | RandomForestClassifier | SVR | SVC, model_type: str, dataset_type: str, features: List[ndarray]) List[List[float]] [source]
Predicts using a scikit-learn model.
- Parameters:
model – The trained scikit-learn model to make predictions with.
model_type – The type of model.
dataset_type – The type of dataset.
features – The data features used as input for the model.
- Returns:
A list of lists of floats containing the predicted values.
- chemprop.sklearn_train.run_sklearn(args: SklearnTrainArgs, data: MoleculeDataset, logger: Logger | None = None) Dict[str, List[float]] [source]
Loads data, trains a scikit-learn model, and returns test scores for the model checkpoint with the highest validation score.
- Parameters:
args – A
SklearnTrainArgs
object containing arguments for loading data and training the scikit-learn model.data – A
MoleculeDataset
containing the data.logger – A logger to record output.
- Returns:
A dictionary mapping each metric in
metrics
to a list of values for each task.
- chemprop.sklearn_train.single_task_sklearn(model: RandomForestRegressor | RandomForestClassifier | SVR | SVC, train_data: MoleculeDataset, test_data: MoleculeDataset, metrics: List[str], args: SklearnTrainArgs, logger: Logger | None = None) List[float] [source]
Trains a single-task scikit-learn model, meaning a separate model is trained for each task.
This is necessary if some tasks have None (unknown) values.
- Parameters:
model – The scikit-learn model to train.
train_data – The training data.
test_data – The test data.
metrics – A list of names of metric functions.
args – A
SklearnTrainArgs
object containing arguments for training the scikit-learn model.logger – A logger to record output.
- Returns:
A dictionary mapping each metric in
metrics
to a list of values for each task.
Scikit-Learn Predict
chemprop.sklearn_predict.py contains functions for training scikit-learn
models.
- chemprop.sklearn_predict.predict_sklearn(args: SklearnPredictArgs) None [source]
Loads data and a trained scikit-learn model and uses the model to make predictions on the data.
- Parameters:
args – A
SklearnPredictArgs
object containing arguments for loading data, loading a trained scikit-learn model, and making predictions with the model.