Features

chemprop.features contains functions for featurizing molecules. This includes both atom/bond features used in message passing and additional molecule-level features appended after message passing.

Featurization

Classes and functions from chemprop.features.featurization.py. Featurization specifically includes computation of the atom and bond features used in message passing.

class chemprop.features.featurization.BatchMolGraph(mol_graphs: List[MolGraph])[source]

A BatchMolGraph represents the graph structure and featurization of a batch of molecules.

A BatchMolGraph contains the attributes of a MolGraph plus:

atom_fdim: The dimensionality of the atom feature vector.
bond_fdim: The dimensionality of the bond feature vector (technically the combined atom/bond features).
a_scope: A list of tuples indicating the start and end atom indices for each molecule.
b_scope: A list of tuples indicating the start and end bond indices for each molecule.
max_num_bonds: The maximum number of bonds neighboring an atom in this batch.
b2b: (Optional) A mapping from a bond index to incoming bond indices.
a2a: (Optional): A mapping from an atom index to neighboring atom indices.
b2br: (Optional): A mapping from f_bonds to real bonds in molecule recorded in targets.

Parameters:: mol_graphs – A list of MolGraphs from which to construct the BatchMolGraph.

get_a2a() → Tensor[source]

Computes (if necessary) and returns a mapping from each atom index to all neighboring atom indices.

Returns:: A PyTorch tensor containing the mapping from each atom index to all the neighboring atom indices.

get_b2b() → Tensor[source]

Computes (if necessary) and returns a mapping from each bond index to all the incoming bond indices.

Returns:: A PyTorch tensor containing the mapping from each bond index to all the incoming bond indices.

get_b2br() → Tensor[source]

Computes (if necessary) and returns a mapping from f_bonds to real bonds in molecule recorded in targets.

Returns:: A PyTorch tensor containing the mapping from f_bonds to real bonds in molecule recorded in targets.

get_components(atom_messages: bool = False) → Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, List[Tuple[int, int]], List[Tuple[int, int]]][source]

Returns the components of the BatchMolGraph.

The returned components are, in order:

f_atoms
f_bonds
a2b
b2a
b2revb
a_scope
b_scope

Parameters:: atom_messages – Whether to use atom messages instead of bond messages. This changes the bond feature vector to contain only bond features rather than both atom and bond features.
Returns:: A tuple containing PyTorch tensors with the atom features, bond features, graph structure, and scope of the atoms and bonds (i.e., the indices of the molecules they belong to).

class chemprop.features.featurization.Featurization_parameters[source]: A class holding molecule featurization parameters as attributes.

class chemprop.features.featurization.MolGraph(mol: str | Mol | Tuple[Mol, Mol], atom_features_extra: ndarray | None = None, bond_features_extra: ndarray | None = None, overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False)[source]

A MolGraph represents the graph structure and featurization of a single molecule.

A MolGraph computes the following attributes:

n_atoms: The number of atoms in the molecule.
n_bonds: The number of bonds in the molecule.
f_atoms: A mapping from an atom index to a list of atom features.
f_bonds: A mapping from a bond index to a list of bond features.
a2b: A mapping from an atom index to a list of incoming bond indices.
b2a: A mapping from a bond index to the index of the atom the bond originates from.
b2revb: A mapping from a bond index to the index of the reverse bond.
overwrite_default_atom_features: A boolean to overwrite default atom descriptors.
overwrite_default_bond_features: A boolean to overwrite default bond descriptors.
is_mol: A boolean whether the input is a molecule.
is_reaction: A boolean whether the molecule is a reaction.
is_explicit_h: A boolean whether to retain explicit Hs (for reaction mode).
is_adding_hs: A boolean whether to add explicit Hs (not for reaction mode).
reaction_mode: Reaction mode to construct atom and bond feature vectors.
b2br: A mapping from f_bonds to real bonds in molecule recorded in targets.

Parameters:

mol – A SMILES or an RDKit molecule.
atom_features_extra – A list of 2D numpy array containing additional atom features to featurize the molecule.
bond_features_extra – A list of 2D numpy array containing additional bond features to featurize the molecule.
overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features instead of concatenating.
overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features instead of concatenating.

chemprop.features.featurization.atom_features(atom: Atom, functional_groups: List[int] | None = None) → List[bool | int | float][source]

Builds a feature vector for an atom.

Parameters:

atom – An RDKit atom.
functional_groups – A k-hot vector indicating the functional groups the atom belongs to.

Returns:

A list containing the atom features.

chemprop.features.featurization.atom_features_zeros(atom: Atom) → List[bool | int | float][source]

Builds a feature vector for an atom containing only the atom number information.

Parameters:: atom – An RDKit atom.
Returns:: A list containing the atom features.

chemprop.features.featurization.bond_features(bond: Bond) → List[bool | int | float][source]

Builds a feature vector for a bond.

Parameters:: bond – An RDKit bond.
Returns:: A list containing the bond features.

chemprop.features.featurization.get_atom_fdim(overwrite_default_atom: bool = False, is_reaction: bool = False) → int[source]

Gets the dimensionality of the atom feature vector.

Parameters:

overwrite_default_atom – Whether to overwrite the default atom descriptors.
is_reaction – Whether to add EXTRA_ATOM_FDIM for reaction input when REACTION_MODE is not None.

Returns:

The dimensionality of the atom feature vector.

chemprop.features.featurization.get_bond_fdim(atom_messages: bool = False, overwrite_default_bond: bool = False, overwrite_default_atom: bool = False, is_reaction: bool = False) → int[source]

Gets the dimensionality of the bond feature vector.

Parameters:

atom_messages – Whether atom messages are being used. If atom messages are used, then the bond feature vector only contains bond features. Otherwise it contains both atom and bond features.
overwrite_default_bond – Whether to overwrite the default bond descriptors.
overwrite_default_atom – Whether to overwrite the default atom descriptors.
is_reaction – Whether to add EXTRA_BOND_FDIM for reaction input when REACTION_MODE: is not None

Returns:

The dimensionality of the bond feature vector.

chemprop.features.featurization.is_adding_hs(is_mol: bool = True) → bool[source]: Returns whether to add explicit Hs to the mol (not for reactions)

chemprop.features.featurization.is_explicit_h(is_mol: bool = True) → bool[source]: Returns whether to retain explicit Hs (for reactions only)

chemprop.features.featurization.is_keeping_atom_map(is_mol: bool = True) → bool[source]: Returns whether to keep the original atom mapping (not for reactions)

chemprop.features.featurization.is_mol(mol: str | Mol | Tuple[Mol, Mol]) → bool[source]

Checks whether an input is a molecule or a reaction

Parameters:: mol – str, RDKIT molecule or tuple of molecules.
Returns:: Whether the supplied input corresponds to a single molecule.

chemprop.features.featurization.is_reaction(is_mol: bool = True) → bool[source]: Returns whether to use reactions as input

chemprop.features.featurization.map_reac_to_prod(mol_reac: Mol, mol_prod: Mol)[source]

Build a dictionary of mapping atom indices in the reactants to the products.

Parameters:

mol_reac – An RDKit molecule of the reactants.
mol_prod – An RDKit molecule of the products.

Returns:

A dictionary of corresponding reactant and product atom indices.

chemprop.features.featurization.mol2graph(mols: List[str] | List[Mol] | List[Tuple[Mol, Mol]], atom_features_batch: List[array] = (None,), bond_features_batch: List[array] = (None,), overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False) → BatchMolGraph[source]

Converts a list of SMILES or RDKit molecules to a BatchMolGraph containing the batch of molecular graphs.

Parameters:

mols – A list of SMILES or a list of RDKit molecules.
atom_features_batch – A list of 2D numpy array containing additional atom features to featurize the molecule.
bond_features_batch – A list of 2D numpy array containing additional bond features to featurize the molecule.
overwrite_default_atom_features – Boolean to overwrite default atom descriptors by atom_descriptors instead of concatenating.
overwrite_default_bond_features – Boolean to overwrite default bond descriptors by bond_descriptors instead of concatenating.

Returns:

A BatchMolGraph containing the combined molecular graph for the molecules.

chemprop.features.featurization.onek_encoding_unk(value: int, choices: List[int]) → List[int][source]

Creates a one-hot encoding with an extra category for uncommon values.

Parameters:

value – The value for which the encoding should be one.
choices – A list of possible values.

Returns:

A one-hot encoding of the value in a list of length len(choices) + 1. If value is not in choices, then the final element in the encoding is 1.

chemprop.features.featurization.reaction_mode() → str[source]: Returns the reaction mode

chemprop.features.featurization.reset_featurization_parameters(logger: Logger | None = None) → None[source]: Function resets feature parameter values to defaults by replacing the parameters instance.

chemprop.features.featurization.set_adding_hs(adding_hs: bool) → None[source]

Sets whether RDKit molecules will be constructed with adding the Hs to them.

Parameters:: adding_hs – Boolean whether to add Hs to the molecule.

chemprop.features.featurization.set_explicit_h(explicit_h: bool) → None[source]

Sets whether RDKit molecules will be constructed with explicit Hs.

Parameters:: explicit_h – Boolean whether to keep explicit Hs from input.

chemprop.features.featurization.set_extra_atom_fdim(extra)[source]: Change the dimensionality of the atom feature vector.

chemprop.features.featurization.set_extra_bond_fdim(extra)[source]: Change the dimensionality of the bond feature vector.

chemprop.features.featurization.set_keeping_atom_map(keeping_atom_map: bool) → None[source]

Sets whether RDKit molecules keep the original atom mapping.

Parameters:: keeping_atom_map – Boolean whether to keep the original atom mapping.

chemprop.features.featurization.set_reaction(reaction: bool, mode: str) → None[source]

Sets whether to use a reaction or molecule as input and adapts feature dimensions.

Parameters:

reaction – Boolean whether to except reactions as input.
mode – Reaction mode to construct atom and bond feature vectors.

Features Generators

Classes and functions from chemprop.features.features_generators.py. Features generators are used for computing additional molecule-level features that are appended after message passing.

chemprop.features.features_generators.get_available_features_generators() → List[str][source]: Returns a list of names of available features generators.

chemprop.features.features_generators.get_features_generator(features_generator_name: str) → Callable[[str | Mol], ndarray][source]

Gets a registered features generator by name.

Parameters:: features_generator_name – The name of the features generator.
Returns:: The desired features generator.

chemprop.features.features_generators.morgan_binary_features_generator(mol: str | Mol, radius: int = 2, num_bits: int = 2048) → ndarray[source]

Generates a binary Morgan fingerprint for a molecule.

Parameters:

mol – A molecule (i.e., either a SMILES or an RDKit molecule).
radius – Morgan fingerprint radius.
num_bits – Number of bits in Morgan fingerprint.

Returns:

A 1D numpy array containing the binary Morgan fingerprint.

chemprop.features.features_generators.morgan_counts_features_generator(mol: str | Mol, radius: int = 2, num_bits: int = 2048) → ndarray[source]

Generates a counts-based Morgan fingerprint for a molecule.

Parameters:

mol – A molecule (i.e., either a SMILES or an RDKit molecule).
radius – Morgan fingerprint radius.
num_bits – Number of bits in Morgan fingerprint.

Returns:

A 1D numpy array containing the counts-based Morgan fingerprint.

chemprop.features.features_generators.rdkit_2d_features_generator(mol: str | Mol) → ndarray[source]

Generates RDKit 2D features for a molecule.

Parameters:: mol – A molecule (i.e., either a SMILES or an RDKit molecule).
Returns:: A 1D numpy array containing the RDKit 2D features.

chemprop.features.features_generators.rdkit_2d_normalized_features_generator(mol: str | Mol) → ndarray[source]

Generates RDKit 2D normalized features for a molecule.

Parameters:: mol – A molecule (i.e., either a SMILES or an RDKit molecule).
Returns:: A 1D numpy array containing the RDKit 2D normalized features.

chemprop.features.features_generators.register_features_generator(features_generator_name: str) → Callable[[Callable[[str | Mol], ndarray]], Callable[[str | Mol], ndarray]][source]

Creates a decorator which registers a features generator in a global dictionary to enable access by name.

Parameters:: features_generator_name – The name to use to access the features generator.
Returns:: A decorator which will add a features generator to the registry using the specified name.

Utils

Classes and functions from chemprop.features.utils.py.

chemprop.features.utils.load_features(path: str) → ndarray[source]

Loads features saved in a variety of formats.

Supported formats:

.npz compressed (assumes features are saved with name “features”)
.npy
.csv / .txt (assumes comma-separated features with a header and with one line per molecule)
.pkl / .pckl / .pickle containing a sparse numpy array

Note

All formats assume that the SMILES loaded elsewhere in the code are in the same order as the features loaded here.

Parameters:: path – Path to a file containing features.
Returns:: A 2D numpy array of size (num_molecules, features_size) containing the features.

chemprop.features.utils.load_valid_atom_or_bond_features(path: str, smiles: List[str]) → List[ndarray][source]

Loads features saved in a variety of formats.

Supported formats:

.npz descriptors are saved as 2D array for each molecule in the order of that in the data.csv
.pkl / .pckl / .pickle containing a pandas dataframe with smiles as index and numpy array of descriptors as columns
:code:’.sdf’ containing all mol blocks with descriptors as entries

Parameters:: path – Path to file containing atomwise features.
Returns:: A list of 2D array.

chemprop.features.utils.save_features(path: str, features: List[ndarray]) → None[source]

Saves features to a compressed .npz file with array name “features”.

Parameters:

path – Path to a .npz file where the features will be saved.
features – A list of 1D numpy arrays containing the features for molecules.