Features
chemprop.features contains functions for featurizing molecules. This includes both atom/bond features used in message passing and additional molecule-level features appended after message passing.
Featurization
Classes and functions from chemprop.features.featurization.py. Featurization specifically includes computation of the atom and bond features used in message passing.
- class chemprop.features.featurization.BatchMolGraph(mol_graphs: List[MolGraph])[source]
A
BatchMolGraph
represents the graph structure and featurization of a batch of molecules.A BatchMolGraph contains the attributes of a
MolGraph
plus:atom_fdim
: The dimensionality of the atom feature vector.bond_fdim
: The dimensionality of the bond feature vector (technically the combined atom/bond features).a_scope
: A list of tuples indicating the start and end atom indices for each molecule.b_scope
: A list of tuples indicating the start and end bond indices for each molecule.max_num_bonds
: The maximum number of bonds neighboring an atom in this batch.b2b
: (Optional) A mapping from a bond index to incoming bond indices.a2a
: (Optional): A mapping from an atom index to neighboring atom indices.b2br
: (Optional): A mapping from f_bonds to real bonds in molecule recorded in targets.
- Parameters:
mol_graphs – A list of
MolGraph
s from which to construct theBatchMolGraph
.
- get_a2a() Tensor [source]
Computes (if necessary) and returns a mapping from each atom index to all neighboring atom indices.
- Returns:
A PyTorch tensor containing the mapping from each atom index to all the neighboring atom indices.
- get_b2b() Tensor [source]
Computes (if necessary) and returns a mapping from each bond index to all the incoming bond indices.
- Returns:
A PyTorch tensor containing the mapping from each bond index to all the incoming bond indices.
- get_b2br() Tensor [source]
Computes (if necessary) and returns a mapping from f_bonds to real bonds in molecule recorded in targets.
- Returns:
A PyTorch tensor containing the mapping from f_bonds to real bonds in molecule recorded in targets.
- get_components(atom_messages: bool = False) Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, List[Tuple[int, int]], List[Tuple[int, int]]] [source]
Returns the components of the
BatchMolGraph
.The returned components are, in order:
f_atoms
f_bonds
a2b
b2a
b2revb
a_scope
b_scope
- Parameters:
atom_messages – Whether to use atom messages instead of bond messages. This changes the bond feature vector to contain only bond features rather than both atom and bond features.
- Returns:
A tuple containing PyTorch tensors with the atom features, bond features, graph structure, and scope of the atoms and bonds (i.e., the indices of the molecules they belong to).
- class chemprop.features.featurization.Featurization_parameters[source]
A class holding molecule featurization parameters as attributes.
- class chemprop.features.featurization.MolGraph(mol: str | Mol | Tuple[Mol, Mol], atom_features_extra: ndarray | None = None, bond_features_extra: ndarray | None = None, overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False)[source]
A
MolGraph
represents the graph structure and featurization of a single molecule.A MolGraph computes the following attributes:
n_atoms
: The number of atoms in the molecule.n_bonds
: The number of bonds in the molecule.f_atoms
: A mapping from an atom index to a list of atom features.f_bonds
: A mapping from a bond index to a list of bond features.a2b
: A mapping from an atom index to a list of incoming bond indices.b2a
: A mapping from a bond index to the index of the atom the bond originates from.b2revb
: A mapping from a bond index to the index of the reverse bond.overwrite_default_atom_features
: A boolean to overwrite default atom descriptors.overwrite_default_bond_features
: A boolean to overwrite default bond descriptors.is_mol
: A boolean whether the input is a molecule.is_reaction
: A boolean whether the molecule is a reaction.is_explicit_h
: A boolean whether to retain explicit Hs (for reaction mode).is_adding_hs
: A boolean whether to add explicit Hs (not for reaction mode).reaction_mode
: Reaction mode to construct atom and bond feature vectors.b2br
: A mapping from f_bonds to real bonds in molecule recorded in targets.
- Parameters:
mol – A SMILES or an RDKit molecule.
atom_features_extra – A list of 2D numpy array containing additional atom features to featurize the molecule.
bond_features_extra – A list of 2D numpy array containing additional bond features to featurize the molecule.
overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features instead of concatenating.
overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features instead of concatenating.
- chemprop.features.featurization.atom_features(atom: Atom, functional_groups: List[int] | None = None) List[bool | int | float] [source]
Builds a feature vector for an atom.
- Parameters:
atom – An RDKit atom.
functional_groups – A k-hot vector indicating the functional groups the atom belongs to.
- Returns:
A list containing the atom features.
- chemprop.features.featurization.atom_features_zeros(atom: Atom) List[bool | int | float] [source]
Builds a feature vector for an atom containing only the atom number information.
- Parameters:
atom – An RDKit atom.
- Returns:
A list containing the atom features.
- chemprop.features.featurization.bond_features(bond: Bond) List[bool | int | float] [source]
Builds a feature vector for a bond.
- Parameters:
bond – An RDKit bond.
- Returns:
A list containing the bond features.
- chemprop.features.featurization.get_atom_fdim(overwrite_default_atom: bool = False, is_reaction: bool = False) int [source]
Gets the dimensionality of the atom feature vector.
- Parameters:
overwrite_default_atom – Whether to overwrite the default atom descriptors.
is_reaction – Whether to add
EXTRA_ATOM_FDIM
for reaction input whenREACTION_MODE
is not None.
- Returns:
The dimensionality of the atom feature vector.
- chemprop.features.featurization.get_bond_fdim(atom_messages: bool = False, overwrite_default_bond: bool = False, overwrite_default_atom: bool = False, is_reaction: bool = False) int [source]
Gets the dimensionality of the bond feature vector.
- Parameters:
atom_messages – Whether atom messages are being used. If atom messages are used, then the bond feature vector only contains bond features. Otherwise it contains both atom and bond features.
overwrite_default_bond – Whether to overwrite the default bond descriptors.
overwrite_default_atom – Whether to overwrite the default atom descriptors.
is_reaction – Whether to add
EXTRA_BOND_FDIM
for reaction input whenREACTION_MODE:
is not None
- Returns:
The dimensionality of the bond feature vector.
- chemprop.features.featurization.is_adding_hs(is_mol: bool = True) bool [source]
Returns whether to add explicit Hs to the mol (not for reactions)
- chemprop.features.featurization.is_explicit_h(is_mol: bool = True) bool [source]
Returns whether to retain explicit Hs (for reactions only)
- chemprop.features.featurization.is_keeping_atom_map(is_mol: bool = True) bool [source]
Returns whether to keep the original atom mapping (not for reactions)
- chemprop.features.featurization.is_mol(mol: str | Mol | Tuple[Mol, Mol]) bool [source]
Checks whether an input is a molecule or a reaction
- Parameters:
mol – str, RDKIT molecule or tuple of molecules.
- Returns:
Whether the supplied input corresponds to a single molecule.
- chemprop.features.featurization.is_reaction(is_mol: bool = True) bool [source]
Returns whether to use reactions as input
- chemprop.features.featurization.map_reac_to_prod(mol_reac: Mol, mol_prod: Mol)[source]
Build a dictionary of mapping atom indices in the reactants to the products.
- Parameters:
mol_reac – An RDKit molecule of the reactants.
mol_prod – An RDKit molecule of the products.
- Returns:
A dictionary of corresponding reactant and product atom indices.
- chemprop.features.featurization.mol2graph(mols: List[str] | List[Mol] | List[Tuple[Mol, Mol]], atom_features_batch: List[array] = (None,), bond_features_batch: List[array] = (None,), overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False) BatchMolGraph [source]
Converts a list of SMILES or RDKit molecules to a
BatchMolGraph
containing the batch of molecular graphs.- Parameters:
mols – A list of SMILES or a list of RDKit molecules.
atom_features_batch – A list of 2D numpy array containing additional atom features to featurize the molecule.
bond_features_batch – A list of 2D numpy array containing additional bond features to featurize the molecule.
overwrite_default_atom_features – Boolean to overwrite default atom descriptors by atom_descriptors instead of concatenating.
overwrite_default_bond_features – Boolean to overwrite default bond descriptors by bond_descriptors instead of concatenating.
- Returns:
A
BatchMolGraph
containing the combined molecular graph for the molecules.
- chemprop.features.featurization.onek_encoding_unk(value: int, choices: List[int]) List[int] [source]
Creates a one-hot encoding with an extra category for uncommon values.
- Parameters:
value – The value for which the encoding should be one.
choices – A list of possible values.
- Returns:
A one-hot encoding of the
value
in a list of lengthlen(choices) + 1
. Ifvalue
is not inchoices
, then the final element in the encoding is 1.
- chemprop.features.featurization.reset_featurization_parameters(logger: Logger | None = None) None [source]
Function resets feature parameter values to defaults by replacing the parameters instance.
- chemprop.features.featurization.set_adding_hs(adding_hs: bool) None [source]
Sets whether RDKit molecules will be constructed with adding the Hs to them.
- Parameters:
adding_hs – Boolean whether to add Hs to the molecule.
- chemprop.features.featurization.set_explicit_h(explicit_h: bool) None [source]
Sets whether RDKit molecules will be constructed with explicit Hs.
- Parameters:
explicit_h – Boolean whether to keep explicit Hs from input.
- chemprop.features.featurization.set_extra_atom_fdim(extra)[source]
Change the dimensionality of the atom feature vector.
- chemprop.features.featurization.set_extra_bond_fdim(extra)[source]
Change the dimensionality of the bond feature vector.
- chemprop.features.featurization.set_keeping_atom_map(keeping_atom_map: bool) None [source]
Sets whether RDKit molecules keep the original atom mapping.
- Parameters:
keeping_atom_map – Boolean whether to keep the original atom mapping.
- chemprop.features.featurization.set_reaction(reaction: bool, mode: str) None [source]
Sets whether to use a reaction or molecule as input and adapts feature dimensions.
- Parameters:
reaction – Boolean whether to except reactions as input.
mode – Reaction mode to construct atom and bond feature vectors.
Features Generators
Classes and functions from chemprop.features.features_generators.py. Features generators are used for computing additional molecule-level features that are appended after message passing.
- chemprop.features.features_generators.get_available_features_generators() List[str] [source]
Returns a list of names of available features generators.
- chemprop.features.features_generators.get_features_generator(features_generator_name: str) Callable[[str | Mol], ndarray] [source]
Gets a registered features generator by name.
- Parameters:
features_generator_name – The name of the features generator.
- Returns:
The desired features generator.
- chemprop.features.features_generators.morgan_binary_features_generator(mol: str | Mol, radius: int = 2, num_bits: int = 2048) ndarray [source]
Generates a binary Morgan fingerprint for a molecule.
- Parameters:
mol – A molecule (i.e., either a SMILES or an RDKit molecule).
radius – Morgan fingerprint radius.
num_bits – Number of bits in Morgan fingerprint.
- Returns:
A 1D numpy array containing the binary Morgan fingerprint.
- chemprop.features.features_generators.morgan_counts_features_generator(mol: str | Mol, radius: int = 2, num_bits: int = 2048) ndarray [source]
Generates a counts-based Morgan fingerprint for a molecule.
- Parameters:
mol – A molecule (i.e., either a SMILES or an RDKit molecule).
radius – Morgan fingerprint radius.
num_bits – Number of bits in Morgan fingerprint.
- Returns:
A 1D numpy array containing the counts-based Morgan fingerprint.
- chemprop.features.features_generators.rdkit_2d_features_generator(mol: str | Mol) ndarray [source]
Generates RDKit 2D features for a molecule.
- Parameters:
mol – A molecule (i.e., either a SMILES or an RDKit molecule).
- Returns:
A 1D numpy array containing the RDKit 2D features.
- chemprop.features.features_generators.rdkit_2d_normalized_features_generator(mol: str | Mol) ndarray [source]
Generates RDKit 2D normalized features for a molecule.
- Parameters:
mol – A molecule (i.e., either a SMILES or an RDKit molecule).
- Returns:
A 1D numpy array containing the RDKit 2D normalized features.
- chemprop.features.features_generators.register_features_generator(features_generator_name: str) Callable[[Callable[[str | Mol], ndarray]], Callable[[str | Mol], ndarray]] [source]
Creates a decorator which registers a features generator in a global dictionary to enable access by name.
- Parameters:
features_generator_name – The name to use to access the features generator.
- Returns:
A decorator which will add a features generator to the registry using the specified name.
Utils
Classes and functions from chemprop.features.utils.py.
- chemprop.features.utils.load_features(path: str) ndarray [source]
Loads features saved in a variety of formats.
Supported formats:
.npz
compressed (assumes features are saved with name “features”).npy
.csv
/.txt
(assumes comma-separated features with a header and with one line per molecule).pkl
/.pckl
/.pickle
containing a sparse numpy array
Note
All formats assume that the SMILES loaded elsewhere in the code are in the same order as the features loaded here.
- Parameters:
path – Path to a file containing features.
- Returns:
A 2D numpy array of size
(num_molecules, features_size)
containing the features.
- chemprop.features.utils.load_valid_atom_or_bond_features(path: str, smiles: List[str]) List[ndarray] [source]
Loads features saved in a variety of formats.
Supported formats:
.npz
descriptors are saved as 2D array for each molecule in the order of that in the data.csv.pkl
/.pckl
/.pickle
containing a pandas dataframe with smiles as index and numpy array of descriptors as columns:code:’.sdf’ containing all mol blocks with descriptors as entries
- Parameters:
path – Path to file containing atomwise features.
- Returns:
A list of 2D array.
- chemprop.features.utils.save_features(path: str, features: List[ndarray]) None [source]
Saves features to a compressed
.npz
file with array name “features”.- Parameters:
path – Path to a
.npz
file where the features will be saved.features – A list of 1D numpy arrays containing the features for molecules.