chemprop.featurizers

Contents

chemprop.featurizers#

Subpackages#

Submodules#

Package Contents#

Classes#

Featurizer

An Featurizer featurizes inputs type S into outputs of

VectorFeaturizer

An Featurizer featurizes inputs type S into outputs of

GraphFeaturizer

An Featurizer featurizes inputs type S into outputs of

MultiHotAtomFeaturizer

A MultiHotAtomFeaturizer uses a multi-hot encoding to featurize atoms.

AtomFeatureMode

The mode of an atom is used for featurization into a MolGraph

MultiHotBondFeaturizer

A MultiHotBondFeaturizer feauturizes bonds based on the following attributes:

MolGraphCacheFacade

A MolGraphCacheFacade provided an interface for caching

MolGraphCache

A MolGraphCache precomputes the corresponding

MolGraphCacheOnTheFly

A MolGraphCacheOnTheFly computes the corresponding

SimpleMoleculeMolGraphFeaturizer

A SimpleMoleculeMolGraphFeaturizer is the default implementation of a

CondensedGraphOfReactionFeaturizer

A CondensedGraphOfReactionFeaturizer featurizes reactions using the condensed

RxnMode

The mode by which a reaction should be featurized into a MolGraph

MorganFeaturizerMixin

BinaryFeaturizerMixin

CountFeaturizerMixin

MorganBinaryFeaturizer

MorganCountFeaturizer

Functions#

get_multi_hot_atom_featurizer(mode)

Build the corresponding multi-hot atom featurizer.

Attributes#

S

T

CGRFeaturizer

MoleculeFeaturizerRegistry

class chemprop.featurizers.Featurizer[source]#

Bases: Generic[S, T]

An Featurizer featurizes inputs type S into outputs of type T.

abstract __call__(input, *args, **kwargs)[source]#

featurize an input

Parameters:

input (S)

Return type:

T

chemprop.featurizers.S#
chemprop.featurizers.T#
class chemprop.featurizers.VectorFeaturizer[source]#

Bases: Featurizer[S, numpy.ndarray], collections.abc.Sized

An Featurizer featurizes inputs type S into outputs of type T.

class chemprop.featurizers.GraphFeaturizer[source]#

Bases: Featurizer[S, chemprop.data.molgraph.MolGraph]

An Featurizer featurizes inputs type S into outputs of type T.

abstract property shape: tuple[int, int]#
Return type:

tuple[int, int]

class chemprop.featurizers.MultiHotAtomFeaturizer(atomic_nums, degrees, formal_charges, chiral_tags, num_Hs, hybridizations)[source]#

Bases: chemprop.featurizers.base.VectorFeaturizer[rdkit.Chem.rdchem.Atom]

A MultiHotAtomFeaturizer uses a multi-hot encoding to featurize atoms.

See also

The class provides three default parameterization schemes:

The generated atom features are ordered as follows: * atomic number * degree * formal charge * chiral tag * number of hydrogens * hybridization * aromaticity * mass

Important

Each feature, except for aromaticity and mass, includes a pad for unknown values.

Parameters:
  • atomic_nums (Sequence[int]) – the choices for atom type denoted by atomic number. Ex: [4, 5, 6] for C, N and O.

  • degrees (Sequence[int]) – the choices for number of bonds an atom is engaged in.

  • formal_charges (Sequence[int]) – the choices for integer electronic charge assigned to an atom.

  • chiral_tags (Sequence[int]) – the choices for an atom’s chiral tag. See rdkit.Chem.rdchem.ChiralType for possible integer values.

  • num_Hs (Sequence[int]) – the choices for number of bonded hydrogen atoms.

  • hybridizations (Sequence[int]) – the choices for an atom’s hybridization type. See rdkit.Chem.rdchem.HybridizationType for possible integer values.

__len__()[source]#
Return type:

int

__call__(a)[source]#
Parameters:

a (rdkit.Chem.rdchem.Atom | None)

Return type:

numpy.ndarray

num_only(a)[source]#

featurize the atom by setting only the atomic number bit

Parameters:

a (rdkit.Chem.rdchem.Atom)

Return type:

numpy.ndarray

classmethod v1(max_atomic_num=100)[source]#

The original implementation used in Chemprop V1 [1]_, [2]_.

Parameters:

max_atomic_num (int, default=100) – Include a bit for all atomic numbers in the interval \([1, \mathtt{max_atomic_num}]\)

References

Kelley, B.; Mathea, M.; Palmer, A. “Analyzing Learned Molecular Representations for Property Prediction.” J. Chem. Inf. Model. 2019, 59 (8), 3370–3388. https://doi.org/10.1021/acs.jcim.9b00237 .. [2] Heid, E.; Greenman, K.P.; Chung, Y.; Li, S.C.; Graff, D.E.; Vermeire, F.H.; Wu, H.; Green, W.H.; McGill, C.J. “Chemprop: A machine learning package for chemical property prediction.” J. Chem. Inf. Model. 2024, 64 (1), 9–17. https://doi.org/10.1021/acs.jcim.3c01250

classmethod v2()[source]#

An implementation that includes an atom type bit for all elements in the first four rows of the periodic table plus iodine.

classmethod organic()[source]#

A specific parameterization intended for use with organic or drug-like molecules.

This parameterization features:
  1. includes an atomic number bit only for H, B, C, N, O, F, Si, P, S, Cl, Br, and I atoms

  2. a hybridization bit for \(s, sp, sp^2\) and \(sp^3\) hybridizations.

class chemprop.featurizers.AtomFeatureMode[source]#

Bases: chemprop.utils.utils.EnumMapping

The mode of an atom is used for featurization into a MolGraph

V1#
V2#
ORGANIC#
chemprop.featurizers.get_multi_hot_atom_featurizer(mode)[source]#

Build the corresponding multi-hot atom featurizer.

Parameters:

mode (str | AtomFeatureMode)

Return type:

MultiHotAtomFeaturizer

class chemprop.featurizers.MultiHotBondFeaturizer(bond_types=None, stereos=None)[source]#

Bases: chemprop.featurizers.base.VectorFeaturizer[rdkit.Chem.rdchem.Bond]

A MultiHotBondFeaturizer feauturizes bonds based on the following attributes:

  • null-ity (i.e., is the bond None?)

  • bond type

  • conjugated?

  • in ring?

  • stereochemistry

The feature vectors produced by this featurizer have the following (general) signature:

slice [start, stop)

subfeature

unknown pad?

0-1

null?

N

1-5

bond type

N

5-6

conjugated?

N

6-8

in ring?

N

7-14

stereochemistry

Y

NOTE: the above signature only applies for the default arguments, as the bond type and sterochemistry slices can increase in size depending on the input arguments.

Parameters:
  • bond_types (Sequence[BondType] | None, default=[SINGLE, DOUBLE, TRIPLE, AROMATIC]) – the known bond types

  • stereos (Sequence[int] | None, default=[0, 1, 2, 3, 4, 5]) – the known bond stereochemistries. See [1]_ for more details

References

__len__()[source]#
__call__(b)[source]#
Parameters:

b (rdkit.Chem.rdchem.Bond)

Return type:

numpy.ndarray

classmethod one_hot_index(x, xs)[source]#

Returns a tuple of the index of x in xs and len(xs) + 1 if x is in xs. Otherwise, returns a tuple with len(xs) and len(xs) + 1.

Parameters:

xs (Sequence)

Return type:

tuple[int, int]

class chemprop.featurizers.MolGraphCacheFacade(inputs, V_fs, E_fs, featurizer)[source]#

Bases: collections.abc.Sequence[chemprop.data.molgraph.MolGraph], Generic[chemprop.featurizers.base.S]

A MolGraphCacheFacade provided an interface for caching MolGraphs.

Note

This class only provides a facade for a cached dataset, but it _does not guarantee_ whether the underlying data is truly cached.

Parameters:
  • inputs (Iterable[S]) – The inputs to be featurized.

  • V_fs (Iterable[np.ndarray]) – The node features for each input.

  • E_fs (Iterable[np.ndarray]) – The edge features for each input.

  • featurizer (Featurizer[S, MolGraph]) – The featurizer with which to generate the MolGraphs.

class chemprop.featurizers.MolGraphCache(inputs, V_fs, E_fs, featurizer)[source]#

Bases: MolGraphCacheFacade

A MolGraphCache precomputes the corresponding MolGraphs and caches them in memory.

Parameters:
__len__()[source]#
Return type:

int

__getitem__(index)[source]#
Parameters:

index (int)

Return type:

chemprop.data.molgraph.MolGraph

class chemprop.featurizers.MolGraphCacheOnTheFly(inputs, V_fs, E_fs, featurizer)[source]#

Bases: MolGraphCacheFacade

A MolGraphCacheOnTheFly computes the corresponding MolGraphs as they are requested.

Parameters:
__len__()[source]#
Return type:

int

__getitem__(index)[source]#
Parameters:

index (int)

Return type:

chemprop.data.molgraph.MolGraph

class chemprop.featurizers.SimpleMoleculeMolGraphFeaturizer[source]#

Bases: chemprop.featurizers.molgraph.mixins._MolGraphFeaturizerMixin, chemprop.featurizers.base.GraphFeaturizer[rdkit.Chem.Mol]

A SimpleMoleculeMolGraphFeaturizer is the default implementation of a MoleculeMolGraphFeaturizer

Parameters:
  • atom_featurizer (AtomFeaturizer, default=MultiHotAtomFeaturizer()) – the featurizer with which to calculate feature representations of the atoms in a given molecule

  • bond_featurizer (BondFeaturizer, default=MultiHotBondFeaturizer()) – the featurizer with which to calculate feature representations of the bonds in a given molecule

  • extra_atom_fdim (int, default=0) – the dimension of the additional features that will be concatenated onto the calculated features of each atom

  • extra_bond_fdim (int, default=0) – the dimension of the additional features that will be concatenated onto the calculated features of each bond

extra_atom_fdim: dataclasses.InitVar[int] = 0#
extra_bond_fdim: dataclasses.InitVar[int] = 0#
__post_init__(extra_atom_fdim=0, extra_bond_fdim=0)[source]#
Parameters:
  • extra_atom_fdim (int)

  • extra_bond_fdim (int)

__call__(mol, atom_features_extra=None, bond_features_extra=None)[source]#
Parameters:
  • mol (rdkit.Chem.Mol)

  • atom_features_extra (numpy.ndarray | None)

  • bond_features_extra (numpy.ndarray | None)

Return type:

chemprop.data.molgraph.MolGraph

class chemprop.featurizers.CondensedGraphOfReactionFeaturizer[source]#

Bases: chemprop.featurizers.molgraph.mixins._MolGraphFeaturizerMixin, chemprop.featurizers.base.GraphFeaturizer[chemprop.types.Rxn]

A CondensedGraphOfReactionFeaturizer featurizes reactions using the condensed reaction graph method utilized in [1]_

NOTE: This class does not accept a AtomFeaturizer instance. This is because it requries the num_only() method, which is only implemented in the concrete AtomFeaturizer class

Parameters:
  • atom_featurizer (AtomFeaturizer, default=AtomFeaturizer()) – the featurizer with which to calculate feature representations of the atoms in a given molecule

  • bond_featurizer (BondFeaturizerBase, default=BondFeaturizer()) – the featurizer with which to calculate feature representations of the bonds in a given molecule

  • mode (Union[str, ReactionMode], default=ReactionMode.REAC_DIFF) – the mode by which to featurize the reaction as either the string code or enum value

References

property mode: RxnMode#
Return type:

RxnMode

mode_: dataclasses.InitVar[str | RxnMode]#
__post_init__(mode_)[source]#
Parameters:

mode_ (str | RxnMode)

__call__(rxn, atom_features_extra=None, bond_features_extra=None)[source]#

Featurize the input reaction into a molecular graph

Parameters:
  • rxn (Rxn) – a 2-tuple of atom-mapped rdkit molecules, where the 0th element is the reactant and the 1st element is the product

  • atom_features_extra (np.ndarray | None, default=None) – UNSUPPORTED maintained only to maintain parity with the method signature of the MoleculeFeaturizer

  • bond_features_extra (np.ndarray | None, default=None) – UNSUPPORTED maintained only to maintain parity with the method signature of the MoleculeFeaturizer

Returns:

the molecular graph of the reaction

Return type:

MolGraph

classmethod map_reac_to_prod(reacs, pdts)[source]#

Map atom indices between corresponding atoms in the reactant and product molecules

Parameters:
  • reacs (Chem.Mol) – An RDKit molecule of the reactants

  • pdts (Chem.Mol) – An RDKit molecule of the products

Returns:

  • ri2pi (dict[int, int]) – A dictionary of corresponding atom indices from reactant atoms to product atoms

  • pdt_idxs (list[int]) – atom indices of poduct atoms

  • rct_idxs (list[int]) – atom indices of reactant atoms

Return type:

tuple[dict[int, int], list[int], list[int]]

chemprop.featurizers.CGRFeaturizer: TypeAlias#
class chemprop.featurizers.RxnMode[source]#

Bases: chemprop.utils.utils.EnumMapping

The mode by which a reaction should be featurized into a MolGraph

REAC_PROD#

concatenate the reactant features with the product features.

REAC_PROD_BALANCE#

concatenate the reactant features with the products feature and balances imbalanced reactions

REAC_DIFF#

concatenates the reactant features with the difference in features between reactants and products

REAC_DIFF_BALANCE#

concatenates the reactant features with the difference in features between reactants and product and balances imbalanced reactions

PROD_DIFF#

concatenates the product features with the difference in features between reactants and products

PROD_DIFF_BALANCE#

concatenates the product features with the difference in features between reactants and products and balances imbalanced reactions

class chemprop.featurizers.MorganFeaturizerMixin(radius=2, length=2048, include_chirality=True)[source]#
Parameters:
  • radius (int)

  • length (int)

  • include_chirality (bool)

__len__()[source]#
Return type:

int

class chemprop.featurizers.BinaryFeaturizerMixin[source]#
__call__(mol)[source]#
Parameters:

mol (rdkit.Chem.Mol)

Return type:

numpy.ndarray

class chemprop.featurizers.CountFeaturizerMixin[source]#
__call__(mol)[source]#
Parameters:

mol (rdkit.Chem.Mol)

Return type:

numpy.ndarray

class chemprop.featurizers.MorganBinaryFeaturizer(radius=2, length=2048, include_chirality=True)[source]#

Bases: MorganFeaturizerMixin, BinaryFeaturizerMixin, chemprop.featurizers.base.VectorFeaturizer[rdkit.Chem.Mol]

Parameters:
  • radius (int)

  • length (int)

  • include_chirality (bool)

class chemprop.featurizers.MorganCountFeaturizer(radius=2, length=2048, include_chirality=True)[source]#

Bases: MorganFeaturizerMixin, CountFeaturizerMixin, chemprop.featurizers.base.VectorFeaturizer[rdkit.Chem.Mol]

Parameters:
  • radius (int)

  • length (int)

  • include_chirality (bool)

chemprop.featurizers.MoleculeFeaturizerRegistry#