Chemistry API
Edit Distance
Compute graph edit distance between molecular structures.
- simba.core.chemistry.edit_distance.edit_distance.create_input_df(smiles, indexes_0, indexes_1)[source]
- simba.core.chemistry.edit_distance.edit_distance.compute_ed_or_mces(smiles: list[str], sampled_index: int64, batch_size: int, identifier: int, random_sampling: bool, config: Config, fps: list[ExplicitBitVect], mols: list[Mol], use_edit_distance: bool) ndarray[source]
Compute the edit distance or MCES for a batch of molecule pairs.
- Parameters:
smiles (List[str]) – List of SMILES strings.
sampled_index (np.int64) – Index to sample from the smiles list.
batch_size (int) – The size of the batch to process.
identifier (int) – An identifier for the batch (used for random seed).
random_sampling (bool) – Whether to use random sampling of pairs.
config (Config) – Configuration object containing parameters.
fps (List[ExplicitBitVect]) – List of fingerprints corresponding to the smiles.
mols (List[Mol]) – List of RDKit Mol objects corresponding to the smiles.
use_edit_distance (bool) – Whether to compute edit distance (True) or MCES (False).
- Returns:
A 2D numpy array with each row containing (index1, index2, distance).
- Return type:
np.ndarray
- simba.core.chemistry.edit_distance.edit_distance.get_number_of_modification_edges(mol, substructure)[source]
- simba.core.chemistry.edit_distance.edit_distance.get_edit_distance_from_smiles(smiles1, smiles2, return_nans=True)[source]
- simba.core.chemistry.edit_distance.edit_distance.simba_get_edit_distance(mol1, mol2, return_nans=True)[source]
Calculate the edit distance between two molecules.
- Parameters:
mol1 (Mol) – First molecule.
mol2 (Mol) – Second molecule.
return_nans (bool, optional) – Whether to return NaN for dissimilar molecules (default: True).
- Returns:
Edit distance between mol1 and mol2.
- Return type:
float or np.nan
- simba.core.chemistry.edit_distance.edit_distance.simba_solve_pair_edit_distance(s0, s1, fp0, fp1, mol0, mol1)[source]
Molecular Utilities
GNPS Utils - Molecule Utils
This file contains utility functions around molecules and molecules modification based on RDKit library.
- simba.core.chemistry.edit_distance.mol_utils.get_modification_nodes(mol1, mol2, in_mol1=True)[source]
Calculates the modification sites between two molecules when one molecule is a substructure of the other molecule
- Input:
- mol1:
first molecule
- mol2:
second molecule
- in_mol1:
bool, if True, the modification sites are given in the mol1, if False, the modification sites are given in the mol2
- Output:
list of modification sites
- simba.core.chemistry.edit_distance.mol_utils.get_modification_edges(mol1, mol2, only_outward_edges=False)[source]
Calculates the modification edges between two molecules when one molecule is a substructure of the other molecule
- Input:
- mol1:
first molecule
- mol2:
second molecule
- only_outward_edges:
bool, if True, only the modification edges that go from atoms in the substructure to atoms outside the substructure are returned
- Output:
list of the the modification edges in the parent molecule as tuples of atom indices
- simba.core.chemistry.edit_distance.mol_utils.get_edit_distance(mol1, mol2)[source]
Calculates the edit distance between mol1 and mol2.
- Input:
- mol1:
first molecule
- mol2:
second molecule
- Output:
- edit_distance:
edit distance between mol1 and mol2
- simba.core.chemistry.edit_distance.mol_utils.get_edit_distance_detailed(mol1, mol2, mcs=None)[source]
Calculates the edit distance between mol1 and mol2.
- Input:
- mol1:
first molecule
- mol2:
second molecule
- mcs:
the maximum common substructure between mol1 and mol2
- Output:
- removed edges:
the removed modification edges
- added edges:
the added modification edges
- simba.core.chemistry.edit_distance.mol_utils.get_transition(input1, input2)[source]
Calculates the transition between mol1 and mol2.
- Input:
- input1:
first molecule
- input2:
second molecule
- Output:
- result:
a dictionary with the following keys: ‘merged_mol’: the merged molecule ‘common_bonds’: the common bonds between mol1 and mol2 ‘common_atoms’: the common atoms between mol1 and mol2 ‘removed_atoms’: the removed atoms from mol1 ‘added_atoms’: the added atoms from mol2 ‘modified_added_edges_inside’: the added edges inside the common substructure ‘modified_added_edges_bridge’: the added edges between the common substructure and the added atoms ‘modified_removed_edges_inside’: the removed edges inside the common substructure ‘modified_removed_edges_bridge’: the removed edges between the common substructure and the removed ‘added_edges’: the added edges that are not modification edges ‘removed_edges’: the removed edges that are not modification edges
- simba.core.chemistry.edit_distance.mol_utils.get_modification_graph(main_struct, sub_struct)[source]
- Calculates the substructure difference between main_struct and sub_struct when there is exactly one modification edge,
if there are multiple modification edges, one of the modifications is returned randomly.
- Input:
- main_struct:
main molecule
- sub_struct:
substructure molecule
- Output:
- all_modifications:
a list of the modifications structures, each modification is a tuple of: 1. the modified subgraph molecule (as an rdkit editable molecule) 2. a dictionary that maps the wildcard atom indices in subgraph to its true index in the main molecule 3. the SMARTS representation of the modification
- simba.core.chemistry.edit_distance.mol_utils.attach_mols(main_mol, attachment_mol, attach_location_main, attach_location_attachment, bond_type)[source]
Attaches the attachment structure to main molecule at the attach_location_main and attach_location_attachment with bond_type.
- Input:
- main_mol:
rdkit molecule of the main molecule
- attachment_mol:
rdkit molecule of the attachment molecule
- attach_location_main:
the index of the atom in the main molecule where the attachment should be done
- attach_location_attachment:
the index of the atom in the attachment molecule where the attachment should be done
- bond_type:
the type of the bond between the main molecule and the attachment molecule
- Output:
- new_mol:
the new molecule after attachment
- simba.core.chemistry.edit_distance.mol_utils.generate_possible_stuctures(main_struct, sub_struct)[source]
Generates all possible structures after attaching sub_struct to main_struct.
- Input:
- main_struct:
main molecule
- sub_struct:
substructure molecule
- Output:
- list of possible_structures:
all possible structures after attachment with the index of the atom
MCES (Maximum Common Edge Substructure)
- class simba.core.chemistry.mces_loader.load_mces.LoadMCES[source]
Bases:
object- static find_file(directory_path, prefix)[source]
Searches for a .pkl file in the given directory and returns the path of the first one found.
Args: directory_path (str): The path of the directory to search in.
Returns: str: The path of the first .pkl file found, or None if no such file exists.
- static load_raw_data(directory_path, prefix, partitions=10000000)[source]
load data for inspection purposes
- static merge_numpy_arrays_mces(directory_path, prefix, remove_percentage=0.9)[source]
load np arrays containing data as well as apply normalization for training
- static merge_numpy_arrays_edit_distance(directory_path, prefix, remove_percentage=0.9)[source]
load np arrays containing data as well as apply normalization
- static merge_numpy_arrays_multitask(directory_path, prefix, remove_percentage=0.0, add_high_similarity_pairs=False, normalize_mces=True, normalize_ed=True)[source]
load np arrays containing data as well as apply normalization
- static merge_numpy_arrays(directory_path, prefix, use_edit_distance, use_multitask=False, add_high_similarity_pairs=False, remove_percentage=0, normalize_mces=True, normalize_ed=True)[source]
load np arrays containing data as well as apply normalization