Chemistry API

Edit Distance

Compute graph edit distance between molecular structures.

simba.core.chemistry.edit_distance.edit_distance.create_input_df(smiles, indexes_0, indexes_1)[source]
simba.core.chemistry.edit_distance.edit_distance.compute_ed_or_mces(smiles: list[str], sampled_index: int64, batch_size: int, identifier: int, random_sampling: bool, config: Config, fps: list[ExplicitBitVect], mols: list[Mol], use_edit_distance: bool) ndarray[source]

Compute the edit distance or MCES for a batch of molecule pairs.

Parameters:
  • smiles (List[str]) – List of SMILES strings.

  • sampled_index (np.int64) – Index to sample from the smiles list.

  • batch_size (int) – The size of the batch to process.

  • identifier (int) – An identifier for the batch (used for random seed).

  • random_sampling (bool) – Whether to use random sampling of pairs.

  • config (Config) – Configuration object containing parameters.

  • fps (List[ExplicitBitVect]) – List of fingerprints corresponding to the smiles.

  • mols (List[Mol]) – List of RDKit Mol objects corresponding to the smiles.

  • use_edit_distance (bool) – Whether to compute edit distance (True) or MCES (False).

Returns:

A 2D numpy array with each row containing (index1, index2, distance).

Return type:

np.ndarray

simba.core.chemistry.edit_distance.edit_distance.get_number_of_modification_edges(mol, substructure)[source]
simba.core.chemistry.edit_distance.edit_distance.get_edit_distance_from_smiles(smiles1, smiles2, return_nans=True)[source]
simba.core.chemistry.edit_distance.edit_distance.simba_get_edit_distance(mol1, mol2, return_nans=True)[source]

Calculate the edit distance between two molecules.

Parameters:
  • mol1 (Mol) – First molecule.

  • mol2 (Mol) – Second molecule.

  • return_nans (bool, optional) – Whether to return NaN for dissimilar molecules (default: True).

Returns:

Edit distance between mol1 and mol2.

Return type:

float or np.nan

simba.core.chemistry.edit_distance.edit_distance.return_mol(smiles)[source]
simba.core.chemistry.edit_distance.edit_distance.simba_solve_pair_edit_distance(s0, s1, fp0, fp1, mol0, mol1)[source]
simba.core.chemistry.edit_distance.edit_distance.simba_solve_pair_mces(s0, s1, fp0, fp1, mol0, mol1, threshold, TIME_LIMIT=2)[source]
simba.core.chemistry.edit_distance.edit_distance.get_data(data, index, batch_count)[source]

Molecular Utilities

GNPS Utils - Molecule Utils

This file contains utility functions around molecules and molecules modification based on RDKit library.

simba.core.chemistry.edit_distance.mol_utils.get_modification_nodes(mol1, mol2, in_mol1=True)[source]

Calculates the modification sites between two molecules when one molecule is a substructure of the other molecule

Input:
mol1:

first molecule

mol2:

second molecule

in_mol1:

bool, if True, the modification sites are given in the mol1, if False, the modification sites are given in the mol2

Output:

list of modification sites

simba.core.chemistry.edit_distance.mol_utils.get_modification_edges(mol1, mol2, only_outward_edges=False)[source]

Calculates the modification edges between two molecules when one molecule is a substructure of the other molecule

Input:
mol1:

first molecule

mol2:

second molecule

only_outward_edges:

bool, if True, only the modification edges that go from atoms in the substructure to atoms outside the substructure are returned

Output:

list of the the modification edges in the parent molecule as tuples of atom indices

simba.core.chemistry.edit_distance.mol_utils.get_edit_distance(mol1, mol2)[source]

Calculates the edit distance between mol1 and mol2.

Input:
mol1:

first molecule

mol2:

second molecule

Output:
edit_distance:

edit distance between mol1 and mol2

simba.core.chemistry.edit_distance.mol_utils.get_edit_distance_detailed(mol1, mol2, mcs=None)[source]

Calculates the edit distance between mol1 and mol2.

Input:
mol1:

first molecule

mol2:

second molecule

mcs:

the maximum common substructure between mol1 and mol2

Output:
removed edges:

the removed modification edges

added edges:

the added modification edges

simba.core.chemistry.edit_distance.mol_utils.get_transition(input1, input2)[source]

Calculates the transition between mol1 and mol2.

Input:
input1:

first molecule

input2:

second molecule

Output:
result:

a dictionary with the following keys: ‘merged_mol’: the merged molecule ‘common_bonds’: the common bonds between mol1 and mol2 ‘common_atoms’: the common atoms between mol1 and mol2 ‘removed_atoms’: the removed atoms from mol1 ‘added_atoms’: the added atoms from mol2 ‘modified_added_edges_inside’: the added edges inside the common substructure ‘modified_added_edges_bridge’: the added edges between the common substructure and the added atoms ‘modified_removed_edges_inside’: the removed edges inside the common substructure ‘modified_removed_edges_bridge’: the removed edges between the common substructure and the removed ‘added_edges’: the added edges that are not modification edges ‘removed_edges’: the removed edges that are not modification edges

simba.core.chemistry.edit_distance.mol_utils.get_modification_graph(main_struct, sub_struct)[source]
Calculates the substructure difference between main_struct and sub_struct when there is exactly one modification edge,

if there are multiple modification edges, one of the modifications is returned randomly.

Input:
main_struct:

main molecule

sub_struct:

substructure molecule

Output:
all_modifications:

a list of the modifications structures, each modification is a tuple of: 1. the modified subgraph molecule (as an rdkit editable molecule) 2. a dictionary that maps the wildcard atom indices in subgraph to its true index in the main molecule 3. the SMARTS representation of the modification

simba.core.chemistry.edit_distance.mol_utils.attach_mols(main_mol, attachment_mol, attach_location_main, attach_location_attachment, bond_type)[source]

Attaches the attachment structure to main molecule at the attach_location_main and attach_location_attachment with bond_type.

Input:
main_mol:

rdkit molecule of the main molecule

attachment_mol:

rdkit molecule of the attachment molecule

attach_location_main:

the index of the atom in the main molecule where the attachment should be done

attach_location_attachment:

the index of the atom in the attachment molecule where the attachment should be done

bond_type:

the type of the bond between the main molecule and the attachment molecule

Output:
new_mol:

the new molecule after attachment

simba.core.chemistry.edit_distance.mol_utils.generate_possible_stuctures(main_struct, sub_struct)[source]

Generates all possible structures after attaching sub_struct to main_struct.

Input:
main_struct:

main molecule

sub_struct:

substructure molecule

Output:
list of possible_structures:

all possible structures after attachment with the index of the atom

MCES (Maximum Common Edge Substructure)

class simba.core.chemistry.mces_loader.load_mces.LoadMCES[source]

Bases: object

static find_file(directory_path, prefix)[source]

Searches for a .pkl file in the given directory and returns the path of the first one found.

Args: directory_path (str): The path of the directory to search in.

Returns: str: The path of the first .pkl file found, or None if no such file exists.

static load_raw_data(directory_path, prefix, partitions=10000000)[source]

load data for inspection purposes

static merge_numpy_arrays_mces(directory_path, prefix, remove_percentage=0.9)[source]

load np arrays containing data as well as apply normalization for training

static add_high_similarity_pairs_edit_distance(merged_array)[source]
static merge_numpy_arrays_edit_distance(directory_path, prefix, remove_percentage=0.9)[source]

load np arrays containing data as well as apply normalization

static merge_numpy_arrays_multitask(directory_path, prefix, remove_percentage=0.0, add_high_similarity_pairs=False, normalize_mces=True, normalize_ed=True)[source]

load np arrays containing data as well as apply normalization

static merge_numpy_arrays(directory_path, prefix, use_edit_distance, use_multitask=False, add_high_similarity_pairs=False, remove_percentage=0, normalize_mces=True, normalize_ed=True)[source]

load np arrays containing data as well as apply normalization

static remove_excess_low_pairs(indexes_tani, remove_percentage=0.95, max_value=5, target_column=2)[source]

remove the 90% of the low pairs to reduce the data loaded

static normalize_ed(ed, max_ed=5)[source]
static normalize_mces20(mcs20, max_value, remove_negative_values=True)[source]
static load_mces_20_data(directory_path, prefix, number_folders)[source]

loads the mces with threshold 20 across different folders