Data API

Data Loading

class simba.core.data.loaders.LoadData[source]

Bases: object

get_spectra(scan_nrs: Sequence[int] = None, compute_classes=False, config=None, use_gnps_format=True, use_only_protonized_adducts=True) → Iterator[SpectrumExt][source]

Get the MS/MS spectra from the given MGF file, optionally filtering by scan number.

Parameters:

source (Union[IO, str]) – The MGF source (file name or open file object) from which the spectra are read.
scan_nrs (Sequence[int]) – Only read spectra with the given scan numbers. If None, no filtering on scan number is performed.
compute_classes (bool) – Whether to compute chemical superclass, class and subclass of the molecules using Classyfire.
config (Config) – Configuration object containing parameters for preprocessing.
use_gnps_format (bool) – Whether the MGF file follows the GNPS format. If False, it is assumed to follow the Janssen format.

Returns:

An iterator over the requested spectra in the given file.

Return type:

Iterator[SpectrumExt]

static get_precursor_mz(spectrum)[source]

static default_filters(spectrum: SpectrumExt, config: Config)[source]

static is_valid_spectrum_janssen(spectrum: SpectrumExt, config: Config)[source]

static is_valid_spectrum_gnps(spectrum: SpectrumExt, config: Config)[source]

get_all_spectra_mgf(num_samples: int = -1, compute_classes: bool = False, use_tqdm: bool = True, config=None, use_gnps_format: bool = True, use_only_protonized_adducts=True) → list[SpectrumExt][source]

Get the MS/MS spectra from the given MGF file, optionally filtering by scan number.

Parameters:

file (Union[IO, str]) – The MGF file (file name or open file object) from which the spectra are read.
num_samples (int) – The maximum number of spectra to read. If -1, all spectra are read.
compute_classes (bool) – Whether to compute chemical superclass, class and subclass of the molecules using Classyfire.
use_tqdm (bool) – Whether to display a progress bar using tqdm.
config (Config) – Configuration object containing parameters for preprocessing.
use_gnps_format (bool) – Whether the MGF file follows the GNPS format. If False, it is assumed to follow the Janssen format.
use_only_protonized_adducts (bool) – Whether to filter spectra to only include those with protonated adducts ([M+H]+).

Returns:

A list of the parsed spectra.

Return type:

List[SpectrumExt]

get_all_spectra_nist(num_samples=10, compute_classes=False, use_tqdm=True, config=None, initial_line_number=0)[source]

Get the MS/MS spectra from the given MGF file, optionally filtering by scan number.

Parameters:

source (Union[IO, str]) – The MGF source (file name or open file object) from which the spectra are read.
scan_nrs (Sequence[int]) – Only read spectra with the given scan numbers. If None, no filtering on scan number is performed.

Returns:

An iterator over the requested spectra in the given file.

Return type:

Iterator[SpectrumExt]

get_all_spectra_casmi(num_samples=10, compute_classes=False, use_tqdm=True, config=None, initial_line_number=0)[source]

get_all_spectra(num_samples: int = 10, compute_classes: bool = False, use_tqdm: bool = True, use_nist: bool = False, config: Config = None, use_janssen: bool = False) → list[SpectrumExt][source]

Get the MS/MS spectra from the given MGF or NIST file, optionally filtering by scan number.

Parameters:

file (Union[IO, str]) – The MGF or NIST file (file name or open file object) from which the spectra are read.
num_samples (int) – The maximum number of spectra to read. If -1, all spectra are read.
compute_classes (bool) – Whether to compute chemical superclass, class and subclass of the molecules using Classyfire.
use_tqdm (bool) – Whether to display a progress bar using tqdm.
use_nist (bool) – Whether the file is a NIST file. If False, it is assumed to be an MGF file.
config (Config) – Configuration object containing parameters for preprocessing.
use_janssen (bool) – Whether the MGF file follows the Janssen format. If False, it is assumed to follow the GNPS format.

Returns:

A list of the parsed spectra.

Return type:

List[SpectrumExt]

Data Preprocessing

class simba.core.data.preprocessing.PreprocessingUtils[source]

Bases: object

static is_centroid(intensity)[source]

static order_by_charge(spectra: list[SpectrumExt]) → dict[int, list[SpectrumExt]][source]

Order spectra by their precursor charge.

Parameters:: spectra (List[SpectrumExt]) – List of SpectrumExt objects to be ordered.
Returns:: A dictionary where keys are precursor charges and values are lists of SpectrumExt objects with that charge.
Return type:: Dict[int, List[SpectrumExt]]

static order_spectra_by_mz(spectra: list[SpectrumExt]) → list[SpectrumExt][source]

Order spectra by their precursor m/z.

Parameters:: spectra (List[SpectrumExt]) – List of SpectrumExt objects to be ordered.
Returns:: A list of SpectrumExt objects ordered by their precursor m/z.
Return type:: List[SpectrumExt]

get_class(smiles: str) → tuple[str | None, str | None, str | None][source]

Get the superclass, class and subclass of a molecule using Classyfire. Either InChI or SMILES can be used as input. :param inchi: The InChI string of the molecule. :type inchi: str :param smiles: The SMILES string of the molecule. :type smiles: str

Returns:: A tuple (superclass, class, subclass) if successful, (None, None, None) otherwise.
Return type:: tuple

class simba.core.data.preprocessing_simba.PreprocessingSimba[source]

Bases: object

load_spectra(config: Config, min_peaks: int = 6, n_samples: int = 500000, use_gnps_format: bool = False, use_only_protonized_adducts: bool = True) → list[SpectrumExt][source]

Load and preprocess spectra from a file. :param file_name: The path to the file containing the spectra. :type file_name: str :param config: Configuration object containing parameters. :type config: Config :param min_peaks: The minimum number of peaks a spectrum must have to be included, by default 6. :type min_peaks: int, optional :param n_samples: The number of samples to load, by default 500000. :type n_samples: int, optional :param use_gnps_format: Whether to use GNPS format for loading, by default False. :type use_gnps_format: bool, optional :param use_only_protonized_adducts: Whether to use only protonized adducts, by default True. :type use_only_protonized_adducts: bool, optional

Returns:: A list of preprocessed SpectrumExt objects.
Return type:: List[SpectrumExt]

Spectrum Representation

class simba.core.data.spectrum.SpectrumExt(identifier: str, precursor_mz: float, precursor_charge: int, mz: ndarray | Iterable, intensity: ndarray | Iterable, retention_time: float, params: dict, library: str, inchi: str, smiles: str, ionmode: str, adduct_mass: float, ce: float, ion_activation: str, ionization_method: str, bms: str, superclass: str, classe: str, subclass: str, inchi_key: str = None, spectrum_hash: str = None)[source]

Bases: MsmsSpectrum

‘ extended spectrum class that incorporates the binned vector

__init__(identifier: str, precursor_mz: float, precursor_charge: int, mz: ndarray | Iterable, intensity: ndarray | Iterable, retention_time: float, params: dict, library: str, inchi: str, smiles: str, ionmode: str, adduct_mass: float, ce: float, ion_activation: str, ionization_method: str, bms: str, superclass: str, classe: str, subclass: str, inchi_key: str = None, spectrum_hash: str = None)[source]

Instantiate a new MsmsSpectrum consisting of fragment peaks.

Parameters:

identifier (str) – Spectrum identifier. It is recommended to use a unique and interpretable identifier, such as a Universal Spectrum Identifier (USI) as defined by the Proteomics Standards Initiative.
precursor_mz (float) – Precursor ion m/z.
precursor_charge (int) – Precursor ion charge.
mz (array_like) – M/z values of the fragment peaks.
intensity (array_like) – Intensities of the corresponding fragment peaks in mz.
retention_time (float, optional) – Retention time at which the spectrum was acquired (the default is np.nan, which indicates that retention time is unspecified/unknown).

set_params(params)[source]

set_spectrum_vector(spectrum_vector)[source]

set_murcko_scaffold(murcko_scaffold)[source]

set_smiles(smiles)[source]

set_max_peak(max_peak)[source]: set the maximum amplitude in the spectrum

Ground Truth

class simba.core.data.ground_truth.GroundTruth[source]

Bases: object

compute_edit_distance(spectra1, max_value=5)[source]

compute_mces(spectra1, threshold=20)[source]

compute_tanimoto(spectra1)[source]

Dataset Classes

Base Dataset

class simba.core.models.transformers.CustomDatasetEncoder.CustomDatasetEncoder(data)[source]

Bases: Dataset

__init__(data)[source]

Unique Dataset

class simba.core.models.transformers.CustomDatasetUnique.CustomDatasetUnique(your_dict, training=False, prob_aug=0.1, mz=None, intensity=None, precursor_mass=None, precursor_charge=None, df_smiles=None)[source]

Bases: Dataset

__init__(your_dict, training=False, prob_aug=0.1, mz=None, intensity=None, precursor_mass=None, precursor_charge=None, df_smiles=None)[source]

get_original_dictionary(max_num_peaks=100)[source]: get a dictionary containing the spectrums mapped

Multitasking Dataset

class simba.core.models.transformers.CustomDatasetMultitasking.CustomDatasetMultitasking(your_dict, training=False, prob_aug=1.0, mz=None, intensity=None, precursor_mass=None, precursor_charge=None, df_smiles=None, use_fingerprints=False, fingerprint_0=None, max_num_peaks=None, use_adduct=False, ionmode=None, adduct_mass=None, use_ce=False, ce=None)[source]

Bases: Dataset

__init__(your_dict, training=False, prob_aug=1.0, mz=None, intensity=None, precursor_mass=None, precursor_charge=None, df_smiles=None, use_fingerprints=False, fingerprint_0=None, max_num_peaks=None, use_adduct=False, ionmode=None, adduct_mass=None, use_ce=False, ce=None)[source]

get_original_dictionary(max_num_peaks=100)[source]: get a dictionary containing the spectrums mapped