Quick Start

This guide will get you up and running with SIMBA.

Overview

SIMBA provides a pretrained model trained on spectra from MassSpecGym. The model operates in positive ionization mode for protonated adducts.

A typical SIMBA workflow consists of:

Computing Structural Similarities: Predict edit distance and MCES between spectra
Analog Discovery: Find structurally similar molecules in a reference library
Training Custom Models: Train SIMBA on your own MS/MS data (optional)

Computing Structural Similarities

Follow the Run Inference Notebook for a comprehensive tutorial:

Runtime: < 10 minutes (including model/data download)
Example data: data folder
Supported format: .mgf

Performance

Using an Apple M3 Pro (36 GB RAM):

Embedding computation: ~100,000 spectra in ~1 minute
Similarity computation: 1 query vs. 100,000 spectra in ~10 seconds

SIMBA caches computed embeddings, significantly speeding repeated library searches.

Analog Discovery

Perform analog discovery to find structurally similar molecules:

simba analog-discovery \
  --model-path /path/to/model.ckpt \
  --query-spectra /path/to/query.mgf \
  --reference-spectra /path/to/reference_library.mgf \
  --output-dir /path/to/output \
  --query-index 0 \
  --top-k 10 \
  --device cpu \
  --compute-ground-truth

Parameters:

--model-path: Path to trained SIMBA model checkpoint (.ckpt file)
--query-spectra: Path to query spectra file (.mgf or .pkl format)
--reference-spectra: Path to reference library spectra file (.mgf or .pkl format)
--output-dir: Directory where results will be saved
--query-index: Index of the query spectrum to analyze (default: 0)
--top-k: Number of top matches to return (default: 10)
--device: Hardware device: cpu or gpu (default: cpu)
--batch-size: Batch size for processing (default: 32)
--cache-embeddings / --no-cache-embeddings: Cache embeddings for faster repeated searches (default: True)
--use-gnps-format / --no-use-gnps-format: Whether spectra files use GNPS format (default: False)
--compute-ground-truth: Compute ground truth edit distance and MCES for validation
--save-rankings: Save complete ranking matrix to file

Output:

The command generates several files in the output directory:

results.json: Summary of top matches with predictions and ground truth
matches.csv: Detailed table of all matches
query_molecule.png: Structure of the query molecule
match_N_molecule.png: Structures of matched molecules
mirror_plot_match_N.png: Mirror plots comparing query and matched spectra
rankings.npy: Complete ranking matrix (if --save-rankings is used)

For interactive exploration, use the Run Analog Discovery Notebook.

Training Custom Models

Step 1: Preprocess Data

simba preprocess \
  --spectra-path /path/to/your/spectra.mgf \
  --workspace /path/to/preprocessed_data \
  --max-spectra-train 10000 \
  --mapping-file-name mapping_unique_smiles.pkl \
  --num-workers 0

Step 2: Train Model

simba train \
  --checkpoint-dir checkpoints/ \
  --preprocessing-dir preprocessing/ \
  --preprocessing-pickle preprocessed_data.pkl \
  --epochs 50 \
  --accelerator gpu \
  --batch-size 64

Step 3: Run Inference

simba inference \
  --checkpoint-dir checkpoints/ \
  --preprocessing-dir preprocessing/ \
  --preprocessing-pickle test_data.pkl \
  --batch-size 128 \
  --accelerator gpu