Quick Start
This guide will get you up and running with SIMBA.
Overview
SIMBA provides a pretrained model trained on spectra from MassSpecGym. The model operates in positive ionization mode for protonated adducts.
A typical SIMBA workflow consists of:
Computing Structural Similarities: Predict edit distance and MCES between spectra
Analog Discovery: Find structurally similar molecules in a reference library
Training Custom Models: Train SIMBA on your own MS/MS data (optional)
Computing Structural Similarities
Follow the Run Inference Notebook for a comprehensive tutorial:
Runtime: < 10 minutes (including model/data download)
Example data: data folder
Supported format:
.mgf
Performance
Using an Apple M3 Pro (36 GB RAM):
Embedding computation: ~100,000 spectra in ~1 minute
Similarity computation: 1 query vs. 100,000 spectra in ~10 seconds
SIMBA caches computed embeddings, significantly speeding repeated library searches.
Analog Discovery
Perform analog discovery to find structurally similar molecules:
simba analog-discovery \
--model-path /path/to/model.ckpt \
--query-spectra /path/to/query.mgf \
--reference-spectra /path/to/reference_library.mgf \
--output-dir /path/to/output \
--query-index 0 \
--top-k 10 \
--device cpu \
--compute-ground-truth
Parameters:
--model-path: Path to trained SIMBA model checkpoint (.ckpt file)--query-spectra: Path to query spectra file (.mgf or .pkl format)--reference-spectra: Path to reference library spectra file (.mgf or .pkl format)--output-dir: Directory where results will be saved--query-index: Index of the query spectrum to analyze (default: 0)--top-k: Number of top matches to return (default: 10)--device: Hardware device:cpuorgpu(default: cpu)--batch-size: Batch size for processing (default: 32)--cache-embeddings/--no-cache-embeddings: Cache embeddings for faster repeated searches (default: True)--use-gnps-format/--no-use-gnps-format: Whether spectra files use GNPS format (default: False)--compute-ground-truth: Compute ground truth edit distance and MCES for validation--save-rankings: Save complete ranking matrix to file
Output:
The command generates several files in the output directory:
results.json: Summary of top matches with predictions and ground truthmatches.csv: Detailed table of all matchesquery_molecule.png: Structure of the query moleculematch_N_molecule.png: Structures of matched moleculesmirror_plot_match_N.png: Mirror plots comparing query and matched spectrarankings.npy: Complete ranking matrix (if--save-rankingsis used)
For interactive exploration, use the Run Analog Discovery Notebook.
Training Custom Models
Step 1: Preprocess Data
simba preprocess \
--spectra-path /path/to/your/spectra.mgf \
--workspace /path/to/preprocessed_data \
--max-spectra-train 10000 \
--mapping-file-name mapping_unique_smiles.pkl \
--num-workers 0
Step 2: Train Model
simba train \
--checkpoint-dir checkpoints/ \
--preprocessing-dir preprocessing/ \
--preprocessing-pickle preprocessed_data.pkl \
--epochs 50 \
--accelerator gpu \
--batch-size 64
Step 3: Run Inference
simba inference \
--checkpoint-dir checkpoints/ \
--preprocessing-dir preprocessing/ \
--preprocessing-pickle test_data.pkl \
--batch-size 128 \
--accelerator gpu