# Quick Start

This guide will get you up and running with SIMBA.

## Overview

SIMBA provides a pretrained model trained on spectra from **MassSpecGym**. The model operates in positive ionization mode for protonated adducts.

A typical SIMBA workflow consists of:

1. **Computing Structural Similarities**: Predict edit distance and MCES between spectra
2. **Analog Discovery**: Find structurally similar molecules in a reference library
3. **Training Custom Models**: Train SIMBA on your own MS/MS data (optional)

## Computing Structural Similarities

Follow the [Run Inference Notebook](https://github.com/bittremieux-lab/simba/tree/main/notebooks/final_tutorials/run_inference.ipynb) for a comprehensive tutorial:

- **Runtime:** < 10 minutes (including model/data download)
- **Example data:** data folder
- **Supported format:** `.mgf`

### Performance

Using an Apple M3 Pro (36 GB RAM):

- **Embedding computation:** ~100,000 spectra in ~1 minute
- **Similarity computation:** 1 query vs. 100,000 spectra in ~10 seconds

SIMBA caches computed embeddings, significantly speeding repeated library searches.

## Analog Discovery

Perform analog discovery to find structurally similar molecules:

```bash
simba analog-discovery \
  --model-path /path/to/model.ckpt \
  --query-spectra /path/to/query.mgf \
  --reference-spectra /path/to/reference_library.mgf \
  --output-dir /path/to/output \
  --query-index 0 \
  --top-k 10 \
  --device cpu \
  --compute-ground-truth
```

**Parameters:**

- `--model-path`: Path to trained SIMBA model checkpoint (.ckpt file)
- `--query-spectra`: Path to query spectra file (.mgf or .pkl format)
- `--reference-spectra`: Path to reference library spectra file (.mgf or .pkl format)
- `--output-dir`: Directory where results will be saved
- `--query-index`: Index of the query spectrum to analyze (default: 0)
- `--top-k`: Number of top matches to return (default: 10)
- `--device`: Hardware device: `cpu` or `gpu` (default: cpu)
- `--batch-size`: Batch size for processing (default: 32)
- `--cache-embeddings` / `--no-cache-embeddings`: Cache embeddings for faster repeated searches (default: True)
- `--use-gnps-format` / `--no-use-gnps-format`: Whether spectra files use GNPS format (default: False)
- `--compute-ground-truth`: Compute ground truth edit distance and MCES for validation
- `--save-rankings`: Save complete ranking matrix to file

**Output:**

The command generates several files in the output directory:

- `results.json`: Summary of top matches with predictions and ground truth
- `matches.csv`: Detailed table of all matches
- `query_molecule.png`: Structure of the query molecule
- `match_N_molecule.png`: Structures of matched molecules
- `mirror_plot_match_N.png`: Mirror plots comparing query and matched spectra
- `rankings.npy`: Complete ranking matrix (if `--save-rankings` is used)

For interactive exploration, use the [Run Analog Discovery Notebook](https://github.com/bittremieux-lab/simba/tree/main/notebooks/final_tutorials/run_analog_discovery.ipynb).

## Training Custom Models

### Step 1: Preprocess Data

```bash
simba preprocess \
  --spectra-path /path/to/your/spectra.mgf \
  --workspace /path/to/preprocessed_data \
  --max-spectra-train 10000 \
  --mapping-file-name mapping_unique_smiles.pkl \
  --num-workers 0
```

### Step 2: Train Model

```bash
simba train \
  --checkpoint-dir checkpoints/ \
  --preprocessing-dir preprocessing/ \
  --preprocessing-pickle preprocessed_data.pkl \
  --epochs 50 \
  --accelerator gpu \
  --batch-size 64
```

### Step 3: Run Inference

```bash
simba inference \
  --checkpoint-dir checkpoints/ \
  --preprocessing-dir preprocessing/ \
  --preprocessing-pickle test_data.pkl \
  --batch-size 128 \
  --accelerator gpu
```