I Fine-Tuned My Own LLM in a Weekend - Here's Everything I Learned

I've been building Osprey, an open-source transaction monitoring engine, for a while now. It evaluates financial transactions against FATF rules and typologies and outputs structured JSON - rule scores, typology matches, and an ALRT/NALT decision.

The problem? That JSON output is useless to a compliance analyst who needs to write a SAR (Suspicious Activity Report). They stare at rule IDs and scores, then spend hours translating that into a human-readable narrative for regulators and law enforcement.

So I asked: what if a fine-tuned LLM could do that translation instantly?

I built one over a weekend. Here's how.

The Goal

Input: Osprey evaluation JSON (12 FATF rules, 6 typologies, transaction details, scores, decision)

Output: A structured SAR-style compliance narrative with risk assessment, FATF references, and recommended actions

Constraints: Train a small model (4B parameters) that can run locally on a laptop. No cloud inference dependency.

Step 1: Building the Training Data

The first challenge: there are no public datasets of SAR narratives (for obvious regulatory reasons). So I built a synthetic data pipeline.

Three Python modules work together:

evaluation_generator.py - Generates realistic Osprey evaluation JSONs with all 12 FATF rules and 6 typologies. Covers 5 scenario types: clean transactions (40%), single rule triggers (15%), multi-rule below threshold (10%), typology matches (20%), and critical alerts (15%).
narrative_templates.py - For each evaluation, generates a structured narrative with 7 sections: Alert Summary, Transaction Details, Risk Assessment, Rules Triggered, Typology Analysis, Narrative, and Recommended Actions. Uses 5-8 phrasing variants per rule to prevent memorization.
dataset_builder.py - Combines them into chat-format JSONL for training.

Osprey's rules were originally developed and benchmarked against the PaySim dataset. The synthetic training data for the narrator uses those same rule definitions and thresholds.

I generated 3,750 examples: 3,000 for training, 375 for validation, 375 for testing. Whole pipeline runs in about 30 seconds.

make data  # generates data/train.jsonl, data/valid.jsonl, data/test.jsonl

Step 2: Choosing the Base Model

I went with Qwen3-4B-Instruct-2507 for a few reasons:

4B parameters is small enough to fine-tune on free hardware
Already instruction-tuned, so it understands chat format
Qwen3 has strong multilingual and reasoning capabilities for its size
Available in 4-bit quantized form for memory-constrained training

Step 3: Training

I first tried training locally on my M3 Pro MacBook (18GB RAM) using MLX. The loss went to NaN and the model came out malformed. So I moved to Google Colab's free tier - an NVIDIA T4 GPU with 16GB VRAM.

The training setup:

Framework: Unsloth (2x faster than standard HuggingFace training)
Method: LoRA (Low-Rank Adaptation) - only trains 16.5M parameters out of 4B (0.41%)
LoRA Config: Rank 8, Alpha 16, targeting all attention + MLP layers
Batch Size: 2 (effective 8 with gradient accumulation)
Learning Rate: 2e-5 with cosine schedule
Training Time: ~70 minutes on a free T4 GPU

What is LoRA?

Instead of updating all 4 billion parameters, LoRA freezes the original weights and injects small trainable matrices into each layer. You're basically adding a thin overlay on top of the existing model rather than rewriting the whole thing. You only train ~0.4% of the parameters and the output quality is surprisingly close to full fine-tuning.

Results

Train Loss - how wrong the model is on data it's learning from. Lower means it's learning well.
Test Loss - same thing but on data it has never seen.
Perplexity - roughly how many words the model is confused between when predicting the next one. 2.65 means it's almost always picking the right word.

Metric	Value	Target
Test Perplexity	2.65	< 5.0
Test Loss	0.976	-
Final Train Loss	0.538	-

Loss went down steadily with no divergence:

Step	Train Loss	Validation Loss
100	0.640	0.624
200	0.560	0.553
300	0.539	0.538

No NaN explosions, no overfitting. Just worked.

Step 4: Testing the Output

Here's what the model produces when given a real Osprey evaluation (wire transfer of $147,500 from UAE to Nigeria, 7 rules triggered, 2 typologies matched):

Alert Summary Evaluation eval-7a3b2c1d generated an ALRT decision with a composite score of 724.50, significantly exceeding the threshold of 500.0. Seven of twelve FATF rules were triggered, and two typology patterns were matched.
Risk Assessment HIGH risk. The composite score of 724.50 exceeds the alert threshold by 44.9%, driven primarily by High-Value Transaction (0.875) and Geographic Risk (0.810).
Recommended Actions
Escalate to senior compliance officer for SAR filing review
Request enhanced due diligence on both counterparties
Investigate related transactions within the past 90 days

That's hours of analyst work done in a few seconds.

Step 5: Deployment

I published the model in two places so people can actually use it:

Ollama (run locally)

I merged the LoRA adapter into the base model, converted to GGUF format, and quantized to Q4_K_M (2.5 GB). Now anyone can run it:

ollama run josephgoksu/osprey-narrator

HuggingFace (for integration)

LoRA adapter is published for anyone who wants to load it with HuggingFace Transformers or Unsloth:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="josephgoksu/osprey-narrator-v0.1",
    max_seq_length=4096,
    load_in_4bit=True,
)

What Surprised Me

You don't need a massive GPU. A free T4 on Google Colab was enough. Training cost: $0.
3,000 examples is enough for a focused task. General-purpose models need millions of examples. Domain-specific models with narrow, well-defined output format need far less.
LoRA is remarkable. Training 0.41% of parameters and getting this quality of output? I was genuinely surprised.
The data pipeline is the hard part. Training itself took 70 minutes. Building the synthetic data generator took most of the weekend.
System prompts are just the first message. The model file contains only weights, no instructions. The system prompt gets injected at inference time as the first conversation message. No special architecture for it.

What's Next (v0.2)

Better training data: Use Claude to generate and refine training narratives (distillation approach) instead of template-based generation
More examples: Scale from 3K to 10-20K training examples
Native MLX training: Train directly on Apple Silicon instead of using Colab
Multiple quantizations: Offer Q8_0 (higher quality) alongside Q4_K_M (smaller)