I Fine-Tuned My Own LLM in a Weekend - Here's Everything I Learned
On this page
I've been building Osprey, an open-source transaction monitoring engine, for a while now. It evaluates financial transactions against FATF rules and typologies and outputs structured JSON - rule scores, typology matches, and an ALRT/NALT decision.
The problem? That JSON output is useless to a compliance analyst who needs to write a SAR (Suspicious Activity Report). They stare at rule IDs and scores, then spend hours translating that into a human-readable narrative for regulators and law enforcement.
So I asked: what if a fine-tuned LLM could do that translation instantly?
I built one over a weekend. Here's how.
The Goal
Input: Osprey evaluation JSON (12 FATF rules, 6 typologies, transaction details, scores, decision)
Output: A structured SAR-style compliance narrative with risk assessment, FATF references, and recommended actions
Constraints: Train a small model (4B parameters) that can run locally on a laptop. No cloud inference dependency.
Step 1: Building the Training Data
The first challenge: there are no public datasets of SAR narratives (for obvious regulatory reasons). So I built a synthetic data pipeline.
Three Python modules work together:
evaluation_generator.py- Generates realistic Osprey evaluation JSONs with all 12 FATF rules and 6 typologies. Covers 5 scenario types: clean transactions (40%), single rule triggers (15%), multi-rule below threshold (10%), typology matches (20%), and critical alerts (15%).narrative_templates.py- For each evaluation, generates a structured narrative with 7 sections: Alert Summary, Transaction Details, Risk Assessment, Rules Triggered, Typology Analysis, Narrative, and Recommended Actions. Uses 5-8 phrasing variants per rule to prevent memorization.dataset_builder.py- Combines them into chat-format JSONL for training.
Osprey's rules were originally developed and benchmarked against the PaySim dataset. The synthetic training data for the narrator uses those same rule definitions and thresholds.
I generated 3,750 examples: 3,000 for training, 375 for validation, 375 for testing. Whole pipeline runs in about 30 seconds.
make data # generates data/train.jsonl, data/valid.jsonl, data/test.jsonl
Step 2: Choosing the Base Model
I went with Qwen3-4B-Instruct-2507 for a few reasons:
- 4B parameters is small enough to fine-tune on free hardware
- Already instruction-tuned, so it understands chat format
- Qwen3 has strong multilingual and reasoning capabilities for its size
- Available in 4-bit quantized form for memory-constrained training
Step 3: Training
I first tried training locally on my M3 Pro MacBook (18GB RAM) using MLX. The loss went to NaN and the model came out malformed. So I moved to Google Colab's free tier - an NVIDIA T4 GPU with 16GB VRAM.
The training setup:
- Framework: Unsloth (2x faster than standard HuggingFace training)
- Method: LoRA (Low-Rank Adaptation) - only trains 16.5M parameters out of 4B (0.41%)
- LoRA Config: Rank 8, Alpha 16, targeting all attention + MLP layers
- Batch Size: 2 (effective 8 with gradient accumulation)
- Learning Rate: 2e-5 with cosine schedule
- Training Time: ~70 minutes on a free T4 GPU
What is LoRA?
Instead of updating all 4 billion parameters, LoRA freezes the original weights and injects small trainable matrices into each layer. You're basically adding a thin overlay on top of the existing model rather than rewriting the whole thing. You only train ~0.4% of the parameters and the output quality is surprisingly close to full fine-tuning.
Results
- Train Loss - how wrong the model is on data it's learning from. Lower means it's learning well.
- Test Loss - same thing but on data it has never seen.
- Perplexity - roughly how many words the model is confused between when predicting the next one. 2.65 means it's almost always picking the right word.
| Metric | Value | Target |
|---|---|---|
| Test Perplexity | 2.65 | < 5.0 |
| Test Loss | 0.976 | - |
| Final Train Loss | 0.538 | - |
Loss went down steadily with no divergence:
| Step | Train Loss | Validation Loss |
|---|---|---|
| 100 | 0.640 | 0.624 |
| 200 | 0.560 | 0.553 |
| 300 | 0.539 | 0.538 |
No NaN explosions, no overfitting. Just worked.
Step 4: Testing the Output
Here's what the model produces when given a real Osprey evaluation (wire transfer of $147,500 from UAE to Nigeria, 7 rules triggered, 2 typologies matched):
Alert Summary Evaluation eval-7a3b2c1d generated an ALRT decision with a composite score of 724.50, significantly exceeding the threshold of 500.0. Seven of twelve FATF rules were triggered, and two typology patterns were matched.
Risk Assessment HIGH risk. The composite score of 724.50 exceeds the alert threshold by 44.9%, driven primarily by High-Value Transaction (0.875) and Geographic Risk (0.810).
Recommended Actions
- Escalate to senior compliance officer for SAR filing review
- Request enhanced due diligence on both counterparties
- Investigate related transactions within the past 90 days
That's hours of analyst work done in a few seconds.
Step 5: Deployment
I published the model in two places so people can actually use it:
Ollama (run locally)
I merged the LoRA adapter into the base model, converted to GGUF format, and quantized to Q4_K_M (2.5 GB). Now anyone can run it:
ollama run josephgoksu/osprey-narrator
HuggingFace (for integration)
LoRA adapter is published for anyone who wants to load it with HuggingFace Transformers or Unsloth:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="josephgoksu/osprey-narrator-v0.1",
max_seq_length=4096,
load_in_4bit=True,
)
What Surprised Me
You don't need a massive GPU. A free T4 on Google Colab was enough. Training cost: $0.
3,000 examples is enough for a focused task. General-purpose models need millions of examples. Domain-specific models with narrow, well-defined output format need far less.
LoRA is remarkable. Training 0.41% of parameters and getting this quality of output? I was genuinely surprised.
The data pipeline is the hard part. Training itself took 70 minutes. Building the synthetic data generator took most of the weekend.
System prompts are just the first message. The model file contains only weights, no instructions. The system prompt gets injected at inference time as the first conversation message. No special architecture for it.
What's Next (v0.2)
- Better training data: Use Claude to generate and refine training narratives (distillation approach) instead of template-based generation
- More examples: Scale from 3K to 10-20K training examples
- Native MLX training: Train directly on Apple Silicon instead of using Colab
- Multiple quantizations: Offer Q8_0 (higher quality) alongside Q4_K_M (smaller)
Links
- Model (HuggingFace): josephgoksu/osprey-narrator-v0.1
- Model (Ollama): josephgoksu/osprey-narrator
- Osprey Engine: opensource-finance/osprey
- opensource.finance: opensource.finance
Built with Qwen3-4B, Unsloth, LoRA, and a free Google Colab GPU. The training pipeline is open source.