Building a Production-Ready Fraud Detection ML System: From Data to Deployment

Credit card fraud costs businesses $32 billion annually — and the difference between catching fraud in milliseconds versus hours isn’t just convenience, it’s revenue. In this post, I’ll walk through how I built a complete, production-ready ML system for fraud detection, covering every layer from raw data ingestion to a monitored, containerized API.

GitHub: github.com/bzddbz/fraud_detection

The Problem: Why Fraud Detection Is Hard

Fraud detection isn’t a standard classification problem. Two challenges make it particularly tricky:

Extreme class imbalance: Only 0.17% of transactions are fraudulent. A model that predicts “legitimate” for every transaction achieves 99.83% accuracy — and catches zero fraud.

Asymmetric costs: Missing a fraudulent transaction costs thousands in losses. Falsely declining a legitimate one costs a customer relationship. These errors are not equally bad.

Standard ML pipelines fail here unless you deliberately design around both.

Data & Preprocessing

The dataset is the Kaggle Credit Card Fraud dataset — 284,807 real transactions with PCA-anonymized features V1–V28, plus Time and Amount.

The preprocessing pipeline does three things:

Standardization: StandardScaler normalizes Amount and Time to prevent scale dominance
SMOTE oversampling: Synthetic Minority Oversampling Technique generates synthetic fraud examples to balance training data
Stratified splitting: Ensures fraud samples are proportionally represented across train/test splits

SMOTE is applied only to training data — never to the test set, which must reflect real-world distribution.

Model Choice: Why Gradient Boosting Over Deep Learning

I evaluated several models but chose Gradient Boosting (scikit-learn’s GradientBoostingClassifier) for three reasons:

Criterion	Gradient Boosting	Deep Learning
Inference latency	~8ms	~30-100ms
Interpretability	High (SHAP-compatible)	Low (black box)
Data size needed	Medium	Large

For a real financial system, regulators need to understand why a transaction was flagged. Deep learning makes that nearly impossible. Gradient Boosting, combined with SHAP, gives you both performance and explainability.

The Right Metric: F2-Score

Optimizing for accuracy or even F1-score would have been a mistake here. I used F2-score, which weights recall twice as heavily as precision (β=2).

The logic is simple: a missed fraud (false negative) costs orders of magnitude more than a false decline (false positive). F2-score encodes that business reality directly into model optimization.

Final model performance:

Metric	Value
ROC-AUC	0.97
F2-Score	0.87
Recall (Fraud)	0.92
Precision (Fraud)	0.78
Optimal Threshold	0.30

The optimal decision threshold of 0.30 (instead of the default 0.50) was found through threshold optimization — lowering it improves recall at a controlled precision cost, which is exactly what fraud detection needs.

MLOps: Experiment Tracking with MLflow

Every training run is tracked in MLflow: hyperparameters, metrics, confusion matrices, and model artifacts are all versioned automatically. This means:

You can reproduce any past model exactly
You can compare experiments side by side
Model promotion to production is one API call

The MLflow UI runs as a Docker service alongside the API, so the full stack spins up with a single docker-compose up.

Explainability: SHAP Values for Regulatory Compliance

GDPR Article 22 requires that automated decisions affecting individuals be explainable. SHAP (SHapley Additive exPlanations) makes this possible.

For every prediction, SHAP computes each feature’s contribution to the output — both globally (which features matter most overall) and locally (why this specific transaction was flagged). This satisfies both technical audits and business stakeholders who need to explain decisions to customers.

Serving: FastAPI + Pydantic

The model is served via a FastAPI REST endpoint with full Pydantic input validation. A typical prediction response looks like:

{
  "prediction": 0,
  "fraud_probability": 0.12,
  "confidence": "high",
  "processing_time_ms": 8
}

Key design decisions:

Pydantic schemas validate all 30 input features — malformed requests are rejected before they reach the model
Processing time is returned in every response for SLA monitoring
Structured logging makes log aggregation trivial

Latency benchmarks: p50 = 8ms, p99 = 25ms, throughput = 1,000 req/sec.

Observability: Prometheus + Grafana

A fraud detection system that you can’t monitor isn’t production-ready. Every prediction endpoint exposes Prometheus metrics:

Request count and latency histograms
Fraud rate over time
Model confidence distributions
Error rates

These feed into Grafana dashboards for real-time visibility. Alerts can trigger if fraud rate spikes abnormally — a potential indicator of model drift or an active attack.

CI/CD Pipeline

The GitHub Actions pipeline runs on every push and pull request:

ruff linting + black formatting checks
pytest test suite with 85% coverage
Docker image build validation

Tests cover the API endpoints, preprocessing logic, and model performance thresholds — so a model regression would break the CI pipeline before it could reach production.

Key Learnings

Class imbalance requires deliberate design — accuracy is a misleading metric when fraud is 0.17% of data
Business metrics beat academic metrics — F2-score maps directly to real cost structure
Explainability is an engineering requirement, not an afterthought — SHAP integration should be planned from the start
MLOps from day one — adding experiment tracking and monitoring retroactively is painful; building it in is fast
Gradient Boosting for structured tabular fraud data outperforms deep learning on interpretability, latency, and data efficiency

What’s Next

Kafka integration for real-time stream processing
Evidently AI for automated data drift detection
Kubernetes deployment via Helm charts
Automated retraining pipeline triggered by drift alerts

The full source code, Docker setup, and documentation are on GitHub. Feedback and contributions are welcome.