1

Fraud Detection ML System

Production-grade fraud detection pipeline with model monitoring and explainability.

Python Scikit-learn MLflow SHAP Arize AI

Building a Production-Ready Fraud Detection ML System: From Data to Deployment

Credit card fraud costs businesses $32 billion annually — and the difference between catching fraud in milliseconds versus hours isn’t just convenience, it’s revenue. In this post, I’ll walk through how I built a complete, production-ready ML system for fraud detection, covering every layer from raw data ingestion to a monitored, containerized API.

GitHub: github.com/bzddbz/fraud_detection


The Problem: Why Fraud Detection Is Hard

Fraud detection isn’t a standard classification problem. Two challenges make it particularly tricky:

Extreme class imbalance: Only 0.17% of transactions are fraudulent. A model that predicts “legitimate” for every transaction achieves 99.83% accuracy — and catches zero fraud.

Asymmetric costs: Missing a fraudulent transaction costs thousands in losses. Falsely declining a legitimate one costs a customer relationship. These errors are not equally bad.

Standard ML pipelines fail here unless you deliberately design around both.


Data & Preprocessing

The dataset is the Kaggle Credit Card Fraud dataset — 284,807 real transactions with PCA-anonymized features V1–V28, plus Time and Amount.

The preprocessing pipeline does three things:

  1. Standardization: StandardScaler normalizes Amount and Time to prevent scale dominance
  2. SMOTE oversampling: Synthetic Minority Oversampling Technique generates synthetic fraud examples to balance training data
  3. Stratified splitting: Ensures fraud samples are proportionally represented across train/test splits

SMOTE is applied only to training data — never to the test set, which must reflect real-world distribution.


Model Choice: Why Gradient Boosting Over Deep Learning

I evaluated several models but chose Gradient Boosting (scikit-learn’s GradientBoostingClassifier) for three reasons:

CriterionGradient BoostingDeep Learning
Inference latency~8ms~30-100ms
InterpretabilityHigh (SHAP-compatible)Low (black box)
Data size neededMediumLarge

For a real financial system, regulators need to understand why a transaction was flagged. Deep learning makes that nearly impossible. Gradient Boosting, combined with SHAP, gives you both performance and explainability.


The Right Metric: F2-Score

Optimizing for accuracy or even F1-score would have been a mistake here. I used F2-score, which weights recall twice as heavily as precision (β=2).

The logic is simple: a missed fraud (false negative) costs orders of magnitude more than a false decline (false positive). F2-score encodes that business reality directly into model optimization.

Final model performance:

MetricValue
ROC-AUC0.97
F2-Score0.87
Recall (Fraud)0.92
Precision (Fraud)0.78
Optimal Threshold0.30

The optimal decision threshold of 0.30 (instead of the default 0.50) was found through threshold optimization — lowering it improves recall at a controlled precision cost, which is exactly what fraud detection needs.


MLOps: Experiment Tracking with MLflow

Every training run is tracked in MLflow: hyperparameters, metrics, confusion matrices, and model artifacts are all versioned automatically. This means:

  • You can reproduce any past model exactly
  • You can compare experiments side by side
  • Model promotion to production is one API call

The MLflow UI runs as a Docker service alongside the API, so the full stack spins up with a single docker-compose up.


Explainability: SHAP Values for Regulatory Compliance

GDPR Article 22 requires that automated decisions affecting individuals be explainable. SHAP (SHapley Additive exPlanations) makes this possible.

For every prediction, SHAP computes each feature’s contribution to the output — both globally (which features matter most overall) and locally (why this specific transaction was flagged). This satisfies both technical audits and business stakeholders who need to explain decisions to customers.


Serving: FastAPI + Pydantic

The model is served via a FastAPI REST endpoint with full Pydantic input validation. A typical prediction response looks like:

{
  "prediction": 0,
  "fraud_probability": 0.12,
  "confidence": "high",
  "processing_time_ms": 8
}

Key design decisions:

  • Pydantic schemas validate all 30 input features — malformed requests are rejected before they reach the model
  • Processing time is returned in every response for SLA monitoring
  • Structured logging makes log aggregation trivial

Latency benchmarks: p50 = 8ms, p99 = 25ms, throughput = 1,000 req/sec.


Observability: Prometheus + Grafana

A fraud detection system that you can’t monitor isn’t production-ready. Every prediction endpoint exposes Prometheus metrics:

  • Request count and latency histograms
  • Fraud rate over time
  • Model confidence distributions
  • Error rates

These feed into Grafana dashboards for real-time visibility. Alerts can trigger if fraud rate spikes abnormally — a potential indicator of model drift or an active attack.


CI/CD Pipeline

The GitHub Actions pipeline runs on every push and pull request:

  1. ruff linting + black formatting checks
  2. pytest test suite with 85% coverage
  3. Docker image build validation

Tests cover the API endpoints, preprocessing logic, and model performance thresholds — so a model regression would break the CI pipeline before it could reach production.


Key Learnings

  • Class imbalance requires deliberate design — accuracy is a misleading metric when fraud is 0.17% of data
  • Business metrics beat academic metrics — F2-score maps directly to real cost structure
  • Explainability is an engineering requirement, not an afterthought — SHAP integration should be planned from the start
  • MLOps from day one — adding experiment tracking and monitoring retroactively is painful; building it in is fast
  • Gradient Boosting for structured tabular fraud data outperforms deep learning on interpretability, latency, and data efficiency

What’s Next

  • Kafka integration for real-time stream processing
  • Evidently AI for automated data drift detection
  • Kubernetes deployment via Helm charts
  • Automated retraining pipeline triggered by drift alerts

The full source code, Docker setup, and documentation are on GitHub. Feedback and contributions are welcome.

Enjoyed this project?

If you find this project helpful and would like to support the creation of more resources like this, consider buying me a coffee. Your support helps me continue building and sharing projects with the community.