Skip to content

Phase 5: Monitoring

Phase 5: Monitoring & Observability

In this phase, we learn that a model’s journey is not over once it’s deployed. We must detect and fix “Model Decay.”


🟢 Level 1: Standard Software Metrics

The basics of system health.

  • Latency: How long does a prediction take?
  • Throughput: Requests per second.
  • Error Rate: HTTP 500s or 400s.

🟡 Level 2: Data & Concept Drift

The unique challenge of ML. Models fail because the World Changes.

1. Data Drift (Feature Drift)

The distribution of input data changes.

  • Example: You trained on young users, but your production users are older.

2. Concept Drift

The relationship between input and output changes.

  • Example: A “Luxury” brand in 2010 might not be considered “Luxury” in 2024.

🔴 Level 3: Closing the Feedback Loop

3. Ground Truth & Accuracy

How do we know if the prediction was right?

  • Immediate: If the user clicks the recommendation.
  • Delayed: If the user pays their loan 6 months later.

4. Alerting Strategies

Use Evidently AI or Deepchecks to monitor drift.

  • Rule: If Data Drift >0.05> 0.05 (KS Test), trigger a Slack alert and an automatic retraining job.