MLOps Tracking (MLflow & DVC)
π MLOps Tracking (MLflow & DVC)
MLOps is the practice of applying DevOps principles to Machine Learning. This phase focuses on experiment tracking and data versioning.
ποΈ 1. Why Track Experiments?
- Reproducibility: Can you re-run the same experiment with the same results?
- Comparison: Which hyperparameters produced the best model?
- Governance: Who trained this model, and what data was used?
π 2. MLflow: The Standard for Tracking
MLflow allows you to log parameters, metrics, and models.
import mlflow
# Start a tracking run
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.95)
# Log the model artifact
mlflow.sklearn.log_model(model, "model")π¦ 3. DVC: Data Version Control
Git is terrible at versioning large datasets. DVC handles this by versioning small metadata files in Git while storing the actual data in a remote (S3, GCS, Azure Blob).
DVC Workflow:
dvc add data.csv(Adds metadata to Git).dvc push(Pushes actual data to S3).git commit data.csv.dvc(Commits the pointer to Git).dvc pull(Anyone can download the specific version of the data).
π¦ 4. Model Registry & Deployment
Once a model is trained, itβs moved to a Model Registry. This allows you to:
- Tag models as
Staging,Production, orArchived. - Version your model artifacts.
- Automatically trigger deployment pipelines when a model is tagged as
Production.