Data Quality (Great Expectations)
🛡️ Data Quality (Great Expectations & Pandera)
In Data Engineering, “Bad data in = Bad decisions out.” Data Quality is the practice of validating your data at every step of the pipeline.
🏗️ 1. Why Data Quality?
- Schema Drift: When the source database changes a column type.
- Null Values: When a mandatory field arrives empty.
- Range Violations: When a “Percentage” field arrives with a value of 150.
🚀 2. Pandera: The Developer’s Choice
Pandera is a lightweight validation library for Pandas and Polars.
import pandera as pa
# Define a schema
schema = pa.DataFrameSchema({
"age": pa.Column(int, pa.Check.in_range(0, 120)),
"email": pa.Column(str, pa.Check.str_matches(r".+@.+\..+")),
"salary": pa.Column(float, pa.Check.greater_than_0())
})
# Validate your DataFrame
validated_df = schema.validate(df)📦 3. Great Expectations: The Enterprise Solution
Great Expectations is a robust framework for documenting and validating data. It creates interactive HTML reports (“Data Docs”) for your stakeholders.
Key Concepts:
- Expectations: Assertions like
expect_column_values_to_not_be_null. - Checkpoints: A specific run of expectations on a batch of data.
- Validation Results: The outcome of the check (Pass/Fail).
🚦 4. Data Quality Best Practices
- Unit Testing: Write tests for your transformation logic.
- Integration Testing: Run checks on a sample of production data before full execution.
- Alerting: Integrate failed quality checks with Slack or PagerDuty.