Data Modeling: Star Schema and Beyond
📐 Data Modeling: Star Schema and Beyond
Data modeling is the process of defining how data is stored and how different data points relate to each other. In analytics, the goal is to optimize for read performance and usability.
🏛️ 1. Dimensional Modeling (Kimball)
The industry standard for data warehousing. It focuses on two types of tables:
A. Fact Tables
Store the quantitative data about a business process (e.g., a sale, a click).
- Contains: Foreign keys to dimensions and numerical measures (Amount, Quantity).
- Shape: Usually very tall (billions of rows) but narrow.
B. Dimension Tables
Store the descriptive context (e.g., Product Name, Customer City, Date).
- Contains: Primary keys and text descriptions.
- Shape: Usually short but wide (many columns).
🏗️ 2. The Star Schema
The simplest and most efficient model for analytical queries.
- One central Fact Table.
- Directly connected to multiple Dimension Tables.
- ✅ Pros: Extremely fast joins; easy for BI tools and users to understand.
🏗️ 3. The Snowflake Schema
A variation of the Star Schema where dimension tables are normalized.
- Example: A
Productdimension links to aCategorydimension. - ❌ Cons: More complex joins; slower query performance.
- ✅ Pros: Saves storage space (rarely an issue in modern warehouses).
⚡ 4. Advanced: Data Vault & One Big Table (OBT)
- Data Vault: A highly scalable model for agile, enterprise-wide warehousing. Focuses on Hubs, Links, and Satellites.
- OBT: Denormalizing everything into a single, massive table. Used in modern cloud warehouses (like BigQuery) to eliminate joins entirely.
🧪 5. Top Interview Questions
- What is the difference between a Fact and a Dimension?
- What is an “Additive” vs. “Non-additive” measure?
- How do you handle “Slowly Changing Dimensions” (SCD)?
🏁 Summary: Best Practices
- Star Schema by Default: For 90% of BI use cases, the Star Schema is the correct choice.
- Surrogate Keys: Always use a meaningless integer or hash as a primary key instead of using business keys (like email).
- Time Dimension: Always have a dedicated Date/Time dimension table to handle complex holiday or fiscal calendar logic.