DuckDB for High-Performance ETL
🦆 DuckDB: The ETL Powerhouse
DuckDB is an in-process SQL OLAP database management system. It is designed for fast analytical queries and is often called the “SQLite for Analytics.” In Data Engineering, it is a game-changer for building High-Performance ETL pipelines that run locally or in serverless environments (like AWS Lambda).
🟢 Phase 1: Foundations (The “Zero-Copy” Engine)
1. Why DuckDB?
- In-Process: No server to manage. Just
import duckdb. - Columnar Engine: Optimized for analytical queries (aggregations, joins).
- Zero-Copy Integration: Directly query Pandas, Polars, and Arrow data without copying it into DuckDB.
import duckdb
import pandas as pd
# 1. Query a Pandas DataFrame directly
df = pd.DataFrame({"id": [1, 2], "val": [10, 20]})
result = duckdb.query("SELECT * FROM df WHERE val > 15").to_df()🟡 Phase 2: Intermediate (Extract & Transform)
2. Extracting from Files (Parquet, CSV, JSON)
DuckDB can query files directly on disk. It automatically detects schemas and handles compression.
# Query a Parquet file directly
duckdb.query("SELECT * FROM 'data/*.parquet' LIMIT 10")
# Query a CSV with automatic header detection
duckdb.query("SELECT * FROM read_csv_auto('data.csv')")3. Transformation with SQL
DuckDB supports a modern, developer-friendly SQL dialect (e.g., SELECT * EXCLUDE, FROM before SELECT).
# Modern SQL: Calculate average score excluding sensitive columns
transformed = duckdb.sql("""
FROM 'raw_data.parquet'
SELECT
* EXCLUDE (ssn, phone),
(score / 100) AS normalized_score
WHERE status = 'active'
""")🟠 Phase 3: Expert (The “Out-of-Core” Pipeline)
4. Handling Datasets Larger than RAM
DuckDB’s execution engine is streaming-native. It can process datasets much larger than your available memory by “spilling to disk” when necessary.
# Set memory limit to 4GB even if the file is 100GB
duckdb.execute("SET memory_limit = '4GB'")
# DuckDB will stream this file and process it in chunks
duckdb.execute("""
COPY (
SELECT category, SUM(value)
FROM read_csv_auto('huge_dataset.csv')
GROUP BY 1
) TO 'summary.parquet' (FORMAT PARQUET)
""")5. Connecting to External Databases (Postgres/MySQL)
You can use DuckDB as a “bridge” between different databases.
# Attach a Postgres DB
duckdb.execute("INSTALL postgres; LOAD postgres;")
duckdb.execute("ATTACH 'host=localhost dbname=mydb' AS my_pg (TYPE postgres)")
# Copy data from Postgres to a local Parquet file
duckdb.execute("COPY (SELECT * FROM my_pg.users) TO 'users.parquet'")🔴 Phase 4: Senior Architect (The Complete ETL Pattern)
6. The “DuckDB ETL” Script
Here is a complete, production-ready ETL pattern:
import duckdb
def run_etl():
# 1. Initialize Connection
con = duckdb.connect("prod_warehouse.db")
# 2. Extract & Clean (Using SQL for speed)
con.execute("""
CREATE OR REPLACE VIEW cleaned_data AS
SELECT
user_id,
UPPER(country) as country,
CAST(timestamp AS TIMESTAMP) as event_time
FROM 's3://my-bucket/events/*.parquet'
WHERE event_type = 'purchase'
""")
# 3. Transform (Aggregations)
con.execute("""
CREATE TABLE daily_sales AS
SELECT
event_time::DATE as date,
country,
COUNT(*) as total_orders
FROM cleaned_data
GROUP BY ALL
""")
# 4. Load (Export to shared storage)
con.execute("COPY daily_sales TO 'outputs/daily_sales.parquet' (FORMAT PARQUET)")
print("ETL Job Complete!")
if __name__ == "__main__":
run_etl()🚦 Summary Toolset
- S3 Support: Install
httpfsextension to query S3 directly. - Python Integration: Use
.to_df(),.to_arrow_table(), or.pl()(Polars) to export results. - Efficiency: Use
SUMMARIZEto quickly profile your data:duckdb.sql("SUMMARIZE TABLE my_data").show().
🚀 Hands-on Recipe: CSV-to-CSV Aggregation Pipeline
This is a common Data Engineering pattern: Extracting raw data from a CSV, performing heavy aggregations, and exporting the summary back to a CSV for a BI tool or another team.
1. The Scenario
We have a 10GB sales_data.csv with columns: order_id, product_id, category, price, and order_date. We want to calculate the total revenue and order count per category and save it to category_summary.csv.
2. The Solution (Python + DuckDB SQL)
import duckdb
import time
def run_csv_to_csv_etl(input_path: str, output_path: str):
start_time = time.time()
# 1. Connect (using ':memory:' or a file-based DB)
con = duckdb.connect()
print(f"🚀 Starting ETL for {input_path}...")
# 2. Extract, Transform, and Load in ONE streaming operation
# DuckDB will stream the CSV, aggregate in memory, and stream out to the new CSV
con.execute(f"""
COPY (
SELECT
category,
COUNT(order_id) as total_orders,
SUM(price) as total_revenue,
AVG(price) as avg_order_value,
MIN(order_date) as first_order,
MAX(order_date) as last_order
FROM read_csv_auto('{input_path}')
GROUP BY category
ORDER BY total_revenue DESC
) TO '{output_path}' (FORMAT CSV, HEADER);
""")
end_time = time.time()
print(f"✅ ETL Complete in {end_time - start_time:.2f} seconds.")
print(f"📂 Result saved to: {output_path}")
if __name__ == "__main__":
# Example usage
run_csv_to_csv_etl("raw_sales.csv", "category_summary.csv")3. Why this is superior to Pandas:
- Memory Efficiency: DuckDB streams the CSV. It does not load the whole 10GB file into RAM. It only keeps the unique categories and their running totals in memory.
- Type Inference:
read_csv_autois faster and more accurate at guessing types than Pandas. - Parallelism: DuckDB will automatically use all your CPU cores to parse the CSV and aggregate the data.
- SQL Simplicity: Complex aggregations are often more readable in SQL than in multi-line Pandas method chains.
🛠️ Advanced Optimization: “The Parquet Bridge”
If you plan to run multiple aggregations, convert the CSV to Parquet first. Parquet is columnar and 10-50x faster to read than CSV.
# Convert CSV to Parquet once
con.execute("COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET)")
# Run multiple aggregations on the Parquet file
con.execute("SELECT category, SUM(price) FROM 'data.parquet' GROUP BY 1").show()