Skip to content

DuckDB for High-Performance ETL

🦆 DuckDB: The ETL Powerhouse

DuckDB is an in-process SQL OLAP database management system. It is designed for fast analytical queries and is often called the “SQLite for Analytics.” In Data Engineering, it is a game-changer for building High-Performance ETL pipelines that run locally or in serverless environments (like AWS Lambda).


🟢 Phase 1: Foundations (The “Zero-Copy” Engine)

1. Why DuckDB?

  • In-Process: No server to manage. Just import duckdb.
  • Columnar Engine: Optimized for analytical queries (aggregations, joins).
  • Zero-Copy Integration: Directly query Pandas, Polars, and Arrow data without copying it into DuckDB.
import duckdb
import pandas as pd

# 1. Query a Pandas DataFrame directly
df = pd.DataFrame({"id": [1, 2], "val": [10, 20]})
result = duckdb.query("SELECT * FROM df WHERE val > 15").to_df()

🟡 Phase 2: Intermediate (Extract & Transform)

2. Extracting from Files (Parquet, CSV, JSON)

DuckDB can query files directly on disk. It automatically detects schemas and handles compression.

# Query a Parquet file directly
duckdb.query("SELECT * FROM 'data/*.parquet' LIMIT 10")

# Query a CSV with automatic header detection
duckdb.query("SELECT * FROM read_csv_auto('data.csv')")

3. Transformation with SQL

DuckDB supports a modern, developer-friendly SQL dialect (e.g., SELECT * EXCLUDE, FROM before SELECT).

# Modern SQL: Calculate average score excluding sensitive columns
transformed = duckdb.sql("""
    FROM 'raw_data.parquet'
    SELECT 
        * EXCLUDE (ssn, phone),
        (score / 100) AS normalized_score
    WHERE status = 'active'
""")

🟠 Phase 3: Expert (The “Out-of-Core” Pipeline)

4. Handling Datasets Larger than RAM

DuckDB’s execution engine is streaming-native. It can process datasets much larger than your available memory by “spilling to disk” when necessary.

# Set memory limit to 4GB even if the file is 100GB
duckdb.execute("SET memory_limit = '4GB'")

# DuckDB will stream this file and process it in chunks
duckdb.execute("""
    COPY (
        SELECT category, SUM(value) 
        FROM read_csv_auto('huge_dataset.csv')
        GROUP BY 1
    ) TO 'summary.parquet' (FORMAT PARQUET)
""")

5. Connecting to External Databases (Postgres/MySQL)

You can use DuckDB as a “bridge” between different databases.

# Attach a Postgres DB
duckdb.execute("INSTALL postgres; LOAD postgres;")
duckdb.execute("ATTACH 'host=localhost dbname=mydb' AS my_pg (TYPE postgres)")

# Copy data from Postgres to a local Parquet file
duckdb.execute("COPY (SELECT * FROM my_pg.users) TO 'users.parquet'")

🔴 Phase 4: Senior Architect (The Complete ETL Pattern)

6. The “DuckDB ETL” Script

Here is a complete, production-ready ETL pattern:

import duckdb

def run_etl():
    # 1. Initialize Connection
    con = duckdb.connect("prod_warehouse.db")
    
    # 2. Extract & Clean (Using SQL for speed)
    con.execute("""
        CREATE OR REPLACE VIEW cleaned_data AS
        SELECT 
            user_id,
            UPPER(country) as country,
            CAST(timestamp AS TIMESTAMP) as event_time
        FROM 's3://my-bucket/events/*.parquet'
        WHERE event_type = 'purchase'
    """)
    
    # 3. Transform (Aggregations)
    con.execute("""
        CREATE TABLE daily_sales AS
        SELECT 
            event_time::DATE as date,
            country,
            COUNT(*) as total_orders
        FROM cleaned_data
        GROUP BY ALL
    """)
    
    # 4. Load (Export to shared storage)
    con.execute("COPY daily_sales TO 'outputs/daily_sales.parquet' (FORMAT PARQUET)")
    
    print("ETL Job Complete!")

if __name__ == "__main__":
    run_etl()

🚦 Summary Toolset

  • S3 Support: Install httpfs extension to query S3 directly.
  • Python Integration: Use .to_df(), .to_arrow_table(), or .pl() (Polars) to export results.
  • Efficiency: Use SUMMARIZE to quickly profile your data: duckdb.sql("SUMMARIZE TABLE my_data").show().

🚀 Hands-on Recipe: CSV-to-CSV Aggregation Pipeline

This is a common Data Engineering pattern: Extracting raw data from a CSV, performing heavy aggregations, and exporting the summary back to a CSV for a BI tool or another team.

1. The Scenario

We have a 10GB sales_data.csv with columns: order_id, product_id, category, price, and order_date. We want to calculate the total revenue and order count per category and save it to category_summary.csv.

2. The Solution (Python + DuckDB SQL)

import duckdb
import time

def run_csv_to_csv_etl(input_path: str, output_path: str):
    start_time = time.time()
    
    # 1. Connect (using ':memory:' or a file-based DB)
    con = duckdb.connect()
    
    print(f"🚀 Starting ETL for {input_path}...")

    # 2. Extract, Transform, and Load in ONE streaming operation
    # DuckDB will stream the CSV, aggregate in memory, and stream out to the new CSV
    con.execute(f"""
        COPY (
            SELECT 
                category,
                COUNT(order_id) as total_orders,
                SUM(price) as total_revenue,
                AVG(price) as avg_order_value,
                MIN(order_date) as first_order,
                MAX(order_date) as last_order
            FROM read_csv_auto('{input_path}')
            GROUP BY category
            ORDER BY total_revenue DESC
        ) TO '{output_path}' (FORMAT CSV, HEADER);
    """)
    
    end_time = time.time()
    print(f"✅ ETL Complete in {end_time - start_time:.2f} seconds.")
    print(f"📂 Result saved to: {output_path}")

if __name__ == "__main__":
    # Example usage
    run_csv_to_csv_etl("raw_sales.csv", "category_summary.csv")

3. Why this is superior to Pandas:

  1. Memory Efficiency: DuckDB streams the CSV. It does not load the whole 10GB file into RAM. It only keeps the unique categories and their running totals in memory.
  2. Type Inference: read_csv_auto is faster and more accurate at guessing types than Pandas.
  3. Parallelism: DuckDB will automatically use all your CPU cores to parse the CSV and aggregate the data.
  4. SQL Simplicity: Complex aggregations are often more readable in SQL than in multi-line Pandas method chains.

🛠️ Advanced Optimization: “The Parquet Bridge”

If you plan to run multiple aggregations, convert the CSV to Parquet first. Parquet is columnar and 10-50x faster to read than CSV.

# Convert CSV to Parquet once
con.execute("COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET)")

# Run multiple aggregations on the Parquet file
con.execute("SELECT category, SUM(price) FROM 'data.parquet' GROUP BY 1").show()