Skip to content

Step 2: Python for ETL

Step 2: Python for ETL

Data Engineers use Python to “extract” data from APIs or flat files, “transform” it into a usable format, and “load” it into a storage system.


🛠️ Code Example: API to CSV

This script fetches data from a public API, cleans it, and saves it to a local CSV file.

import requests
import pandas as pd

def fetch_and_save_data():
    # 1. Extract
    url = "https://jsonplaceholder.typicode.com/posts"
    response = requests.get(url)
    data = response.json()

    # 2. Transform
    df = pd.DataFrame(data)
    # Remove unnecessary columns and clean text
    df = df[['id', 'userId', 'title']]
    df['title'] = df['title'].str.upper()

    # 3. Load
    df.to_csv('cleaned_posts.csv', index=False)
    print(f"Successfully saved {len(df)} rows to CSV.")

if __name__ == "__main__":
    fetch_and_save_data()

🏗️ Core Libraries

  1. Requests: For fetching data from the web.
  2. Pandas: The “Excel for Python” – used for cleaning and filtering data.
  3. Pydantic: Used for ensuring the data fits a specific schema (Validation).

🥅 Your Goal

  • Create a virtual environment using uv venv.
  • Write a script that reads a JSON file and filters out rows with null values.