Step 2: Python for ETL
Step 2: Python for ETL
Data Engineers use Python to “extract” data from APIs or flat files, “transform” it into a usable format, and “load” it into a storage system.
🛠️ Code Example: API to CSV
This script fetches data from a public API, cleans it, and saves it to a local CSV file.
import requests
import pandas as pd
def fetch_and_save_data():
# 1. Extract
url = "https://jsonplaceholder.typicode.com/posts"
response = requests.get(url)
data = response.json()
# 2. Transform
df = pd.DataFrame(data)
# Remove unnecessary columns and clean text
df = df[['id', 'userId', 'title']]
df['title'] = df['title'].str.upper()
# 3. Load
df.to_csv('cleaned_posts.csv', index=False)
print(f"Successfully saved {len(df)} rows to CSV.")
if __name__ == "__main__":
fetch_and_save_data()🏗️ Core Libraries
- Requests: For fetching data from the web.
- Pandas: The “Excel for Python” – used for cleaning and filtering data.
- Pydantic: Used for ensuring the data fits a specific schema (Validation).
🥅 Your Goal
- Create a virtual environment using
uv venv. - Write a script that reads a JSON file and filters out rows with null values.