Memory Profiling & Leaks

🧠 Memory Profiling & Leaks

For long-running Data Pipelines, monitoring memory usage is essential to prevent Out Of Memory (OOM) errors.

🏗️ 1. Why Memory Leaks Happen in Python

Despite garbage collection, memory leaks can occur due to:

Global Variables: Holding references to large objects.
Unclosed Resources: Open file handles or network connections.
Circular References: Objects that point to each other and are not cleared properly.

🛠️ 2. Memory Profiling Tools

`memory_profiler`

Allows line-by-line memory usage tracking.

from memory_profiler import profile

@profile
def heavy_processing():
    data = [i for i in range(1_000_000)]
    return data

`tracemalloc`

Built-in Python library to track memory allocations.

import tracemalloc

tracemalloc.start()
# Your data processing here...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:5]:
    print(stat)

💾 3. Handling Large Datasets

When data doesn’t fit in RAM:

Chunking: Process files in small chunks (e.g., pd.read_csv(..., chunksize=1000)).
Generators: Use yield to iterate through rows without loading all of them into memory.
On-Disk Processing: Use Polars LazyFrames or DuckDB for out-of-core processing.

# Streaming generator to save memory
def read_large_file(file_path):
    with open(file_path) as f:
        for line in f:
            yield line.strip()

🧹 4. Best Practices for Memory Efficiency

Delete Large Objects: Use del my_large_df to free references manually.
Clear Cache: If using @lru_cache, clear it periodically if it grows too large.
Use Typed Containers: Prefer NumPy arrays or array.array over lists of objects.