Skip to content

Memory Profiling & Leaks

๐Ÿง  Memory Profiling & Leaks

For long-running Data Pipelines, monitoring memory usage is essential to prevent Out Of Memory (OOM) errors.


๐Ÿ—๏ธ 1. Why Memory Leaks Happen in Python

Despite garbage collection, memory leaks can occur due to:

  • Global Variables: Holding references to large objects.
  • Unclosed Resources: Open file handles or network connections.
  • Circular References: Objects that point to each other and are not cleared properly.

๐Ÿ› ๏ธ 2. Memory Profiling Tools

memory_profiler

Allows line-by-line memory usage tracking.

from memory_profiler import profile

@profile
def heavy_processing():
    data = [i for i in range(1_000_000)]
    return data

tracemalloc

Built-in Python library to track memory allocations.

import tracemalloc

tracemalloc.start()
# Your data processing here...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:5]:
    print(stat)

๐Ÿ’พ 3. Handling Large Datasets

When data doesnโ€™t fit in RAM:

  • Chunking: Process files in small chunks (e.g., pd.read_csv(..., chunksize=1000)).
  • Generators: Use yield to iterate through rows without loading all of them into memory.
  • On-Disk Processing: Use Polars LazyFrames or DuckDB for out-of-core processing.
# Streaming generator to save memory
def read_large_file(file_path):
    with open(file_path) as f:
        for line in f:
            yield line.strip()

๐Ÿงน 4. Best Practices for Memory Efficiency

  1. Delete Large Objects: Use del my_large_df to free references manually.
  2. Clear Cache: If using @lru_cache, clear it periodically if it grows too large.
  3. Use Typed Containers: Prefer NumPy arrays or array.array over lists of objects.