Memory Profiling & Leaks
๐ง Memory Profiling & Leaks
For long-running Data Pipelines, monitoring memory usage is essential to prevent Out Of Memory (OOM) errors.
๐๏ธ 1. Why Memory Leaks Happen in Python
Despite garbage collection, memory leaks can occur due to:
- Global Variables: Holding references to large objects.
- Unclosed Resources: Open file handles or network connections.
- Circular References: Objects that point to each other and are not cleared properly.
๐ ๏ธ 2. Memory Profiling Tools
memory_profiler
Allows line-by-line memory usage tracking.
from memory_profiler import profile
@profile
def heavy_processing():
data = [i for i in range(1_000_000)]
return datatracemalloc
Built-in Python library to track memory allocations.
import tracemalloc
tracemalloc.start()
# Your data processing here...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:5]:
print(stat)๐พ 3. Handling Large Datasets
When data doesnโt fit in RAM:
- Chunking: Process files in small chunks (e.g.,
pd.read_csv(..., chunksize=1000)). - Generators: Use
yieldto iterate through rows without loading all of them into memory. - On-Disk Processing: Use Polars LazyFrames or DuckDB for out-of-core processing.
# Streaming generator to save memory
def read_large_file(file_path):
with open(file_path) as f:
for line in f:
yield line.strip()๐งน 4. Best Practices for Memory Efficiency
- Delete Large Objects: Use
del my_large_dfto free references manually. - Clear Cache: If using
@lru_cache, clear it periodically if it grows too large. - Use Typed Containers: Prefer NumPy arrays or
array.arrayover lists of objects.