Advanced Stream Processing with Apache Flink

While Kafka is the “Storage” for streams, Apache Flink is the “Compute.” It is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

🏗️ Core Concept: Stateful Streams

Unlike Spark (which is traditionally micro-batch), Flink processes every event immediately (true streaming). It maintains “State” in memory, allowing you to perform calculations over time.

Why is State important?

Example: Calculating a moving average of stock prices over the last 5 minutes.
Example: Detecting if a credit card is used in two different cities within 10 minutes (Fraud Detection).

🚀 Key Features

1. Event Time vs. Processing Time

Event Time: The time the action actually happened (e.g., in the mobile app). Flink uses Watermarks to handle data that arrives late or out of order.
Processing Time: The time the event arrived at the Flink cluster.

2. Exactly-Once Guarantees

Flink uses a distributed snapshot algorithm (Chandy-Lamport) to ensure that even if a server crashes, the system can recover and guarantee that every event was processed exactly once.

3. Windowing

Tumbling Windows: Fixed-size, non-overlapping (e.g., Every 1 minute).
Sliding Windows: Overlapping (e.g., Every 1 minute, but calculated every 10 seconds).
Session Windows: Grouped by activity (e.g., User session starts when they login and ends after 30 mins of inactivity).

🛠️ Code Example: PyFlink Sliding Window

This example shows how to calculate the average temperature of a sensor over a 10-second sliding window, updated every 5 seconds.

from pyflink.common import WatermarkStrategy, Time
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import SlidingEventTimeWindows

env = StreamExecutionEnvironment.get_execution_environment()

# Define the source (e.g., Kafka)
ds = env.from_source(
    source=my_kafka_source,
    watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
    source_name="Kafka Source"
)

# Processing Logic
ds.key_by(lambda x: x['device_id']) \
  .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5))) \
  .reduce(lambda v1, v2: {'device_id': v1['device_id'], 'temp': (v1['temp'] + v2['temp']) / 2}) \
  .print()

env.execute("Sensor Sliding Window")

📊 Flink vs. Spark Streaming

Feature	Apache Flink	Spark Streaming
Latency	Milliseconds (Real-time)	Seconds (Micro-batch)
State Management	Native, high-performance	External (Checkpoints)
Order Handling	Excellent (Watermarks)	Limited
Best Use Case	Fraud, Low-latency alerts	ETL, Batch/Stream unification

💡 Engineering Takeaway

Use Flink when every millisecond matters or when you have complex logic that depends on the history of the stream (Stateful logic).