Skip to content

Advanced Stream Processing with Apache Flink

Advanced Stream Processing with Apache Flink

While Kafka is the β€œStorage” for streams, Apache Flink is the β€œCompute.” It is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.


πŸ—οΈ Core Concept: Stateful Streams

Unlike Spark (which is traditionally micro-batch), Flink processes every event immediately (true streaming). It maintains β€œState” in memory, allowing you to perform calculations over time.

Why is State important?

  • Example: Calculating a moving average of stock prices over the last 5 minutes.
  • Example: Detecting if a credit card is used in two different cities within 10 minutes (Fraud Detection).

πŸš€ Key Features

1. Event Time vs. Processing Time

  • Event Time: The time the action actually happened (e.g., in the mobile app). Flink uses Watermarks to handle data that arrives late or out of order.
  • Processing Time: The time the event arrived at the Flink cluster.

2. Exactly-Once Guarantees

Flink uses a distributed snapshot algorithm (Chandy-Lamport) to ensure that even if a server crashes, the system can recover and guarantee that every event was processed exactly once.

3. Windowing

  • Tumbling Windows: Fixed-size, non-overlapping (e.g., Every 1 minute).
  • Sliding Windows: Overlapping (e.g., Every 1 minute, but calculated every 10 seconds).
  • Session Windows: Grouped by activity (e.g., User session starts when they login and ends after 30 mins of inactivity).

This example shows how to calculate the average temperature of a sensor over a 10-second sliding window, updated every 5 seconds.

from pyflink.common import WatermarkStrategy, Time
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import SlidingEventTimeWindows

env = StreamExecutionEnvironment.get_execution_environment()

# Define the source (e.g., Kafka)
ds = env.from_source(
    source=my_kafka_source,
    watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
    source_name="Kafka Source"
)

# Processing Logic
ds.key_by(lambda x: x['device_id']) \
  .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5))) \
  .reduce(lambda v1, v2: {'device_id': v1['device_id'], 'temp': (v1['temp'] + v2['temp']) / 2}) \
  .print()

env.execute("Sensor Sliding Window")

FeatureApache FlinkSpark Streaming
LatencyMilliseconds (Real-time)Seconds (Micro-batch)
State ManagementNative, high-performanceExternal (Checkpoints)
Order HandlingExcellent (Watermarks)Limited
Best Use CaseFraud, Low-latency alertsETL, Batch/Stream unification

πŸ’‘ Engineering Takeaway

Use Flink when every millisecond matters or when you have complex logic that depends on the history of the stream (Stateful logic).