Advanced Stream Processing with Apache Flink
Advanced Stream Processing with Apache Flink
While Kafka is the βStorageβ for streams, Apache Flink is the βCompute.β It is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
ποΈ Core Concept: Stateful Streams
Unlike Spark (which is traditionally micro-batch), Flink processes every event immediately (true streaming). It maintains βStateβ in memory, allowing you to perform calculations over time.
Why is State important?
- Example: Calculating a moving average of stock prices over the last 5 minutes.
- Example: Detecting if a credit card is used in two different cities within 10 minutes (Fraud Detection).
π Key Features
1. Event Time vs. Processing Time
- Event Time: The time the action actually happened (e.g., in the mobile app). Flink uses Watermarks to handle data that arrives late or out of order.
- Processing Time: The time the event arrived at the Flink cluster.
2. Exactly-Once Guarantees
Flink uses a distributed snapshot algorithm (Chandy-Lamport) to ensure that even if a server crashes, the system can recover and guarantee that every event was processed exactly once.
3. Windowing
- Tumbling Windows: Fixed-size, non-overlapping (e.g., Every 1 minute).
- Sliding Windows: Overlapping (e.g., Every 1 minute, but calculated every 10 seconds).
- Session Windows: Grouped by activity (e.g., User session starts when they login and ends after 30 mins of inactivity).
π οΈ Code Example: PyFlink Sliding Window
This example shows how to calculate the average temperature of a sensor over a 10-second sliding window, updated every 5 seconds.
from pyflink.common import WatermarkStrategy, Time
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import SlidingEventTimeWindows
env = StreamExecutionEnvironment.get_execution_environment()
# Define the source (e.g., Kafka)
ds = env.from_source(
source=my_kafka_source,
watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
source_name="Kafka Source"
)
# Processing Logic
ds.key_by(lambda x: x['device_id']) \
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5))) \
.reduce(lambda v1, v2: {'device_id': v1['device_id'], 'temp': (v1['temp'] + v2['temp']) / 2}) \
.print()
env.execute("Sensor Sliding Window")π Flink vs. Spark Streaming
| Feature | Apache Flink | Spark Streaming |
|---|---|---|
| Latency | Milliseconds (Real-time) | Seconds (Micro-batch) |
| State Management | Native, high-performance | External (Checkpoints) |
| Order Handling | Excellent (Watermarks) | Limited |
| Best Use Case | Fraud, Low-latency alerts | ETL, Batch/Stream unification |
π‘ Engineering Takeaway
Use Flink when every millisecond matters or when you have complex logic that depends on the history of the stream (Stateful logic).