Skip to content

Online vs. Batch Serving

πŸš€ Online vs. Batch Serving

The architecture for delivering a model depends on whether you need a result in Milliseconds or Hours.


🟒 Level 1: Online Serving (Real-Time)

The model is exposed as a REST or gRPC API.

  • Latency: < 100ms.
  • Tools: FastAPI, BentoML, TorchServe.
  • Workflow:
    1. Client sends JSON request.
    2. Server performs preprocessing.
    3. Server runs inference.
    4. Server returns JSON response.

High-Speed Preprocessing

Standard Python can be slow for preprocessing (e.g., text tokenization). For high traffic, consider:

  • Rust/Go Sidecar: Handle data cleaning in a fast language.
  • Triton Inference Server: Optimized C++ engine for model execution.

🟑 Level 2: Batch Serving (Asynchronous)

The model processes a large dataset all at once.

  • Latency: Not a concern (Minutes to Hours).
  • Throughput: Massive (Millions of rows).
  • Tools: Spark, Airflow, Dask.
  • Workflow:
    1. Scheduler triggers job at 2 AM.
    2. Load 10M rows from Snowflake/Parquet.
    3. Distribute inference across 100 worker nodes.
    4. Write results back to the database.

πŸ”΄ Level 3: The Hybrid (Request-Response Batch)

Used when you have many requests but don’t need instant results.

Streaming Inference

  • Tools: Kafka, Flink.
  • Workflow:
    1. Request is pushed to a Kafka topic.
    2. Model service consumes the topic and performs inference.
    3. Result is pushed to an β€œOutput” topic.