Most production data platforms use both batch and streaming approaches rather than choosing exclusively between them. The decision hinges on specific use case requirements rather than universal superiority.
Two Mental Models
Batch: Collect Everything, Analyze Thoroughly
Key Advantages:
-
Complete Picture — All data exists before processing begins; no late arrivals or ordering concerns; straightforward cross-dataset joins
-
Simple Retry Logic — Failed jobs can be rerun with full recomputation as a viable option; idempotent by design
-
Mature Ecosystem — Decades of established tooling (Airflow, dbt, Spark); widespread SQL knowledge; proven debugging patterns
-
Cost Predictable — Efficient resource usage through on-demand compute; no always-on infrastructure for sporadic workloads
Streaming: Process As It Happens
Key Advantages:
-
Millisecond Latency — React to events immediately through fraud detection, real-time pricing, and live dashboards
-
Event Sourcing — Event logs serve as source of truth; tables and caches become derived views; replay capability for state rebuilding
-
Unified Processing — Single codebase for real-time and batch via replay; Kappa architecture eliminates dual-system complexity
-
Decoupled Systems — Producers and consumers evolve independently; new consumers don’t require producer modifications; event-driven microservices
Architecture Patterns
Modern Batch Architecture
Sources → Airbyte (Extract & Load) → Apache Iceberg + Trino (Storage) → dbt (Transform) → BI Tools / Reverse ETL
Provides hourly/daily freshness with simplified operations.
Streaming Event-Driven Stack
Sources → Kafka/Redpanda (Capture) → Apache Flink (Transform) → ClickHouse/Redis (Serve)
Delivers sub-second latency with continuous processing.
Key Components Comparison
| Component | Batch Stack | Streaming Stack |
|---|---|---|
| Ingestion | Airbyte, Apache NiFi, custom scripts | Debezium CDC, direct producers |
| Storage | Apache Iceberg, ClickHouse, Trino | Kafka (event log), ClickHouse (OLAP) |
| Transformation | dbt, Spark, SQL | Flink, ksqlDB, Spark Structured Streaming |
| Orchestration | Airflow, Dagster, Prefect | Always-on (Kubernetes, managed Flink) |
| Serving | Direct warehouse queries, caching | Pre-computed views in Redis/ClickHouse |
Best-in-Class Technology
Batch Stack
Streaming Stack
Code Examples
Example 1: Daily Revenue by Category
dbt (Batch)
-- Batch: Simple, readable, testable
SELECT
DATE(order_timestamp) AS order_date,
category,
SUM(amount) AS daily_revenue,
COUNT(*) AS order_count
FROM {{ ref('orders') }}
WHERE order_timestamp >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY 1, 2
Flink SQL (Streaming)
-- Streaming: Continuous, requires windowing
SELECT
TUMBLE_START(order_time, INTERVAL '1' DAY) AS order_date,
category,
SUM(amount) AS daily_revenue,
COUNT(*) AS order_count
FROM orders
GROUP BY
TUMBLE(order_time, INTERVAL '1' DAY),
category
Batch wins due to simplicity, testability via dbt tests, and trivial historical recompute capabilities—unless real-time revenue updates are necessary.
Example 2: Fraud Detection (Velocity Check)
PySpark (Batch)
# Batch: Run hourly - detects fraud after the fact
from pyspark.sql import functions as F
from pyspark.sql.window import Window
user_window = Window.partitionBy("user_id").orderBy("timestamp")
flagged = (
transactions
.withColumn("prev_lat", F.lag("lat").over(user_window))
.withColumn("prev_lon", F.lag("lon").over(user_window))
.withColumn("prev_time", F.lag("timestamp").over(user_window))
.withColumn("distance_km", haversine_udf("lat", "lon", "prev_lat", "prev_lon"))
.withColumn("time_diff_min", (F.col("timestamp") - F.col("prev_time")) / 60)
.filter((F.col("distance_km") > 500) & (F.col("time_diff_min") < 30))
)
Flink SQL (Streaming)
-- Streaming: Detect in real-time, block before damage
CREATE TABLE payments (
txn_id STRING,
user_id STRING,
amount DECIMAL(10, 2),
lat DOUBLE,
lon DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);
-- Detect impossible velocity: >500km in <30 min
INSERT INTO fraud_alerts
SELECT p1.user_id, p1.txn_id, p2.txn_id,
ST_Distance(ST_Point(p1.lon, p1.lat), ST_Point(p2.lon, p2.lat)) / 1000 AS km
FROM payments p1, payments p2
WHERE p1.user_id = p2.user_id
AND p1.event_time < p2.event_time
AND p2.event_time < p1.event_time + INTERVAL '30' MINUTE
AND ST_Distance(...) > 500000;
Streaming excels here—batch detects fraud hours post-incident while streaming blocks transactions within milliseconds, preventing significant financial losses.
Monitoring & Visibility
Batch Monitoring
- Jobs produce binary outcomes (success/failure)
- dbt tests catch data quality issues
- Airflow displays DAG execution history
- Clear audit trail exists per run
Streaming Monitoring
- Continuous metrics track consumer lag, end-to-end latency, checkpoint success rates
- Requires always-on monitoring
- Distributed debugging presents greater challenges
| Metric | Batch | Streaming |
|---|---|---|
| Data freshness | "Last run: 2 hours ago" | "Consumer lag: 50ms" |
| Pipeline health | Job success/failure rate | Checkpoint success, backpressure |
| Data quality | dbt tests, Great Expectations | Schema registry, runtime validation |
| Lineage | dbt docs, DataHub (mature) | Event correlation IDs (manual) |
| Debugging | Re-run failed job, inspect logs | Distributed tracing, state inspection |
Operational Complexity
Batch Operations
- Simple retry logic — Failed jobs rerun with full recomputation as viable fallback; no state corruption risk
- Mature scheduling — Airflow, Dagster, Prefect offer well-understood dependency management
- Delayed detection - Issues discovered when next job executes; hours or days after occurrence
- DAG dependency hell - Complex pipelines create cascading failures; slow jobs block downstream processes
Streaming Operations
- Immediate detection — Consumer lag spikes visible instantly without waiting for scheduled runs
- Always-on processing — No scheduling management; no batch windows; continuous data flow
- State management complexity — Checkpoints, savepoints, state backends require management; corrupted state may necessitate full Kafka rebuild
- Upgrade complexity — Stateful job upgrades demand savepoints; schema changes need careful coordination
Cost Economics
When Batch is Cheaper
- Infrequent processing (daily/weekly schedules)
- Serverless warehouses (pay-per-query models)
- Small data volumes (<1TB/day)
- Ad-hoc analytics workloads
- No always-on infrastructure required
When Streaming is Cheaper
- High-volume continuous processing
- Replacing multiple redundant batch jobs
- Eliminating intermediate storage requirements
- Real-time requirements preventing over-provisioning alternatives
Decision Framework
Clear Wins for Batch
Historical Analytics
- Ad-hoc queries spanning years of data
- Complex joins across large datasets
- Columnar storage optimization
ML Training
- Feature engineering leverages historical data
- Model training gains nothing from real-time processing
- Simpler implementation
Compliance Reporting
- Audit reports, regulatory filings, point-in-time snapshots
- Full reproducibility required
Clear Wins for Streaming
Fraud Detection
- Milliseconds are critical
- Block fraudulent transactions before completion
- Real-time processing is essential
Operational Monitoring
- System health tracking, alerting, live dashboards
- Stale data becomes useless
- Continuous processing required
Event-Driven Systems
- Microservices choreography
- Real-time notifications
- User-facing features requiring instant responsiveness
Risks & Failure Modes
Batch Risks
- Stale data — Decisions based on outdated information
- Delayed detection — Problems discovered hours later
- Cascade failures — One slow job blocks everything downstream
- Resource contention — Batch windows competing for compute resources
Streaming Risks
- Silent data loss — Misconfigured consumers drop events undetected
- State corruption — Requires full rebuild from Kafka
- Backpressure cascade — Slow consumer affects entire pipeline
- Schema breaks — Producer changes break downstream consumers
Mitigation Strategies
| Risk | Batch Mitigation | Streaming Mitigation |
|---|---|---|
| Data quality issues | dbt tests, Great Expectations, post-run validation | Schema registry, runtime validation, dead letter queues |
| Pipeline failures | Retry policies, alerting on failure, SLAs with buffer | Checkpointing, automatic restart, consumer lag alerting |
| Data loss | Idempotent jobs, backup before overwrite | Kafka retention, exactly-once semantics, idempotent sinks |
| Schema evolution | dbt schema tests, migration scripts | Schema registry compatibility rules, versioned topics |
The Bottom Line
Neither paradigm is universally superior. They optimize for different outcomes:
- Batch optimizes for simplicity, completeness, and cost-effectiveness for periodic workloads
- Streaming optimizes for latency, reactivity, and continuous processing
Most modern data platforms employ both paradigms. Stream for operational use cases prioritizing latency. Batch for analytical use cases emphasizing completeness. Let requirements drive architecture rather than ideology.