Why Your Kafka Pipeline Will Break When You Add an LLM to It

Most teams wire LLM calls directly into Kafka consumers. Here's why that fails in production and what to do instead.

6 min read

The Failure Scenario

2 AM. PagerDuty goes off.

You roll over, grab your laptop, pull up the Grafana dashboard. Consumer group lag for customer-interactions isn't just rising — it's vertical. Ten minutes ago the cluster was healthy. Now you've got 40M records backed up and the gap is widening every second.

Logs look fine. No errors, no exceptions. Consumer's up, network's fine, brokers are healthy. But consumers-per-sec has flatlined near zero.

So you dig into the code that deployed yesterday — a "simple" feature to add sentiment analysis to the support ticket stream using an LLM. Passed staging. Passed integration tests. But in production, that LLM call isn't returning in 10ms like your Postgres lookups. It's taking 800ms to 2 seconds. Your consumers, configured for high-throughput database enrichment, are spending 99% of their time blocked on I/O waiting for a model that doesn't care about your max.poll.interval.ms.

You're processing one record every two seconds per thread. Pipeline's broke.

I've seen this exact failure mode three times in the past month. Which is wild, because the fix isn't exactly secret.

Why Kafka Consumers Are the Wrong Place for LLM Calls

Here's the fundamental problem: Kafka consumers and LLM APIs have completely different assumptions about how the world works.

We're used to synchronous enrichment — database lookups, key-value store checks — that execute in single-digit milliseconds. LLMs don't play by those rules.

Latency mismatch: A standard Kafka consumer poll loop is built for speed. Poll, process, commit. If you drop a 1-3 second LLM call inside that loop, you stop polling. You aren't just slowing down processing — you're holding a partition hostage. While that thread waits, the partition stalls. Scale this to a high-volume topic and you create a backlog that literally cannot recover without scaling your consumer count to something absurd.

No native retry contract: When a database query fails, the driver fails fast. Exception, retry, dead-letter. LLM APIs? They hang. They timeout inconsistently. They return HTTP 200 with rate-limit errors in the JSON body. Kafka consumers aren't built to parse application-layer backpressure from some external REST API. You end up with ProcessingException loops or, worse, silent data corruption if you catch the exception and commit the offset anyway.

Backpressure blindness: Your consumer group has no idea what's happening with the LLM provider. If OpenAI or Anthropic has a latency spike, your consumers don't know to slow down. They keep fetching from the broker because the poll() loop is technically open. But the internal processing queue backs up instantly. You're pulling data out of Kafka and parking it in memory, risking OOM, while your consumer group coordinator thinks everything's fine because heartbeats are still flowing.

What Teams Try (And Why It Doesn't Work)

First instinct is usually to optimize the consumer code. Try to force-fit the async nature of LLMs into the sync model of Kafka consumers.

Async threads inside the consumer: I see this a lot. Engineers spin up a thread pool or use CompletableFuture inside the consumer to make LLM calls non-blocking. On paper, throughput goes up. In practice, you destroy ordering guarantees. Kafka guarantees order per partition, but only if you process records sequentially. Go async and record B might finish before record A. If a rebalance happens mid-flight, you've got uncommitted offsets and in-flight requests that might complete and write to a sink after the partition's already been reassigned. Data consistency breaks.

Side-car microservice: Some teams offload the LLM call to a separate HTTP service. Consumer hits the service, service handles the LLM. This isolates the logic but you've just invented a poor man's queue. Two failure modes now: the service goes down, or the LLM goes down. No clean backpressure signal from the HTTP service back to the consumer. If the service backs up, the consumer just sees latency. You haven't solved the problem — you've moved it behind a REST API and added network hops.

The Actual Fix: Flink as the Enrichment Layer

Stop hacking the consumer. Move the logic to a stream processor designed for this.

Apache Flink — specifically through Confluent Cloud — has this thing called Async I/O that handles high-latency external systems natively.

Flink's Async I/O operator: This is the game changer. Flink lets you define an AsyncFunction that talks to external systems (like an LLM) without blocking the main pipeline. The operator maintains a queue of in-flight requests. Sends a request to the LLM, registers a callback, immediately moves to the next record. When the LLM responds, the result gets emitted downstream. You can configure timeouts, retry strategies, and queue capacity right in the Flink API.

Event time + watermarking: If an LLM response comes back late, Flink's event time processing handles it. You can define watermarks that allow out-of-order arrival, so a slow LLM response doesn't corrupt the temporal logic of your stream. Kafka consumers can't do this — they process in ingestion order only.

Kafka stays clean: With this architecture, Kafka goes back to being a transport layer. Read raw topic, Flink handles the async enrichment and retries, write to an enriched topic. The LLM logic is decoupled from your ingestion layer.

Confluent Cloud Flink: If you're already running Kafka at scale, managing a self-hosted Flink cluster is probably overhead you don't need. Confluent Cloud Flink gives you the compute layer with native Kafka integration. It scales automatically and handles checkpointing and state backend for you, so you can focus on enrichment logic rather than JVM tuning.

Here's the simplified pattern:

// Stream from Kafka
DataStream<String> rawStream = env
    .fromSource(kafkaSource, WatermarkStrategy.noWatermarks(), "Raw Topic");

// AsyncFunction handles the LLM call
AsyncFunction<String, EnrichedRecord> llmEnrichment = new LLMAsyncClient();

// Non-blocking enrichment with timeout and capacity
DataStream<EnrichedRecord> enrichedStream = AsyncDataStream.orderedWait(
    rawStream,
    llmEnrichment,
    5000,    // Timeout in ms
    TimeUnit.MILLISECONDS,
    100      // Max concurrent requests
);

enrichedStream.sinkTo(kafkaSink);

The key insight: orderedWait maintains ordering per partition while letting the LLM calls happen concurrently. You get the throughput of async without breaking the ordering contract.

Before You Refactor: Quick Checklist

If you're seeing lag spikes or stability issues with your "AI-enabled" pipelines, check these five things:

  • [ ] Are LLM calls inside a consumer poll loop or @KafkaListener method?
  • [ ] Is max.poll.interval.ms tuned for database latency (default 5m) or LLM latency (which might need way longer or different handling)?
  • [ ] Do you have per-partition lag alerts specifically for AI-enriched topics, or are they buried in aggregate metrics?
  • [ ] Is there a dead-letter topic for timed-out LLM calls, separate from generic processing errors?
  • [ ] Can your pipeline degrade gracefully? If the LLM endpoint is down, does everything halt or can you route to a "retry later" topic?

The pipeline you built for database enrichment won't survive an LLM integration without rethinking the async layer. The good news is the streaming primitives to fix this already exist — most teams just haven't connected the dots yet.


I help early-stage startups connect their data pipelines to AI without hiring a full team. If you're hitting this wall, let's talk.

Get in Touch

Have a question or want to connect? Feel free to reach out.