The short version
When you chat with Claude or ChatGPT, the response flows in word by word. That's streaming. Without it, you'd stare at a blank screen for 10-30 seconds, then the entire response would appear at once.
This isn't just a cosmetic choice. Streaming makes AI feel faster (even though the total time is the same), lets you stop a bad response early, and reduces the risk of timeouts on long generations.
How it works
LLMs generate text one token at a time. Each token is predicted based on everything that came before it. Without streaming, the server waits until the entire response is complete, then sends it all in one HTTP response. With streaming, the server sends each token as it's generated using a technique called Server-Sent Events (SSE).
The flow:
- Your app sends a request to the AI API with
stream: true - The server starts generating tokens
- As each token is produced, it's sent immediately over the open connection
- Your app receives and displays each token in real time
- The server sends a final signal when generation is complete
From the API side, the response looks like a series of small chunks:
data: {"type": "content_block_delta", "delta": {"text": "Hello"}}
data: {"type": "content_block_delta", "delta": {"text": " there"}}
data: {"type": "content_block_delta", "delta": {"text": "."}}
data: {"type": "message_stop"}
For builders, streaming adds complexity. You need to handle partial responses, manage the connection, and assemble the final text from chunks. Libraries like the Vercel AI SDK abstract most of this into a simple useChat hook or streamText function.
Why it matters
If you're building anything that calls an LLM and shows the result to a user, streaming is the expected experience. Without it, long responses feel broken. With it, the app feels responsive even when the model takes 20 seconds to generate a full answer. Most AI SDKs support streaming out of the box.