--- name: llm-streaming-response-handler description: Build production LLM streaming UIs with Server-Sent Events, real-time token display, cancellation, error recovery. Handles OpenAI/Anthropic/Claude streaming APIs. Use for chatbots, AI assistants, real-time text generation. Activate on "LLM streaming", "SSE", "token stream", "chat UI", "real-time AI". NOT for batch processing, non-streaming APIs, or WebSocket bidirectional chat. allowed-tools: Read,Write,Edit,Bash(npm:*) metadata: category: AI & Machine Learning tags: - llm - streaming - response - llm-streaming - sse pairs-with: - skill: chatbot-analytics reason: Streaming metrics (time-to-first-token, throughput) are key analytics for chatbot performance - skill: react-performance-optimizer reason: Real-time token rendering requires React performance optimization for smooth display - skill: error-handling-patterns reason: Stream interruption recovery and retry logic are critical error handling concerns - skill: websocket-streaming reason: WebSocket transport is an alternative to SSE for bidirectional LLM streaming --- # LLM Streaming Response Handler Expert in building production-grade streaming interfaces for LLM responses that feel instant and responsive. ## When to Use ✅ **Use for**: - Chat interfaces with typing animation - Real-time AI assistants - Code generation with live preview - Document summarization with progressive display - Any UI where users expect immediate feedback from LLMs ❌ **NOT for**: - Batch document processing (no user watching) - APIs that don't support streaming - WebSocket-based bidirectional chat (use Socket.IO) - Simple request/response (fetch is fine) ## Quick Decision Tree ``` Does your LLM interaction: ├── Need immediate visual feedback? → Streaming ├── Display long-form content (>100 words)? → Streaming ├── User expects typewriter effect? → Streaming ├── Short response (<50 words)? → Regular fetch └── Background processing? → Regular fetch ``` --- ## Technology Selection ### Server-Sent Events (SSE) - Recommended **Why SSE over WebSockets for LLM streaming**: - **Simplicity**: HTTP-based, works with existing infrastructure - **Auto-reconnect**: Built-in reconnection logic - **Firewall-friendly**: Easier than WebSockets through proxies - **One-way perfect**: LLMs only stream server → client **Timeline**: - 2015-2020: WebSockets for everything - 2020: SSE adoption for streaming APIs - 2023+: SSE standard for LLM streaming (OpenAI, Anthropic) - 2024: Vercel AI SDK popularizes SSE patterns ### Streaming APIs | Provider | Streaming Method | Response Format | |----------|------------------|-----------------| | OpenAI | SSE | `data: {"choices":[{"delta":{"content":"token"}}]}` | | Anthropic | SSE | `data: {"type":"content_block_delta","delta":{"text":"token"}}` | | Claude (API) | SSE | `data: {"delta":{"text":"token"}}` | | Vercel AI SDK | SSE | Normalized across providers | --- ## Common Anti-Patterns ### Anti-Pattern 1: Buffering Before Display **Novice thinking**: "Collect all tokens, then show complete response" **Problem**: Defeats the entire purpose of streaming. **Wrong approach**: ```typescript // ❌ Waits for entire response before showing anything const response = await fetch('/api/chat', { method: 'POST', body: prompt }); const fullText = await response.text(); setMessage(fullText); // User sees nothing until done ``` **Correct approach**: ```typescript // ✅ Display tokens as they arrive const response = await fetch('/api/chat', { method: 'POST', body: JSON.stringify({ prompt }) }); const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n').filter(line => line.trim()); for (const line of lines) { if (line.startsWith('data: ')) { const data = JSON.parse(line.slice(6)); setMessage(prev => prev + data.content); // Update immediately } } } ``` **Timeline**: - Pre-2023: Many apps buffered entire response - 2023+: Token-by-token display expected --- ### Anti-Pattern 2: No Stream Cancellation **Problem**: User can't stop generation, wasting tokens and money. **Symptom**: "Stop" button doesn't work or doesn't exist. **Correct approach**: ```typescript // ✅ AbortController for cancellation const [abortController, setAbortController] = useState(null); const streamResponse = async () => { const controller = new AbortController(); setAbortController(controller); try { const response = await fetch('/api/chat', { signal: controller.signal, method: 'POST', body: JSON.stringify({ prompt }) }); // Stream handling... } catch (error) { if (error.name === 'AbortError') { console.log('Stream cancelled by user'); } } finally { setAbortController(null); } }; const cancelStream = () => { abortController?.abort(); }; return ( ); ``` --- ### Anti-Pattern 3: No Error Recovery **Problem**: Stream fails mid-response, user sees partial text with no indication of failure. **Correct approach**: ```typescript // ✅ Error states and recovery const [streamState, setStreamState] = useState<'idle' | 'streaming' | 'error' | 'complete'>('idle'); const [errorMessage, setErrorMessage] = useState(null); try { setStreamState('streaming'); // Streaming logic... setStreamState('complete'); } catch (error) { setStreamState('error'); if (error.name === 'AbortError') { setErrorMessage('Generation stopped'); } else if (error.message.includes('429')) { setErrorMessage('Rate limit exceeded. Try again in a moment.'); } else { setErrorMessage('Something went wrong. Please retry.'); } } // UI feedback {streamState === 'error' && (
{errorMessage}
)} ``` --- ### Anti-Pattern 4: Memory Leaks from Unclosed Streams **Problem**: Streams not cleaned up, causing memory leaks. **Symptom**: Browser slows down after multiple requests. **Correct approach**: ```typescript // ✅ Cleanup with useEffect useEffect(() => { let reader: ReadableStreamDefaultReader | null = null; const streamResponse = async () => { const response = await fetch('/api/chat', { ... }); reader = response.body.getReader(); // Streaming... }; streamResponse(); // Cleanup on unmount return () => { reader?.cancel(); }; }, [prompt]); ``` --- ### Anti-Pattern 5: No Typing Indicator Between Tokens **Problem**: UI feels frozen between slow tokens. **Correct approach**: ```typescript // ✅ Animated cursor during generation
{content} {isStreaming && }
``` ```css .typing-cursor { animation: blink 1s step-end infinite; } @keyframes blink { 50% { opacity: 0; } } ``` --- ## Implementation Patterns ### Pattern 1: Basic SSE Stream Handler ```typescript async function* streamCompletion(prompt: string) { const response = await fetch('/api/chat', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt }) }); const reader = response.body!.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n'); for (const line of lines) { if (line.startsWith('data: ')) { const data = JSON.parse(line.slice(6)); if (data.content) { yield data.content; } if (data.done) { return; } } } } } // Usage for await (const token of streamCompletion('Hello')) { console.log(token); } ``` ### Pattern 2: React Hook for Streaming ```typescript import { useState, useCallback } from 'react'; interface UseStreamingOptions { onToken?: (token: string) => void; onComplete?: (fullText: string) => void; onError?: (error: Error) => void; } export function useStreaming(options: UseStreamingOptions = {}) { const [content, setContent] = useState(''); const [isStreaming, setIsStreaming] = useState(false); const [error, setError] = useState(null); const [abortController, setAbortController] = useState(null); const stream = useCallback(async (prompt: string) => { const controller = new AbortController(); setAbortController(controller); setIsStreaming(true); setError(null); setContent(''); try { const response = await fetch('/api/chat', { method: 'POST', signal: controller.signal, headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt }) }); const reader = response.body!.getReader(); const decoder = new TextDecoder(); let accumulated = ''; while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n').filter(line => line.trim()); for (const line of lines) { if (line.startsWith('data: ')) { const data = JSON.parse(line.slice(6)); if (data.content) { accumulated += data.content; setContent(accumulated); options.onToken?.(data.content); } } } } options.onComplete?.(accumulated); } catch (err) { if (err.name !== 'AbortError') { setError(err as Error); options.onError?.(err as Error); } } finally { setIsStreaming(false); setAbortController(null); } }, [options]); const cancel = useCallback(() => { abortController?.abort(); }, [abortController]); return { content, isStreaming, error, stream, cancel }; } // Usage in component function ChatInterface() { const { content, isStreaming, stream, cancel } = useStreaming({ onToken: (token) => console.log('New token:', token), onComplete: (text) => console.log('Done:', text) }); return (
{content} {isStreaming && }
{isStreaming && }
); } ``` ### Pattern 3: Server-Side Streaming (Next.js) ```typescript // app/api/chat/route.ts import { OpenAI } from 'openai'; export const runtime = 'edge'; // Required for streaming export async function POST(req: Request) { const { prompt } = await req.json(); const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const stream = await openai.chat.completions.create({ model: 'gpt-4', messages: [{ role: 'user', content: prompt }], stream: true }); // Convert OpenAI stream to SSE format const encoder = new TextEncoder(); const readable = new ReadableStream({ async start(controller) { try { for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { const sseMessage = `data: ${JSON.stringify({ content })}\n\n`; controller.enqueue(encoder.encode(sseMessage)); } } // Send completion signal controller.enqueue(encoder.encode('data: {"done":true}\n\n')); controller.close(); } catch (error) { controller.error(error); } } }); return new Response(readable, { headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive' } }); } ``` --- ## Production Checklist ``` □ AbortController for cancellation □ Error states with retry capability □ Typing indicator during generation □ Cleanup on component unmount □ Rate limiting on API route □ Token usage tracking □ Streaming fallback (if API fails) □ Accessibility (screen reader announces updates) □ Mobile-friendly (touch targets for stop button) □ Network error recovery (auto-retry on disconnect) □ Max response length enforcement □ Cost estimation before generation ``` --- ## When to Use vs Avoid | Scenario | Use Streaming? | |----------|----------------| | Chat interface | ✅ Yes | | Long-form content generation | ✅ Yes | | Code generation with preview | ✅ Yes | | Short completions (<50 words) | ❌ No - regular fetch | | Background jobs | ❌ No - use job queue | | Bidirectional chat | ⚠️ Use WebSockets instead | --- ## Technology Comparison | Feature | SSE | WebSockets | Long Polling | |---------|-----|-----------|--------------| | Complexity | Low | Medium | High | | Auto-reconnect | ✅ | ❌ | ❌ | | Bidirectional | ❌ | ✅ | ❌ | | Firewall-friendly | ✅ | ⚠️ | ✅ | | Browser support | ✅ All modern | ✅ All modern | ✅ Universal | | LLM API support | ✅ Standard | ❌ Rare | ❌ Not used | --- ## References - `/references/sse-protocol.md` - Server-Sent Events specification details - `/references/vercel-ai-sdk.md` - Vercel AI SDK integration patterns - `/references/error-recovery.md` - Stream error handling strategies ## Scripts - `scripts/stream_tester.ts` - Test SSE endpoints locally - `scripts/token_counter.ts` - Estimate costs before generation --- **This skill guides**: LLM streaming implementation | SSE protocol | Real-time UI updates | Cancellation | Error recovery | Token-by-token display