Designing Scalable LLM APIs for Production
Learn how to design robust APIs for Large Language Model integrations. Covering request queuing, rate limiting, streaming responses, and error handling patterns used in production systems.
Introduction
Building production-ready APIs for LLM integrations requires careful consideration of latency, cost, and reliability. This article explores proven patterns for designing scalable LLM APIs that can handle high traffic while maintaining performance and cost efficiency.
Architecture Overview
A well-designed LLM API architecture separates concerns and handles the unique challenges of AI model inference:
LLM API Architecture
flowchart TB
subgraph "Client Layer"
A[Web Application]
B[Mobile App]
C[Third-party Services]
end
subgraph "API Gateway"
D[Load Balancer]
E[Rate Limiter]
F[Request Queue]
end
subgraph "API Services"
G[LLM API Service]
H[Streaming Handler]
I[Response Cache]
end
subgraph "LLM Providers"
J[OpenAI API]
K[Anthropic API]
L[Self-hosted Model]
end
subgraph "Infrastructure"
M[(Redis Cache)]
N[(PostgreSQL)]
O[Monitoring]
end
A --> D
B --> D
C --> D
D --> E
E --> F
F --> G
G --> H
G --> I
H --> J
H --> K
H --> L
I --> M
G --> N
G --> O
style D fill:#3b82f6
style G fill:#8b5cf6
style H fill:#10b981
style I fill:#f59e0b
Request Queuing Strategy
LLM inference can be slow and expensive. Implementing a queue helps manage load and prioritize requests:
interface QueuedRequest {
id: string;
prompt: string;
priority: 'high' | 'normal' | 'low';
userId: string;
timestamp: number;
retries: number;
}
class LLMRequestQueue {
private queue: QueuedRequest[] = [];
private processing = false;
private maxConcurrent = 5;
async enqueue(request: QueuedRequest): Promise<string> {
// Insert based on priority
const insertIndex = this.queue.findIndex(
r => this.getPriorityValue(r.priority) < this.getPriorityValue(request.priority)
);
if (insertIndex === -1) {
this.queue.push(request);
} else {
this.queue.splice(insertIndex, 0, request);
}
this.processQueue();
return request.id;
}
private async processQueue(): Promise<void> {
if (this.processing || this.queue.length === 0) return;
this.processing = true;
while (this.queue.length > 0) {
const batch = this.queue.splice(0, this.maxConcurrent);
await Promise.all(batch.map(req => this.processRequest(req)));
}
this.processing = false;
}
private getPriorityValue(priority: string): number {
const values = { high: 3, normal: 2, low: 1 };
return values[priority as keyof typeof values] || 0;
}
}
Rate Limiting Patterns
Implement multiple rate limiting strategies to protect your API:
Token-Based Rate Limiting
class TokenBucketRateLimiter {
private buckets: Map<string, { tokens: number; lastRefill: number }> = new Map();
constructor(
private capacity: number,
private refillRate: number // tokens per second
) {}
async checkLimit(userId: string, tokens: number): Promise<boolean> {
const bucket = this.buckets.get(userId) || {
tokens: this.capacity,
lastRefill: Date.now()
};
const now = Date.now();
const elapsed = (now - bucket.lastRefill) / 1000;
bucket.tokens = Math.min(
this.capacity,
bucket.tokens + elapsed * this.refillRate
);
bucket.lastRefill = now;
if (bucket.tokens >= tokens) {
bucket.tokens -= tokens;
this.buckets.set(userId, bucket);
return true;
}
return false;
}
}
Cost-Based Rate Limiting
Limit based on estimated API costs:
class CostBasedLimiter {
async checkCostLimit(userId: string, estimatedCost: number): Promise<boolean> {
const dailySpend = await this.getDailySpend(userId);
const limit = await this.getUserLimit(userId);
return (dailySpend + estimatedCost) <= limit;
}
}
Streaming Response Handling
LLM APIs often support streaming for better user experience. Here’s how to handle it:
async function streamLLMResponse(
prompt: string,
onChunk: (chunk: string) => void
): Promise<void> {
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
stream: true,
}),
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
if (!reader) throw new Error('No response body');
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.trim() !== '');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') return;
try {
const json = JSON.parse(data);
const content = json.choices[0]?.delta?.content || '';
if (content) onChunk(content);
} catch (e) {
console.error('Error parsing chunk:', e);
}
}
}
}
}
Error Handling and Retries
Implement robust error handling with exponential backoff:
Error Handling Flow
sequenceDiagram
participant Client
participant API
participant LLM Provider
participant Cache
Client->>API: Request
API->>Cache: Check cache
Cache-->>API: Cache miss
API->>LLM Provider: Request
LLM Provider-->>API: Error (rate limit)
Note over API: Exponential backoff
API->>LLM Provider: Retry request
LLM Provider-->>API: Success
API->>Cache: Store result
API-->>Client: Stream response
alt Error persists
API-->>Client: Error response
Note over Client: Fallback to cached/default
end
async function callLLMWithRetry(
prompt: string,
maxRetries = 3
): Promise<string> {
let lastError: Error | null = null;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await callLLMAPI(prompt);
} catch (error: any) {
lastError = error;
// Don't retry on client errors (4xx)
if (error.status >= 400 && error.status < 500) {
throw error;
}
// Exponential backoff
const delay = Math.min(1000 * Math.pow(2, attempt), 10000);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw lastError || new Error('Max retries exceeded');
}
Response Caching
Cache responses to reduce costs and improve latency:
class LLMResponseCache {
constructor(private redis: Redis) {}
async get(prompt: string, model: string): Promise<string | null> {
const key = this.getCacheKey(prompt, model);
return await this.redis.get(key);
}
async set(prompt: string, model: string, response: string, ttl = 3600): Promise<void> {
const key = this.getCacheKey(prompt, model);
await this.redis.setex(key, ttl, response);
}
private getCacheKey(prompt: string, model: string): string {
const hash = createHash('sha256')
.update(`${model}:${prompt}`)
.digest('hex');
return `llm:cache:${hash}`;
}
}
Monitoring and Observability
Track key metrics for LLM APIs:
class LLMAPIMetrics {
async recordRequest(metrics: {
userId: string;
model: string;
tokens: number;
latency: number;
cost: number;
success: boolean;
}): Promise<void> {
// Record to your metrics system
await this.metricsClient.record({
name: 'llm.api.request',
tags: {
model: metrics.model,
user: metrics.userId,
success: metrics.success.toString(),
},
values: {
tokens: metrics.tokens,
latency: metrics.latency,
cost: metrics.cost,
},
});
}
}
Cost Optimization Strategies
Model Selection
Choose the right model for each use case:
- GPT-4: Complex reasoning, high accuracy
- GPT-3.5-turbo: General purpose, cost-effective
- Claude: Long context, document analysis
- Self-hosted: High volume, data privacy
Prompt Optimization
Reduce token usage through prompt engineering:
// Bad: Verbose prompt
const badPrompt = `
Please analyze the following user query and provide a comprehensive response
that addresses all aspects of their question in detail...
`;
// Good: Concise prompt
const goodPrompt = `Analyze: ${userQuery}`;
Batch Processing
Process multiple requests together when possible:
async function batchLLMRequests(requests: string[]): Promise<string[]> {
// Some providers support batch API calls
const response = await fetch('/v1/batch', {
method: 'POST',
body: JSON.stringify({ prompts: requests }),
});
return response.json();
}
Security Considerations
Input Validation
function validatePrompt(prompt: string): void {
if (prompt.length > 10000) {
throw new Error('Prompt too long');
}
// Check for injection attempts
const suspiciousPatterns = [
/ignore previous instructions/i,
/system prompt/i,
/\[INST\]/i,
];
for (const pattern of suspiciousPatterns) {
if (pattern.test(prompt)) {
throw new Error('Suspicious prompt detected');
}
}
}
Output Filtering
Filter sensitive information from responses:
function filterResponse(response: string): string {
// Remove API keys, tokens, etc.
return response.replace(
/(?:api[_-]?key|token|password|secret)[\s:=]+[\w-]+/gi,
'[REDACTED]'
);
}
Conclusion
Designing production-ready LLM APIs requires careful attention to queuing, rate limiting, streaming, caching, and error handling. By implementing these patterns, you can build scalable APIs that handle high traffic while maintaining performance and controlling costs.
Key Takeaways
- Queue requests to manage load and prioritize important requests
- Implement multiple rate limiting strategies (token-based, cost-based)
- Support streaming for better user experience
- Cache responses to reduce costs and improve latency
- Handle errors gracefully with retries and fallbacks
- Monitor costs and performance to optimize usage
- Validate inputs and filter outputs for security
Related Articles
Building Scalable Consumer Apps with Modern Architecture Patterns
Learn how to design consumer applications that can handle millions of users while maintaining performance and reliability. We explore microservices, event-driven architecture, and database scaling strategies.
Building Production-Ready RAG Systems for Enterprise Applications
A comprehensive guide to implementing Retrieval-Augmented Generation systems that work reliably in production. Covering vector databases, embedding strategies, and prompt engineering.