Designing Scalable LLM APIs for Production

Introduction

Building production-ready APIs for LLM integrations requires careful consideration of latency, cost, and reliability. This article explores proven patterns for designing scalable LLM APIs that can handle high traffic while maintaining performance and cost efficiency.

Architecture Overview

A well-designed LLM API architecture separates concerns and handles the unique challenges of AI model inference:

LLM API Architecture

flowchart TB
  subgraph "Client Layer"
    A[Web Application]
    B[Mobile App]
    C[Third-party Services]
  end
  
  subgraph "API Gateway"
    D[Load Balancer]
    E[Rate Limiter]
    F[Request Queue]
  end
  
  subgraph "API Services"
    G[LLM API Service]
    H[Streaming Handler]
    I[Response Cache]
  end
  
  subgraph "LLM Providers"
    J[OpenAI API]
    K[Anthropic API]
    L[Self-hosted Model]
  end
  
  subgraph "Infrastructure"
    M[(Redis Cache)]
    N[(PostgreSQL)]
    O[Monitoring]
  end
  
  A --> D
  B --> D
  C --> D
  D --> E
  E --> F
  F --> G
  G --> H
  G --> I
  H --> J
  H --> K
  H --> L
  I --> M
  G --> N
  G --> O
  
  style D fill:#3b82f6
  style G fill:#8b5cf6
  style H fill:#10b981
  style I fill:#f59e0b

Request Queuing Strategy

LLM inference can be slow and expensive. Implementing a queue helps manage load and prioritize requests:

interface QueuedRequest {
  id: string;
  prompt: string;
  priority: 'high' | 'normal' | 'low';
  userId: string;
  timestamp: number;
  retries: number;
}

class LLMRequestQueue {
  private queue: QueuedRequest[] = [];
  private processing = false;
  private maxConcurrent = 5;

  async enqueue(request: QueuedRequest): Promise<string> {
    // Insert based on priority
    const insertIndex = this.queue.findIndex(
      r => this.getPriorityValue(r.priority) < this.getPriorityValue(request.priority)
    );
    
    if (insertIndex === -1) {
      this.queue.push(request);
    } else {
      this.queue.splice(insertIndex, 0, request);
    }

    this.processQueue();
    return request.id;
  }

  private async processQueue(): Promise<void> {
    if (this.processing || this.queue.length === 0) return;
    
    this.processing = true;
    
    while (this.queue.length > 0) {
      const batch = this.queue.splice(0, this.maxConcurrent);
      await Promise.all(batch.map(req => this.processRequest(req)));
    }
    
    this.processing = false;
  }

  private getPriorityValue(priority: string): number {
    const values = { high: 3, normal: 2, low: 1 };
    return values[priority as keyof typeof values] || 0;
  }
}

Rate Limiting Patterns

Implement multiple rate limiting strategies to protect your API:

Token-Based Rate Limiting

class TokenBucketRateLimiter {
  private buckets: Map<string, { tokens: number; lastRefill: number }> = new Map();
  
  constructor(
    private capacity: number,
    private refillRate: number // tokens per second
  ) {}

  async checkLimit(userId: string, tokens: number): Promise<boolean> {
    const bucket = this.buckets.get(userId) || {
      tokens: this.capacity,
      lastRefill: Date.now()
    };

    const now = Date.now();
    const elapsed = (now - bucket.lastRefill) / 1000;
    bucket.tokens = Math.min(
      this.capacity,
      bucket.tokens + elapsed * this.refillRate
    );
    bucket.lastRefill = now;

    if (bucket.tokens >= tokens) {
      bucket.tokens -= tokens;
      this.buckets.set(userId, bucket);
      return true;
    }

    return false;
  }
}

Cost-Based Rate Limiting

Limit based on estimated API costs:

class CostBasedLimiter {
  async checkCostLimit(userId: string, estimatedCost: number): Promise<boolean> {
    const dailySpend = await this.getDailySpend(userId);
    const limit = await this.getUserLimit(userId);
    
    return (dailySpend + estimatedCost) <= limit;
  }
}

Streaming Response Handling

LLM APIs often support streaming for better user experience. Here’s how to handle it:

async function streamLLMResponse(
  prompt: string,
  onChunk: (chunk: string) => void
): Promise<void> {
  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'gpt-4',
      messages: [{ role: 'user', content: prompt }],
      stream: true,
    }),
  });

  const reader = response.body?.getReader();
  const decoder = new TextDecoder();

  if (!reader) throw new Error('No response body');

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n').filter(line => line.trim() !== '');

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
        if (data === '[DONE]') return;

        try {
          const json = JSON.parse(data);
          const content = json.choices[0]?.delta?.content || '';
          if (content) onChunk(content);
        } catch (e) {
          console.error('Error parsing chunk:', e);
        }
      }
    }
  }
}

Error Handling and Retries

Implement robust error handling with exponential backoff:

Error Handling Flow

sequenceDiagram
  participant Client
  participant API
  participant LLM Provider
  participant Cache
  
  Client->>API: Request
  API->>Cache: Check cache
  Cache-->>API: Cache miss
  
  API->>LLM Provider: Request
  LLM Provider-->>API: Error (rate limit)
  
  Note over API: Exponential backoff
  API->>LLM Provider: Retry request
  LLM Provider-->>API: Success
  
  API->>Cache: Store result
  API-->>Client: Stream response
  
  alt Error persists
    API-->>Client: Error response
    Note over Client: Fallback to cached/default
  end

async function callLLMWithRetry(
  prompt: string,
  maxRetries = 3
): Promise<string> {
  let lastError: Error | null = null;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await callLLMAPI(prompt);
    } catch (error: any) {
      lastError = error;

      // Don't retry on client errors (4xx)
      if (error.status >= 400 && error.status < 500) {
        throw error;
      }

      // Exponential backoff
      const delay = Math.min(1000 * Math.pow(2, attempt), 10000);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }

  throw lastError || new Error('Max retries exceeded');
}

Response Caching

Cache responses to reduce costs and improve latency:

class LLMResponseCache {
  constructor(private redis: Redis) {}

  async get(prompt: string, model: string): Promise<string | null> {
    const key = this.getCacheKey(prompt, model);
    return await this.redis.get(key);
  }

  async set(prompt: string, model: string, response: string, ttl = 3600): Promise<void> {
    const key = this.getCacheKey(prompt, model);
    await this.redis.setex(key, ttl, response);
  }

  private getCacheKey(prompt: string, model: string): string {
    const hash = createHash('sha256')
      .update(`${model}:${prompt}`)
      .digest('hex');
    return `llm:cache:${hash}`;
  }
}

Monitoring and Observability

Track key metrics for LLM APIs:

class LLMAPIMetrics {
  async recordRequest(metrics: {
    userId: string;
    model: string;
    tokens: number;
    latency: number;
    cost: number;
    success: boolean;
  }): Promise<void> {
    // Record to your metrics system
    await this.metricsClient.record({
      name: 'llm.api.request',
      tags: {
        model: metrics.model,
        user: metrics.userId,
        success: metrics.success.toString(),
      },
      values: {
        tokens: metrics.tokens,
        latency: metrics.latency,
        cost: metrics.cost,
      },
    });
  }
}

Cost Optimization Strategies

Model Selection

Choose the right model for each use case:

GPT-4: Complex reasoning, high accuracy
GPT-3.5-turbo: General purpose, cost-effective
Claude: Long context, document analysis
Self-hosted: High volume, data privacy

Prompt Optimization

Reduce token usage through prompt engineering:

// Bad: Verbose prompt
const badPrompt = `
Please analyze the following user query and provide a comprehensive response
that addresses all aspects of their question in detail...
`;

// Good: Concise prompt
const goodPrompt = `Analyze: ${userQuery}`;

Batch Processing

Process multiple requests together when possible:

async function batchLLMRequests(requests: string[]): Promise<string[]> {
  // Some providers support batch API calls
  const response = await fetch('/v1/batch', {
    method: 'POST',
    body: JSON.stringify({ prompts: requests }),
  });
  return response.json();
}

Security Considerations

Input Validation

function validatePrompt(prompt: string): void {
  if (prompt.length > 10000) {
    throw new Error('Prompt too long');
  }
  
  // Check for injection attempts
  const suspiciousPatterns = [
    /ignore previous instructions/i,
    /system prompt/i,
    /\[INST\]/i,
  ];
  
  for (const pattern of suspiciousPatterns) {
    if (pattern.test(prompt)) {
      throw new Error('Suspicious prompt detected');
    }
  }
}

Output Filtering

Filter sensitive information from responses:

function filterResponse(response: string): string {
  // Remove API keys, tokens, etc.
  return response.replace(
    /(?:api[_-]?key|token|password|secret)[\s:=]+[\w-]+/gi,
    '[REDACTED]'
  );
}

Conclusion

Designing production-ready LLM APIs requires careful attention to queuing, rate limiting, streaming, caching, and error handling. By implementing these patterns, you can build scalable APIs that handle high traffic while maintaining performance and controlling costs.

Key Takeaways

Queue requests to manage load and prioritize important requests
Implement multiple rate limiting strategies (token-based, cost-based)
Support streaming for better user experience
Cache responses to reduce costs and improve latency
Handle errors gracefully with retries and fallbacks
Monitor costs and performance to optimize usage
Validate inputs and filter outputs for security

Sylergy