LiteLLM: One Interface for Every LLM Provider

PublishedMarch 9, 2026

LiteLLM: One Interface for Every LLM Provider

Every major LLM provider, OpenAI, Anthropic, Google, Mistral, Cohere, Azure, AWS Bedrock, ships its own SDK with its own request format, authentication scheme, error types, and streaming protocol. Building a production system that uses more than one means rewriting the same logic repeatedly, and switching providers means refactoring your entire inference layer.

LiteLLM solves this with a single, OpenAI-compatible interface that routes to 100+ models across all major providers. You write your code once against the OpenAI chat/completions format, and LiteLLM handles the translation, retries, fallbacks, cost tracking, and observability, transparently.

1. Core Usage, Drop-in Replacement

The simplest LiteLLM usage requires almost no setup. Install the package, set your API keys as environment variables, and call litellm.completion() with any model string. The format is identical to the OpenAI SDK, the only thing that changes is the model name.

import litellm
import os

# API keys picked up from environment variables:
# OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.

messages = [{"role": "user", "content": "Explain transformers in one paragraph."}]

# ── OpenAI ───────────────────────────────────────────────────────────────────
response = litellm.completion(model="gpt-4o", messages=messages)
print(response.choices[0].message.content)

# ── Anthropic ────────────────────────────────────────────────────────────────
response = litellm.completion(model="claude-sonnet-4-6", messages=messages)
print(response.choices[0].message.content)

# ── Google Gemini ────────────────────────────────────────────────────────────
response = litellm.completion(model="gemini/gemini-2.0-flash", messages=messages)
print(response.choices[0].message.content)

# ── AWS Bedrock ──────────────────────────────────────────────────────────────
response = litellm.completion(
    model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=messages,
)
print(response.choices[0].message.content)

import asyncio
import litellm

async def stream_response(model: str, prompt: str):
    response = await litellm.acompletion(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    async for chunk in response:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
    print()

asyncio.run(stream_response("claude-sonnet-4-6", "Write a haiku about RAG pipelines."))

2. Fallbacks and Retries

One of LiteLLM's most valuable production features is automatic fallbacks: if a model is unavailable or rate-limited, LiteLLM can transparently retry with an alternative model, without changing any application code. This is defined declaratively per call or globally in config.

import litellm

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is RAG?"}],
    # If gpt-4o fails, try these in order:
    fallbacks=["claude-sonnet-4-6", "gemini/gemini-2.0-flash"],
    # Retry on rate-limit or server errors:
    num_retries=3,
    # Hard timeout per attempt in seconds:
    timeout=30,
)

print(response.choices[0].message.content)
print(f"Model actually used: {response.model}")

3. Cost and Token Tracking

LiteLLM ships with a built-in cost database covering token prices for all major providers. Every response includes a _hidden_params field with the computed cost, and you can also query it directly. This makes per-request cost attribution trivial without needing a third-party billing tool.

import litellm

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise the history of databases."}],
)

# Token usage is always present on the response object
usage = response.usage
print(f"Prompt tokens:     {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
print(f"Total tokens:      {usage.total_tokens}")

# Cost in USD for this specific call
cost = litellm.completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

# Or compute cost directly from token counts + model name
cost2 = litellm.cost_per_token(
    model="gpt-4o",
    prompt_tokens=usage.prompt_tokens,
    completion_tokens=usage.completion_tokens,
)
print(f"Prompt cost: ${cost2[0]:.6f}  |  Completion cost: ${cost2[1]:.6f}")

4. LiteLLM Proxy, The Self-Hosted LLM Gateway

Beyond the Python SDK, LiteLLM ships a self-hosted proxy server that exposes an OpenAI-compatible REST API in front of all your models. Any client that speaks the OpenAI API, LangChain, LlamaIndex, your own HTTP client, can point at your LiteLLM proxy instead of OpenAI directly, without changing a single line of client code.

The proxy handles:

- Routing, map virtual model names to real providers in a central config.

- Auth, issue API keys scoped to teams or services, rate-limit them independently.

- Observability, stream logs and traces to Langfuse, Helicone, or any OpenTelemetry backend.

- Budget enforcement, set hard spend limits per key, per team, or per model.

- Load balancing, distribute requests across multiple deployments of the same model.

model_list:
  # Virtual name → real provider model
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: fast-model
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY

  # Load balance across two Azure deployments of the same model
  - model_name: azure-gpt4
    litellm_params:
      model: azure/gpt-4
      api_base: os.environ/AZURE_API_BASE_1
      api_key: os.environ/AZURE_API_KEY_1

  - model_name: azure-gpt4
    litellm_params:
      model: azure/gpt-4
      api_base: os.environ/AZURE_API_BASE_2
      api_key: os.environ/AZURE_API_KEY_2

litellm_settings:
  drop_params: true        # silently ignore unsupported params per model
  success_callback: ["langfuse"]   # stream all traces to Langfuse

general_settings:
  master_key: sk-my-master-key  # required to create virtual keys

# ── Start the proxy ──────────────────────────────────────────────────────────
pip install 'litellm[proxy]'
litellm --config config.yaml --port 4000

# ── Call it from any OpenAI-compatible client ────────────────────────────────
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-my-master-key" \
  -d '{
    "model": "claude-sonnet",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

# Point the official openai SDK at your LiteLLM proxy
client = OpenAI(
    api_key="sk-my-master-key",
    base_url="http://localhost:4000",
)

response = client.chat.completions.create(
    model="claude-sonnet",   # virtual name defined in config.yaml
    messages=[{"role": "user", "content": "What is LiteLLM?"}],
)
print(response.choices[0].message.content)

5. Observability, Langfuse Integration

LiteLLM integrates natively with Langfuse for full trace visibility: every request, token count, latency, cost, and model used is logged automatically. In the Python SDK, one environment variable is enough to activate it.

import os
import litellm

# Set Langfuse credentials, that's it. No other code changes.
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"]       = "https://cloud.langfuse.com"  # or self-hosted

litellm.success_callback = ["langfuse"]
litellm.failure_callback = ["langfuse"]

# All calls below are automatically traced in Langfuse
response = litellm.completion(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "What are the benefits of RAG?"}],
    metadata={
        "trace_name": "rag-query",
        "tags": ["production", "rag"],
        "user_id": "user-123",
    },
)
print(response.choices[0].message.content)

When to use LiteLLM, and when not to

Use it when	Skip it when
You use or plan to use more than one LLM provider	You are 100% committed to a single provider long-term
You need provider fallbacks for reliability	You need ultra-low latency and every extra hop matters
You want a single place to enforce rate limits and budgets	Your team already has an internal gateway with these features
You want observability without vendor lock-in	You use a framework (e.g. LangChain) that already wraps providers
You are building a multi-tenant LLM platform	Your use case is a simple script or one-off experiment

LiteLLM is not a model, it is a routing and abstraction layer. Its value compounds as your system grows: more providers, more teams, more models, more need for visibility. For teams building serious LLM infrastructure, it is one of the clearest wins in the stack, low setup cost, immediate portability, and built-in safeguards that would otherwise require significant custom engineering.

Topics

LiteLLMLLMRAGOpenAIAnthropicLLM GatewayObservabilityLangfuseProxyPythonLangChainVector DatabaseSemantic SearchEmbeddingsContextual Retrieval

LiteLLM: One Interface for Every LLM Provider