LiteLLM: One Interface for Every LLM Provider
Every major LLM provider, OpenAI, Anthropic, Google, Mistral, Cohere, Azure, AWS Bedrock, ships its own SDK with its own request format, authentication scheme, error types, and streaming protocol. Building a production system that uses more than one means rewriting the same logic repeatedly, and switching providers means refactoring your entire inference layer.
LiteLLM solves this with a single, OpenAI-compatible interface that routes to 100+ models across all major providers. You write your code once against the OpenAI chat/completions format, and LiteLLM handles the translation, retries, fallbacks, cost tracking, and observability, transparently.
1. Core Usage, Drop-in Replacement
The simplest LiteLLM usage requires almost no setup. Install the package, set your API keys as environment variables, and call litellm.completion() with any model string. The format is identical to the OpenAI SDK, the only thing that changes is the model name.
import litellm
import os
# API keys picked up from environment variables:
# OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.
messages = [{"role": "user", "content": "Explain transformers in one paragraph."}]
# ── OpenAI ───────────────────────────────────────────────────────────────────
response = litellm.completion(model="gpt-4o", messages=messages)
print(response.choices[0].message.content)
# ── Anthropic ────────────────────────────────────────────────────────────────
response = litellm.completion(model="claude-sonnet-4-6", messages=messages)
print(response.choices[0].message.content)
# ── Google Gemini ────────────────────────────────────────────────────────────
response = litellm.completion(model="gemini/gemini-2.0-flash", messages=messages)
print(response.choices[0].message.content)
# ── AWS Bedrock ──────────────────────────────────────────────────────────────
response = litellm.completion(
model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
messages=messages,
)
print(response.choices[0].message.content)import asyncio
import litellm
async def stream_response(model: str, prompt: str):
response = await litellm.acompletion(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
)
async for chunk in response:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
print()
asyncio.run(stream_response("claude-sonnet-4-6", "Write a haiku about RAG pipelines."))2. Fallbacks and Retries
One of LiteLLM's most valuable production features is automatic fallbacks: if a model is unavailable or rate-limited, LiteLLM can transparently retry with an alternative model, without changing any application code. This is defined declaratively per call or globally in config.
import litellm
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "What is RAG?"}],
# If gpt-4o fails, try these in order:
fallbacks=["claude-sonnet-4-6", "gemini/gemini-2.0-flash"],
# Retry on rate-limit or server errors:
num_retries=3,
# Hard timeout per attempt in seconds:
timeout=30,
)
print(response.choices[0].message.content)
print(f"Model actually used: {response.model}")3. Cost and Token Tracking
LiteLLM ships with a built-in cost database covering token prices for all major providers. Every response includes a _hidden_params field with the computed cost, and you can also query it directly. This makes per-request cost attribution trivial without needing a third-party billing tool.
import litellm
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarise the history of databases."}],
)
# Token usage is always present on the response object
usage = response.usage
print(f"Prompt tokens: {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
print(f"Total tokens: {usage.total_tokens}")
# Cost in USD for this specific call
cost = litellm.completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")
# Or compute cost directly from token counts + model name
cost2 = litellm.cost_per_token(
model="gpt-4o",
prompt_tokens=usage.prompt_tokens,
completion_tokens=usage.completion_tokens,
)
print(f"Prompt cost: ${cost2[0]:.6f} | Completion cost: ${cost2[1]:.6f}")4. LiteLLM Proxy, The Self-Hosted LLM Gateway
Beyond the Python SDK, LiteLLM ships a self-hosted proxy server that exposes an OpenAI-compatible REST API in front of all your models. Any client that speaks the OpenAI API, LangChain, LlamaIndex, your own HTTP client, can point at your LiteLLM proxy instead of OpenAI directly, without changing a single line of client code.
The proxy handles:
- Routing, map virtual model names to real providers in a central config.
- Auth, issue API keys scoped to teams or services, rate-limit them independently.
- Observability, stream logs and traces to Langfuse, Helicone, or any OpenTelemetry backend.
- Budget enforcement, set hard spend limits per key, per team, or per model.
- Load balancing, distribute requests across multiple deployments of the same model.
model_list:
# Virtual name → real provider model
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: fast-model
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
# Load balance across two Azure deployments of the same model
- model_name: azure-gpt4
litellm_params:
model: azure/gpt-4
api_base: os.environ/AZURE_API_BASE_1
api_key: os.environ/AZURE_API_KEY_1
- model_name: azure-gpt4
litellm_params:
model: azure/gpt-4
api_base: os.environ/AZURE_API_BASE_2
api_key: os.environ/AZURE_API_KEY_2
litellm_settings:
drop_params: true # silently ignore unsupported params per model
success_callback: ["langfuse"] # stream all traces to Langfuse
general_settings:
master_key: sk-my-master-key # required to create virtual keys# ── Start the proxy ──────────────────────────────────────────────────────────
pip install 'litellm[proxy]'
litellm --config config.yaml --port 4000
# ── Call it from any OpenAI-compatible client ────────────────────────────────
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-my-master-key" \
-d '{
"model": "claude-sonnet",
"messages": [{"role": "user", "content": "Hello!"}]
}'from openai import OpenAI
# Point the official openai SDK at your LiteLLM proxy
client = OpenAI(
api_key="sk-my-master-key",
base_url="http://localhost:4000",
)
response = client.chat.completions.create(
model="claude-sonnet", # virtual name defined in config.yaml
messages=[{"role": "user", "content": "What is LiteLLM?"}],
)
print(response.choices[0].message.content)5. Observability, Langfuse Integration
LiteLLM integrates natively with Langfuse for full trace visibility: every request, token count, latency, cost, and model used is logged automatically. In the Python SDK, one environment variable is enough to activate it.
import os
import litellm
# Set Langfuse credentials, that's it. No other code changes.
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # or self-hosted
litellm.success_callback = ["langfuse"]
litellm.failure_callback = ["langfuse"]
# All calls below are automatically traced in Langfuse
response = litellm.completion(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "What are the benefits of RAG?"}],
metadata={
"trace_name": "rag-query",
"tags": ["production", "rag"],
"user_id": "user-123",
},
)
print(response.choices[0].message.content)When to use LiteLLM, and when not to
| Use it when | Skip it when |
|---|---|
| You use or plan to use more than one LLM provider | You are 100% committed to a single provider long-term |
| You need provider fallbacks for reliability | You need ultra-low latency and every extra hop matters |
| You want a single place to enforce rate limits and budgets | Your team already has an internal gateway with these features |
| You want observability without vendor lock-in | You use a framework (e.g. LangChain) that already wraps providers |
| You are building a multi-tenant LLM platform | Your use case is a simple script or one-off experiment |
LiteLLM is not a model, it is a routing and abstraction layer. Its value compounds as your system grows: more providers, more teams, more models, more need for visibility. For teams building serious LLM infrastructure, it is one of the clearest wins in the stack, low setup cost, immediate portability, and built-in safeguards that would otherwise require significant custom engineering.