Proxy - Fallbacks, Retries

Quick Start load balancing
Quick Start client side fallbacks

Quick Start - Load Balancing

Step 1 - Set deployments on config

Example config below. Here requests with model=gpt-3.5-turbo will be routed across multiple instances of azure/gpt-3.5-turbo

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/<your-deployment-name>
      api_base: <your-azure-endpoint>
      api_key: <your-azure-api-key>
      rpm: 6      # Rate limit for this deployment: in requests per minute (rpm)
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: <your-azure-api-key>
      rpm: 6
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-large
      api_base: https://openai-france-1234.openai.azure.com/
      api_key: <your-azure-api-key>
      rpm: 1440

router_settings:
  routing_strategy: simple-shuffle # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"
  model_group_alias: {"gpt-4": "gpt-3.5-turbo"} # all requests with `gpt-4` will be routed to models with `gpt-3.5-turbo`
  num_retries: 2
  timeout: 30                                  # 30 seconds
  redis_host: <your redis host>                # set this when using multiple litellm proxy deployments, load balancing state stored in redis
  redis_password: <your redis password>
  redis_port: 1992

info

Detailed information about routing strategies can be found here

Step 2: Start Proxy with config

$ litellm --config /path/to/config.yaml

Test - Simple Call

Here requests with model=gpt-3.5-turbo will be routed across multiple instances of azure/gpt-3.5-turbo

👉 Key Change: model="gpt-3.5-turbo"

Check the model_id in Response Headers to make sure the requests are being load balanced

OpenAI Python v1.0.0+
Curl Request
Langchain

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": "this is a test request, write a short poem"
        }
    ]
)

print(response)

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "gpt-3.5-turbo",
    "messages": [
        {
        "role": "user",
        "content": "what llm are you"
        }
    ]
}'

from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage
import os 

os.environ["OPENAI_API_KEY"] = "anything"

chat = ChatOpenAI(
    openai_api_base="http://0.0.0.0:4000",
    model="gpt-3.5-turbo",
)

messages = [
    SystemMessage(
        content="You are a helpful assistant that im using to make a test request to."
    ),
    HumanMessage(
        content="test from litellm. tell me why it's amazing in 1 sentence"
    ),
]
response = chat(messages)

print(response)

Test - Loadbalancing

In this request, the following will occur:

A rate limit exception will be raised
LiteLLM proxy will retry the request on the model group (default is 3).

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
  "model": "gpt-3.5-turbo",
  "messages": [
        {"role": "user", "content": "Hi there!"}
    ],
    "mock_testing_rate_limit_error": true
}'

See Code

Test - Client Side Fallbacks

In this request the following will occur:

The request to model="zephyr-beta" will fail
litellm proxy will loop through all the model_groups specified in fallbacks=["gpt-3.5-turbo"]
The request to model="gpt-3.5-turbo" will succeed and the client making the request will get a response from gpt-3.5-turbo

👉 Key Change: "fallbacks": ["gpt-3.5-turbo"]

OpenAI Python v1.0.0+
Curl Request
Langchain

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="zephyr-beta",
    messages = [
        {
            "role": "user",
            "content": "this is a test request, write a short poem"
        }
    ],
    extra_body={
        "fallbacks": ["gpt-3.5-turbo"]
    }
)

print(response)

Pass metadata as part of the request body

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "zephyr-beta"",
    "messages": [
        {
        "role": "user",
        "content": "what llm are you"
        }
    ],
    "fallbacks": ["gpt-3.5-turbo"]
}'

from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage
import os 

os.environ["OPENAI_API_KEY"] = "anything"

chat = ChatOpenAI(
    openai_api_base="http://0.0.0.0:4000",
    model="zephyr-beta",
    extra_body={
        "fallbacks": ["gpt-3.5-turbo"]
    }
)

messages = [
    SystemMessage(
        content="You are a helpful assistant that im using to make a test request to."
    ),
    HumanMessage(
        content="test from litellm. tell me why it's amazing in 1 sentence"
    ),
]
response = chat(messages)

print(response)

Advanced

Fallbacks + Retries + Timeouts + Cooldowns

To set fallbacks, just do:

litellm_settings:
  fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}]

Covers all errors (429, 500, etc.)

Set via config

model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003
  - model_name: gpt-3.5-turbo
    litellm_params:
        model: gpt-3.5-turbo
        api_key: <my-openai-key>
  - model_name: gpt-3.5-turbo-16k
    litellm_params:
        model: gpt-3.5-turbo-16k
        api_key: <my-openai-key>

litellm_settings:
  num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
  request_timeout: 10 # raise Timeout error if call takes longer than 10s. Sets litellm.request_timeout 
  fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries 
  allowed_fails: 3 # cooldown model if it fails > 1 call in a minute. 
  cooldown_time: 30 # how long to cooldown model if fails/min > allowed_fails

Fallback to Specific Model ID

If all models in a group are in cooldown (e.g. rate limited), LiteLLM will fallback to the model with the specific model ID.

This skips any cooldown check for the fallback model.

Specify the model ID in model_info

model_list:
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4
    model_info:
      id: my-specific-model-id # 👈 KEY CHANGE
  - model_name: gpt-4
    litellm_params:
      model: azure/chatgpt-v-2
      api_base: os.environ/AZURE_API_BASE
      api_key: os.environ/AZURE_API_KEY
  - model_name: anthropic-claude
    litellm_params:
      model: anthropic/claude-3-opus-20240229
      api_key: os.environ/ANTHROPIC_API_KEY

Note: This will only fallback to the model with the specific model ID. If you want to fallback to another model group, you can set fallbacks=[{"gpt-4": ["anthropic-claude"]}]

Set fallbacks in config

litellm_settings:
  fallbacks: [{"gpt-4": ["my-specific-model-id"]}]

Test it!

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "mock_testing_fallbacks": true
}'

Validate it works, by checking the response header x-litellm-model-id

x-litellm-model-id: my-specific-model-id

Test Fallbacks!

Check if your fallbacks are working as expected.

Regular Fallbacks

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
  "model": "my-bad-model",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "mock_testing_fallbacks": true # 👈 KEY CHANGE
}
'

Content Policy Fallbacks

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
  "model": "my-bad-model",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "mock_testing_content_policy_fallbacks": true # 👈 KEY CHANGE
}
'

Context Window Fallbacks

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
  "model": "my-bad-model",
  "messages": [
    {
      "role": "user",
      "content": "ping"
    }
  ],
  "mock_testing_context_window_fallbacks": true # 👈 KEY CHANGE
}
'

Context Window Fallbacks (Pre-Call Checks + Fallbacks)

Before call is made check if a call is within model context window with enable_pre_call_checks: true.

See Code

1. Setup config

For azure deployments, set the base model. Pick the base model from this list, all the azure models start with azure/.

Same Group
Context Window Fallbacks (Different Groups)

Filter older instances of a model (e.g. gpt-3.5-turbo) with smaller context windows

router_settings:
    enable_pre_call_checks: true # 1. Enable pre-call checks

model_list:
    - model_name: gpt-3.5-turbo
      litellm_params:
        model: azure/chatgpt-v-2
        api_base: os.environ/AZURE_API_BASE
        api_key: os.environ/AZURE_API_KEY
        api_version: "2023-07-01-preview"
      model_info:
        base_model: azure/gpt-4-1106-preview # 2. 👈 (azure-only) SET BASE MODEL
    
    - model_name: gpt-3.5-turbo
      litellm_params:
        model: gpt-3.5-turbo-1106
        api_key: os.environ/OPENAI_API_KEY

2. Start proxy

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

3. Test it!

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

text = "What is the meaning of 42?" * 5000

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {"role": "system", "content": text},
        {"role": "user", "content": "Who was Alexander?"},
    ],
)

print(response)

Fallback to larger models if current model is too small.

router_settings:
    enable_pre_call_checks: true # 1. Enable pre-call checks

model_list:
    - model_name: gpt-3.5-turbo-small
      litellm_params:
        model: azure/chatgpt-v-2
      api_base: os.environ/AZURE_API_BASE
      api_key: os.environ/AZURE_API_KEY
      api_version: "2023-07-01-preview"
      model_info:
      base_model: azure/gpt-4-1106-preview # 2. 👈 (azure-only) SET BASE MODEL
    
    - model_name: gpt-3.5-turbo-large
      litellm_params:
      model: gpt-3.5-turbo-1106
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-opus
    litellm_params:
      model: claude-3-opus-20240229
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  context_window_fallbacks: [{"gpt-3.5-turbo-small": ["gpt-3.5-turbo-large", "claude-opus"]}]

2. Start proxy

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

3. Test it!

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

text = "What is the meaning of 42?" * 5000

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {"role": "system", "content": text},
        {"role": "user", "content": "Who was Alexander?"},
    ],
)

print(response)

Content Policy Fallbacks

Fallback across providers (e.g. from Azure OpenAI to Anthropic) if you hit content policy violation errors.

model_list:
    - model_name: gpt-3.5-turbo-small
      litellm_params:
        model: azure/chatgpt-v-2
        api_base: os.environ/AZURE_API_BASE
        api_key: os.environ/AZURE_API_KEY
        api_version: "2023-07-01-preview"

    - model_name: claude-opus
      litellm_params:
        model: claude-3-opus-20240229
        api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  content_policy_fallbacks: [{"gpt-3.5-turbo-small": ["claude-opus"]}]

Default Fallbacks

You can also set default_fallbacks, in case a specific model group is misconfigured / bad.

model_list:
    - model_name: gpt-3.5-turbo-small
      litellm_params:
        model: azure/chatgpt-v-2
        api_base: os.environ/AZURE_API_BASE
        api_key: os.environ/AZURE_API_KEY
        api_version: "2023-07-01-preview"

    - model_name: claude-opus
      litellm_params:
        model: claude-3-opus-20240229
        api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  default_fallbacks: ["claude-opus"]

This will default to claude-opus in case any model fails.

A model-specific fallbacks (e.g. {"gpt-3.5-turbo-small": ["claude-opus"]}) overrides default fallback.

EU-Region Filtering (Pre-Call Checks)

Before call is made check if a call is within model context window with enable_pre_call_checks: true.

Set 'region_name' of deployment.

Note: LiteLLM can automatically infer region_name for Vertex AI, Bedrock, and IBM WatsonxAI based on your litellm params. For Azure, set litellm.enable_preview = True.

1. Set Config

router_settings:
    enable_pre_call_checks: true # 1. Enable pre-call checks

model_list:
- model_name: gpt-3.5-turbo
  litellm_params:
    model: azure/chatgpt-v-2
    api_base: os.environ/AZURE_API_BASE
    api_key: os.environ/AZURE_API_KEY
    api_version: "2023-07-01-preview"
    region_name: "eu" # 👈 SET EU-REGION

- model_name: gpt-3.5-turbo
  litellm_params:
    model: gpt-3.5-turbo-1106
    api_key: os.environ/OPENAI_API_KEY

- model_name: gemini-pro
  litellm_params:
    model: vertex_ai/gemini-pro-1.5
    vertex_project: adroit-crow-1234
    vertex_location: us-east1 # 👈 AUTOMATICALLY INFERS 'region_name'

2. Start proxy

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

3. Test it!

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.with_raw_response.create(
    model="gpt-3.5-turbo",
    messages = [{"role": "user", "content": "Who was Alexander?"}]
)

print(response)

print(f"response.headers.get('x-litellm-model-api-base')")

Custom Timeouts, Stream Timeouts - Per Model

For each model you can set timeout & stream_timeout under litellm_params

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-eu
      api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
      api_key: <your-key>
      timeout: 0.1                      # timeout in (seconds)
      stream_timeout: 0.01              # timeout for stream requests (seconds)
      max_retries: 5
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: 
      timeout: 0.1                      # timeout in (seconds)
      stream_timeout: 0.01              # timeout for stream requests (seconds)
      max_retries: 5

Start Proxy

$ litellm --config /path/to/config.yaml

Setting Dynamic Timeouts - Per Request

LiteLLM Proxy supports setting a timeout per request

Example Usage

Curl Request
OpenAI v1.0.0+

curl --location 'http://0.0.0.0:4000/chat/completions' \
     --header 'Content-Type: application/json' \
     --data-raw '{
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "user", "content": "what color is red"}
        ],
        "logit_bias": {12481: 100},
        "timeout": 1
     }'

import openai


client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "what color is red"}
    ],
    logit_bias={12481: 100},
    timeout=1
)

print(response)

Setting Fallbacks for Wildcard Models

You can set fallbacks for wildcard models (e.g. azure/*) in your config file.

Setup config

model_list:
  - model_name: "gpt-4o"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: os.environ/OPENAI_API_KEY
  - model_name: "azure/*"
    litellm_params:
      model: "azure/*"
      api_key: os.environ/AZURE_API_KEY
      api_base: os.environ/AZURE_API_BASE

litellm_settings:
  fallbacks: [{"gpt-4o": ["azure/gpt-4o"]}]

Start Proxy

litellm --config /path/to/config.yaml

Test it!

curl -L -X POST 'http://0.0.0.0:4000/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [    
          {
            "type": "text",
            "text": "what color is red"
          }
        ]
      }
    ],
    "max_tokens": 300,
    "mock_testing_fallbacks": true
}'

Disable Fallbacks per key

You can disable fallbacks per key by setting disable_fallbacks: true in your key metadata.

curl -L -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
    "metadata": {
        "disable_fallbacks": true
    }
}'

Proxy - Fallbacks, Retries

Quick Start - Load Balancing​

Step 1 - Set deployments on config​

Step 2: Start Proxy with config​

Test - Simple Call​

Test - Loadbalancing​

Test - Client Side Fallbacks​

Advanced​

Fallbacks + Retries + Timeouts + Cooldowns​

Fallback to Specific Model ID​

Test Fallbacks!​

Regular Fallbacks​

Content Policy Fallbacks​

Context Window Fallbacks​

Context Window Fallbacks (Pre-Call Checks + Fallbacks)​

Content Policy Fallbacks​

Default Fallbacks​

EU-Region Filtering (Pre-Call Checks)​

Custom Timeouts, Stream Timeouts - Per Model​

Start Proxy​

Setting Dynamic Timeouts - Per Request​

Setting Fallbacks for Wildcard Models​

Disable Fallbacks per key​

Quick Start - Load Balancing

Step 1 - Set deployments on config

Step 2: Start Proxy with config

Test - Simple Call

Test - Loadbalancing

Test - Client Side Fallbacks

Advanced

Fallbacks + Retries + Timeouts + Cooldowns

Fallback to Specific Model ID

Test Fallbacks!

Regular Fallbacks

Content Policy Fallbacks

Context Window Fallbacks

Context Window Fallbacks (Pre-Call Checks + Fallbacks)

Content Policy Fallbacks

Default Fallbacks

EU-Region Filtering (Pre-Call Checks)

Custom Timeouts, Stream Timeouts - Per Model

Start Proxy

Setting Dynamic Timeouts - Per Request

Setting Fallbacks for Wildcard Models

Disable Fallbacks per key