Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ jobs:
enable-cache: true

- name: Install the project
run: uv sync --extra dev-gpu
run: uv sync --extra dev-gpu --extra litellm

- name: Ensure cache directories exist
run: mkdir -p cache/models cache/datasets
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -183,3 +183,4 @@ prod_env
logs
_logs
outputs
.litellm_cache/
2 changes: 1 addition & 1 deletion docs/source/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Lighteval provides several optional dependency groups that you can install based
|-------|-------------|--------------|
| `vllm` | Use VLLM as backend for high-performance inference | vllm>=0.10.0, ray, more_itertools |
| `tgi` | Use Text Generation Inference API | text-generation>=0.6.0 |
| `litellm` | Use LiteLLM for unified API access | litellm, diskcache |
| `litellm` | Use LiteLLM for unified API access (generative + loglikelihood for completion-capable models) | litellm, diskcache |
| `optimum` | Use Optimum for optimized models | optimum==1.12.0 |
| `quantization` | Evaluate quantized models | bitsandbytes>=0.41.0, auto-gptq>=0.4.2 |
| `adapters` | Evaluate adapter models (PEFT, Delta) | peft==0.3.0 |
Expand Down
1 change: 1 addition & 0 deletions docs/source/package_reference/models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ set in the `model-args` or in the model yaml file (see example

### Litellm Model
[[autodoc]] models.endpoints.litellm_model.LiteLLMModelConfig
[[autodoc]] models.endpoints.litellm_model.LiteLLMClient

## Custom Model
[[autodoc]] models.custom.custom_model.CustomModelConfig
113 changes: 105 additions & 8 deletions docs/source/use-litellm-as-backend.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,49 @@ OpenAI, Groq, and many others.
> [!TIP]
> Documentation for available APIs and compatible endpoints can be found [here](https://docs.litellm.ai/docs/).

## Supported Evaluation Modes

The LiteLLM backend supports two evaluation modes depending on the model and provider:

| Mode | Method | Benchmarks | Provider requirement |
|------|--------|-----------|----------------------|
| **Generative** | `greedy_until` | GSM8K, HLE, IFEval, … | Any chat-completion provider |
| **Log-likelihood** | `loglikelihood` / `loglikelihood_rolling` | MMLU, ARC, HellaSwag, … | `/v1/completions` endpoint with `echo=True` |

### Generative evaluation — all providers

Generative benchmarks route through `POST /v1/chat/completions`. Any model from
any provider works.

```bash
lighteval endpoint litellm \
"provider=openai,model_name=gpt-4o" \
gsm8k
```

### Log-likelihood evaluation — completion-capable models only

MCQ benchmarks (MMLU, ARC, HellaSwag, …) and perplexity benchmarks require
`POST /v1/completions` with `echo=True` and `logprobs=1`. **Chat-only models
such as gpt-4o, Claude, or Gemini do not expose this endpoint** and will
produce all-`-inf` results with a warning.

Supported models:
- **OpenAI**: `gpt-3.5-turbo-instruct`
- **Local servers**: any OpenAI-compatible server — llama.cpp, `vllm serve`, Ollama, etc.

```bash
# Run MCQ benchmarks (requires completion endpoint)
lighteval endpoint litellm \
examples/model_configs/litellm_completion_model.yaml \
"mmlu|0" "arc|0" "hellaswag|0"
```

> [!WARNING]
> Lighteval automatically detects unsupported chat-only models via
> `litellm.get_model_info()` and emits a WARNING before the evaluation starts.
> Results will be `-inf` for every choice and metrics will be at chance level.

## Basic Usage

```bash
Expand All @@ -18,9 +61,9 @@ lighteval endpoint litellm \
## Using a Configuration File

LiteLLM allows generation with any OpenAI-compatible endpoint. For example, you
can evaluate a model running on a local VLLM server.
can evaluate a model running on a local vLLM server.

To do so, you will need to use a configuration file like this:
**Generative tasks** (`examples/model_configs/litellm_model.yaml`):

```yaml
model_parameters:
Expand All @@ -37,25 +80,39 @@ model_parameters:
frequency_penalty: 0.0
```

**Log-likelihood / MCQ tasks** (`examples/model_configs/litellm_completion_model.yaml`):

```yaml
model_parameters:
model_name: "gpt-3.5-turbo-instruct"
provider: "openai"
concurrent_requests: 10
generation_parameters:
seed: 42
```

## Supported Providers

LiteLLM supports a wide range of LLM providers:

### Cloud Providers

all cloud providers can be found in the [litellm documentation](https://docs.litellm.ai/docs/providers).
All cloud providers can be found in the [litellm documentation](https://docs.litellm.ai/docs/providers).

### Local/On-Premise
- **VLLM**: Local VLLM servers
- **Hugging Face**: Local Hugging Face models
- **vLLM**: Local vLLM servers (supports both generative and log-likelihood)
- **llama.cpp**: OpenAI-compatible server (supports both generative and log-likelihood)
- **Ollama**: OpenAI-compatible endpoint (generative only)
- **Custom endpoints**: Any OpenAI-compatible API

## Using with Local Models

### VLLM Server
To use with a local VLLM server:
### vLLM Server (generative + log-likelihood)

Local vLLM servers expose both `/v1/chat/completions` and `/v1/completions`, so
they support all evaluation modes.

1. Start your VLLM server:
1. Start your vLLM server:
```bash
vllm serve HuggingFaceH4/zephyr-7b-beta --host 0.0.0.0 --port 8000
```
Expand All @@ -67,6 +124,46 @@ model_parameters:
model_name: "hosted_vllm/HuggingFaceH4/zephyr-7b-beta"
base_url: "http://localhost:8000/v1"
api_key: ""
generation_parameters:
seed: 42
```

3. Run any benchmark:
```bash
# Generative
lighteval endpoint litellm my_config.yaml "gsm8k|0"

# MCQ (log-likelihood)
lighteval endpoint litellm my_config.yaml "mmlu|0" "arc|0"
```

### llama.cpp Server (generative + log-likelihood)

```bash
./llama-server -m model.gguf --port 8080
```

```yaml
model_parameters:
model_name: "openai/local"
base_url: "http://localhost:8080/v1"
api_key: "none"
```

## Generation Parameters

All parameters in `generation_parameters` are forwarded appropriately to the
underlying API call.

| Parameter | Generative (`/chat/completions`) | Log-likelihood (`/completions`) |
|-----------|----------------------------------|----------------------------------|
| `seed` | ✅ | ✅ |
| `temperature` | ✅ | hardcoded `0.0` (deterministic scoring) |
| `max_new_tokens` | ✅ | hardcoded `1` (only logprobs needed) |
| `stop_tokens` | ✅ | ✅ |
| `top_p` | ✅ | ✅ |
| `frequency_penalty` | ✅ | ✅ |
| `presence_penalty` | ✅ | ✅ |
| `repetition_penalty` | ✅ | ❌ (not part of `/v1/completions`) |

For more detailed error handling and debugging, refer to the [LiteLLM documentation](https://docs.litellm.ai/docs/).
26 changes: 26 additions & 0 deletions examples/model_configs/litellm_completion_model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# LiteLLM configuration for a model that supports loglikelihood evaluation.
#
# loglikelihood() and loglikelihood_rolling() require the /v1/completions
# endpoint with echo=True and logprobs=1. Only "completion-style" models
# expose this endpoint. Examples:
# - OpenAI: gpt-3.5-turbo-instruct
# - Local (llama.cpp): openai/local-model (with base_url pointing to server)
# - Local (vLLM serve): openai/local-model (with base_url pointing to server)
#
# Chat-only models (gpt-4o, claude-*, gemini-*) do NOT support this endpoint
# and will produce -inf loglikelihoods. Use the standard litellm_model.yaml
# for generative (greedy_until) evaluations with those models.
#
# Usage:
# lighteval endpoint litellm examples/model_configs/litellm_completion_model.yaml \
# "mmlu|0" "arc|0" "hellaswag|0"

model_parameters:
model_name: "gpt-3.5-turbo-instruct"
provider: "openai"
concurrent_requests: 10
api_max_retry: 8
api_retry_sleep: 1.0
api_retry_multiplier: 2.0
generation_parameters:
seed: 42
28 changes: 23 additions & 5 deletions src/lighteval/models/endpoints/inference_providers_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,12 +258,30 @@ def max_length(self) -> int:

@cached(SamplingMethod.LOGPROBS)
def loglikelihood(self, docs: list[Doc]) -> list[ModelResponse]:
"""Tokenize the context and continuation and compute the log likelihood of those
tokenized sequences.
"""Not supported for HuggingFace Inference Providers.

The HF Inference Providers API exposes only ``/v1/chat/completions``.
That endpoint does not support ``echo=True`` or per-prompt token
log-probabilities, which are required for loglikelihood evaluation
(MCQ benchmarks such as MMLU, ARC, HellaSwag).

Use the LiteLLM backend (``lighteval endpoint litellm``) with a
model that supports the ``/v1/completions`` endpoint — for example
``gpt-3.5-turbo-instruct`` or any OpenAI-compatible local server —
to run loglikelihood evaluations over a remote API.
"""
raise NotImplementedError
raise NotImplementedError(
"loglikelihood is not supported for the HuggingFace Inference Providers backend. "
"The provider API exposes only /v1/chat/completions, which does not return "
"per-prompt token log-probabilities. "
"Use `lighteval endpoint litellm` with a completion-capable model instead "
"(e.g. gpt-3.5-turbo-instruct or a local OpenAI-compatible server)."
)

@cached(SamplingMethod.PERPLEXITY)
def loglikelihood_rolling(self, docs: list[Doc]) -> list[ModelResponse]:
"""This function is used to compute the log likelihood of the context for perplexity metrics."""
raise NotImplementedError
"""Not supported for HuggingFace Inference Providers — see ``loglikelihood`` for details."""
raise NotImplementedError(
"loglikelihood_rolling is not supported for the HuggingFace Inference Providers backend. "
"See loglikelihood() for the full explanation."
)
Loading