huggingface · ALI-AL-MARJANI · May 21, 2026 · May 21, 2026
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -46,7 +46,7 @@ jobs:
           enable-cache: true
 
       - name: Install the project
-        run: uv sync --extra dev-gpu
+        run: uv sync --extra dev-gpu --extra litellm
 
       - name: Ensure cache directories exist
         run: mkdir -p cache/models cache/datasets

diff --git a/.gitignore b/.gitignore
@@ -183,3 +183,4 @@ prod_env
 logs
 _logs
 outputs
+.litellm_cache/
diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
@@ -38,7 +38,7 @@ Lighteval provides several optional dependency groups that you can install based
 |-------|-------------|--------------|
 | `vllm` | Use VLLM as backend for high-performance inference | vllm>=0.10.0, ray, more_itertools |
 | `tgi` | Use Text Generation Inference API | text-generation>=0.6.0 |
-| `litellm` | Use LiteLLM for unified API access | litellm, diskcache |
+| `litellm` | Use LiteLLM for unified API access (generative + loglikelihood for completion-capable models) | litellm, diskcache |
 | `optimum` | Use Optimum for optimized models | optimum==1.12.0 |
 | `quantization` | Evaluate quantized models | bitsandbytes>=0.41.0, auto-gptq>=0.4.2 |
 | `adapters` | Evaluate adapter models (PEFT, Delta) | peft==0.3.0 |

diff --git a/docs/source/package_reference/models.mdx b/docs/source/package_reference/models.mdx
@@ -38,6 +38,7 @@ set in the `model-args` or in the model yaml file (see example
 
 ### Litellm Model
 [[autodoc]] models.endpoints.litellm_model.LiteLLMModelConfig
+[[autodoc]] models.endpoints.litellm_model.LiteLLMClient
 
 ## Custom Model
 [[autodoc]] models.custom.custom_model.CustomModelConfig
diff --git a/docs/source/use-litellm-as-backend.mdx b/docs/source/use-litellm-as-backend.mdx
@@ -7,6 +7,49 @@ OpenAI, Groq, and many others.
 > [!TIP]
 > Documentation for available APIs and compatible endpoints can be found [here](https://docs.litellm.ai/docs/).
 
+## Supported Evaluation Modes
+
+The LiteLLM backend supports two evaluation modes depending on the model and provider:
+
+| Mode | Method | Benchmarks | Provider requirement |
+|------|--------|-----------|----------------------|
+| **Generative** | `greedy_until` | GSM8K, HLE, IFEval, … | Any chat-completion provider |
+| **Log-likelihood** | `loglikelihood` / `loglikelihood_rolling` | MMLU, ARC, HellaSwag, … | `/v1/completions` endpoint with `echo=True` |
+
+### Generative evaluation — all providers
+
+Generative benchmarks route through `POST /v1/chat/completions`. Any model from
+any provider works.
+
+```bash
+lighteval endpoint litellm \
+    "provider=openai,model_name=gpt-4o" \
+    gsm8k
+```
+
+### Log-likelihood evaluation — completion-capable models only
+
+MCQ benchmarks (MMLU, ARC, HellaSwag, …) and perplexity benchmarks require
+`POST /v1/completions` with `echo=True` and `logprobs=1`.  **Chat-only models
+such as gpt-4o, Claude, or Gemini do not expose this endpoint** and will
+produce all-`-inf` results with a warning.
+
+Supported models:
+- **OpenAI**: `gpt-3.5-turbo-instruct`
+- **Local servers**: any OpenAI-compatible server — llama.cpp, `vllm serve`, Ollama, etc.
+
+```bash
+# Run MCQ benchmarks (requires completion endpoint)
+lighteval endpoint litellm \
+    examples/model_configs/litellm_completion_model.yaml \
+    "mmlu|0" "arc|0" "hellaswag|0"
+```
+
+> [!WARNING]
+> Lighteval automatically detects unsupported chat-only models via
+> `litellm.get_model_info()` and emits a WARNING before the evaluation starts.
+> Results will be `-inf` for every choice and metrics will be at chance level.
+
 ## Basic Usage
 
 ```bash
@@ -18,9 +61,9 @@ lighteval endpoint litellm \
 ## Using a Configuration File
 
 LiteLLM allows generation with any OpenAI-compatible endpoint. For example, you
-can evaluate a model running on a local VLLM server.
+can evaluate a model running on a local vLLM server.
 
-To do so, you will need to use a configuration file like this:
+**Generative tasks** (`examples/model_configs/litellm_model.yaml`):
 
 ```yaml
 model_parameters:
@@ -37,25 +80,39 @@ model_parameters:
       frequency_penalty: 0.0
 ```
 
+**Log-likelihood / MCQ tasks** (`examples/model_configs/litellm_completion_model.yaml`):
+
+```yaml
+model_parameters:
+    model_name: "gpt-3.5-turbo-instruct"
+    provider: "openai"
+    concurrent_requests: 10
+    generation_parameters:
+      seed: 42
+```
+
 ## Supported Providers
 
 LiteLLM supports a wide range of LLM providers:
 
 ### Cloud Providers
 
-all cloud providers can be found in the [litellm documentation](https://docs.litellm.ai/docs/providers).
+All cloud providers can be found in the [litellm documentation](https://docs.litellm.ai/docs/providers).
 
 ### Local/On-Premise
-- **VLLM**: Local VLLM servers
-- **Hugging Face**: Local Hugging Face models
+- **vLLM**: Local vLLM servers (supports both generative and log-likelihood)
+- **llama.cpp**: OpenAI-compatible server (supports both generative and log-likelihood)
+- **Ollama**: OpenAI-compatible endpoint (generative only)
 - **Custom endpoints**: Any OpenAI-compatible API
 
 ## Using with Local Models
 
-### VLLM Server
-To use with a local VLLM server:
+### vLLM Server (generative + log-likelihood)
+
+Local vLLM servers expose both `/v1/chat/completions` and `/v1/completions`, so
+they support all evaluation modes.
 
-1. Start your VLLM server:
+1. Start your vLLM server:
 ```bash
 vllm serve HuggingFaceH4/zephyr-7b-beta --host 0.0.0.0 --port 8000
 ```
@@ -67,6 +124,46 @@ model_parameters:
     model_name: "hosted_vllm/HuggingFaceH4/zephyr-7b-beta"
     base_url: "http://localhost:8000/v1"
     api_key: ""
+    generation_parameters:
+      seed: 42
+```
+
+3. Run any benchmark:
+```bash
+# Generative
+lighteval endpoint litellm my_config.yaml "gsm8k|0"
+
+# MCQ (log-likelihood)
+lighteval endpoint litellm my_config.yaml "mmlu|0" "arc|0"
 ```
 
+### llama.cpp Server (generative + log-likelihood)
+
+```bash
+./llama-server -m model.gguf --port 8080
+```
+
+```yaml
+model_parameters:
+    model_name: "openai/local"
+    base_url: "http://localhost:8080/v1"
+    api_key: "none"
+```
+
+## Generation Parameters
+
+All parameters in `generation_parameters` are forwarded appropriately to the
+underlying API call.
+
+| Parameter | Generative (`/chat/completions`) | Log-likelihood (`/completions`) |
+|-----------|----------------------------------|----------------------------------|
+| `seed` | ✅ | ✅ |
+| `temperature` | ✅ | hardcoded `0.0` (deterministic scoring) |
+| `max_new_tokens` | ✅ | hardcoded `1` (only logprobs needed) |
+| `stop_tokens` | ✅ | ✅ |
+| `top_p` | ✅ | ✅ |
+| `frequency_penalty` | ✅ | ✅ |
+| `presence_penalty` | ✅ | ✅ |
+| `repetition_penalty` | ✅ | ❌ (not part of `/v1/completions`) |
+
 For more detailed error handling and debugging, refer to the [LiteLLM documentation](https://docs.litellm.ai/docs/).
diff --git a/examples/model_configs/litellm_completion_model.yaml b/examples/model_configs/litellm_completion_model.yaml
@@ -0,0 +1,26 @@
+# LiteLLM configuration for a model that supports loglikelihood evaluation.
+#
+# loglikelihood() and loglikelihood_rolling() require the /v1/completions
+# endpoint with echo=True and logprobs=1.  Only "completion-style" models
+# expose this endpoint.  Examples:
+#   - OpenAI:               gpt-3.5-turbo-instruct
+#   - Local (llama.cpp):    openai/local-model  (with base_url pointing to server)
+#   - Local (vLLM serve):   openai/local-model  (with base_url pointing to server)
+#
+# Chat-only models (gpt-4o, claude-*, gemini-*) do NOT support this endpoint
+# and will produce -inf loglikelihoods.  Use the standard litellm_model.yaml
+# for generative (greedy_until) evaluations with those models.
+#
+# Usage:
+#   lighteval endpoint litellm examples/model_configs/litellm_completion_model.yaml \
+#     "mmlu|0" "arc|0" "hellaswag|0"
+
+model_parameters:
+  model_name: "gpt-3.5-turbo-instruct"
+  provider: "openai"
+  concurrent_requests: 10
+  api_max_retry: 8
+  api_retry_sleep: 1.0
+  api_retry_multiplier: 2.0
+  generation_parameters:
+    seed: 42
diff --git a/src/lighteval/models/endpoints/inference_providers_model.py b/src/lighteval/models/endpoints/inference_providers_model.py
@@ -258,12 +258,30 @@ def max_length(self) -> int:
 
     @cached(SamplingMethod.LOGPROBS)
     def loglikelihood(self, docs: list[Doc]) -> list[ModelResponse]:
-        """Tokenize the context and continuation and compute the log likelihood of those
-        tokenized sequences.
+        """Not supported for HuggingFace Inference Providers.
+
+        The HF Inference Providers API exposes only ``/v1/chat/completions``.
+        That endpoint does not support ``echo=True`` or per-prompt token
+        log-probabilities, which are required for loglikelihood evaluation
+        (MCQ benchmarks such as MMLU, ARC, HellaSwag).
+
+        Use the LiteLLM backend (``lighteval endpoint litellm``) with a
+        model that supports the ``/v1/completions`` endpoint — for example
+        ``gpt-3.5-turbo-instruct`` or any OpenAI-compatible local server —
+        to run loglikelihood evaluations over a remote API.
         """
-        raise NotImplementedError
+        raise NotImplementedError(
+            "loglikelihood is not supported for the HuggingFace Inference Providers backend. "
+            "The provider API exposes only /v1/chat/completions, which does not return "
+            "per-prompt token log-probabilities. "
+            "Use `lighteval endpoint litellm` with a completion-capable model instead "
+            "(e.g. gpt-3.5-turbo-instruct or a local OpenAI-compatible server)."
+        )
 
     @cached(SamplingMethod.PERPLEXITY)
     def loglikelihood_rolling(self, docs: list[Doc]) -> list[ModelResponse]:
-        """This function is used to compute the log likelihood of the context for perplexity metrics."""
-        raise NotImplementedError
+        """Not supported for HuggingFace Inference Providers — see ``loglikelihood`` for details."""
+        raise NotImplementedError(
+            "loglikelihood_rolling is not supported for the HuggingFace Inference Providers backend. "
+            "See loglikelihood() for the full explanation."
+        )
-Original file line number
+Diff line change
@@ Expand Up / @@ -183,3 +183,4 @@ prod_env @@
     logs
     _logs
     outputs
+    .litellm_cache/