Fix localizer multi-GPU failure (model.to vs device_map) by KartikP · Pull Request #404 · brain-score/language

KartikP · 2026-06-03T08:53:22Z

On multi-GPU scoring tiers, HuggingfaceSubject loads models with device_map='auto' (sharded across GPUs). The localizer's extract_representations then called model.to(device), which tries to consolidate the sharded model onto a single GPU:

8B (fp32 ~32GB): OOMs trying to move the whole model onto one ~22GB A10G.
4B: partial consolidation leaves layers split, then the forward pass raises RuntimeError: Expected all tensors to be on the same device (cuda:0 and cuda:1).

Both surfaced as scoring failures on the 4-GPU medium tier (jenkins run 190); 0.6B/1.7B were unaffected because they land on the single-GPU small tier (no device_map, so the .to() is a no-op).

Fix: skip model.to(device) when the model is already dispatched via device_map (hf_device_map set). Inputs are already sent to self.device (the input-embedding shard), so the sharded forward works and 8B can use all GPUs instead of OOMing on one. Single-GPU loads are unchanged.

skip localizer model.to(device) when model is device_map-sharded

86a4c70

KartikP merged commit abcc58a into main Jun 3, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix localizer multi-GPU failure (model.to vs device_map)#404

Fix localizer multi-GPU failure (model.to vs device_map)#404
KartikP merged 1 commit into
mainfrom
fix-localizer-device-map

KartikP commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KartikP commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant