Fatterbox is built on rsxdalv's optimized Chatterbox implementation, exposing both Wyoming protocol and OpenAPI endpoints with streaming support for minimal time-to-first-word latency. The streaming architecture splits text into sentence chunks and generates audio progressively, allowing playback to begin before the entire text is synthesized.
- Docker with NVIDIA GPU support (install nvidia-container-toolkit)
- NVIDIA GPU with CUDA capability
- Voice reference files (.wav format)
-
Prepare voice files: Place
.wavfiles in avoicesdirectory. Each file becomes a voice (e.g.,Jake.wav→ voice name "Jake"). -
Pull the prebuilt image (or build your own with
docker build -t fatterbox .):
docker pull docker.io/justinlime/fatterbox:v0.1.0- Run the container:
docker run --gpus all \
-v ./voices:/chatter/voices \
-p 10200:10200 \
-p 8000:8000 \
docker.io/justinlime/fatterbox:v0.1.0Two servers run simultaneously:
- Wyoming protocol:
tcp://0.0.0.0:10200(Home Assistant integration) - OpenAPI REST:
http://0.0.0.0:8000(OpenAI-compatible)
Configure via environment variables (all prefixed with FATTERBOX_):
FATTERBOX_WYOMING_HOST=0.0.0.0
FATTERBOX_WYOMING_PORT=10200
FATTERBOX_OPENAPI_HOST=0.0.0.0
FATTERBOX_OPENAPI_PORT=8000
FATTERBOX_VOICES_DIR=./voicesFATTERBOX_DEVICE=cuda # cuda or cpu
FATTERBOX_DTYPE=bf16 # float32, fp16, bf16 (bf16 recommended)
FATTERBOX_BACKEND=cudagraphs-manual # fastest optionMinimum VRAM Required:
- FP32: ~4.5 GB
- FP16/BF16: ~3.5 GB (recommended)
Estimates based on my tests with a RTX 3090. You might experience memory spikes if generating large sentences without punctuations. BF16 offers the best balance of speed and memory efficiency on modern GPUs.
FATTERBOX_EXAGGERATION=0.5 # Emotional expressiveness (0.0-2.0)
FATTERBOX_CFG_WEIGHT=0.5 # Voice adherence (0.0-1.0)
FATTERBOX_TEMPERATURE=0.8 # Randomness (0.05-5.0)
FATTERBOX_SEED=0 # Random seed (0=random)
FATTERBOX_TOP_P=1.0 # Nucleus sampling (0.0-1.0)
FATTERBOX_MIN_P=0.0 # Min probability (0.0-1.0)
FATTERBOX_MAX_NEW_TOKENS=4096 # Max audio tokens (~25 per second)
FATTERBOX_N_TIMESTEPS=10 # Diffusion steps
FATTERBOX_FLOW_CFG_SCALE=1.0 # Mel decoder CFG scale
FATTERBOX_DEBUG=false # Enable debug loggingdocker run --gpus all \
-v ./voices:/chatter/voices \
-p 10200:10200 \
-p 8000:8000 \
-e FATTERBOX_DTYPE=bf16 \
-e FATTERBOX_EXAGGERATION=0.7 \
-e FATTERBOX_CFG_WEIGHT=0.4 \
fatterboxcurl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is a test.",
"voice": "Jake"
}' \
--output speech.wavcurl http://localhost:8000/v1/voicesCompatible with Home Assistant's Wyoming protocol. Configure in Home Assistant using:
- Host:
<docker-host-ip> - Port:
10200
- Use
bf16dtype (recommended) for best balance of speed and VRAM efficiency - RTX 30xx/40xx GPUs have native BF16 support for optimal performance
- Use
cudagraphs-manualbackend (default) for fastest generation - Lower
EXAGGERATIONandCFG_WEIGHTfor more expressive speech