Configuration¶
VoiceLayer is configured entirely via environment variables. All settings have sensible defaults — zero config required for basic usage.
Environment Variables¶
STT (Speech-to-Text)¶
| Variable | Default | Description |
|---|---|---|
QA_VOICE_STT_BACKEND |
auto |
Backend selection: whisper, wispr, or auto |
QA_VOICE_WHISPER_MODEL |
auto-detected | Absolute path to a whisper.cpp GGML model file |
QA_VOICE_WISPR_KEY |
— | Wispr Flow API key (cloud fallback only) |
Auto-detection (auto mode) checks for whisper.cpp first, falls back to Wispr Flow if QA_VOICE_WISPR_KEY is set.
Model auto-detection scans ~/.cache/whisper/ for GGML files in this order:
ggml-large-v3-turbo.binggml-large-v3-turbo-q5_0.binggml-base.en.binggml-base.binggml-small.en.binggml-small.bin- Any other
ggml-*.binfile
TTS (Text-to-Speech)¶
| Variable | Default | Description |
|---|---|---|
QA_VOICE_TTS_VOICE |
en-US-JennyNeural |
Microsoft edge-tts voice ID |
QA_VOICE_TTS_RATE |
+0% |
Base speech rate (per-mode defaults layer on top) |
Available voices — run edge-tts --list-voices for the full list. Popular choices:
| Voice | Language | Style |
|---|---|---|
en-US-JennyNeural |
English (US) | Default, clear female |
en-US-GuyNeural |
English (US) | Male |
en-GB-SoniaNeural |
English (UK) | British female |
en-US-AriaNeural |
English (US) | Expressive female |
Recording¶
Recording uses Silero VAD (neural network) for speech detection. The device's native sample rate is auto-detected — no configuration needed for any microphone (built-in, AirPods, USB, etc.).
Silence detection is configured per-call via the silence_mode parameter on voice_ask:
| Mode | Silence Duration | Use Case |
|---|---|---|
quick |
0.5s | Fast responses, short answers |
standard |
1.5s | Normal conversation |
thoughtful |
2.5s (default) | User pauses to think |
Output¶
| Variable | Default | Description |
|---|---|---|
QA_VOICE_THINK_FILE |
/tmp/voicelayer-thinking.md |
Path for the think mode markdown log |
Per-Mode Speech Rates¶
Each voice mode has a default rate that balances speed with clarity:
| Mode | Default Rate | Rationale |
|---|---|---|
| announce | +10% |
Quick status updates — snappy delivery |
| brief | -10% |
Long explanations — slower for digestion |
| consult | +5% |
Checkpoints — slightly fast, user may respond |
| converse | +0% |
Conversational — natural speed |
Rates are auto-adjusted for long text:
| Text Length | Adjustment |
|---|---|
| < 300 chars | No change |
| 300-599 chars | -5% |
| 600-999 chars | -10% |
| 1000+ chars | -15% |
You can override per-call by passing the rate parameter to any TTS tool.
MCP Server Configuration¶
Basic Setup¶
With Environment Overrides¶
{
"mcpServers": {
"voicelayer": {
"command": "bunx",
"args": ["voicelayer-mcp"],
"env": {
"QA_VOICE_TTS_VOICE": "en-GB-SoniaNeural",
"QA_VOICE_STT_BACKEND": "whisper"
}
}
}
}
File Paths¶
VoiceLayer uses /tmp for all runtime files:
| File | Purpose |
|---|---|
/tmp/voicelayer-session.lock |
Session booking lockfile |
/tmp/voicelayer-stop |
User stop signal (touch to end) |
/tmp/voicelayer-tts-*.mp3 |
Temporary TTS audio (auto-cleaned) |
/tmp/voicelayer-recording-*.wav |
Temporary recording (auto-cleaned) |
/tmp/voicelayer-thinking.md |
Think mode log (persistent until cleared) |