Converse Mode¶
Full bidirectional voice Q&A. The agent speaks a question, records the user's voice response via microphone, transcribes it with whisper.cpp (or Wispr Flow), and returns the text.
This is the only blocking mode — the tool call doesn't return until the user finishes speaking or the timeout expires.
When to Use¶
- Interactive Q&A sessions: "What did you think of the prototype?"
- Drilling sessions: "Walk me through how the auth flow works"
- Discovery calls: "What are the main pain points with the current system?"
- QA testing: "How does the checkout page look on your screen?"
MCP Tool¶
Name: qa_voice_converse
Alias: qa_voice_ask
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
message |
string | Yes | — | Question or prompt to speak aloud (non-empty) |
timeout_seconds |
number | No | 300 |
Max wait time for response (clamped to 10-3600) |
Returns¶
On success — the user's transcribed text:
On timeout / no speech:
{
"content": [{ "type": "text", "text": "[converse] No response received within 300 seconds. The user may have stepped away." }]
}
On busy (another session has the mic):
{
"content": [{ "type": "text", "text": "[converse] Line is busy — voice session owned by mcp-12345 (PID 12345) since 2026-02-21T10:00:00Z. Fall back to text input, or wait for the other session to finish." }],
"isError": true
}
Errors¶
| Error | Cause | Fix |
|---|---|---|
| Line busy | Another session has the mic | Wait or fall back to text |
| sox not installed | rec command not found |
brew install sox |
| Mic permission denied | Terminal not authorized | macOS: System Settings > Privacy > Microphone |
| No STT backend | Neither whisper.cpp nor Wispr available | Install whisper-cpp or set QA_VOICE_WISPR_KEY |
Behavior¶
Full Flow¶
- Session booking — auto-books if not already booked (lockfile check)
- Clear state — removes leftover input/stop signals
- Speak question — edge-tts synthesizes and plays the question
- Record mic — sox starts recording 16kHz 16-bit mono PCM
- Wait for stop — user touches
/tmp/voicelayer-stop(primary) or silence detected (fallback) - Transcribe — whisper.cpp or Wispr Flow converts audio to text
- Return — transcribed text returned to the agent
Stop Methods (in priority order)¶
- User stop signal (PRIMARY):
touch /tmp/voicelayer-stop - Silence detection (FALLBACK): 5 seconds of silence after speech is detected
- Timeout (SAFETY NET):
timeout_secondsparameter (default 300s)
Why user-controlled stop is primary
Silence detection can misfire — background noise, thinking pauses, or mic sensitivity issues cause premature cutoff. The touch-file approach gives the user explicit control over when they're done speaking.
Session Booking¶
Converse mode requires exclusive mic access. The first converse call auto-books a session:
- Lockfile:
/tmp/voicelayer-session.lock - Contains: PID, session ID, start timestamp
- Stale lock cleanup: dead PIDs are auto-detected and removed
- Race condition safe: uses atomic exclusive file creation (
wxflag)
Other Claude Code sessions that try to use converse see "line busy" and should fall back to text input.
The lock is released when the MCP server process exits (SIGTERM/SIGINT/exit handlers).
Recording Details¶
| Setting | Value |
|---|---|
| Sample rate | 16,000 Hz |
| Channels | 1 (mono) |
| Bit depth | 16-bit signed |
| Format | Raw PCM -> WAV |
| Silence threshold | RMS 500 (configurable) |
| Silence duration | 5 seconds (converse-specific) |
Audio is recorded as raw PCM via sox, wrapped in a WAV header, then passed to the STT backend.
Speech Rate¶
Default: +0% (natural conversational pace).
No auto-slowdown applied — converse questions are typically short.