Skip to content

Architecture Overview

VoiceLayer is a lightweight MCP server that bridges Claude Code with your microphone and speakers. It's built with Bun and TypeScript, using system-level tools for audio I/O.

System Architecture

Claude Code Session
    │  MCP (JSON-RPC over stdio)
┌────────────────────────────┐
│     VoiceLayer MCP Server  │
│     (mcp-server.ts)        │
│                            │
│  ┌─────────┐  ┌─────────┐ │
│  │  TTS    │  │  Input  │ │
│  │ (tts.ts)│  │(input.ts)│ │
│  └────┬────┘  └────┬────┘ │
│       │            │       │
│  ┌────▼────┐  ┌────▼────┐ │
│  │edge-tts │  │   sox   │ │
│  │(python3)│  │  (rec)  │ │
│  └────┬────┘  └────┬────┘ │
│       │            │       │
│  ┌────▼────┐  ┌────▼────┐ │
│  │ afplay/ │  │   STT   │ │
│  │  mpv    │  │(stt.ts) │ │
│  └─────────┘  └────┬────┘ │
│                    │       │
│          ┌─────────┴──┐    │
│          │whisper.cpp │    │
│          │or Wispr API│    │
│          └────────────┘    │
│                            │
│  ┌──────────────────────┐  │
│  │  Session Booking     │  │
│  │  (session-booking.ts)│  │
│  └──────────────────────┘  │
└────────────────────────────┘

Module Responsibilities

Module File Responsibility
MCP Server mcp-server.ts Tool definitions, request routing, argument validation
TTS tts.ts Text-to-speech via edge-tts, audio playback, rate adjustment
Input input.ts Mic recording via sox, silence detection, PCM/WAV handling
STT stt.ts Backend abstraction — whisper.cpp (local) or Wispr Flow (cloud)
Session Booking session-booking.ts Lockfile mutex for mic access, stale lock cleanup
Paths paths.ts Centralized /tmp path constants
Audio Utils audio-utils.ts Shared audio utilities (RMS calculation)
Session session.ts Session lifecycle management (save/load/generate)
Report report.ts QA report rendering (JSON -> markdown)
Brief brief.ts Discovery brief rendering (JSON -> markdown)
Schemas schemas/ QA and discovery category/checklist definitions

Data Flow: Converse Mode

The most complex mode — full round-trip voice Q&A:

1. Agent calls qa_voice_converse("How does it look?")
2. Session booking check (lockfile)
3. edge-tts synthesizes → /tmp/voicelayer-tts-PID-N.mp3
4. afplay speaks the question (stop-signal polling at 300ms)
5. sox starts recording → raw 16kHz 16-bit mono PCM to stdout
6. PCM streamed in 1-second chunks, RMS calculated per chunk
7. Stop condition met:
   - User: touch /tmp/voicelayer-stop
   - Silence: 5 consecutive chunks below threshold
   - Timeout: configurable (default 300s)
8. PCM wrapped in WAV header → /tmp/voicelayer-recording-PID-TS.wav
9. whisper.cpp transcribes WAV → text
10. Text returned to agent, temp files cleaned up

External Dependencies

VoiceLayer delegates audio I/O to battle-tested system tools rather than bundling audio libraries:

Tool Purpose Install
python3 + edge-tts Neural TTS (Microsoft, free) pip3 install edge-tts
sox/rec Mic recording (16kHz mono) brew install sox
afplay (macOS) Audio playback Built-in
mpv/ffplay/mpg123 (Linux) Audio playback Package manager
whisper.cpp Local STT brew install whisper-cpp

This approach keeps VoiceLayer lightweight (~500 lines of TypeScript) while leveraging mature, well-tested audio tools.

File-Based IPC

VoiceLayer uses the filesystem for inter-process communication:

File Purpose Pattern
/tmp/voicelayer-session.lock Mic mutex Atomic wx create, read JSON
/tmp/voicelayer-stop Stop signal Touch to create, poll existence
/tmp/voicelayer-thinking.md Think log Append-only markdown

This is intentional — MCP servers can't push UI updates to Claude Code. File-based signaling is the only reliable cross-process communication pattern available.