Architecture Overview¶

VoiceLayer is a lightweight MCP server that bridges Claude Code with your microphone and speakers. It's built with Bun and TypeScript, using system-level tools for audio I/O.

System Architecture¶

Claude Code Session
    │
    │  MCP (JSON-RPC over stdio)
    │
    ▼
┌────────────────────────────┐
│     VoiceLayer MCP Server  │
│     (mcp-server.ts)        │
│                            │
│  ┌─────────┐  ┌─────────┐ │
│  │  TTS    │  │  Input  │ │
│  │ (tts.ts)│  │(input.ts)│ │
│  └────┬────┘  └────┬────┘ │
│       │            │       │
│  ┌────▼────┐  ┌────▼────┐ │
│  │edge-tts │  │   sox   │ │
│  │(python3)│  │  (rec)  │ │
│  └────┬────┘  └────┬────┘ │
│       │            │       │
│  ┌────▼────┐  ┌────▼────┐ │
│  │ afplay/ │  │   STT   │ │
│  │  mpv    │  │(stt.ts) │ │
│  └─────────┘  └────┬────┘ │
│                    │       │
│          ┌─────────┴──┐    │
│          │whisper.cpp │    │
│          │or Wispr API│    │
│          └────────────┘    │
│                            │
│  ┌──────────────────────┐  │
│  │  Session Booking     │  │
│  │  (session-booking.ts)│  │
│  └──────────────────────┘  │
└────────────────────────────┘

Module Responsibilities¶

Module	File	Responsibility
MCP Server	`mcp-server.ts`	Tool definitions, request routing, argument validation
TTS	`tts.ts`	Text-to-speech via edge-tts, audio playback, rate adjustment
Input	`input.ts`	Mic recording via sox (native rate), Silero VAD speech detection, PCM resampling/WAV handling
STT	`stt.ts`	Backend abstraction — whisper.cpp (local) or Wispr Flow (cloud)
Session Booking	`session-booking.ts`	Lockfile mutex for mic access, stale lock cleanup
Paths	`paths.ts`	Centralized `/tmp` path constants
Audio Utils	`audio-utils.ts`	Shared audio utilities (RMS calculation, native rate detection, PCM resampling)
Session	`session.ts`	Session lifecycle management (save/load/generate)
Report	`report.ts`	QA report rendering (JSON -> markdown)
Brief	`brief.ts`	Discovery brief rendering (JSON -> markdown)
Schemas	`schemas/`	QA and discovery category/checklist definitions

Data Flow: Converse Mode¶

The most complex mode — full round-trip voice Q&A:

1. Agent calls voice_ask("How does it look?")
                │
2. Session booking check (lockfile)
                │
3. Wait for prior voice_speak playback to finish (if any)
                │
4. edge-tts synthesizes → /tmp/voicelayer-tts-PID-N.mp3
                │
5. afplay speaks the question (blocking — waits for completion)
                │
6. Detect device native sample rate (e.g., 24kHz AirPods, 48kHz built-in)
                │
7. sox records at native rate → raw PCM to stdout (no sox resampling)
                │
8. Each 32ms chunk resampled to 16kHz, fed to Silero VAD + accumulated
                │
9. Stop condition met:
   - User: touch /tmp/voicelayer-stop
   - Silero VAD: configurable silence after speech (0.5-2.5s)
   - Pre-speech timeout: 15s of no speech detected
   - Timeout: configurable (default 300s)
                │
10. 16kHz PCM wrapped in WAV header → /tmp/voicelayer-recording-PID-TS.wav
                │
11. whisper.cpp transcribes WAV → text
                │
12. Text returned to agent, temp files cleaned up

External Dependencies¶

VoiceLayer delegates audio I/O to battle-tested system tools rather than bundling audio libraries:

Tool	Purpose	Install
python3 + edge-tts	Neural TTS (Microsoft, free)	`pip3 install edge-tts`
sox/rec	Mic recording (native rate mono)	`brew install sox`
afplay (macOS)	Audio playback	Built-in
mpv/ffplay/mpg123 (Linux)	Audio playback	Package manager
whisper.cpp	Local STT	`brew install whisper-cpp`

This approach keeps VoiceLayer lightweight (~500 lines of TypeScript) while leveraging mature, well-tested audio tools.

File-Based IPC¶

VoiceLayer uses the filesystem for inter-process communication:

File	Purpose	Pattern
`/tmp/voicelayer-session.lock`	Mic mutex	Atomic `wx` create, read JSON
`/tmp/voicelayer-stop`	Stop signal	Touch to create, poll existence
`/tmp/voicelayer-thinking.md`	Think log	Append-only markdown

This is intentional — MCP servers can't push UI updates to Claude Code. File-based signaling is the only reliable cross-process communication pattern available.