Voice Cloning¶
VoiceLayer supports zero-shot voice cloning using Qwen3-TTS. Clone any voice from YouTube samples — no model training required.
How It Works¶
YouTube URL → yt-dlp (WAV 48kHz) → Silero VAD segmentation → FFmpeg normalization
→ voicelayer clone (select best 3 clips) → profile.yaml → Qwen3-TTS daemon
Three-tier TTS routing at runtime:
- Qwen3-TTS — cloned voice (if daemon running + profile exists)
- edge-tts — Microsoft neural voice (free, always available)
- Text-only — fallback when no audio output possible
Prerequisites¶
# Required
brew install yt-dlp ffmpeg
pip3 install silero-vad torch soundfile
# Optional (better quality)
pip3 install demucs # Source separation (removes background music)
pip3 install pyannote.audio # Speaker diarization (multi-speaker videos)
# TTS daemon
pip3 install mlx-audio fastapi uvicorn
Step 1: Extract Voice Samples¶
Pull clean voice clips from YouTube videos:
The extraction pipeline:
- Downloads audio via
yt-dlp(WAV 48kHz) - Optionally runs Demucs source separation (
--demucs) - Segments speech using Silero VAD
- Optionally diarizes speakers (
--diarize) - Normalizes audio (highpass 80Hz, loudnorm -16 LUFS)
- Saves clips to
~/.voicelayer/voices/{name}/samples/
Options¶
| Flag | Default | Description |
|---|---|---|
--source |
required | YouTube URL (channel, video, or playlist) |
--name |
required | Voice profile name |
--count |
20 |
Number of clips to extract |
--demucs |
off | Enable source separation |
--diarize |
off | Enable speaker diarization |
Step 2: Build a Voice Profile¶
Select the best reference clips and generate a profile:
This command:
- Reads all samples from
~/.voicelayer/voices/{name}/samples/ - Analyzes audio quality (duration, RMS level, SNR estimate)
- Selects the best 3 clips (~18.5s total — Qwen3-TTS sweet spot: 3-30s)
- Generates transcripts via whisper.cpp
- Writes
~/.voicelayer/voices/{name}/profile.yaml
Profile Format¶
name: speaker-name
engine: qwen3-tts
model_path: ~/.voicelayer/models/qwen3-tts-4bit
reference_clips:
- path: ~/.voicelayer/voices/speaker-name/samples/clip_007.wav
text: "Transcribed text of this clip..."
- path: ~/.voicelayer/voices/speaker-name/samples/clip_012.wav
text: "Another transcribed clip..."
- path: ~/.voicelayer/voices/speaker-name/samples/clip_003.wav
text: "Third reference clip..."
reference_clip: ~/.voicelayer/voices/speaker-name/samples/clip_007.wav
fallback: en-US-GuyNeural
created: "2026-02-27"
source: "https://youtube.com/@channel"
Step 3: Run the TTS Daemon¶
Start the Qwen3-TTS daemon for runtime synthesis:
The daemon:
- Loads the Qwen3-TTS 4-bit quantized model into Metal/MPS memory
- Serves a
/synthesizeHTTP endpoint - Inference latency: 200-500ms per call on Apple Silicon
- Model location:
~/.voicelayer/models/qwen3-tts-4bit/ - Creates or reuses
~/.voicelayer/daemon.secretwith mode0600
The TypeScript bridge and the Python daemon both default to the same token
file. If you need a custom launcher path, set
VOICELAYER_TTS_DAEMON_SECRET_FILE=/path/to/token,
VOICELAYER_TTS_AUTH_TOKEN_FILE=/path/to/token, or start the daemon with
voicelayer daemon --daemon-secret-file /path/to/token.
The daemon requires Authorization: Bearer ... on every endpoint, only accepts
Host: 127.0.0.1:8880 / Host: localhost:8880, rejects non-local Origin
headers, and only accepts reference_wav paths that resolve inside
~/.voicelayer/voices/.
Testing¶
AUTH_TOKEN="$(cat ~/.voicelayer/daemon.secret)"
# Health check
curl http://127.0.0.1:8880/health \
-H "Authorization: Bearer ${AUTH_TOKEN}"
# Synthesize
curl -X POST http://127.0.0.1:8880/synthesize \
-H "Authorization: Bearer ${AUTH_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello world",
"reference_wav": "'"$HOME"'/.voicelayer/voices/speaker-name/samples/clip_007.wav",
"reference_text": "Transcribed text of this clip..."
}'
Using Cloned Voices¶
Once the daemon is running and a profile exists, VoiceLayer automatically routes through Qwen3-TTS:
If the daemon is unavailable, VoiceLayer falls back to edge-tts using the fallback voice from the profile.
File Locations¶
| Path | Purpose |
|---|---|
~/.voicelayer/voices/{name}/samples/ |
Extracted voice clips |
~/.voicelayer/voices/{name}/profile.yaml |
Voice profile |
~/.voicelayer/models/qwen3-tts-4bit/ |
Quantized model weights |