Voice Cloning¶

VoiceLayer supports zero-shot voice cloning using Qwen3-TTS. Clone any voice from YouTube samples — no model training required.

How It Works¶

YouTube URL → yt-dlp (WAV 48kHz) → Silero VAD segmentation → FFmpeg normalization
→ voicelayer clone (select best 3 clips) → profile.yaml → Qwen3-TTS daemon

Three-tier TTS routing at runtime:

Qwen3-TTS — cloned voice (if daemon running + profile exists)
edge-tts — Microsoft neural voice (free, always available)
Text-only — fallback when no audio output possible

Prerequisites¶

# Required
brew install yt-dlp ffmpeg
pip3 install silero-vad torch soundfile

# Optional (better quality)
pip3 install demucs            # Source separation (removes background music)
pip3 install pyannote.audio    # Speaker diarization (multi-speaker videos)

# TTS daemon
pip3 install mlx-audio fastapi uvicorn

Step 1: Extract Voice Samples¶

Pull clean voice clips from YouTube videos:

voicelayer extract \
  --source "https://youtube.com/@channel" \
  --name "speaker-name" \
  --count 20

The extraction pipeline:

Downloads audio via yt-dlp (WAV 48kHz)
Optionally runs Demucs source separation (--demucs)
Segments speech using Silero VAD
Optionally diarizes speakers (--diarize)
Normalizes audio (highpass 80Hz, loudnorm -16 LUFS)
Saves clips to ~/.voicelayer/voices/{name}/samples/

Options¶

Flag	Default	Description
`--source`	required	YouTube URL (channel, video, or playlist)
`--name`	required	Voice profile name
`--count`	`20`	Number of clips to extract
`--demucs`	off	Enable source separation
`--diarize`	off	Enable speaker diarization

Step 2: Build a Voice Profile¶

Select the best reference clips and generate a profile:

voicelayer clone --name "speaker-name"

This command:

Reads all samples from ~/.voicelayer/voices/{name}/samples/
Analyzes audio quality (duration, RMS level, SNR estimate)
Selects the best 3 clips (~18.5s total — Qwen3-TTS sweet spot: 3-30s)
Generates transcripts via whisper.cpp
Writes ~/.voicelayer/voices/{name}/profile.yaml

Profile Format¶

name: speaker-name
engine: qwen3-tts
model_path: ~/.voicelayer/models/qwen3-tts-4bit
reference_clips:
  - path: ~/.voicelayer/voices/speaker-name/samples/clip_007.wav
    text: "Transcribed text of this clip..."
  - path: ~/.voicelayer/voices/speaker-name/samples/clip_012.wav
    text: "Another transcribed clip..."
  - path: ~/.voicelayer/voices/speaker-name/samples/clip_003.wav
    text: "Third reference clip..."
reference_clip: ~/.voicelayer/voices/speaker-name/samples/clip_007.wav
fallback: en-US-GuyNeural
created: "2026-02-27"
source: "https://youtube.com/@channel"

Step 3: Run the TTS Daemon¶

Start the Qwen3-TTS daemon for runtime synthesis:

voicelayer daemon --port 8880

The daemon:

Loads the Qwen3-TTS 4-bit quantized model into Metal/MPS memory
Serves a /synthesize HTTP endpoint
Inference latency: 200-500ms per call on Apple Silicon
Model location: ~/.voicelayer/models/qwen3-tts-4bit/

Testing¶

# Health check
curl http://127.0.0.1:8880/health

# Synthesize
curl -X POST http://127.0.0.1:8880/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "speaker-name"}'

Using Cloned Voices¶

Once the daemon is running and a profile exists, VoiceLayer automatically routes through Qwen3-TTS:

voice_speak("Your cloned voice speaks this text")

If the daemon is unavailable, VoiceLayer falls back to edge-tts using the fallback voice from the profile.

File Locations¶

Path	Purpose
`~/.voicelayer/voices/{name}/samples/`	Extracted voice clips
`~/.voicelayer/voices/{name}/profile.yaml`	Voice profile
`~/.voicelayer/models/qwen3-tts-4bit/`	Quantized model weights