# Chatterbox Voice Clone Kit
### Clone Any Voice — End to End

Give it a YouTube URL. It downloads the audio, finds the cleanest solo speech, extracts a reference clip, and generates cloned speech. Fully automated.

---

## For AI Agents / Automated Setup

If you are an AI agent helping set up this kit, follow these steps exactly in order.

### Step 1: Environment
```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
This installs Chatterbox TTS and all dependencies. First run downloads ~1.5GB of model weights from HuggingFace automatically.

Also need these system tools (likely already installed):
```bash
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg yt-dlp

# pip fallback
pip install yt-dlp
```

### Step 2: Set API Key (for transcription during clip extraction)
```bash
export OPENAI_API_KEY=sk-your-key-here
```
Only needed for the extraction step (Whisper transcription to find clean speech). Generation itself needs no API key.

### Step 3: Extract a Reference Clip from YouTube
```bash
python extract_reference.py --url "https://youtube.com/watch?v=VIDEO_ID" --name voice_name
```

This will:
1. Download the full audio
2. Sample 8 segments across the video
3. Transcribe each with Whisper
4. Score them for solo speech quality (penalizes music, multiple speakers, silence)
5. Save the best one to `voices/voice_name_reference.wav`

To get multiple candidates to pick from:
```bash
python extract_reference.py --url "https://..." --name voice_name --top 3
```

You can also point it at a local file:
```bash
python extract_reference.py --file podcast_episode.mp3 --name voice_name
```

### Step 4: Generate Speech
```bash
# Single line
python gen_voice.py --voice voices/voice_name_reference.wav "Your text here."

# From a text file
python gen_voice.py --voice voices/voice_name_reference.wav --file script.txt

# Long text — use --chunk to process one sentence at a time (saves VRAM)
python gen_voice.py --voice voices/voice_name_reference.wav --file script.txt --chunk
```

Output lands in `output/` as WAV (24kHz mono).

### That Is the Entire Flow
```
YouTube URL → extract_reference.py → reference clip → gen_voice.py → cloned speech
```
Two commands. No manual trimming. No audio editing. No guesswork.

---

## Common Errors and Fixes

| Error | Cause | Fix |
|-------|-------|-----|
| `CUDA out of memory` | Text too long for single gen | Add `--chunk` flag |
| `RuntimeError: No CUDA GPUs` | No NVIDIA GPU | Need NVIDIA GPU with CUDA. CPU works but 10-20x slower |
| `FileNotFoundError` on voice | Wrong path | Check `--voice` path points to actual file in `voices/` |
| `HTTP 401` downloading model | HuggingFace rate limit | `export HF_TOKEN=your_token` and retry |
| Audio sounds wrong | Bad reference clip | Run `extract_reference.py` with `--top 3` and try each |
| Words mispronounced | TTS does not know word | Spell phonetically in input ("Neefai" not "Nephi") |
| `yt-dlp` not found | Not installed | `pip install yt-dlp` |
| `ffmpeg` not found | Not installed | `apt install ffmpeg` or `brew install ffmpeg` |

## Architecture Notes for Integration

- **Model**: Chatterbox Turbo (350M params) by Resemble AI — Apache 2.0 license
- **Input**: Plain text + path to reference WAV/MP3
- **Output**: WAV (24kHz, mono, float32)
- **VRAM**: ~4GB loaded, ~6-8GB during generation
- **Speed**: ~15-30s per sentence on RTX 3060 Ti, ~5-10s on RTX 4090
- **Python API** — the whole thing in 3 lines:

```python
from chatterbox.tts_turbo import ChatterboxTurboTTS
import torchaudio

model = ChatterboxTurboTTS.from_pretrained(device="cuda")
wav = model.generate("Your text.", audio_prompt_path="voices/reference.wav")
torchaudio.save("output.wav", wav, model.sr)
```

---

## What's Inside

```
chatterbox-voice-kit/
├── README.md               ← You're here
├── extract_reference.py    ← YouTube URL → best reference clip (automated)
├── gen_voice.py            ← Reference clip + text → cloned speech
├── requirements.txt        ← Python dependencies
├── voices/                 ← Reference clips land here
└── output/                 ← Generated audio lands here
```

## Multiple Voices

Clone as many voices as you want. Each one is just a reference clip:

```bash
# Extract different voices
python extract_reference.py --url "https://..." --name morgan_freeman
python extract_reference.py --url "https://..." --name jack_mccoy
python extract_reference.py --file my_recording.wav --name my_voice

# Generate with any of them
python gen_voice.py --voice voices/morgan_freeman_reference.wav "Text here"
python gen_voice.py --voice voices/jack_mccoy_reference.wav "Objection."
python gen_voice.py --voice voices/my_voice_reference.wav "Clone of me."

# List all available voices
python gen_voice.py --list
```

## Tips for Best Results

- **Reference clip quality is everything**: Clean solo speech > everything else
- **Sentence length matters**: Under 200 characters per sentence for best output
- **Punctuation controls pacing**: Periods = full stop. Commas = pause. Ellipses = long pause.
- **First person sounds natural**: "I believe this" clones better than "One might consider"
- **Test short first**: Try one sentence before generating a full script
- **One clip per voice**: You only need ONE good reference. Use it for everything.

---

Powered by [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) (open source, Apache 2.0)
Built by TRED Technologies
