In partnership with

You No Longer Need ElevenLabs

For years, professional voice cloning came with a single condition: pay a monthly fee, upload your audio to someone else's server, and trust the platform not to abuse it. ElevenLabs became the default because it was genuinely good. The voice quality was hard to argue with, and the interface was clean enough that non-technical users could get results in minutes.

That was the moat. Not the technology itself. Convenience wrapped in a subscription.

Voicebox just removed the wrapper.

WHAT VOICEBOX SHIPS WITH
🎙️
Instant Voice Cloning
Upload or record a few seconds of audio. Get a near-accurate voice clone immediately, no setup required.
🎛️
Multi-Track DAW Editor
Arrange multiple voice clips on a timeline, layer different voices, and fine-tune timing like a pro audio tool.
🖥️
System Audio Capture
Record directly from videos, meetings, or streams playing on your machine and clone voices on the spot.
📝
Whisper Transcription
Auto-transcribes audio using OpenAI's Whisper model, running fully offline alongside the synthesis engine.
Voice Prompt Caching
Smart caching lets you regenerate speech segments instantly without reprocessing the entire voice profile.
🔓
100% MIT Licensed
Fully open source. Free to use commercially, modify, fork, and build on. No usage restrictions, ever.
📦 v0.1.12 ⭐ 3,000+ GitHub Stars ⬇️ 60,000+ Downloads macOS & Windows

Better prompts. Better AI output.

AI gets smarter when your input is complete. Wispr Flow helps you think out loud and capture full context by voice, then turns that speech into a clean, structured prompt you can paste into ChatGPT, Claude, or any assistant. No more chopping up thoughts into typed paragraphs. Preserve constraints, examples, edge cases, and tone by speaking them once. The result is faster iteration, more precise outputs, and less time re-prompting. Try Wispr Flow for AI or see a 30-second demo.

What Voicebox Actually Does

Voicebox is a desktop application for voice cloning and speech synthesis. You feed it a few seconds of recorded audio, and it produces a near-accurate voice clone using the Qwen3-TTS model from Alibaba's Qwen team. The whole process happens on your local hardware. Nothing is sent to a server. Nothing is queued in the cloud.

The multi-track timeline editor is the part most people do not expect. You can arrange voice clips across a timeline, layer different cloned voices, and export finished audio from inside the same interface. It functions like a lightweight DAW, which makes Voicebox useful not just for generating single lines of speech, but for producing entire multi-speaker conversations or full narration scripts.

The system audio capture feature rounds this out. You can grab audio directly from anything playing on your machine, including videos, recordings, and conference calls, and clone that voice without an extra recording step. Alongside that, the built-in Whisper transcription runs locally and converts audio to editable text, so you can revise and regenerate specific lines without re-recording everything.

One technical detail worth noting: Voicebox is built with Tauri and a Rust backend rather than the Electron framework that most desktop AI apps use. This makes it up to ten times smaller in size and noticeably more responsive on consumer hardware, including mid-range laptops.

The Model That Powers It

QWEN3-TTS AT A GLANCE
Mean Opinion Score (MOS) > 4.3 / 5.0
Synthesis Speed Up to 5× faster than real-time
Language Support 10 languages
End-to-end Latency As low as 97ms
License Open Source (Apache 2.0)

Qwen3-TTS was released by Alibaba in January 2026 and it is the engine that makes Voicebox viable as a professional tool. A Mean Opinion Score above 4.3 out of 5 places it firmly in broadcast-quality territory. Most open-source voice models have struggled to break 3.8 in controlled listening tests. Qwen3-TTS does it cleanly.

The model handles long passages without losing prosody, manages out-of-vocabulary words reliably, and can mimic varied speaking styles with the kind of accuracy that previously required fine-tuned proprietary models. It supports ten languages and produces the first audio packet within 97ms of receiving input, which means it is usable in real-time applications as well as batch generation workflows.

Leadership Can’t Be Automated

AI can help you move faster, but real leadership still requires human judgment.

The free resource 5 Traits AI Can’t Replace explains the traits leaders must protect in an AI-driven world and why BELAY Executive Assistants are built to support them.

The Real Cost of ElevenLabs

ELEVENLABS PRICING (2026)
Plan Price/mo Characters
Free $0 10,000
Starter $5 30,000
Creator $22 100,000
Pro $99 500,000
Scale $330 2,000,000
Business $1,320 11,000,000
⚡ Voicebox: $0/month. Unlimited characters. Zero quotas.

ElevenLabs' free tier allows 10,000 characters per month, which translates to roughly five to seven minutes of finished audio. The Creator plan at $22 per month gives you 100,000 characters. Once you are publishing multiple pieces of content weekly, running a podcast, or building anything voice-heavy, you hit the Pro tier at $99 or the Scale tier at $330 without much effort.

Those numbers are not unreasonable for what ElevenLabs offers. But they represent a cost that now has a direct, quality-comparable alternative at zero dollars. Voicebox runs on your local GPU or CPU, and the only ceiling on how much audio you produce is how long you are willing to wait for your hardware to render it.

Why Privacy Is Not a Small Point

Voice data is biometric. It is as personally identifying as a fingerprint, and unlike a password, you cannot change it. When you upload voice samples to any cloud platform, you are transferring that data to servers you do not control, under terms of service that most people do not read carefully.

Many platforms include clauses about using submitted content to improve their models, and few are explicit about how long voice samples are retained or how they are protected. For journalists recording sensitive sources, legal professionals handling confidential calls, or anyone building products around real clients' voices, this is not an abstract concern.

Voicebox keeps everything on the machine it runs on. The voice clone stays local. Generated audio stays local. There is no data in transit and no external server involved at any point in the process.

Who Should Be Using This Right Now

BEST FIT USE CASES
  • 🎙️ Podcast producers — Generate filler takes, intros, and ad reads without re-recording sessions
  • 🛠️ Developers — Build voice-powered apps locally before committing to a cloud API dependency
  • 📧 Newsletter creators — Produce audio versions of issues for subscriber retention and accessibility
  • 🎓 Course creators — Record narration tracks and revise individual lines without full re-records
  • 🔒 Privacy-conscious teams — Handle voice workflows involving sensitive or confidential audio locally

Voicebox is available at voicebox.sh for macOS and Windows now, with Linux support listed on the public roadmap. On first launch, the app walks you through downloading the Qwen3-TTS and Whisper models to your machine. After that, you drag in an audio clip, type your script, and the speech generates locally. The project is MIT licensed, actively maintained, and a developer API for third-party integrations is confirmed on the roadmap.

The pattern here is consistent. Open-source tooling closes the gap on proprietary platforms faster than expected, then surpasses them on certain dimensions entirely. It happened with image generation. It happened with language models. It is happening with voice now.

The question is not whether tools like Voicebox will become the standard for local voice work. They already are for the people paying attention. The question is how quickly the rest of the market catches up.

If you want a weekly breakdown of tools like this before they go mainstream, our newsletter covers exactly this kind of shift, without the noise.

Stay sharp,
Better Every Day

Ray Dalio: "The S&P Fell 28% Last Year." Wait, What?

He's measuring in gold, not dollars. And that's the point.

The dollar dropped 10% in 2025. So, yeah, your portfolio went up in dollars, but, Dalio says your real return isn’t so exciting.

And the decline is reportedly advancing as macro conditions don’t improve.

So, what investments offer protection against that currency risk?

Well.. billionaires have an answer. And now 70,819 everyday investors have joined in.

This unexpected asset class outpaced the S&P 500 overall with low correlation since 1995.*

Not real estate or PE. Post war and contemporary art. Seriously.

Plus– Art trades globally ;)

And now, you don’t need to be a billionaire–

Masterworks makes it easy to FRACTIONALLY invest in blue-chip art, with a track record of 26 net annualized returns like 14.6%, 17.6%, and 17.8% on works held over a year.

See why investors moved $1.3 billion into 500+ offerings:

*According to Masterworks data.  Investing involves risk. Past performance not indicative of future returns. See important disclosures at masterworks.com/cd.

📬 Building something unique? Hit reply. I'm tracking tools and approaches for a future breakdown.

Reply

Avatar

or to participate

Keep Reading