The face model is a very new idea, it uses a multimodal LLM to generate Face-Motion tokens designed to be run in realtime not that expensively. The LLM has semantic understanding (part of why Orpheus was so good) which is why the face model is capable and will continue to be more capable of generating complex expressions that show a deeper understanding of the speech and text its saying. We also generate all the speech through Orpheus fine-tuned (however our face actor which we fine-tuned on did not have a good mic + not that strong at English) so the audio sounds bad. We will find a native actor and collect from them and the voice will be good. The fact that we own the entire generation stack face+voice to sota level is a very big advantage and there will be a lot of synergies we can exploit.
[happy] I grabbed a coffee this morning and it actually made the whole commute feel less stressful. <laugh>
[sad] By the time I finished the laundry, I realized I'd been humming the same song on repeat for an hour.
[happy] I almost missed the bus because I couldn't find my headphones under the couch, and it felt ridiculous. <laugh>
[sad] I tried calling a friend tonight, but they didn't pick up and it left me feeling a bit low.
[happy] I spent half the afternoon reorganizing my desk, and for once it actually stayed tidy.
[happy] I watched the program crash again at 3 a.m., <sigh> and wondered if I'd ever see it succeed.
Orpheus Nano is 135m parameter LLM backbone (over 20x smaller than most other models) + 50m decoder. We have developed the first model of this size that can behave like a bigger model (different voices/customisable through finetuning/emotional tags/stable and expressive etc). The hope is that it can run much much cheaper making it a very compelling option for very high volume voice applications, and ultimately changing the economics of TTS models.
Hey there, my name is Ashley, <starts laughing> and I'm a speech generation model that can sound like a person.
I've also been taught to understand and produce paralinguistic things like sighing, or chuckling, or yawning.
I live in San Francisco, and have, barely any parameters. I'm really, REALLY small!
You can, uh, increase the generation parameters like temperature, or top P to make me sound even more or less expressive.
One of my main strengths is that I can handle alphanumeric strings like 1-F-4-G-J-Z-7-0-3-B, really well.
I can pronounce complex words like, I flew to Reykjavik, went to Timbuktu, and stopped at Ljubljana, seven days ago.
These are just prototype weights so over the next couple of weeks before launch, I'll be tweaked and made even better.