PersonaPlex-7B: full-duplex voice model that listens and talks at the same time
9 days ago
- #conversational AI
- #real-time
- #speech-to-speech
- PersonaPlex is a real-time speech-to-speech conversational model that performs streaming speech understanding and generation.
- It operates on continuous audio encoded with a neural codec, predicting both text and audio tokens autoregressively.
- Supports natural conversational dynamics like interruptions, barge-ins, overlaps, and rapid turn-taking.
- Runs in a dual-stream configuration allowing concurrent listening and speaking.
- Conditioned on voice and text prompts to define vocal characteristics, speaking style, and persona attributes.
- Ready for commercial use with global deployment.
- Model architecture is based on Transformer (Moshi) with 7B parameters.
- Inputs include text prompts and audio (24kHz sample rate), outputs include text and audio responses.
- Trained on Fisher English dataset (less than 10,000 hours) and evaluated on FullDuplexBench benchmark.
- Outperforms other conversational AI systems in dynamics, latency, and task adherence.
- Ethical considerations include bias, explainability, safety, security, and privacy.