Tech Home March 2 News, Alibaba Tongyi Laboratory’s speech team today released two models supporting FreeStyle command generation: Fun-CosyVoice3.5 and Fun-AudioGen-VD.
The official claims that whether it’s fine control of voice expression or designing tone and scenes from scratch, they can be directly generated through natural language commands. Both models support natural language control of speech synthesis, but their applications differ:
Fun-CosyVoice3.5: Multilingual replication + Fine-grained expression control
This model supports FreeStyle command control. CosyVoice3.5 enhances capabilities in Instruct-TTS, supporting FreeStyle command control to generate effects, allowing free speech synthesis with a single sentence.
Users can directly describe expressive styles in natural language, such as: “Make the tone more assertive,” “Slightly lower the pitch and slow down the speech,” “Add some emotional variation”… The model can understand and generate the corresponding expression.
Fun-CosyVoice3.5 now also supports Thai, Indonesian, Portuguese, and Vietnamese. It maintains industry-leading objective metrics like WER and SpkSim across 13 languages.
Special optimizations for scenarios with rare characters and complex sentences that are prone to misreading have reduced the mispronunciation rate of rare characters from 15.2% to 5.3%. Performance in complex text reading is more stable, and long text reading is smoother and more stable.
Additionally, Fun-CosyVoice3.5 has been fine-tuned with reinforcement learning techniques to enhance listening experience, making overall sound more natural and expressions more layered. In terms of performance, the Tokenizer frame rate has been halved, and initial packet latency reduced by 35%, enabling faster responses in real-time interactive scenarios for a smoother experience.
Fun-AudioGen-VD
Supports generating target tones, emotional expressions, and complete auditory scenes based on natural language descriptions, achieving integrated “character + scene” sound generation.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
A single sentence is enough for free generation, Alibaba Tongyi Speech Dual Model Fun-CosyVoice3.5 and Fun-AudioGen-VD released
Tech Home March 2 News, Alibaba Tongyi Laboratory’s speech team today released two models supporting FreeStyle command generation: Fun-CosyVoice3.5 and Fun-AudioGen-VD.
The official claims that whether it’s fine control of voice expression or designing tone and scenes from scratch, they can be directly generated through natural language commands. Both models support natural language control of speech synthesis, but their applications differ:
Fun-CosyVoice3.5: Multilingual replication + Fine-grained expression control
Fun-AudioGen-VD: Sound design + Scene-based audio generation
Fun-CosyVoice3.5
This model supports FreeStyle command control. CosyVoice3.5 enhances capabilities in Instruct-TTS, supporting FreeStyle command control to generate effects, allowing free speech synthesis with a single sentence.
Users can directly describe expressive styles in natural language, such as: “Make the tone more assertive,” “Slightly lower the pitch and slow down the speech,” “Add some emotional variation”… The model can understand and generate the corresponding expression.
Fun-CosyVoice3.5 now also supports Thai, Indonesian, Portuguese, and Vietnamese. It maintains industry-leading objective metrics like WER and SpkSim across 13 languages.
Special optimizations for scenarios with rare characters and complex sentences that are prone to misreading have reduced the mispronunciation rate of rare characters from 15.2% to 5.3%. Performance in complex text reading is more stable, and long text reading is smoother and more stable.
Additionally, Fun-CosyVoice3.5 has been fine-tuned with reinforcement learning techniques to enhance listening experience, making overall sound more natural and expressions more layered. In terms of performance, the Tokenizer frame rate has been halved, and initial packet latency reduced by 35%, enabling faster responses in real-time interactive scenarios for a smoother experience.
Fun-AudioGen-VD
Supports generating target tones, emotional expressions, and complete auditory scenes based on natural language descriptions, achieving integrated “character + scene” sound generation.
Basic attributes: gender, age, accent, pitch, speech rate
Sound quality features: hoarse, clear, deep, magnetic…
Emotional expressions: anger, sadness, excitement, determination…
Role simulation: customer service, veteran, child, AI, announcer…
Complex psychological states: supports nuanced expression (e.g., “calm on the surface but trembling inside”)
Fun-AudioGen-VD can generate not only sounds but also the “world” surrounding the sound, creating immersive auditory scenes.
Background environment sounds: overlay city noise, café background, battlefield rumbles, etc.
Spatial reverberation effects: simulate echoes of cathedrals, metal cells, underwater spaces.
Device listening filters: restore vintage broadcasts, walkie-talkies, breathing masks, and other special audio qualities.
Dynamic environmental interaction: supports real-time interactions like wind noise interruption, echo variation, hoarseness effects.
Official API links for these models are provided below:
https://help.aliyun.com/zh/model-studio/text-to-speech?spm=a2c4g.11186623.help-menu-2400256.d_0_3_2_0.d5536a31V2tEJP