VibeVoice-ASR
VibeVoice-ASR 是一款统一的语音转文本模型,专为处理长达 60 分钟的长音频内容而设计,能够单次处理并生成结构化转录,包含谁(说话人)、何时(时间戳)以及什么(内容)等关键信息,同时支持自定义热词和超过 50 种语言。
VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for Customized Hotwords and over 50 languages.
模型: VibeVoice-ASR-7B
演示: VibeVoice-ASR-Demo
报告: VibeVoice-ASR-Report
微调: finetune-guide
vLLM: vLLM-asr
🔥 Key Features
🕒 60-minute Single-Pass Processing:
Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.👤 Customized Hotwords:
Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.📝 Rich Transcription (Who, When, What):
The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when.🌍 Multilingual & Code-Switching Support:
It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found here.
🏗️ Model Architecture
Demo
Evaluation


Installation
We recommend to use NVIDIA Deep Learning Container to manage the CUDA environment.
- Launch docker
`bashNVIDIA PyTorch Container 24.07 ~ 25.12 verified.
Previous versions are also compatible.
sudo docker run –privileged –net=host –ipc=host –ulimit memlock=-1:-1 –ulimit stack=-1:-1 –gpus all –rm -it nvcr.io/nvidia/pytorch:25.12-py3
If flash attention is not included in your docker environment, you need to install it manually
Refer to https://github.com/Dao-AILab/flash-attention for installation instructions
pip install flash-attn –no-build-isolation
2. Install from github
```bash
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .Usages
Usage 1: Launch Gradio demo
apt update && apt install ffmpeg -y # for demo
python demo/vibevoice_asr_gradio_demo.py --model_path microsoft/VibeVoice-ASR --shareUsage 2: Inference from files directly
python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files [add a audio path here] Finetuning
LoRA (Low-Rank Adaptation) fine-tuning is supported. See Finetuning for detailed guide.
Results
Multilingual
| Dataset | Language | DER | cpWER | tcpWER | WER |
|---|---|---|---|---|---|
| MLC-Challenge | English | 4.28 | 11.48 | 13.02 | 7.99 |
| MLC-Challenge | French | 3.80 | 18.80 | 19.64 | 15.21 |
| MLC-Challenge | German | 1.04 | 17.10 | 17.26 | 16.30 |
| MLC-Challenge | Italian | 2.08 | 15.76 | 15.91 | 13.91 |
| MLC-Challenge | Japanese | 0.82 | 15.33 | 15.41 | 14.69 |
| MLC-Challenge | Korean | 4.52 | 15.35 | 16.07 | 9.65 |
| MLC-Challenge | Portuguese | 7.98 | 29.91 | 31.65 | 21.54 |
| MLC-Challenge | Russian | 0.90 | 12.94 | 12.98 | 12.40 |
| MLC-Challenge | Spanish | 2.67 | 10.51 | 11.71 | 8.04 |
| MLC-Challenge | Thai | 4.09 | 14.91 | 15.57 | 13.61 |
| MLC-Challenge | Vietnamese | 0.16 | 14.57 | 14.57 | 14.43 |
| Dataset | Language | DER | cpWER | tcpWER | WER |
|---|---|---|---|---|---|
| AISHELL-4 | Chinese | 6.77 | 24.99 | 25.35 | 21.40 |
| AMI-IHM | English | 11.92 | 20.41 | 20.82 | 18.81 |
| AMI-SDM | English | 13.43 | 28.82 | 29.80 | 24.65 |
| AliMeeting | Chinese | 10.92 | 29.33 | 29.51 | 27.40 |
| MLC-Challenge | Average | 3.42 | 14.81 | 15.66 | 12.07 |
Language Distribution
📄 License
This project is licensed under the MIT License.
最后编辑:Ddd4j 更新时间:2026-02-27 09:40