Matcha-TTS

Matcha-TTS：具有条件流匹配的快速 TTS 架构

https://github.com/shivammehta25/Matcha-TTS

Matcha-TTS，一种非自回归神经 TTS 的新方法，它使用条件流匹配（类似于整流）来加速基于 ODE 的语音合成。我们的方法：

是概率性的
内存占用小
听起来非常自然
合成速度非常快

安装

1、创建环境（建议但可选）

conda create -n matcha-tts python=3.10 -y
conda activate matcha-tts

2、使用 pip 或源代码安装 Matcha TTS

pip install matcha-tts

来自源码

pip install git+https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS
pip install -e .

3、运行 CLI/gradio 应用程序/jupyter 笔记本

# This will download the required models
matcha-tts --text "<INPUT TEXT>"

或者

matcha-tts-app

或者 synthesis.ipynb 在 jupyter notebook 上打开

CLI 参数

要从给定的文本进行合成，请运行：

matcha-tts --text "<INPUT TEXT>"

要从文件进行合成，请运行：

matcha-tts --file <PATH TO FILE>

要从文件进行批量合成，请运行：

matcha-tts --file <PATH TO FILE> --batched

其他参数

语速

matcha-tts --text "<INPUT TEXT>" --speaking_rate 1.0

取样温度

matcha-tts --text "<INPUT TEXT>" --temperature 0.667

欧拉微分方程求解步骤

matcha-tts --text "<INPUT TEXT>" --steps 10

使用你自己的数据集进行训练

假设我们正在使用 LJ Speech 进行训练

1、从这里下载数据集，将其提取到data/LJSpeech-1.1，并准备文件列表以指向提取的数据，如NVIDIA Tacotron 2 repo 设置中的第 5 项。

2、克隆并进入 Matcha-TTS 存储库

git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS

3、从源代码安装包

pip install -e .

4、前往 configs/data/ljspeech.yaml 并更改

train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt

5、使用数据集配置的 yaml 文件生成规范化统计数据

matcha-data-stats -i ljspeech.yaml
# Output:
#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}

configs/data/ljspeech.yaml 在键下更新这些值 data_statistics。

data_statistics:  # Computed for ljspeech dataset
  mel_mean: -5.536622
  mel_std: 2.116101

到您的训练和验证文件列表的路径。

6、运行训练脚本

make train-ljspeech

或者

python matcha/train.py experiment=ljspeech

最低内存运行

python matcha/train.py experiment=ljspeech_min_memory

对于多 GPU 训练，运行

python matcha/train.py experiment=ljspeech trainer.devices=[0,1]

从定制训练模型进行合成

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT>

ONNX 支持

特别感谢@mush42实现 ONNX 导出和推理支持。

可以将 Matcha 检查点导出到ONNX，并在导出的 ONNX 图上运行推理。

ONNX 导出

要将检查点导出到 ONNX，首先使用以下工具安装 ONNX

pip install onnx

然后运行以下命令：

python3 -m matcha.onnx.export matcha.ckpt model.onnx --n-timesteps 5

可选地，ONNX 导出器接受vocoder-name和vocoder-checkpoint参数。这使您能够将声码器嵌入导出的图形中并在一次运行中生成波形（类似于端到端 TTS 系统）。

请注意，它n_timesteps被视为超参数，而不是模型输入。这意味着您应该在导出期间（而不是在推理期间）指定它。如果未指定，则n_timesteps设置为5。

重要提示：目前，导出需要 torch>=2.1.0，因为该scaled_product_attention运算符在旧版本中不可导出。在最终版本发布之前，想要导出模型的人必须手动安装 torch>=2.1.0 作为预发布版本。

ONNX 推理

要在导出的模型上运行推理，首先安装onnxruntime使用

pip install onnxruntime
pip install onnxruntime-gpu  # for GPU inference

然后使用以下命令：

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs

您还可以控制合成参数：

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --temperature 0.4 --speaking_rate 0.9 --spk 0

要在GPU上运行推理，请确保安装onnxruntime-gpu包，然后传递–gpu给推理命令：

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --gpu

如果您仅将 Matcha 导出到 ONNX，这会将梅尔频谱图作为图形和numpy数组写入输出目录。如果您在导出的图形中嵌入了声码器，这会将.wav音频文件写入输出目录。

如果您仅将 Matcha 导出到 ONNX，并且想要运行完整的 TTS 管道，则可以将路径传递给声码器模型，ONNX格式如下：

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnx

这会将.wav音频文件写入输出目录。

从 Matcha-TTS 中提取音素对齐

如果数据集的结构如下

data/
└── LJSpeech-1.1
    ├── metadata.csv
    ├── README
    ├── test.txt
    ├── train.txt
    ├── val.txt
    └── wavs

然后，您可以使用以下方法从训练好的 Matcha-TTS 模型中提取音素级别对齐：

python  matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c <checkpoint>

例子：

python  matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckpt

或者简单来说：

matcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckpt

使用提取的比对进行训练

在数据集配置中打开加载持续时间。例如：ljspeech.yaml

load_durations: True

或者查看 configs/experiment/ljspeech_from_durations.yaml 中的示例

作者：Jeebiz 创建时间：2024-07-13 22:58
最后编辑：Jeebiz 更新时间：2025-05-12 09:20