2021-04-07发表2021-07-12更新智能家居5 分钟读完 (大约817个字)

Ubuntu上安装并训练DeepSpeech

全程参照官方文档，仅少许浅坑。这篇文章算是摘要 + 翻译 + 注解吧。

前提条件

Python 3.6

严格遵守，实测 3.8 版本会有诸多错误。
Mac 或 Linux 环境

建议避过 Mac，缺失的依赖比 Linux 多很多，问题也多，我是半途放弃 Mac 转到 Ubuntu 的。可以通过启用 Windows 10 的 WSL 功能创建 Ubuntu 环境，可以使用所有的硬件能力。
CUDA 10.0 / CuDNN v7.6 per Dockerfile.

这条为非必须，因为我是新手，直接用 CPU 训练的。

下载 DeepSpeech 源码

1	git clone --branch v0.9.3 https://github.com/mozilla/DeepSpeech

创建 Python 虚拟环境

使用默认的 venv 方式创建：

1	python3 -m venv ~/tmp/deepspeech-train-venv/

激活虚拟环境

source ~/tmp/deepspeech-train-venv/bin/activate

然后下边的所有操作都应该在此虚拟环境中进行。

安装 DeepSpeech 依赖

1
2
3

cd DeepSpeech
pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0
pip3 install --upgrade -e .

后续如果有更新 DeepSpeech 源码，需要再次执行上边最后一句 pip3 install 命令，确保依赖对得上号。

这个过程中会遇到一个错误：

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorflow 1.15.4 requires numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.5 which is incompatible.

依照错误提示卸载 numpy 再装一个指定版本的包即可：

1
2
3

pip3 uninstall numpy
pip3 install numpy==1.16.0
pip3 install tensorflow==1.15.4

然后安装一个 webrtcvad 的依赖包：

1	sudo apt-get install python3-dev

执行 DeepSpeech 预置脚本完成一次简单训练

跳转到 DeepSpeech 根目录 后执行以下脚本：

1	./bin/run-ldc93s1.sh

上边脚本会下载语料，处理成 DeepSpeech 可识别格式，然后进行训练。下载的语料和处理后的数据都保存在 DeepSpeech/data/ldc93s1/ 目录下。

DeepSpeech 要求必须用 16 bit 位深、单声道的音频进行训练、识别；且训练与识别所用的音频的采样率也必须相同。

几分钟就训练完了。但是这个脚本没有将训练后的模型保存成文件，可以打开脚本，给在最后边执行的 DeepSpeech.py 脚本添加一个参数：

python -u DeepSpeech.py --noshow_progressbar \
  --train_files data/ldc93s1/ldc93s1.csv \
  --test_files data/ldc93s1/ldc93s1.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 200 \
  --checkpoint_dir "$checkpoint_dir" \
  --export_dir /mnt/g/ \ # 这行就是新添加的，可以将模型保存到指定目录
  "$@"

在 /mnt/g/ 目录下会生成一个 output_graph.pb 模型文件。

使用 Mozilla 提供的数据集

Mozilla 收集了很多语言的数据集，包括普通话。下载下来的是一个 tar 压缩包，解压后得到一串 clips/*.mp3 文件和若干 tsv 文件。

执行 bin/import_cv2.py 脚本：

1	bin/import_cv2.py /path/to/extracted/language/archive

在 clips 目录下创建了 mp3 对应的 wav 文件，和若干 csv 文件。这些就是 DeepSpeech 可识别的输入了，参照上边 run-ldc93s1.sh 中 DeepSpeech.py 脚本的用法，就可以用这些数据集训练了。

#DeepSpeech