利用 Whisper + DeepSeek + ChatTTS 构建语音对话机器人 - 文章 - 开发者社区

点击下方卡片，关注“ 慢慢学AIGC ”

picture.image

由 DALL·E 3 生成，prompt：A person and a machine are engaged in two-way communication through a microphone and speakers. The person, standing on the left, speaks into the microphone while the machine on the right, resembling a sleek, futuristic robot, responds through speakers. The setting is a modern, well-lit room with a professional atmosphere. The person looks focused and engaged, and the machine's digital display shows sound waves indicating speech.

语音交互系统简介

语音交互系统主要由自动语音识别（Automatic Speech Recognition, 简称 ASR）、自然语言处理（Natural Language Processing, 简称 NLP）和文本到语音合成（Text to Speech，简称 TTS）三个环节构成。ASR 相当于人的听觉系统，NLP 相当于人的大脑语言区域，TTS 相当于人的发声系统。

picture.image

如何构建语音对话机器人

本文将完全利用开源方案构建语音对话机器人。

ASR 采用 OpenAI Whisper，同时支持中、英文。更多技术细节可以看这篇《跟着 Whisper 学说正宗河南话》；
NLP 采用 DeepSeek v2，由于本地运行所需的 GPU 资源不足，我们调用云端 API 实现这一步；
TTS 采用 ChatTTS，它是专门为对话场景设计的文本转语音模型，支持英文和中文两种语言。

本文基于 Gradio 实现的交互界面如图：

picture.image

你可以基于系统麦克风采集音频，通过 Whisper 转录为文本，调用 DeepSeek v2 API 后，再将对话输出经过 ChatTTS 合成为语音，点击播放即可听到来自机器人的声音。

硬件环境：RTX 3060， 12GB 显存

软件环境信息（Miniconda3 + Python 3.8.19）：


          
pip list
          
Package                       Version
          
----------------------------- --------------
          
absl-py                       2.0.0
          
accelerate                    0.25.0
          
aiofiles                      23.2.1
          
aiohttp                       3.8.6
          
aiosignal                     1.3.1
          
altair                        5.1.2
          
annotated-types               0.6.0
          
antlr4-python3-runtime        4.9.3
          
anyio                         4.0.0
          
argon2-cffi                   23.1.0
          
argon2-cffi-bindings          21.2.0
          
arrow                         1.3.0
          
asttokens                     2.4.1
          
astunparse                    1.6.3
          
async-lru                     2.0.4
          
async-timeout                 4.0.3
          
attrs                         23.1.0
          
audioread                     3.0.1
          
Babel                         2.15.0
          
backcall                      0.2.0
          
backports.zoneinfo            0.2.1
          
beautifulsoup4                4.12.3
          
bitarray                      2.8.2
          
bitsandbytes                  0.41.1
          
bleach                        6.1.0
          
blinker                       1.6.3
          
cachetools                    5.3.1
          
cdifflib                      1.2.6
          
certifi                       2023.7.22
          
cffi                          1.16.0
          
charset-normalizer            2.1.1
          
click                         8.1.7
          
colorama                      0.4.6
          
comm                          0.2.2
          
contourpy                     1.1.1
          
cpm-kernels                   1.0.11
          
cycler                        0.12.1
          
Cython                        3.0.3
          
debugpy                       1.8.1
          
decorator                     5.1.1
          
defusedxml                    0.7.1
          
distro                        1.9.0
          
dlib                          19.24.2
          
edge-tts                      6.1.8
          
editdistance                  0.8.1
          
einops                        0.8.0
          
einx                          0.2.2
          
encodec                       0.1.1
          
exceptiongroup                1.1.3
          
executing                     2.0.1
          
face-alignment                1.4.1
          
fairseq                       0.12.2
          
faiss-cpu                     1.7.4
          
fastapi                       0.108.0
          
fastjsonschema                2.19.1
          
ffmpeg                        1.4
          
ffmpeg-python                 0.2.0
          
ffmpy                         0.3.1
          
filelock                      3.12.4
          
Flask                         2.1.2
          
Flask-Cors                    3.0.10
          
flatbuffers                   23.5.26
          
fonttools                     4.43.1
          
fqdn                          1.5.1
          
frozendict                    2.4.4
          
frozenlist                    1.4.0
          
fsspec                        2023.9.2
          
future                        0.18.3
          
gast                          0.4.0
          
gitdb                         4.0.10
          
GitPython                     3.1.37
          
google-auth                   2.23.3
          
google-auth-oauthlib          1.0.0
          
google-pasta                  0.2.0
          
gradio                        4.32.2
          
gradio_client                 0.17.0
          
grpcio                        1.59.0
          
h11                           0.14.0
          
h5py                          3.10.0
          
httpcore                      0.18.0
          
httpx                         0.25.0
          
huggingface-hub               0.23.2
          
hydra-core                    1.0.7
          
idna                          3.4
          
imageio                       2.31.5
          
importlib-metadata            6.8.0
          
importlib-resources           6.1.0
          
inflect                       7.2.1
          
ipykernel                     6.29.4
          
ipython                       8.12.3
          
ipywidgets                    8.1.3
          
isoduration                   20.11.0
          
itsdangerous                  2.1.2
          
jedi                          0.19.1
          
Jinja2                        3.1.2
          
joblib                        1.3.2
          
json5                         0.9.25
          
jsonpointer                   2.4
          
jsonschema                    4.19.1
          
jsonschema-specifications     2023.7.1
          
jupyter                       1.0.0
          
jupyter_client                8.6.2
          
jupyter-console               6.6.3
          
jupyter_core                  5.7.2
          
jupyter-events                0.10.0
          
jupyter-lsp                   2.2.5
          
jupyter_server                2.14.1
          
jupyter_server_terminals      0.5.3
          
jupyterlab                    4.2.1
          
jupyterlab_pygments           0.3.0
          
jupyterlab_server             2.27.2
          
jupyterlab_widgets            3.0.11
          
keras                         2.13.1
          
kiwisolver                    1.4.5
          
langdetect                    1.0.9
          
latex2mathml                  3.77.0
          
lazy_loader                   0.3
          
libclang                      16.0.6
          
librosa                       0.9.1
          
llvmlite                      0.41.0
          
loguru                        0.7.2
          
lxml                          4.9.3
          
Markdown                      3.5
          
markdown-it-py                3.0.0
          
MarkupSafe                    2.1.3
          
matplotlib                    3.7.3
          
matplotlib-inline             0.1.7
          
mdtex2html                    1.2.0
          
mdurl                         0.1.2
          
mistune                       3.0.2
          
more-itertools                10.1.0
          
mpmath                        1.3.0
          
multidict                     6.0.4
          
nbclient                      0.10.0
          
nbconvert                     7.16.4
          
nbformat                      5.10.4
          
nemo_text_processing          1.0.2
          
nest-asyncio                  1.6.0
          
networkx                      3.1
          
notebook                      7.2.0
          
notebook_shim                 0.2.4
          
numba                         0.58.0
          
numpy                         1.22.4
          
oauthlib                      3.2.2
          
omegaconf                     2.3.0
          
onnx                          1.14.1
          
onnxoptimizer                 0.3.13
          
onnxsim                       0.4.33
          
openai                        1.6.1
          
openai-whisper                20230918
          
opencv-python                 4.8.1.78
          
opt-einsum                    3.3.0
          
orjson                        3.9.9
          
overrides                     7.7.0
          
packaging                     23.2
          
pandas                        2.0.3
          
pandocfilters                 1.5.1
          
parso                         0.8.4
          
peft                          0.7.1
          
pickleshare                   0.7.5
          
Pillow                        10.0.1
          
pip                           24.0
          
pkgutil_resolve_name          1.3.10
          
platformdirs                  3.11.0
          
playsound                     1.3.0
          
pooch                         1.7.0
          
portalocker                   2.8.2
          
praat-parselmouth             0.4.3
          
prometheus_client             0.20.0
          
prompt_toolkit                3.0.45
          
protobuf                      4.25.1
          
psutil                        5.9.5
          
pure-eval                     0.2.2
          
pyarrow                       13.0.0
          
pyasn1                        0.5.0
          
pyasn1-modules                0.3.0
          
PyAudio                       0.2.12
          
pycparser                     2.21
          
pydantic                      2.5.3
          
pydantic_core                 2.14.6
          
pydeck                        0.8.1b0
          
pydub                         0.25.1
          
Pygments                      2.16.1
          
pynini                        2.1.5
          
pynvml                        11.5.0
          
PyOpenGL                      3.1.7
          
pyparsing                     3.1.1
          
python-dateutil               2.8.2
          
python-json-logger            2.0.7
          
python-multipart              0.0.9
          
pytz                          2023.3.post1
          
PyWavelets                    1.4.1
          
pywin32                       306
          
pywinpty                      2.0.13
          
pyworld                       0.3.0
          
PyYAML                        6.0.1
          
pyzmq                         26.0.3
          
qtconsole                     5.5.2
          
QtPy                          2.4.1
          
referencing                   0.30.2
          
regex                         2023.10.3
          
requests                      2.32.3
          
requests-oauthlib             1.3.1
          
resampy                       0.4.2
          
rfc3339-validator             0.1.4
          
rfc3986-validator             0.1.1
          
rich                          13.6.0
          
rpds-py                       0.10.4
          
rsa                           4.9
          
ruff                          0.4.7
          
sacrebleu                     2.3.1
          
sacremoses                    0.1.1
          
safetensors                   0.4.3
          
scikit-image                  0.18.1
          
scikit-learn                  1.3.1
          
scikit-maad                   1.3.12
          
scipy                         1.7.3
          
semantic-version              2.10.0
          
Send2Trash                    1.8.3
          
sentencepiece                 0.1.99
          
setuptools                    69.5.1
          
shellingham                   1.5.4
          
six                           1.16.0
          
smmap                         5.0.1
          
sniffio                       1.3.0
          
sounddevice                   0.4.5
          
SoundFile                     0.10.3.post1
          
soupsieve                     2.5
          
sse-starlette                 1.8.2
          
stack-data                    0.6.3
          
starlette                     0.32.0.post1
          
streamlit                     1.29.0
          
sympy                         1.12
          
tabulate                      0.9.0
          
tenacity                      8.2.3
          
tensorboard                   2.13.0
          
tensorboard-data-server       0.7.1
          
tensorboardX                  2.6.2.2
          
tensorflow                    2.13.0
          
tensorflow-estimator          2.13.0
          
tensorflow-intel              2.13.0
          
tensorflow-io-gcs-filesystem  0.31.0
          
termcolor                     2.3.0
          
terminado                     0.18.1
          
threadpoolctl                 3.2.0
          
tifffile                      2023.7.10
          
tiktoken                      0.3.3
          
timm                          0.9.12
          
tinycss2                      1.3.0
          
tokenizers                    0.19.1
          
toml                          0.10.2
          
tomli                         2.0.1
          
tomlkit                       0.12.0
          
toolz                         0.12.0
          
torch                         2.1.0+cu121
          
torchaudio                    2.1.0+cu121
          
torchcrepe                    0.0.22
          
torchvision                   0.16.0+cu121
          
tornado                       6.3.3
          
tqdm                          4.63.0
          
traitlets                     5.14.3
          
transformers                  4.41.2
          
transformers-stream-generator 0.0.4
          
trimesh                       4.0.0
          
typeguard                     4.3.0
          
typer                         0.12.3
          
types-python-dateutil         2.9.0.20240316
          
typing_extensions             4.12.0
          
tzdata                        2023.3
          
tzlocal                       5.1
          
uri-template                  1.3.0
          
urllib3                       2.2.1
          
uvicorn                       0.25.0
          
validators                    0.22.0
          
vector_quantize_pytorch       1.14.8
          
vocos                         0.1.0
          
watchdog                      3.0.0
          
wcwidth                       0.2.13
          
webcolors                     1.13
          
webencodings                  0.5.1
          
websocket-client              1.8.0
          
websockets                    11.0.3
          
Werkzeug                      3.0.0
          
WeTextProcessing              0.1.12
          
wget                          3.2
          
wheel                         0.43.0
          
widgetsnbextension            4.0.11
          
win32-setctime                1.1.0
          
wrapt                         1.15.0
          
yarl                          1.9.2
          
zipp                          3.17.0

WebUI 代码如下（目前只是演示基本功能，比较简陋）：


          
import gradio as gr
          
from transformers import pipeline
          
import numpy as np
          

          
from ChatTTS.experimental.llm import llm_api
          
import ChatTTS
          

          
chat = ChatTTS.Chat()
          
chat.load_models(compile=False) # 设置为True以获得更快速度
          

          

          
API_KEY = 'sk-xxxxxxxx' # 需要自行到 https://platform.deepseek.com/api_keys 申请
          
client = llm_api(api_key=API_KEY,
          
        base_url="https://api.deepseek.com",
          
        model="deepseek-chat")
          

          

          
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base")
          

          
def transcribe(audio):
          
    sr, y = audio
          
    y = y.astype(np.float32)
          
    y /= np.max(np.abs(y))
          
    user_question = transcriber({"sampling_rate": sr, "raw": y})["text"]
          
    text = client.call(user_question, prompt_version = 'deepseek')
          
    wav = chat.infer(text, use_decoder=True)
          
    audio_data = np.array(wav[0]).flatten()
          
    sample_rate = 24000
          

          
    return (sample_rate, audio_data)
          

          

          
demo = gr.Interface(
          
    transcribe,
          
    gr.Audio(sources=["microphone"]),
          
    "audio",
          
)
          

          
demo.launch()

在此基础上，可以增加更多功能：

ASR 模型这里只使用 openai/whisper-base，可以在页面上选择多种模型；
DeepSeek v2 API 使用了默认参数配置，可以在页面上增加一些额外参数，如 temperature 和 system prompt 等；
ChatTTS 可以增加如 speaker 身份，打断和笑声控制，实现更丰富的输出；
支持流式对话，像 GPT-4o 那样自然打断；

如果环境搭建遇到困难，可以私信获取完整项目。

点击下方卡片，关注“ 慢慢学AIGC ”