{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 10 Environmental Robustness\n", "\n", "一个鲁棒的语音识别系统,即使在环境与训练数据失配的情况下,性能也不会下降太多。本章将从麦克风、回声消除、语音信号增强等几方面讲解如何提高系统对环境变化的鲁棒性。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.1 The Acoustical Environment\n", "\n", "**Def:** the set of transformations that affect the speech signal from the time it leaves the speaker’s mouth until it is in digital form.\n", "\n", "主要包含两部分:\n", "* **加性噪声**\n", " 1. **稳态噪声**,e.g. 风扇声、汽车发动机的噪声,对于稳态噪声,数字信号处理的滤波方法就可以很好的处理掉。\n", " 2. **非稳态噪声**,e.g. 关门声、说话声、歌声等,若它与一个已知的信号相关,那么使用 AEC(adaptive echo-canceling) 可以很好的处理。\n", " 3. 噪声唤醒下还可能会影响的人的说话方式,比如人会不自觉的提高音量和音调,这种现象称之为**Lombard effect**(a phenomenon by which a speaker increases his vocal effort in the presence of background noise)。\n", " \n", " \n", "* 声道失真\n", " 1. **混响**, 实际中,由于墙等障碍物的遮挡,会产生回声,因此麦克风会在不同的时间点收到说话人的声音和回声,从而产生混响,除非人和麦克风都处于消音室或空旷的地方。**减小系统的混响,可以提高系统的鲁棒性。**\n", " 2. 麦克风的频率特性\n", " 3. AD 电路的采样失真" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.2 Acoustical Transducers\n", "\n", "Acoustical Transducers: 声学信号和电信号之间互相转换的器件。e.g. 麦克风、扩音器\n", "\n", "* 电容型麦克风:电容的两个极板中,有一个是活动的,根据声压的不同,极板间的距离会相应的进行改变,从而改变电容的容量,配合相应的电路,从而得到不同的电压输出。\n", "\n", "* 麦克风的指向性:\n", " 1. 全向麦克风(omnidirectional Microphones)。麦克风的响应与声波的方向无关。\n", " 2. 双向麦克风(Bidirectional Microphones)。只接收两个方向的声压信号。\n", " 3. 单向麦克风(Unidirectional Microphones)。只接收单个方向的声压信号。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 10.3 Adaptive Echo Cancellation(AEC)\n", "\n", "![](./images/10_1.png)\n", "\n", "* **AEC的作用**:降噪,消掉麦克风中拾取到的系统本身产生的声音,如上图所示,\\\\(r(n)\\\\)是麦克风拾取的信号,\\\\(d(n)\\\\)是扬声器播放的声音,\\\\(s(n)\\\\)是用户的声音,\\\\(v(n)\\\\)是噪声,\\\\(\\hat{d}(n)\\\\)是滤波器估计的系统声音。\n", "* **性能评价指标**:echo-return loss enhancement(ERLE),越大越好,计算方法如下:\n", "\n", "$$ERLE(db) = 10 lg_{10} \\frac{E\\{d^2(n)\\}}{E\\{(d(n) - \\hat{d}(n))^2\\}}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10.3.1 The LMS Algorithm\n", "Least Mean Square(LMS)算法非常简单,它假设系统的信号\\\\(d(n)\\\\)可以麦克风的输入信号通过加权得到:\n", "\n", "$$d(n) = \\sum_{k=0}^{L-1}g_k x(n-k) + u(n) = G^T X(n) + u(n)$$\n", "\n", "其中\\\\(u(n)\\\\)为噪声信号,它与\\\\(x(n)\\\\)相互独立。\n", "\n", "因此我们可以用相同的方法来估计滤波器的输出:\n", "\n", "$$y(n) = \\sum_{k=0}^{L-1}w_k x(n-k) = W^T X(n)$$\n", "\n", "所以估计信号与系统信号的误差为\n", "\n", "$$e(n) = d(n) - y(n) = d(n) - W^T X(n) $$\n", "\n", "LMS 算法就是通过不断的迭代,逐步的减小误差,采用梯度下降法,得到权值的更新公式:\n", "\n", "$$W(n+1) = W(n) + \\epsilon e(n)X(n) $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10.3.2 Convergence Properties of the LMS Algorithm\n", "\n", "\\\\(\\epsilon\\\\)的选择至关重要:过小,算法收敛慢,过大,则算法误差大。\n", "\n", "**\\\\(\\epsilon\\\\)的取值范围**:\n", "\n", "$$0 < \\epsilon <\\frac{1}{\\lambda}$$\n", "\n", "其中\\\\(\\lambda\\\\)为输入\\\\(X(n)\\\\)的最大特征值。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10.3.3 Normalized LMS Algorithm\n", "\n", "归一化LMS 算法:动态更新学习步长\\\\(\\epsilon\\\\)\n", "\n", "$$\\epsilon(n) = \\frac{\\epsilon}{\\delta + L\\hat{\\sigma}_x^2(n)}$$\n", "\n", "其中\\\\(\\delta\\\\)只是为了防止分母为0,\\\\(\\hat{\\sigma}_x^2(n)\\\\)为输入信号的能量估计,它通过指数窗进行更新:\n", "\n", "$$\\hat{\\sigma}_x^2(n) = (1-\\beta)\\hat{\\sigma}_x^2(n-1) + x^2(n)$$\n", "\n", "或矩形窗进行更新:\n", "\n", "$$\\hat{\\sigma}_x^2(n) = \\hat{\\sigma}_x^2(n-1) + \\frac{1}{N}(x^2(n) - x^2(n-N))$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.4 Multimicrophone Speech Enhancement\n", "\n", "### 10.4.1 Microphone Arrays\n", "\n", "使用麦克风阵列的目的:\n", "* 声源定位\n", "* 提高信号的 SNR\n", "\n", "方法:beamforming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10.4.2 Blind Source Separation(盲源分离)\n", "\n", "**Blind source separation(BSS)** is a set of techniques that assume no information about the mixing process or the\n", "sources, apart from their mutual statistical independence, hence is termed blind.\n", "\n", "**方法:Independent component analysisICA**,a set of techniques to solve the BSS problem that estimate a set of linear filters to separate the mixed signals under the **assumption that the original sources are statistically independent**.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.5 Environment compensation preprocessing\n", "针对单麦环境,介绍几种对于加性噪声和声道失真的补偿措施。\n", "\n", "* **Spectral substraction**\n", " * **前提假设**:信号\\\\(y(m)\\\\)由语音信号和加性噪声信号混叠而成:\n", "\n", " $$y(n) = x(m) + n(m)$$\n", "\n", " 其中\\\\(x(m)\\\\)和\\\\(n(m)\\\\)相互独立,因此在频域中有\n", "\n", " $$|Y(f)|^2 \\approx |X(f)|^2 + |N(f)|^2$$\n", "\n", " * **算法过程**:\n", " 1. **噪声估计**。在无语音信号的时候,在\\\\(M\\\\)帧的时长上估计噪声的功率谱。\n", " \n", " $$|\\hat{N}(f)|^2 = \\frac{1}{M}\\sum_{i=0}^{M-1}|Y_i{f}|^2$$\n", " \n", " 2. **计算语音信号的功率谱**\n", " \n", " $$|X(f)|^2 = |Y(f)|^2 - |\\hat{N}(f)|^2 = |Y(f)|^2(1-\\frac{1}{SNR(f)})$$\n", " \n", " 其中\n", " \n", " $$SNR(f) = \\frac{|Y(f)|^2}{|N(f)|^2}$$\n", " \n", " 上述过程会造成信号的失真,若减小失真,可以牺牲较小的噪声衰减为代价,对 SNR 进行平滑:\n", " \n", " $$SNR(f, t) = \\gamma SNR(f, t-1) + (1- \\gamma)\\frac{|Y(f)|^2}{|N(f)|^2}$$\n", "\n", " **Smoothing over both time and frequency can be done to obtain more accurate SNR measurements and thus less distortion.**\n", "\n", "\n", "* **Frequency-Domain MMSE from Stereo Data**\n", "* **Wiener Filtering**\n", "* **Cepstral Mean Normalization (CMN)**\n", " \n", " 对于给定信号\\\\(x(n)\\\\),基于短时倒谱分析得到长度为\\\\(T\\\\)的倒谱向量序列\\\\(X={x_0, x_1, ..., x_{T-1}}\\\\),它的均值为\n", " \n", " $$\\bar{x} = \\frac{1}{T}\\sum_{t=0}^{T-1}x_t$$\n", " \n", " CMN 其实就是计算归一化的倒谱向量\\\\(\\hat{x}_t\\\\),由下式计算得到:\n", " \n", " $$\\hat{x}_t = x_t - \\bar{x}$$\n", " \n", " 假设信号\\\\(y(n)\\\\)由输入信号\\\\(x(n)\\\\)经过滤波器\\\\(h(n)\\\\)得到,那么可以计算的到\\\\(y(n)\\\\)的倒谱序列\\\\(Y={y_0, y_1, ..., y_{T-1}}\\\\),根据梅尔倒谱的计算过程,可以定义向量\\\\({h}\\\\)为\n", " \n", " $${h} = C(\\ln{|H(\\omega_0)|^2} \\cdots \\ln{|H(\\omega_M)|^2)}$$\n", " \n", " 其中\\\\(C\\\\)为 DCT 变换矩阵。所以有\n", " \n", " $$y_t = x_t + h$$\n", " \n", " 因此可以计算得到倒谱序列\\\\(Y\\\\)的均值为\n", " \n", " $$\\bar{y} = \\frac{1}{T}\\sum_{t=0}^{T-1}y_t=\\frac{1}{T}\\sum_{t=0}^{T-1}(x_t + h) = \\bar{x} + h$$\n", " \n", " 从而得到它的归一化向量\n", " \n", " $$\\bar{y}_t = y_t - \\bar{y} = \\hat{x}_t$$\n", " \n", " **特点**:\n", " 1. Use of CMN to the cepstrum vectors does not modify the delta or delta-delta cepstrum.\n", " 2. It has been found that this procedure does not degrade the recognition rate on utterances from the same acoustical environment, as long as they are longer than 2–4 seconds.\n", " 3. When a system is trained on one microphone and tested on another, CMN can provide significant robustness.\n", " 4. One drawback of CMN is it does not discriminate silence and voice in computing the utterance mean.\n", " \n", "* **Real-Time Cepstral Normalization**\n", " \n", " 动态更新均值估计:\n", " \n", " $$\\bar{x}_t = (1-\\alpha) \\bar{x}_{t-1} + (\\alpha)x_t$$\n", " \n", "* **The Use of Gaussian Mixture Models**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.6 Environmental Model Adaptation\n", "\n", "* **Retraining on Corrupted Speech**\n", " * 方法:\n", " 1. 在对应的噪声环境中,收集大量的训练数据,重新训练模型(成本过高)\n", " 2. 采集少量的噪声,对已有的训练集加噪,重新训练模型,结果如下图:\n", " 3. 在线采集数据、在线训练(只适合小词汇量系统)\n", " 4. 使用不同的噪声、不同 SNR 的训练数据,重新训练模型,使模型对噪声环境更鲁棒。\n", " \n", " ![](./images/10_2.png)\n", " \n", "* **Model Adaptation**\n", "\n", " 1. MAP is an unstructured method, it can offer results similar to those of matched conditions, but it requires a significant amount of adaptation data. \n", " 2. MLLR can achieve reasonable performance with about a minute of speech for minor mismatches. But for severe mismatches, MLLR also requires a large number of transformations, which, in turn, require a larger amount of adaptation data\n", " \n", " \n", "* **Parallel Model Combination**\n", " \n", " 处理方法如下图:\n", " ![](./images/10_3.png)\n", " \n", "* **Vector Taylor Series**\n", "* **Retraining on Compensated Features**\n", " \n", " 信号经前端处理(如 CMN)后,再重新训练模型,性能可以得到进一步提升,如下图:\n", " ![](./images/10_4.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.7 Modeling Nonstationary Noise\n", "\n", "之前的处理都是针对稳态噪声,但对于非稳态噪声,特别是某些固定的噪声,如鼠标、键盘的敲击声,电脑的风扇声、关门声等,理论上对其进行建模可以很好的处理掉上述非稳态噪声。\n", "\n", "训练过程如下图:\n", "![](./images/10_5.png)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }