{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 多层感知机\n",
"\n",
"我们已经介绍了包括线性回归和softmax回归在内的单层神经网络。然而深度学习主要关注多层模型。在本节中,我们将以多层感知机(multilayer perceptron,MLP)为例,介绍多层神经网络的概念。\n",
"\n",
"## 隐藏层\n",
"\n",
"多层感知机在单层神经网络的基础上引入了一到多个隐藏层(hidden layer)。隐藏层位于输入层和输出层之间。图3.3展示了一个多层感知机的神经网络图。\n",
"\n",
"\n",
"\n",
"在图3.3所示的多层感知机中,输入和输出个数分别为4和3,中间的隐藏层中包含了5个隐藏单元(hidden unit)。由于输入层不涉及计算,图3.3中的多层感知机的层数为2。由图3.3可见,隐藏层中的神经元和输入层中各个输入完全连接,输出层中的神经元和隐藏层中的各个神经元也完全连接。因此,多层感知机中的隐藏层和输出层都是全连接层。\n",
"\n",
"\n",
"具体来说,给定一个小批量样本$\\boldsymbol{X} \\in \\mathbb{R}^{n \\times d}$,其批量大小为$n$,输入个数为$d$。假设多层感知机只有一个隐藏层,其中隐藏单元个数为$h$。记隐藏层的输出(也称为隐藏层变量或隐藏变量)为$\\boldsymbol{H}$,有$\\boldsymbol{H} \\in \\mathbb{R}^{n \\times h}$。因为隐藏层和输出层均是全连接层,可以设隐藏层的权重参数和偏差参数分别为$\\boldsymbol{W}_h \\in \\mathbb{R}^{d \\times h}$和 $\\boldsymbol{b}_h \\in \\mathbb{R}^{1 \\times h}$,输出层的权重和偏差参数分别为$\\boldsymbol{W}_o \\in \\mathbb{R}^{h \\times q}$和$\\boldsymbol{b}_o \\in \\mathbb{R}^{1 \\times q}$。\n",
"\n",
"我们先来看一种含单隐藏层的多层感知机的设计。其输出$\\boldsymbol{O} \\in \\mathbb{R}^{n \\times q}$的计算为\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\boldsymbol{H} &= \\boldsymbol{X} \\boldsymbol{W}_h + \\boldsymbol{b}_h,\\\\\n",
"\\boldsymbol{O} &= \\boldsymbol{H} \\boldsymbol{W}_o + \\boldsymbol{b}_o,\n",
"\\end{aligned} \n",
"$$\n",
"\n",
"也就是将隐藏层的输出直接作为输出层的输入。如果将以上两个式子联立起来,可以得到\n",
"\n",
"$$\n",
"\\boldsymbol{O} = (\\boldsymbol{X} \\boldsymbol{W}_h + \\boldsymbol{b}_h)\\boldsymbol{W}_o + \\boldsymbol{b}_o = \\boldsymbol{X} \\boldsymbol{W}_h\\boldsymbol{W}_o + \\boldsymbol{b}_h \\boldsymbol{W}_o + \\boldsymbol{b}_o.\n",
"$$\n",
"\n",
"从联立后的式子可以看出,虽然神经网络引入了隐藏层,却依然等价于一个单层神经网络:其中输出层权重参数为$\\boldsymbol{W}_h\\boldsymbol{W}_o$,偏差参数为$\\boldsymbol{b}_h \\boldsymbol{W}_o + \\boldsymbol{b}_o$。不难发现,即便再添加更多的隐藏层,以上设计依然只能与仅含输出层的单层神经网络等价。\n",
"\n",
"\n",
"## 激活函数\n",
"\n",
"上述问题的根源在于全连接层只是对数据做仿射变换(affine transformation),而多个仿射变换的叠加仍然是一个仿射变换。解决问题的一个方法是引入非线性变换,例如对隐藏变量使用按元素运算的非线性函数进行变换,然后再作为下一个全连接层的输入。这个非线性函数被称为激活函数(activation function)。下面我们介绍几个常用的激活函数。\n",
"\n",
"### ReLU函数\n",
"\n",
"ReLU(rectified linear unit)函数提供了一个很简单的非线性变换。给定元素$x$,该函数定义为\n",
"\n",
"$$\\text{ReLU}(x) = \\max(x, 0).$$\n",
"\n",
"可以看出,ReLU函数只保留正数元素,并将负数元素清零。为了直观地观察这一非线性变换,我们先定义一个绘图函数`xyplot`。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import d2ltorch as d2lt\n",
"import torch\n",
"from torch import autograd\n",
"\n",
"def xyplot(x_vals, y_vals, name):\n",
" d2lt.set_figsize(figsize=(5, 2.5))\n",
" d2lt.plt.plot(x_vals.data.numpy(), y_vals.data.numpy())\n",
" d2lt.plt.xlabel('x')\n",
" d2lt.plt.ylabel(name + '(x)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们接下来通过`tensor`提供的`relu`函数来绘制ReLU函数。可以看到,该激活函数是一个两段线性函数。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"x = torch.arange(-8.0, 8.0, 0.1)\n",
"x.requires_grad_()\n",
"y = x.relu()\n",
"xyplot(x, y, 'relu')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"显然,当输入为负数时,ReLU函数的导数为0;当输入为正数时,ReLU函数的导数为1。尽管输入为0时ReLU函数不可导,但是我们可以取此处的导数为0。下面绘制ReLU函数的导数。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"y.backward(torch.ones_like(y))\n",
"xyplot(x, x.grad, 'grad of relu')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sigmoid函数\n",
"\n",
"sigmoid函数可以将元素的值变换到0和1之间:\n",
"\n",
"$$\\text{sigmoid}(x) = \\frac{1}{1 + \\exp(-x)}.$$\n",
"\n",
"sigmoid函数在早期的神经网络中较为普遍,但它目前逐渐被更简单的ReLU函数取代。在后面“循环神经网络”一章中我们会介绍如何利用它值域在0到1之间这一特性来控制信息在神经网络中的流动。下面绘制了sigmoid函数。当输入接近0时,sigmoid函数接近线性变换。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# 清空x的梯度\n",
"x.grad.data.zero_()\n",
"y = x.sigmoid()\n",
"xyplot(x, y, 'sigmoid')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"依据链式法则,sigmoid函数的导数\n",
"\n",
"$$\\text{sigmoid}'(x) = \\text{sigmoid}(x)\\left(1-\\text{sigmoid}(x)\\right).$$\n",
"\n",
"\n",
"下面绘制了sigmoid函数的导数。当输入为0时,sigmoid函数的导数达到最大值0.25;当输入越偏离0时,sigmoid函数的导数越接近0。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"y.backward(torch.ones_like(y))\n",
"xyplot(x, x.grad, 'grad of sigmoid')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### tanh函数\n",
"\n",
"tanh(双曲正切)函数可以将元素的值变换到-1和1之间:\n",
"\n",
"$$\\text{tanh}(x) = \\frac{1 - \\exp(-2x)}{1 + \\exp(-2x)}.$$\n",
"\n",
"我们接着绘制tanh函数。当输入接近0时,tanh函数接近线性变换。虽然该函数的形状和sigmoid函数的形状很像,但tanh函数在坐标系的原点上对称。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"x.grad.data.zero_()\n",
"y = x.tanh()\n",
"xyplot(x, y, 'tanh')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"依据链式法则,tanh函数的导数\n",
"\n",
"$$\\text{tanh}'(x) = 1 - \\text{tanh}^2(x).$$\n",
"\n",
"下面绘制了tanh函数的导数。当输入为0时,tanh函数的导数达到最大值1;当输入越偏离0时,tanh函数的导数越接近0。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"y.backward(torch.ones_like(y))\n",
"xyplot(x, x.grad, 'grad of tanh')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 多层感知机\n",
"\n",
"多层感知机就是含有至少一个隐藏层的由全连接层组成的神经网络,且每个隐藏层的输出通过激活函数进行变换。多层感知机的层数和各隐藏层中隐藏单元个数都是超参数。以单隐藏层为例并沿用本节之前定义的符号,多层感知机按以下方式计算输出:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\boldsymbol{H} &= \\phi(\\boldsymbol{X} \\boldsymbol{W}_h + \\boldsymbol{b}_h),\\\\\n",
"\\boldsymbol{O} &= \\boldsymbol{H} \\boldsymbol{W}_o + \\boldsymbol{b}_o,\n",
"\\end{aligned}\n",
"$$\n",
" \n",
"其中$\\phi$表示激活函数。在分类问题中,我们可以对输出$\\boldsymbol{O}$做softmax运算,并使用softmax回归中的交叉熵损失函数。\n",
"在回归问题中,我们将输出层的输出个数设为1,并将输出$\\boldsymbol{O}$直接提供给线性回归中使用的平方损失函数。\n",
"\n",
"\n",
"\n",
"## 小结\n",
"\n",
"* 多层感知机在输出层与输入层之间加入了一个或多个全连接隐藏层,并通过激活函数对隐藏层输出进行变换。\n",
"* 常用的激活函数包括ReLU函数、sigmoid函数和tanh函数。\n",
"\n",
"\n",
"## 练习\n",
"\n",
"* 应用链式法则,推导出sigmoid函数和tanh函数的导数的数学表达式。\n",
"* 查阅资料,了解其他的激活函数。\n",
"\n",
"\n",
"\n",
"\n",
"## 扫码直达[讨论区](https://discuss.gluon.ai/t/topic/6447)\n",
"\n",
""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:pytorch]",
"language": "python",
"name": "conda-env-pytorch-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}