{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Theano 实例:Softmax 回归" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MNIST 数据集的下载和导入" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[MNIST 数据集](http://yann.lecun.com/exdb/mnist/) 是一个手写数字组成的数据集,现在被当作一个机器学习算法评测的基准数据集。\n", "\n", "这是一个下载并解压数据的脚本:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting download_mnist.py\n" ] } ], "source": [ "%%file download_mnist.py\n", "import os\n", "import os.path\n", "import urllib\n", "import gzip\n", "import shutil\n", "\n", "if not os.path.exists('mnist'):\n", " os.mkdir('mnist')\n", "\n", "def download_and_gzip(name):\n", " if not os.path.exists(name + '.gz'):\n", " urllib.urlretrieve('http://yann.lecun.com/exdb/' + name + '.gz', name + '.gz')\n", " if not os.path.exists(name):\n", " with gzip.open(name + '.gz', 'rb') as f_in, open(name, 'wb') as f_out:\n", " shutil.copyfileobj(f_in, f_out)\n", " \n", "download_and_gzip('mnist/train-images-idx3-ubyte')\n", "download_and_gzip('mnist/train-labels-idx1-ubyte')\n", "download_and_gzip('mnist/t10k-images-idx3-ubyte')\n", "download_and_gzip('mnist/t10k-labels-idx1-ubyte')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以运行这个脚本来下载和解压数据:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%run download_mnist.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用如下的脚本来导入 MNIST 数据,源码地址:\n", "\n", "https://github.com/Newmu/Theano-Tutorials/blob/master/load.py" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting load.py\n" ] } ], "source": [ "%%file load.py\n", "import numpy as np\n", "import os\n", "\n", "datasets_dir = './'\n", "\n", "def one_hot(x,n):\n", "\tif type(x) == list:\n", "\t\tx = np.array(x)\n", "\tx = x.flatten()\n", "\to_h = np.zeros((len(x),n))\n", "\to_h[np.arange(len(x)),x] = 1\n", "\treturn o_h\n", "\n", "def mnist(ntrain=60000,ntest=10000,onehot=True):\n", "\tdata_dir = os.path.join(datasets_dir,'mnist/')\n", "\tfd = open(os.path.join(data_dir,'train-images-idx3-ubyte'))\n", "\tloaded = np.fromfile(file=fd,dtype=np.uint8)\n", "\ttrX = loaded[16:].reshape((60000,28*28)).astype(float)\n", "\n", "\tfd = open(os.path.join(data_dir,'train-labels-idx1-ubyte'))\n", "\tloaded = np.fromfile(file=fd,dtype=np.uint8)\n", "\ttrY = loaded[8:].reshape((60000))\n", "\n", "\tfd = open(os.path.join(data_dir,'t10k-images-idx3-ubyte'))\n", "\tloaded = np.fromfile(file=fd,dtype=np.uint8)\n", "\tteX = loaded[16:].reshape((10000,28*28)).astype(float)\n", "\n", "\tfd = open(os.path.join(data_dir,'t10k-labels-idx1-ubyte'))\n", "\tloaded = np.fromfile(file=fd,dtype=np.uint8)\n", "\tteY = loaded[8:].reshape((10000))\n", "\n", "\ttrX = trX/255.\n", "\tteX = teX/255.\n", "\n", "\ttrX = trX[:ntrain]\n", "\ttrY = trY[:ntrain]\n", "\n", "\tteX = teX[:ntest]\n", "\tteY = teY[:ntest]\n", "\n", "\tif onehot:\n", "\t\ttrY = one_hot(trY, 10)\n", "\t\tteY = one_hot(teY, 10)\n", "\telse:\n", "\t\ttrY = np.asarray(trY)\n", "\t\tteY = np.asarray(teY)\n", "\n", "\treturn trX,teX,trY,teY" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## softmax 回归" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Softmax` 回归相当于 `Logistic` 回归的一个一般化,`Logistic` 回归处理的是两类问题,`Softmax` 回归处理的是 `N` 类问题。\n", "\n", "`Logistic` 回归输出的是标签为 1 的概率(标签为 0 的概率也就知道了),对应地,对 N 类问题 `Softmax` 输出的是每个类对应的概率。\n", "\n", "具体的内容,可以参考 `UFLDL` 教程:\n", "\n", "http://ufldl.stanford.edu/wiki/index.php/Softmax%E5%9B%9E%E5%BD%92" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using gpu device 1: Tesla C2075 (CNMeM is disabled)\n" ] } ], "source": [ "import theano\n", "from theano import tensor as T\n", "import numpy as np\n", "from load import mnist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们来看它具体的实现。\n", "\n", "这两个函数一个是将数据转化为 `GPU` 计算的类型,另一个是初始化权重:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def floatX(X):\n", " return np.asarray(X, dtype=theano.config.floatX)\n", "\n", "def init_weights(shape):\n", " return theano.shared(floatX(np.random.randn(*shape) * 0.01))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Softmax` 的模型在 `theano` 中已经实现好了:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3, 4)\n", "[ 1.00000012 1. 1. ]\n" ] } ], "source": [ "A = T.matrix()\n", "\n", "B = T.nnet.softmax(A)\n", "\n", "test_softmax = theano.function([A], B)\n", "\n", "a = floatX(np.random.rand(3, 4))\n", "\n", "b = test_softmax(a)\n", "\n", "print b.shape\n", "\n", "# 行和\n", "print b.sum(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`softmax` 函数会按照行对矩阵进行 `Softmax` 归一化。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "所以我们的模型为:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def model(X, w):\n", " return T.nnet.softmax(T.dot(X, w))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "导入数据:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trX, teX, trY, teY = mnist(onehot=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "定义变量,并初始化权重:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = T.fmatrix()\n", "Y = T.fmatrix()\n", "\n", "w = init_weights((784, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "定义模型输出和预测:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "py_x = model(X, w)\n", "y_pred = T.argmax(py_x, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "损失函数为多类的交叉熵,这个在 `theano` 中也被定义好了:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))\n", "gradient = T.grad(cost=cost, wrt=w)\n", "update = [[w, w - gradient * 0.05]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "编译 `train` 和 `predict` 函数:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)\n", "predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "迭代 100 次,测试集正确率为 0.925:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "000 0.8862\n", "001 0.8985\n", "002 0.9042\n", "003 0.9084\n", "004 0.9104\n", "005 0.9121\n", "006 0.9121\n", "007 0.9142\n", "008 0.9158\n", "009 0.9163\n", "010 0.9162\n", "011 0.9166\n", "012 0.9171\n", "013 0.9176\n", "014 0.9182\n", "015 0.9182\n", "016 0.9184\n", "017 0.9188\n", "018 0.919\n", "019 0.919\n", "020 0.9194\n", "021 0.9201\n", "022 0.9204\n", "023 0.9203\n", "024 0.9205\n", "025 0.9207\n", "026 0.9207\n", "027 0.9209\n", "028 0.9214\n", "029 0.9213\n", "030 0.9212\n", "031 0.9211\n", "032 0.9217\n", "033 0.9217\n", "034 0.9217\n", "035 0.922\n", "036 0.9222\n", "037 0.922\n", "038 0.922\n", "039 0.9218\n", "040 0.9219\n", "041 0.9223\n", "042 0.9225\n", "043 0.9226\n", "044 0.9227\n", "045 0.9225\n", "046 0.9227\n", "047 0.9231\n", "048 0.9231\n", "049 0.9231\n", "050 0.9232\n", "051 0.9232\n", "052 0.9231\n", "053 0.9231\n", "054 0.9233\n", "055 0.9233\n", "056 0.9237\n", "057 0.9239\n", "058 0.9239\n", "059 0.9239\n", "060 0.924\n", "061 0.9242\n", "062 0.9242\n", "063 0.9243\n", "064 0.9243\n", "065 0.9244\n", "066 0.9244\n", "067 0.9244\n", "068 0.9245\n", "069 0.9244\n", "070 0.9244\n", "071 0.9245\n", "072 0.9244\n", "073 0.9243\n", "074 0.9243\n", "075 0.9244\n", "076 0.9243\n", "077 0.9242\n", "078 0.9244\n", "079 0.9244\n", "080 0.9243\n", "081 0.9242\n", "082 0.9239\n", "083 0.9241\n", "084 0.9242\n", "085 0.9243\n", "086 0.9244\n", "087 0.9243\n", "088 0.9243\n", "089 0.9244\n", "090 0.9246\n", "091 0.9246\n", "092 0.9246\n", "093 0.9247\n", "094 0.9246\n", "095 0.9246\n", "096 0.9246\n", "097 0.9246\n", "098 0.9246\n", "099 0.9248\n" ] } ], "source": [ "for i in range(100):\n", " for start, end in zip(range(0, len(trX), 128), range(128, len(trX), 128)):\n", " cost = train(trX[start:end], trY[start:end])\n", " print \"{0:03d}\".format(i), np.mean(np.argmax(teY, axis=1) == predict(teX))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }