{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ch9: Use Keras Models with Scikit-Learn for General Machine Learning\n", "\n", "Keras 模型和Python Scikit-Learn Library 结合,学习如何\n", "\n", "- 如何将Keras模型和 scikit-learn 库结合\n", "- 利用scikit-learn 库中的交叉验证(cross validation)来评估 Keras 模型\n", "- 利用scikit-learn 库中的格点搜索(grid search)调节Keras模型超参数" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.前言\n", "\n", "在Python中,Keras是用于深度学习的库,虽然受欢迎,但是Keras只关注深度学习。Keras追求极简,只关注我们快速使用、定义并建立深度学习模型。\n", "\n", "Scikit-learn 是Python的另一个库,建立在用于有效数值计算的SciPy基础之上,常用于机器学习。\n", "\n", "Keras中为深度学习模型能在scikit-learn中用于分类和回归提供了便捷的封装器(Wrapper),如`KearsClassifier`wrapper(用于在Keras中建立的神经网络分类器)。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.利用交叉验证评估模型" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"uniform\", input_dim=8)`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"uniform\")`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"uniform\")`\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "accuracy: 67.0454549068 %\n" ] } ], "source": [ "# 通过sklearn交叉验证评估用于糖尿病预测的神经网络模型\n", "from keras.models import Sequential\n", "from keras.layers import Dense\n", "from keras.wrappers.scikit_learn import KerasClassifier\n", "\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.model_selection import cross_val_score\n", "\n", "import numpy as np\n", "\n", "import urllib\n", "\n", "# 函数用于建立模型,KerasClassifier需要的函数\n", "def create_model():\n", " # 建立模型\n", " model = Sequential()\n", " model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))\n", " model.add(Dense(8, init='uniform', activation='relu'))\n", " model.add(Dense(1, init='uniform', activation='sigmoid'))\n", " # 编译模型\n", " model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", " return model\n", "\n", "# 随机数设置,便于产生相同的随机数\n", "seed = 42\n", "np.random.seed(seed)\n", "\n", "# 加载数据\n", "\n", "# 使用url来获取 diabetes dataset\n", "url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data\"\n", "# 打开文件\n", "raw_data = urllib.urlopen(url)\n", "# 下载CSV文件 保存为np matrix格式\n", "dataset = np.loadtxt(raw_data, delimiter=',')\n", "# 数据特征与标签分开\n", "X = dataset[:, 0:8]\n", "y = dataset[:, 8]\n", "\n", "# 建立模型\n", "model = KerasClassifier(build_fn=create_model, nb_epoch=150, batch_size=10, verbose=0)\n", "\n", "# 利用10-fold 交叉验证建立的模型\n", "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", "results = cross_val_score(model, X, y, cv=kfold)\n", "print(\"accuracy: {0} %\".format(results.mean()*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对比上一篇[Python深度学习实战03-模型性能评估](https://anifacc.github.io/deeplearning/machinelearning/python/2017/08/09/dlwp-setimate-model-performance/)中的交叉验证,这里的更加快捷。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.格点搜索调节深度学习模型参数\n", "\n", "\n", "从上面我们知道`Keras`中就配有与`scikit-learn`库结合使用的功能(封装包,wrapper),如`KerasClassifier`,和`scikit-learn`结合使用,这个非常棒。另外我们还可以使用`scikit-learn`中的`grid search`来找到较佳的神经网络参数,不过这些参数仅包括:\n", "\n", "- Optimizers:用于搜索神经网络权重的优化器\n", "- Initializers: 初始化神经网络权重的方法\n", "- Epochs:模型训练时迭代次数\n", "- Batch_Size:用于更新网络权重的样本数目。\n", "\n", "在这里,我们没有 search 神经网络的拓扑结构参数:网络层数,中间隐含层神经元(节点)个数。神经网络拓扑结构确定是一个难点。\n", "\n", "下面的代码展示了利用`scikit-learn`格点搜索技术调节深度学习模型的参数。\n", "\n", "利用 `GridSearchCV` 功能,我们可以看到神经网络可调节参数为:\n", "\n", "```\n", "optimizers = ['rmsprop', 'adam']\n", "init = ['glorot_uniform', 'normal', 'uniform']\n", "epochs = np.array([50, 100, 150])\n", "batches = np.array([5, 10, 20])\n", "```\n", "\n", "这样就有`2x3x3x3=54`中不同参数的神经网络。如果数据集量非常大,那么search起来,计算时间相对样本量小的长很多。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"glorot_uniform\", input_dim=8)`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"glorot_uniform\")`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"glorot_uniform\")`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"normal\", input_dim=8)`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"normal\")`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"normal\")`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"uniform\", input_dim=8)`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"uniform\")`\n", "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"uniform\")`\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.697917 using {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n", "0.636719 (0.024910) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}\n", "0.665365 (0.021236) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}\n", "0.643229 (0.030647) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}\n", "0.670573 (0.020752) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}\n", "0.644531 (0.020915) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}\n", "0.651042 (0.027498) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n", "0.678385 (0.027126) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}\n", "0.675781 (0.020915) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}\n", "0.682292 (0.004872) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}\n", "0.688802 (0.013279) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}\n", "0.670573 (0.015733) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}\n", "0.697917 (0.007366) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n", "0.661458 (0.031466) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}\n", "0.694010 (0.025780) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}\n", "0.628906 (0.048265) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}\n", "0.669271 (0.009207) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}\n", "0.677083 (0.020505) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}\n", "0.684896 (0.019225) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n", "0.526042 (0.110685) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10}\n", "0.583333 (0.086547) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 10}\n", "0.467448 (0.162891) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 10}\n", "0.529948 (0.145472) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 10}\n", "0.552083 (0.123869) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 10}\n", "0.640625 (0.022326) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 10}\n", "0.654948 (0.027126) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10}\n", "0.680990 (0.014382) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 10}\n", "0.667969 (0.034499) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 10}\n", "0.670573 (0.012890) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 10}\n", "0.675781 (0.011049) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 10}\n", "0.666667 (0.009207) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 10}\n", "0.673177 (0.023073) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10}\n", "0.678385 (0.003683) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 10}\n", "0.680990 (0.008027) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 10}\n", "0.674479 (0.030145) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 10}\n", "0.664063 (0.009568) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 10}\n", "0.671875 (0.013902) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 10}\n", "0.671875 (0.011500) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 20}\n", "0.635417 (0.043537) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 20}\n", "0.636719 (0.022999) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 20}\n", "0.648438 (0.011500) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 20}\n", "0.615885 (0.035132) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 20}\n", "0.651042 (0.027866) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 20}\n", "0.662760 (0.024774) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 20}\n", "0.651042 (0.027866) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 20}\n", "0.671875 (0.000000) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 20}\n", "0.669271 (0.012890) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 20}\n", "0.671875 (0.022326) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 20}\n", "0.647135 (0.013279) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 20}\n", "0.651042 (0.025780) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 20}\n", "0.654948 (0.019488) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 20}\n", "0.656250 (0.020915) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 20}\n", "0.645833 (0.027126) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 20}\n", "0.656250 (0.027621) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 20}\n", "0.656250 (0.028348) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 20}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\sklearn\\model_selection\\_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20\n", " DeprecationWarning)\n" ] } ], "source": [ "# 通过sklearn 格点搜索来调节 用于糖尿病预测神经网络模型的参数\n", "from keras.models import Sequential\n", "from keras.layers import Dense\n", "from keras.wrappers.scikit_learn import KerasClassifier\n", "\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "import numpy as np\n", "\n", "import urllib\n", "\n", "# 函数用于建立模型,KerasClassifier需要的函数\n", "## 选择rmsprop优化器,初始化权重方式为 glorot_uniform\n", "def create_model(optimizer='rmsprop', init='glorot_uniform'): \n", " # 创建模型\n", " model = Sequential()\n", " model.add(Dense(12, input_dim=8, init=init, activation='relu'))\n", " model.add(Dense(8, init=init, activation='relu'))\n", " model.add(Dense(1, init=init, activation='sigmoid'))\n", " # 编译模型\n", " model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])\n", " return model\n", "\n", "# 随机数设置,便于产生相同的随机数\n", "seed = 7\n", "np.random.seed(seed)\n", "\n", "# 加载数据\n", "## 使用url来获取 diabetes dataset\n", "url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data\"\n", "## 打开数据文件\n", "raw_data = urllib.urlopen(url)\n", "## 下载数据文件 保存为np matrix格式\n", "dataset = np.loadtxt(raw_data, delimiter=',')\n", "## 将数据特征与标签分开\n", "X = dataset[:, 0:8]\n", "Y = dataset[:, 8]\n", "\n", "# 创建模型\n", "model = KerasClassifier(build_fn=create_model, verbose=0)\n", "# 格点搜索的步数,batch 样本数和优化器 参数设置\n", "optimizers = ['rmsprop', 'adam']\n", "init = ['glorot_uniform', 'normal', 'uniform']\n", "epochs = np.array([50, 100, 150])\n", "batches = np.array([5, 10, 20])\n", "param_grid = dict(optimizer=optimizers, nb_epoch=epochs, batch_size=batches, init=init)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid)\n", "grid_result = grid.fit(X, Y)\n", "\n", "# 结果显示\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "\n", "for params, mean_score, scores in grid_result.grid_scores_:\n", " print(\"%f (%f) with: %r\" % (scores.mean(), scores.std(), params))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Best: 0.697917 using {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n", "\n", "最好的分类结果约为 69.79%。其使用的参数设置,如上所示。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }