{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ch9: Use Keras Models with Scikit-Learn for General Machine Learning\n",
    "\n",
    "Keras 模型和Python Scikit-Learn Library 结合，学习如何\n",
    "\n",
    "- 如何将Keras模型和 scikit-learn 库结合\n",
    "- 利用scikit-learn 库中的交叉验证(cross validation)来评估 Keras 模型\n",
    "- 利用scikit-learn 库中的格点搜索(grid search)调节Keras模型超参数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.前言\n",
    "\n",
    "在Python中，Keras是用于深度学习的库，虽然受欢迎，但是Keras只关注深度学习。Keras追求极简，只关注我们快速使用、定义并建立深度学习模型。\n",
    "\n",
    "Scikit-learn 是Python的另一个库，建立在用于有效数值计算的SciPy基础之上，常用于机器学习。\n",
    "\n",
    "Keras中为深度学习模型能在scikit-learn中用于分类和回归提供了便捷的封装器（Wrapper），如`KearsClassifier`wrapper（用于在Keras中建立的神经网络分类器）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.利用交叉验证评估模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"uniform\", input_dim=8)`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"uniform\")`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"uniform\")`\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy: 67.0454549068 %\n"
     ]
    }
   ],
   "source": [
    "# 通过sklearn交叉验证评估用于糖尿病预测的神经网络模型\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense\n",
    "from keras.wrappers.scikit_learn import KerasClassifier\n",
    "\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import urllib\n",
    "\n",
    "# 函数用于建立模型，KerasClassifier需要的函数\n",
    "def create_model():\n",
    "    # 建立模型\n",
    "    model = Sequential()\n",
    "    model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))\n",
    "    model.add(Dense(8, init='uniform', activation='relu'))\n",
    "    model.add(Dense(1, init='uniform', activation='sigmoid'))\n",
    "    # 编译模型\n",
    "    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "    return model\n",
    "\n",
    "# 随机数设置，便于产生相同的随机数\n",
    "seed = 42\n",
    "np.random.seed(seed)\n",
    "\n",
    "# 加载数据\n",
    "\n",
    "# 使用url来获取 diabetes dataset\n",
    "url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data\"\n",
    "# 打开文件\n",
    "raw_data = urllib.urlopen(url)\n",
    "# 下载CSV文件 保存为np matrix格式\n",
    "dataset = np.loadtxt(raw_data, delimiter=',')\n",
    "# 数据特征与标签分开\n",
    "X = dataset[:, 0:8]\n",
    "y = dataset[:, 8]\n",
    "\n",
    "# 建立模型\n",
    "model = KerasClassifier(build_fn=create_model, nb_epoch=150, batch_size=10, verbose=0)\n",
    "\n",
    "# 利用10-fold 交叉验证建立的模型\n",
    "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n",
    "results = cross_val_score(model, X, y, cv=kfold)\n",
    "print(\"accuracy: {0} %\".format(results.mean()*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对比上一篇[Python深度学习实战03-模型性能评估](https://anifacc.github.io/deeplearning/machinelearning/python/2017/08/09/dlwp-setimate-model-performance/)中的交叉验证，这里的更加快捷。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.格点搜索调节深度学习模型参数\n",
    "\n",
    "\n",
    "从上面我们知道`Keras`中就配有与`scikit-learn`库结合使用的功能（封装包，wrapper），如`KerasClassifier`,和`scikit-learn`结合使用，这个非常棒。另外我们还可以使用`scikit-learn`中的`grid search`来找到较佳的神经网络参数，不过这些参数仅包括：\n",
    "\n",
    "- Optimizers：用于搜索神经网络权重的优化器\n",
    "- Initializers: 初始化神经网络权重的方法\n",
    "- Epochs：模型训练时迭代次数\n",
    "- Batch_Size：用于更新网络权重的样本数目。\n",
    "\n",
    "在这里，我们没有 search 神经网络的拓扑结构参数：网络层数，中间隐含层神经元（节点）个数。神经网络拓扑结构确定是一个难点。\n",
    "\n",
    "下面的代码展示了利用`scikit-learn`格点搜索技术调节深度学习模型的参数。\n",
    "\n",
    "利用 `GridSearchCV` 功能，我们可以看到神经网络可调节参数为：\n",
    "\n",
    "```\n",
    "optimizers = ['rmsprop', 'adam']\n",
    "init = ['glorot_uniform', 'normal', 'uniform']\n",
    "epochs = np.array([50, 100, 150])\n",
    "batches = np.array([5, 10, 20])\n",
    "```\n",
    "\n",
    "这样就有`2x3x3x3=54`中不同参数的神经网络。如果数据集量非常大，那么search起来，计算时间相对样本量小的长很多。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"glorot_uniform\", input_dim=8)`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"glorot_uniform\")`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"glorot_uniform\")`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"normal\", input_dim=8)`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"normal\")`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"normal\")`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:17: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(12, activation=\"relu\", kernel_initializer=\"uniform\", input_dim=8)`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:18: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(8, activation=\"relu\", kernel_initializer=\"uniform\")`\n",
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\ipykernel\\__main__.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, activation=\"sigmoid\", kernel_initializer=\"uniform\")`\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best: 0.697917 using {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n",
      "0.636719 (0.024910) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}\n",
      "0.665365 (0.021236) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}\n",
      "0.643229 (0.030647) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}\n",
      "0.670573 (0.020752) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}\n",
      "0.644531 (0.020915) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}\n",
      "0.651042 (0.027498) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n",
      "0.678385 (0.027126) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}\n",
      "0.675781 (0.020915) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}\n",
      "0.682292 (0.004872) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}\n",
      "0.688802 (0.013279) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}\n",
      "0.670573 (0.015733) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}\n",
      "0.697917 (0.007366) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n",
      "0.661458 (0.031466) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 5}\n",
      "0.694010 (0.025780) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 5}\n",
      "0.628906 (0.048265) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 5}\n",
      "0.669271 (0.009207) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 5}\n",
      "0.677083 (0.020505) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 5}\n",
      "0.684896 (0.019225) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n",
      "0.526042 (0.110685) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10}\n",
      "0.583333 (0.086547) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 10}\n",
      "0.467448 (0.162891) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 10}\n",
      "0.529948 (0.145472) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 10}\n",
      "0.552083 (0.123869) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 10}\n",
      "0.640625 (0.022326) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 10}\n",
      "0.654948 (0.027126) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10}\n",
      "0.680990 (0.014382) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 10}\n",
      "0.667969 (0.034499) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 10}\n",
      "0.670573 (0.012890) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 10}\n",
      "0.675781 (0.011049) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 10}\n",
      "0.666667 (0.009207) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 10}\n",
      "0.673177 (0.023073) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10}\n",
      "0.678385 (0.003683) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 10}\n",
      "0.680990 (0.008027) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 10}\n",
      "0.674479 (0.030145) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 10}\n",
      "0.664063 (0.009568) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 10}\n",
      "0.671875 (0.013902) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 10}\n",
      "0.671875 (0.011500) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 20}\n",
      "0.635417 (0.043537) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 20}\n",
      "0.636719 (0.022999) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 20}\n",
      "0.648438 (0.011500) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 20}\n",
      "0.615885 (0.035132) with: {'init': 'glorot_uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 20}\n",
      "0.651042 (0.027866) with: {'init': 'glorot_uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 20}\n",
      "0.662760 (0.024774) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 20}\n",
      "0.651042 (0.027866) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 20}\n",
      "0.671875 (0.000000) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 20}\n",
      "0.669271 (0.012890) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 20}\n",
      "0.671875 (0.022326) with: {'init': 'normal', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 20}\n",
      "0.647135 (0.013279) with: {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 20}\n",
      "0.651042 (0.025780) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 20}\n",
      "0.654948 (0.019488) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 50, 'batch_size': 20}\n",
      "0.656250 (0.020915) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 100, 'batch_size': 20}\n",
      "0.645833 (0.027126) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 100, 'batch_size': 20}\n",
      "0.656250 (0.027621) with: {'init': 'uniform', 'optimizer': 'rmsprop', 'nb_epoch': 150, 'batch_size': 20}\n",
      "0.656250 (0.028348) with: {'init': 'uniform', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 20}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\ProgramData\\Anaconda2\\lib\\site-packages\\sklearn\\model_selection\\_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20\n",
      "  DeprecationWarning)\n"
     ]
    }
   ],
   "source": [
    "# 通过sklearn 格点搜索来调节 用于糖尿病预测神经网络模型的参数\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense\n",
    "from keras.wrappers.scikit_learn import KerasClassifier\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import urllib\n",
    "\n",
    "# 函数用于建立模型，KerasClassifier需要的函数\n",
    "## 选择rmsprop优化器，初始化权重方式为 glorot_uniform\n",
    "def create_model(optimizer='rmsprop', init='glorot_uniform'): \n",
    "    # 创建模型\n",
    "    model = Sequential()\n",
    "    model.add(Dense(12, input_dim=8, init=init, activation='relu'))\n",
    "    model.add(Dense(8, init=init, activation='relu'))\n",
    "    model.add(Dense(1, init=init, activation='sigmoid'))\n",
    "    # 编译模型\n",
    "    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])\n",
    "    return model\n",
    "\n",
    "# 随机数设置，便于产生相同的随机数\n",
    "seed = 7\n",
    "np.random.seed(seed)\n",
    "\n",
    "# 加载数据\n",
    "## 使用url来获取 diabetes dataset\n",
    "url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data\"\n",
    "## 打开数据文件\n",
    "raw_data = urllib.urlopen(url)\n",
    "## 下载数据文件 保存为np matrix格式\n",
    "dataset = np.loadtxt(raw_data, delimiter=',')\n",
    "## 将数据特征与标签分开\n",
    "X = dataset[:, 0:8]\n",
    "Y = dataset[:, 8]\n",
    "\n",
    "# 创建模型\n",
    "model = KerasClassifier(build_fn=create_model, verbose=0)\n",
    "# 格点搜索的步数，batch 样本数和优化器 参数设置\n",
    "optimizers = ['rmsprop', 'adam']\n",
    "init = ['glorot_uniform', 'normal', 'uniform']\n",
    "epochs = np.array([50, 100, 150])\n",
    "batches = np.array([5, 10, 20])\n",
    "param_grid = dict(optimizer=optimizers, nb_epoch=epochs, batch_size=batches, init=init)\n",
    "grid = GridSearchCV(estimator=model, param_grid=param_grid)\n",
    "grid_result = grid.fit(X, Y)\n",
    "\n",
    "# 结果显示\n",
    "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n",
    "\n",
    "for params, mean_score, scores in grid_result.grid_scores_:\n",
    "    print(\"%f (%f) with: %r\" % (scores.mean(), scores.std(), params))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Best: 0.697917 using {'init': 'normal', 'optimizer': 'adam', 'nb_epoch': 150, 'batch_size': 5}\n",
    "\n",
    "最好的分类结果约为 69.79%。其使用的参数设置，如上所示。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}