{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn.apionly as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   sepal_length  sepal_width  petal_length  petal_width species\n",
      "0           5.1          3.5           1.4          0.2  setosa\n",
      "1           4.9          3.0           1.4          0.2  setosa\n",
      "2           4.7          3.2           1.3          0.2  setosa\n",
      "3           4.6          3.1           1.5          0.2  setosa\n",
      "4           5.0          3.6           1.4          0.2  setosa\n"
     ]
    }
   ],
   "source": [
    "iris = sns.load_dataset('iris')\n",
    "print iris.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['setosa', 'versicolor', 'virginica'], dtype=object)"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris.species.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`tolist()`: np.array $\\to$ list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['setosa', 'versicolor', 'virginica']"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris.species.unique().tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'setosa'"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pctg = {\n",
    "    'setosa':0.1, \n",
    "    'versicolor':0.2, \n",
    "    'virginica':0.3\n",
    "}\n",
    "pctg.keys()[0]\n",
    "# pctg.values()[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在讨论分层抽样的问题。其实这个问题之前已经讨论清楚了，就是设置`sample`里面的比例。\n",
    "这里直接构建一个`dict`，\n",
    "\n",
    "- 在`subsample`中的`filter`使用`dict`的`keys()`，\n",
    "- 在`sample`使用`dict`的`values()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = {}\n",
    "\n",
    "#for i in iris.species.unique().tolist():\n",
    "for i in range(0,3):   \n",
    "    name = pctg.keys()[i]\n",
    "    subsample = iris[iris.species == name]\n",
    "    ratio = pctg.values()[i]\n",
    "    subsample_size = subsample.species.size\n",
    "    number = int(ratio*subsample_size)\n",
    "    data[name] = subsample.sample(number)\n",
    "type(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因此解决了批量合并`data.frame`的方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "            sepal_length  sepal_width  petal_length  petal_width\n",
      "species                                                         \n",
      "setosa                50           50            50           50\n",
      "versicolor            50           50            50           50\n",
      "virginica             50           50            50           50\n",
      "            sepal_length  sepal_width  petal_length  petal_width\n",
      "species                                                         \n",
      "setosa                 5            5             5            5\n",
      "versicolor            10           10            10           10\n",
      "virginica             15           15            15           15\n"
     ]
    }
   ],
   "source": [
    "print iris.groupby(\"species\").count()\n",
    "print pd.concat(data).groupby(\"species\").count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "检验结果，发现的确是按照每个group id，分层抽样以此递增。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我想到了一个办法，就是我可以导出为`.md`格式，也是可以deploy的。\n",
    "但是导出方式一定要是`print`的。\n",
    "`blogdown`可以调用`.md`文档。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}