{ "cells": [ { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn.apionly as sns" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa\n" ] } ], "source": [ "iris = sns.load_dataset('iris')\n", "print iris.head()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['setosa', 'versicolor', 'virginica'], dtype=object)" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.species.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`tolist()`: np.array $\\to$ list" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['setosa', 'versicolor', 'virginica']" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.species.unique().tolist()" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'setosa'" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pctg = {\n", " 'setosa':0.1, \n", " 'versicolor':0.2, \n", " 'virginica':0.3\n", "}\n", "pctg.keys()[0]\n", "# pctg.values()[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在讨论分层抽样的问题。其实这个问题之前已经讨论清楚了,就是设置`sample`里面的比例。\n", "这里直接构建一个`dict`,\n", "\n", "- 在`subsample`中的`filter`使用`dict`的`keys()`,\n", "- 在`sample`使用`dict`的`values()`" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {}\n", "\n", "#for i in iris.species.unique().tolist():\n", "for i in range(0,3): \n", " name = pctg.keys()[i]\n", " subsample = iris[iris.species == name]\n", " ratio = pctg.values()[i]\n", " subsample_size = subsample.species.size\n", " number = int(ratio*subsample_size)\n", " data[name] = subsample.sample(number)\n", "type(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "因此解决了批量合并`data.frame`的方法" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " sepal_length sepal_width petal_length petal_width\n", "species \n", "setosa 50 50 50 50\n", "versicolor 50 50 50 50\n", "virginica 50 50 50 50\n", " sepal_length sepal_width petal_length petal_width\n", "species \n", "setosa 5 5 5 5\n", "versicolor 10 10 10 10\n", "virginica 15 15 15 15\n" ] } ], "source": [ "print iris.groupby(\"species\").count()\n", "print pd.concat(data).groupby(\"species\").count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "检验结果,发现的确是按照每个group id,分层抽样以此递增。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我想到了一个办法,就是我可以导出为`.md`格式,也是可以deploy的。\n", "但是导出方式一定要是`print`的。\n", "`blogdown`可以调用`.md`文档。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }