{ "cells": [ { "cell_type": "code", "execution_count": 52, "id": "initial_id", "metadata": { "collapsed": true, "ExecuteTime": { "end_time": "2024-01-09T23:29:26.707212Z", "start_time": "2024-01-09T23:29:26.670908300Z" } }, "outputs": [ { "data": { "text/plain": " Age Attrition BusinessTravel DailyRate Department \\\n0 21 0.0 Travel_Rarely 391 Research_Development \n1 19 1.0 Travel_Rarely 528 Sales \n2 18 1.0 Travel_Rarely 230 Research_Development \n3 18 0.0 Travel_Rarely 812 Sales \n4 18 1.0 Travel_Frequently 1306 Sales \n\n DistanceFromHome Education EducationField EnvironmentSatisfaction \\\n0 15 College Life_Sciences High \n1 22 Below_College Marketing Very_High \n2 3 Bachelor Life_Sciences High \n3 10 Bachelor Medical Very_High \n4 5 Bachelor Marketing Medium \n\n Gender ... PerformanceRating RelationshipSatisfaction StockOptionLevel \\\n0 Male ... Excellent Very_High 0 \n1 Male ... Excellent Very_High 0 \n2 Male ... Excellent High 0 \n3 Female ... Excellent Low 0 \n4 Male ... Excellent Very_High 0 \n\n TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany \\\n0 0 6 Better 0 \n1 0 2 Good 0 \n2 0 2 Better 0 \n3 0 2 Better 0 \n4 0 3 Better 0 \n\n YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager \n0 0 0 0 \n1 0 0 0 \n2 0 0 0 \n3 0 0 0 \n4 0 0 0 \n\n[5 rows x 31 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEnvironmentSatisfactionGender...PerformanceRatingRelationshipSatisfactionStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
0210.0Travel_Rarely391Research_Development15CollegeLife_SciencesHighMale...ExcellentVery_High006Better0000
1191.0Travel_Rarely528Sales22Below_CollegeMarketingVery_HighMale...ExcellentVery_High002Good0000
2181.0Travel_Rarely230Research_Development3BachelorLife_SciencesHighMale...ExcellentHigh002Better0000
3180.0Travel_Rarely812Sales10BachelorMedicalVery_HighFemale...ExcellentLow002Better0000
4181.0Travel_Frequently1306Sales5BachelorMarketingMediumMale...ExcellentVery_High003Better0000
\n

5 rows × 31 columns

\n
" }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import random as rd\n", "import seaborn as sns\n", "rd.seed(42)\n", "df = pd.read_feather(\"../datasets/attrition.feather\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 28, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1470, 31)\n" ] }, { "data": { "text/plain": "" }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(df.shape)\n", "df[\"Age\"].hist()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:35.462202600Z", "start_time": "2024-01-09T18:17:35.288372100Z" } }, "id": "6bc0db10e43f819" }, { "cell_type": "markdown", "source": [ "# Simple sampling" ], "metadata": { "collapsed": false }, "id": "a1c5e970018f62d9" }, { "cell_type": "code", "execution_count": 29, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_simple_samp = df.sample(n=70, random_state=42)\n", "df_simple_samp[\"Age\"].hist()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:35.583059900Z", "start_time": "2024-01-09T18:17:35.419232300Z" } }, "id": "15798369a2dca49e" }, { "cell_type": "markdown", "source": [ "# Systematic sampling\n", "Systematic sampling has a problem: if the data has been sorted, or there is some sort of pattern or meaning behind the row order, then the resulting sample may not be representative of the whole population. The problem can be solved by shuffling the rows, but then systematic sampling is equivalent to simple random sampling." ], "metadata": { "collapsed": false }, "id": "c4414501ea754011" }, { "cell_type": "code", "execution_count": 30, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sample_size = 70\n", "pop_size = len(df)\n", "interval = pop_size // sample_size\n", "df_sys_samp = df.iloc[::interval]\n", "df_sys_samp[\"Age\"].hist()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:35.828413100Z", "start_time": "2024-01-09T18:17:35.578496200Z" } }, "id": "f9d422e5ed6431b3" }, { "cell_type": "markdown", "source": [ "# Proportional stratified sampling\n", "\n", "* Split the population into subgroups\n", "* Use simple random sampling on every subgroup\n", "\n", "The proportions of each category or subgroup will be similar between the population and sampling data." ], "metadata": { "collapsed": false }, "id": "75658aae51baf9e8" }, { "cell_type": "code", "execution_count": 31, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_strat_samp = df.groupby(\"Department\", observed=False).sample(frac=0.1, random_state=42)\n", "df_strat_samp[\"Age\"].hist()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:35.995234500Z", "start_time": "2024-01-09T18:17:35.823517300Z" } }, "id": "359d88c37f86ab5a" }, { "cell_type": "code", "execution_count": 32, "outputs": [ { "data": { "text/plain": "Department\nResearch_Development 0.653741\nSales 0.303401\nHuman_Resources 0.042857\nName: proportion, dtype: float64" }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Department\"].value_counts(normalize=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:35.997690300Z", "start_time": "2024-01-09T18:17:35.988534100Z" } }, "id": "bdb40c09cedb8939" }, { "cell_type": "code", "execution_count": 33, "outputs": [ { "data": { "text/plain": "Department\nResearch_Development 0.653061\nSales 0.306122\nHuman_Resources 0.040816\nName: proportion, dtype: float64" }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_strat_samp[\"Department\"].value_counts(normalize=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:36.008678100Z", "start_time": "2024-01-09T18:17:35.997690300Z" } }, "id": "5a04b4bfebc60408" }, { "cell_type": "markdown", "source": [ "# Equal counts stratified sampling\n", "The sampling will extract n rows of each category" ], "metadata": { "collapsed": false }, "id": "a8e08efe798af0d5" }, { "cell_type": "code", "execution_count": 34, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_eq_strat_samp = df.groupby(\"Department\", observed=False).sample(n=15, random_state=42)\n", "df_eq_strat_samp[\"Age\"].hist()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:36.195207100Z", "start_time": "2024-01-09T18:17:36.007290100Z" } }, "id": "8fabd3002b70ea61" }, { "cell_type": "code", "execution_count": 35, "outputs": [ { "data": { "text/plain": "Department\nResearch_Development 0.653741\nSales 0.303401\nHuman_Resources 0.042857\nName: proportion, dtype: float64" }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Department\"].value_counts(normalize=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:36.196275400Z", "start_time": "2024-01-09T18:17:36.189687100Z" } }, "id": "4faaeaee6597546d" }, { "cell_type": "code", "execution_count": 36, "outputs": [ { "data": { "text/plain": "Department\nHuman_Resources 0.333333\nResearch_Development 0.333333\nSales 0.333333\nName: proportion, dtype: float64" }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_eq_strat_samp[\"Department\"].value_counts(normalize=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:36.207147600Z", "start_time": "2024-01-09T18:17:36.196275400Z" } }, "id": "f8d7b48586a49856" }, { "cell_type": "markdown", "source": [ "# Weighted random sampling\n", "Specify weights to adjust the relative probability of a row being sampled." ], "metadata": { "collapsed": false }, "id": "1cfc1ca86ba620d3" }, { "cell_type": "code", "execution_count": 37, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_weight = df\n", "condition = df_weight[\"Department\"] == \"Sales\"\n", "df_weight[\"weight\"] = np.where(condition, 2, 1) # weight 2 if match - 1 don't match => 2 times the chance of beign picked\n", "df_weight = df.sample(frac=0.1, weights=\"weight\", random_state=42)\n", "df_weight[\"Age\"].hist()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:36.407116800Z", "start_time": "2024-01-09T18:17:36.203767500Z" } }, "id": "ba3b8a14c558df8c" }, { "cell_type": "code", "execution_count": 38, "outputs": [ { "data": { "text/plain": "Department\nResearch_Development 0.653741\nSales 0.303401\nHuman_Resources 0.042857\nName: proportion, dtype: float64" }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.value_counts(\"Department\", normalize=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:36.420165900Z", "start_time": "2024-01-09T18:17:36.402905900Z" } }, "id": "10f01d89bbf6b641" }, { "cell_type": "code", "execution_count": 39, "outputs": [ { "data": { "text/plain": "Department\nResearch_Development 0.537415\nSales 0.435374\nHuman_Resources 0.027211\nName: proportion, dtype: float64" }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_weight.value_counts(\"Department\", normalize=True)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T18:17:36.421218600Z", "start_time": "2024-01-09T18:17:36.410843Z" } }, "id": "bf432e27e6cb2952" }, { "cell_type": "markdown", "source": [ "# Cluster sampling\n", "* Use simple random sampling to pick some subgroups\n", "* Use simple random sampling on only those subgroups" ], "metadata": { "collapsed": false }, "id": "2db075dd5cba8831" }, { "cell_type": "code", "execution_count": 45, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\b061995\\AppData\\Local\\Temp\\ipykernel_21152\\4183575958.py:5: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df_filtered[\"JobRole\"] = df_filtered[\"JobRole\"].cat.remove_unused_categories()\n", "C:\\Users\\b061995\\AppData\\Local\\Temp\\ipykernel_21152\\4183575958.py:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n", " df_clust_samp = df_filtered.groupby(\"JobRole\").sample(n=10, random_state=42)\n" ] }, { "data": { "text/plain": "" }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "job_roles = list(df[\"JobRole\"].unique())\n", "job_roles_samp = rd.sample(job_roles, k=4)\n", "condition = df[\"JobRole\"].isin(job_roles_samp)\n", "df_filtered = df[condition]\n", "df_filtered[\"JobRole\"] = df_filtered[\"JobRole\"].cat.remove_unused_categories()\n", "df_clust_samp = df_filtered.groupby(\"JobRole\").sample(n=10, random_state=42)\n", "df_clust_samp[\"Age\"].hist()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T19:08:30.584141Z", "start_time": "2024-01-09T19:08:30.403512700Z" } }, "id": "b59f600b56c026db" }, { "cell_type": "markdown", "source": [ "# Relative error\n", "\n", "$$\n", "RE=100*\\frac{|\\text{population mean} - \\text{sample mean}|}{\\text{population mean}}\n", "$$" ], "metadata": { "collapsed": false }, "id": "3e951798a9ae4a13" }, { "cell_type": "code", "execution_count": 50, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "24.05063291139242\n" ] } ], "source": [ "attrition_srs100 = df.sample(n=100, random_state=42)\n", "mean_attrition_srs100 = attrition_srs100[\"Attrition\"].mean()\n", "rel_error_pct100 = 100 * abs(df[\"Attrition\"].mean() - mean_attrition_srs100) / df[\"Attrition\"].mean()\n", "print(rel_error_pct100)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T22:10:26.772175100Z", "start_time": "2024-01-09T22:10:26.764408100Z" } }, "id": "9cc55bcb181b6fb4" }, { "cell_type": "markdown", "source": [], "metadata": { "collapsed": false }, "id": "3c1dfca1fd820eb7" }, { "cell_type": "markdown", "source": [ "# Bootstrapping\n", "\n", "*Sampling*: going from a population to a smaller sample.\n", "\n", "*Bootstraping*: building up a theorical population from the sample.\n", "\n", "**Process:**\n", "\n", "1. Make a resample of the same size as the original sample.\n", "2. Calculate the statistic of interest for this bootstrap sample.\n", "3. Repeat steps 1 and 2 many times.\n", "\n", "**Bootstrap distribution mean:**\n", "\n", "* Usually close to the sample mean.\n", "* May not be a good estimate of the population mean. (here we use sampling distribution)\n", "* Bootstrapping cannot correct biases from sampling.\n", "\n", "**Standard error:** standard deviation of the statistic of interest.\n", "\n", "* standard error * sqrt(sample_size) => population standard deviation\n", "* Estimated standard error -> standard deviation of the bootstrap distribution for a sample statistic.\n", "* It can be a good estimate of the population std\n", "* Bootstrapping doesn't suffer that much from biases\n" ], "metadata": { "collapsed": false }, "id": "df8c9191b9599079" }, { "cell_type": "code", "execution_count": 54, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_resample = df.sample(frac=1, replace=True)\n", "means = []\n", "for i in range(1000):\n", " means.append(np.mean(df.sample(frac=1, replace=True)[\"Age\"]))\n", "sns.histplot(means)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-09T23:29:37.068843300Z", "start_time": "2024-01-09T23:29:36.518302100Z" } }, "id": "5a5f9704be95cd7c" }, { "cell_type": "markdown", "source": [ "# Confidence " ], "metadata": { "collapsed": false }, "id": "d301591b3248bbb1" }, { "cell_type": "code", "execution_count": 58, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": "
", "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.histplot(df[\"Age\"])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-10T00:07:45.479965600Z", "start_time": "2024-01-10T00:07:45.326828900Z" } }, "id": "bfe7a7d694218129" }, { "cell_type": "markdown", "source": [ "Ways to calculate:\n", "\n", "1. Mean plus or minus one standard deviation:" ], "metadata": { "collapsed": false }, "id": "ca908d93de1725dd" }, { "cell_type": "code", "execution_count": 55, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[27.78843603467279, 46.05918301294626]\n" ] } ], "source": [ "mean = np.mean(df[\"Age\"])\n", "c1 = mean - np.std(df[\"Age\"], ddof=1)\n", "c2 = mean + np.std(df[\"Age\"], ddof=1)\n", "print([c1, c2])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-10T00:05:56.985439200Z", "start_time": "2024-01-10T00:05:56.976144100Z" } }, "id": "ce7cabb03b505084" }, { "cell_type": "markdown", "source": [ "2. Quantile method for confidence intervals" ], "metadata": { "collapsed": false }, "id": "b8b9d30bb876311" }, { "cell_type": "code", "execution_count": 57, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[21.0, 56.0]\n" ] } ], "source": [ "q1 = np.quantile(df[\"Age\"], 0.025)\n", "q2 = np.quantile(df[\"Age\"], 0.975)\n", "print([q1, q2])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-10T00:07:02.886095800Z", "start_time": "2024-01-10T00:07:02.876254600Z" } }, "id": "8ad909dac3468ff" }, { "cell_type": "markdown", "source": [ "3. Inverse cumulative distribution function - Standard error method for confidence itnerval" ], "metadata": { "collapsed": false }, "id": "f2c01ee8e89f647c" }, { "cell_type": "code", "execution_count": 61, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[19.018806499779515, 54.82881254783953]\n" ] } ], "source": [ "from scipy.stats import norm\n", "point_estimate = np.mean(df[\"Age\"])\n", "std_error = np.std(df[\"Age\"], ddof=1) # here we should use the standard error (std(bootstrap_distribution))\n", "lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)\n", "upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)\n", "print([lower, upper])" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-01-10T00:12:11.673899Z", "start_time": "2024-01-10T00:12:11.668428Z" } }, "id": "47c22c76acb0d459" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }