{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n",
    "\n",
    "Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). Translated and edited by [Sergey Isaev](https://www.linkedin.com/in/isvforall/), [Artem Trunov](https://www.linkedin.com/in/datamove/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center> Assignment #1 (demo). Solution\n",
    "## <center>  Exploratory data analysis with Pandas\n",
    "\n",
    "\n",
    "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a1-demo-pandas-and-uci-adult-dataset) + [solution](https://www.kaggle.com/kashnitsky/a1-demo-pandas-and-uci-adult-dataset-solution).**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**In this task you should use Pandas to answer a few questions about the [Adult](https://archive.ics.uci.edu/ml/datasets/Adult) dataset. (You don't have to download the data – it's already  in the repository). Choose the answers in the [web-form](https://docs.google.com/forms/d/1uY7MpI2trKx6FLWZte0uVh3ULV4Cm_tDud0VDFGCOKg).**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unique values of features (for more information please see the link above):\n",
    "- `age`: continuous.\n",
    "- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.\n",
    "- `fnlwgt`: continuous.\n",
    "- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.\n",
    "- `education-num`: continuous.\n",
    "- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.\n",
    "- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.\n",
    "- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.\n",
    "- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.\n",
    "- `sex`: Female, Male.\n",
    "- `capital-gain`: continuous.\n",
    "- `capital-loss`: continuous.\n",
    "- `hours-per-week`: continuous.\n",
    "- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   \n",
    "- `salary`: >50K,<=50K"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "pd.set_option(\"display.max.columns\", 100)\n",
    "# to draw pictures in jupyter notebook\n",
    "%matplotlib inline\n",
    "# we don't like warnings\n",
    "# you can comment the following 2 lines if you'd like to\n",
    "import warnings\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"../../data/adult.data.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1. How many men and women (*sex* feature) are represented in this dataset?** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"sex\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2. What is the average age (*age* feature) of women?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[data[\"sex\"] == \"Female\"][\"age\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3. What is the proportion of German citizens (*native-country* feature)?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "float((data[\"native-country\"] == \"Germany\").sum()) / data.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**4-5. What are mean value and standard deviation of the age of those who recieve more than 50K per year (*salary* feature) and those who receive less than 50K per year? **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ages1 = data[data[\"salary\"] == \">50K\"][\"age\"]\n",
    "ages2 = data[data[\"salary\"] == \"<=50K\"][\"age\"]\n",
    "print(\n",
    "    \"The average age of the rich: {0} +- {1} years, poor - {2} +- {3} years.\".format(\n",
    "        round(ages1.mean()),\n",
    "        round(ages1.std(), 1),\n",
    "        round(ages2.mean()),\n",
    "        round(ages2.std(), 1),\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**6. Is it true that people who receive more than 50k have at least high school education? (*education - Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters* or *Doctorate* feature)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[data[\"salary\"] == \">50K\"][\"education\"].unique()  # No"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**7. Display statistics of age for each race (*race* feature) and each gender. Use *groupby()* and *describe()*. Find the maximum age of men of *Amer-Indian-Eskimo* race.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for (race, sex), sub_df in data.groupby([\"race\", \"sex\"]):\n",
    "    print(\"Race: {0}, sex: {1}\".format(race, sex))\n",
    "    print(sub_df[\"age\"].describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8. Among whom the proportion of those who earn a lot(>50K) is more: among married or single men (*marital-status* feature)? Consider married those who have a *marital-status* starting with *Married* (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered single.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# married men\n",
    "data[(data[\"sex\"] == \"Male\") \n",
    "     & (data[\"marital-status\"].str.startswith(\"Married\"))][\n",
    "    \"salary\"\n",
    "].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# single men\n",
    "data[\n",
    "    (data[\"sex\"] == \"Male\")\n",
    "    & ~(data[\"marital-status\"].str.startswith(\"Married\"))\n",
    "][\"salary\"].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's good to be married :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**9. What is the maximum number of hours a person works per week (*hours-per-week* feature)? How many people work such a number of hours and what is the percentage of those who earn a lot among them?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_load = data[\"hours-per-week\"].max()\n",
    "print(\"Max time - {0} hours./week.\".format(max_load))\n",
    "\n",
    "num_workaholics = data[data[\"hours-per-week\"] == max_load].shape[0]\n",
    "print(\"Total number of such hard workers {0}\".format(num_workaholics))\n",
    "\n",
    "rich_share = (\n",
    "    float(\n",
    "        data[(data[\"hours-per-week\"] == max_load) & (data[\"salary\"] == \">50K\")].shape[0]\n",
    "    )\n",
    "    / num_workaholics\n",
    ")\n",
    "print(\"Percentage of rich among them {0}%\".format(int(100 * rich_share)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10. Count the average time of work (*hours-per-week*) those who earning a little and a lot (*salary*) for each country (*native-country*).**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Simple method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for (country, salary), sub_df in data.groupby([\"native-country\", \"salary\"]):\n",
    "    print(country, salary, round(sub_df[\"hours-per-week\"].mean(), 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Elegant method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(\n",
    "    data[\"native-country\"],\n",
    "    data[\"salary\"],\n",
    "    values=data[\"hours-per-week\"],\n",
    "    aggfunc=np.mean,\n",
    ").T"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}