\n",
"\n",
" \n",
"## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
"\n",
"Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
Topic 1. Exploratory data analysis with Pandas\n",
"##
Practice. Analyzing \"Titanic\" passengers. Solution\n",
"\n",
"**Fill in the missing code (\"Your code here\") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"pd.set_option(\"display.precision\", 2)\n",
"from matplotlib import pyplot as plt\n",
"\n",
"# Graphics in SVG format are more sharp and legible\n",
"%config InlineBackend.figure_format = 'svg'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Read data into a Pandas DataFrame**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**First 5 rows**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**\n",
"\n",
"Make sure you understand how actually this construction works."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[(data[\"Embarked\"] == \"C\") & (data.Fare > 200)].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We can sort these people by Fare in descending order.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[(data[\"Embarked\"] == \"C\") & (data[\"Fare\"] > 200)].sort_values(\n",
" by=\"Fare\", ascending=False\n",
").head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's create a new feature.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def age_category(age):\n",
" \"\"\"\n",
" < 30 -> 1\n",
" >= 30, <55 -> 2\n",
" >= 55 -> 3\n",
" \"\"\"\n",
" if age < 30:\n",
" return 1\n",
" elif age < 55:\n",
" return 2\n",
" elif age >= 55:\n",
" return 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age_categories = [age_category(age) for age in data.Age]\n",
"data[\"Age_category\"] = age_categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Another way is to do it with `apply`.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[\"Age_category\"] = data[\"Age\"].apply(age_category)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. How many men/women were there onboard?**\n",
"- 412 men and 479 women\n",
"- 314 men и 577 women\n",
"- 479 men и 412 women\n",
"- **577 men и 314 women [+]**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"(data[\"Sex\"] == \"male\").sum(), (data[\"Sex\"] == \"female\").sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Easier:**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[\"Sex\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**\n",
"- 104\n",
"- **108 [+]**\n",
"- 112\n",
"- 125"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.crosstab(data[\"Pclass\"], data[\"Sex\"], margins=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can plot a picture as well, though it's not necessary here. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[\"Pclass\"].hist(label=\"all\")\n",
"data[data[\"Sex\"] == \"male\"][\"Pclass\"].hist(color=\"green\", label=\"male\")\n",
"data[data[\"Sex\"] == \"female\"][\"Pclass\"].hist(color=\"yellow\", label=\"female\")\n",
"plt.title(\"Distribution by class and gender.\")\n",
"plt.xlabel(\"Pclass\")\n",
"plt.ylabel(\"Frequency\")\n",
"plt.legend(loc=\"upper left\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. What are median and standard deviation of `Fare`?. Round to two decimals.**\n",
"- **median is 14.45, standard deviation is 49.69 [+]**\n",
"- median is 15.1, standard deviation is 12.15\n",
"- median is 13.15, standard deviation is 35.3\n",
"- median is 17.43, standard deviation is 39.1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Median fare: \", round(data[\"Fare\"].median(), 2))\n",
"print(\"Fare std: \", round(data[\"Fare\"].std(), 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**\n",
"- Yes\n",
"- **No [+]**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[data[\"Survived\"] == 1][\"Age\"].hist(\n",
" color=\"green\", label=\"Survived\", alpha=0.5, density=True\n",
")\n",
"data[data[\"Survived\"] == 0][\"Age\"].hist(\n",
" color=\"red\", label=\"Died\", alpha=0.5, density=True\n",
")\n",
"plt.title(\"Age for survived and died\")\n",
"plt.xlabel(\"Years\")\n",
"plt.ylabel(\"Frequency\")\n",
"plt.legend();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install seaborn\n",
"import seaborn as sns\n",
"\n",
"sns.set()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(data[\"Survived\"], data[\"Age\"]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can't see the difference through eye-balling only. Let's calculate."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.groupby(\"Survived\")[\"Age\"].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**\n",
"- 22.7% among young and 40.6% among old\n",
"- **40.6% among young and 22.7% among old [+]**\n",
"- 35.3% among young and 27.4% among old\n",
"- 27.4% among young and 35.3% among old"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"young_survived = data.loc[data[\"Age\"] < 30, \"Survived\"]\n",
"old_survived = data.loc[data[\"Age\"] > 60, \"Survived\"]\n",
"\n",
"print(\n",
" \"Shares of survived people: \\n\\t among young {}%, \\n\\t among old {}%.\".format(\n",
" round(100 * young_survived.mean(), 1), round(100 * old_survived.mean(), 1)\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**\n",
"- 30.2% among men and 46.2% among women\n",
"- 35.7% among men and 74.2% among women\n",
"- 21.1% among men and 46.2% among women\n",
"- **18.9% among men and 74.2% among women [+]**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"male_survived = data[data[\"Sex\"] == \"male\"][\"Survived\"]\n",
"female_survived = data[data[\"Sex\"] == \"female\"][\"Survived\"]\n",
"\n",
"\n",
"print(\n",
" \"Shares of survived people: \\n\\t among women {}%, \\n\\t among men {}%\".format(\n",
" round(100 * female_survived.mean(), 1), round(100 * male_survived.mean(), 1)\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**7. What's the most popular first name among male passengers?**\n",
"- Charles\n",
"- Thomas\n",
"- **William [+]**\n",
"- John"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[\"Name\"].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.loc[1, \"Name\"].split(\",\")[1].split()[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"first_names = data.loc[data[\"Sex\"] == \"male\", \"Name\"].apply(\n",
" lambda full_name: full_name.split(\",\")[1].split()[1]\n",
")\n",
"first_names.value_counts().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**\n",
"- **On average, men of 1 class are older than 40 [+]**\n",
"- On average, women of 1 class are older than 40\n",
"- **Men of all classes are on average older than women of the same class [+]**\n",
"- ** On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class [+]**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for cl in data[\"Pclass\"].unique():\n",
" for sex in data[\"Sex\"].unique():\n",
" print(\n",
" \"Average age for {0} and class {1}: {2}\".format(\n",
" sex,\n",
" cl,\n",
" round(\n",
" data[(data[\"Sex\"] == sex) & (data[\"Pclass\"] == cl)][\"Age\"].mean(), 2\n",
" ),\n",
" )\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nicer:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for (cl, sex), sub_df in data.groupby([\"Pclass\", \"Sex\"]):\n",
" print(\n",
" \"Average age for {0} and class {1}: {2}\".format(\n",
" sex, cl, round(sub_df[\"Age\"].mean(), 2)\n",
" )\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And even nicer:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.crosstab(data[\"Pclass\"], data[\"Sex\"], values=data[\"Age\"], aggfunc=np.mean)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(data[\"Pclass\"], data[\"Age\"]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Useful resources\n",
"* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)\n",
"* Topic 1 \"Exploratory Data Analysis with Pandas\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)\n",
"* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
},
"name": "seminar02_practice_pandas_titanic.ipynb"
},
"nbformat": 4,
"nbformat_minor": 1
}