\n",
"\n",
" \n",
"## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
"\n",
"Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
Topic 2. Visual data analysis\n",
"##
Practice. Analyzing \"Titanic\" passengers. Solution\n",
"\n",
"**Fill in the missing code (\"You code here\"). No need to select answers in a webform.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Competition Kaggle \"Titanic: Machine Learning from Disaster\".**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"sns.set()\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Graphics in SVG format are more sharp and legible\n",
"%config InlineBackend.figure_format = 'svg'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Read data**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df = pd.read_csv(\"../../data/titanic_train.csv\", index_col=\"PassengerId\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.describe(include=\"all\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's drop`Cabin`, and then – all rows with missing values.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df = train_df.drop(\"Cabin\", axis=1).dropna()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. Build a picture to visualize all scatter plots for each pair of features `Age`, `Fare`, `SibSp`, `Parch` and `Survived`. ( `scatter_matrix ` from Pandas or `pairplot` from Seaborn)**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.pairplot(train_df[[\"Survived\", \"Age\", \"Fare\", \"SibSp\", \"Parch\"]]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. How does ticket price (`Fare`) depend on `Pclass`? Build a boxplot.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(x=\"Pclass\", y=\"Fare\", data=train_df);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. Let's build the same plot but restricting values of `Fare` to be less than 95% quantile of the initial vector (to drop outliers that make the plot less clear).**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(\n",
" x=\"Pclass\",\n",
" y=\"Fare\",\n",
" data=train_df[train_df[\"Fare\"] < train_df[\"Fare\"].quantile(0.95)],\n",
");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4. How is the percentage of survived passengers dependent on passengers' gender? Depict it with `Seaborn.countplot` using the `hue` argument.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.crosstab(train_df[\"Sex\"], train_df[\"Survived\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.countplot(x=\"Sex\", hue=\"Survived\", data=train_df);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's do the same for `Survived` and `Pclass`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.countplot(x=\"Pclass\", hue=\"Survived\", data=train_df);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5. How does the distribution of ticket prices differ for those who survived and those who didn't. Depict it with `Seaborn.boxplot`**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sad truth, those who survived, typically had paid much more for their tickets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(x=\"Survived\", y=\"Fare\", data=train_df[train_df[\"Fare\"] < 500]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6. How does survival depend on passengers' age? Verify (graphically) an assumption that youngsters (< 30 y.o.) survived more frequently than old people (> 55 y.o.).**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first build this boxplot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(x=\"Survived\", y=\"Age\", data=train_df);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can further split by classes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(x=\"Survived\", hue=\"Pclass\", y=\"Age\", data=train_df);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hmm.. hard to conclude anything. Let's do it in another way."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_df[\"age_cat\"] = train_df[\"Age\"].apply(\n",
" lambda age: 1 if age < 30 else 3 if age > 55 else 2\n",
");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.crosstab(train_df[\"age_cat\"], train_df[\"Survived\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.countplot(x=\"age_cat\", hue=\"Survived\", data=train_df);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could have guessed that the fraction of surviving passengers is lower among old people. However, there are too few observations for serious conclusions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Useful resources\n",
"* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-2-practice-solution)\n",
"* Topic 2 \"Visual data analysis in Python\" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-2-visual-data-analysis-in-python)\n",
"* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
},
"name": "seminar13_optional_practice_trees_titanic.ipynb"
},
"nbformat": 4,
"nbformat_minor": 1
}