{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Data Science

\n", "

Lesson 9

\n", "

Exploratory Analysis

\n", "\n", "
\n", "\n", "
Representing Data
\n", "\n", "
\n", "\n", "
***Original Tutorial:***
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "OVERVIEW\n", "
\n", "\n", "
\n", "Now that our data is cleaned, we will explore our data with descriptive and graphical statistics to describe and summarize our variables. In this stage, you will find yourself classifying features and determining their correlation with the target variable and each other.\n", "
\n", "Most of this lesson is understanding what your data looks like and how it can be represented in different formats. This is especially useful for drawing conclusions from our data and correctly feeding the appropriate data into our machine learning algorithms in the future. \n", "
\n", "Please be sure to understand every diagram and what it is trying to portray\n", "
\n", "\n", "
\n", "\n", "
\n", "REPRESENTING DATA\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is to get all the code from before in this notebook \n", "# so you can continue without typing everything out again. \n", "# requirements.py should have been provided with this lesson.\n", "%run requirements.py\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Discrete Variable Correlation by Survival using\n", "#group by aka pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html\n", "for x in data1_x:\n", " if data1[x].dtype != 'float64' :\n", " print('Survival Correlation by:', x)\n", " print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())\n", " print('-'*10, '\\n')\n", " \n", "\n", "#using crosstabs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html\n", "print(pd.crosstab(data1['Title'],data1[Target[0]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Explore\"/\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#IMPORTANT: Intentionally plotted different ways for learning purposes only. \n", "\n", "#optional plotting w/pandas: https://pandas.pydata.org/pandas-docs/stable/visualization.html\n", "\n", "#we will use matplotlib.pyplot: https://matplotlib.org/api/pyplot_api.html\n", "\n", "#to organize our graphics will use figure: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure\n", "#subplot: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot\n", "#and subplotS: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=matplotlib%20pyplot%20subplots#matplotlib.pyplot.subplots\n", "\n", "#graph distribution of quantitative data\n", "plt.figure(figsize=[16,12])\n", "\n", "plt.subplot(231)\n", "plt.boxplot(x=data1['Fare'], showmeans = True, meanline = True)\n", "plt.title('Fare Boxplot')\n", "plt.ylabel('Fare ($)')\n", "\n", "plt.subplot(232)\n", "plt.boxplot(data1['Age'], showmeans = True, meanline = True)\n", "plt.title('Age Boxplot')\n", "plt.ylabel('Age (Years)')\n", "\n", "plt.subplot(233)\n", "plt.boxplot(data1['FamilySize'], showmeans = True, meanline = True)\n", "plt.title('Family Size Boxplot')\n", "plt.ylabel('Family Size (#)')\n", "\n", "plt.subplot(234)\n", "plt.hist(x = [data1[data1['Survived']==1]['Fare'], data1[data1['Survived']==0]['Fare']], \n", " stacked=True, color = ['g','r'],label = ['Survived','Dead'])\n", "plt.title('Fare Histogram by Survival')\n", "plt.xlabel('Fare ($)')\n", "plt.ylabel('# of Passengers')\n", "plt.legend()\n", "\n", "plt.subplot(235)\n", "plt.hist(x = [data1[data1['Survived']==1]['Age'], data1[data1['Survived']==0]['Age']], \n", " stacked=True, color = ['g','r'],label = ['Survived','Dead'])\n", "plt.title('Age Histogram by Survival')\n", "plt.xlabel('Age (Years)')\n", "plt.ylabel('# of Passengers')\n", "plt.legend()\n", "\n", "plt.subplot(236)\n", "plt.hist(x = [data1[data1['Survived']==1]['FamilySize'], data1[data1['Survived']==0]['FamilySize']], \n", " stacked=True, color = ['g','r'],label = ['Survived','Dead'])\n", "plt.title('Family Size Histogram by Survival')\n", "plt.xlabel('Family Size (#)')\n", "plt.ylabel('# of Passengers')\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#we will use seaborn graphics for multi-variable comparison: https://seaborn.pydata.org/api.html\n", "\n", "#graph individual features by survival\n", "fig, saxis = plt.subplots(2, 3,figsize=(16,12))\n", "\n", "sns.barplot(x = 'Embarked', y = 'Survived', data=data1, ax = saxis[0,0])\n", "sns.barplot(x = 'Pclass', y = 'Survived', order=[1,2,3], data=data1, ax = saxis[0,1])\n", "sns.barplot(x = 'IsAlone', y = 'Survived', order=[1,0], data=data1, ax = saxis[0,2])\n", "\n", "sns.pointplot(x = 'FareBin', y = 'Survived', data=data1, ax = saxis[1,0])\n", "sns.pointplot(x = 'AgeBin', y = 'Survived', data=data1, ax = saxis[1,1])\n", "sns.pointplot(x = 'FamilySize', y = 'Survived', data=data1, ax = saxis[1,2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#graph distribution of qualitative data: Pclass\n", "#we know class mattered in survival, now let's compare class and a 2nd feature\n", "fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(14,12))\n", "\n", "sns.boxplot(x = 'Pclass', y = 'Fare', hue = 'Survived', data = data1, ax = axis1)\n", "axis1.set_title('Pclass vs Fare Survival Comparison')\n", "\n", "# Try running this piece of code with split as False and ovserve what the resulting graph looks like.\n", "sns.violinplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = data1, split = True, ax = axis2)\n", "axis2.set_title('Pclass vs Age Survival Comparison')\n", "\n", "sns.boxplot(x = 'Pclass', y ='FamilySize', hue = 'Survived', data = data1, ax = axis3)\n", "axis3.set_title('Pclass vs Family Size Survival Comparison')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#graph distribution of qualitative data: Sex\n", "#we know sex mattered in survival, now let's compare sex and a 2nd feature\n", "fig, qaxis = plt.subplots(1,3,figsize=(14,12))\n", "\n", "sns.barplot(x = 'Sex', y = 'Survived', hue = 'Embarked', data=data1, ax = qaxis[0])\n", "axis1.set_title('Sex vs Embarked Survival Comparison')\n", "\n", "sns.barplot(x = 'Sex', y = 'Survived', hue = 'Pclass', data=data1, ax = qaxis[1])\n", "axis1.set_title('Sex vs Pclass Survival Comparison')\n", "\n", "sns.barplot(x = 'Sex', y = 'Survived', hue = 'IsAlone', data=data1, ax = qaxis[2])\n", "axis1.set_title('Sex vs IsAlone Survival Comparison')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#more side-by-side comparisons\n", "fig, (maxis1, maxis2) = plt.subplots(1, 2,figsize=(14,12))\n", "\n", "#how does family size factor with sex & survival compare\n", "sns.pointplot(x=\"FamilySize\", y=\"Survived\", hue=\"Sex\", data=data1,\n", " palette={\"male\": \"blue\", \"female\": \"pink\"},\n", " markers=[\"*\", \"o\"], linestyles=[\"-\", \"--\"], ax = maxis1)\n", "\n", "#how does class factor with sex & survival compare\n", "sns.pointplot(x=\"Pclass\", y=\"Survived\", hue=\"Sex\", data=data1,\n", " palette={\"male\": \"blue\", \"female\": \"pink\"},\n", " markers=[\"*\", \"o\"], linestyles=[\"-\", \"--\"], ax = maxis2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#how does embark port factor with class, sex, and survival compare\n", "#facetgrid: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html\n", "e = sns.FacetGrid(data1, col = 'Embarked')\n", "e.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', ci=95.0, palette = 'deep')\n", "e.add_legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#plot distributions of age of passengers who survived or did not survive\n", "a = sns.FacetGrid( data1, hue = 'Survived', aspect=4 )\n", "a.map(sns.kdeplot, 'Age', shade= True )\n", "a.set(xlim=(0 , data1['Age'].max()))\n", "a.add_legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#histogram comparison of sex, class, and age by survival\n", "h = sns.FacetGrid(data1, row = 'Sex', col = 'Pclass', hue = 'Survived')\n", "h.map(plt.hist, 'Age', alpha = .75)\n", "h.add_legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#pair plots of entire dataset\n", "pp = sns.pairplot(data1, hue = 'Survived', palette = 'deep', size=1.2, diag_kind = 'kde', diag_kws=dict(shade=True), plot_kws=dict(s=10) )\n", "pp.set(xticklabels=[])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#correlation heatmap of dataset\n", "def correlation_heatmap(df):\n", " _ , ax = plt.subplots(figsize =(14, 12))\n", " colormap = sns.diverging_palette(220, 10, as_cmap = True)\n", " \n", " _ = sns.heatmap(\n", " df.corr(), \n", " cmap = colormap,\n", " square=True, \n", " cbar_kws={'shrink':.9 }, \n", " ax=ax,\n", " annot=True, \n", " linewidths=0.1,vmax=1.0, linecolor='white',\n", " annot_kws={'fontsize':12 }\n", " )\n", " \n", " plt.title('Pearson Correlation of Features', y=1.05, size=15)\n", "\n", "correlation_heatmap(data1)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "
\n", "STOP\n", "
\n", ">- Please be sure you understand ALL THE ABOVE DIAGRAMS AND CODE and what each step does.\n", ">- Read the comments and if you still don't understand, run each line of code individually" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }