{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis and Visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of __exploratory data analysis (EDA)__ is to explore attributes across multiple entities to decide what statistical or machine learning techniques to apply to the data. Visualizations are used to assist in understanding the data." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pandas in /opt/conda/lib/python3.8/site-packages (1.2.2)\n", "Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2.8.1)\n", "Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.8/site-packages (from pandas) (2021.1)\n", "Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.8/site-packages (from pandas) (1.19.5)\n", "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Unnamed: 0.1FormStateSecurity_GradeArea_NumberTerrain_DescriptionFavorable_InfluencesDetrimental_InfluencesINHABITANTS_Type...max_annual_incometerrain_rollingwhite_collarmixture_or_jewishprofessionalbusiness_or_executivelaborerclerksmechanicsindustrial
001NS FORM-8 6-1-37MarylandA1undulatingVery nicely planned residential area of medium...Noexecutives professional men...5000.0100110000
110NS FORM-8 6-1-37MarylandA2rollingFairly new suburban area of homogeneous charac...Nosubstantial middle class...5000.0100000000
222NS FORM-8 6-1-37MarylandA3rollingGood residential area. Well planned.Distance to Cityexecutives professional men...7000.0100110000
\n", "

3 rows × 43 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 Unnamed: 0.1 Form State Security_Grade \\\n", "0 0 1 NS FORM-8 6-1-37 Maryland A \n", "1 1 0 NS FORM-8 6-1-37 Maryland A \n", "2 2 2 NS FORM-8 6-1-37 Maryland A \n", "\n", " Area_Number Terrain_Description \\\n", "0 1 undulating \n", "1 2 rolling \n", "2 3 rolling \n", "\n", " Favorable_Influences Detrimental_Influences \\\n", "0 Very nicely planned residential area of medium... No \n", "1 Fairly new suburban area of homogeneous charac... No \n", "2 Good residential area. Well planned. Distance to City \n", "\n", " INHABITANTS_Type ... max_annual_income terrain_rolling \\\n", "0 executives professional men ... 5000.0 1 \n", "1 substantial middle class ... 5000.0 1 \n", "2 executives professional men ... 7000.0 1 \n", "\n", " white_collar mixture_or_jewish professional business_or_executive laborer \\\n", "0 0 0 1 1 0 \n", "1 0 0 0 0 0 \n", "2 0 0 1 1 0 \n", "\n", " clerks mechanics industrial \n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "\n", "[3 rows x 43 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# loads the pandas library \n", "import pandas as pd\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning) # Ignore Pandas future warnings\n", "\n", "# creates data frame named df by reading in the Baltimore csv\n", "df = pd.read_csv(\"manipulated_baltimore_data.csv\")\n", "df.head(n=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.describe()` function summarizes a data frame column. Since the data type of `max_building_age` is currently type 'object', which in python is an indcator of type 'string', we have to first convert this attribute into a numeric value." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "count 46.000000\n", "mean 30.086957\n", "std 16.497577\n", "min 10.000000\n", "25% 20.000000\n", "50% 25.000000\n", "75% 40.000000\n", "max 65.000000\n", "Name: max_building_age, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['max_building_age'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that `max_building_age` is numeric type, we see that `describe()` provides __summary statistics__ on this attribute." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "count 46.000000\n", "mean 30.086957\n", "std 16.497577\n", "min 10.000000\n", "25% 20.000000\n", "50% 25.000000\n", "75% 40.000000\n", "max 65.000000\n", "Name: max_building_age, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# converts max_building age to numeric type\n", "df[\"max_building_age\"] = pd.to_numeric(df[\"max_building_age\"])\n", "df['max_building_age'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can the same operations to `max_annual_income`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "count 46.000000\n", "mean 3139.130435\n", "std 2009.806874\n", "min 1000.000000\n", "25% 1850.000000\n", "50% 2750.000000\n", "75% 4000.000000\n", "max 10000.000000\n", "Name: max_annual_income, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['max_annual_income'].describe()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "count 46.000000\n", "mean 3139.130435\n", "std 2009.806874\n", "min 1000.000000\n", "25% 1850.000000\n", "50% 2750.000000\n", "75% 4000.000000\n", "max 10000.000000\n", "Name: max_annual_income, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['max_annual_income'] = pd.to_numeric(df['max_annual_income'])\n", "df['max_annual_income'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we create some plots our data. A __scatter plot__ and a __bar chart__ are shown below." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 1. Hover over different points and explore their additional characteristics. __Note__:`INHABITANTS_F/N` should be multiplied by 100 to be a percent.\n", "2. The different points are clustered by grades. Which clusters have the most variation?\n", "3. How does `BUILDINGS_Construction` vary across the different points?\n", "4. Can you identify a trend overall? " ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excercise 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 1. Recall the preperations done to the INHABITANTS_Foreignborn, how might this have influenced these outcomes?\n", "2. What can you learn from this graph?\n", "3. What do you learn about the different grades?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 1. What can you learn from this graph?\n", "2. What are some explanations for the outcomes?\n", "3. What can you learn about the different grades?\n", "4. Compare to the previous graph, what are the similarities and differences?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }