{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataframes 10/7/2019 & 10/25/2019" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.A recap from last time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall we previously had talked about pandas series. Dataframes include columns that are like individual pandas series. They resemble tabular data in Microsoft Excel. Keep in mind it has a row and column index. Let's construct a Dataframe object. A common way with using a dict of equal length lists or numpy arrays. Each column can then be accessed as a series object." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationStatesYear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002
\n", "
" ], "text/plain": [ " Population States Year\n", "0 1.5 Ohio 2000\n", "1 1.7 Ohio 2001\n", "2 3.6 Ohio 2002\n", "3 2.4 Nevada 2001\n", "4 2.9 Nevada 2002" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "states = ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada']\n", "year = [2000, 2001, 2002, 2001, 2002]\n", "pop = [1.5, 1.7, 3.6, 2.4, 2.9]\n", "data = {'States': states, 'Year': year, 'Population': pop}\n", "dataframe = pd.DataFrame(data)\n", "dataframe #May have to put print(dataframe) in your text editor" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StatesPopulationYear
oneOhio1.52000
twoOhio1.72001
threeOhio3.62002
fourNevada2.42001
fiveNevada2.92002
\n", "
" ], "text/plain": [ " States Population Year\n", "one Ohio 1.5 2000\n", "two Ohio 1.7 2001\n", "three Ohio 3.6 2002\n", "four Nevada 2.4 2001\n", "five Nevada 2.9 2002" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Specify columns, index\n", "dataframe2 = pd.DataFrame(data, columns = ['States', 'Population', 'Year'],\n", " index = ['one', 'two', 'three', 'four', 'five'])\n", "\n", "dataframe2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting columns will return a series object. Use the syntax below:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one 1.5\n", "two 1.7\n", "three 3.6\n", "four 2.4\n", "five 2.9\n", "Name: Population, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataframe2['Population']\n", "#dataframe2[['Population']] <- this turns into a dataframe object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this in mind, we can create new columns in a similar matter:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StatesPopulationYearWest
oneOhio1.52000False
twoOhio1.72001False
threeOhio3.62002False
fourNevada2.42001True
fiveNevada2.92002True
\n", "
" ], "text/plain": [ " States Population Year West\n", "one Ohio 1.5 2000 False\n", "two Ohio 1.7 2001 False\n", "three Ohio 3.6 2002 False\n", "four Nevada 2.4 2001 True\n", "five Nevada 2.9 2002 True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataframe2['West'] = ['False', 'False', 'False', 'True', 'True']\n", "dataframe2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, we may choose to retrieve one or more columns and possibly assign it to some variable. Most of the time, we can select pieces of the dataframe and analyze them during exploratory data analysis, the phase before machine learning. " ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": true }, "outputs": [], "source": [ "salaries = pd.read_csv(\"Position_Salaries.csv\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['Position', 'Level', 'Salary'], dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries.columns # < - Printing columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve the level and salary columns" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LevelSalary
0145000
1250000
2360000
3480000
45110000
56150000
67200000
78300000
89500000
9101000000
\n", "
" ], "text/plain": [ " Level Salary\n", "0 1 45000\n", "1 2 50000\n", "2 3 60000\n", "3 4 80000\n", "4 5 110000\n", "5 6 150000\n", "6 7 200000\n", "7 8 300000\n", "8 9 500000\n", "9 10 1000000" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries[['Level', 'Salary']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print first five rows" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionLevelSalary
0Business Analyst145000
1Junior Consultant250000
2Senior Consultant360000
3Manager480000
4Country Manager5110000
\n", "
" ], "text/plain": [ " Position Level Salary\n", "0 Business Analyst 1 45000\n", "1 Junior Consultant 2 50000\n", "2 Senior Consultant 3 60000\n", "3 Manager 4 80000\n", "4 Country Manager 5 110000" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2a.Boolean Indexing" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionLevelSalary
4Country Manager5110000
5Region Manager6150000
6Partner7200000
7Senior Partner8300000
8C-level9500000
9CEO101000000
\n", "
" ], "text/plain": [ " Position Level Salary\n", "4 Country Manager 5 110000\n", "5 Region Manager 6 150000\n", "6 Partner 7 200000\n", "7 Senior Partner 8 300000\n", "8 C-level 9 500000\n", "9 CEO 10 1000000" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries[salaries['Salary'] > 100000]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4 110000\n", "5 150000\n", "6 200000\n", "7 300000\n", "8 500000\n", "9 1000000\n", "Name: Salary, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries['Salary'][salaries['Salary'] > 100000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2b. iloc/loc" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Position Business Analyst\n", "Level 1\n", "Salary 45000\n", "Name: 0, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries.iloc[0] #prints first row, returns series object" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionLevel
0Business Analyst1
1Junior Consultant2
2Senior Consultant3
3Manager4
4Country Manager5
5Region Manager6
6Partner7
7Senior Partner8
8C-level9
9CEO10
\n", "
" ], "text/plain": [ " Position Level\n", "0 Business Analyst 1\n", "1 Junior Consultant 2\n", "2 Senior Consultant 3\n", "3 Manager 4\n", "4 Country Manager 5\n", "5 Region Manager 6\n", "6 Partner 7\n", "7 Senior Partner 8\n", "8 C-level 9\n", "9 CEO 10" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(salaries.iloc[:, [0,1]]) #prints all rows, first two columns; We choose column 1 and column 2 (indexed at 0 & 1)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionLevelSalary
4Country Manager5110000
5Region Manager6150000
6Partner7200000
\n", "
" ], "text/plain": [ " Position Level Salary\n", "4 Country Manager 5 110000\n", "5 Region Manager 6 150000\n", "6 Partner 7 200000" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries.iloc[[4,5,6], :] #prints 4th, 5th, 6th rows, all columns " ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionSalary
1Junior Consultant50000
2Senior Consultant60000
6Partner200000
7Senior Partner300000
8C-level500000
\n", "
" ], "text/plain": [ " Position Salary\n", "1 Junior Consultant 50000\n", "2 Senior Consultant 60000\n", "6 Partner 200000\n", "7 Senior Partner 300000\n", "8 C-level 500000" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Slightly bit more complicated:\n", "import numpy as np\n", "salaries.iloc[np.r_[1:3, 6:8, 8], np.r_[0:1, 2:3]] #last index is exclusive unless specifically stated" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionLevelSalary
0Business Analyst145000
1Junior Consultant250000
2Senior Consultant360000
3Manager480000
4Country Manager88888110000
5Region Manager6150000
6Partner7200000
7Senior Partner8300000
8C-level9500000
9CEO101000000
\n", "
" ], "text/plain": [ " Position Level Salary\n", "0 Business Analyst 1 45000\n", "1 Junior Consultant 2 50000\n", "2 Senior Consultant 3 60000\n", "3 Manager 4 80000\n", "4 Country Manager 88888 110000\n", "5 Region Manager 6 150000\n", "6 Partner 7 200000\n", "7 Senior Partner 8 300000\n", "8 C-level 9 500000\n", "9 CEO 10 1000000" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries.iloc[4, 1] = 88888 # Change individual value\n", "salaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LOC is fairly similar; instead of numbers, use the row/column names. for this specific example, let's change the index of the dataframe to the positions" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LevelSalary
Position
Business Analyst145000
Junior Consultant250000
Senior Consultant360000
Manager480000
Country Manager88888110000
Region Manager6150000
Partner7200000
Senior Partner8300000
C-level9500000
CEO101000000
\n", "
" ], "text/plain": [ " Level Salary\n", "Position \n", "Business Analyst 1 45000\n", "Junior Consultant 2 50000\n", "Senior Consultant 3 60000\n", "Manager 4 80000\n", "Country Manager 88888 110000\n", "Region Manager 6 150000\n", "Partner 7 200000\n", "Senior Partner 8 300000\n", "C-level 9 500000\n", "CEO 10 1000000" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries.set_index(['Position'], inplace = True)\n", "salaries" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LevelSalary
Position
Business Analyst145000
Junior Consultant250000
\n", "
" ], "text/plain": [ " Level Salary\n", "Position \n", "Business Analyst 1 45000\n", "Junior Consultant 2 50000" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries.loc[['Business Analyst', 'Junior Consultant'], :] #prints first two rows, all columns" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LevelSalary
Position
Business Analyst145000
Junior Consultant250000
Senior Consultant360000
Manager480000
Country Manager88888110000
Region Manager6150000
Partner7200000
Senior Partner8300000
C-level9500000
CEO101000000
\n", "
" ], "text/plain": [ " Level Salary\n", "Position \n", "Business Analyst 1 45000\n", "Junior Consultant 2 50000\n", "Senior Consultant 3 60000\n", "Manager 4 80000\n", "Country Manager 88888 110000\n", "Region Manager 6 150000\n", "Partner 7 200000\n", "Senior Partner 8 300000\n", "C-level 9 500000\n", "CEO 10 1000000" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "salaries.loc[:, ['Level', 'Salary']] #prints all rows, second/third column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the 'any' function to determine if any of the values in the Salary column contain values less than $10,000." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(salaries.Salary < 10000).values.any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Arithmetic" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionLevelSalary
0Business Analyst190000
1Junior Consultant2100000
2Senior Consultant3120000
3Manager4160000
4Country Manager88888220000
5Region Manager6300000
6Partner7400000
7Senior Partner8600000
8C-level91000000
9CEO102000000
\n", "
" ], "text/plain": [ " Position Level Salary\n", "0 Business Analyst 1 90000\n", "1 Junior Consultant 2 100000\n", "2 Senior Consultant 3 120000\n", "3 Manager 4 160000\n", "4 Country Manager 88888 220000\n", "5 Region Manager 6 300000\n", "6 Partner 7 400000\n", "7 Senior Partner 8 600000\n", "8 C-level 9 1000000\n", "9 CEO 10 2000000" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's reset the index:\n", "# multiply the salaries by 2.\n", "salaries.reset_index(inplace = True)\n", "salaries['Salary'] = salaries['Salary'] * 2\n", "salaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use other operations when dealing and maniuplating with individual columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filling Misisng Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be importing the wine dataset. This dataset contains null values (missing values). 5 columns contain some missing data, while 22 rows contain some missing data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fixed acidity False\n", "volatile acidity False\n", "citric acid True\n", "residual sugar False\n", "chlorides True\n", "free sulfur dioxide False\n", "total sulfur dioxide False\n", "density False\n", "pH True\n", "sulphates True\n", "alcohol True\n", "quality False\n", "dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wine = pd.read_csv(\"winequality-red.csv\")\n", "wine.isnull().any(axis = 0) #add the .sum operator" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.700.001.90.07611.034.00.99783.510.569.45
17.80.880.002.60.09825.067.00.99683.200.689.85
27.80.760.042.30.09215.054.00.99703.260.659.85
311.20.280.561.90.07517.060.00.99803.160.589.86
47.40.700.001.90.07611.034.00.99783.510.569.45
\n", "
" ], "text/plain": [ " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", "0 7.4 0.70 0.00 1.9 0.076 \n", "1 7.8 0.88 0.00 2.6 0.098 \n", "2 7.8 0.76 0.04 2.3 0.092 \n", "3 11.2 0.28 0.56 1.9 0.075 \n", "4 7.4 0.70 0.00 1.9 0.076 \n", "\n", " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", "0 11.0 34.0 0.9978 3.51 0.56 \n", "1 25.0 67.0 0.9968 3.20 0.68 \n", "2 15.0 54.0 0.9970 3.26 0.65 \n", "3 17.0 60.0 0.9980 3.16 0.58 \n", "4 11.0 34.0 0.9978 3.51 0.56 \n", "\n", " alcohol quality \n", "0 9.4 5 \n", "1 9.8 5 \n", "2 9.8 5 \n", "3 9.8 6 \n", "4 9.4 5 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Check the first few rows of the data frame. \n", "wine.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#wine.dropna(axis=0, inplace = True)\n", "#wine.dropna(axis = 1, inplace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, we may want to fill in the missing data with a specific value. Keep in mind that dropping null values or replacing them with a particular value may ultimately affect the accuracy of your model. It's important to utilize certain strategies on how to deal with null values depending on the context of the problem. This is referred to as data imputation. Here is an article on some approaches used: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fixed acidity False\n", "volatile acidity False\n", "citric acid False\n", "residual sugar False\n", "chlorides False\n", "free sulfur dioxide False\n", "total sulfur dioxide False\n", "density False\n", "pH False\n", "sulphates False\n", "alcohol False\n", "quality False\n", "dtype: bool" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Fill all missing data with zeros\n", "wine.fillna(0, inplace = True)\n", "wine.isnull().any(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can fill columns with missing data by specific values corresponding to their column." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fixed acidity False\n", "volatile acidity False\n", "citric acid False\n", "residual sugar False\n", "chlorides False\n", "free sulfur dioxide False\n", "total sulfur dioxide False\n", "density False\n", "pH True\n", "sulphates True\n", "alcohol True\n", "quality False\n", "dtype: bool" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wine.fillna({wine.columns[2]:0, wine.columns[4]:2}, inplace = True) #Columns 2 and 4 values have been replaced. {name of dataframe}.columns provides a list of column names\n", "wine.isnull().any(axis = 0)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.700.001.90.07611.034.00.99783.510.569.45
17.80.880.002.60.09825.067.00.99683.200.689.85
27.80.760.042.30.09215.054.00.99703.260.659.85
311.20.280.561.90.07517.060.00.99803.160.589.86
47.40.700.001.90.07611.034.00.99783.510.569.45
\n", "
" ], "text/plain": [ " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", "0 7.4 0.70 0.00 1.9 0.076 \n", "1 7.8 0.88 0.00 2.6 0.098 \n", "2 7.8 0.76 0.04 2.3 0.092 \n", "3 11.2 0.28 0.56 1.9 0.075 \n", "4 7.4 0.70 0.00 1.9 0.076 \n", "\n", " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", "0 11.0 34.0 0.9978 3.51 0.56 \n", "1 25.0 67.0 0.9968 3.20 0.68 \n", "2 15.0 54.0 0.9970 3.26 0.65 \n", "3 17.0 60.0 0.9980 3.16 0.58 \n", "4 11.0 34.0 0.9978 3.51 0.56 \n", "\n", " alcohol quality \n", "0 9.4 5 \n", "1 9.8 5 \n", "2 9.8 5 \n", "3 9.8 6 \n", "4 9.4 5 " ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wine.head()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.6" } }, "nbformat": 4, "nbformat_minor": 1 }