{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataframes 10/7/2019 & 10/25/2019"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.A recap from last time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall we previously had talked about pandas series. Dataframes include columns that are like individual pandas series. They resemble tabular data in Microsoft Excel. Keep in mind it has a row and column index. Let's construct a Dataframe object. A common way with using a dict of equal length lists or numpy arrays. Each column can then be accessed as a series object."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" States | \n",
" Year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.5 | \n",
" Ohio | \n",
" 2000 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.7 | \n",
" Ohio | \n",
" 2001 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.6 | \n",
" Ohio | \n",
" 2002 | \n",
"
\n",
" \n",
" 3 | \n",
" 2.4 | \n",
" Nevada | \n",
" 2001 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.9 | \n",
" Nevada | \n",
" 2002 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Population States Year\n",
"0 1.5 Ohio 2000\n",
"1 1.7 Ohio 2001\n",
"2 3.6 Ohio 2002\n",
"3 2.4 Nevada 2001\n",
"4 2.9 Nevada 2002"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"states = ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada']\n",
"year = [2000, 2001, 2002, 2001, 2002]\n",
"pop = [1.5, 1.7, 3.6, 2.4, 2.9]\n",
"data = {'States': states, 'Year': year, 'Population': pop}\n",
"dataframe = pd.DataFrame(data)\n",
"dataframe #May have to put print(dataframe) in your text editor"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" States | \n",
" Population | \n",
" Year | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" Ohio | \n",
" 1.5 | \n",
" 2000 | \n",
"
\n",
" \n",
" two | \n",
" Ohio | \n",
" 1.7 | \n",
" 2001 | \n",
"
\n",
" \n",
" three | \n",
" Ohio | \n",
" 3.6 | \n",
" 2002 | \n",
"
\n",
" \n",
" four | \n",
" Nevada | \n",
" 2.4 | \n",
" 2001 | \n",
"
\n",
" \n",
" five | \n",
" Nevada | \n",
" 2.9 | \n",
" 2002 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" States Population Year\n",
"one Ohio 1.5 2000\n",
"two Ohio 1.7 2001\n",
"three Ohio 3.6 2002\n",
"four Nevada 2.4 2001\n",
"five Nevada 2.9 2002"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Specify columns, index\n",
"dataframe2 = pd.DataFrame(data, columns = ['States', 'Population', 'Year'],\n",
" index = ['one', 'two', 'three', 'four', 'five'])\n",
"\n",
"dataframe2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting columns will return a series object. Use the syntax below:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one 1.5\n",
"two 1.7\n",
"three 3.6\n",
"four 2.4\n",
"five 2.9\n",
"Name: Population, dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataframe2['Population']\n",
"#dataframe2[['Population']] <- this turns into a dataframe object"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this in mind, we can create new columns in a similar matter:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" States | \n",
" Population | \n",
" Year | \n",
" West | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" Ohio | \n",
" 1.5 | \n",
" 2000 | \n",
" False | \n",
"
\n",
" \n",
" two | \n",
" Ohio | \n",
" 1.7 | \n",
" 2001 | \n",
" False | \n",
"
\n",
" \n",
" three | \n",
" Ohio | \n",
" 3.6 | \n",
" 2002 | \n",
" False | \n",
"
\n",
" \n",
" four | \n",
" Nevada | \n",
" 2.4 | \n",
" 2001 | \n",
" True | \n",
"
\n",
" \n",
" five | \n",
" Nevada | \n",
" 2.9 | \n",
" 2002 | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" States Population Year West\n",
"one Ohio 1.5 2000 False\n",
"two Ohio 1.7 2001 False\n",
"three Ohio 3.6 2002 False\n",
"four Nevada 2.4 2001 True\n",
"five Nevada 2.9 2002 True"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataframe2['West'] = ['False', 'False', 'False', 'True', 'True']\n",
"dataframe2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.Indexing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, we may choose to retrieve one or more columns and possibly assign it to some variable. Most of the time, we can select pieces of the dataframe and analyze them during exploratory data analysis, the phase before machine learning. "
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"salaries = pd.read_csv(\"Position_Salaries.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Position', 'Level', 'Salary'], dtype='object')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries.columns # < - Printing columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Retrieve the level and salary columns"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 45000 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 50000 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 60000 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 80000 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 110000 | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" 150000 | \n",
"
\n",
" \n",
" 6 | \n",
" 7 | \n",
" 200000 | \n",
"
\n",
" \n",
" 7 | \n",
" 8 | \n",
" 300000 | \n",
"
\n",
" \n",
" 8 | \n",
" 9 | \n",
" 500000 | \n",
"
\n",
" \n",
" 9 | \n",
" 10 | \n",
" 1000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Level Salary\n",
"0 1 45000\n",
"1 2 50000\n",
"2 3 60000\n",
"3 4 80000\n",
"4 5 110000\n",
"5 6 150000\n",
"6 7 200000\n",
"7 8 300000\n",
"8 9 500000\n",
"9 10 1000000"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries[['Level', 'Salary']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print first five rows"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Position | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Business Analyst | \n",
" 1 | \n",
" 45000 | \n",
"
\n",
" \n",
" 1 | \n",
" Junior Consultant | \n",
" 2 | \n",
" 50000 | \n",
"
\n",
" \n",
" 2 | \n",
" Senior Consultant | \n",
" 3 | \n",
" 60000 | \n",
"
\n",
" \n",
" 3 | \n",
" Manager | \n",
" 4 | \n",
" 80000 | \n",
"
\n",
" \n",
" 4 | \n",
" Country Manager | \n",
" 5 | \n",
" 110000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Position Level Salary\n",
"0 Business Analyst 1 45000\n",
"1 Junior Consultant 2 50000\n",
"2 Senior Consultant 3 60000\n",
"3 Manager 4 80000\n",
"4 Country Manager 5 110000"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2a.Boolean Indexing"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Position | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 4 | \n",
" Country Manager | \n",
" 5 | \n",
" 110000 | \n",
"
\n",
" \n",
" 5 | \n",
" Region Manager | \n",
" 6 | \n",
" 150000 | \n",
"
\n",
" \n",
" 6 | \n",
" Partner | \n",
" 7 | \n",
" 200000 | \n",
"
\n",
" \n",
" 7 | \n",
" Senior Partner | \n",
" 8 | \n",
" 300000 | \n",
"
\n",
" \n",
" 8 | \n",
" C-level | \n",
" 9 | \n",
" 500000 | \n",
"
\n",
" \n",
" 9 | \n",
" CEO | \n",
" 10 | \n",
" 1000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Position Level Salary\n",
"4 Country Manager 5 110000\n",
"5 Region Manager 6 150000\n",
"6 Partner 7 200000\n",
"7 Senior Partner 8 300000\n",
"8 C-level 9 500000\n",
"9 CEO 10 1000000"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries[salaries['Salary'] > 100000]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4 110000\n",
"5 150000\n",
"6 200000\n",
"7 300000\n",
"8 500000\n",
"9 1000000\n",
"Name: Salary, dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries['Salary'][salaries['Salary'] > 100000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2b. iloc/loc"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Position Business Analyst\n",
"Level 1\n",
"Salary 45000\n",
"Name: 0, dtype: object"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries.iloc[0] #prints first row, returns series object"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Position | \n",
" Level | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Business Analyst | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" Junior Consultant | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" Senior Consultant | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" Manager | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" Country Manager | \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
" Region Manager | \n",
" 6 | \n",
"
\n",
" \n",
" 6 | \n",
" Partner | \n",
" 7 | \n",
"
\n",
" \n",
" 7 | \n",
" Senior Partner | \n",
" 8 | \n",
"
\n",
" \n",
" 8 | \n",
" C-level | \n",
" 9 | \n",
"
\n",
" \n",
" 9 | \n",
" CEO | \n",
" 10 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Position Level\n",
"0 Business Analyst 1\n",
"1 Junior Consultant 2\n",
"2 Senior Consultant 3\n",
"3 Manager 4\n",
"4 Country Manager 5\n",
"5 Region Manager 6\n",
"6 Partner 7\n",
"7 Senior Partner 8\n",
"8 C-level 9\n",
"9 CEO 10"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(salaries.iloc[:, [0,1]]) #prints all rows, first two columns; We choose column 1 and column 2 (indexed at 0 & 1)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Position | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 4 | \n",
" Country Manager | \n",
" 5 | \n",
" 110000 | \n",
"
\n",
" \n",
" 5 | \n",
" Region Manager | \n",
" 6 | \n",
" 150000 | \n",
"
\n",
" \n",
" 6 | \n",
" Partner | \n",
" 7 | \n",
" 200000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Position Level Salary\n",
"4 Country Manager 5 110000\n",
"5 Region Manager 6 150000\n",
"6 Partner 7 200000"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries.iloc[[4,5,6], :] #prints 4th, 5th, 6th rows, all columns "
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Position | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Junior Consultant | \n",
" 50000 | \n",
"
\n",
" \n",
" 2 | \n",
" Senior Consultant | \n",
" 60000 | \n",
"
\n",
" \n",
" 6 | \n",
" Partner | \n",
" 200000 | \n",
"
\n",
" \n",
" 7 | \n",
" Senior Partner | \n",
" 300000 | \n",
"
\n",
" \n",
" 8 | \n",
" C-level | \n",
" 500000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Position Salary\n",
"1 Junior Consultant 50000\n",
"2 Senior Consultant 60000\n",
"6 Partner 200000\n",
"7 Senior Partner 300000\n",
"8 C-level 500000"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Slightly bit more complicated:\n",
"import numpy as np\n",
"salaries.iloc[np.r_[1:3, 6:8, 8], np.r_[0:1, 2:3]] #last index is exclusive unless specifically stated"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Position | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Business Analyst | \n",
" 1 | \n",
" 45000 | \n",
"
\n",
" \n",
" 1 | \n",
" Junior Consultant | \n",
" 2 | \n",
" 50000 | \n",
"
\n",
" \n",
" 2 | \n",
" Senior Consultant | \n",
" 3 | \n",
" 60000 | \n",
"
\n",
" \n",
" 3 | \n",
" Manager | \n",
" 4 | \n",
" 80000 | \n",
"
\n",
" \n",
" 4 | \n",
" Country Manager | \n",
" 88888 | \n",
" 110000 | \n",
"
\n",
" \n",
" 5 | \n",
" Region Manager | \n",
" 6 | \n",
" 150000 | \n",
"
\n",
" \n",
" 6 | \n",
" Partner | \n",
" 7 | \n",
" 200000 | \n",
"
\n",
" \n",
" 7 | \n",
" Senior Partner | \n",
" 8 | \n",
" 300000 | \n",
"
\n",
" \n",
" 8 | \n",
" C-level | \n",
" 9 | \n",
" 500000 | \n",
"
\n",
" \n",
" 9 | \n",
" CEO | \n",
" 10 | \n",
" 1000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Position Level Salary\n",
"0 Business Analyst 1 45000\n",
"1 Junior Consultant 2 50000\n",
"2 Senior Consultant 3 60000\n",
"3 Manager 4 80000\n",
"4 Country Manager 88888 110000\n",
"5 Region Manager 6 150000\n",
"6 Partner 7 200000\n",
"7 Senior Partner 8 300000\n",
"8 C-level 9 500000\n",
"9 CEO 10 1000000"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries.iloc[4, 1] = 88888 # Change individual value\n",
"salaries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"LOC is fairly similar; instead of numbers, use the row/column names. for this specific example, let's change the index of the dataframe to the positions"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" Position | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Business Analyst | \n",
" 1 | \n",
" 45000 | \n",
"
\n",
" \n",
" Junior Consultant | \n",
" 2 | \n",
" 50000 | \n",
"
\n",
" \n",
" Senior Consultant | \n",
" 3 | \n",
" 60000 | \n",
"
\n",
" \n",
" Manager | \n",
" 4 | \n",
" 80000 | \n",
"
\n",
" \n",
" Country Manager | \n",
" 88888 | \n",
" 110000 | \n",
"
\n",
" \n",
" Region Manager | \n",
" 6 | \n",
" 150000 | \n",
"
\n",
" \n",
" Partner | \n",
" 7 | \n",
" 200000 | \n",
"
\n",
" \n",
" Senior Partner | \n",
" 8 | \n",
" 300000 | \n",
"
\n",
" \n",
" C-level | \n",
" 9 | \n",
" 500000 | \n",
"
\n",
" \n",
" CEO | \n",
" 10 | \n",
" 1000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Level Salary\n",
"Position \n",
"Business Analyst 1 45000\n",
"Junior Consultant 2 50000\n",
"Senior Consultant 3 60000\n",
"Manager 4 80000\n",
"Country Manager 88888 110000\n",
"Region Manager 6 150000\n",
"Partner 7 200000\n",
"Senior Partner 8 300000\n",
"C-level 9 500000\n",
"CEO 10 1000000"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries.set_index(['Position'], inplace = True)\n",
"salaries"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" Position | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Business Analyst | \n",
" 1 | \n",
" 45000 | \n",
"
\n",
" \n",
" Junior Consultant | \n",
" 2 | \n",
" 50000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Level Salary\n",
"Position \n",
"Business Analyst 1 45000\n",
"Junior Consultant 2 50000"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries.loc[['Business Analyst', 'Junior Consultant'], :] #prints first two rows, all columns"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" Position | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Business Analyst | \n",
" 1 | \n",
" 45000 | \n",
"
\n",
" \n",
" Junior Consultant | \n",
" 2 | \n",
" 50000 | \n",
"
\n",
" \n",
" Senior Consultant | \n",
" 3 | \n",
" 60000 | \n",
"
\n",
" \n",
" Manager | \n",
" 4 | \n",
" 80000 | \n",
"
\n",
" \n",
" Country Manager | \n",
" 88888 | \n",
" 110000 | \n",
"
\n",
" \n",
" Region Manager | \n",
" 6 | \n",
" 150000 | \n",
"
\n",
" \n",
" Partner | \n",
" 7 | \n",
" 200000 | \n",
"
\n",
" \n",
" Senior Partner | \n",
" 8 | \n",
" 300000 | \n",
"
\n",
" \n",
" C-level | \n",
" 9 | \n",
" 500000 | \n",
"
\n",
" \n",
" CEO | \n",
" 10 | \n",
" 1000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Level Salary\n",
"Position \n",
"Business Analyst 1 45000\n",
"Junior Consultant 2 50000\n",
"Senior Consultant 3 60000\n",
"Manager 4 80000\n",
"Country Manager 88888 110000\n",
"Region Manager 6 150000\n",
"Partner 7 200000\n",
"Senior Partner 8 300000\n",
"C-level 9 500000\n",
"CEO 10 1000000"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"salaries.loc[:, ['Level', 'Salary']] #prints all rows, second/third column"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the 'any' function to determine if any of the values in the Salary column contain values less than $10,000."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(salaries.Salary < 10000).values.any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Arithmetic"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Position | \n",
" Level | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Business Analyst | \n",
" 1 | \n",
" 90000 | \n",
"
\n",
" \n",
" 1 | \n",
" Junior Consultant | \n",
" 2 | \n",
" 100000 | \n",
"
\n",
" \n",
" 2 | \n",
" Senior Consultant | \n",
" 3 | \n",
" 120000 | \n",
"
\n",
" \n",
" 3 | \n",
" Manager | \n",
" 4 | \n",
" 160000 | \n",
"
\n",
" \n",
" 4 | \n",
" Country Manager | \n",
" 88888 | \n",
" 220000 | \n",
"
\n",
" \n",
" 5 | \n",
" Region Manager | \n",
" 6 | \n",
" 300000 | \n",
"
\n",
" \n",
" 6 | \n",
" Partner | \n",
" 7 | \n",
" 400000 | \n",
"
\n",
" \n",
" 7 | \n",
" Senior Partner | \n",
" 8 | \n",
" 600000 | \n",
"
\n",
" \n",
" 8 | \n",
" C-level | \n",
" 9 | \n",
" 1000000 | \n",
"
\n",
" \n",
" 9 | \n",
" CEO | \n",
" 10 | \n",
" 2000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Position Level Salary\n",
"0 Business Analyst 1 90000\n",
"1 Junior Consultant 2 100000\n",
"2 Senior Consultant 3 120000\n",
"3 Manager 4 160000\n",
"4 Country Manager 88888 220000\n",
"5 Region Manager 6 300000\n",
"6 Partner 7 400000\n",
"7 Senior Partner 8 600000\n",
"8 C-level 9 1000000\n",
"9 CEO 10 2000000"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's reset the index:\n",
"# multiply the salaries by 2.\n",
"salaries.reset_index(inplace = True)\n",
"salaries['Salary'] = salaries['Salary'] * 2\n",
"salaries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use other operations when dealing and maniuplating with individual columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filling Misisng Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will be importing the wine dataset. This dataset contains null values (missing values). 5 columns contain some missing data, while 22 rows contain some missing data."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"fixed acidity False\n",
"volatile acidity False\n",
"citric acid True\n",
"residual sugar False\n",
"chlorides True\n",
"free sulfur dioxide False\n",
"total sulfur dioxide False\n",
"density False\n",
"pH True\n",
"sulphates True\n",
"alcohol True\n",
"quality False\n",
"dtype: bool"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wine = pd.read_csv(\"winequality-red.csv\")\n",
"wine.isnull().any(axis = 0) #add the .sum operator"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fixed acidity | \n",
" volatile acidity | \n",
" citric acid | \n",
" residual sugar | \n",
" chlorides | \n",
" free sulfur dioxide | \n",
" total sulfur dioxide | \n",
" density | \n",
" pH | \n",
" sulphates | \n",
" alcohol | \n",
" quality | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" 7.8 | \n",
" 0.88 | \n",
" 0.00 | \n",
" 2.6 | \n",
" 0.098 | \n",
" 25.0 | \n",
" 67.0 | \n",
" 0.9968 | \n",
" 3.20 | \n",
" 0.68 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 7.8 | \n",
" 0.76 | \n",
" 0.04 | \n",
" 2.3 | \n",
" 0.092 | \n",
" 15.0 | \n",
" 54.0 | \n",
" 0.9970 | \n",
" 3.26 | \n",
" 0.65 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 11.2 | \n",
" 0.28 | \n",
" 0.56 | \n",
" 1.9 | \n",
" 0.075 | \n",
" 17.0 | \n",
" 60.0 | \n",
" 0.9980 | \n",
" 3.16 | \n",
" 0.58 | \n",
" 9.8 | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
"0 7.4 0.70 0.00 1.9 0.076 \n",
"1 7.8 0.88 0.00 2.6 0.098 \n",
"2 7.8 0.76 0.04 2.3 0.092 \n",
"3 11.2 0.28 0.56 1.9 0.075 \n",
"4 7.4 0.70 0.00 1.9 0.076 \n",
"\n",
" free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
"0 11.0 34.0 0.9978 3.51 0.56 \n",
"1 25.0 67.0 0.9968 3.20 0.68 \n",
"2 15.0 54.0 0.9970 3.26 0.65 \n",
"3 17.0 60.0 0.9980 3.16 0.58 \n",
"4 11.0 34.0 0.9978 3.51 0.56 \n",
"\n",
" alcohol quality \n",
"0 9.4 5 \n",
"1 9.8 5 \n",
"2 9.8 5 \n",
"3 9.8 6 \n",
"4 9.4 5 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Check the first few rows of the data frame. \n",
"wine.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#wine.dropna(axis=0, inplace = True)\n",
"#wine.dropna(axis = 1, inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, we may want to fill in the missing data with a specific value. Keep in mind that dropping null values or replacing them with a particular value may ultimately affect the accuracy of your model. It's important to utilize certain strategies on how to deal with null values depending on the context of the problem. This is referred to as data imputation. Here is an article on some approaches used: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"fixed acidity False\n",
"volatile acidity False\n",
"citric acid False\n",
"residual sugar False\n",
"chlorides False\n",
"free sulfur dioxide False\n",
"total sulfur dioxide False\n",
"density False\n",
"pH False\n",
"sulphates False\n",
"alcohol False\n",
"quality False\n",
"dtype: bool"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Fill all missing data with zeros\n",
"wine.fillna(0, inplace = True)\n",
"wine.isnull().any(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can fill columns with missing data by specific values corresponding to their column."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"fixed acidity False\n",
"volatile acidity False\n",
"citric acid False\n",
"residual sugar False\n",
"chlorides False\n",
"free sulfur dioxide False\n",
"total sulfur dioxide False\n",
"density False\n",
"pH True\n",
"sulphates True\n",
"alcohol True\n",
"quality False\n",
"dtype: bool"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wine.fillna({wine.columns[2]:0, wine.columns[4]:2}, inplace = True) #Columns 2 and 4 values have been replaced. {name of dataframe}.columns provides a list of column names\n",
"wine.isnull().any(axis = 0)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fixed acidity | \n",
" volatile acidity | \n",
" citric acid | \n",
" residual sugar | \n",
" chlorides | \n",
" free sulfur dioxide | \n",
" total sulfur dioxide | \n",
" density | \n",
" pH | \n",
" sulphates | \n",
" alcohol | \n",
" quality | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" 7.8 | \n",
" 0.88 | \n",
" 0.00 | \n",
" 2.6 | \n",
" 0.098 | \n",
" 25.0 | \n",
" 67.0 | \n",
" 0.9968 | \n",
" 3.20 | \n",
" 0.68 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 7.8 | \n",
" 0.76 | \n",
" 0.04 | \n",
" 2.3 | \n",
" 0.092 | \n",
" 15.0 | \n",
" 54.0 | \n",
" 0.9970 | \n",
" 3.26 | \n",
" 0.65 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 11.2 | \n",
" 0.28 | \n",
" 0.56 | \n",
" 1.9 | \n",
" 0.075 | \n",
" 17.0 | \n",
" 60.0 | \n",
" 0.9980 | \n",
" 3.16 | \n",
" 0.58 | \n",
" 9.8 | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
"0 7.4 0.70 0.00 1.9 0.076 \n",
"1 7.8 0.88 0.00 2.6 0.098 \n",
"2 7.8 0.76 0.04 2.3 0.092 \n",
"3 11.2 0.28 0.56 1.9 0.075 \n",
"4 7.4 0.70 0.00 1.9 0.076 \n",
"\n",
" free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
"0 11.0 34.0 0.9978 3.51 0.56 \n",
"1 25.0 67.0 0.9968 3.20 0.68 \n",
"2 15.0 54.0 0.9970 3.26 0.65 \n",
"3 17.0 60.0 0.9980 3.16 0.58 \n",
"4 11.0 34.0 0.9978 3.51 0.56 \n",
"\n",
" alcohol quality \n",
"0 9.4 5 \n",
"1 9.8 5 \n",
"2 9.8 5 \n",
"3 9.8 6 \n",
"4 9.4 5 "
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wine.head()"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}