{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
\n",
"\n",
" | \n",
"Introduction to Python for Data Sciences | \n",
"Franck Iutzeler | \n",
"
\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
\n",
"\n",
"Chap. 3 - Data Handling with Pandas \n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1- Pandas\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"In a previous chapter, we explored some features of NumPy and notably its arrays. Here we will take a look at the data structures provided by the **Pandas** library.\n",
"\n",
"Pandas is a newer package built on top of NumPy which provides an efficient implementation of **DataFrames**. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations.\n",
"\n",
"\n",
"\n",
"Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Pandas Series\n",
"\n",
"\n",
"A Pandas `Series` is a one-dimensional array of indexed data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.25\n",
"1 0.50\n",
"2 0.75\n",
"3 1.00\n",
"dtype: float64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series([0.25, 0.5, 0.75, 1.0])\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The contents can be accessed in the same way as for NumPy arrays, to the difference that when more than one value is selected, the type remains a Pandas ``Series``."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.25 \n"
]
}
],
"source": [
"print(data[0],type(data[0]))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2 0.75\n",
"3 1.00\n",
"dtype: float64 \n"
]
}
],
"source": [
"print(data[2:],type(data[2:]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The type ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes.\n",
"\n",
"* ``values`` are the contents of the series as a NumPy array"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.25 0.5 0.75 1. ] \n"
]
}
],
"source": [
"print(data.values,type(data.values))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* ``index`` are the indices of the series"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RangeIndex(start=0, stop=4, step=1) \n"
]
}
],
"source": [
"print(data.index,type(data.index))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Series Indices\n",
"\n",
"The main difference between NumPy arrays and Pandas Series is the presence of this index field. By default, it is set (as in NumPy arrays) as 0,1,..,size_of_the_series but a Series index can be explicitly defined. The indices may be numbers but also strings. Then, the contents of the series *have to* be accessed using these defined indices."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a 0.25\n",
"b 0.50\n",
"c 0.75\n",
"d 1.00\n",
"dtype: float64\n"
]
}
],
"source": [
"data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])\n",
"print(data)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.75\n"
]
}
],
"source": [
"print(data['c'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 0.25\n",
"3 0.50\n",
"4 0.75\n",
"2 1.00\n",
"dtype: float64\n"
]
}
],
"source": [
"data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 3, 4, 2])\n",
"print(data)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.0\n"
]
}
],
"source": [
"print(data[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Series and Python Dictionaries [\\*] \n",
"\n",
"Pandas Series and Python Dictionaries are close semantically: mappping keys to values. However, the implementation of Pandas series is usually more efficient than dictionaries in the context of data science. Naturally, Series can be contructed from dictionaries."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135} \n",
"California 38332521\n",
"Texas 26448193\n",
"New York 19651127\n",
"Florida 19552860\n",
"Illinois 12882135\n",
"dtype: int64 \n"
]
}
],
"source": [
"population_dict = {'California': 38332521,\n",
" 'Texas': 26448193,\n",
" 'New York': 19651127,\n",
" 'Florida': 19552860,\n",
" 'Illinois': 12882135}\n",
"population = pd.Series(population_dict)\n",
"print(population_dict,type(population_dict))\n",
"print(population,type(population))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"38332521"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"population['California']"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California 38332521\n",
"Texas 26448193\n",
"New York 19651127\n",
"Florida 19552860\n",
"Illinois 12882135\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"population['California':'Illinois']"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Pandas DataFrames\n",
"\n",
"DataFrames is a fundamental object of Pandas that mimicks what can be found in `R` for instance. Dataframes can be seen as an array of Series: to each `index` (corresponding to an individual for instance or a line in a table), a Dataframe maps multiples values; these values corresponds to the `columns` of the DataFrame which each have a name (as a string). \n",
"\n",
"\n",
"In the following example, we will construct a Dataframe from two Series with common indices. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})\n",
"population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Population Area\n",
"California 38332521 423967\n",
"Texas 26448193 695662\n",
"New York 19651127 141297\n",
"Florida 19552860 170312\n",
"Illinois 12882135 149995 \n"
]
}
],
"source": [
"states = pd.DataFrame({'Population': population, 'Area': area})\n",
"print(states,type(states))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Jupyter notebooks, DataFrames are displayed in a fancier way when the name of the dataframe is typed (instead of using print)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" California | \n",
" 38332521 | \n",
" 423967 | \n",
"
\n",
" \n",
" Texas | \n",
" 26448193 | \n",
" 695662 | \n",
"
\n",
" \n",
" New York | \n",
" 19651127 | \n",
" 141297 | \n",
"
\n",
" \n",
" Florida | \n",
" 19552860 | \n",
" 170312 | \n",
"
\n",
" \n",
" Illinois | \n",
" 12882135 | \n",
" 149995 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Population Area\n",
"California 38332521 423967\n",
"Texas 26448193 695662\n",
"New York 19651127 141297\n",
"Florida 19552860 170312\n",
"Illinois 12882135 149995"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrames have \n",
"* index that are the defined indices as in Series\n",
"* columns that are the columns names\n",
"* values that return a (2D) NumPy array with the contents"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')\n",
"Index(['Population', 'Area'], dtype='object')\n",
"[[38332521 423967]\n",
" [26448193 695662]\n",
" [19651127 141297]\n",
" [19552860 170312]\n",
" [12882135 149995]] (5, 2)\n"
]
}
],
"source": [
"print(states.index)\n",
"print(states.columns)\n",
"print(states.values,type(states.values),states.values.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Warning:* When accessing a Dataframe, `dataframe_name[column_name]` return the corresponding column as a Series. `dataframe_name[index_name]` returns an error! We will see later how to access a specific index."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"California 423967\n",
"Texas 695662\n",
"New York 141297\n",
"Florida 170312\n",
"Illinois 149995\n",
"Name: Area, dtype: int64 \n"
]
}
],
"source": [
"print(states['Area'],type(states['Area']))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"KeyError: 'California'\n"
]
}
],
"source": [
"try:\n",
" print(states['California'])\n",
"except KeyError as error: \n",
" print(\"KeyError: \",error)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataframe creation\n",
"\n",
"To create DataFrames, the main methods are:\n",
"* from Series (as above)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"California 38332521\n",
"Texas 26448193\n",
"New York 19651127\n",
"Florida 19552860\n",
"Illinois 12882135\n",
"dtype: int64 \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" California | \n",
" 38332521 | \n",
" 423967 | \n",
"
\n",
" \n",
" Texas | \n",
" 26448193 | \n",
" 695662 | \n",
"
\n",
" \n",
" New York | \n",
" 19651127 | \n",
" 141297 | \n",
"
\n",
" \n",
" Florida | \n",
" 19552860 | \n",
" 170312 | \n",
"
\n",
" \n",
" Illinois | \n",
" 12882135 | \n",
" 149995 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Population Area\n",
"California 38332521 423967\n",
"Texas 26448193 695662\n",
"New York 19651127 141297\n",
"Florida 19552860 170312\n",
"Illinois 12882135 149995"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(population,type(population))\n",
"states = pd.DataFrame({'Population': population, 'Area': area})\n",
"states"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* from NumPy arrays (the columns and indices are taken as the array's ones)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-1.47908983 0.55834675 -0.68109792]\n",
" [ 1.18023681 1.82871481 0.0944462 ]\n",
" [-0.22391784 0.26061809 0.68857944]\n",
" [-1.75644104 0.74439857 -0.45926716]\n",
" [-0.90534641 -1.57246221 2.28871663]] \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.479090 | \n",
" 0.558347 | \n",
" -0.681098 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.180237 | \n",
" 1.828715 | \n",
" 0.094446 | \n",
"
\n",
" \n",
" 2 | \n",
" -0.223918 | \n",
" 0.260618 | \n",
" 0.688579 | \n",
"
\n",
" \n",
" 3 | \n",
" -1.756441 | \n",
" 0.744399 | \n",
" -0.459267 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.905346 | \n",
" -1.572462 | \n",
" 2.288717 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 -1.479090 0.558347 -0.681098\n",
"1 1.180237 1.828715 0.094446\n",
"2 -0.223918 0.260618 0.688579\n",
"3 -1.756441 0.744399 -0.459267\n",
"4 -0.905346 -1.572462 2.288717"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = np.random.randn(5,3)\n",
"print(A,type(A))\n",
"dfA = pd.DataFrame(A)\n",
"dfA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* from a *list* of *dictionaries*. Be careful, each element of the list is an example (corresponding to an automatic index 0,1,...) while each key of the dictonary corresponds to a column."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}] \n",
"{'a': 0, 'b': 0} \n"
]
}
],
"source": [
"data = [{'a': i, 'b': 2 * i} for i in range(3)]\n",
"print(data,type(data))\n",
"print(data[0],type(data[0]))"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b\n",
"0 0 0\n",
"1 1 2\n",
"2 2 4"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(data)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* from a *file* , typically a csv file (for comma separated values), eventually with the names of the columns as a first line.\n",
"\n",
"\n",
" col_1_name,col_2_name,col_3_name\n",
" col_1_v1,col_2_v1,col_3_v1\n",
" col_1_v2,col_2_v2,col_3_v2\n",
" ...\n",
" \n",
"For other files types (MS Excel, libSVM, any other separator) see this [part of the doc](https://pandas.pydata.org/pandas-docs/stable/api.html#input-output)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"order,name,height(cm)\r\n",
"1,George Washington,189\r\n",
"2,John Adams,170\r\n",
"3,Thomas Jefferson,189\r\n"
]
}
],
"source": [
"!head -4 data/president_heights.csv # Jupyter bash command to see the first 4 lines of the file"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order | \n",
" name | \n",
" height(cm) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" George Washington | \n",
" 189 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" John Adams | \n",
" 170 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Thomas Jefferson | \n",
" 189 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" James Madison | \n",
" 163 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" James Monroe | \n",
" 183 | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" John Quincy Adams | \n",
" 171 | \n",
"
\n",
" \n",
" 6 | \n",
" 7 | \n",
" Andrew Jackson | \n",
" 185 | \n",
"
\n",
" \n",
" 7 | \n",
" 8 | \n",
" Martin Van Buren | \n",
" 168 | \n",
"
\n",
" \n",
" 8 | \n",
" 9 | \n",
" William Henry Harrison | \n",
" 173 | \n",
"
\n",
" \n",
" 9 | \n",
" 10 | \n",
" John Tyler | \n",
" 183 | \n",
"
\n",
" \n",
" 10 | \n",
" 11 | \n",
" James K. Polk | \n",
" 173 | \n",
"
\n",
" \n",
" 11 | \n",
" 12 | \n",
" Zachary Taylor | \n",
" 173 | \n",
"
\n",
" \n",
" 12 | \n",
" 13 | \n",
" Millard Fillmore | \n",
" 175 | \n",
"
\n",
" \n",
" 13 | \n",
" 14 | \n",
" Franklin Pierce | \n",
" 178 | \n",
"
\n",
" \n",
" 14 | \n",
" 15 | \n",
" James Buchanan | \n",
" 183 | \n",
"
\n",
" \n",
" 15 | \n",
" 16 | \n",
" Abraham Lincoln | \n",
" 193 | \n",
"
\n",
" \n",
" 16 | \n",
" 17 | \n",
" Andrew Johnson | \n",
" 178 | \n",
"
\n",
" \n",
" 17 | \n",
" 18 | \n",
" Ulysses S. Grant | \n",
" 173 | \n",
"
\n",
" \n",
" 18 | \n",
" 19 | \n",
" Rutherford B. Hayes | \n",
" 174 | \n",
"
\n",
" \n",
" 19 | \n",
" 20 | \n",
" James A. Garfield | \n",
" 183 | \n",
"
\n",
" \n",
" 20 | \n",
" 21 | \n",
" Chester A. Arthur | \n",
" 183 | \n",
"
\n",
" \n",
" 21 | \n",
" 23 | \n",
" Benjamin Harrison | \n",
" 168 | \n",
"
\n",
" \n",
" 22 | \n",
" 25 | \n",
" William McKinley | \n",
" 170 | \n",
"
\n",
" \n",
" 23 | \n",
" 26 | \n",
" Theodore Roosevelt | \n",
" 178 | \n",
"
\n",
" \n",
" 24 | \n",
" 27 | \n",
" William Howard Taft | \n",
" 182 | \n",
"
\n",
" \n",
" 25 | \n",
" 28 | \n",
" Woodrow Wilson | \n",
" 180 | \n",
"
\n",
" \n",
" 26 | \n",
" 29 | \n",
" Warren G. Harding | \n",
" 183 | \n",
"
\n",
" \n",
" 27 | \n",
" 30 | \n",
" Calvin Coolidge | \n",
" 178 | \n",
"
\n",
" \n",
" 28 | \n",
" 31 | \n",
" Herbert Hoover | \n",
" 182 | \n",
"
\n",
" \n",
" 29 | \n",
" 32 | \n",
" Franklin D. Roosevelt | \n",
" 188 | \n",
"
\n",
" \n",
" 30 | \n",
" 33 | \n",
" Harry S. Truman | \n",
" 175 | \n",
"
\n",
" \n",
" 31 | \n",
" 34 | \n",
" Dwight D. Eisenhower | \n",
" 179 | \n",
"
\n",
" \n",
" 32 | \n",
" 35 | \n",
" John F. Kennedy | \n",
" 183 | \n",
"
\n",
" \n",
" 33 | \n",
" 36 | \n",
" Lyndon B. Johnson | \n",
" 193 | \n",
"
\n",
" \n",
" 34 | \n",
" 37 | \n",
" Richard Nixon | \n",
" 182 | \n",
"
\n",
" \n",
" 35 | \n",
" 38 | \n",
" Gerald Ford | \n",
" 183 | \n",
"
\n",
" \n",
" 36 | \n",
" 39 | \n",
" Jimmy Carter | \n",
" 177 | \n",
"
\n",
" \n",
" 37 | \n",
" 40 | \n",
" Ronald Reagan | \n",
" 185 | \n",
"
\n",
" \n",
" 38 | \n",
" 41 | \n",
" George H. W. Bush | \n",
" 188 | \n",
"
\n",
" \n",
" 39 | \n",
" 42 | \n",
" Bill Clinton | \n",
" 188 | \n",
"
\n",
" \n",
" 40 | \n",
" 43 | \n",
" George W. Bush | \n",
" 182 | \n",
"
\n",
" \n",
" 41 | \n",
" 44 | \n",
" Barack Obama | \n",
" 185 | \n",
"
\n",
" \n",
" 42 | \n",
" 45 | \n",
" Donald Trump | \n",
" 188 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order name height(cm)\n",
"0 1 George Washington 189\n",
"1 2 John Adams 170\n",
"2 3 Thomas Jefferson 189\n",
"3 4 James Madison 163\n",
"4 5 James Monroe 183\n",
"5 6 John Quincy Adams 171\n",
"6 7 Andrew Jackson 185\n",
"7 8 Martin Van Buren 168\n",
"8 9 William Henry Harrison 173\n",
"9 10 John Tyler 183\n",
"10 11 James K. Polk 173\n",
"11 12 Zachary Taylor 173\n",
"12 13 Millard Fillmore 175\n",
"13 14 Franklin Pierce 178\n",
"14 15 James Buchanan 183\n",
"15 16 Abraham Lincoln 193\n",
"16 17 Andrew Johnson 178\n",
"17 18 Ulysses S. Grant 173\n",
"18 19 Rutherford B. Hayes 174\n",
"19 20 James A. Garfield 183\n",
"20 21 Chester A. Arthur 183\n",
"21 23 Benjamin Harrison 168\n",
"22 25 William McKinley 170\n",
"23 26 Theodore Roosevelt 178\n",
"24 27 William Howard Taft 182\n",
"25 28 Woodrow Wilson 180\n",
"26 29 Warren G. Harding 183\n",
"27 30 Calvin Coolidge 178\n",
"28 31 Herbert Hoover 182\n",
"29 32 Franklin D. Roosevelt 188\n",
"30 33 Harry S. Truman 175\n",
"31 34 Dwight D. Eisenhower 179\n",
"32 35 John F. Kennedy 183\n",
"33 36 Lyndon B. Johnson 193\n",
"34 37 Richard Nixon 182\n",
"35 38 Gerald Ford 183\n",
"36 39 Jimmy Carter 177\n",
"37 40 Ronald Reagan 185\n",
"38 41 George H. W. Bush 188\n",
"39 42 Bill Clinton 188\n",
"40 43 George W. Bush 182\n",
"41 44 Barack Obama 185\n",
"42 45 Donald Trump 188"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.read_csv('data/president_heights.csv')\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Names and Values\n",
"\n",
"Notice there can be missing values in DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" 3 | \n",
" 4.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c\n",
"0 1.0 2 NaN\n",
"1 NaN 3 4.0"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can set indices and columns names *a posteriori*"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" -1.479090 | \n",
" 0.558347 | \n",
" -0.681098 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.180237 | \n",
" 1.828715 | \n",
" 0.094446 | \n",
"
\n",
" \n",
" 9 | \n",
" -0.223918 | \n",
" 0.260618 | \n",
" 0.688579 | \n",
"
\n",
" \n",
" 16 | \n",
" -1.756441 | \n",
" 0.744399 | \n",
" -0.459267 | \n",
"
\n",
" \n",
" 25 | \n",
" -0.905346 | \n",
" -1.572462 | \n",
" 2.288717 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c\n",
"1 -1.479090 0.558347 -0.681098\n",
"4 1.180237 1.828715 0.094446\n",
"9 -0.223918 0.260618 0.688579\n",
"16 -1.756441 0.744399 -0.459267\n",
"25 -0.905346 -1.572462 2.288717"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfA.columns = ['a','b','c']\n",
"dfA.index = [i**2 for i in range(1,6) ]\n",
"dfA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indexing\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" California | \n",
" 38332521 | \n",
" 423967 | \n",
"
\n",
" \n",
" Texas | \n",
" 26448193 | \n",
" 695662 | \n",
"
\n",
" \n",
" New York | \n",
" 19651127 | \n",
" 141297 | \n",
"
\n",
" \n",
" Florida | \n",
" 19552860 | \n",
" 170312 | \n",
"
\n",
" \n",
" Illinois | \n",
" 12882135 | \n",
" 149995 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Population Area\n",
"California 38332521 423967\n",
"Texas 26448193 695662\n",
"New York 19651127 141297\n",
"Florida 19552860 170312\n",
"Illinois 12882135 149995"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})\n",
"population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})\n",
"states = pd.DataFrame({'Population': population, 'Area': area})\n",
"states"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may access columns directly with names, *then* you can access individuals with their index. "
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California 423967\n",
"Texas 695662\n",
"New York 141297\n",
"Florida 170312\n",
"Illinois 149995\n",
"Name: Area, dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states['Area']"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"695662"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states['Area']['Texas']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To ease the access, Pandas offers dedicated methods:\n",
"* iloc enables to access subparts of the dataframe as if it was a NumPy array."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" California | \n",
" 38332521 | \n",
" 423967 | \n",
"
\n",
" \n",
" Texas | \n",
" 26448193 | \n",
" 695662 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Population Area\n",
"California 38332521 423967\n",
"Texas 26448193 695662"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states.iloc[:2]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"California 38332521\n",
"Texas 26448193\n",
"Name: Population, dtype: int64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states.iloc[:2,0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* loc does the same but with the explicit names (the last one is included)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" California | \n",
" 38332521 | \n",
" 423967 | \n",
"
\n",
" \n",
" Texas | \n",
" 26448193 | \n",
" 695662 | \n",
"
\n",
" \n",
" New York | \n",
" 19651127 | \n",
" 141297 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Population Area\n",
"California 38332521 423967\n",
"Texas 26448193 695662\n",
"New York 19651127 141297"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states.loc[:'New York']"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" California | \n",
" 38332521 | \n",
" 423967 | \n",
"
\n",
" \n",
" Texas | \n",
" 26448193 | \n",
" 695662 | \n",
"
\n",
" \n",
" New York | \n",
" 19651127 | \n",
" 141297 | \n",
"
\n",
" \n",
" Florida | \n",
" 19552860 | \n",
" 170312 | \n",
"
\n",
" \n",
" Illinois | \n",
" 12882135 | \n",
" 149995 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Population Area\n",
"California 38332521 423967\n",
"Texas 26448193 695662\n",
"New York 19651127 141297\n",
"Florida 19552860 170312\n",
"Illinois 12882135 149995"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states.loc[:,'Population':]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}