{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "

Introduction to Python for Data Sciences

Franck Iutzeler
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

\n", "\n", "
Chap. 3 - Data Handling with Pandas
\n", "\n", "

\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1- Pandas\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In a previous chapter, we explored some features of NumPy and notably its arrays. Here we will take a look at the data structures provided by the **Pandas** library.\n", "\n", "Pandas is a newer package built on top of NumPy which provides an efficient implementation of **DataFrames**. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations.\n", "\n", "\n", "\n", "Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Pandas Series\n", "\n", "\n", "A Pandas `Series` is a one-dimensional array of indexed data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.25\n", "1 0.50\n", "2 0.75\n", "3 1.00\n", "dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series([0.25, 0.5, 0.75, 1.0])\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The contents can be accessed in the same way as for NumPy arrays, to the difference that when more than one value is selected, the type remains a Pandas ``Series``." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.25 \n" ] } ], "source": [ "print(data[0],type(data[0]))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 0.75\n", "3 1.00\n", "dtype: float64 \n" ] } ], "source": [ "print(data[2:],type(data[2:]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes.\n", "\n", "* ``values`` are the contents of the series as a NumPy array" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.25 0.5 0.75 1. ] \n" ] } ], "source": [ "print(data.values,type(data.values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* ``index`` are the indices of the series" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RangeIndex(start=0, stop=4, step=1) \n" ] } ], "source": [ "print(data.index,type(data.index))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series Indices\n", "\n", "The main difference between NumPy arrays and Pandas Series is the presence of this index field. By default, it is set (as in NumPy arrays) as 0,1,..,size_of_the_series but a Series index can be explicitly defined. The indices may be numbers but also strings. Then, the contents of the series *have to* be accessed using these defined indices." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a 0.25\n", "b 0.50\n", "c 0.75\n", "d 1.00\n", "dtype: float64\n" ] } ], "source": [ "data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])\n", "print(data)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.75\n" ] } ], "source": [ "print(data['c'])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 0.25\n", "3 0.50\n", "4 0.75\n", "2 1.00\n", "dtype: float64\n" ] } ], "source": [ "data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 3, 4, 2])\n", "print(data)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0\n" ] } ], "source": [ "print(data[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series and Python Dictionaries [\\*] \n", "\n", "Pandas Series and Python Dictionaries are close semantically: mappping keys to values. However, the implementation of Pandas series is usually more efficient than dictionaries in the context of data science. Naturally, Series can be contructed from dictionaries." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135} \n", "California 38332521\n", "Texas 26448193\n", "New York 19651127\n", "Florida 19552860\n", "Illinois 12882135\n", "dtype: int64 \n" ] } ], "source": [ "population_dict = {'California': 38332521,\n", " 'Texas': 26448193,\n", " 'New York': 19651127,\n", " 'Florida': 19552860,\n", " 'Illinois': 12882135}\n", "population = pd.Series(population_dict)\n", "print(population_dict,type(population_dict))\n", "print(population,type(population))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "38332521" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "population['California']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California 38332521\n", "Texas 26448193\n", "New York 19651127\n", "Florida 19552860\n", "Illinois 12882135\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "population['California':'Illinois']" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Pandas DataFrames\n", "\n", "DataFrames is a fundamental object of Pandas that mimicks what can be found in `R` for instance. Dataframes can be seen as an array of Series: to each `index` (corresponding to an individual for instance or a line in a table), a Dataframe maps multiples values; these values corresponds to the `columns` of the DataFrame which each have a name (as a string). \n", "\n", "\n", "In the following example, we will construct a Dataframe from two Series with common indices. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})\n", "population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Population Area\n", "California 38332521 423967\n", "Texas 26448193 695662\n", "New York 19651127 141297\n", "Florida 19552860 170312\n", "Illinois 12882135 149995 \n" ] } ], "source": [ "states = pd.DataFrame({'Population': population, 'Area': area})\n", "print(states,type(states))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Jupyter notebooks, DataFrames are displayed in a fancier way when the name of the dataframe is typed (instead of using print)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995
\n", "
" ], "text/plain": [ " Population Area\n", "California 38332521 423967\n", "Texas 26448193 695662\n", "New York 19651127 141297\n", "Florida 19552860 170312\n", "Illinois 12882135 149995" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrames have \n", "* index that are the defined indices as in Series\n", "* columns that are the columns names\n", "* values that return a (2D) NumPy array with the contents" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')\n", "Index(['Population', 'Area'], dtype='object')\n", "[[38332521 423967]\n", " [26448193 695662]\n", " [19651127 141297]\n", " [19552860 170312]\n", " [12882135 149995]] (5, 2)\n" ] } ], "source": [ "print(states.index)\n", "print(states.columns)\n", "print(states.values,type(states.values),states.values.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Warning:* When accessing a Dataframe, `dataframe_name[column_name]` return the corresponding column as a Series. `dataframe_name[index_name]` returns an error! We will see later how to access a specific index." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "California 423967\n", "Texas 695662\n", "New York 141297\n", "Florida 170312\n", "Illinois 149995\n", "Name: Area, dtype: int64 \n" ] } ], "source": [ "print(states['Area'],type(states['Area']))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KeyError: 'California'\n" ] } ], "source": [ "try:\n", " print(states['California'])\n", "except KeyError as error: \n", " print(\"KeyError: \",error)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataframe creation\n", "\n", "To create DataFrames, the main methods are:\n", "* from Series (as above)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "California 38332521\n", "Texas 26448193\n", "New York 19651127\n", "Florida 19552860\n", "Illinois 12882135\n", "dtype: int64 \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995
\n", "
" ], "text/plain": [ " Population Area\n", "California 38332521 423967\n", "Texas 26448193 695662\n", "New York 19651127 141297\n", "Florida 19552860 170312\n", "Illinois 12882135 149995" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(population,type(population))\n", "states = pd.DataFrame({'Population': population, 'Area': area})\n", "states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* from NumPy arrays (the columns and indices are taken as the array's ones)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-1.47908983 0.55834675 -0.68109792]\n", " [ 1.18023681 1.82871481 0.0944462 ]\n", " [-0.22391784 0.26061809 0.68857944]\n", " [-1.75644104 0.74439857 -0.45926716]\n", " [-0.90534641 -1.57246221 2.28871663]] \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.4790900.558347-0.681098
11.1802371.8287150.094446
2-0.2239180.2606180.688579
3-1.7564410.744399-0.459267
4-0.905346-1.5724622.288717
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.479090 0.558347 -0.681098\n", "1 1.180237 1.828715 0.094446\n", "2 -0.223918 0.260618 0.688579\n", "3 -1.756441 0.744399 -0.459267\n", "4 -0.905346 -1.572462 2.288717" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = np.random.randn(5,3)\n", "print(A,type(A))\n", "dfA = pd.DataFrame(A)\n", "dfA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* from a *list* of *dictionaries*. Be careful, each element of the list is an example (corresponding to an automatic index 0,1,...) while each key of the dictonary corresponds to a column." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}] \n", "{'a': 0, 'b': 0} \n" ] } ], "source": [ "data = [{'a': i, 'b': 2 * i} for i in range(3)]\n", "print(data,type(data))\n", "print(data[0],type(data[0]))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
000
112
224
\n", "
" ], "text/plain": [ " a b\n", "0 0 0\n", "1 1 2\n", "2 2 4" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(data)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* from a *file* , typically a csv file (for comma separated values), eventually with the names of the columns as a first line.\n", "\n", "\n", " col_1_name,col_2_name,col_3_name\n", " col_1_v1,col_2_v1,col_3_v1\n", " col_1_v2,col_2_v2,col_3_v2\n", " ...\n", " \n", "For other files types (MS Excel, libSVM, any other separator) see this [part of the doc](https://pandas.pydata.org/pandas-docs/stable/api.html#input-output)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "order,name,height(cm)\r\n", "1,George Washington,189\r\n", "2,John Adams,170\r\n", "3,Thomas Jefferson,189\r\n" ] } ], "source": [ "!head -4 data/president_heights.csv # Jupyter bash command to see the first 4 lines of the file" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ordernameheight(cm)
01George Washington189
12John Adams170
23Thomas Jefferson189
34James Madison163
45James Monroe183
56John Quincy Adams171
67Andrew Jackson185
78Martin Van Buren168
89William Henry Harrison173
910John Tyler183
1011James K. Polk173
1112Zachary Taylor173
1213Millard Fillmore175
1314Franklin Pierce178
1415James Buchanan183
1516Abraham Lincoln193
1617Andrew Johnson178
1718Ulysses S. Grant173
1819Rutherford B. Hayes174
1920James A. Garfield183
2021Chester A. Arthur183
2123Benjamin Harrison168
2225William McKinley170
2326Theodore Roosevelt178
2427William Howard Taft182
2528Woodrow Wilson180
2629Warren G. Harding183
2730Calvin Coolidge178
2831Herbert Hoover182
2932Franklin D. Roosevelt188
3033Harry S. Truman175
3134Dwight D. Eisenhower179
3235John F. Kennedy183
3336Lyndon B. Johnson193
3437Richard Nixon182
3538Gerald Ford183
3639Jimmy Carter177
3740Ronald Reagan185
3841George H. W. Bush188
3942Bill Clinton188
4043George W. Bush182
4144Barack Obama185
4245Donald Trump188
\n", "
" ], "text/plain": [ " order name height(cm)\n", "0 1 George Washington 189\n", "1 2 John Adams 170\n", "2 3 Thomas Jefferson 189\n", "3 4 James Madison 163\n", "4 5 James Monroe 183\n", "5 6 John Quincy Adams 171\n", "6 7 Andrew Jackson 185\n", "7 8 Martin Van Buren 168\n", "8 9 William Henry Harrison 173\n", "9 10 John Tyler 183\n", "10 11 James K. Polk 173\n", "11 12 Zachary Taylor 173\n", "12 13 Millard Fillmore 175\n", "13 14 Franklin Pierce 178\n", "14 15 James Buchanan 183\n", "15 16 Abraham Lincoln 193\n", "16 17 Andrew Johnson 178\n", "17 18 Ulysses S. Grant 173\n", "18 19 Rutherford B. Hayes 174\n", "19 20 James A. Garfield 183\n", "20 21 Chester A. Arthur 183\n", "21 23 Benjamin Harrison 168\n", "22 25 William McKinley 170\n", "23 26 Theodore Roosevelt 178\n", "24 27 William Howard Taft 182\n", "25 28 Woodrow Wilson 180\n", "26 29 Warren G. Harding 183\n", "27 30 Calvin Coolidge 178\n", "28 31 Herbert Hoover 182\n", "29 32 Franklin D. Roosevelt 188\n", "30 33 Harry S. Truman 175\n", "31 34 Dwight D. Eisenhower 179\n", "32 35 John F. Kennedy 183\n", "33 36 Lyndon B. Johnson 193\n", "34 37 Richard Nixon 182\n", "35 38 Gerald Ford 183\n", "36 39 Jimmy Carter 177\n", "37 40 Ronald Reagan 185\n", "38 41 George H. W. Bush 188\n", "39 42 Bill Clinton 188\n", "40 43 George W. Bush 182\n", "41 44 Barack Obama 185\n", "42 45 Donald Trump 188" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('data/president_heights.csv')\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Names and Values\n", "\n", "Notice there can be missing values in DataFrames." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
01.02NaN
1NaN34.0
\n", "
" ], "text/plain": [ " a b c\n", "0 1.0 2 NaN\n", "1 NaN 3 4.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can set indices and columns names *a posteriori*" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1-1.4790900.558347-0.681098
41.1802371.8287150.094446
9-0.2239180.2606180.688579
16-1.7564410.744399-0.459267
25-0.905346-1.5724622.288717
\n", "
" ], "text/plain": [ " a b c\n", "1 -1.479090 0.558347 -0.681098\n", "4 1.180237 1.828715 0.094446\n", "9 -0.223918 0.260618 0.688579\n", "16 -1.756441 0.744399 -0.459267\n", "25 -0.905346 -1.572462 2.288717" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfA.columns = ['a','b','c']\n", "dfA.index = [i**2 for i in range(1,6) ]\n", "dfA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995
\n", "
" ], "text/plain": [ " Population Area\n", "California 38332521 423967\n", "Texas 26448193 695662\n", "New York 19651127 141297\n", "Florida 19552860 170312\n", "Illinois 12882135 149995" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})\n", "population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})\n", "states = pd.DataFrame({'Population': population, 'Area': area})\n", "states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may access columns directly with names, *then* you can access individuals with their index. " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California 423967\n", "Texas 695662\n", "New York 141297\n", "Florida 170312\n", "Illinois 149995\n", "Name: Area, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states['Area']" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "695662" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states['Area']['Texas']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To ease the access, Pandas offers dedicated methods:\n", "* iloc enables to access subparts of the dataframe as if it was a NumPy array." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
California38332521423967
Texas26448193695662
\n", "
" ], "text/plain": [ " Population Area\n", "California 38332521 423967\n", "Texas 26448193 695662" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states.iloc[:2]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "California 38332521\n", "Texas 26448193\n", "Name: Population, dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states.iloc[:2,0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* loc does the same but with the explicit names (the last one is included)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
California38332521423967
Texas26448193695662
New York19651127141297
\n", "
" ], "text/plain": [ " Population Area\n", "California 38332521 423967\n", "Texas 26448193 695662\n", "New York 19651127 141297" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states.loc[:'New York']" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995
\n", "
" ], "text/plain": [ " Population Area\n", "California 38332521 423967\n", "Texas 26448193 695662\n", "New York 19651127 141297\n", "Florida 19552860 170312\n", "Illinois 12882135 149995" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states.loc[:,'Population':]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }