{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "ChEn-1070: Introduction to Chemical Engineering Spring 2019 UMass Lowell; Profs. Manohar and de Almeida **26Mar2019**\n", "\n", "# 04. Tabular Numerics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Table of Contents\n", "* [Introduction](#introduction)\n", " + Importing the `pandas` package\n", "* [Series](#series)\n", "* [Data Frame](#data-frame)\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "A significant `Python` package that supports tabular calculations (spreadsheet) is called: `Pandas`\n", "\n", " + [Home page](https://pandas.pydata.org/),\n", " + [Documentation](http://pandas.pydata.org/pandas-docs/stable/),\n", " \n", "we will use `Pandas` to build tabular data and perform statistical analysis. We will access `Pandas` with the `import` keyword\n", "\n", " + `import pandas`\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "'''Import pandas'''\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series (1-Dimensional)\n", "\n", "It is a one-dimensional *labeled* array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. For example, note how to create a `pandas` series:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1.101340\n", "1 2.292110\n", "2 3.383457\n", "3 4.678454\n", "dtype: float64\n" ] } ], "source": [ "'''Create basic Pandas series'''\n", "\n", "data = [1.10134,2.29211,3.383457,4.678454]\n", "s0 = pd.Series(data)\n", "\n", "print(s0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the overall type is `float64` (*i.e.* floating point with double precision), and the default labels are added: 0,1,2,3" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1.10e+00\n", "1 2.29e+00\n", "2 3.38e+00\n", "3 4.68e+00\n", "dtype: float64\n" ] } ], "source": [ "'''Display number of significant digits'''\n", "\n", "pd.options.display.float_format = '{:.2e}'.format\n", "print(s0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are various [statistics](http://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#descriptive-statistics) in `pandas`: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "s0 mean = 2.86384025\n", "s0 sum = 11.455361\n", "s0 std = 1.5271004374941803\n", "s0 var = 2.3320357461949173\n", "s0 sem = 0.7635502187470902\n" ] } ], "source": [ "'''Statistics'''\n", "\n", "print('s0 mean =',s0.mean())\n", "print('s0 sum =',s0.sum())\n", "print('s0 std =',s0.std())\n", "print('s0 var =',s0.var())\n", "print('s0 sem =',s0.sem())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "|Function\t|Description|\n", "|-----------|-----------|\n", "|count\t| Number of non-NA observations |\n", "|sum\t| Sum of values |\n", "|mean\t| Mean of values |\n", "|mad\t| Mean absolute deviation |\n", "|median\t| Arithmetic median of values |\n", "|min\t| Minimum |\n", "|max\t| Maximum |\n", "|mode\t| Mode |\n", "|abs\t| Absolute Value |\n", "|prod\t| Product of values |\n", "|std\t| Bessel-corrected sample standard deviation |\n", "|var\t| Unbiased variance |\n", "|sem\t| Standard error of the mean |\n", "|skew\t| Sample skewness (3rd moment) |\n", "|kurt\t| Sample kurtosis (4th moment) |\n", "|quantile\t| Sample quantile (value at %) |\n", "|cumsum\t| Cumulative sum |\n", "|cumprod\t| Cumulative product |\n", "|cummax\t| Cumulative maximum |\n", "|cummin\t| Cumulative minimum |\n", "\n", "\n", "There is a convenient `describe()` function which computes a variety of summary statistics:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 4.00e+00\n", "mean 2.86e+00\n", "std 1.53e+00\n", "min 1.10e+00\n", "25% 1.99e+00\n", "50% 2.84e+00\n", "75% 3.71e+00\n", "max 4.68e+00\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Describe statistics summary of data'''\n", "\n", "s0.describe()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "s0[3] = 4.678454\n" ] } ], "source": [ "'''Data access with index operator'''\n", "\n", "print(\"s0[3] = \",s0[3])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1.10e+00\n", "1 2.29e+00\n", "2 hello\n", "3 4.68e+00\n", "dtype: object\n" ] } ], "source": [ "'''Data assignment with index operator'''\n", "\n", "s0[2] = 'hello'\n", "print(s0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the overall type is now `object` because of the mixed data type." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.678454" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Fast data access with integer index'''\n", "\n", "s0.iat[-1]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Slower than iat[], more generic (see below in Data Frames) access with integer index'''\n", "\n", "s0.iloc[-2]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a 1\n", "b 2\n", "c 3\n", "d 4\n", "dtype: int64\n" ] } ], "source": [ "'''Create a Pandas series w/ a dictionary'''\n", "\n", "data = {'a':1,'b':2,'c':3,'d':4}\n", "s1 = pd.Series(data)\n", "\n", "print(s1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The overall type is `int64` (*i.e* integer with double precision), but notice the labels are taken from the data dictionary (the keys are made as labels)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a 8.00e+00\n", "b 8.00e+00\n", "c 8.00e+00\n", "d 8.00e+00\n", "dtype: float64\n" ] } ], "source": [ "'''Create a Pandas series w/ the same value'''\n", "\n", "s2 = pd.Series( 8.0, index=['a', 'b', 'c', 'd'] )\n", "\n", "print(s2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data can be a repeated value (a scalar)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "s1['b'] = 2\n", "s2['d'] = 8.0\n" ] } ], "source": [ "'''Data access as a dictionary using the index operator'''\n", "\n", "print(\"s1['b'] = \",s1['b'])\n", "print(\"s2['d'] = \",s2['d'])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Fast data access with integer index'''\n", "\n", "s2.iat[-1]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "symbol H\n", "atomic number 1\n", "group 1\n", "period 1\n", "mass number 1\n", "dtype: object\n", "\n", "symbol D\n", "atomic number 1\n", "group 1\n", "period 1\n", "mass number 2\n", "dtype: object\n", "\n", "symbol Be-9\n", "atomic number 4\n", "group 2\n", "period 2\n", "mass number 9\n", "dtype: object\n", "\n", "symbol Be-10\n", "atomic number 4\n", "group 2\n", "period 2\n", "mass number 10\n", "dtype: object\n" ] } ], "source": [ "'''Create a Pandas series using lists'''\n", "\n", "import pandas as pd\n", "\n", "labels = ['symbol','atomic number','group','period','mass number']\n", "\n", "data = ['H',1,1,1,1] # hydrogen\n", "h1 = pd.Series( data, index=labels)\n", "print(h1)\n", "\n", "print('')\n", "\n", "data = ['D',1,1,1,2] # deuterium\n", "h2 = pd.Series( data, index=labels)\n", "print(h2)\n", "\n", "print('')\n", "\n", "data = ['Be-9',4,2,2,9] # beryllium\n", "be_9 = pd.Series( data, index=labels)\n", "print(be_9)\n", "\n", "print('')\n", "\n", "data = ['Be-10',4,2,2,10] # beryllium\n", "be_10 = pd.Series( data, index=labels)\n", "print(be_10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The overall type is `object` because the data types are mixed. Notice the index labels are more elaborate and provided as a list of `str`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Frame (2-Dimensional)\n", "\n", "It is a two-dimensional (table) *labeled* data structure with various data types. The **row** labels are collectively referred to as the index. The **column** labels appear as a new structure. For example, note how to create a `pandas` data frame:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S1 S2\n", "a 1 8.00e+00\n", "b 2 8.00e+00\n", "c 3 8.00e+00\n", "d 4 8.00e+00\n" ] } ], "source": [ "'''Using existing series as columns'''\n", "\n", "data = {'S1':s1,'S2':s2}\n", "\n", "df1 = pd.DataFrame( data )\n", "\n", "print(df1)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['S1', 'S2'], dtype='object')" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.columns" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " x y\n", "a 1 8.00e+00\n", "b 2 8.00e+00\n", "c 3 8.00e+00\n", "d 4 8.00e+00\n" ] } ], "source": [ "'''Change column labels'''\n", "\n", "df1.columns = ['x','y']\n", "print(df1)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "'''Using a dictionary to create a table'''\n", "\n", "hydrogen = {'symbol':'H', 'atomic_number':1, 'group':1, 'period':1, 'isotopes':[1,2,3]}\n", "helium = {'symbol':'He', 'atomic_number':2, 'group':18, 'period':1, 'isotopes':[3,4]}\n", "lithium = {'symbol':'Li', 'atomic_number':3, 'group':1, 'period':2, 'isotopes':[6,7]}\n", "beryllium = {'symbol':'Be', 'atomic_number':4, 'group':2, 'period':2, 'isotopes':[9,10]}\n", "boron = {'symbol':'B', 'atomic_number':5, 'group':13, 'period':2, 'isotopes':[10,11]}\n", "\n", "data = {'hydrogen':hydrogen,'helium':helium,'lithium':lithium,'beryllium':beryllium,'boron':boron}\n", "\n", "df2 = pd.DataFrame(data)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hydrogenheliumlithiumberylliumboron
atomic_number12345
group1181213
isotopes[1, 2, 3][3, 4][6, 7][9, 10][10, 11]
period11222
symbolHHeLiBeB
\n", "
" ], "text/plain": [ " hydrogen helium lithium beryllium boron\n", "atomic_number 1 2 3 4 5\n", "group 1 18 1 2 13\n", "isotopes [1, 2, 3] [3, 4] [6, 7] [9, 10] [10, 11]\n", "period 1 1 2 2 2\n", "symbol H He Li Be B" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Column selection'''\n", "\n", "type(df2['helium'])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "group 18\n", "isotopes [3, 4]\n", "period 1\n", "Name: helium, dtype: object" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Rows of a column'''\n", "\n", "df2['helium'][1:-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some data access operation:\n", "\n", "|Operation |\tSyntax |\tResult|\n", "|----------|-----------|----------|\n", "|Select column |\tdf[col]\t| Series\n", "|Select row by label |\tdf.loc[label]\t| Series\n", "|Select row by integer | location\tdf.iloc[loc]\t| Series\n", "|Slice rows\tdf[5:10]\t| DataFrame\n", "|Select rows by boolean vector |\tdf[bool_vec]\t| DataFrame\n", "\n", "There is more on data access and [slicing](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "hydrogen [1, 2, 3]\n", "helium [3, 4]\n", "lithium [6, 7]\n", "beryllium [9, 10]\n", "boron [10, 11]\n", "Name: isotopes, dtype: object" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Access of \"isotopes\" row by label or also called index'''\n", "\n", "df2.loc['isotopes']" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[10, 11]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Access of a table entry using row and column labels'''\n", "\n", "df2.loc['isotopes','boron']" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "hydrogen [1, 2, 3]\n", "helium [3, 4]\n", "lithium [6, 7]\n", "beryllium [9, 10]\n", "boron [10, 11]\n", "Name: isotopes, dtype: object" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Access of \"isotopes\" row by index'''\n", "\n", "df2.iloc[2]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hydrogenheliumlithiumberylliumboron
period11222
symbolHHeLiBeB
\n", "
" ], "text/plain": [ " hydrogen helium lithium beryllium boron\n", "period 1 1 2 2 2\n", "symbol H He Li Be B" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Access the last two rows by index'''\n", "\n", "df2.iloc[-2:]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Note that iat will not work'" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''Note that iat will not work'''\n", "\n", "#df2.iat[-2:]\n", "#df2.iat[-2]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''This pure index access will work'''\n", "\n", "df2.iat[1,3] # row index 1, column index 3, that is: second row, fourth column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 2 }