{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section1_2-Introduction-to-Pandas.ipynb)\n", "\n", "# Introduction to Pandas\n", "\n", "**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. \n", "\n", "pandas is well suited for:\n", "\n", "- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet\n", "- Ordered and unordered (not necessarily fixed-frequency) time series data.\n", "- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels\n", "- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure\n", "\n", "\n", "Key features:\n", " \n", "- Easy handling of **missing data**\n", "- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects\n", "- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically\n", "- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets\n", "- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets\n", "- Intuitive **merging and joining** data sets\n", "- Flexible **reshaping and pivoting** of data sets\n", "- **Hierarchical labeling** of axes\n", "- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5\n", "- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "DATA_PATH = 'https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas Data Structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series\n", "\n", "A **Series** is a single vector of data (like a NumPy array) with an *index* that labels each element in the vector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "counts = pd.Series([632, 1638, 569, 115])\n", "counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "counts.values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "counts.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can assign meaningful labels to the index, if they are available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria = pd.Series([632, 1638, 569, 115], \n", " index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])\n", "\n", "bacteria" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These labels can be used to refer to the values in the `Series`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria['Actinobacteria']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria[[name.endswith('bacteria') for name in bacteria.index]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[name.endswith('bacteria') for name in bacteria.index]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the indexing operation preserved the association between the values and the corresponding indices.\n", "\n", "We can still use positional indexing if we wish." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can give both the array of values and the index meaningful labels themselves:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria.name = 'counts'\n", "bacteria.index.name = 'phylum'\n", "bacteria" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy's math functions and other operations can be applied to Series without losing the data structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.log(bacteria)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also filter according to the values in the `Series`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria[bacteria>1000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria_dict = {'Proteobacteria': 1638, 'Actinobacteria': 569,'Firmicutes': 632, \n", " 'Bacteroidetes': 115}\n", "pd.Series(bacteria_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria2 = pd.Series(bacteria_dict, \n", " index=['Cyanobacteria','Firmicutes',\n", " 'Proteobacteria','Actinobacteria'])\n", "bacteria2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria2.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Critically, the labels are used to **align data** when used in operations with other Series objects:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria + bacteria2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrame\n", "\n", "Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.\n", "\n", "A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],\n", " 'patient':[1, 1, 1, 1, 2, 2, 2, 2],\n", " 'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', \n", " 'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can change the order by indexing them in the order we desire:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[['phylum','value','patient']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `DataFrame` has a second index, representing the columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `dtypes` attribute reveals the data type for each column in our DataFrame. \n", "\n", "- `int64` is numeric integer values \n", "- `object` strings (letters and numbers)\n", "- `float64` floating-point values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wish to access columns, we can do so either by dict-like indexing or by attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['patient']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.patient" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(data.value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[['value']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice this is different than with `Series`, where dict-like indexing retrieved a particular element (row). \n", "\n", "If we want access to a row in a `DataFrame`, we index its `loc` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.loc[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Try out these commands to see what they return:\n", "\n", "- `data.head()`\n", "- `data.tail(3)`\n", "- `data.shape`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative way of initializing a `DataFrame` is with a list of dicts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.DataFrame([{'patient': 1, 'phylum': 'Firmicutes', 'value': 632},\n", " {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},\n", " {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},\n", " {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},\n", " {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},\n", " {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},\n", " {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},\n", " {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Its important to note that the Series returned when a DataFrame is indexed is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vals = data.value\n", "vals" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vals[5] = 0\n", "vals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now look at the original `DataFrame`!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we plan on modifying an extracted Series, its a good idea to make a copy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vals = data.value.copy()\n", "vals[5] = 1000\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create or modify columns by assignment:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.value[[3,4,6]] = [14, 21, 5]\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['year'] = 2013\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But note, we cannot use the attribute indexing method to add a new column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.treatment = 1\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.treatment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "From the `data` table above, create an index to return all rows for which the phylum name ends in \"bacteria\" and the value is greater than 1000." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[data.phylum.str.endswith('bacteria') & (data.value > 1000)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "treatment = pd.Series([0]*4 + [1]*2)\n", "treatment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['treatment'] = treatment\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other Python data structures (ones without an index) need to be the same length as the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "month = ['Jan', 'Feb', 'Mar', 'Apr']\n", "data['month'] = month" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['month'] = ['Jan']*len(data)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `drop` method to remove rows or columns, which by default drops rows. We can be explicit by using the `axis` argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.drop('month', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that because of the mix of string and integer (and `NaN`) values, the dtype of the array is `object`. The dtype will automatically be chosen to be as general as needed to accomodate all the columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})\n", "df.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas uses a custom data structure to represent the indices of Series and DataFrames." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Index objects are immutable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.index[0] = 15" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is so that Index objects can be shared between data structures without fear that they will be changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria2.index = bacteria.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bacteria2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:\n", "\n", " genes = np.loadtxt(\"genes.csv\", delimiter=\",\", dtype=[('gene', '|S10'), ('value', '