{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [Fundamental Python Data Science Libraries: A Cheatsheet (Part 2/4)](https://hackernoon.com/fundamental-python-data-science-libraries-a-cheatsheet-part-2-4-fcf5fab9cdf1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "by [Lauren Glass](https://www.linkedin.com/in/laurenjglass/), [Hackernoon](https://hackernoon.com/), Jan. 17, 2018" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pandas\n", "\n", "This library is built on top of NumPy. It allows you to store & manipulate data in a relational table structure.\n", "\n", "This library focuses on two objects: the Series (1D) and the DataFrame (2D). Each allow you to set:\n", "\n", "- an index: that lets you find and manipulate certain rows\n", "- column names: that lets you find and manipulate certain columns\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Series\n", "future_array1 = [1,2,3,4,5,6]\n", "array1 = np.array(future_array1)\n", "s = pd.Series(array1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DataFrame\n", "future_array2 = [2,4,6,8,10,12]\n", "array2 = np.array(future_array2)\n", "df = pd.DataFrame([future_array1, future_array2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Series from a dictionary\n", "future_series = {0: 'A', 1: 'B', 2: 'C'}\n", "s = pd.Series(future_series)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DataFrame from dictionary\n", "dict = {'Normal': ['A', 'B', 'C'], 'Reverse': ['Z', 'Y', 'X']}\n", "df = pd.DataFrame(dict)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# upload data from file: The keyword argument, index_col, is where you \n", "# can specify which column in your CSV should be the index in the DataFrame\n", "#uploaded_data = pd.read_csv(\"filename.csv\", index_col=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the Index\n", "dates = pd.date_range(\"20160101\", periods=6)\n", "data = np.random.random((6,3))\n", "column_names = ['Column1', 'Column2', 'Column3']\n", "df = pd.DataFrame(data, index=dates, columns=column_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Indexing a column\n", "df['Column2'] # use the column name's string" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Indexing a row\n", "df[0:2] # use the standard indexing technique" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['20160101':'20160102'] # use the index's strings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Indexing multiple axes — names\n", "df.loc['20160101':'20160102',['Column1','Column3']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Indexing multiple axes — numbers\n", "df.iloc[3:5, 0:2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# View Your Data\n", "df.head(2) # first 2 rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.tail(2) # last 2 rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# View summary statistics\n", "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Control Your Data\n", "# Pandas brings the flexibility of SQL into Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sort\n", "df.sort_index(axis=0, ascending=False) # sort using the index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sort_values(by='Column2') # sort using a column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Join\n", "dates1 = pd.date_range(\"20160101\", periods=6)\n", "data1 = np.random.random((6,2))\n", "column_names1 = ['ColumnA', 'ColumnB']\n", "dates2 = pd.date_range(\"20160101\", periods=7)\n", "data2 = np.random.random((7,2))\n", "column_names2 = ['ColumnC', 'ColumnD']\n", "df1 = pd.DataFrame(data1, index=dates1, columns=column_names1)\n", "df2 = pd.DataFrame(data2, index=dates2, columns=column_names2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df1.join(df2) # joins on the index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group by\n", "df3 = df1.join(df2)\n", "# add a column to df to group on\n", "df3['ProfitLoss'] = pd.Series(['Profit', 'Loss', 'Profit', 'Profit', 'Profit', 'Loss'], index=dates)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df3.groupby('ProfitLoss').mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Accessing Attributes\n", "\n", "# Access the Index\n", "df3.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Access the Values\n", "df3.values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Access the Columns\n", "df3.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }