{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Fundamental Python Data Science Libraries: A Cheatsheet (Part 2/4)](https://hackernoon.com/fundamental-python-data-science-libraries-a-cheatsheet-part-2-4-fcf5fab9cdf1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "by [Lauren Glass](https://www.linkedin.com/in/laurenjglass/), [Hackernoon](https://hackernoon.com/), Jan. 17, 2018"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pandas\n",
    "\n",
    "This library is built on top of NumPy. It allows you to store & manipulate data in a relational table structure.\n",
    "\n",
    "This library focuses on two objects: the Series (1D) and the DataFrame (2D). Each allow you to set:\n",
    "\n",
    "- an index: that lets you find and manipulate certain rows\n",
    "- column names: that lets you find and manipulate certain columns\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Series\n",
    "future_array1 = [1,2,3,4,5,6]\n",
    "array1 = np.array(future_array1)\n",
    "s = pd.Series(array1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# DataFrame\n",
    "future_array2 = [2,4,6,8,10,12]\n",
    "array2 = np.array(future_array2)\n",
    "df = pd.DataFrame([future_array1, future_array2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Series from a dictionary\n",
    "future_series = {0: 'A', 1: 'B', 2: 'C'}\n",
    "s = pd.Series(future_series)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# DataFrame from dictionary\n",
    "dict = {'Normal': ['A', 'B', 'C'], 'Reverse': ['Z', 'Y', 'X']}\n",
    "df = pd.DataFrame(dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# upload data from file: The keyword argument, index_col, is where you \n",
    "# can specify which column in your CSV should be the index in the DataFrame\n",
    "#uploaded_data = pd.read_csv(\"filename.csv\", index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the Index\n",
    "dates = pd.date_range(\"20160101\", periods=6)\n",
    "data = np.random.random((6,3))\n",
    "column_names = ['Column1', 'Column2', 'Column3']\n",
    "df = pd.DataFrame(data, index=dates, columns=column_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Indexing a column\n",
    "df['Column2'] # use the column name's string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Indexing a row\n",
    "df[0:2] # use the standard indexing technique"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['20160101':'20160102'] # use the index's strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Indexing multiple axes — names\n",
    "df.loc['20160101':'20160102',['Column1','Column3']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Indexing multiple axes — numbers\n",
    "df.iloc[3:5, 0:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View Your Data\n",
    "df.head(2) # first 2 rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.tail(2) # last 2 rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View summary statistics\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Control Your Data\n",
    "# Pandas brings the flexibility of SQL into Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort\n",
    "df.sort_index(axis=0, ascending=False) # sort using the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sort_values(by='Column2') # sort using a column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Join\n",
    "dates1 = pd.date_range(\"20160101\", periods=6)\n",
    "data1 = np.random.random((6,2))\n",
    "column_names1 = ['ColumnA', 'ColumnB']\n",
    "dates2 = pd.date_range(\"20160101\", periods=7)\n",
    "data2 = np.random.random((7,2))\n",
    "column_names2 = ['ColumnC', 'ColumnD']\n",
    "df1 = pd.DataFrame(data1, index=dates1, columns=column_names1)\n",
    "df2 = pd.DataFrame(data2, index=dates2, columns=column_names2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1.join(df2) # joins on the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Group by\n",
    "df3 = df1.join(df2)\n",
    "# add a column to df to group on\n",
    "df3['ProfitLoss'] = pd.Series(['Profit', 'Loss', 'Profit', 'Profit', 'Profit', 'Loss'], index=dates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df3.groupby('ProfitLoss').mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Accessing Attributes\n",
    "\n",
    "# Access the Index\n",
    "df3.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access the Values\n",
    "df3.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access the Columns\n",
    "df3.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}