{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", "
\n", " \"Open\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*This notebook was created by [Jean de Dieu Nyandwi](https://twitter.com/jeande_d) for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), [Twitter](https://twitter.com/jeande_d), or [LinkedIn](https://linkedin.com/in/nyandwi).*" ] }, { "cell_type": "markdown", "metadata": { "id": "ErLffIn4mD93" }, "source": [ "\n", "# Data Manipulation with Pandas\n", "\n", " In this lab, you will learn how to manipulate data with Pandas. Here is an overview:\n", "\n", "* [1. Basics of Pandas for data manipulation:](#1)\n", "\n", " * [A. Series and DataFrames](#1-1)\n", " * [B. Data Indexing and Selection, and Iteration](#1-2)\n", " * [C. Dealing with Missing data](#1-3)\n", " * [D. Basic operations and Functions](#1-4)\n", " * [E. Aggregation Methods](#1-5)\n", " * [F. Groupby](#1-6)\n", " * [G. Merging, Joining and Concatenate](#1-7)\n", " * [H. Beyond Dataframes: Working with CSV, and Excel](#1-8)\n", " \n", "* [2. Real World Exploratory Data Analysis (EDA)](#2)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "6cK48ySE4SB0" }, "source": [ "\n", "## 1. Basics of Pandas for data manipulation\n", "\n", "\n", "### A. Series and DataFrames\n", "\n", "Both series and DataFrames are Pandas Data structures. \n", "\n", "Series is like one dimensional NumPy array with axis labels.\n", "\n", "DataFrame is multidimensional NumPy array with labels on rows and columns. \n", "\n", "Working with NumPy, we saw that it supports numeric type data. Pandas on other hand supports whole range of data types, from numeric to strings, etc.." ] }, { "cell_type": "markdown", "metadata": { "id": "6FckciW165U4" }, "source": [ "Since we are using python notebook, we do not need to install Pandas. We only just have to import it. \n", "\n", "```\n", "import pandas as pd\n", "```\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "hl7wXbFo4ZLP" }, "outputs": [], "source": [ "# importing numpy and pandas\n", "\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "id": "r1MZe-Th7jf-" }, "source": [ "#### Creating Series\n", "\n", "Series can be created from a Python list, dictionary, and NumPy array. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Ous_6XUS7lLa", "outputId": "83ada779-3f99-4536-8958-abb525d3409b" }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 4\n", "4 5\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating the series from a Python list\n", "\n", "num_list = [1,2,3,4,5]\n", "\n", "pd.Series(num_list)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "E6aPp0JJ9MI9", "outputId": "0e2c3111-226d-4c5f-b2c2-fec67493c18a" }, "outputs": [ { "data": { "text/plain": [ "a Mon\n", "b Tues\n", "c Wed\n", "d Thur\n", "e Fri\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "week_days = ['Mon','Tues','Wed','Thur','Fri']\n", "\n", "pd.Series(week_days, index=[\"a\", \"b\", \"c\", \"d\", \"e\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "0gtsxrby9fHM" }, "source": [ "Note the data types `int64` and `object`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MHWhzhTt9mev", "outputId": "1093b1e6-cdc8-41e2-d8d0-dbbe81d041bd" }, "outputs": [ { "data": { "text/plain": [ "1 United States\n", "91 India\n", "49 Germany\n", "86 China\n", "250 Rwanda\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating the Series from dictionary \n", "\n", "countries_code = { 1:\"United States\",\n", " 91:\"India\",\n", " 49:\"Germany\",\n", " 86:\"China\",\n", " 250:\"Rwanda\"}\n", "\n", "pd.Series(countries_code)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hCBCCRHZ-i_1", "outputId": "420e17e5-3dcb-48d8-8293-5bda93306c4d" }, "outputs": [ { "data": { "text/plain": [ "1 a\n", "2 b\n", "3 c\n", "4 d\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = {1:'a', 2:'b', 3:'c', 4:'d'}\n", "pd.Series(d)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lr0P6iSO_Ngl", "outputId": "69df04f1-e231-4d16-9917-cf04461a11c3" }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 4\n", "4 5\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating the Series from NumPy array\n", "# We peovide the list of indexes\n", "# if we don't provide the indexes, the default indexes are numbers...starts from 0,1,2..\n", "\n", "arr = np.array ([1, 2, 3, 4, 5])\n", "pd.Series(arr)\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vBV3WXje__5m", "outputId": "08444786-b5d7-454e-bb4f-3e279d8a898a" }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 2\n", "c 3\n", "d 4\n", "e 5\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])" ] }, { "cell_type": "markdown", "metadata": { "id": "YYpqI7l4AvOG" }, "source": [ "#### Creating DataFrames\n", "\n", "DataFrames are the most used Pandas data structure. It can be created from a dictionary, 2D array, and Series. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "cThhEpprAyLr", "outputId": "9f09b720-0a54-451d-9005-e70c91479031" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
0USA1
1India91
2German49
3Rwanda250
\n", "
" ], "text/plain": [ " Name Codes\n", "0 USA 1\n", "1 India 91\n", "2 German 49\n", "3 Rwanda 250" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating DataFrame from a dictionary\n", "\n", "countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], \n", " \n", " 'Codes':[1, 91, 49, 250] }\n", "\n", "pd.DataFrame(countries)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "fwdwsY9uGdG0", "outputId": "945f8126-a85b-46f5-9cd8-f649c7301fcf" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0123
1456
2789
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1 2 3\n", "1 4 5 6\n", "2 7 8 9" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating a dataframe from a 2D array\n", "# You pass the list of columns\n", "\n", "array_2d = np.array ([[1,2,3], [4,5,6], [7,8,9]])\n", "\n", "pd.DataFrame(array_2d, columns = ['column 1', 'column 2', 'column 3'])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "Y7ezd5g4HrHq", "outputId": "34eeaa06-92f3-40d9-c0f7-998d43f39931" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Codes
United States1
India91
Germany49
China86
Rwanda250
\n", "
" ], "text/plain": [ " Codes\n", "United States 1\n", "India 91\n", "Germany 49\n", "China 86\n", "Rwanda 250" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating a dataframe from Pandas series \n", "# Pass the columns in a list\n", "\n", "countries_code = { \"United States\": 1,\n", " \"India\": 91,\n", " \"Germany\": 49,\n", " \"China\": 86,\n", " \"Rwanda\":250}\n", "\n", "pd_series = pd.Series (countries_code)\n", "\n", "pd.Series(countries_code)\n", "\n", "df = pd.DataFrame(pd_series, columns = ['Codes'])\n", "df" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "F1JPrkTizQHM", "outputId": "bc3d53bc-19d8-402c-fd8d-78ade1623a90" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodesPopulation
United States1100
India91450
Germany49575
China865885
Rwanda250533
\n", "
" ], "text/plain": [ " Codes Population\n", "United States 1 100\n", "India 91 450\n", "Germany 49 575\n", "China 86 5885\n", "Rwanda 250 533" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Adding a column\n", "# Number in population are pretty random\n", "\n", "df ['Population'] = [100, 450, 575, 5885, 533]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "pympKqSv8ZNp", "outputId": "bf04e8f8-7d0e-474e-d287-95a326128461" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Codes
United States1
India91
Germany49
China86
Rwanda250
\n", "
" ], "text/plain": [ " Codes\n", "United States 1\n", "India 91\n", "Germany 49\n", "China 86\n", "Rwanda 250" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Removing a column \n", "\n", "df.drop('Population', axis =1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Hz2u0p008lDR", "outputId": "927211e6-3536-43ec-9932-54b5298a45d6" }, "outputs": [ { "data": { "text/plain": [ "Index(['Codes', 'Population'], dtype='object')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "x0mYfsLt8oyC", "outputId": "033adeef-b0a6-407d-e58b-b293bc8c22d8" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.keys" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "DcGWsgwP8tuo", "outputId": "d9ab7016-b67e-4b59-f959-60d542ca7901" }, "outputs": [ { "data": { "text/plain": [ "Index(['United States', 'India', 'Germany', 'China', 'Rwanda'], dtype='object')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": { "id": "8S1rS2IqI5OQ" }, "source": [ "\n", "### B. Data Indexing, Selection and Iteration\n", "\n", "Indexing and selection works in both Series and Dataframe.\n", "\n", "Because DataFrame is made of Series, let's focus on how to select data in DataFrame. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "xO-xCw1uI9Yt", "outputId": "008a4242-556f-42af-c946-55ff41f0cbe3" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aUSA1
bIndia91
cGerman49
dRwanda250
\n", "
" ], "text/plain": [ " Name Codes\n", "a USA 1\n", "b India 91\n", "c German 49\n", "d Rwanda 250" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating DataFrame from a dictionary\n", "\n", "countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], \n", " \n", " 'Codes':[1, 91, 49, 250] }\n", "\n", "df = pd.DataFrame(countries, index=['a', 'b', 'c', 'd'])\n", "df" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BhcNuJo1sBff", "outputId": "a89548bf-e083-496e-d7a2-93bc7378bce3" }, "outputs": [ { "data": { "text/plain": [ "a USA\n", "b India\n", "c German\n", "d Rwanda\n", "Name: Name, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Name']" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oBhhzCWyy_OJ", "outputId": "22207d29-0349-4a06-cb29-902bc389eb06" }, "outputs": [ { "data": { "text/plain": [ "a USA\n", "b India\n", "c German\n", "d Rwanda\n", "Name: Name, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Name" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vRUbK_YLsQ0h", "outputId": "517f0652-6c83-4293-9b84-236263e6f856" }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 91\n", "c 49\n", "d 250\n", "Name: Codes, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df ['Codes']" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "BGdQngaxsTbM", "outputId": "ce70d45c-1b3d-400b-f278-f986920ec16f" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aUSA1
bIndia91
cGerman49
dRwanda250
\n", "
" ], "text/plain": [ " Name Codes\n", "a USA 1\n", "b India 91\n", "c German 49\n", "d Rwanda 250" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## When you have many columns, columns in list will be selected\n", "\n", "df [['Name', 'Codes']]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "hm5NIMGksb71", "outputId": "9333881c-eb66-4543-908a-a5b676d82d8c" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aUSA1
bIndia91
\n", "
" ], "text/plain": [ " Name Codes\n", "a USA 1\n", "b India 91" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This will return the first two rows\n", "df [0:2]" ] }, { "cell_type": "markdown", "metadata": { "id": "7hsBOkVLwev1" }, "source": [ "You can also use `loc` to select data by the label indexes and `iloc` to select by default integer index (or by the position of the row)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "r0kBMm4Dwmyw", "outputId": "5cca5a84-43be-43ee-948f-a05f706dbdc1" }, "outputs": [ { "data": { "text/plain": [ "Name USA\n", "Codes 1\n", "Name: a, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['a']" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "-S2cSr5Ax97c", "outputId": "efb0c4e8-7fa0-4fb7-9a12-ea2851ecd890" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
bIndia91
cGerman49
dRwanda250
\n", "
" ], "text/plain": [ " Name Codes\n", "b India 91\n", "c German 49\n", "d Rwanda 250" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['b':'d']" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "RictqM1hyUYm", "outputId": "0d962413-d6c5-4cdd-800a-3eb99f98b597" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aUSA1
bIndia91
\n", "
" ], "text/plain": [ " Name Codes\n", "a USA 1\n", "b India 91" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df [:'b']" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "m2Tauaenw6WQ", "outputId": "4cf58b8a-15e3-47a4-d849-cc4cde79c139" }, "outputs": [ { "data": { "text/plain": [ "Name German\n", "Codes 49\n", "Name: c, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "e8YeUoXZx4KA", "outputId": "e7b5c924-6755-44b7-8405-4cb8198113bf" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
bIndia91
cGerman49
\n", "
" ], "text/plain": [ " Name Codes\n", "b India 91\n", "c German 49" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1:3]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "wGaDmJRGyYhy", "outputId": "ba0e9cbb-95d5-4fb3-b799-90ebd3c296f6" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
cGerman49
dRwanda250
\n", "
" ], "text/plain": [ " Name Codes\n", "c German 49\n", "d Rwanda 250" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2:]" ] }, { "cell_type": "markdown", "metadata": { "id": "RCc2oXwszDDB" }, "source": [ "### Conditional Selection" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "iNxAW-NTzHAS", "outputId": "d66e4c9a-dfbc-486a-d7ae-d27487ed5722" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aUSA1
bIndia91
cGerman49
dRwanda250
\n", "
" ], "text/plain": [ " Name Codes\n", "a USA 1\n", "b India 91\n", "c German 49\n", "d Rwanda 250" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "4F3mr2Fszagx", "outputId": "878b4c87-ae97-47eb-df25-bb98ee575056" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
cGerman49
\n", "
" ], "text/plain": [ " Name Codes\n", "c German 49" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Let's select a country with code 49\n", "\n", "df [df['Codes'] ==49 ]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "xLSnNlqc0XMp", "outputId": "d006ddff-24e3-409e-aa5c-57522194fceb" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aUSA1
bIndia91
cGerman49
\n", "
" ], "text/plain": [ " Name Codes\n", "a USA 1\n", "b India 91\n", "c German 49" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df [df['Codes'] < 250 ]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "_3g-C-UKzyq6", "outputId": "a8eecb7b-b35c-4de9-cabb-86e548183318" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aUSA1
\n", "
" ], "text/plain": [ " Name Codes\n", "a USA 1" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df [df['Name'] =='USA' ]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "EqqEQfhoz9IG", "outputId": "ac0c9c58-53fa-4228-d5f7-2742199a7471" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
bIndia91
\n", "
" ], "text/plain": [ " Name Codes\n", "b India 91" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can use and (&) or (|) for more than conditions\n", "#df [(condition 1) & (condition 2)]\n", "\n", "df [(df['Codes'] == 91 ) & (df['Name'] == 'India') ]" ] }, { "cell_type": "markdown", "metadata": { "id": "ipe0NTjf2Db8" }, "source": [ "You can also use `isin()` and `where()` to select data in a series or dataframe." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "CJIqyvOC2P5A", "outputId": "8cf04e2e-40a4-4518-d8d7-2b0e8e5149cf" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aTrueTrue
bTrueFalse
cFalseFalse
dFalseTrue
\n", "
" ], "text/plain": [ " Name Codes\n", "a True True\n", "b True False\n", "c False False\n", "d False True" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# isin() return false or true when provided value is included in dataframe\n", "sample_codes_names=[1,3,250, 'USA', 'India', 'England']\n", "\n", "df.isin(sample_codes_names)" ] }, { "cell_type": "markdown", "metadata": { "id": "HsqhvWzo3B2N" }, "source": [ "As you can see, it returned `True` wherever a country code or name was found. Otherwise, `False`. You can use a dictinary to match search by columns. A key must be a column and values are passed in list." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "fnx_IAnS3Wq8", "outputId": "e4f51707-8352-4aec-9e12-f2937a4804aa" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCodes
aTrueTrue
bTrueFalse
cFalseFalse
dFalseTrue
\n", "
" ], "text/plain": [ " Name Codes\n", "a True True\n", "b True False\n", "c False False\n", "d False True" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_codes_names = {'Codes':[1,3,250], 'Name':['USA', 'India', 'England']}\n", "\n", "df.isin(sample_codes_names)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "PGv1D9K54knp", "outputId": "48868abb-3dcc-4344-db4d-5e9c63f1a1da" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0123
1456
2789
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1 2 3\n", "1 4 5 6\n", "2 7 8 9" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame(np.array ([[1,2,3], [4,5,6], [7,8,9]]), \n", " columns = ['column 1', 'column 2', 'column 3'])\n", "\n", "df2" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "BGU1CUQ142CW", "outputId": "9caa3f57-cfd5-423e-eba1-a9fdb13866cd" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0FalseFalseTrue
1TrueTrueFalse
2TrueFalseFalse
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 False False True\n", "1 True True False\n", "2 True False False" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.isin([0,3,4,5,7])" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "NsFfB57Y5ePv", "outputId": "ced02843-254f-40f5-af21-1a005677ccb5" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0NaNNaNNaN
1NaN5.06.0
27.08.09.0
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 NaN NaN NaN\n", "1 NaN 5.0 6.0\n", "2 7.0 8.0 9.0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 [df2 > 4]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "HTTed_RS5E9S", "outputId": "fe5b3755-85de-41de-87a9-3a4f190a2bc9" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0NaNNaNNaN
1NaN5.06.0
27.08.09.0
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 NaN NaN NaN\n", "1 NaN 5.0 6.0\n", "2 7.0 8.0 9.0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.where(df2 > 4)" ] }, { "cell_type": "markdown", "metadata": { "id": "4cGg8isa5waQ" }, "source": [ "`where` allows you to replace the values that doesn't meet the provided condition with any other value. So, if we do `df2.where(df2 > 4, 0)` as follows, all values less than `4` will be replaced by `0`." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "jcIdwzbI5LBd", "outputId": "4e043655-b16a-40ac-bb54-ad1365cf3815" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0000
1056
2789
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 0 0 0\n", "1 0 5 6\n", "2 7 8 9" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.where(df2 > 4, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can achieve the above by..." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "sJdjZySr5Uaz", "outputId": "476cd0a7-4228-4812-c65e-022f099c6cb4", "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0000
1056
2789
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 0 0 0\n", "1 0 5 6\n", "2 7 8 9" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 [df2 <= 4] = 0\n", "df2" ] }, { "cell_type": "markdown", "metadata": { "id": "sT3tBawQHIxh" }, "source": [ "### Iteration\n", "\n", "```\n", "df.items() #Iterate over (column name, Series) pairs.\n", "df.iteritems() Iterate over (column name, Series) pairs.\n", "DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.\n", "DataFrame.itertuples([index, name]) Iterate over DataFrame rows as namedtuples.\n", "```\n", "\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "column 1\n", "0 0\n", "1 0\n", "2 7\n", "Name: column 1, dtype: int64\n", "column 2\n", "0 0\n", "1 5\n", "2 8\n", "Name: column 2, dtype: int64\n", "column 3\n", "0 0\n", "1 6\n", "2 9\n", "Name: column 3, dtype: int64\n" ] } ], "source": [ "# Iterate over (column name, Series) pairs.\n", "\n", "for col_name, content in df2.items():\n", " print(col_name)\n", " print(content)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "column 1\n", "0 0\n", "1 0\n", "2 7\n", "Name: column 1, dtype: int64\n", "column 2\n", "0 0\n", "1 5\n", "2 8\n", "Name: column 2, dtype: int64\n", "column 3\n", "0 0\n", "1 6\n", "2 9\n", "Name: column 3, dtype: int64\n" ] } ], "source": [ "# Iterate over (column name, Series) pairs.\n", "# Same as df.items()\n", "\n", "for col_name, content in df2.iteritems():\n", " print(col_name)\n", " print(content)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0, column 1 0\n", "column 2 0\n", "column 3 0\n", "Name: 0, dtype: int64)\n", "(1, column 1 0\n", "column 2 5\n", "column 3 6\n", "Name: 1, dtype: int64)\n", "(2, column 1 7\n", "column 2 8\n", "column 3 9\n", "Name: 2, dtype: int64)\n" ] } ], "source": [ "# Iterate over DataFrame rows as (index, Series) pairs\n", "\n", "for row in df2.iterrows():\n", " print(row)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pandas(Index=0, _1=0, _2=0, _3=0)\n", "Pandas(Index=1, _1=0, _2=5, _3=6)\n", "Pandas(Index=2, _1=7, _2=8, _3=9)\n" ] } ], "source": [ "# Iterate over DataFrame rows as namedtuples\n", "\n", "for row in df2.itertuples():\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notes: Thanks to [Prit Kalariya](https://twitter.com/pritkalariya) for Contributing the Iteration part!" ] }, { "cell_type": "markdown", "metadata": { "id": "5L0C1_d5840L" }, "source": [ "\n", "### C. Dealing with Missing data\n", "\n", "Real world datasets are messy, often with missing values. Pandas replace NaN with missing values by default. NaN stands for not a number. \n", "\n", "Missing values can either be ignored, droped or filled. " ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "xWN2SE4K87Cn", "outputId": "8b639662-3a57-4d62-95d0-a1447c9c2bf0" }, "outputs": [], "source": [ "# Creating a dataframe\n", "\n", "df3 = pd.DataFrame(np.array ([[1,2,3], [4,np.nan,6], [7,np.nan,np.nan]]), \n", " columns = ['column 1', 'column 2', 'column 3'])" ] }, { "cell_type": "markdown", "metadata": { "id": "gofEv_rhTxzz" }, "source": [ "#### Checking Missing values" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "EQP7y0YaN5Jb", "outputId": "ce985690-efa5-4ad2-9258-91a0f3fd4141" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0FalseFalseFalse
1FalseTrueFalse
2FalseTrueTrue
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 False False False\n", "1 False True False\n", "2 False True True" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Recognizing the missing values\n", "\n", "df3.isnull()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BQSUMOXeOe5Z", "outputId": "9233ea74-868a-4f26-b275-96b3f8f46dbd" }, "outputs": [ { "data": { "text/plain": [ "column 1 0\n", "column 2 2\n", "column 3 1\n", "dtype: int64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculating number of the missing values in each feature\n", "\n", "df3.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "je6Rqhq0Opgg", "outputId": "967f225a-6b2f-494a-8e16-eb0ab558e622" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
0TrueTrueTrue
1TrueFalseTrue
2TrueFalseFalse
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 True True True\n", "1 True False True\n", "2 True False False" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Recognizng non missig values\n", "\n", "df3.notna()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9oJFnRwNQkMb", "outputId": "06373d27-24f4-4be3-e665-c3a055300b1d" }, "outputs": [ { "data": { "text/plain": [ "column 1 3\n", "column 2 1\n", "column 3 2\n", "dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3.notna().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "OrrsYADLT2J7" }, "source": [ "#### Removing the missing values" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "BtYAGZ3GQm7-", "outputId": "c1ef0bc6-09d0-4063-db4b-f6092fc30cc8" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
01.02.03.0
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1.0 2.0 3.0" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Dropping missing values \n", "\n", "df3.dropna()" ] }, { "cell_type": "markdown", "metadata": { "id": "M03nac6CRfWQ" }, "source": [ "All rows are deleted because dropna() will remove each row which have missing value. " ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6qdjI9S9Rfvk", "outputId": "b469ed24-0d8e-4ad6-f766-229e4151e06e" }, "outputs": [ { "data": { "text/plain": [ "0 3.0\n", "1 6.0\n", "Name: column 3, dtype: float64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# you can drop NaNs in specific column(s)\n", "\n", "df3['column 3'].dropna()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "EWfHlBf7R0b1", "outputId": "ddbbf4c1-0eca-4a4c-b38a-654c3d79da66" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1
01.0
14.0
27.0
\n", "
" ], "text/plain": [ " column 1\n", "0 1.0\n", "1 4.0\n", "2 7.0" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can drop data by axis \n", "# Axis = 1...drop all columns with Nans\n", "# df3.dropna(axis='columns')\n", "\n", "df3.dropna(axis=1)\n", "\n" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "wTFG5MeBTG9y", "outputId": "dd8f4d08-5d82-498f-c3a3-fc35f6a8c8ba" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
01.02.03.0
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1.0 2.0 3.0" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# axis = 0...drop all rows with Nans\n", "# df3.dropna(axis='rows') is same \n", "\n", "df3.dropna(axis=0)" ] }, { "cell_type": "markdown", "metadata": { "id": "daRUQLvNT7xT" }, "source": [ "#### Filling the missing values" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "keazo2zoTKIm", "outputId": "0d3ebd6f-497e-4ad1-d00f-723e8a0f1386" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
01.02.03.0
14.010.06.0
27.010.010.0
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1.0 2.0 3.0\n", "1 4.0 10.0 6.0\n", "2 7.0 10.0 10.0" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filling Missing values\n", "\n", "df3.fillna(10)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "WI6t4b68V65Z", "outputId": "a2d6b6f3-9d6d-4e7b-a8e5-084b7d634ca7" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
01.02.03.0
14.0fillme6.0
27.0fillmefillme
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1.0 2.0 3.0\n", "1 4.0 fillme 6.0\n", "2 7.0 fillme fillme" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "df3.fillna('fillme')" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "ZBFjHR0JWEzK", "outputId": "98d074dd-2060-4094-f5bb-f9d2b8b651d3" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
01.02.03.0
14.02.06.0
27.02.06.0
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1.0 2.0 3.0\n", "1 4.0 2.0 6.0\n", "2 7.0 2.0 6.0" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can forward fill (ffill) or backward fill(bfill)\n", "# Or fill a current value with previous or next value\n", "\n", "df3.fillna(method='ffill')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "azOA4S_PWB-8", "outputId": "b9a3a1fb-0cb2-4160-85de-5ef2df307c37" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
01.02.03.0
14.0NaN6.0
27.0NaNNaN
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1.0 2.0 3.0\n", "1 4.0 NaN 6.0\n", "2 7.0 NaN NaN" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Won't change it because the last values are NaNs, so it backward it\n", "\n", "df3.fillna(method='bfill')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "uVkg9e9AWYvK", "outputId": "14ff8c58-6ac7-4a6e-e5f9-bccd7ebc1592" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column 1column 2column 3
01.02.03.0
14.06.06.0
27.0NaNNaN
\n", "
" ], "text/plain": [ " column 1 column 2 column 3\n", "0 1.0 2.0 3.0\n", "1 4.0 6.0 6.0\n", "2 7.0 NaN NaN" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If we change the axis to columns, you can see that Nans at row 2 and col 2 is backfilled with 6\n", "\n", "df3.fillna(method='bfill', axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "id": "k3MilbA7XSPo" }, "source": [ "\n", "### D. More Operations and Functions\n", "\n", "This section will show the more and most useful functions of Pandas." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "id": "NWubkTA6XUpl" }, "outputs": [], "source": [ "df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag'], \n", " 'Order Number':[45,56,64], \n", " 'Total Quantity':[10,5,9]}, \n", " columns = ['Product Name', 'Order Number', 'Total Quantity'])" ] }, { "cell_type": "markdown", "metadata": { "id": "uHXUH7KdH-Mp" }, "source": [ "#### Retrieving basic info about the Dataframe" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TziPr2P2IFNG", "outputId": "8d7526ca-9b60-48d7-d292-9c8158921e53" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3 entries, 0 to 2\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Product Name 3 non-null object\n", " 1 Order Number 3 non-null int64 \n", " 2 Total Quantity 3 non-null int64 \n", "dtypes: int64(2), object(1)\n", "memory usage: 200.0+ bytes\n" ] } ], "source": [ "# Return a summary about the dataframe\n", "\n", "df4.info()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "39diMDz1IQyv", "outputId": "08be59cf-6bd2-4386-d0a5-454c423d072d" }, "outputs": [ { "data": { "text/plain": [ "Index(['Product Name', 'Order Number', 'Total Quantity'], dtype='object')" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return dataframe columns\n", "\n", "df4.columns" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-BESb6FxIk32", "outputId": "46ecc116-190d-49a0-be2c-54d966e64960" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return dataframe data\n", "\n", "df4.keys" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "k5v_SKHlIn2C", "outputId": "6cfdcec6-76ab-42ff-de77-2d514a70ba63" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product NameOrder NumberTotal Quantity
0Shirt4510
\n", "
" ], "text/plain": [ " Product Name Order Number Total Quantity\n", "0 Shirt 45 10" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the head of the dataframe ....could make sense if you have long frame\n", "# Choose how many rows you want in head()\n", "\n", "df4.head(1)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "S5ZiB2wqI3rs", "outputId": "a95b716c-5a9b-4fb4-c0b3-8b0812cc0dce" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product NameOrder NumberTotal Quantity
2Bag649
\n", "
" ], "text/plain": [ " Product Name Order Number Total Quantity\n", "2 Bag 64 9" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the tail of the dataframe\n", "\n", "df4.tail(1)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IuVfuOuXK6KW", "outputId": "f31732e5-ad72-4c34-c5a9-2808b47e3b40" }, "outputs": [ { "data": { "text/plain": [ "array([['Shirt', 45, 10],\n", " ['Boot', 56, 5],\n", " ['Bag', 64, 9]], dtype=object)" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return NumPy array of the dataframe\n", "\n", "df4.values" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GMSw7p-3LAvn", "outputId": "b30e41d5-efe3-47d0-8aae-a7c23771850f" }, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the size or number of elements in a dataframe\n", "\n", "df4.size" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-twlx7EFLI68", "outputId": "9d611a45-c218-4f73-c7ba-4cf36b0de15e" }, "outputs": [ { "data": { "text/plain": [ "(3, 3)" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the shape\n", "\n", "df4.shape" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AMIZ2GJCLU3h", "outputId": "59126c5d-5252-4508-b35d-1e0af8e8ca24" }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the length of the dataframe/the number of rows in a dataframe\n", "\n", "df4.shape[0]" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the length of the dataframe/the number of columns in a dataframe\n", "\n", "df4.shape[1]" ] }, { "cell_type": "markdown", "metadata": { "id": "NPsgE9ogLchY" }, "source": [ "#### Unique Values" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gpVCVrAELeGz", "outputId": "a16a2c3c-4197-4c63-f6be-85a770891ac7" }, "outputs": [ { "data": { "text/plain": [ "array(['Shirt', 'Boot', 'Bag'], dtype=object)" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return unique values in a given column \n", "\n", "df4['Product Name'].unique()" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vc6k80jULlJ2", "outputId": "304b4b99-23b0-44f6-d48f-b8c15c33ee66" }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return a number of unique values\n", "df4['Product Name'].nunique()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4E5tnSieL3q1", "outputId": "3245bb8c-473b-4eee-8b1c-1c520362b09b" }, "outputs": [ { "data": { "text/plain": [ "Shirt 1\n", "Boot 1\n", "Bag 1\n", "Name: Product Name, dtype: int64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Counting the occurence of each value in a column \n", "\n", "df4['Product Name'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "iKHhUk07Qsfr" }, "source": [ "#### Applying a Function to Dataframe" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "id": "JW0yon8ZQuRk" }, "outputs": [], "source": [ "# Double the quantity product\n", "\n", "def double_quantity(x):\n", " return x * x" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AYAD97GwRrD4", "outputId": "70ccdfdd-a574-47a6-9022-1a437c5ebc68" }, "outputs": [ { "data": { "text/plain": [ "0 100\n", "1 25\n", "2 81\n", "Name: Total Quantity, dtype: int64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4['Total Quantity'].apply(double_quantity)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "arYX52diU25C", "outputId": "a167184b-2eed-4d4c-b6bd-00ecfc505250" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2
014
11625
\n", "
" ], "text/plain": [ " col1 col2\n", "0 1 4\n", "1 16 25" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can also apply an anonymous function to a dataframe\n", "# Squaring each value in dataframe\n", "\n", "df5 = pd.DataFrame([[1,2], [4,5]], columns=['col1', 'col2'])\n", "\n", "df5.applymap(lambda x: x**2)" ] }, { "cell_type": "markdown", "metadata": { "id": "KQgkBB6jW0mb" }, "source": [ "#### Sorting values in dataframe" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "oVQUpEWpW3eh", "outputId": "08573bf2-357c-4ecb-99e9-8e55423ed85e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product NameOrder NumberTotal Quantity
0Shirt4510
1Boot565
2Bag649
\n", "
" ], "text/plain": [ " Product Name Order Number Total Quantity\n", "0 Shirt 45 10\n", "1 Boot 56 5\n", "2 Bag 64 9" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sort the df4 by the order number\n", "\n", "df4.sort_values(['Order Number'])" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "SXHVgJOIXrYM", "outputId": "16c2c64a-8a3a-41b0-f21d-f94ba83e5d5d" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product NameOrder NumberTotal Quantity
2Bag649
1Boot565
0Shirt4510
\n", "
" ], "text/plain": [ " Product Name Order Number Total Quantity\n", "2 Bag 64 9\n", "1 Boot 56 5\n", "0 Shirt 45 10" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.sort_values(['Order Number'], ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "Nf96M45vX6WI" }, "source": [ "\n", "### E. Aggregation Methods" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "YoqCRTKNYPsX", "outputId": "59bc6d83-8bac-4144-9dfa-02c5230fc09f" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product NameOrder NumberTotal Quantity
0Shirt4510
1Boot565
2Bag649
\n", "
" ], "text/plain": [ " Product Name Order Number Total Quantity\n", "0 Shirt 45 10\n", "1 Boot 56 5\n", "2 Bag 64 9" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "sGMYqGLDOW7y", "outputId": "39f39243-3153-4630-be08-6eb0614129e0" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Order NumberTotal Quantity
count3.0000003.000000
mean55.0000008.000000
std9.5393922.645751
min45.0000005.000000
25%50.5000007.000000
50%56.0000009.000000
75%60.0000009.500000
max64.00000010.000000
\n", "
" ], "text/plain": [ " Order Number Total Quantity\n", "count 3.000000 3.000000\n", "mean 55.000000 8.000000\n", "std 9.539392 2.645751\n", "min 45.000000 5.000000\n", "25% 50.500000 7.000000\n", "50% 56.000000 9.000000\n", "75% 60.000000 9.500000\n", "max 64.000000 10.000000" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# summary statistics\n", "\n", "df4.describe()" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "1fhz845VOdb_", "outputId": "d96f93de-619a-426c-ffc6-4865bbc928c3" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
Order Number3.055.09.53939245.050.556.060.064.0
Total Quantity3.08.02.6457515.07.09.09.510.0
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "Order Number 3.0 55.0 9.539392 45.0 50.5 56.0 60.0 64.0\n", "Total Quantity 3.0 8.0 2.645751 5.0 7.0 9.0 9.5 10.0" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "df4.describe().transpose()" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "e9RKm4TIHoui", "outputId": "23e93644-03f3-457f-e0bc-cea6b1f117bd" }, "outputs": [ { "data": { "text/plain": [ "0 5\n", "1 9\n", "2 10\n", "dtype: int64" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Mode of the dataframe\n", "# Mode is the most recurring values\n", "\n", "df4['Total Quantity'].mode()" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IzRRvXcOIbzC", "outputId": "938e4d3a-caa0-413e-d17a-a9551ff91afd" }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The maximum value\n", "\n", "df4['Total Quantity'].max()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "T6x7YvCxJCFg", "outputId": "59f4672f-180a-43fb-8185-417550064ae7" }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The minimum value\n", "\n", "df4['Total Quantity'].min()" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Jq8PKQIcJIpO", "outputId": "f6804d4e-22ba-462a-dbcc-b415bf510b6c" }, "outputs": [ { "data": { "text/plain": [ "8.0" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The mean\n", "\n", "df4['Total Quantity'].mean()" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kxl8cn4jJYDg", "outputId": "943e05b4-da1a-4322-e594-9e3e16fbc1e5" }, "outputs": [ { "data": { "text/plain": [ "9.0" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The median value in a dataframe\n", "\n", "df4['Total Quantity'].median()" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7sAHCt3rJg0P", "outputId": "542668fc-cdee-45a8-d062-27744fc04fb0" }, "outputs": [ { "data": { "text/plain": [ "2.6457513110645907" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Standard deviation\n", "\n", "df4['Total Quantity'].std()" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VryxdVidJy3q", "outputId": "7cecf363-c5fe-4994-961e-6696e7cf62aa" }, "outputs": [ { "data": { "text/plain": [ "7.0" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Variance \n", "\n", "df4['Total Quantity'].var()" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1BuyPI-zJ69A", "outputId": "26dc9cf7-aa05-44ff-94ea-38153a7e1ef2" }, "outputs": [ { "data": { "text/plain": [ "24" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sum of all values in a column\n", "\n", "df4['Total Quantity'].sum()" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EhHTxpRrKD7I", "outputId": "5d80593c-5d5f-499b-8122-a36c281439ff" }, "outputs": [ { "data": { "text/plain": [ "450" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Product of all values in dataframe\n", "\n", "df4['Total Quantity'].prod()" ] }, { "cell_type": "markdown", "metadata": { "id": "viZf9yCpK4xO" }, "source": [ "\n", "### F. Groupby\n", "\n", "`Group by` involves splitting data into groups, applying function to each group, and combining the results. " ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "id": "AmKHc2IjMZdE" }, "outputs": [], "source": [ "df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag', 'Ankle', 'Pullover', 'Boot', 'Ankle', 'Tshirt', 'Shirt'], \n", " 'Order Number':[45,56,64, 34, 67, 56, 34, 89, 45], \n", " 'Total Quantity':[10,5,9, 11, 11, 8, 14, 23, 10]}, \n", " columns = ['Product Name', 'Order Number', 'Total Quantity'])" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 328 }, "id": "lIhEZ-cZMzUh", "outputId": "0ec73ad8-626b-490f-ff37-cdbf99c2db83" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product NameOrder NumberTotal Quantity
0Shirt4510
1Boot565
2Bag649
3Ankle3411
4Pullover6711
5Boot568
6Ankle3414
7Tshirt8923
8Shirt4510
\n", "
" ], "text/plain": [ " Product Name Order Number Total Quantity\n", "0 Shirt 45 10\n", "1 Boot 56 5\n", "2 Bag 64 9\n", "3 Ankle 34 11\n", "4 Pullover 67 11\n", "5 Boot 56 8\n", "6 Ankle 34 14\n", "7 Tshirt 89 23\n", "8 Shirt 45 10" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 266 }, "id": "jNLj7ehAM02_", "outputId": "06525190-572e-47f2-ac9c-4eb31a33c885" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Order NumberTotal Quantity
Product Name
Ankle34.012.5
Bag64.09.0
Boot56.06.5
Pullover67.011.0
Shirt45.010.0
Tshirt89.023.0
\n", "
" ], "text/plain": [ " Order Number Total Quantity\n", "Product Name \n", "Ankle 34.0 12.5\n", "Bag 64.0 9.0\n", "Boot 56.0 6.5\n", "Pullover 67.0 11.0\n", "Shirt 45.0 10.0\n", "Tshirt 89.0 23.0" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's group the df by product name\n", "\n", "df4.groupby('Product Name').mean()" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 266 }, "id": "gHaQZBIXNR-s", "outputId": "aaae2325-9114-4e52-fb15-ac34ba4ec847" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Order NumberTotal Quantity
Product Name
Ankle6825
Bag649
Boot11213
Pullover6711
Shirt9020
Tshirt8923
\n", "
" ], "text/plain": [ " Order Number Total Quantity\n", "Product Name \n", "Ankle 68 25\n", "Bag 64 9\n", "Boot 112 13\n", "Pullover 67 11\n", "Shirt 90 20\n", "Tshirt 89 23" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.groupby('Product Name').sum()" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 266 }, "id": "tgJA87lTN3dJ", "outputId": "13bc0ceb-09be-45df-b066-c95ec81a760a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Order NumberTotal Quantity
Product Name
Ankle3411
Bag649
Boot565
Pullover6711
Shirt4510
Tshirt8923
\n", "
" ], "text/plain": [ " Order Number Total Quantity\n", "Product Name \n", "Ankle 34 11\n", "Bag 64 9\n", "Boot 56 5\n", "Pullover 67 11\n", "Shirt 45 10\n", "Tshirt 89 23" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.groupby('Product Name').min()" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 266 }, "id": "1hPLubKZN83R", "outputId": "b0550960-111c-4e6c-acf1-a84f747f0035" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Order NumberTotal Quantity
Product Name
Ankle3414
Bag649
Boot568
Pullover6711
Shirt4510
Tshirt8923
\n", "
" ], "text/plain": [ " Order Number Total Quantity\n", "Product Name \n", "Ankle 34 14\n", "Bag 64 9\n", "Boot 56 8\n", "Pullover 67 11\n", "Shirt 45 10\n", "Tshirt 89 23" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.groupby('Product Name').max()" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 266 }, "id": "lUvqB_Y-P-K8", "outputId": "08dd63ea-7002-4cfa-daad-3c6e1725fe07" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Total Quantity
Product NameOrder Number
Ankle3414
Bag649
Boot568
Pullover6711
Shirt4510
Tshirt8923
\n", "
" ], "text/plain": [ " Total Quantity\n", "Product Name Order Number \n", "Ankle 34 14\n", "Bag 64 9\n", "Boot 56 8\n", "Pullover 67 11\n", "Shirt 45 10\n", "Tshirt 89 23" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.groupby(['Product Name', 'Order Number']).max()" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 266 }, "id": "kk8w4IMCQKvu", "outputId": "52c5e63c-566e-4dae-b0c0-d5c8515fbfae" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Total Quantity
Product NameOrder Number
Ankle3425
Bag649
Boot5613
Pullover6711
Shirt4520
Tshirt8923
\n", "
" ], "text/plain": [ " Total Quantity\n", "Product Name Order Number \n", "Ankle 34 25\n", "Bag 64 9\n", "Boot 56 13\n", "Pullover 67 11\n", "Shirt 45 20\n", "Tshirt 89 23" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.groupby(['Product Name', 'Order Number']).sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "zvqtUO0lO2HS" }, "source": [ "You can also use `aggregation()` after groupby. " ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "i6JlX-yrO2iP", "outputId": "78690759-45b2-468e-f148-bdebb08014fd" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Order NumberTotal Quantity
minmaxsumminmaxsum
Product Name
Ankle343468111425
Bag646464999
Boot56561125813
Pullover676767111111
Shirt454590101020
Tshirt898989232323
\n", "
" ], "text/plain": [ " Order Number Total Quantity \n", " min max sum min max sum\n", "Product Name \n", "Ankle 34 34 68 11 14 25\n", "Bag 64 64 64 9 9 9\n", "Boot 56 56 112 5 8 13\n", "Pullover 67 67 67 11 11 11\n", "Shirt 45 45 90 10 10 20\n", "Tshirt 89 89 89 23 23 23" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4.groupby('Product Name').aggregate(['min', 'max', 'sum'])" ] }, { "cell_type": "markdown", "metadata": { "id": "iaDkcuplQfD7" }, "source": [ "\n", "### G. Combining Datasets: Concatenating, Joining and Merging\n", "\n", "#### Concatenation" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "id": "7Sas3WNFQgf6" }, "outputs": [], "source": [ "# Creating dataframes\n", "\n", "df1 = pd.DataFrame({'Col1':['A','B','C'],\n", " 'Col2':[1,2,3]}, \n", " index=['a','b','c'])\n", "\n", "df2 = pd.DataFrame({'Col1':['D','E','F'],\n", " 'Col2':[4,5,6]}, \n", " index=['d','e','f'])\n", "\n", "\n", "df3 = pd.DataFrame({'Col1':['G','I','J'],\n", " 'Col2':[7,8,9]}, \n", " index=['g', 'i','j'])" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "DWCL_KbiV00-", "outputId": "c16c316a-3a6c-48c3-98dc-6538bc717ca2" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2
aA1
bB2
cC3
\n", "
" ], "text/plain": [ " Col1 Col2\n", "a A 1\n", "b B 2\n", "c C 3" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "mCmFFGDcXBlx", "outputId": "a63840b8-26a5-4c04-9088-87824453b90e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2
dD4
eE5
fF6
\n", "
" ], "text/plain": [ " Col1 Col2\n", "d D 4\n", "e E 5\n", "f F 6" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "HftwMeb1XEI6", "outputId": "d8ac4302-d7df-49c6-c251-609a1525cc76" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2
gG7
iI8
jJ9
\n", "
" ], "text/plain": [ " Col1 Col2\n", "g G 7\n", "i I 8\n", "j J 9" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 328 }, "id": "GJ4UHZakXFiM", "outputId": "d3af8f06-e924-48db-98db-a96cbdc6e7ca" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2
aA1
bB2
cC3
dD4
eE5
fF6
gG7
iI8
jJ9
\n", "
" ], "text/plain": [ " Col1 Col2\n", "a A 1\n", "b B 2\n", "c C 3\n", "d D 4\n", "e E 5\n", "f F 6\n", "g G 7\n", "i I 8\n", "j J 9" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Concatenating: Adding one dataset to another\n", "\n", "pd.concat([df1, df2, df3])" ] }, { "cell_type": "markdown", "metadata": { "id": "A8NR3BpsM_Et" }, "source": [ "The default axis is `0`. This is how the combined dataframes will look like if we change the `axis to 1`." ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 328 }, "id": "n2QCVHToNYIT", "outputId": "96e1fefb-2e32-4838-e8e4-48858d8bd93c" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2Col1Col2Col1Col2
aA1.0NaNNaNNaNNaN
bB2.0NaNNaNNaNNaN
cC3.0NaNNaNNaNNaN
dNaNNaND4.0NaNNaN
eNaNNaNE5.0NaNNaN
fNaNNaNF6.0NaNNaN
gNaNNaNNaNNaNG7.0
iNaNNaNNaNNaNI8.0
jNaNNaNNaNNaNJ9.0
\n", "
" ], "text/plain": [ " Col1 Col2 Col1 Col2 Col1 Col2\n", "a A 1.0 NaN NaN NaN NaN\n", "b B 2.0 NaN NaN NaN NaN\n", "c C 3.0 NaN NaN NaN NaN\n", "d NaN NaN D 4.0 NaN NaN\n", "e NaN NaN E 5.0 NaN NaN\n", "f NaN NaN F 6.0 NaN NaN\n", "g NaN NaN NaN NaN G 7.0\n", "i NaN NaN NaN NaN I 8.0\n", "j NaN NaN NaN NaN J 9.0" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2, df3], axis=1)" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 328 }, "id": "VL4FtD33Xdln", "outputId": "1bacc555-8d28-45a0-ddcc-56f2221a4b66" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2
aA1
bB2
cC3
dD4
eE5
fF6
gG7
iI8
jJ9
\n", "
" ], "text/plain": [ " Col1 Col2\n", "a A 1\n", "b B 2\n", "c C 3\n", "d D 4\n", "e E 5\n", "f F 6\n", "g G 7\n", "i I 8\n", "j J 9" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can also use append()\n", "\n", "df1.append([df2, df3])" ] }, { "cell_type": "markdown", "metadata": { "id": "G5RJbmpqN5C1" }, "source": [ "#### Merging\n", "\n", "If you have worked with SQL, what `pd.merge()` does may be familiar. It links data from different sources (different features) and you have a control on the structure of the combined dataset." ] }, { "cell_type": "markdown", "metadata": { "id": "DvlwrPIsaLho" }, "source": [ "*Pandas Merge method(`how`): SQL Join Name : Description*\n", "\n", "\n", "\n", "```\n", "* left : LEFT OUTER JOIN : Use keys or columns from left frame only\n", "\n", "* right : RIGHT OUTER JOIN : Use keys or columns from right frame only\n", "\n", "* outer : FULL OUTER JOIN : Use union of keys or columns from both frames\n", "\n", "* inner : INNER JOIN : Use intersection of keys or columns from both frames\n", "```\n", "\n" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "id": "xJvZbm8ZN1SB" }, "outputs": [], "source": [ "df1 = pd.DataFrame({'Name': ['Joe', 'Joshua', 'Jeanne', 'David'],\n", " 'Role': ['Manager', 'Developer', 'Engineer', 'Scientist']})\n", "\n", "df2 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],\n", "'Year Hired': [2018, 2017, 2020, 2018]})\n", "\n", "df3 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],\n", "'No of Leaves': [15, 3, 10, 12]})" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "fRK8jNJ2SqxM", "outputId": "a3782332-f882-4496-892a-6fb175af4d52" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameRole
0JoeManager
1JoshuaDeveloper
2JeanneEngineer
3DavidScientist
\n", "
" ], "text/plain": [ " Name Role\n", "0 Joe Manager\n", "1 Joshua Developer\n", "2 Jeanne Engineer\n", "3 David Scientist" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "LlzW52vJStN7", "outputId": "cbc480a2-0ede-4b5e-e55f-bac0a5b1f827" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameYear Hired
0David2018
1Joshua2017
2Joe2020
3Jeanne2018
\n", "
" ], "text/plain": [ " Name Year Hired\n", "0 David 2018\n", "1 Joshua 2017\n", "2 Joe 2020\n", "3 Jeanne 2018" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "JroYckVMSv1i", "outputId": "d4cdb9a9-9054-4249-b809-ccd8ff8d5f07" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameRoleYear Hired
0JoeManager2020
1JoshuaDeveloper2017
2JeanneEngineer2018
3DavidScientist2018
\n", "
" ], "text/plain": [ " Name Role Year Hired\n", "0 Joe Manager 2020\n", "1 Joshua Developer 2017\n", "2 Jeanne Engineer 2018\n", "3 David Scientist 2018" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2)" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "Qso0a2AMS-qg", "outputId": "56bcee2b-4548-45fa-f04e-5f7a588dd8f4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameRoleYear Hired
0JoeManager2020
1JoshuaDeveloper2017
2JeanneEngineer2018
3DavidScientist2018
\n", "
" ], "text/plain": [ " Name Role Year Hired\n", "0 Joe Manager 2020\n", "1 Joshua Developer 2017\n", "2 Jeanne Engineer 2018\n", "3 David Scientist 2018" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Let's merge on Role being a key \n", "\n", "pd.merge(df1, df2, how='inner', on=\"Name\")" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "id": "CvSLfcCZWrN8" }, "outputs": [], "source": [ "df1 = pd.DataFrame({'col1': ['K0', 'K0', 'K1', 'K2'],\n", " 'col2': ['K0', 'K1', 'K0', 'K1'],\n", " 'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3']})\n", " \n", "df2 = pd.DataFrame({'col1': ['K0', 'K1', 'K1', 'K2'],\n", " 'col2': ['K0', 'K0', 'K0', 'K0'],\n", " 'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']})" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "fNuBZxJ3Y2tR", "outputId": "04fead01-62f2-4bf6-9610-03aae17f130b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2AB
0K0K0A0B0
1K0K1A1B1
2K1K0A2B2
3K2K1A3B3
\n", "
" ], "text/plain": [ " col1 col2 A B\n", "0 K0 K0 A0 B0\n", "1 K0 K1 A1 B1\n", "2 K1 K0 A2 B2\n", "3 K2 K1 A3 B3" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "BnH1xC4WY31H", "outputId": "899584c7-05ed-4504-9b08-6e6858c5f822" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2CD
0K0K0C0D0
1K1K0C1D1
2K1K0C2D2
3K2K0C3D3
\n", "
" ], "text/plain": [ " col1 col2 C D\n", "0 K0 K0 C0 D0\n", "1 K1 K0 C1 D1\n", "2 K1 K0 C2 D2\n", "3 K2 K0 C3 D3" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "bQ_chqv-VpxK", "outputId": "79b127fd-90ea-43b7-ccc9-1d601ad4e439" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
\n", "
" ], "text/plain": [ " col1 col2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K1 K0 A2 B2 C1 D1\n", "2 K1 K0 A2 B2 C2 D2" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='inner', on=['col1', 'col2'])" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "AA0B0vnOZBe-", "outputId": "804bc5a4-135b-418c-e214-adcca9c76df5" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
5K2K0NaNNaNC3D3
\n", "
" ], "text/plain": [ " col1 col2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K0 K1 A1 B1 NaN NaN\n", "2 K1 K0 A2 B2 C1 D1\n", "3 K1 K0 A2 B2 C2 D2\n", "4 K2 K1 A3 B3 NaN NaN\n", "5 K2 K0 NaN NaN C3 D3" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='outer', on=['col1', 'col2'])" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "jqLLLuMwZYSS", "outputId": "ad49df22-ff36-47dc-d9f4-224da1fbcd61" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
\n", "
" ], "text/plain": [ " col1 col2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K0 K1 A1 B1 NaN NaN\n", "2 K1 K0 A2 B2 C1 D1\n", "3 K1 K0 A2 B2 C2 D2\n", "4 K2 K1 A3 B3 NaN NaN" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='left')" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "92MAb163Z0IM", "outputId": "bf72f4bb-b61f-4149-d42a-ec542b29ca0e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
3K2K0NaNNaNC3D3
\n", "
" ], "text/plain": [ " col1 col2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K1 K0 A2 B2 C1 D1\n", "2 K1 K0 A2 B2 C2 D2\n", "3 K2 K0 NaN NaN C3 D3" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='right')" ] }, { "cell_type": "markdown", "metadata": { "id": "8eBHN57CcYhI" }, "source": [ "#### Joining\n", "\n", "Joining is a simple way to combine columns of two dataframes with different indexes. \n" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "id": "cc9WCJHzc5Lt" }, "outputs": [], "source": [ "df1 = pd.DataFrame({'Col1': ['A', 'B', 'C'],\n", " 'Col2': [11, 12, 13]},\n", " index=['a', 'b', 'c']) \n", "\n", "df2 = pd.DataFrame({'Col3': ['D', 'E', 'F'],\n", " 'Col4': [14, 14, 16]},\n", " index=['a', 'c', 'd'])" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "-2vPNNXcd532", "outputId": "314fc545-bcba-418d-ecf9-e236df1de66a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2
aA11
bB12
cC13
\n", "
" ], "text/plain": [ " Col1 Col2\n", "a A 11\n", "b B 12\n", "c C 13" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "fPRLEeLod7gW", "outputId": "ab506d97-75c6-4c15-a740-5148eec89acc" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col3Col4
aD14
cE14
dF16
\n", "
" ], "text/plain": [ " Col3 Col4\n", "a D 14\n", "c E 14\n", "d F 16" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "9vjxh0O-d9e2", "outputId": "02cb3f80-887d-447a-b2cb-b21e0402f85e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2Col3Col4
aA11D14.0
bB12NaNNaN
cC13E14.0
\n", "
" ], "text/plain": [ " Col1 Col2 Col3 Col4\n", "a A 11 D 14.0\n", "b B 12 NaN NaN\n", "c C 13 E 14.0" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.join(df2)" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "RwuSaXD5eBPN", "outputId": "0ced8f45-3247-4b21-bac5-e1909b3a9b60" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col3Col4Col1Col2
aD14A11.0
cE14C13.0
dF16NaNNaN
\n", "
" ], "text/plain": [ " Col3 Col4 Col1 Col2\n", "a D 14 A 11.0\n", "c E 14 C 13.0\n", "d F 16 NaN NaN" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.join(df1)" ] }, { "cell_type": "markdown", "metadata": { "id": "5ApI4u3DeSk1" }, "source": [ "You can see that with `df.join()`, the alignment of data is on indexes. " ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 111 }, "id": "_i-_dSy5efHn", "outputId": "f1329d33-4263-493b-f416-1dca2e60206a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2Col3Col4
aA11D14
cC13E14
\n", "
" ], "text/plain": [ " Col1 Col2 Col3 Col4\n", "a A 11 D 14\n", "c C 13 E 14" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.join(df2, how='inner')" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "id": "88PjCW2ofM-y", "outputId": "e98270bc-8965-41fc-accb-5d32991d39c2" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2Col3Col4
aA11.0D14.0
bB12.0NaNNaN
cC13.0E14.0
dNaNNaNF16.0
\n", "
" ], "text/plain": [ " Col1 Col2 Col3 Col4\n", "a A 11.0 D 14.0\n", "b B 12.0 NaN NaN\n", "c C 13.0 E 14.0\n", "d NaN NaN F 16.0" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.join(df2, how='outer')" ] }, { "cell_type": "markdown", "metadata": { "id": "rLBMe6CINpzu" }, "source": [ "Learn more about Merging, Joining, and Concatenating the Pandas Dataframes [here](https://pandas.pydata.org/docs/user_guide/merging.html). " ] }, { "cell_type": "markdown", "metadata": { "id": "I86Tf0VYfTJT" }, "source": [ "\n", "### H. Beyond Dataframes: Working with CSV and Excel\n", "\n", "In this last section of Pandas' fundamentals, we will see how to read real world data with different formats: CSV and Excel" ] }, { "cell_type": "markdown", "metadata": { "id": "BQD9appAk64O" }, "source": [ "#### CSV and Excel\n", "\n", "Let's use california housing dataset. " ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "id": "K7rfiyx6ivZ_" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 1390k 100 1390k 0 0 409k 0 0:00:03 0:00:03 --:--:-- 409k\n" ] } ], "source": [ "# Let's download the data \n", "\n", "!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/housing.csv" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('housing.csv')" ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "bbASBwqilYCZ", "outputId": "6d0a180e-2887-4346-a55e-8ebf02d0a793" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VetH2qu5mnFM", "outputId": "0321f018-e42a-4976-82db-5889461d1f3f" }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(data)" ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "id": "Ca5qySmRmsz0" }, "outputs": [], "source": [ "## Exporting dataframe back to csv\n", "\n", "data.to_csv('housing_dataset', index=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "LmqPzh9-nWgQ" }, "source": [ "If you look into the folder sidebar, you can see `Housing Dataset`. " ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "id": "vseWxDRYm1iA" }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'openpyxl'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/9x/fscj3yx566q3y3y1kf5yh9m40000gn/T/ipykernel_1348/1131967869.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m## Exporting CSV to Excel\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_excel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'housing_excel.xlsx'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36mto_excel\u001b[0;34m(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)\u001b[0m\n\u001b[1;32m 2282\u001b[0m \u001b[0minf_rep\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minf_rep\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2283\u001b[0m )\n\u001b[0;32m-> 2284\u001b[0;31m formatter.write(\n\u001b[0m\u001b[1;32m 2285\u001b[0m \u001b[0mexcel_writer\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2286\u001b[0m \u001b[0msheet_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msheet_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/io/formats/excel.py\u001b[0m in \u001b[0;36mwrite\u001b[0;34m(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)\u001b[0m\n\u001b[1;32m 832\u001b[0m \u001b[0;31m# error: Cannot instantiate abstract class 'ExcelWriter' with abstract\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 833\u001b[0m \u001b[0;31m# attributes 'engine', 'save', 'supported_extensions' and 'write_cells'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 834\u001b[0;31m writer = ExcelWriter( # type: ignore[abstract]\n\u001b[0m\u001b[1;32m 835\u001b[0m \u001b[0mwriter\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstorage_options\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstorage_options\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 836\u001b[0m )\n", "\u001b[0;32m~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)\u001b[0m\n\u001b[1;32m 46\u001b[0m ):\n\u001b[1;32m 47\u001b[0m \u001b[0;31m# Use the openpyxl module as the Excel writer.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 48\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mopenpyxl\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mworkbook\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mWorkbook\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 49\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[0mengine_kwargs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcombine_kwargs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mengine_kwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'openpyxl'" ] } ], "source": [ "## Exporting CSV to Excel\n", "\n", "data.to_excel('housing_excel.xlsx', index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0DinnjCfnt3M" }, "outputs": [], "source": [ "## Reading the Excel file back\n", "\n", "excel_data = pd.read_excel('housing_excel.xlsx')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "id": "mIKA-KwYiFgZ", "outputId": "cdbf9eed-a457-400c-a7d8-5620bc275001" }, "outputs": [], "source": [ "\n", "excel_data.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "HK8qIyAEjRsM" }, "source": [ "\n", "## Real World: Exploratory Data Analysis (EDA)\n", "\n", "All above was the basics. Let us apply some of these techniques to the real world dataset, `Red wine quality`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YEWQ8U-FjTVo", "outputId": "cb81a5a4-43d4-4821-e29a-84e243e40aff" }, "outputs": [], "source": [ "!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/winequality-red.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EBXDmKTlmy0X" }, "outputs": [], "source": [ "wine_data = pd.read_csv('winequality-red.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 221 }, "id": "Gr93h3Xgm8vF", "outputId": "2510e3ee-7ff4-46fc-e19a-5bf67b50edd8" }, "outputs": [], "source": [ "# Displaying the head of the dataset\n", "\n", "wine_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 221 }, "id": "BrBYEf9Dqg9K", "outputId": "b240cb52-ba5b-4286-a42a-8ea7feafe135" }, "outputs": [], "source": [ "# Displaying the tail of the dataset\n", "\n", "wine_data.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 421 }, "id": "qauq0ujDnocm", "outputId": "23d06fc6-738b-4ad8-8582-3fd6e916cf18" }, "outputs": [], "source": [ "# Displaying summary statistics\n", "\n", "wine_data.describe().transpose()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qPCZ3X8WqN4j", "outputId": "85edec22-5d4e-4f79-c0aa-af7ac9f0700e" }, "outputs": [], "source": [ "# Displaying quick information about the dataset \n", "\n", "wine_data.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LWL8QJ_XqTCJ", "outputId": "17558d3d-ac51-443d-d7f4-08c0a5d7f016" }, "outputs": [], "source": [ "# Checking missing values\n", "\n", "wine_data.isnull().sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8SnzW2ibsmYp", "outputId": "41953dab-5aa4-4470-e412-7442db0bf2c8" }, "outputs": [], "source": [ "# wine quality range from 0 to 10. The higher the quality value, the good wine is\n", "\n", "wine_data['quality'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 484 }, "id": "ij8yGjqnt2HD", "outputId": "9a237e6a-69da-4937-d612-3b8e01cf3309" }, "outputs": [], "source": [ "wine_data.groupby(['fixed acidity', 'volatile acidity', 'citric acid']).sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 484 }, "id": "8hl4EQpNt680", "outputId": "d637c1ad-f4d8-4f38-e5ba-f63a24ccdd47" }, "outputs": [], "source": [ "wine_data.groupby(['free sulfur dioxide', 'total sulfur dioxide']).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the end of the lab that was about using Pandas to manipulate data. Alot of things will make sense when we start to prepare data for machine learning models in the next notebooks. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### [BACK TO TOP](#0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "2. Data Analysis & Manipulation with Pandas.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }