{ "cells": [ { "cell_type": "markdown", "id": "4a87b5ef", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "ab0dc25c", "metadata": {}, "source": [ "

Lecture 3.11 (Pandas-03)

" ] }, { "cell_type": "markdown", "id": "b163752e", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "19f82705", "metadata": {}, "source": [ "\n", "\n", "## _Overview of Pandas Dataframe Data Structure_" ] }, { "cell_type": "markdown", "id": "4c4a18fd", "metadata": {}, "source": [ "#### Read about Pandas Data Structures: https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro" ] }, { "cell_type": "code", "execution_count": null, "id": "31fc4ed9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9727124d", "metadata": {}, "source": [ "## Learning agenda of this notebook\n", "1. Anatomy of a Dataframe\n", "2. Creating Dataframe\n", " - An empty dataframe\n", " - Two-Dimensional NumPy Array\n", " - Dictionary of Python Lists\n", " - Dictionary of Panda Series\n", "2. Attributes of a Dataframe\n", "3. Bonus" ] }, { "cell_type": "code", "execution_count": null, "id": "c2831c56", "metadata": {}, "outputs": [], "source": [ "# To install this library in Jupyter notebook\n", "#import sys\n", "#!{sys.executable} -m pip install pandas" ] }, { "cell_type": "code", "execution_count": 1, "id": "dba905d0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('1.3.4',\n", " ['/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas'])" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.__version__ , pd.__path__" ] }, { "cell_type": "code", "execution_count": null, "id": "88a306ac", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "99a7442e", "metadata": {}, "source": [ "\n", "\n", "## 1. Creating a Dataframe\n", "

\n", ">**A Pandas Dataframe is a two-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.**\n", "\n", "



\n", "\n", "**```pd.DataFrame(data=None, index=None, columns=None, dtype=None)```**\n", "- Where,\n", " - `data`: It can be a 2-D NumPy Array, a Dictionary of Python Lists, or a Dictionary of Panda Series (You can also create a dataframe from a file in CSV, Excel, JSON, HTML format or may be from a database table as well).\n", " - `index`: These are the row indices. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data.\n", " - `columns`: These are the column indices or labels. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data.\n", " - `dtype`: Data type to force. Only a single dtype is allowed. If None, infer." ] }, { "cell_type": "code", "execution_count": null, "id": "ff6ca294", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d7273559", "metadata": {}, "source": [ "### a. Creating an Empty Dataframe" ] }, { "cell_type": "code", "execution_count": 2, "id": "7b9f2409", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Empty DataFrame\n", "Columns: []\n", "Index: []\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "df = pd.DataFrame()\n", "print(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "adf6151f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8c5d39df", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bc846b98", "metadata": {}, "source": [ "### b. Creating a Dataframe from a 2-D NumPy Array" ] }, { "cell_type": "code", "execution_count": 4, "id": "c002856d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numpy Array:\n", " [[51 10 67 20 67]\n", " [55 24 35 57 95]\n", " [47 74 25 37 10]\n", " [96 64 19 21 86]\n", " [10 49 10 50 45]\n", " [94 33 76 66 37]]\n", "Pandas Dataframe:\n", " 0 1 2 3 4\n", "0 51 10 67 20 67\n", "1 55 24 35 57 95\n", "2 47 74 25 37 10\n", "3 96 64 19 21 86\n", "4 10 49 10 50 45\n", "5 94 33 76 66 37\n" ] } ], "source": [ "arr = np.random.randint(10,100, size= (6,5))\n", "print(\"Numpy Array:\\n\",arr)\n", "\n", "df = pd.DataFrame(data=arr)\n", "print(\"Pandas Dataframe:\\n\",df)" ] }, { "cell_type": "markdown", "id": "6810f5a0", "metadata": {}, "source": [ "- Note that both the row indices and the column labels/indices are implicitly set to numerical values from 0 to n-1, since neither of the two is provided while creating the dataframe object. They are also not considered as part of data in the dataframe.\n", "- In majority of the cases the row label is left as default, i.e., 0,1,2,3.... However, the column labels are changed from 0,1,2,3,... to some meaningful values." ] }, { "cell_type": "code", "execution_count": null, "id": "354845ed", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e30fb823", "metadata": {}, "outputs": [], "source": [ "# Let us name the column labels of our choice, while creating it\n", "col_labels=['Col1', 'Col2', 'Col3', 'Col4', 'Col5']\n", "df = pd.DataFrame(data=arr, columns=col_labels)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "f362c1df", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "faace9e7", "metadata": {}, "outputs": [], "source": [ "# Let us name the row labels of our choice, while creating it\n", "df = pd.DataFrame(data=arr, index=['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5'])\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "1c9c67c7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 5, "id": "1ea2973f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col0Col1Col2Col3Col4
Row05110672067
Row15524355795
Row24774253710
Row39664192186
Row41049105045
Row59433766637
\n", "
" ], "text/plain": [ " Col0 Col1 Col2 Col3 Col4\n", "Row0 51 10 67 20 67\n", "Row1 55 24 35 57 95\n", "Row2 47 74 25 37 10\n", "Row3 96 64 19 21 86\n", "Row4 10 49 10 50 45\n", "Row5 94 33 76 66 37" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let us name the both row labels and column labels to strings of our choice, while creating it\n", "row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5']\n", "col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4']\n", "df = pd.DataFrame(data=arr, index=row_labels, columns=col_labels)\n", "df" ] }, { "cell_type": "code", "execution_count": 6, "id": "2968c77c", "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "'Row0'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3360\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3361\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 'Row0'", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_28361/1613391253.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Row0'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3456\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3457\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3458\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3459\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3460\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3361\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3363\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3364\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3365\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhasnans\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 'Row0'" ] } ], "source": [ "df['Row0']" ] }, { "cell_type": "markdown", "id": "3b79c790", "metadata": {}, "source": [ "- You can do this later as well, i.e., after the dataframe has been created with default indices.\n", "- This is done by assigning a list of labels/values to `index` and `columns` attributes of a dataframe object." ] }, { "cell_type": "code", "execution_count": null, "id": "ef4c1820", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2805dd6e", "metadata": {}, "outputs": [], "source": [ "arr = np.random.randint(10,100, size= (6,5))\n", "df = pd.DataFrame(data=arr)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "5009a4b0", "metadata": {}, "outputs": [], "source": [ "row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5']\n", "col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4']\n", "\n", "df.columns = col_labels\n", "df.index = row_labels\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "14a7ea81", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4d78c91b", "metadata": {}, "source": [ "### c. Creating a Dataframe from a Dictionary of Python Lists\n", "- You can create a dataframe object from a dictionary of Python Lists \n", " - The dictionary `Keys` become the column names, and \n", " - The dictionary `Values` are lists/arrays containing data for the respective columns." ] }, { "cell_type": "code", "execution_count": null, "id": "4f4c0cc7", "metadata": {}, "outputs": [], "source": [ "people = {\n", " \"name\" : [\"Rauf\", \"Arif\", \"Maaz\", \"Hadeed\", \"Mujahid\", \"Mohid\"],\n", " \"age\" : [52, 51, 26, 22, 18, 17],\n", " \"address\": [\"Lahore\", \"Karachi\", \"Lahore\", \"Islamabad\", \"Kakul\", \"Karachi\"],\n", " \"cell\" : [\"321-123\", \"320-431\", \"321-478\", \"324-446\", \"321-967\", \"320-678\"],\n", " \"bg\": [\"B+\", \"A-\", \"B+\", \"O-\", \"A-\", \"B+\"]\n", "}\n", "people" ] }, { "cell_type": "code", "execution_count": null, "id": "daea6fce", "metadata": {}, "outputs": [], "source": [ "# Pass this Dictionary of Python Lists to pd.Dataframe()\n", "df_people = pd.DataFrame(data=people)\n", "df_people" ] }, { "cell_type": "markdown", "id": "c7ddb8c1", "metadata": {}, "source": [ "- Note that column labels are set as per the keys inside the dictionary object, while the row labels/indices are set to default numerical values.\n", "- You can set the row indices while creating the dataframe by passing the index argument to `pd.DataFrame()` method, or can do that later by assigning the new values to the `index` and `columns` attributes of a dataframe object." ] }, { "cell_type": "code", "execution_count": null, "id": "933c79a6", "metadata": {}, "outputs": [], "source": [ "# Let us change the row labels of above dataframe\n", "row_labels = ['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06']\n", "df_people.index = row_labels\n", "df_people" ] }, { "cell_type": "code", "execution_count": null, "id": "7d5591a0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "dfdad357", "metadata": {}, "source": [ "### d. Creating a Dataframe from Dictionary of Panda Series\n", "One can think of a dataframe as a dictionary of Panda Series: \n", "- `Keys` are column names, and \n", "- `Values` are Series object for the respective columns." ] }, { "cell_type": "code", "execution_count": null, "id": "9fa00f68", "metadata": { "scrolled": true }, "outputs": [], "source": [ "dict = {\n", " \"name\": pd.Series(['Arif', 'Hadeed', 'Mujahid']),\n", " \"age\": pd.Series([50, 22, 18]),\n", " \"addr\": pd.Series(['Lahore', 'Islamabad','Karachi']),\n", "}\n", "df = pd.DataFrame(data=dict)\n", "df" ] }, { "cell_type": "markdown", "id": "c9d92bf4", "metadata": {}, "source": [ ">Note from the above output, that every series object becomes the data of the appropriate column. Moreover, the keys of the dictionary become the column labels." ] }, { "cell_type": "code", "execution_count": null, "id": "fbd41d29", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3984364d", "metadata": {}, "outputs": [], "source": [ "dict = {\n", " \"name\": pd.Series(data=['Arif', 'Hadeed', 'Mujahid', 'Maaz'], index=['a','b','c', 'd']),\n", " \"age\": pd.Series(data=[50, 22,np.nan, 18], index=['a','b','c','d']),\n", " \"addr\": pd.Series(data=['Lahore', '', 'Peshawer','Karachi'], index=['a','b','c', 'd']),\n", "}\n", "df = pd.DataFrame(dict)\n", "df" ] }, { "cell_type": "markdown", "id": "62175e2b", "metadata": {}, "source": [ ">- In the above code and its output, note that every series object has four data values and four corresponding indices.\n", ">- Also note that in the `age` series, we have a NaN value, and in the `addr` series we have an empty string.\n", ">- Another point to note that the row indices of the three series exactly match, in number as well as in sequence/value.\n", ">- A question arise, what if the indices of series are different. See the following code to understand this concept." ] }, { "cell_type": "code", "execution_count": null, "id": "a0d5332b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 4, "id": "558777d2", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 5, "id": "4c16eeab", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageaddr
aArif50.0Lahore
bHadeedNaNNaN
cMujahidNaNNaN
dMaaz18.0
xNaN22.0Karachi
yNaNNaNNaN
\n", "
" ], "text/plain": [ " name age addr\n", "a Arif 50.0 Lahore\n", "b Hadeed NaN NaN\n", "c Mujahid NaN NaN\n", "d Maaz 18.0 \n", "x NaN 22.0 Karachi\n", "y NaN NaN NaN" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict = {\n", " \"name\": pd.Series(['Arif', 'Hadeed', 'Mujahid', 'Maaz'], index=['a','b','c', 'd']),\n", " \"age\": pd.Series([50, 22,np.nan, 18], index=['a','x','y','d']),\n", " \"addr\": pd.Series(['Lahore', '','Karachi'], index=['a', 'd', 'x']),\n", "}\n", "df = pd.DataFrame(dict)\n", "df" ] }, { "cell_type": "code", "execution_count": 18, "id": "53b75b68", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageaddr
aArif50.0Lahore
bHadeedNaNNaN
\n", "
" ], "text/plain": [ " name age addr\n", "a Arif 50.0 Lahore\n", "b Hadeed NaN NaN" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(-4)" ] }, { "cell_type": "code", "execution_count": 19, "id": "b96f5268", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageaddr
xNaN22.0Karachi
yNaNNaNNaN
\n", "
" ], "text/plain": [ " name age addr\n", "x NaN 22.0 Karachi\n", "y NaN NaN NaN" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(-4)" ] }, { "cell_type": "code", "execution_count": null, "id": "6c132815", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7fb33e69", "metadata": {}, "source": [ ">- In the above code and its output, note that first series object has four data values and four corresponding indices. Similarly, second series object has four data values (with one `np.nan` value) and four corresponding indices, which are a bit different from the first series object. Third series has three data values (with one empty string) and three indices.\n", ">- Note the resulting Dataframe has six rows and three columns.\n", " - For index 'a' we have value in all the three series objects or columns.\n", " - For index 'b' we have a value in first series object, and NaN for the second and third column, since the second and third series object has no value corresponding to row index 'b." ] }, { "cell_type": "code", "execution_count": null, "id": "2e500525", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5f9a4c39", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b77fae30", "metadata": {}, "source": [ "## 3. Attributes of Pandas Dataframe\n", "- Like Series, we can access properties/attributes of a dataframe by using dot `.` notation" ] }, { "cell_type": "code", "execution_count": 6, "id": "18688a13", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageaddresscellbg
MS01Rauf52Lahore321-123B+
MS02Arif51Karachi320-431A-
MS03Maaz26Lahore321-478B+
MS04Hadeed22Islamabad324-446O-
MS05Mujahid18Kakul321-967A-
MS06Mohid17Karachi320-678B+
\n", "
" ], "text/plain": [ " name age address cell bg\n", "MS01 Rauf 52 Lahore 321-123 B+\n", "MS02 Arif 51 Karachi 320-431 A-\n", "MS03 Maaz 26 Lahore 321-478 B+\n", "MS04 Hadeed 22 Islamabad 324-446 O-\n", "MS05 Mujahid 18 Kakul 321-967 A-\n", "MS06 Mohid 17 Karachi 320-678 B+" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "people = {\n", " \"name\" : [\"Rauf\", \"Arif\", \"Maaz\", \"Hadeed\", \"Mujahid\", \"Mohid\"],\n", " \"age\" : [52, 51, 26, 22, 18, 17],\n", " \"address\": [\"Lahore\", \"Karachi\", \"Lahore\", \"Islamabad\", \"Kakul\", \"Karachi\"],\n", " \"cell\" : [\"321-123\", \"320-431\", \"321-478\", \"324-446\", \"321-967\", \"320-678\"],\n", " \"bg\": [\"B+\", \"A-\", \"B+\", \"O-\", \"A-\", \"B+\"]\n", "}\n", "\n", "df_people = pd.DataFrame(data=people, index=['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06'])\n", "df_people" ] }, { "cell_type": "code", "execution_count": 7, "id": "ee17855b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6, 5)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `shape` attribute of a dataframe object return a two value tuple containing rows and columns\n", "# Note the rows count does not include the column labels and column count does not include the row index\n", "df_people.shape " ] }, { "cell_type": "code", "execution_count": null, "id": "2a6e7edd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 8, "id": "53fe6d82", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `ndim` attribute of a dataframe object returns number of dimensions (which is always 2)\n", "df_people.ndim" ] }, { "cell_type": "code", "execution_count": null, "id": "4b807f01", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 9, "id": "f4716fdf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `size` attribute of a dataframe object returns the number of elements in the underlying data\n", "df_people.size" ] }, { "cell_type": "code", "execution_count": null, "id": "6b28b509", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 10, "id": "b943c0ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06'], dtype='object')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `index` attribute of a dataframe object return the list of row indices and its datatype\n", "df_people.index" ] }, { "cell_type": "code", "execution_count": null, "id": "3f879f9a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 11, "id": "b8a7fc1b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['name', 'age', 'address', 'cell', 'bg'], dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `columns` attribute of a dataframe object return the list of column labels and its datatype\n", "df_people.columns" ] }, { "cell_type": "code", "execution_count": 12, "id": "05476544", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Index(['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06'], dtype='object'),\n", " Index(['name', 'age', 'address', 'cell', 'bg'], dtype='object')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#This attribute is used to fetch both index and column names.\n", "df_people.axes" ] }, { "cell_type": "code", "execution_count": 13, "id": "3806bafa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([['Rauf', 52, 'Lahore', '321-123', 'B+'],\n", " ['Arif', 51, 'Karachi', '320-431', 'A-'],\n", " ['Maaz', 26, 'Lahore', '321-478', 'B+'],\n", " ['Hadeed', 22, 'Islamabad', '324-446', 'O-'],\n", " ['Mujahid', 18, 'Kakul', '321-967', 'A-'],\n", " ['Mohid', 17, 'Karachi', '320-678', 'B+']], dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `values` attribute of a dataframe object returns a NumPy 2-D having all the values in the DataFrame, \n", "# without the row indices and column labels\n", "df_people.values" ] }, { "cell_type": "code", "execution_count": 14, "id": "5ca5006b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.empty" ] }, { "cell_type": "code", "execution_count": 15, "id": "899d087f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name object\n", "age int64\n", "address object\n", "cell object\n", "bg object\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `dtypes` attribute of a dataframe object return the data type of each column in the dataframe\n", "df_people.dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "61d0403a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "63c3d167", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# To check number on non-NA values\n", "df_people.count()" ] }, { "cell_type": "code", "execution_count": null, "id": "e1b64d0e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2011f372", "metadata": {}, "source": [ "# Bonus" ] }, { "cell_type": "markdown", "id": "506b8217", "metadata": {}, "source": [ "#### The `df.info()` Method" ] }, { "cell_type": "code", "execution_count": 1, "id": "92f3c6a1", "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'df_people' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_41559/1037101920.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m#This method prints information about a DataFrame including the row indices, column labels,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m# non-null values count in each column, datatype and memory usage\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdf_people\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mNameError\u001b[0m: name 'df_people' is not defined" ] } ], "source": [ "#This method prints information about a DataFrame including the row indices, column labels, \n", "# non-null values count in each column, datatype and memory usage\n", "df_people.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "3cd86298", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c3cfb3f0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }