{ "cells": [ { "cell_type": "markdown", "id": "1bc72fe3", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "3d522b27", "metadata": {}, "source": [ "

Lecture 3.12 (Pandas-04)

" ] }, { "cell_type": "markdown", "id": "b852d0bd", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "7de11ce0", "metadata": {}, "source": [ "## _IO with CSV EXCEL and JSON Files_\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "7b992d9d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "441aebbc", "metadata": {}, "source": [ "#### Read Pandas Documentation:\n", "- General Info: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\n", "\n", "\n", "- For `read_csv`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv\n", "\n", "\n", "- For `read_excel`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html?highlight=read_excel#pandas.read_excel\n", "\n", "\n", "- For `read_json`:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.read_json.html?highlight=pandas%20read_json#pandas.io.json.read_json\n", "\n", "\n", "- For `to_csv`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv\n", "\n", "\n", "\n", "- For `to_excel`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html?highlight=to_excel#pandas.DataFrame.to_excel\n", "\n", "\n", "- For `to_json`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html?highlight=to_json" ] }, { "cell_type": "markdown", "id": "22dcd278", "metadata": {}, "source": [ "## Learning agenda of this notebook\n", "[Pandas](https://pandas.pydata.org/) provides helper functions to read data from various file formats like CSV, EXCEL, JSON, HTML, SQL table, and many more.\n", "1. Reading data from a CSV/TSV File\n", "2. Reading a CSV file from a Remote System\n", "3. Writing Contents of Dataframe to a CSV File\n", "4. Reading data from an EXCEL File\n", "5. Writing Contents of Dataframe to an EXCEL File\n", "6. Reading data from a JSON File\n", "7. Writing Contents of Dataframe to a JSON File" ] }, { "cell_type": "code", "execution_count": null, "id": "d85d4c7a", "metadata": {}, "outputs": [], "source": [ "# To install this library in Jupyter notebook\n", "import sys\n", "!{sys.executable} -m pip install pandas --quiet" ] }, { "cell_type": "code", "execution_count": 1, "id": "b96c7461", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('1.3.4',\n", " ['/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas'])" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.__version__ , pd.__path__" ] }, { "cell_type": "code", "execution_count": null, "id": "2c74b937", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bb4e943a", "metadata": {}, "source": [ "## 1. Reading from CSV/TSV Files\n", ">**CSV**: A text file in which the values are separated by a comma or a tab character is called a CSV or a TSV file. Each line of the file is a data record and each record consists of one or more fields, separated by a specific character called separator. A CSV/TSV file is typically used to store tabular data (numbers and text), in which each line will have the same number of fields." ] }, { "cell_type": "markdown", "id": "586d81fe", "metadata": {}, "source": [ "### a. Reading a Simple CSV File\n", "The `pd.read_csv()` method is used to read a comma-separated file into a DataFrame.\n", "```\n", "pd.read_csv(fname, delimiter=None, header='infer', skiprows=None , nrows=None , usecols=None, footer='',...)\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "id": "36f4f1fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rollno,gender,group,age,math,english,urdu\r\n", "MS01,female,group B,28.0,72.0,72,74.0\r\n", "MS02,female,group C,33.0,69.0,90,88.0\r\n", "MS03,female,group B,21.0,,95,93.0\r\n", "MS04,male,group A,44.0,47.0,57,44.0\r\n", "MS05,male,group C,54.0,76.0,78,\r\n", "MS06,female,group B,,71.0,83,78.0\r\n", "MS07,female,group B,47.0,88.0,95,92.0\r\n", "MS08,male,group B,33.0,40.0,43,39.0\r\n", "MS09,male,group D,27.0,64.0,64,67.0\r\n", "MS10,female,group B,33.0,38.0,60,50.0\r\n", "MS11,male,group C,46.0,58.0,54,52.0\r\n", "MS12,male,group D,53.0,40.0,52,43.0\r\n", "MS13,female,group B,25.0,65.0,81,73.0\r\n", "MS14,male,group A,26.0,78.0,72,70.0\r\n", "MS15,female,group A,25.0,50.0,53,58.0\r\n", "MS16,female,group C,29.0,69.0,75,78.0\r\n", "MS17,male,group C,22.0,88.0,89,86.0\r\n", "MS18,female,group B,31.0,18.0,32,28.0\r\n", "MS19,male,group C,33.0,46.0,42,46.0\r\n", "MS20,female,group C,36.0,54.0,58,\r\n", "MS21,male,group D,34.0,66.0,69,63.0\r\n", "MS22,female,group B,32.0,65.0,75,70.0\r\n", "MS23,male,group D,31.0,44.0,54,53.0\r\n", "MS24,female,group C,37.0,,73,73.0\r\n", "MS25,male,group D,33.0,74.0,71,80.0\r\n", "MS26,male,group A,39.0,73.0,74,72.0\r\n", "MS27,male,group B,,69.0,54,55.0\r\n", "MS28,female,group C,47.0,67.0,69,75.0\r\n", "MS29,male,group C,44.0,70.0,70,65.0\r\n", "MS30,female,group D,42.0,62.0,70,75.0\r\n", "MS31,female,group D,41.0,69.0,74,74.0\r\n", "MS32,female,group B,,63.0,65,61.0\r\n", "MS33,female,group E,29.0,56.0,72,65.0\r\n", "MS34,male,group D,34.0,40.0,42,38.0\r\n", "MS35,male,group E,45.0,97.0,87,82.0\r\n", "MS36,male,group E,31.0,81.0,81,79.0\r\n", "MS37,female,group D,28.0,74.0,81,\r\n", "MS38,female,group D,26.0,50.0,64,59.0\r\n", "MS39,female,group D,36.0,75.0,90,88.0\r\n", "MS40,male,group B,33.0,57.0,56,57.0\r\n", "MS41,male,group C,35.0,55.0,61,54.0\r\n", "MS42,female,group C,22.0,58.0,73,68.0\r\n", "MS43,female,group B,38.0,53.0,58,65.0\r\n", "MS44,male,group B,31.0,59.0,65,66.0\r\n", "MS45,female,group E,28.0,,56,54.0\r\n", "MS46,male,group B,26.0,65.0,54,57.0\r\n", "MS47,female,group A,36.0,55.0,65,62.0\r\n", "MS48,female,group C,30.0,66.0,71,76.0\r\n", "MS49,female,group D,40.0,57.0,74,76.0\r\n", "MS50,male,group C,37.0,66.0,78,81.0\r\n" ] } ], "source": [ "! cat datasets/classmarks.csv" ] }, { "cell_type": "code", "execution_count": 4, "id": "fef96838", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50, 7)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 3, "id": "e5863995", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnogendergroupagemathenglishurdu
0MS01femalegroup B28.072.07274.0
1MS02femalegroup C33.069.09088.0
2MS03femalegroup B21.0NaN9593.0
3MS04malegroup A44.047.05744.0
4MS05malegroup C54.076.078NaN
5MS06femalegroup BNaN71.08378.0
6MS07femalegroup B47.088.09592.0
7MS08malegroup B33.040.04339.0
8MS09malegroup D27.064.06467.0
9MS10femalegroup B33.038.06050.0
10MS11malegroup C46.058.05452.0
11MS12malegroup D53.040.05243.0
12MS13femalegroup B25.065.08173.0
13MS14malegroup A26.078.07270.0
14MS15femalegroup A25.050.05358.0
15MS16femalegroup C29.069.07578.0
16MS17malegroup C22.088.08986.0
17MS18femalegroup B31.018.03228.0
18MS19malegroup C33.046.04246.0
19MS20femalegroup C36.054.058NaN
20MS21malegroup D34.066.06963.0
21MS22femalegroup B32.065.07570.0
22MS23malegroup D31.044.05453.0
23MS24femalegroup C37.0NaN7373.0
24MS25malegroup D33.074.07180.0
25MS26malegroup A39.073.07472.0
26MS27malegroup BNaN69.05455.0
27MS28femalegroup C47.067.06975.0
28MS29malegroup C44.070.07065.0
29MS30femalegroup D42.062.07075.0
30MS31femalegroup D41.069.07474.0
31MS32femalegroup BNaN63.06561.0
32MS33femalegroup E29.056.07265.0
33MS34malegroup D34.040.04238.0
34MS35malegroup E45.097.08782.0
35MS36malegroup E31.081.08179.0
36MS37femalegroup D28.074.081NaN
37MS38femalegroup D26.050.06459.0
38MS39femalegroup D36.075.09088.0
39MS40malegroup B33.057.05657.0
40MS41malegroup C35.055.06154.0
41MS42femalegroup C22.058.07368.0
42MS43femalegroup B38.053.05865.0
43MS44malegroup B31.059.06566.0
44MS45femalegroup E28.0NaN5654.0
45MS46malegroup B26.065.05457.0
46MS47femalegroup A36.055.06562.0
47MS48femalegroup C30.066.07176.0
48MS49femalegroup D40.057.07476.0
49MS50malegroup C37.066.07881.0
\n", "
" ], "text/plain": [ " rollno gender group age math english urdu\n", "0 MS01 female group B 28.0 72.0 72 74.0\n", "1 MS02 female group C 33.0 69.0 90 88.0\n", "2 MS03 female group B 21.0 NaN 95 93.0\n", "3 MS04 male group A 44.0 47.0 57 44.0\n", "4 MS05 male group C 54.0 76.0 78 NaN\n", "5 MS06 female group B NaN 71.0 83 78.0\n", "6 MS07 female group B 47.0 88.0 95 92.0\n", "7 MS08 male group B 33.0 40.0 43 39.0\n", "8 MS09 male group D 27.0 64.0 64 67.0\n", "9 MS10 female group B 33.0 38.0 60 50.0\n", "10 MS11 male group C 46.0 58.0 54 52.0\n", "11 MS12 male group D 53.0 40.0 52 43.0\n", "12 MS13 female group B 25.0 65.0 81 73.0\n", "13 MS14 male group A 26.0 78.0 72 70.0\n", "14 MS15 female group A 25.0 50.0 53 58.0\n", "15 MS16 female group C 29.0 69.0 75 78.0\n", "16 MS17 male group C 22.0 88.0 89 86.0\n", "17 MS18 female group B 31.0 18.0 32 28.0\n", "18 MS19 male group C 33.0 46.0 42 46.0\n", "19 MS20 female group C 36.0 54.0 58 NaN\n", "20 MS21 male group D 34.0 66.0 69 63.0\n", "21 MS22 female group B 32.0 65.0 75 70.0\n", "22 MS23 male group D 31.0 44.0 54 53.0\n", "23 MS24 female group C 37.0 NaN 73 73.0\n", "24 MS25 male group D 33.0 74.0 71 80.0\n", "25 MS26 male group A 39.0 73.0 74 72.0\n", "26 MS27 male group B NaN 69.0 54 55.0\n", "27 MS28 female group C 47.0 67.0 69 75.0\n", "28 MS29 male group C 44.0 70.0 70 65.0\n", "29 MS30 female group D 42.0 62.0 70 75.0\n", "30 MS31 female group D 41.0 69.0 74 74.0\n", "31 MS32 female group B NaN 63.0 65 61.0\n", "32 MS33 female group E 29.0 56.0 72 65.0\n", "33 MS34 male group D 34.0 40.0 42 38.0\n", "34 MS35 male group E 45.0 97.0 87 82.0\n", "35 MS36 male group E 31.0 81.0 81 79.0\n", "36 MS37 female group D 28.0 74.0 81 NaN\n", "37 MS38 female group D 26.0 50.0 64 59.0\n", "38 MS39 female group D 36.0 75.0 90 88.0\n", "39 MS40 male group B 33.0 57.0 56 57.0\n", "40 MS41 male group C 35.0 55.0 61 54.0\n", "41 MS42 female group C 22.0 58.0 73 68.0\n", "42 MS43 female group B 38.0 53.0 58 65.0\n", "43 MS44 male group B 31.0 59.0 65 66.0\n", "44 MS45 female group E 28.0 NaN 56 54.0\n", "45 MS46 male group B 26.0 65.0 54 57.0\n", "46 MS47 female group A 36.0 55.0 65 62.0\n", "47 MS48 female group C 30.0 66.0 71 76.0\n", "48 MS49 female group D 40.0 57.0 74 76.0\n", "49 MS50 male group C 37.0 66.0 78 81.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#The `read_csv`, by default assumes that the file contains comma separated values, \n", "# and the first row of the file conatins names of columns, which will be taken as column labels\n", "df = pd.read_csv('datasets/classmarks.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "8a0945b2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5a190256", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d8e68f05", "metadata": {}, "source": [ "**The `df.head(N)` method is used to select/display first `N` rows, based on `position`, i.e., the integer value corresponding to the position of the row (from 0 to n-1). The default value of `N` is 5.**" ] }, { "cell_type": "code", "execution_count": null, "id": "5ab86207", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b06b026d", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "544e5255", "metadata": {}, "outputs": [], "source": [ "# For negative values of n, this method returns all rows except the last `n` rows, equivalent to df[:-n].\n", "# The df has a total of 50 rows, so the following will return first 2 rows\n", "df.head(-48)" ] }, { "cell_type": "code", "execution_count": null, "id": "e8592e87", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9a20bcc0", "metadata": {}, "source": [ "**The `df.tail(N)` method is used to select/display last `N` rows, based on `position`, i.e., the integer value corresponding to the position of the row (from 0 to n-1). The default value of `N` is 5.**" ] }, { "cell_type": "code", "execution_count": null, "id": "b3ad4272", "metadata": {}, "outputs": [], "source": [ "# tail() method is useful for quickly verifying data, after sorting or appending rows.\n", "df.tail()" ] }, { "cell_type": "code", "execution_count": null, "id": "03574cd4", "metadata": {}, "outputs": [], "source": [ "df.tail(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "494b01d9", "metadata": {}, "outputs": [], "source": [ "# For negative values of `n`, this function returns all rows except the first `n` rows, equivalent to df[n:]\n", "# The df has a total of 50 rows, so the following will return last 3 rows\n", "df.tail(-46)" ] }, { "cell_type": "code", "execution_count": null, "id": "5d805f4a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "aa28bae8", "metadata": {}, "source": [ "### b.Reading a CSV File having a Delimter, other than Comma\n", "- By default, the `read_csv()` expect comma as seperator. But if the CSV file has some other seperator or delimiter like (semi-collon or tab), it will raise an error.\n", "- To handler the issue we need to pass specific value to the `delimiter` argument of `read_csv()` method." ] }, { "cell_type": "code", "execution_count": null, "id": "f5650685", "metadata": {}, "outputs": [], "source": [ "! cat datasets/classmarkswithtab.csv" ] }, { "cell_type": "code", "execution_count": 5, "id": "4e3d8e55", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollno\\tgender\\tgroup\\tage\\tmath\\tenglish\\turdu
0MS01\\tfemale\\tgroup B\\t28.0\\t72.0\\t72\\t74.0
1MS02\\tfemale\\tgroup C\\t33.0\\t69.0\\t90\\t88.0
2MS03\\tfemale\\tgroup B\\t21.0\\t\\t95\\t93.0
3MS04\\tmale\\tgroup A\\t44.0\\t47.0\\t57\\t44.0
4MS05\\tmale\\tgroup C\\t54.0\\t76.0\\t78\\t
\n", "
" ], "text/plain": [ " rollno\\tgender\\tgroup\\tage\\tmath\\tenglish\\turdu\n", "0 MS01\\tfemale\\tgroup B\\t28.0\\t72.0\\t72\\t74.0\n", "1 MS02\\tfemale\\tgroup C\\t33.0\\t69.0\\t90\\t88.0\n", "2 MS03\\tfemale\\tgroup B\\t21.0\\t\\t95\\t93.0\n", "3 MS04\\tmale\\tgroup A\\t44.0\\t47.0\\t57\\t44.0\n", "4 MS05\\tmale\\tgroup C\\t54.0\\t76.0\\t78\\t" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('datasets/classmarkswithtab.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 13, "id": "0b427ca0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
0rollnogendergroupagemathenglishurdu
1MS01femalegroup B28.072.07274.0
2MS02femalegroup C33.069.09088.0
3MS03femalegroup B21.0NaN9593.0
4MS04malegroup A44.047.05744.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "0 rollno gender group age math english urdu\n", "1 MS01 female group B 28.0 72.0 72 74.0\n", "2 MS02 female group C 33.0 69.0 90 88.0\n", "3 MS03 female group B 21.0 NaN 95 93.0\n", "4 MS04 male group A 44.0 47.0 57 44.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('datasets/classmarkswithtab.csv', delimiter='\\t', header=None)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b4d94b7e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d7228943", "metadata": {}, "source": [ "### c. Reading a CSV File not having Column Labels\n", "- By default the `read_csv()` method assume the first row of the file will contain column labels\n", "- If this is not the case, i.e., the file do not contain column labels rather data, it will be dealt as column label\n", "- Understand this in following example" ] }, { "cell_type": "code", "execution_count": null, "id": "92b3aaaf", "metadata": {}, "outputs": [], "source": [ "! cat datasets/classmarkswithoutcollabels.csv" ] }, { "cell_type": "code", "execution_count": 7, "id": "be736d7d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MS01femalegroup B287272.174
0MS02femalegroup C33.069.09088.0
1MS03femalegroup B21.0NaN9593.0
2MS04malegroup A44.047.05744.0
3MS05malegroup C54.076.078NaN
4MS06femalegroup BNaN71.08378.0
\n", "
" ], "text/plain": [ " MS01 female group B 28 72 72.1 74\n", "0 MS02 female group C 33.0 69.0 90 88.0\n", "1 MS03 female group B 21.0 NaN 95 93.0\n", "2 MS04 male group A 44.0 47.0 57 44.0\n", "3 MS05 male group C 54.0 76.0 78 NaN\n", "4 MS06 female group B NaN 71.0 83 78.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('datasets/classmarkswithoutcollabels.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "id": "e4a78813", "metadata": {}, "source": [ "**To read such files, you have to pass the parameter `header=None` to the `read_csv()` method as shown below**" ] }, { "cell_type": "code", "execution_count": 8, "id": "8d228941", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
0MS01femalegroup B28.072.07274.0
1MS02femalegroup C33.069.09088.0
2MS03femalegroup B21.0NaN9593.0
3MS04malegroup A44.047.05744.0
4MS05malegroup C54.076.078NaN
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "0 MS01 female group B 28.0 72.0 72 74.0\n", "1 MS02 female group C 33.0 69.0 90 88.0\n", "2 MS03 female group B 21.0 NaN 95 93.0\n", "3 MS04 male group A 44.0 47.0 57 44.0\n", "4 MS05 male group C 54.0 76.0 78 NaN" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('datasets/classmarkswithoutcollabels.csv', header=None)\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bb895f0e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b9783830", "metadata": {}, "source": [ "**Now if you want to assign new column labels to make them more understandable, you can assign the list of column labels to the `columns` attribute of the dataframe object**" ] }, { "cell_type": "code", "execution_count": 9, "id": "404229bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnogendergroupagemathenglishurdu
0MS01femalegroup B28.072.07274.0
1MS02femalegroup C33.069.09088.0
2MS03femalegroup B21.0NaN9593.0
3MS04malegroup A44.047.05744.0
4MS05malegroup C54.076.078NaN
\n", "
" ], "text/plain": [ " rollno gender group age math english urdu\n", "0 MS01 female group B 28.0 72.0 72 74.0\n", "1 MS02 female group C 33.0 69.0 90 88.0\n", "2 MS03 female group B 21.0 NaN 95 93.0\n", "3 MS04 male group A 44.0 47.0 57 44.0\n", "4 MS05 male group C 54.0 76.0 78 NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col_names = ['rollno', 'gender', 'group', 'age', 'math', 'english', 'urdu']\n", "df.columns = col_names\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "045f907c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "eedf135e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "736ff36e", "metadata": {}, "source": [ "### d. Reading a CSV File having Comments in the beginning\n", "- You may get an error while reading a CSV file because someone may have added few comments on the top of the file. In pandas we can still read the data set by skipping few rows from the top.\n", "- To deal with the ParseError, open the csv file in the text editor and check if you have some comments on the top.\n", "- If yes, then count the number of rows to skip.\n", "- While reading file, pass the parameter **skiprows = n** (number of rows in the beginninghaving comments to skip)\n", "- While reading file, pass the parameter **skipfooter = n** (number of rows at the end having comments to skip)" ] }, { "cell_type": "code", "execution_count": 10, "id": "4ca9ed6b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "These are comments,,,,,,\r", "\r\n", "These are second line of comments,,,,,,\r", "\r\n", "These are third line of comments,,,,,,\r", "\r\n", "rollno,gender,group,age,math,english,urdu\r", "\r\n", "MS01,female,group B,28,72,72,74\r", "\r\n", "MS02,female,group C,33,69,90,88\r", "\r\n", "MS03,female,group B,21,,95,93\r", "\r\n", "MS04,male,group A,44,47,57,44\r", "\r\n", "MS05,male,group C,54,76,78,\r", "\r\n", "MS06,female,group B,,71,83,78\r", "\r\n", "MS07,female,group B,47,88,95,92\r", "\r\n", "MS08,male,group B,33,40,43,39\r", "\r\n", "MS09,male,group D,27,64,64,67\r", "\r\n", "MS10,female,group B,33,38,60,50\r", "\r\n", "MS11,male,group C,46,58,54,52\r", "\r\n", "MS12,male,group D,53,40,52,43\r", "\r\n", "MS13,female,group B,25,65,81,73\r", "\r\n", "MS14,male,group A,26,78,72,70\r", "\r\n", "MS15,female,group A,25,50,53,58\r", "\r\n", "MS16,female,group C,29,69,75,78\r", "\r\n", "MS17,male,group C,22,88,89,86\r", "\r\n", "MS18,female,group B,31,18,32,28\r", "\r\n", "MS19,male,group C,33,46,42,46\r", "\r\n", "MS20,female,group C,36,54,58,\r", "\r\n", "MS21,male,group D,34,66,69,63\r", "\r\n", "MS22,female,group B,32,65,75,70\r", "\r\n", "MS23,male,group D,31,44,54,53\r", "\r\n", "MS24,female,group C,37,,73,73\r", "\r\n", "MS25,male,group D,33,74,71,80\r", "\r\n", "MS26,male,group A,39,73,74,72\r", "\r\n", "MS27,male,group B,,69,54,55\r", "\r\n", "MS28,female,group C,47,67,69,75\r", "\r\n", "MS29,male,group C,44,70,70,65\r", "\r\n", "MS30,female,group D,42,62,70,75\r", "\r\n", "MS31,female,group D,41,69,74,74\r", "\r\n", "MS32,female,group B,,63,65,61\r", "\r\n", "MS33,female,group E,29,56,72,65\r", "\r\n", "MS34,male,group D,34,40,42,38\r", "\r\n", "MS35,male,group E,45,97,87,82\r", "\r\n", "MS36,male,group E,31,81,81,79\r", "\r\n", "MS37,female,group D,28,74,81,\r", "\r\n", "MS38,female,group D,26,50,64,59\r", "\r\n", "MS39,female,group D,36,75,90,88\r", "\r\n", "MS40,male,group B,33,57,56,57\r", "\r\n", "MS41,male,group C,35,55,61,54\r", "\r\n", "MS42,female,group C,22,58,73,68\r", "\r\n", "MS43,female,group B,38,53,58,65\r", "\r\n", "MS44,male,group B,31,59,65,66\r", "\r\n", "MS45,female,group E,28,,56,54\r", "\r\n", "MS46,male,group B,26,65,54,57\r", "\r\n", "MS47,female,group A,36,55,65,62\r", "\r\n", "MS48,female,group C,30,66,71,76\r", "\r\n", "MS49,female,group D,40,57,74,76\r", "\r\n", "MS50,male,group C,37,66,78,81" ] } ], "source": [ "! cat datasets/classmarkswithtopcomments.csv" ] }, { "cell_type": "code", "execution_count": 11, "id": "c60f6550", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
These are commentsUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6
0These are second line of commentsNaNNaNNaNNaNNaNNaN
1These are third line of commentsNaNNaNNaNNaNNaNNaN
2rollnogendergroupagemathenglishurdu
3MS01femalegroup B28727274
4MS02femalegroup C33699088
\n", "
" ], "text/plain": [ " These are comments Unnamed: 1 Unnamed: 2 Unnamed: 3 \\\n", "0 These are second line of comments NaN NaN NaN \n", "1 These are third line of comments NaN NaN NaN \n", "2 rollno gender group age \n", "3 MS01 female group B 28 \n", "4 MS02 female group C 33 \n", "\n", " Unnamed: 4 Unnamed: 5 Unnamed: 6 \n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 math english urdu \n", "3 72 72 74 \n", "4 69 90 88 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Try reading a csv file having 3 comments lines in the beginning.\n", "df = pd.read_csv('datasets/classmarkswithtopcomments.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "dc580e77", "metadata": {}, "outputs": [], "source": [ "# Try reading a csv file having 3 comments lines in the beginning.\n", "df = pd.read_csv('datasets/classmarkswithtopcomments.csv', skiprows=3)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "f6ad9283", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "255294ba", "metadata": {}, "source": [ "### e. Reading a portion of CSV File in a Dataframe\n", "- Suppose the dataset inside the csv file is too big and you don't want to spend that much time for reading that data\n", "- Or might be your system crashes, when you try to load that much data\n", "- Solution is read\n", " - Specific number of rows by passing `nrows` parameter to `read_csv()` method\n", " - Specific number of columns by passing `usecols` parameter to `read_csv()` method" ] }, { "cell_type": "code", "execution_count": 12, "id": "347ed4a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10, 7)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read just 10 rows from the csv file by passing the number of rows to read to `nrows` argument\n", "df = pd.read_csv('datasets/classmarks.csv', nrows=10)\n", "df.shape\n", "#df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "ac7ad23b", "metadata": {}, "outputs": [], "source": [ "# Read specific columns from the csv file by passing a list of column names to `usecols` argument\n", "df = pd.read_csv('datasets/classmarks.csv', usecols= ['rollno', 'group','english'])\n", "df.shape\n", "#df.head()" ] }, { "cell_type": "code", "execution_count": 13, "id": "08ad1ccf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7, 3)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ofcourse you can use both the parameters at the same time\n", "df = pd.read_csv('datasets/classmarks.csv', nrows= 7, usecols= ['rollno', 'group','english'])\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 14, "id": "ab2470e3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnogroupenglish
0MS01group B72
1MS02group C90
2MS03group B95
3MS04group A57
4MS05group C78
5MS06group B83
6MS07group B95
\n", "
" ], "text/plain": [ " rollno group english\n", "0 MS01 group B 72\n", "1 MS02 group C 90\n", "2 MS03 group B 95\n", "3 MS04 group A 57\n", "4 MS05 group C 78\n", "5 MS06 group B 83\n", "6 MS07 group B 95" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "aee98254", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "37a24ec5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "aff77179", "metadata": {}, "source": [ "## 2. Reading a CSV File from a Remote System" ] }, { "cell_type": "code", "execution_count": 15, "id": "5140b250", "metadata": {}, "outputs": [], "source": [ "# To avoid URLError: \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamesAgeAddr
0Arif50Lahore
1Rauf52Islamabad
2Maaz27Peshawer
3Hadeed22Islamabad
4Mujahid18Karachi
\n", "" ], "text/plain": [ " Names Age Addr\n", "0 Arif 50 Lahore\n", "1 Rauf 52 Islamabad\n", "2 Maaz 27 Peshawer\n", "3 Hadeed 22 Islamabad\n", "4 Mujahid 18 Karachi" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "myurl = 'https://bit.ly/31eYGTx'\n", "df = pd.read_csv(myurl)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "055a3906", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f312831d", "metadata": {}, "source": [ "### b. Reading a CSV file from a Google Docs" ] }, { "cell_type": "markdown", "id": "278b2d6e", "metadata": {}, "source": [ "- Google Sheet URL: https://docs.google.com/spreadsheets/d/1H9ZTGVRXN3zuyP3cbJQnbX7JlqhqFNMOE_WwDP-PRTE/edit#gid=2084742287\n" ] }, { "cell_type": "code", "execution_count": 17, "id": "60c5a6ca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnogendergroupagemathenglishurdu
0MS01femalegroup B28.072.07274.0
1MS02femalegroup C33.069.09088.0
2MS03femalegroup B21.0NaN9593.0
3MS04malegroup A44.047.05744.0
4MS05malegroup C54.076.078NaN
\n", "
" ], "text/plain": [ " rollno gender group age math english urdu\n", "0 MS01 female group B 28.0 72.0 72 74.0\n", "1 MS02 female group C 33.0 69.0 90 88.0\n", "2 MS03 female group B 21.0 NaN 95 93.0\n", "3 MS04 male group A 44.0 47.0 57 44.0\n", "4 MS05 male group C 54.0 76.0 78 NaN" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sheetID = '1H9ZTGVRXN3zuyP3cbJQnbX7JlqhqFNMOE_WwDP-PRTE'\n", "sheetName = 'sheet1'\n", "URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(sheetID, sheetName)\n", "\n", "df = pd.read_csv(URL)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 18, "id": "3043328a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://docs.google.com/spreadsheets/d/1H9ZTGVRXN3zuyP3cbJQnbX7JlqhqFNMOE_WwDP-PRTE/gviz/tq?tqx=out:csv&sheet=sheet1'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "URL" ] }, { "cell_type": "code", "execution_count": null, "id": "da021170", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3bec93fe", "metadata": {}, "source": [ "## 3. Writing Contents of Dataframe to a CSV File\n", "- The `pd.to_csv()` method is used to write the contents of a dataframe (with indices) to a CSV file.\n", "- The only required argument is the file path.\n", "- For details see help page or python documents (link given above)" ] }, { "cell_type": "code", "execution_count": null, "id": "87862682", "metadata": {}, "outputs": [], "source": [ "df_class = pd.read_csv('datasets/classmarks.csv')\n", "df_class.head(7)" ] }, { "cell_type": "markdown", "id": "176a8515", "metadata": {}, "source": [ ">- Let us create a new dataframe from above dataframe containing records of only group B" ] }, { "cell_type": "code", "execution_count": null, "id": "05d35cf1", "metadata": {}, "outputs": [], "source": [ "mask = (df_class['group'] == 'group B')\n", "mask.head(7)" ] }, { "cell_type": "code", "execution_count": null, "id": "519347c0", "metadata": {}, "outputs": [], "source": [ "df_class_groupB = df_class.loc[mask]\n", "df_class_groupB" ] }, { "cell_type": "code", "execution_count": null, "id": "9d92b6ba", "metadata": {}, "outputs": [], "source": [ "df_class_groupB.to_csv('datasets/classmarksgroupB.csv')" ] }, { "cell_type": "code", "execution_count": null, "id": "5a4da09b", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/classmarksgroupB.csv')\n", "df" ] }, { "cell_type": "markdown", "id": "6c378d68", "metadata": {}, "source": [ ">To avoid writing the row indices column inside the file pass `index=False` argument to `to_csv()` method" ] }, { "cell_type": "code", "execution_count": null, "id": "99b286a7", "metadata": {}, "outputs": [], "source": [ "df_class_groupB.to_csv('datasets/classmarksgroupB.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "e6d69405", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/classmarksgroupB.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "7eb77253", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "dc33cd11", "metadata": {}, "source": [ "## 4. I/O with EXCEL Files\n", ">**XLSX**: XLSX is a Microsoft Excel Open XML file format. It also comes under the Spreadsheet file format. It is an XML-based file format created by Microsoft Excel. In XLSX data is organized under the cells and columns in a sheet. Each XLSX file may contain one or more sheets. So a workbook can contain multiple sheets" ] }, { "cell_type": "code", "execution_count": null, "id": "ae3b2500", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install xlrd xlwt openpyxl" ] }, { "cell_type": "markdown", "id": "b3d1d092", "metadata": {}, "source": [ "### a. Reading a Simple Excel File" ] }, { "cell_type": "code", "execution_count": 19, "id": "3cf3d333", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnogendergroupagemathenglishurdu
0MS01femalegroup B28.072.07274.0
1MS02femalegroup C33.069.09088.0
2MS03femalegroup B21.0NaN9593.0
3MS04malegroup A44.047.05744.0
4MS05malegroup C54.076.078NaN
\n", "
" ], "text/plain": [ " rollno gender group age math english urdu\n", "0 MS01 female group B 28.0 72.0 72 74.0\n", "1 MS02 female group C 33.0 69.0 90 88.0\n", "2 MS03 female group B 21.0 NaN 95 93.0\n", "3 MS04 male group A 44.0 47.0 57 44.0\n", "4 MS05 male group C 54.0 76.0 78 NaN" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel(io='datasets/classmarks.xlsx')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "3430bd49", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a08102cc", "metadata": {}, "source": [ "### b. Reading an Excel File having Comments in the beginning\n", "- You may get an error while reading an Excel file because someone may have added few comments on the top of the file. In pandas we can still read the data set by skipping few rows from the top.\n", "- To deal with the ParseError, open the Excel file in MS EXCEL and check if you have some comments on the top.\n", "- If yes, then count the number of rows to skip.\n", "- While reading file, pass the parameter **skiprows = n** (number of rows in the beginning having comments to skip)\n", "- While reading file, pass the parameter **skipfooter = n** (number of rows at the end having comments to skip)" ] }, { "cell_type": "code", "execution_count": null, "id": "6cccdcd0", "metadata": {}, "outputs": [], "source": [ "# The following file has three lines of comments in the beginning of the file.\n", "df = pd.read_excel(io='datasets/classmarkswithcomments.xlsx',skiprows=3)\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "8edcc587", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1ff5b811", "metadata": {}, "source": [ "### c. Reading Excel Workbook with Multiple Sheets\n", "- By default `pd.read_excel()` function read only the first sheet.\n", "- What if we want to read an Excel file having multiple sheets.\n", "- The `big_mart_sales_with_multiple_sheets.xlsx` is a workbook that contains three sheets for different years data. The sheet names are 1985, 1987, and 1997" ] }, { "cell_type": "code", "execution_count": 20, "id": "7e857867", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1463, 12)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx')\n", "# if you check/view the data you can see, it only contains the data of first excel sheet (for the year 1985)\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 21, "id": "9b7966ff", "metadata": {}, "outputs": [], "source": [ "df_1985 = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx',sheet_name='1985')\n", "df_1987 = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx',sheet_name='1987')\n", "df_1997 = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx',sheet_name='1997')" ] }, { "cell_type": "code", "execution_count": 22, "id": "47ce9c66", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1463, 12)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1985.shape" ] }, { "cell_type": "code", "execution_count": 23, "id": "939cf540", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(932, 12)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1987.shape" ] }, { "cell_type": "code", "execution_count": 24, "id": "687a6a4c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(930, 12)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1997.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "bdcdad44", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "229e0221", "metadata": {}, "source": [ "## 5. Writing Contents of Dataframe to an EXCEL File\n", "- The `pd.to_excel()` method is used to write the contents of a dataframe (with indices) to an Excel file.\n", "- The only required argument is the file path.\n", "- For details see help page or python documents (link given above)" ] }, { "cell_type": "markdown", "id": "2b30e9d9", "metadata": {}, "source": [ ">- Let us create a new single dataframe after concatenating all the above three dataframes using `pd.concat()` method" ] }, { "cell_type": "code", "execution_count": 25, "id": "6a30fd89", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3325, 12)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_concatenated = pd.concat(objs=[df_1985, df_1987, df_1997])\n", "\n", "df_concatenated.shape" ] }, { "cell_type": "markdown", "id": "f28d1c9f", "metadata": {}, "source": [ "**Note the total number of rows in this dataframe equals to `1463+932+930 = 3325`**" ] }, { "cell_type": "code", "execution_count": null, "id": "1b40bb0f", "metadata": {}, "outputs": [], "source": [ "df_concatenated.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "a895a92a", "metadata": {}, "outputs": [], "source": [ "# you can store the concatenated data inside your dataframe into a single Excel file\n", "# You can mention the argument `index= false` for not storing row indices (0, 1,2,3,... in the Excel file.\n", "\n", "df_concatenated.to_excel(excel_writer='temp.xlsx', index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "ec95f340", "metadata": {}, "outputs": [], "source": [ "# Let us verify\n", "data = pd.read_excel(io='temp.xlsx')\n", "\n", "data.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "28a845bd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7f84a0f6", "metadata": {}, "source": [ "## 6. I/O with JSON Files" ] }, { "cell_type": "markdown", "id": "6219e7d6", "metadata": {}, "source": [ ">**JSON**: JavaScript Object Notation is a text-based open standard file format that uses human-readable text consisting of attribute–value pairs and arrays. It is a data interchange format that is used to store and transfer the data via Internet, primarily between a web client and a server." ] }, { "cell_type": "markdown", "id": "bc7f1cf8", "metadata": {}, "source": [ "### a. Reading a Simple JSON File" ] }, { "cell_type": "code", "execution_count": null, "id": "d59f6291", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install SQLAlchemy psycopg2-binary" ] }, { "cell_type": "code", "execution_count": 26, "id": "6900634f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{\"name\": \"Kamal\", \"age\": 12, \"grade\": \"A\"}, {\"name\": \"Hashim\", \"age\": 18, \"grade\": \"B\"}, {\"name\": \"Salman\", \"age\": 11, \"grade\": \"A\"}, {\"name\": \"Mazhar\", \"age\": 12, \"grade\": \"C\"}, {\"name\": \"Eisha\", \"age\": 13, \"grade\": \"B\"}, {\"name\": \"Farhan\", \"age\": 22, \"grade\": \"C\"}, {\"name\": \"Mohsin\", \"age\": 11, \"grade\": \"A\"}, {\"name\": \"Bilal\", \"age\": 19, \"grade\": \"A\"}, {\"name\": \"Ishaan\", \"age\": 10, \"grade\": \"D\"}, {\"name\": \"Zalaid\", \"age\": 9, \"grade\": \"B\"}]" ] } ], "source": [ "! cat datasets/simple.json" ] }, { "cell_type": "code", "execution_count": 27, "id": "f7bf34ec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegrade
0Kamal12A
1Hashim18B
2Salman11A
3Mazhar12C
4Eisha13B
5Farhan22C
6Mohsin11A
7Bilal19A
8Ishaan10D
9Zalaid9B
\n", "
" ], "text/plain": [ " name age grade\n", "0 Kamal 12 A\n", "1 Hashim 18 B\n", "2 Salman 11 A\n", "3 Mazhar 12 C\n", "4 Eisha 13 B\n", "5 Farhan 22 C\n", "6 Mohsin 11 A\n", "7 Bilal 19 A\n", "8 Ishaan 10 D\n", "9 Zalaid 9 B" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the json file using read_json method of pandas library\n", "df = pd.read_json('datasets/simple.json')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "a5bc074c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "55cdd441", "metadata": {}, "source": [ "### b. Reading JSON File having each record in a separate line\n", "- Some of the json files are written as records i.e each json line is a separate json object. For example:\n", "```\n", "{ 'name' : 'Ahsan', 'roll_no' : '100' } # line 1\n", "{ 'name' : 'Ayesha' , 'roll_no' : '101' } # line 2\n", "```" ] }, { "cell_type": "code", "execution_count": 28, "id": "673fc206", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"name\": \"Kamal\", \"age\": 12, \"grade\": \"A\"}\r\n", "{\"name\": \"Hashim\", \"age\": 18, \"grade\": \"B\"}\r\n", "{\"name\": \"Salman\", \"age\": 11, \"grade\": \"A\"}\r\n", "{\"name\": \"Mazhar\", \"age\": 12, \"grade\": \"C\"}\r\n", "{\"name\": \"Eisha\", \"age\": 13, \"grade\": \"B\"}\r\n", "{\"name\": \"Farhan\", \"age\": 22, \"grade\": \"C\"}\r\n", "{\"name\": \"Mohsin\", \"age\": 11, \"grade\": \"A\"}\r\n", "{\"name\": \"Bilal\", \"age\": 19, \"grade\": \"A\"}\r\n", "{\"name\": \"Ishaan\", \"age\": 10, \"grade\": \"D\"}\r\n", "{\"name\": \"Zalaid\", \"age\": 9, \"grade\": \"B\"}\r\n" ] } ], "source": [ "! cat datasets/simple_records.json" ] }, { "cell_type": "code", "execution_count": 30, "id": "be13ceb0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegrade
0Kamal12A
1Hashim18B
2Salman11A
3Mazhar12C
4Eisha13B
5Farhan22C
6Mohsin11A
7Bilal19A
8Ishaan10D
9Zalaid9B
\n", "
" ], "text/plain": [ " name age grade\n", "0 Kamal 12 A\n", "1 Hashim 18 B\n", "2 Salman 11 A\n", "3 Mazhar 12 C\n", "4 Eisha 13 B\n", "5 Farhan 22 C\n", "6 Mohsin 11 A\n", "7 Bilal 19 A\n", "8 Ishaan 10 D\n", "9 Zalaid 9 B" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To read such file you need to pass `lines=True` to the `read_json()` method of dataframe\n", "df = pd.read_json('datasets/simple_records.json',lines=True)\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "ff918dd0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4abc6869", "metadata": {}, "source": [ "## 7. Writing Contents of Dataframe to a JSON File" ] }, { "cell_type": "code", "execution_count": null, "id": "683f9b20", "metadata": {}, "outputs": [], "source": [ "df.to_json('datasets/temp.json')" ] }, { "cell_type": "code", "execution_count": null, "id": "3a566ec5", "metadata": {}, "outputs": [], "source": [ "df = pd.read_json('datasets/temp.json')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "3d2aa6ae", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }