{
"cells": [
{
"cell_type": "markdown",
"id": "4a87b5ef",
"metadata": {},
"source": [
"--- \n",
" \n",
"\n",
"
Department of Data Science
\n",
"Course: Tools and Techniques for Data Science
\n",
"\n",
"---\n",
"Instructor: Muhammad Arif Butt, Ph.D.
"
]
},
{
"cell_type": "markdown",
"id": "ab0dc25c",
"metadata": {},
"source": [
"Lecture 3.17 (Pandas-09)
"
]
},
{
"cell_type": "markdown",
"id": "19f82705",
"metadata": {},
"source": [
"## _Aggregating and Grouping Dataframes.ipynb_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07d1d815",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "0acb63e8",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "87d7b36c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f284fcde",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "9f676117",
"metadata": {},
"source": [
"Instructor: Muhammad Arif Butt, Ph.D.
\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f2a42950",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a434dac",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c736f024",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4a53918",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "36ad40d6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "12db95e1",
"metadata": {},
"source": [
"## Learning agenda of this notebook\n",
"1. Overview of Aggregation Functions and the `agg()` method\n",
" - Applying a Built-in Aggregation Function on Entire Dataframe Object\n",
" - Applying a Built-in Aggregation Function on a Series Object\n",
" - Applying a User-Defined/Lambda Function on a Series Object
\n",
"2. Computing the Minimum Temperature of each City using **hard way**
\n",
"3. Computing the Minimum Temperature of each City using **`groupby`**
\n",
"4. Practice GroupBy on Stack Overflow Survey Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db17aec4",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e54b5d6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ab0f3d7",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "63d2431e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "d0a2d89b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "1664cd8f",
"metadata": {},
"source": [
"## 1. Overview of Aggregation Functions and the `agg()` Method\n",
"- An aggregation function is one which takes multiple individual values and returns a result."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "298d74ce",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
" humidity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01/01/2022 | \n",
" lahore | \n",
" 8 | \n",
" 60 | \n",
"
\n",
" \n",
" 1 | \n",
" 02/01/2022 | \n",
" lahore | \n",
" 10 | \n",
" 58 | \n",
"
\n",
" \n",
" 2 | \n",
" 03/01/2022 | \n",
" lahore | \n",
" 5 | \n",
" 51 | \n",
"
\n",
" \n",
" 3 | \n",
" 04/01/2022 | \n",
" lahore | \n",
" 6 | \n",
" 49 | \n",
"
\n",
" \n",
" 4 | \n",
" 05/01/2022 | \n",
" lahore | \n",
" 12 | \n",
" 54 | \n",
"
\n",
" \n",
" 5 | \n",
" 01/01/2022 | \n",
" karachi | \n",
" 18 | \n",
" 74 | \n",
"
\n",
" \n",
" 6 | \n",
" 02/01/2022 | \n",
" karachi | \n",
" 10 | \n",
" 71 | \n",
"
\n",
" \n",
" 7 | \n",
" 03/01/2022 | \n",
" karachi | \n",
" 12 | \n",
" 78 | \n",
"
\n",
" \n",
" 8 | \n",
" 04/01/2022 | \n",
" karachi | \n",
" 15 | \n",
" 76 | \n",
"
\n",
" \n",
" 9 | \n",
" 05/01/2022 | \n",
" karachi | \n",
" 16 | \n",
" 70 | \n",
"
\n",
" \n",
" 10 | \n",
" 01/01/2022 | \n",
" murree | \n",
" -5 | \n",
" 61 | \n",
"
\n",
" \n",
" 11 | \n",
" 02/01/2022 | \n",
" murree | \n",
" -3 | \n",
" 68 | \n",
"
\n",
" \n",
" 12 | \n",
" 03/01/2022 | \n",
" murree | \n",
" -4 | \n",
" 69 | \n",
"
\n",
" \n",
" 13 | \n",
" 04/01/2022 | \n",
" murree | \n",
" -1 | \n",
" 63 | \n",
"
\n",
" \n",
" 14 | \n",
" 05/01/2022 | \n",
" murree | \n",
" -7 | \n",
" 72 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature humidity\n",
"0 01/01/2022 lahore 8 60\n",
"1 02/01/2022 lahore 10 58\n",
"2 03/01/2022 lahore 5 51\n",
"3 04/01/2022 lahore 6 49\n",
"4 05/01/2022 lahore 12 54\n",
"5 01/01/2022 karachi 18 74\n",
"6 02/01/2022 karachi 10 71\n",
"7 03/01/2022 karachi 12 78\n",
"8 04/01/2022 karachi 15 76\n",
"9 05/01/2022 karachi 16 70\n",
"10 01/01/2022 murree -5 61\n",
"11 02/01/2022 murree -3 68\n",
"12 03/01/2022 murree -4 69\n",
"13 04/01/2022 murree -1 63\n",
"14 05/01/2022 murree -7 72"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('datasets/groupbydata2.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "71f119a8",
"metadata": {},
"source": [
"### a. Applying a Built-in Aggregation Function on Entire Dataframe Object"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3e9650a9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"date 01/01/2022\n",
"city karachi\n",
"temperature -7\n",
"humidity 49\n",
"dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.min()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7e5e6958",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"date 15\n",
"city 15\n",
"temperature 15\n",
"humidity 15\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.count()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "31839411",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_3651/2026153972.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.\n",
" df.median()\n"
]
},
{
"data": {
"text/plain": [
"temperature 8.0\n",
"humidity 68.0\n",
"dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Should be applied to numeric columns only, may raise a warning\n",
"df.median()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "fa0eb1f9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"temperature 8.0\n",
"humidity 68.0\n",
"dtype: float64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.median(numeric_only=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c41a2003",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd9f3067",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "b29bf1c2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3655614e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "8409ebd7",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "0a80e737",
"metadata": {},
"source": [
"> We can call the `agg()` method on the dataframe to apply multiple aggregation functions at a time, by passing the `agg()` function a list of aggregation functions as strings."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "875e29de",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
" humidity | \n",
"
\n",
" \n",
" \n",
" \n",
" min | \n",
" 01/01/2022 | \n",
" karachi | \n",
" -7 | \n",
" 49 | \n",
"
\n",
" \n",
" max | \n",
" 05/01/2022 | \n",
" murree | \n",
" 18 | \n",
" 78 | \n",
"
\n",
" \n",
" count | \n",
" 15 | \n",
" 15 | \n",
" 15 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature humidity\n",
"min 01/01/2022 karachi -7 49\n",
"max 05/01/2022 murree 18 78\n",
"count 15 15 15 15"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.agg(['min', 'max', 'count'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab3e6286",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a58a2e02",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "f7624ba3",
"metadata": {},
"source": [
"> We can call the `describe()` method on the dataframe to get descriptive statistical measures on all its numeric columns."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a26b0208",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" temperature | \n",
" humidity | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 15.000000 | \n",
" 15.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 6.133333 | \n",
" 64.933333 | \n",
"
\n",
" \n",
" std | \n",
" 8.253715 | \n",
" 9.153194 | \n",
"
\n",
" \n",
" min | \n",
" -7.000000 | \n",
" 49.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" -2.000000 | \n",
" 59.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 8.000000 | \n",
" 68.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 12.000000 | \n",
" 71.500000 | \n",
"
\n",
" \n",
" max | \n",
" 18.000000 | \n",
" 78.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" temperature humidity\n",
"count 15.000000 15.000000\n",
"mean 6.133333 64.933333\n",
"std 8.253715 9.153194\n",
"min -7.000000 49.000000\n",
"25% -2.000000 59.000000\n",
"50% 8.000000 68.000000\n",
"75% 12.000000 71.500000\n",
"max 18.000000 78.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98f30e56",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6ceb80e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "264cd891",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b6edfe8",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "29818e96",
"metadata": {},
"source": [
"### b. Applying a Built-in Aggregation Function on a Series Object"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "111bef4b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-7"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['temperature'].min()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "22606fba",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"18"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['temperature'].max()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "709429fa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6.133333333333334"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['temperature'].mean()"
]
},
{
"cell_type": "markdown",
"id": "b131b9b5",
"metadata": {},
"source": [
"> We can call the `agg()` method on a series to apply multiple aggregation functions at a time, by passing the `agg()` function a list of aggregation functions as strings."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4bea6bc9",
"metadata": {},
"outputs": [],
"source": [
"df['temperature'].agg(['min', 'max', 'mean', 'count'])"
]
},
{
"cell_type": "markdown",
"id": "07ad4b62",
"metadata": {},
"source": [
"> We can call the `describe()` method on the dataframe to get descriptive statistical measures on all its numeric columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80d38d82",
"metadata": {},
"outputs": [],
"source": [
"df['temperature'].describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e72c885",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "86a71e02",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "39a48d79",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d6f7d77",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "4d354a5b",
"metadata": {},
"source": [
"### c. Applying a User-Defined/Lambda Function on a Series Object using the `apply()` Method\n",
"- We have used this `apply()` method before as well that is used to invoke function on values of Series and return a resulting series."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8647021a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 8\n",
"1 10\n",
"2 5\n",
"3 6\n",
"4 12\n",
"5 18\n",
"6 10\n",
"7 12\n",
"8 15\n",
"9 16\n",
"10 -5\n",
"11 -3\n",
"12 -4\n",
"13 -1\n",
"14 -7\n",
"Name: temperature, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.temperature"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ffd978e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 46.4\n",
"1 50.0\n",
"2 41.0\n",
"3 42.8\n",
"4 53.6\n",
"5 64.4\n",
"6 50.0\n",
"7 53.6\n",
"8 59.0\n",
"9 60.8\n",
"10 23.0\n",
"11 26.6\n",
"12 24.8\n",
"13 30.2\n",
"14 19.4\n",
"Name: temperature, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def ctof(x):\n",
" return x*9/5+32\n",
"\n",
"df.temperature.apply(ctof)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec10b7f9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "e97d6d8f",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "253bc061",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a73e2af",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4c6948e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b427ebcd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 46.4\n",
"1 50.0\n",
"2 41.0\n",
"3 42.8\n",
"4 53.6\n",
"5 64.4\n",
"6 50.0\n",
"7 53.6\n",
"8 59.0\n",
"9 60.8\n",
"10 23.0\n",
"11 26.6\n",
"12 24.8\n",
"13 30.2\n",
"14 19.4\n",
"Name: temperature, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.temperature.apply(lambda x: x*9/5+32)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7d0bbd8",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a89c671b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb0141d9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "afd6b8ba",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9851993",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "1253cf51",
"metadata": {},
"source": [
"# How to Compute the Minimum Temperature of Each City?"
]
},
{
"cell_type": "markdown",
"id": "c21a3258",
"metadata": {},
"source": [
"## 2. Doing it the Hard Way\n",
""
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "443db1c5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01/01/2022 | \n",
" lahore | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" 02/01/2022 | \n",
" lahore | \n",
" 10 | \n",
"
\n",
" \n",
" 2 | \n",
" 03/01/2022 | \n",
" lahore | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 04/01/2022 | \n",
" lahore | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 05/01/2022 | \n",
" lahore | \n",
" 12 | \n",
"
\n",
" \n",
" 5 | \n",
" 01/01/2022 | \n",
" karachi | \n",
" 18 | \n",
"
\n",
" \n",
" 6 | \n",
" 02/01/2022 | \n",
" karachi | \n",
" 10 | \n",
"
\n",
" \n",
" 7 | \n",
" 03/01/2022 | \n",
" karachi | \n",
" 12 | \n",
"
\n",
" \n",
" 8 | \n",
" 04/01/2022 | \n",
" karachi | \n",
" 15 | \n",
"
\n",
" \n",
" 9 | \n",
" 05/01/2022 | \n",
" karachi | \n",
" 16 | \n",
"
\n",
" \n",
" 10 | \n",
" 01/01/2022 | \n",
" murree | \n",
" -5 | \n",
"
\n",
" \n",
" 11 | \n",
" 02/01/2022 | \n",
" murree | \n",
" -3 | \n",
"
\n",
" \n",
" 12 | \n",
" 03/01/2022 | \n",
" murree | \n",
" -4 | \n",
"
\n",
" \n",
" 13 | \n",
" 04/01/2022 | \n",
" murree | \n",
" -1 | \n",
"
\n",
" \n",
" 14 | \n",
" 05/01/2022 | \n",
" murree | \n",
" -7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"0 01/01/2022 lahore 8\n",
"1 02/01/2022 lahore 10\n",
"2 03/01/2022 lahore 5\n",
"3 04/01/2022 lahore 6\n",
"4 05/01/2022 lahore 12\n",
"5 01/01/2022 karachi 18\n",
"6 02/01/2022 karachi 10\n",
"7 03/01/2022 karachi 12\n",
"8 04/01/2022 karachi 15\n",
"9 05/01/2022 karachi 16\n",
"10 01/01/2022 murree -5\n",
"11 02/01/2022 murree -3\n",
"12 03/01/2022 murree -4\n",
"13 04/01/2022 murree -1\n",
"14 05/01/2022 murree -7"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('datasets/groupbydata1.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "eb304630",
"metadata": {},
"source": [
"### a. Splitting the Dataframe\n",
"- We need to use conditional selection technique, in which we pass a Boolean mask for the appropriate city column to be selected. Can do it using two ways:\n",
" - Using `df[]` subscript operator\n",
" - Using `df.loc` method"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "8bc2737e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 5 | \n",
" 01/01/2022 | \n",
" karachi | \n",
" 18 | \n",
"
\n",
" \n",
" 6 | \n",
" 02/01/2022 | \n",
" karachi | \n",
" 10 | \n",
"
\n",
" \n",
" 7 | \n",
" 03/01/2022 | \n",
" karachi | \n",
" 12 | \n",
"
\n",
" \n",
" 8 | \n",
" 04/01/2022 | \n",
" karachi | \n",
" 15 | \n",
"
\n",
" \n",
" 9 | \n",
" 05/01/2022 | \n",
" karachi | \n",
" 16 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"5 01/01/2022 karachi 18\n",
"6 02/01/2022 karachi 10\n",
"7 03/01/2022 karachi 12\n",
"8 04/01/2022 karachi 15\n",
"9 05/01/2022 karachi 16"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['city']=='karachi']"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9d4410d9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01/01/2022 | \n",
" lahore | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" 02/01/2022 | \n",
" lahore | \n",
" 10 | \n",
"
\n",
" \n",
" 2 | \n",
" 03/01/2022 | \n",
" lahore | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 04/01/2022 | \n",
" lahore | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 05/01/2022 | \n",
" lahore | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"0 01/01/2022 lahore 8\n",
"1 02/01/2022 lahore 10\n",
"2 03/01/2022 lahore 5\n",
"3 04/01/2022 lahore 6\n",
"4 05/01/2022 lahore 12"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['city']=='lahore']\n",
"df.loc[df.city=='lahore', :]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "43a90d9a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 5 | \n",
" 01/01/2022 | \n",
" karachi | \n",
" 18 | \n",
"
\n",
" \n",
" 6 | \n",
" 02/01/2022 | \n",
" karachi | \n",
" 10 | \n",
"
\n",
" \n",
" 7 | \n",
" 03/01/2022 | \n",
" karachi | \n",
" 12 | \n",
"
\n",
" \n",
" 8 | \n",
" 04/01/2022 | \n",
" karachi | \n",
" 15 | \n",
"
\n",
" \n",
" 9 | \n",
" 05/01/2022 | \n",
" karachi | \n",
" 16 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"5 01/01/2022 karachi 18\n",
"6 02/01/2022 karachi 10\n",
"7 03/01/2022 karachi 12\n",
"8 04/01/2022 karachi 15\n",
"9 05/01/2022 karachi 16"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['city']=='karachi']\n",
"df.loc[df.city=='karachi', :]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "2fe134f3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" 01/01/2022 | \n",
" murree | \n",
" -5 | \n",
"
\n",
" \n",
" 11 | \n",
" 02/01/2022 | \n",
" murree | \n",
" -3 | \n",
"
\n",
" \n",
" 12 | \n",
" 03/01/2022 | \n",
" murree | \n",
" -4 | \n",
"
\n",
" \n",
" 13 | \n",
" 04/01/2022 | \n",
" murree | \n",
" -1 | \n",
"
\n",
" \n",
" 14 | \n",
" 05/01/2022 | \n",
" murree | \n",
" -7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"10 01/01/2022 murree -5\n",
"11 02/01/2022 murree -3\n",
"12 03/01/2022 murree -4\n",
"13 04/01/2022 murree -1\n",
"14 05/01/2022 murree -7"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['city']=='murree']\n",
"df.loc[df.city=='murree', :]"
]
},
{
"cell_type": "markdown",
"id": "531be356",
"metadata": {},
"source": [
">**Limitation:**\n",
">- We have to repeat this process for every city separately.\n",
">- What if there are over 100 cities in the dataset?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c11acbb",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2510868f",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a552b89a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0bf46fe",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b691295",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "aafece6c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "e6a35e53",
"metadata": {},
"source": [
"### b. Applying the `min()` Function\n",
"- We need to apply the `min()` function on the temperature column of all of the above dataframes separately"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98d13252",
"metadata": {},
"outputs": [],
"source": [
"df.loc[df.city=='lahore', :].temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a9b7261",
"metadata": {},
"outputs": [],
"source": [
"df.loc[df.city=='lahore', :].temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f36327e",
"metadata": {},
"outputs": [],
"source": [
"df.loc[df.city=='karachi', :].temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e76a7dd",
"metadata": {},
"outputs": [],
"source": [
"df.loc[df.city=='murree', :].temperature.min()"
]
},
{
"cell_type": "markdown",
"id": "6573946e",
"metadata": {},
"source": [
">**Limitation:**\n",
">- We have to repeat this process for every city separately.\n",
">- What if there are over 100 cities in the dataset?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "078a3e5c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "adc56aef",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "49071bbc",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5647b406",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "84b892aa",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3bc149d5",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "8d980d9c",
"metadata": {},
"source": [
"### c. Combining the Result\n",
"- Since, we have got the minimum temperature of all the cities, we need to combine them to an appropriate series object to be used for later processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67ff4a19",
"metadata": {},
"outputs": [],
"source": [
"lhr = df.loc[df.city=='lahore', :].temperature.min()\n",
"kci = df.loc[df.city=='karachi', :].temperature.min()\n",
"murree = df.loc[df.city=='murree', :].temperature.min()\n",
"\n",
"s = pd.Series(data=[lhr, kci, murree], index=['L_min', 'K_min', 'M_min'] )\n",
"s.name= 'Min Temperatures'\n",
"s"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59a794d2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f857f7dd",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "30bc4bdc",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "8eca9f7d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "440a8b2f",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c14e695f",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2ba09af",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "3f7f9891",
"metadata": {},
"source": [
"# How to Compute the Minimum Temperature of Each City?"
]
},
{
"cell_type": "markdown",
"id": "62fb733d",
"metadata": {},
"source": [
"## 3. An Elegant Way\n",
""
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "08b3cae2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01/01/2022 | \n",
" lahore | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" 02/01/2022 | \n",
" lahore | \n",
" 10 | \n",
"
\n",
" \n",
" 2 | \n",
" 03/01/2022 | \n",
" lahore | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 04/01/2022 | \n",
" lahore | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 05/01/2022 | \n",
" lahore | \n",
" 12 | \n",
"
\n",
" \n",
" 5 | \n",
" 01/01/2022 | \n",
" karachi | \n",
" 18 | \n",
"
\n",
" \n",
" 6 | \n",
" 02/01/2022 | \n",
" karachi | \n",
" 10 | \n",
"
\n",
" \n",
" 7 | \n",
" 03/01/2022 | \n",
" karachi | \n",
" 12 | \n",
"
\n",
" \n",
" 8 | \n",
" 04/01/2022 | \n",
" karachi | \n",
" 15 | \n",
"
\n",
" \n",
" 9 | \n",
" 05/01/2022 | \n",
" karachi | \n",
" 16 | \n",
"
\n",
" \n",
" 10 | \n",
" 01/01/2022 | \n",
" murree | \n",
" -5 | \n",
"
\n",
" \n",
" 11 | \n",
" 02/01/2022 | \n",
" murree | \n",
" -3 | \n",
"
\n",
" \n",
" 12 | \n",
" 03/01/2022 | \n",
" murree | \n",
" -4 | \n",
"
\n",
" \n",
" 13 | \n",
" 04/01/2022 | \n",
" murree | \n",
" -1 | \n",
"
\n",
" \n",
" 14 | \n",
" 05/01/2022 | \n",
" murree | \n",
" -7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"0 01/01/2022 lahore 8\n",
"1 02/01/2022 lahore 10\n",
"2 03/01/2022 lahore 5\n",
"3 04/01/2022 lahore 6\n",
"4 05/01/2022 lahore 12\n",
"5 01/01/2022 karachi 18\n",
"6 02/01/2022 karachi 10\n",
"7 03/01/2022 karachi 12\n",
"8 04/01/2022 karachi 15\n",
"9 05/01/2022 karachi 16\n",
"10 01/01/2022 murree -5\n",
"11 02/01/2022 murree -3\n",
"12 03/01/2022 murree -4\n",
"13 04/01/2022 murree -1\n",
"14 05/01/2022 murree -7"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('datasets/groupbydata1.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "93f02e73",
"metadata": {},
"source": [
"### a. Step 1: Split Step\n",
"- In the split step we divide the data inside the dataframe into multiple groups\n",
"- Since we need to calculate the minimum temperature of each city, therefore, we will use `groupby()` method on the `city` column of the dataframe.\n",
"- This will result a DataFrameGroupBy object, which is an iterable containing multiple small dataframes based on the `by` argument passed to the `groupby()` method"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "b0ea5d79",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfgb = df.groupby('city')\n",
"dfgb"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52a0e61a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "46dd2666",
"metadata": {},
"source": [
">- Since this is an iterable, so let us iterate :)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "cc1b1019",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('karachi', date city temperature\n",
"5 01/01/2022 karachi 18\n",
"6 02/01/2022 karachi 10\n",
"7 03/01/2022 karachi 12\n",
"8 04/01/2022 karachi 15\n",
"9 05/01/2022 karachi 16)\n",
"('lahore', date city temperature\n",
"0 01/01/2022 lahore 8\n",
"1 02/01/2022 lahore 10\n",
"2 03/01/2022 lahore 5\n",
"3 04/01/2022 lahore 6\n",
"4 05/01/2022 lahore 12)\n",
"('murree', date city temperature\n",
"10 01/01/2022 murree -5\n",
"11 02/01/2022 murree -3\n",
"12 03/01/2022 murree -4\n",
"13 04/01/2022 murree -1\n",
"14 05/01/2022 murree -7)\n"
]
}
],
"source": [
"for mydf in dfgb:\n",
" print(mydf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bdce1713",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f00f923a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c824665",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f2429cb1",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "da049f81",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "6b81eb53",
"metadata": {},
"source": [
">- To display indices of every group in the dataframe, use `groups` attribute of `DataFrameGroupBy` object.\n",
">- Returns a Dictionary object (PrettyDict) with keys as the group value and value as list of corresponding row indices"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "1de745f0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'karachi': [5, 6, 7, 8, 9], 'lahore': [0, 1, 2, 3, 4], 'murree': [10, 11, 12, 13, 14]}"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfgb.groups # df.groupby('city').groups"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00e3974a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "64e34bf9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "48f40627",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "444b9ae2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2fd7fc5e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "d4c2271f",
"metadata": {},
"source": [
">- To display records of a specific group, use `get_group()` method on `DataFrameGroupBy` object.\n",
">- Construct and return DataFrame from `DataFrameGroupBy` object with provided name."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "cbfa0a16",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" 01/01/2022 | \n",
" murree | \n",
" -5 | \n",
"
\n",
" \n",
" 11 | \n",
" 02/01/2022 | \n",
" murree | \n",
" -3 | \n",
"
\n",
" \n",
" 12 | \n",
" 03/01/2022 | \n",
" murree | \n",
" -4 | \n",
"
\n",
" \n",
" 13 | \n",
" 04/01/2022 | \n",
" murree | \n",
" -1 | \n",
"
\n",
" \n",
" 14 | \n",
" 05/01/2022 | \n",
" murree | \n",
" -7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"10 01/01/2022 murree -5\n",
"11 02/01/2022 murree -3\n",
"12 03/01/2022 murree -4\n",
"13 04/01/2022 murree -1\n",
"14 05/01/2022 murree -7"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display DataFrame of a specific group from groupby object by providing the specific group value\n",
"dfgb.get_group('murree') # df.groupby('city').get_group('karachi') "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a93eea5d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "22bdd270",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "06f58584",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b1f8d1c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "1c325989",
"metadata": {},
"source": [
">- To find the size of each group, use `size()` method of DataFrameGroupBy object.\n",
">- It return a series containing number of rows in each each group of the DataFrameGroupBy object as a Series"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "fb875cd9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"city\n",
"karachi 5\n",
"lahore 5\n",
"murree 5\n",
"dtype: int64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfgb.size() #df.groupby('city').size()"
]
},
{
"cell_type": "markdown",
"id": "03042da0",
"metadata": {},
"source": [
"> After understanding the `groupby()` method let us move to step 2, and that is `Applying a Function`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f6f80ed",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcfc7a1d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f2df584",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3dd730b5",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac3e2b74",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2030779b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "38c9df60",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "95bc3a96",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "9519e655",
"metadata": {},
"source": [
"### b. Step 2: Apply Step\n",
"- Now second step is that we apply appropriate aggregate function on all the groups inside the DataFrameGroupBy object"
]
},
{
"cell_type": "markdown",
"id": "729e7c98",
"metadata": {},
"source": [
"**Let us first apply aggregate function on a specific column of `DataFrameGroupBy` object, which is a `SeriesGroupBy` object**"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "4b906481",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01/01/2022 | \n",
" lahore | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" 02/01/2022 | \n",
" lahore | \n",
" 10 | \n",
"
\n",
" \n",
" 2 | \n",
" 03/01/2022 | \n",
" lahore | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 04/01/2022 | \n",
" lahore | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 05/01/2022 | \n",
" lahore | \n",
" 12 | \n",
"
\n",
" \n",
" 5 | \n",
" 01/01/2022 | \n",
" karachi | \n",
" 18 | \n",
"
\n",
" \n",
" 6 | \n",
" 02/01/2022 | \n",
" karachi | \n",
" 10 | \n",
"
\n",
" \n",
" 7 | \n",
" 03/01/2022 | \n",
" karachi | \n",
" 12 | \n",
"
\n",
" \n",
" 8 | \n",
" 04/01/2022 | \n",
" karachi | \n",
" 15 | \n",
"
\n",
" \n",
" 9 | \n",
" 05/01/2022 | \n",
" karachi | \n",
" 16 | \n",
"
\n",
" \n",
" 10 | \n",
" 01/01/2022 | \n",
" murree | \n",
" -5 | \n",
"
\n",
" \n",
" 11 | \n",
" 02/01/2022 | \n",
" murree | \n",
" -3 | \n",
"
\n",
" \n",
" 12 | \n",
" 03/01/2022 | \n",
" murree | \n",
" -4 | \n",
"
\n",
" \n",
" 13 | \n",
" 04/01/2022 | \n",
" murree | \n",
" -1 | \n",
"
\n",
" \n",
" 14 | \n",
" 05/01/2022 | \n",
" murree | \n",
" -7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"0 01/01/2022 lahore 8\n",
"1 02/01/2022 lahore 10\n",
"2 03/01/2022 lahore 5\n",
"3 04/01/2022 lahore 6\n",
"4 05/01/2022 lahore 12\n",
"5 01/01/2022 karachi 18\n",
"6 02/01/2022 karachi 10\n",
"7 03/01/2022 karachi 12\n",
"8 04/01/2022 karachi 15\n",
"9 05/01/2022 karachi 16\n",
"10 01/01/2022 murree -5\n",
"11 02/01/2022 murree -3\n",
"12 03/01/2022 murree -4\n",
"13 04/01/2022 murree -1\n",
"14 05/01/2022 murree -7"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85bfea29",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 32,
"id": "6f3ffb6d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('city')"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "4175e21e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" city | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01/01/2022 | \n",
" lahore | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" 02/01/2022 | \n",
" lahore | \n",
" 10 | \n",
"
\n",
" \n",
" 2 | \n",
" 03/01/2022 | \n",
" lahore | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 04/01/2022 | \n",
" lahore | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 05/01/2022 | \n",
" lahore | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date city temperature\n",
"0 01/01/2022 lahore 8\n",
"1 02/01/2022 lahore 10\n",
"2 03/01/2022 lahore 5\n",
"3 04/01/2022 lahore 6\n",
"4 05/01/2022 lahore 12"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('city').get_group('lahore')"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "c7820a7d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('city').get_group('lahore').temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "ff32ca65",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('city').get_group('lahore').temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "e1744064",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('city').get_group('karachi').temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "c4aab17c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-7"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('city').get_group('murree').temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe8e892f",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f2dec82",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f275823",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "a49a9ba0",
"metadata": {},
"source": [
"### b. Step 3: Combine Step\n",
"- Now we have got minimum temperature of all the three cities, let us combine the result into a series object"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "e37d344a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"K_min 10\n",
"L_min 5\n",
"M_min -7\n",
"Name: Min Temperatures, dtype: int64"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kci = df.groupby('city').get_group('karachi').temperature.min()\n",
"lhr = df.groupby('city').get_group('lahore').temperature.min()\n",
"murree = df.groupby('city').get_group('murree').temperature.min()\n",
"\n",
"s1 = pd.Series(data=[kci, lhr, murree], index=['K_min', 'L_min', 'M_min'] )\n",
"s1.name= 'Min Temperatures'\n",
"s1"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db772042",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "f07d918e",
"metadata": {},
"source": [
">- **Let us perform the `apply + combine` steps in one go, by applying the `min()` function on the temperature series of all the dataframes inside the DataFrameGroupBy object.**\n",
">- **This saves us from the hassle of applying `min()` method explicitly as done above**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77db0e35",
"metadata": {},
"outputs": [],
"source": [
"df.groupby('city')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d014f0d",
"metadata": {},
"outputs": [],
"source": [
"df.groupby('city').temperature"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58604a90",
"metadata": {},
"outputs": [],
"source": [
"df.groupby('city').temperature.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "214d5d3a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f362632d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3325138",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "104e0f3c",
"metadata": {},
"source": [
">- **We can also apply `agg()` method on the temperature series of all the dataframes inside the DataFrameGroupBy object**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd047429",
"metadata": {},
"outputs": [],
"source": [
"df.groupby('city').temperature.agg(['min', 'max', 'sum', 'mean'])"
]
},
{
"cell_type": "markdown",
"id": "908e2c72",
"metadata": {},
"source": [
">-Note that we have got a dataframe this time"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab453a05",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5affd9eb",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a20735b4",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "da368894",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "842b740b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "58697cde",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "94101a3b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "b5f491f4",
"metadata": {},
"source": [
"## 4. Practice GroupBy on Stack Overflow Survey Dataset\n",
"Visit to Download Data: https://insights.stackoverflow.com/survey/"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12fdafcf",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "5f02fc6d",
"metadata": {},
"source": [
"### a. Understand the Data Set"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "2313a7da",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(88883, 9)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('datasets/so_survey_subset.csv', index_col='Respondent')\n",
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "adb21c6b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MainBranch | \n",
" Hobbyist | \n",
" Country | \n",
" YearsCode | \n",
" ConvertedComp | \n",
" LanguageWorkedWith | \n",
" SocialMedia | \n",
" Age | \n",
" Gender | \n",
"
\n",
" \n",
" Respondent | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" I am a student who is learning to code | \n",
" Yes | \n",
" United Kingdom | \n",
" 4 | \n",
" NaN | \n",
" HTML/CSS;Java;JavaScript;Python | \n",
" Twitter | \n",
" 14.0 | \n",
" Man | \n",
"
\n",
" \n",
" 2 | \n",
" I am a student who is learning to code | \n",
" No | \n",
" Bosnia and Herzegovina | \n",
" NaN | \n",
" NaN | \n",
" C++;HTML/CSS;Python | \n",
" Instagram | \n",
" 19.0 | \n",
" Man | \n",
"
\n",
" \n",
" 3 | \n",
" I am not primarily a developer, but I write co... | \n",
" Yes | \n",
" Thailand | \n",
" 3 | \n",
" 8820.0 | \n",
" HTML/CSS | \n",
" Reddit | \n",
" 28.0 | \n",
" Man | \n",
"
\n",
" \n",
" 4 | \n",
" I am a developer by profession | \n",
" No | \n",
" United States | \n",
" 3 | \n",
" 61000.0 | \n",
" C;C++;C#;Python;SQL | \n",
" Reddit | \n",
" 22.0 | \n",
" Man | \n",
"
\n",
" \n",
" 5 | \n",
" I am a developer by profession | \n",
" Yes | \n",
" Ukraine | \n",
" 16 | \n",
" NaN | \n",
" C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA | \n",
" Facebook | \n",
" 30.0 | \n",
" Man | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MainBranch Hobbyist \\\n",
"Respondent \n",
"1 I am a student who is learning to code Yes \n",
"2 I am a student who is learning to code No \n",
"3 I am not primarily a developer, but I write co... Yes \n",
"4 I am a developer by profession No \n",
"5 I am a developer by profession Yes \n",
"\n",
" Country YearsCode ConvertedComp \\\n",
"Respondent \n",
"1 United Kingdom 4 NaN \n",
"2 Bosnia and Herzegovina NaN NaN \n",
"3 Thailand 3 8820.0 \n",
"4 United States 3 61000.0 \n",
"5 Ukraine 16 NaN \n",
"\n",
" LanguageWorkedWith SocialMedia Age \\\n",
"Respondent \n",
"1 HTML/CSS;Java;JavaScript;Python Twitter 14.0 \n",
"2 C++;HTML/CSS;Python Instagram 19.0 \n",
"3 HTML/CSS Reddit 28.0 \n",
"4 C;C++;C#;Python;SQL Reddit 22.0 \n",
"5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA Facebook 30.0 \n",
"\n",
" Gender \n",
"Respondent \n",
"1 Man \n",
"2 Man \n",
"3 Man \n",
"4 Man \n",
"5 Man "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aed11320",
"metadata": {},
"outputs": [],
"source": [
"df.loc[df['Country']=='Pakistan', :]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46bd92ff",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "8999fd3d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 41,
"id": "600d2e20",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" QuestionText | \n",
"
\n",
" \n",
" Column | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Respondent | \n",
" Randomized respondent ID number (not in order ... | \n",
"
\n",
" \n",
" MainBranch | \n",
" Which of the following options best describes ... | \n",
"
\n",
" \n",
" Hobbyist | \n",
" Do you code as a hobby? | \n",
"
\n",
" \n",
" Country | \n",
" In which country do you currently reside? | \n",
"
\n",
" \n",
" YearsCode | \n",
" Including any education, how many years have y... | \n",
"
\n",
" \n",
" ConvertedComp | \n",
" Salary converted to annual USD salaries using ... | \n",
"
\n",
" \n",
" LanguageWorkedWith | \n",
" Which of the following programming, scripting,... | \n",
"
\n",
" \n",
" SocialMedia | \n",
" What social media site do you use the most? | \n",
"
\n",
" \n",
" Age | \n",
" What is your age (in years)? If you prefer not... | \n",
"
\n",
" \n",
" Gender | \n",
" Which of the following do you currently identi... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" QuestionText\n",
"Column \n",
"Respondent Randomized respondent ID number (not in order ...\n",
"MainBranch Which of the following options best describes ...\n",
"Hobbyist Do you code as a hobby?\n",
"Country In which country do you currently reside?\n",
"YearsCode Including any education, how many years have y...\n",
"ConvertedComp Salary converted to annual USD salaries using ...\n",
"LanguageWorkedWith Which of the following programming, scripting,...\n",
"SocialMedia What social media site do you use the most?\n",
"Age What is your age (in years)? If you prefer not...\n",
"Gender Which of the following do you currently identi..."
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"schema = pd.read_csv('datasets/so_survey_subset_schema.csv', index_col='Column')\n",
"schema"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5429f2d0",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 42,
"id": "146eb97a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"QuestionText Do you code as a hobby?\n",
"Name: Hobbyist, dtype: object"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema.loc['Hobbyist']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf6d798f",
"metadata": {},
"outputs": [],
"source": [
"df['Hobbyist']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28946f82",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "54ddb6cf",
"metadata": {},
"outputs": [],
"source": [
"schema.loc['Country']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55a8ee40",
"metadata": {},
"outputs": [],
"source": [
"df['Country']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c98c33e1",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ad3de2b",
"metadata": {},
"outputs": [],
"source": [
"schema.loc['ConvertedComp']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40df4193",
"metadata": {},
"outputs": [],
"source": [
"df['ConvertedComp']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f2c3819d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f1899a1",
"metadata": {},
"outputs": [],
"source": [
"schema.loc['LanguageWorkedWith']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ef88604",
"metadata": {},
"outputs": [],
"source": [
"!cat datasets/so_survey_subset_schema.csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fa61427",
"metadata": {},
"outputs": [],
"source": [
"df['LanguageWorkedWith']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbd881f9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bd2f021",
"metadata": {},
"outputs": [],
"source": [
"schema.loc['SocialMedia']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb7d90c9",
"metadata": {},
"outputs": [],
"source": [
"df['SocialMedia']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38ee282a",
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "574e3847",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "da4a81c4",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "bc9d4cfb",
"metadata": {},
"source": [
"##### Let us perform some basic statistical analysis on the Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2096bf5e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "35f48c5d",
"metadata": {},
"outputs": [],
"source": [
"# Returns the count of non-NA values for a series object.\n",
"df['Hobbyist'].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe766e4c",
"metadata": {},
"outputs": [],
"source": [
"# Returns a Series containing counts of unique rows.\n",
"df['Hobbyist'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e791cf10",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "39ba7c9b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "abac191b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "650bd519",
"metadata": {},
"outputs": [],
"source": [
"# Returns the count of non-NA values for a series object.\n",
"df['Country'].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4aaea5b1",
"metadata": {},
"outputs": [],
"source": [
"# Returns a Series containing counts of unique rows.\n",
"df['Country'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6b91f3cf",
"metadata": {},
"outputs": [],
"source": [
"# To get the count of countries whose developers participated in the survey\n",
"df['Country'].value_counts().count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c3ae775",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "4de9edd6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa4e378e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc48a1f7",
"metadata": {},
"outputs": [],
"source": [
"# Returns the count of non-NA values for a series object.\n",
"df['ConvertedComp'].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb34078c",
"metadata": {},
"outputs": [],
"source": [
"# Returns a Series containing counts of unique rows.\n",
"df['ConvertedComp'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "34dde858",
"metadata": {},
"outputs": [],
"source": [
"df['ConvertedComp'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f91267c0",
"metadata": {},
"outputs": [],
"source": [
"df['ConvertedComp'].median()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f48e6608",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "755b50b5",
"metadata": {},
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8b48433",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6516ee56",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "46da67ef",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "0fe690a2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "94e974b1",
"metadata": {},
"source": [
"Let us try answering certain Questions
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f561cf90",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "1e75db36",
"metadata": {},
"source": [
"## Question 1: \n",
">**List the most popular SocialMedia web site for every Country**"
]
},
{
"cell_type": "markdown",
"id": "557059b6",
"metadata": {},
"source": [
"**Let us first do the easy task, and that is to list the most popular SocialMedia website of a single country (lets say Pakistan)**"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "def1798a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MainBranch | \n",
" Hobbyist | \n",
" Country | \n",
" YearsCode | \n",
" ConvertedComp | \n",
" LanguageWorkedWith | \n",
" SocialMedia | \n",
" Age | \n",
" Gender | \n",
"
\n",
" \n",
" Respondent | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" I am a student who is learning to code | \n",
" Yes | \n",
" United Kingdom | \n",
" 4 | \n",
" NaN | \n",
" HTML/CSS;Java;JavaScript;Python | \n",
" Twitter | \n",
" 14.0 | \n",
" Man | \n",
"
\n",
" \n",
" 2 | \n",
" I am a student who is learning to code | \n",
" No | \n",
" Bosnia and Herzegovina | \n",
" NaN | \n",
" NaN | \n",
" C++;HTML/CSS;Python | \n",
" Instagram | \n",
" 19.0 | \n",
" Man | \n",
"
\n",
" \n",
" 3 | \n",
" I am not primarily a developer, but I write co... | \n",
" Yes | \n",
" Thailand | \n",
" 3 | \n",
" 8820.0 | \n",
" HTML/CSS | \n",
" Reddit | \n",
" 28.0 | \n",
" Man | \n",
"
\n",
" \n",
" 4 | \n",
" I am a developer by profession | \n",
" No | \n",
" United States | \n",
" 3 | \n",
" 61000.0 | \n",
" C;C++;C#;Python;SQL | \n",
" Reddit | \n",
" 22.0 | \n",
" Man | \n",
"
\n",
" \n",
" 5 | \n",
" I am a developer by profession | \n",
" Yes | \n",
" Ukraine | \n",
" 16 | \n",
" NaN | \n",
" C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA | \n",
" Facebook | \n",
" 30.0 | \n",
" Man | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 88377 | \n",
" NaN | \n",
" Yes | \n",
" Canada | \n",
" NaN | \n",
" NaN | \n",
" HTML/CSS;JavaScript;Other(s): | \n",
" YouTube | \n",
" NaN | \n",
" Man | \n",
"
\n",
" \n",
" 88601 | \n",
" NaN | \n",
" No | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88802 | \n",
" NaN | \n",
" No | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88816 | \n",
" NaN | \n",
" No | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88863 | \n",
" NaN | \n",
" Yes | \n",
" Spain | \n",
" 8 | \n",
" NaN | \n",
" Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript... | \n",
" WhatsApp | \n",
" 18.0 | \n",
" Man | \n",
"
\n",
" \n",
"
\n",
"
88883 rows × 9 columns
\n",
"
"
],
"text/plain": [
" MainBranch Hobbyist \\\n",
"Respondent \n",
"1 I am a student who is learning to code Yes \n",
"2 I am a student who is learning to code No \n",
"3 I am not primarily a developer, but I write co... Yes \n",
"4 I am a developer by profession No \n",
"5 I am a developer by profession Yes \n",
"... ... ... \n",
"88377 NaN Yes \n",
"88601 NaN No \n",
"88802 NaN No \n",
"88816 NaN No \n",
"88863 NaN Yes \n",
"\n",
" Country YearsCode ConvertedComp \\\n",
"Respondent \n",
"1 United Kingdom 4 NaN \n",
"2 Bosnia and Herzegovina NaN NaN \n",
"3 Thailand 3 8820.0 \n",
"4 United States 3 61000.0 \n",
"5 Ukraine 16 NaN \n",
"... ... ... ... \n",
"88377 Canada NaN NaN \n",
"88601 NaN NaN NaN \n",
"88802 NaN NaN NaN \n",
"88816 NaN NaN NaN \n",
"88863 Spain 8 NaN \n",
"\n",
" LanguageWorkedWith SocialMedia \\\n",
"Respondent \n",
"1 HTML/CSS;Java;JavaScript;Python Twitter \n",
"2 C++;HTML/CSS;Python Instagram \n",
"3 HTML/CSS Reddit \n",
"4 C;C++;C#;Python;SQL Reddit \n",
"5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA Facebook \n",
"... ... ... \n",
"88377 HTML/CSS;JavaScript;Other(s): YouTube \n",
"88601 NaN NaN \n",
"88802 NaN NaN \n",
"88816 NaN NaN \n",
"88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript... WhatsApp \n",
"\n",
" Age Gender \n",
"Respondent \n",
"1 14.0 Man \n",
"2 19.0 Man \n",
"3 28.0 Man \n",
"4 22.0 Man \n",
"5 30.0 Man \n",
"... ... ... \n",
"88377 NaN Man \n",
"88601 NaN NaN \n",
"88802 NaN NaN \n",
"88816 NaN NaN \n",
"88863 18.0 Man \n",
"\n",
"[88883 rows x 9 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "df9b825e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"WhatsApp 266\n",
"Facebook 232\n",
"YouTube 182\n",
"LinkedIn 71\n",
"Twitter 58\n",
"Instagram 41\n",
"Reddit 28\n",
"I don't use social media 23\n",
"Snapchat 5\n",
"Hello 1\n",
"VK ВКонта́кте 1\n",
"Name: SocialMedia, dtype: int64"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d54d37cb",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "846ab59e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 44,
"id": "d65ad730",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"WeChat 微信 403\n",
"YouTube 53\n",
"Weibo 新浪微博 42\n",
"I don't use social media 27\n",
"Twitter 27\n",
"Reddit 12\n",
"LinkedIn 11\n",
"Facebook 8\n",
"Instagram 7\n",
"Youku Tudou 优酷 7\n",
"WhatsApp 3\n",
"VK ВКонта́кте 1\n",
"Name: SocialMedia, dtype: int64"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df.Country =='Pakistan', :]\n",
"df.loc[df.Country =='Pakistan', 'SocialMedia'].head(10)\n",
"df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()\n",
"df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts(normalize=True)\n",
"df.loc[df.Country =='China', 'SocialMedia'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "440d617b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "4cf572c2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ba1fa80",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "95b681b2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 48,
"id": "54413fbf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country')"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "a7cf5d90",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MainBranch | \n",
" Hobbyist | \n",
" Country | \n",
" YearsCode | \n",
" ConvertedComp | \n",
" LanguageWorkedWith | \n",
" SocialMedia | \n",
" Age | \n",
" Gender | \n",
"
\n",
" \n",
" Respondent | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 84 | \n",
" I am a developer by profession | \n",
" No | \n",
" Pakistan | \n",
" 3 | \n",
" 3468.0 | \n",
" C;C++;C#;Java;Kotlin;PHP;SQL | \n",
" WhatsApp | \n",
" 26.0 | \n",
" Man | \n",
"
\n",
" \n",
" 119 | \n",
" I am a developer by profession | \n",
" No | \n",
" Pakistan | \n",
" 10 | \n",
" NaN | \n",
" C;C++;C#;HTML/CSS;Java;JavaScript;SQL | \n",
" Facebook | \n",
" 28.0 | \n",
" Man | \n",
"
\n",
" \n",
" 298 | \n",
" I am a developer by profession | \n",
" Yes | \n",
" Pakistan | \n",
" 4 | \n",
" NaN | \n",
" HTML/CSS;JavaScript;PHP;SQL;Other(s): | \n",
" LinkedIn | \n",
" 23.0 | \n",
" Man | \n",
"
\n",
" \n",
" 299 | \n",
" I am a developer by profession | \n",
" Yes | \n",
" Pakistan | \n",
" 19 | \n",
" NaN | \n",
" Assembly;C;C++;Java;Python;SQL | \n",
" Facebook | \n",
" 25.0 | \n",
" Man | \n",
"
\n",
" \n",
" 311 | \n",
" I am a developer by profession | \n",
" No | \n",
" Pakistan | \n",
" 5 | \n",
" 2600.0 | \n",
" Assembly;C;C++;C#;HTML/CSS;Java;Python;Scala;SQL | \n",
" LinkedIn | \n",
" 24.0 | \n",
" Man | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 88862 | \n",
" I am a student who is learning to code | \n",
" Yes | \n",
" Pakistan | \n",
" 3 | \n",
" NaN | \n",
" Java | \n",
" WhatsApp | \n",
" 21.0 | \n",
" Man | \n",
"
\n",
" \n",
" 5439 | \n",
" NaN | \n",
" Yes | \n",
" Pakistan | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" Instagram | \n",
" 24.0 | \n",
" Woman | \n",
"
\n",
" \n",
" 39117 | \n",
" NaN | \n",
" Yes | \n",
" Pakistan | \n",
" 4 | \n",
" NaN | \n",
" C;C++;C#;HTML/CSS;Java;JavaScript;SQL | \n",
" WhatsApp | \n",
" 22.0 | \n",
" Man | \n",
"
\n",
" \n",
" 60066 | \n",
" NaN | \n",
" Yes | \n",
" Pakistan | \n",
" 4 | \n",
" NaN | \n",
" Assembly;C++;C#;HTML/CSS;Java;PHP;Python;SQL | \n",
" YouTube | \n",
" 20.0 | \n",
" Man | \n",
"
\n",
" \n",
" 88182 | \n",
" NaN | \n",
" Yes | \n",
" Pakistan | \n",
" 1 | \n",
" NaN | \n",
" HTML/CSS;Java;JavaScript | \n",
" Twitter | \n",
" NaN | \n",
" Man | \n",
"
\n",
" \n",
"
\n",
"
923 rows × 9 columns
\n",
"
"
],
"text/plain": [
" MainBranch Hobbyist Country \\\n",
"Respondent \n",
"84 I am a developer by profession No Pakistan \n",
"119 I am a developer by profession No Pakistan \n",
"298 I am a developer by profession Yes Pakistan \n",
"299 I am a developer by profession Yes Pakistan \n",
"311 I am a developer by profession No Pakistan \n",
"... ... ... ... \n",
"88862 I am a student who is learning to code Yes Pakistan \n",
"5439 NaN Yes Pakistan \n",
"39117 NaN Yes Pakistan \n",
"60066 NaN Yes Pakistan \n",
"88182 NaN Yes Pakistan \n",
"\n",
" YearsCode ConvertedComp \\\n",
"Respondent \n",
"84 3 3468.0 \n",
"119 10 NaN \n",
"298 4 NaN \n",
"299 19 NaN \n",
"311 5 2600.0 \n",
"... ... ... \n",
"88862 3 NaN \n",
"5439 2 NaN \n",
"39117 4 NaN \n",
"60066 4 NaN \n",
"88182 1 NaN \n",
"\n",
" LanguageWorkedWith SocialMedia \\\n",
"Respondent \n",
"84 C;C++;C#;Java;Kotlin;PHP;SQL WhatsApp \n",
"119 C;C++;C#;HTML/CSS;Java;JavaScript;SQL Facebook \n",
"298 HTML/CSS;JavaScript;PHP;SQL;Other(s): LinkedIn \n",
"299 Assembly;C;C++;Java;Python;SQL Facebook \n",
"311 Assembly;C;C++;C#;HTML/CSS;Java;Python;Scala;SQL LinkedIn \n",
"... ... ... \n",
"88862 Java WhatsApp \n",
"5439 NaN Instagram \n",
"39117 C;C++;C#;HTML/CSS;Java;JavaScript;SQL WhatsApp \n",
"60066 Assembly;C++;C#;HTML/CSS;Java;PHP;Python;SQL YouTube \n",
"88182 HTML/CSS;Java;JavaScript Twitter \n",
"\n",
" Age Gender \n",
"Respondent \n",
"84 26.0 Man \n",
"119 28.0 Man \n",
"298 23.0 Man \n",
"299 25.0 Man \n",
"311 24.0 Man \n",
"... ... ... \n",
"88862 21.0 Man \n",
"5439 24.0 Woman \n",
"39117 22.0 Man \n",
"60066 20.0 Man \n",
"88182 NaN Man \n",
"\n",
"[923 rows x 9 columns]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country').get_group(\"Pakistan\")"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "8bc1753e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Respondent\n",
"84 WhatsApp\n",
"119 Facebook\n",
"298 LinkedIn\n",
"299 Facebook\n",
"311 LinkedIn\n",
" ... \n",
"88862 WhatsApp\n",
"5439 Instagram\n",
"39117 WhatsApp\n",
"60066 YouTube\n",
"88182 Twitter\n",
"Name: SocialMedia, Length: 923, dtype: object"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country').get_group(\"Pakistan\").loc[:, 'SocialMedia']"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "caa9a03c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"WhatsApp 266\n",
"Facebook 232\n",
"YouTube 182\n",
"LinkedIn 71\n",
"Twitter 58\n",
"Instagram 41\n",
"Reddit 28\n",
"I don't use social media 23\n",
"Snapchat 5\n",
"Hello 1\n",
"VK ВКонта́кте 1\n",
"Name: SocialMedia, dtype: int64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country').get_group(\"Pakistan\").loc[:, 'SocialMedia'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "083d3d5f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country SocialMedia \n",
"Afghanistan Facebook 15\n",
" YouTube 9\n",
" I don't use social media 6\n",
" WhatsApp 4\n",
" Instagram 1\n",
" LinkedIn 1\n",
" Twitter 1\n",
"Albania WhatsApp 18\n",
" Facebook 16\n",
" Instagram 13\n",
" YouTube 10\n",
" Twitter 8\n",
" LinkedIn 7\n",
" Reddit 6\n",
" I don't use social media 4\n",
" Snapchat 1\n",
" WeChat 微信 1\n",
"Algeria YouTube 42\n",
" Facebook 41\n",
" Twitter 14\n",
" LinkedIn 9\n",
" I don't use social media 7\n",
" Instagram 7\n",
" Reddit 2\n",
" WhatsApp 2\n",
"Andorra Facebook 2\n",
" I don't use social media 2\n",
" Reddit 1\n",
" Snapchat 1\n",
" YouTube 1\n",
"Angola Twitter 2\n",
" Facebook 1\n",
" YouTube 1\n",
"Antigua and Barbuda YouTube 2\n",
" Facebook 1\n",
" Instagram 1\n",
" Twitter 1\n",
" VK ВКонта́кте 1\n",
"Argentina WhatsApp 202\n",
" Twitter 80\n",
" YouTube 62\n",
" Instagram 53\n",
" Reddit 50\n",
" Facebook 45\n",
" I don't use social media 22\n",
" LinkedIn 16\n",
"Armenia Facebook 38\n",
" YouTube 23\n",
" LinkedIn 5\n",
" Instagram 3\n",
" Reddit 3\n",
" Twitter 3\n",
" WhatsApp 3\n",
" VK ВКонта́кте 2\n",
" I don't use social media 1\n",
"Australia Reddit 407\n",
" Facebook 320\n",
" Twitter 283\n",
" YouTube 242\n",
" I don't use social media 156\n",
"Name: SocialMedia, dtype: int64"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country')['SocialMedia'].value_counts().head(60)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8e521de",
"metadata": {},
"outputs": [],
"source": [
"df.groupby('Country')['SocialMedia'].value_counts().head(50)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4640ef13",
"metadata": {},
"outputs": [],
"source": [
"df.groupby('Country')['SocialMedia'].value_counts().head(50)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fefb673d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8f5efc5",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a98c714e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "df115e7a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "45db2fba",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "f18c4077",
"metadata": {},
"source": [
"## Question 2: \n",
">**What percentage of people in each country knows Python programming?**"
]
},
{
"cell_type": "markdown",
"id": "128cbeb8",
"metadata": {},
"source": [
"**tc** = Total count of people from each country who participated in the survey?\n",
"\n",
"**pc** = Python People: Count of people from each country who knows Python"
]
},
{
"cell_type": "markdown",
"id": "045b2e32",
"metadata": {},
"source": [
"**tc (option 1):**"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "fef318d9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MainBranch | \n",
" Hobbyist | \n",
" Country | \n",
" YearsCode | \n",
" ConvertedComp | \n",
" LanguageWorkedWith | \n",
" SocialMedia | \n",
" Age | \n",
" Gender | \n",
"
\n",
" \n",
" Respondent | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" I am a student who is learning to code | \n",
" Yes | \n",
" United Kingdom | \n",
" 4 | \n",
" NaN | \n",
" HTML/CSS;Java;JavaScript;Python | \n",
" Twitter | \n",
" 14.0 | \n",
" Man | \n",
"
\n",
" \n",
" 2 | \n",
" I am a student who is learning to code | \n",
" No | \n",
" Bosnia and Herzegovina | \n",
" NaN | \n",
" NaN | \n",
" C++;HTML/CSS;Python | \n",
" Instagram | \n",
" 19.0 | \n",
" Man | \n",
"
\n",
" \n",
" 3 | \n",
" I am not primarily a developer, but I write co... | \n",
" Yes | \n",
" Thailand | \n",
" 3 | \n",
" 8820.0 | \n",
" HTML/CSS | \n",
" Reddit | \n",
" 28.0 | \n",
" Man | \n",
"
\n",
" \n",
" 4 | \n",
" I am a developer by profession | \n",
" No | \n",
" United States | \n",
" 3 | \n",
" 61000.0 | \n",
" C;C++;C#;Python;SQL | \n",
" Reddit | \n",
" 22.0 | \n",
" Man | \n",
"
\n",
" \n",
" 5 | \n",
" I am a developer by profession | \n",
" Yes | \n",
" Ukraine | \n",
" 16 | \n",
" NaN | \n",
" C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA | \n",
" Facebook | \n",
" 30.0 | \n",
" Man | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 88377 | \n",
" NaN | \n",
" Yes | \n",
" Canada | \n",
" NaN | \n",
" NaN | \n",
" HTML/CSS;JavaScript;Other(s): | \n",
" YouTube | \n",
" NaN | \n",
" Man | \n",
"
\n",
" \n",
" 88601 | \n",
" NaN | \n",
" No | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88802 | \n",
" NaN | \n",
" No | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88816 | \n",
" NaN | \n",
" No | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88863 | \n",
" NaN | \n",
" Yes | \n",
" Spain | \n",
" 8 | \n",
" NaN | \n",
" Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript... | \n",
" WhatsApp | \n",
" 18.0 | \n",
" Man | \n",
"
\n",
" \n",
"
\n",
"
88883 rows × 9 columns
\n",
"
"
],
"text/plain": [
" MainBranch Hobbyist \\\n",
"Respondent \n",
"1 I am a student who is learning to code Yes \n",
"2 I am a student who is learning to code No \n",
"3 I am not primarily a developer, but I write co... Yes \n",
"4 I am a developer by profession No \n",
"5 I am a developer by profession Yes \n",
"... ... ... \n",
"88377 NaN Yes \n",
"88601 NaN No \n",
"88802 NaN No \n",
"88816 NaN No \n",
"88863 NaN Yes \n",
"\n",
" Country YearsCode ConvertedComp \\\n",
"Respondent \n",
"1 United Kingdom 4 NaN \n",
"2 Bosnia and Herzegovina NaN NaN \n",
"3 Thailand 3 8820.0 \n",
"4 United States 3 61000.0 \n",
"5 Ukraine 16 NaN \n",
"... ... ... ... \n",
"88377 Canada NaN NaN \n",
"88601 NaN NaN NaN \n",
"88802 NaN NaN NaN \n",
"88816 NaN NaN NaN \n",
"88863 Spain 8 NaN \n",
"\n",
" LanguageWorkedWith SocialMedia \\\n",
"Respondent \n",
"1 HTML/CSS;Java;JavaScript;Python Twitter \n",
"2 C++;HTML/CSS;Python Instagram \n",
"3 HTML/CSS Reddit \n",
"4 C;C++;C#;Python;SQL Reddit \n",
"5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA Facebook \n",
"... ... ... \n",
"88377 HTML/CSS;JavaScript;Other(s): YouTube \n",
"88601 NaN NaN \n",
"88802 NaN NaN \n",
"88816 NaN NaN \n",
"88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript... WhatsApp \n",
"\n",
" Age Gender \n",
"Respondent \n",
"1 14.0 Man \n",
"2 19.0 Man \n",
"3 28.0 Man \n",
"4 22.0 Man \n",
"5 30.0 Man \n",
"... ... ... \n",
"88377 NaN Man \n",
"88601 NaN NaN \n",
"88802 NaN NaN \n",
"88816 NaN NaN \n",
"88863 18.0 Man \n",
"\n",
"[88883 rows x 9 columns]"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "b61430a3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Respondent\n",
"1 United Kingdom\n",
"2 Bosnia and Herzegovina\n",
"3 Thailand\n",
"4 United States\n",
"5 Ukraine\n",
" ... \n",
"88377 Canada\n",
"88601 NaN\n",
"88802 NaN\n",
"88816 NaN\n",
"88863 Spain\n",
"Name: Country, Length: 88883, dtype: object"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[:, 'Country']"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "a61e0e18",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"United States 20949\n",
"India 9061\n",
"Germany 5866\n",
"United Kingdom 5737\n",
"Canada 3395\n",
" ... \n",
"Tonga 1\n",
"Timor-Leste 1\n",
"North Korea 1\n",
"Brunei Darussalam 1\n",
"Chad 1\n",
"Name: Total, Length: 179, dtype: int64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tc = df['Country'].value_counts()\n",
"tc.name = 'Total'\n",
"tc"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00634605",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3112db44",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "36b6d5e0",
"metadata": {},
"source": [
"**tc (option 2):**"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "64cf9c86",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfgb = df.groupby('Country')\n",
"dfgb"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "b2fca694",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country')['Country']"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "9f6901aa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country \n",
"Afghanistan Afghanistan 44\n",
"Albania Albania 86\n",
"Algeria Algeria 134\n",
"Andorra Andorra 7\n",
"Angola Angola 5\n",
" ... \n",
"Venezuela, Bolivarian Republic of... Venezuela, Bolivarian Republic of... 88\n",
"Viet Nam Viet Nam 231\n",
"Yemen Yemen 19\n",
"Zambia Zambia 12\n",
"Zimbabwe Zimbabwe 39\n",
"Name: Country, Length: 179, dtype: int64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country')['Country'].apply(lambda x: x.value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "acbcaa82",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "684d3e07",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "aab00ae8",
"metadata": {},
"source": [
"**pc:**"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "d0066ac5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Respondent\n",
"1 HTML/CSS;Java;JavaScript;Python\n",
"2 C++;HTML/CSS;Python\n",
"3 HTML/CSS\n",
"4 C;C++;C#;Python;SQL\n",
"5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA\n",
" ... \n",
"88377 HTML/CSS;JavaScript;Other(s):\n",
"88601 NaN\n",
"88802 NaN\n",
"88816 NaN\n",
"88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...\n",
"Name: LanguageWorkedWith, Length: 88883, dtype: object"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[:, 'LanguageWorkedWith']"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "db814708",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country')['LanguageWorkedWith']"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "61ef5d9b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Respondent\n",
"1 True\n",
"2 True\n",
"3 False\n",
"4 True\n",
"5 True\n",
" ... \n",
"88182 False\n",
"88212 True\n",
"88282 False\n",
"88377 False\n",
"88863 False\n",
"Name: LanguageWorkedWith, Length: 88751, dtype: object"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python'))"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "9b3e1b22",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"Afghanistan 8\n",
"Albania 23\n",
"Algeria 40\n",
"Andorra 0\n",
"Angola 2\n",
" ..\n",
"Venezuela, Bolivarian Republic of... 28\n",
"Viet Nam 78\n",
"Yemen 3\n",
"Zambia 4\n",
"Zimbabwe 14\n",
"Name: LanguageWorkedWith, Length: 179, dtype: int64"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp = df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())\n",
"pp"
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "26205a53",
"metadata": {},
"outputs": [],
"source": [
"pp.name = 'Knows Python'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09886e0f",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "6954a136",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "ced1a4b0",
"metadata": {},
"source": [
"**Create a Dataframe of two series tc and pp**"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "989aec12",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Total | \n",
" Knows Python | \n",
"
\n",
" \n",
" \n",
" \n",
" United States | \n",
" 20949 | \n",
" 10083 | \n",
"
\n",
" \n",
" India | \n",
" 9061 | \n",
" 3105 | \n",
"
\n",
" \n",
" Germany | \n",
" 5866 | \n",
" 2451 | \n",
"
\n",
" \n",
" United Kingdom | \n",
" 5737 | \n",
" 2384 | \n",
"
\n",
" \n",
" Canada | \n",
" 3395 | \n",
" 1558 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Tonga | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Timor-Leste | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" North Korea | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Brunei Darussalam | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Chad | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
179 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Total Knows Python\n",
"United States 20949 10083\n",
"India 9061 3105\n",
"Germany 5866 2451\n",
"United Kingdom 5737 2384\n",
"Canada 3395 1558\n",
"... ... ...\n",
"Tonga 1 0\n",
"Timor-Leste 1 1\n",
"North Korea 1 0\n",
"Brunei Darussalam 1 0\n",
"Chad 1 0\n",
"\n",
"[179 rows x 2 columns]"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resultdf = pd.concat([tc, pp], axis=1)\n",
"resultdf"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "df17703c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Total 923\n",
"Knows Python 251\n",
"Name: Pakistan, dtype: int64"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resultdf.loc['Pakistan']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c670b7e9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2891419b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "e96094bb",
"metadata": {},
"source": [
"**Percentage of people in each country knows Python?**"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "d09f7a77",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Total | \n",
" Knows Python | \n",
" Percentage | \n",
"
\n",
" \n",
" \n",
" \n",
" United States | \n",
" 20949 | \n",
" 10083 | \n",
" 48.131176 | \n",
"
\n",
" \n",
" India | \n",
" 9061 | \n",
" 3105 | \n",
" 34.267741 | \n",
"
\n",
" \n",
" Germany | \n",
" 5866 | \n",
" 2451 | \n",
" 41.783157 | \n",
"
\n",
" \n",
" United Kingdom | \n",
" 5737 | \n",
" 2384 | \n",
" 41.554820 | \n",
"
\n",
" \n",
" Canada | \n",
" 3395 | \n",
" 1558 | \n",
" 45.891016 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" Tonga | \n",
" 1 | \n",
" 0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Timor-Leste | \n",
" 1 | \n",
" 1 | \n",
" 100.000000 | \n",
"
\n",
" \n",
" North Korea | \n",
" 1 | \n",
" 0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Brunei Darussalam | \n",
" 1 | \n",
" 0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Chad | \n",
" 1 | \n",
" 0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
179 rows × 3 columns
\n",
"
"
],
"text/plain": [
" Total Knows Python Percentage\n",
"United States 20949 10083 48.131176\n",
"India 9061 3105 34.267741\n",
"Germany 5866 2451 41.783157\n",
"United Kingdom 5737 2384 41.554820\n",
"Canada 3395 1558 45.891016\n",
"... ... ... ...\n",
"Tonga 1 0 0.000000\n",
"Timor-Leste 1 1 100.000000\n",
"North Korea 1 0 0.000000\n",
"Brunei Darussalam 1 0 0.000000\n",
"Chad 1 0 0.000000\n",
"\n",
"[179 rows x 3 columns]"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resultdf['Percentage'] = (resultdf['Knows Python'] / resultdf['Total']) * 100\n",
"resultdf"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7530f928",
"metadata": {},
"outputs": [],
"source": [
"resultdf.loc['Pakistan']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b72302eb",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}