{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/groda/big_data/blob/master/generate_data_with_Faker.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hwLHeIOc69kD"
   },
   "source": [
    "<a href=\"https://github.com/groda/big_data\"><div><img src=\"https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true\" align=right width=\"90\"></div></a>\n",
    "\n",
    "# Data Generation and Aggregation with Python's Faker Library and PySpark\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "Explore the capabilities of the Python Faker library (https://faker.readthedocs.io/) for dynamic data generation!\n",
    "\n",
    "Whether you're a data scientist, engineer, or analyst, this tutorial will guide you through the process of creating realistic and diverse datasets using Faker and then harnessing the distributed computing capabilities of PySpark to aggregate and analyze the generated data. Throughout this guide, you will explore effective techniques for data generation that enhance performance and optimize resource usage. Whether you're working with large datasets or simply seeking to streamline your data generation process, this tutorial offers valuable insights to elevate your skills.\n",
    "\n",
    "**Note:** This is not _synthetic_ data, as it is generated using straightforward methods and is unlikely to conform to any real-life distribution.  Still, it serves as a valuable resource for testing purposes when authentic data is unavailable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BW6yZdpy7run"
   },
   "source": [
    "# Install Faker\n",
    "\n",
    "The Python `faker` module needs to be installed. Note that on Google Colab you can use `!pip` as well as just `pip` (no exclamation mark)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:10.436469Z",
     "iopub.status.busy": "2024-10-13T21:33:10.435903Z",
     "iopub.status.idle": "2024-10-13T21:33:12.688188Z",
     "shell.execute_reply": "2024-10-13T21:33:12.687491Z"
    },
    "id": "7EOLVOTe7KnP",
    "outputId": "f768294b-b3f8-4df7-9837-47b2dd1f4cdb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting faker\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Downloading Faker-30.3.0-py3-none-any.whl.metadata (15 kB)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: python-dateutil>=2.4 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from faker) (2.9.0.post0)\r\n",
      "Requirement already satisfied: typing-extensions in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from faker) (4.12.2)\r\n",
      "Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.4->faker) (1.16.0)\r\n",
      "Downloading Faker-30.3.0-py3-none-any.whl (1.8 MB)\r\n",
      "\u001b[?25l   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m31.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n",
      "\u001b[?25h"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing collected packages: faker\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully installed faker-30.3.0\r\n"
     ]
    }
   ],
   "source": [
    "!pip install faker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UYLa4gsrB7aS"
   },
   "source": [
    "# Generate a Pandas dataframe with fake data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yKviUSgj7nHP"
   },
   "source": [
    "Import `Faker` and set a random seed ($42$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:12.691166Z",
     "iopub.status.busy": "2024-10-13T21:33:12.690755Z",
     "iopub.status.idle": "2024-10-13T21:33:12.742178Z",
     "shell.execute_reply": "2024-10-13T21:33:12.741512Z"
    },
    "id": "7fbGCBoq69kF"
   },
   "outputs": [],
   "source": [
    "from faker import Faker\n",
    "# Set the seed value of the shared `random.Random` object\n",
    "# across all internal generators that will ever be created\n",
    "Faker.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yUEes2gl69kF"
   },
   "source": [
    "`fake` is a fake data generator with `DE_de` locale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:12.744622Z",
     "iopub.status.busy": "2024-10-13T21:33:12.744255Z",
     "iopub.status.idle": "2024-10-13T21:33:12.772887Z",
     "shell.execute_reply": "2024-10-13T21:33:12.772244Z"
    },
    "id": "Bf7UTx_r69kG"
   },
   "outputs": [],
   "source": [
    "fake = Faker('de_DE')\n",
    "fake.seed_locale('de_DE', 42)\n",
    "# Creates and seeds a unique `random.Random` object for\n",
    "# each internal generator of this `Faker` instance\n",
    "fake.seed_instance(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eZnvde79ljUB"
   },
   "source": [
    "With `fake` you can generate fake data, such as name, email, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:12.775243Z",
     "iopub.status.busy": "2024-10-13T21:33:12.774870Z",
     "iopub.status.idle": "2024-10-13T21:33:12.778427Z",
     "shell.execute_reply": "2024-10-13T21:33:12.777796Z"
    },
    "id": "81-ltoA0lp0v",
    "outputId": "c7e97d64-3184-44b3-aac1-d77008e6a0d8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A fake name: Aleksandr Weihmann\n",
      "A fake email: ioannis32@example.net\n"
     ]
    }
   ],
   "source": [
    "print(f\"A fake name: {fake.name()}\")\n",
    "print(f\"A fake email: {fake.email()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0bSXBUI469kG"
   },
   "source": [
    "Import Pandas to save data into a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:12.780603Z",
     "iopub.status.busy": "2024-10-13T21:33:12.780258Z",
     "iopub.status.idle": "2024-10-13T21:33:17.535019Z",
     "shell.execute_reply": "2024-10-13T21:33:17.534318Z"
    },
    "id": "PI_YqfM169kG"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting pandas==1.5.3\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)\r\n",
      "Requirement already satisfied: python-dateutil>=2.8.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (2.9.0.post0)\r\n",
      "Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (2024.2)\r\n",
      "Requirement already satisfied: numpy>=1.20.3 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (1.24.4)\r\n",
      "Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.5.3) (1.16.0)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)\r\n",
      "\u001b[?25l   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/12.2 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.2/12.2 MB\u001b[0m \u001b[31m107.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n",
      "\u001b[?25h"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing collected packages: pandas\r\n",
      "  Attempting uninstall: pandas\r\n",
      "    Found existing installation: pandas 2.0.3\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    Uninstalling pandas-2.0.3:\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      Successfully uninstalled pandas-2.0.3\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully installed pandas-1.5.3\r\n"
     ]
    }
   ],
   "source": [
    "# true if running on Google Colab\n",
    "import sys\n",
    "IN_COLAB = 'google.colab' in sys.modules\n",
    "if not IN_COLAB:\n",
    " !pip install pandas==1.5.3\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LjCsGikw69kG"
   },
   "source": [
    "The function `create_row_faker` creates one row of fake data. Here we choose to generate a row containing the following fields:\n",
    " - `fake.name()`\n",
    " - `fake.postcode()`\n",
    " - `fake.email()`\n",
    " - `fake.country()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:17.537889Z",
     "iopub.status.busy": "2024-10-13T21:33:17.537412Z",
     "iopub.status.idle": "2024-10-13T21:33:17.542322Z",
     "shell.execute_reply": "2024-10-13T21:33:17.541794Z"
    },
    "id": "f1MiSZl069kG"
   },
   "outputs": [],
   "source": [
    "def create_row_faker(num=1):\n",
    "    fake = Faker('de_DE')\n",
    "    fake.seed_locale('de_DE', 42)\n",
    "    fake.seed_instance(42)\n",
    "    output = [{\"name\": fake.name(),\n",
    "               \"age\": fake.random_int(0, 100),\n",
    "               \"postcode\": fake.postcode(),\n",
    "               \"email\": fake.email(),\n",
    "               \"nationality\": fake.country(),\n",
    "              } for x in range(num)]\n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TeuZISIh69kH"
   },
   "source": [
    "Generate a single row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:17.544672Z",
     "iopub.status.busy": "2024-10-13T21:33:17.544292Z",
     "iopub.status.idle": "2024-10-13T21:33:17.551952Z",
     "shell.execute_reply": "2024-10-13T21:33:17.551413Z"
    },
    "id": "wXP-5uSg69kH",
    "outputId": "cb47d550-12a3-4b38-d21e-f4d976e5585f"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'Aleksandr Weihmann',\n",
       "  'age': 35,\n",
       "  'postcode': '32181',\n",
       "  'email': 'bbeckmann@example.org',\n",
       "  'nationality': 'Fidschi'}]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_row_faker()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vu3dIuuFmSU0"
   },
   "source": [
    "Generate `n=3` rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:17.554137Z",
     "iopub.status.busy": "2024-10-13T21:33:17.553754Z",
     "iopub.status.idle": "2024-10-13T21:33:17.559448Z",
     "shell.execute_reply": "2024-10-13T21:33:17.558908Z"
    },
    "id": "fS0Bv6QPmV1k",
    "outputId": "3020ff6b-ff6b-46d4-cf7b-f55b12957d14"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'Aleksandr Weihmann',\n",
       "  'age': 35,\n",
       "  'postcode': '32181',\n",
       "  'email': 'bbeckmann@example.org',\n",
       "  'nationality': 'Fidschi'},\n",
       " {'name': 'Prof. Kurt Bauer B.A.',\n",
       "  'age': 91,\n",
       "  'postcode': '37940',\n",
       "  'email': 'hildaloechel@example.com',\n",
       "  'nationality': 'Guatemala'},\n",
       " {'name': 'Ekkehart Wiek-Kallert',\n",
       "  'age': 13,\n",
       "  'postcode': '61559',\n",
       "  'email': 'maja07@example.net',\n",
       "  'nationality': 'Brasilien'}]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_row_faker(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4zIjkEOw69kI"
   },
   "source": [
    "Generate a dataframe `df_fake` of 5000 rows using `create_row_faker`.\n",
    "\n",
    "We're using the _cell magic_ `%%time` to time the operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:17.561647Z",
     "iopub.status.busy": "2024-10-13T21:33:17.561268Z",
     "iopub.status.idle": "2024-10-13T21:33:17.847077Z",
     "shell.execute_reply": "2024-10-13T21:33:17.846450Z"
    },
    "id": "JtRWDEsT69kI",
    "outputId": "72f6326d-10cd-41d6-f1bd-4a81e6d63d18"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 276 ms, sys: 6.14 ms, total: 282 ms\n",
      "Wall time: 282 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "df_fake = pd.DataFrame(create_row_faker(5000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "meT16YZy69kI"
   },
   "source": [
    "View dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 424
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:17.849407Z",
     "iopub.status.busy": "2024-10-13T21:33:17.849015Z",
     "iopub.status.idle": "2024-10-13T21:33:17.859311Z",
     "shell.execute_reply": "2024-10-13T21:33:17.858758Z"
    },
    "id": "RJK93FxW69kI",
    "outputId": "0c2b4462-43a1-40ef-967a-dd1e9d7359dc"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>age</th>\n",
       "      <th>postcode</th>\n",
       "      <th>email</th>\n",
       "      <th>nationality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Aleksandr Weihmann</td>\n",
       "      <td>35</td>\n",
       "      <td>32181</td>\n",
       "      <td>bbeckmann@example.org</td>\n",
       "      <td>Fidschi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Prof. Kurt Bauer B.A.</td>\n",
       "      <td>91</td>\n",
       "      <td>37940</td>\n",
       "      <td>hildaloechel@example.com</td>\n",
       "      <td>Guatemala</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Ekkehart Wiek-Kallert</td>\n",
       "      <td>13</td>\n",
       "      <td>61559</td>\n",
       "      <td>maja07@example.net</td>\n",
       "      <td>Brasilien</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Annelise Rohleder-Hornig</td>\n",
       "      <td>80</td>\n",
       "      <td>93103</td>\n",
       "      <td>daniel31@example.com</td>\n",
       "      <td>Guatemala</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Magrit Knappe B.A.</td>\n",
       "      <td>47</td>\n",
       "      <td>34192</td>\n",
       "      <td>gottliebmisicher@example.com</td>\n",
       "      <td>Guadeloupe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4995</th>\n",
       "      <td>Hanno Jopich-Rädel</td>\n",
       "      <td>99</td>\n",
       "      <td>13333</td>\n",
       "      <td>keudelstanislaus@example.org</td>\n",
       "      <td>Syrien</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4996</th>\n",
       "      <td>Herr Arno Ebert B.A.</td>\n",
       "      <td>63</td>\n",
       "      <td>36790</td>\n",
       "      <td>josefaebert@example.org</td>\n",
       "      <td>Slowenien</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4997</th>\n",
       "      <td>Miroslawa Schüler</td>\n",
       "      <td>22</td>\n",
       "      <td>11118</td>\n",
       "      <td>ruppersbergerbetina@example.org</td>\n",
       "      <td>Republik Moldau</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4998</th>\n",
       "      <td>Janusz Nerger</td>\n",
       "      <td>74</td>\n",
       "      <td>33091</td>\n",
       "      <td>ann-kathrinseip@example.net</td>\n",
       "      <td>Belarus</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4999</th>\n",
       "      <td>Frau Cathleen Bähr</td>\n",
       "      <td>97</td>\n",
       "      <td>89681</td>\n",
       "      <td>hethurhubertus@example.org</td>\n",
       "      <td>St. Barthélemy</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5000 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                          name  age postcode                            email  \\\n",
       "0           Aleksandr Weihmann   35    32181            bbeckmann@example.org   \n",
       "1        Prof. Kurt Bauer B.A.   91    37940         hildaloechel@example.com   \n",
       "2        Ekkehart Wiek-Kallert   13    61559               maja07@example.net   \n",
       "3     Annelise Rohleder-Hornig   80    93103             daniel31@example.com   \n",
       "4           Magrit Knappe B.A.   47    34192     gottliebmisicher@example.com   \n",
       "...                        ...  ...      ...                              ...   \n",
       "4995        Hanno Jopich-Rädel   99    13333     keudelstanislaus@example.org   \n",
       "4996      Herr Arno Ebert B.A.   63    36790          josefaebert@example.org   \n",
       "4997         Miroslawa Schüler   22    11118  ruppersbergerbetina@example.org   \n",
       "4998             Janusz Nerger   74    33091      ann-kathrinseip@example.net   \n",
       "4999        Frau Cathleen Bähr   97    89681       hethurhubertus@example.org   \n",
       "\n",
       "          nationality  \n",
       "0             Fidschi  \n",
       "1           Guatemala  \n",
       "2           Brasilien  \n",
       "3           Guatemala  \n",
       "4          Guadeloupe  \n",
       "...               ...  \n",
       "4995           Syrien  \n",
       "4996        Slowenien  \n",
       "4997  Republik Moldau  \n",
       "4998          Belarus  \n",
       "4999   St. Barthélemy  \n",
       "\n",
       "[5000 rows x 5 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "86GX2vpJ69kI"
   },
   "source": [
    "For more fake data generators see Faker's [standard providers](https://faker.readthedocs.io/en/master/providers.html#standard-providers) as well as [community providers](https://faker.readthedocs.io/en/master/communityproviders.html#community-providers)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6Yv5FGq0CH6D"
   },
   "source": [
    "# Generate PySpark dataframe with fake data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qw0OUfBE8Aom"
   },
   "source": [
    "Install PySpark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:17.861561Z",
     "iopub.status.busy": "2024-10-13T21:33:17.861168Z",
     "iopub.status.idle": "2024-10-13T21:33:43.234453Z",
     "shell.execute_reply": "2024-10-13T21:33:43.233743Z"
    },
    "id": "PRKNcWR87h7P",
    "outputId": "2f039839-b528-4c88-bd38-c13c938b3202"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting pyspark\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Downloading pyspark-3.5.3.tar.gz (317.3 MB)\r\n",
      "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/317.3 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m19.1/317.3 MB\u001b[0m \u001b[31m126.1 MB/s\u001b[0m eta \u001b[36m0:00:03\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.5/317.3 MB\u001b[0m \u001b[31m151.3 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m87.3/317.3 MB\u001b[0m \u001b[31m157.3 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m122.2/317.3 MB\u001b[0m \u001b[31m161.3 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m157.0/317.3 MB\u001b[0m \u001b[31m163.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━\u001b[0m \u001b[32m191.6/317.3 MB\u001b[0m \u001b[31m164.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━\u001b[0m \u001b[32m226.5/317.3 MB\u001b[0m \u001b[31m166.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━\u001b[0m \u001b[32m260.8/317.3 MB\u001b[0m \u001b[31m166.4 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m296.0/317.3 MB\u001b[0m \u001b[31m171.4 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m317.2/317.3 MB\u001b[0m \u001b[31m171.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m317.2/317.3 MB\u001b[0m \u001b[31m171.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m317.2/317.3 MB\u001b[0m \u001b[31m171.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m317.2/317.3 MB\u001b[0m \u001b[31m171.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m317.2/317.3 MB\u001b[0m \u001b[31m171.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m317.3/317.3 MB\u001b[0m \u001b[31m107.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n",
      "\u001b[?25h"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Installing build dependencies ... \u001b[?25l-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \bdone\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[?25h  Getting requirements to build wheel ... \u001b[?25l-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \bdone\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[?25h  Preparing metadata (pyproject.toml) ... \u001b[?25l-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \bdone\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[?25hCollecting py4j==0.10.9.7 (from pyspark)\r\n",
      "  Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)\r\n",
      "Building wheels for collected packages: pyspark\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Building wheel for pyspark (pyproject.toml) ... \u001b[?25l-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b/"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b-"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b \b\\\b \bdone\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[?25h  Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl size=317840626 sha256=c569321f8b984ed317b59e5511ef6bc2d2e0ca78caa0b3fc06c77628726af6b5\r\n",
      "  Stored in directory: /home/runner/.cache/pip/wheels/94/3e/42/5eee4ed6246b61022f0335dcf22bb1a4a3915c45c0135cdc6f\r\n",
      "Successfully built pyspark\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing collected packages: py4j, pyspark\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully installed py4j-0.10.9.7 pyspark-3.5.3\r\n"
     ]
    }
   ],
   "source": [
    "!pip install pyspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:43.237214Z",
     "iopub.status.busy": "2024-10-13T21:33:43.236802Z",
     "iopub.status.idle": "2024-10-13T21:33:46.354936Z",
     "shell.execute_reply": "2024-10-13T21:33:46.354092Z"
    },
    "id": "9drMVtgJ69kI"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
      "24/10/13 21:33:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"Faker demo\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:46.357629Z",
     "iopub.status.busy": "2024-10-13T21:33:46.357230Z",
     "iopub.status.idle": "2024-10-13T21:33:48.306363Z",
     "shell.execute_reply": "2024-10-13T21:33:48.305484Z"
    },
    "id": "SYIUxAj569kI"
   },
   "outputs": [],
   "source": [
    "df = spark.createDataFrame(create_row_faker(5000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QbBAHBLF69kJ"
   },
   "source": [
    "To avoid getting the warning, either use [pyspark.sql.Row](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row) and let Spark infer datatypes or create a schema for the dataframe specifying the datatypes of all fields (here's the list of all [datatypes](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=types#module-pyspark.sql.types))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:48.310042Z",
     "iopub.status.busy": "2024-10-13T21:33:48.309469Z",
     "iopub.status.idle": "2024-10-13T21:33:48.314619Z",
     "shell.execute_reply": "2024-10-13T21:33:48.313841Z"
    },
    "id": "qwaQQ2Ud69kJ"
   },
   "outputs": [],
   "source": [
    "from pyspark.sql.types import *\n",
    "schema = StructType([StructField('name', StringType()),\n",
    "                     StructField('age',IntegerType()),\n",
    "                     StructField('postcode',StringType()),\n",
    "                     StructField('email', StringType()),\n",
    "                     StructField('nationality',StringType())])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:48.317122Z",
     "iopub.status.busy": "2024-10-13T21:33:48.316854Z",
     "iopub.status.idle": "2024-10-13T21:33:48.720797Z",
     "shell.execute_reply": "2024-10-13T21:33:48.720084Z"
    },
    "id": "NM4OqVld69kJ"
   },
   "outputs": [],
   "source": [
    "df = spark.createDataFrame(create_row_faker(5000), schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:48.723478Z",
     "iopub.status.busy": "2024-10-13T21:33:48.723069Z",
     "iopub.status.idle": "2024-10-13T21:33:48.740009Z",
     "shell.execute_reply": "2024-10-13T21:33:48.739411Z"
    },
    "id": "NDPqZqnl69kJ",
    "outputId": "92fc7f4b-16ae-4177-a17e-50df27431450"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- name: string (nullable = true)\n",
      " |-- age: integer (nullable = true)\n",
      " |-- postcode: string (nullable = true)\n",
      " |-- email: string (nullable = true)\n",
      " |-- nationality: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HwsIhJxO69kJ"
   },
   "source": [
    "Let's generate some more data (dataframe with $5\\cdot10^4$ rows). The file will be partitioned by Spark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:48.742330Z",
     "iopub.status.busy": "2024-10-13T21:33:48.741935Z",
     "iopub.status.idle": "2024-10-13T21:33:51.788787Z",
     "shell.execute_reply": "2024-10-13T21:33:51.788072Z"
    },
    "id": "KZtRuVeR69kJ",
    "outputId": "391fe926-80ac-4e72-8a3a-a8ba061bf415"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.01 s, sys: 8.67 ms, total: 3.02 s\n",
      "Wall time: 3.04 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "n = 5*10**4\n",
    "df = spark.createDataFrame(create_row_faker(n), schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:51.791294Z",
     "iopub.status.busy": "2024-10-13T21:33:51.790893Z",
     "iopub.status.idle": "2024-10-13T21:33:53.989183Z",
     "shell.execute_reply": "2024-10-13T21:33:53.988393Z"
    },
    "id": "QvtXJzvZkzbC",
    "outputId": "0158df1a-55e6-4b71-ba2e-46a90f966717"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 0:>                                                          (0 + 1) / 1]\r"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------------------------+---+--------+----------------------------+----------------------+\n",
      "|name                             |age|postcode|email                       |nationality           |\n",
      "+---------------------------------+---+--------+----------------------------+----------------------+\n",
      "|Aleksandr Weihmann               |35 |32181   |bbeckmann@example.org       |Fidschi               |\n",
      "|Prof. Kurt Bauer B.A.            |91 |37940   |hildaloechel@example.com    |Guatemala             |\n",
      "|Ekkehart Wiek-Kallert            |13 |61559   |maja07@example.net          |Brasilien             |\n",
      "|Annelise Rohleder-Hornig         |80 |93103   |daniel31@example.com        |Guatemala             |\n",
      "|Magrit Knappe B.A.               |47 |34192   |gottliebmisicher@example.com|Guadeloupe            |\n",
      "|Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413   |heini76@example.net         |Litauen               |\n",
      "|Franjo Etzold-Hentschel          |95 |96965   |frederikpechel@example.com  |Belize                |\n",
      "|Steffen Dörschner                |19 |69166   |qraedel@example.net         |Tunesien              |\n",
      "|Milos Ullmann                    |14 |51462   |uadler@example.net          |Griechenland          |\n",
      "|Prof. Urban Döring               |80 |89325   |augustewulff@example.net    |Vereinigtes Königreich|\n",
      "+---------------------------------+---+--------+----------------------------+----------------------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(10, truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "l0ZiQJKp69kJ"
   },
   "source": [
    "It took a long time (~4 sec. for 50000 rows)!\n",
    "\n",
    "Can we do better?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6xI7IjQ669kK"
   },
   "source": [
    "The function `create_row_faker()` returns a list. This is not efficient, what we need is a _generator_ instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:53.992360Z",
     "iopub.status.busy": "2024-10-13T21:33:53.992074Z",
     "iopub.status.idle": "2024-10-13T21:33:54.001002Z",
     "shell.execute_reply": "2024-10-13T21:33:54.000222Z"
    },
    "id": "Bt2HncFZ69kK",
    "outputId": "24208475-bf91-401d-8a5d-81dbb28d807e"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = create_row_faker(5)\n",
    "# what type is d?\n",
    "type(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EC_IgTtY69kK"
   },
   "source": [
    "Let us turn `d` into a generator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:54.003872Z",
     "iopub.status.busy": "2024-10-13T21:33:54.003385Z",
     "iopub.status.idle": "2024-10-13T21:33:54.009952Z",
     "shell.execute_reply": "2024-10-13T21:33:54.009210Z"
    },
    "id": "9YjvEOTV69kK",
    "outputId": "3f3ecb0a-bb0a-4a1b-a128-1cfe661b1be0"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "generator"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = ({\"name\": fake.name(),\n",
    "      \"age\": fake.random_int(0, 100),\n",
    "      \"postcode\": fake.postcode(),\n",
    "      \"email\": fake.email(),\n",
    "      \"nationality\": fake.country()} for i in range(5))\n",
    "# what type is d?\n",
    "type(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:54.012771Z",
     "iopub.status.busy": "2024-10-13T21:33:54.012194Z",
     "iopub.status.idle": "2024-10-13T21:33:57.168001Z",
     "shell.execute_reply": "2024-10-13T21:33:57.167101Z"
    },
    "id": "BBdz-PJz69kK",
    "outputId": "5099dfc3-16bd-4e50-c009-f9a617622062"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.1 s, sys: 25.7 ms, total: 3.13 s\n",
      "Wall time: 3.15 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "n = 5*10**4\n",
    "fake = Faker('de_DE')\n",
    "fake.seed_locale('de_DE', 42)\n",
    "fake.seed_instance(42)\n",
    "d = ({\"name\": fake.name(),\n",
    "      \"age\": fake.random_int(0, 100),\n",
    "      \"postcode\": fake.postcode(),\n",
    "      \"email\": fake.email(),\n",
    "      \"nationality\": fake.country()}\n",
    "     for i in range(n))\n",
    "df = spark.createDataFrame(d, schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:57.170954Z",
     "iopub.status.busy": "2024-10-13T21:33:57.170693Z",
     "iopub.status.idle": "2024-10-13T21:33:57.291327Z",
     "shell.execute_reply": "2024-10-13T21:33:57.290584Z"
    },
    "id": "tRnJDPSGlBBO",
    "outputId": "aa6d7ff7-9228-4c80-c1f8-65751d189f4e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------------------------+---+--------+----------------------------+----------------------+\n",
      "|name                             |age|postcode|email                       |nationality           |\n",
      "+---------------------------------+---+--------+----------------------------+----------------------+\n",
      "|Aleksandr Weihmann               |35 |32181   |bbeckmann@example.org       |Fidschi               |\n",
      "|Prof. Kurt Bauer B.A.            |91 |37940   |hildaloechel@example.com    |Guatemala             |\n",
      "|Ekkehart Wiek-Kallert            |13 |61559   |maja07@example.net          |Brasilien             |\n",
      "|Annelise Rohleder-Hornig         |80 |93103   |daniel31@example.com        |Guatemala             |\n",
      "|Magrit Knappe B.A.               |47 |34192   |gottliebmisicher@example.com|Guadeloupe            |\n",
      "|Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413   |heini76@example.net         |Litauen               |\n",
      "|Franjo Etzold-Hentschel          |95 |96965   |frederikpechel@example.com  |Belize                |\n",
      "|Steffen Dörschner                |19 |69166   |qraedel@example.net         |Tunesien              |\n",
      "|Milos Ullmann                    |14 |51462   |uadler@example.net          |Griechenland          |\n",
      "|Prof. Urban Döring               |80 |89325   |augustewulff@example.net    |Vereinigtes Königreich|\n",
      "+---------------------------------+---+--------+----------------------------+----------------------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(10, truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bd-jNkqk69kK"
   },
   "source": [
    "This wasn't faster.\n",
    "\n",
    "Let us look at how one can leverage Hadoop's parallelism to generate dataframes and speed up the process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LhL22_84gF5D"
   },
   "source": [
    "## A more efficient way to generate a large amount of records\n",
    "\n",
    "We are going to use Spark's RDD and the function `parallelize`. In order to do this, we are going to need to extract the Spark _context_ from the current session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 197
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:57.294204Z",
     "iopub.status.busy": "2024-10-13T21:33:57.293752Z",
     "iopub.status.idle": "2024-10-13T21:33:57.301104Z",
     "shell.execute_reply": "2024-10-13T21:33:57.300536Z"
    },
    "id": "zoe9UgSOdVwQ",
    "outputId": "ffdc2700-e9e4-4bd7-bd1e-0c09c3dee416"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://fv-az732-973:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v3.5.3</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>local[*]</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>Faker demo</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        "
      ],
      "text/plain": [
       "<SparkContext master=local[*] appName=Faker demo>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sc = spark.sparkContext\n",
    "sc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N-AIFOUWiJvv"
   },
   "source": [
    "In order to decide on the number of partitions, we are going to look at the number of (virtual) CPU's on the local machine. If you have a cluster you can have a larger number of CPUs across multiple nodes but this is not the case here.\n",
    "\n",
    "The standard Google Colab virtual machine has $2$ virtual CPUs (one CPU with two threads), so that is the maximum parallelization that you can achieve.\n",
    "\n",
    "**Note:**\n",
    "\n",
    "CPUs = threads per core × cores per socket × sockets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:57.303336Z",
     "iopub.status.busy": "2024-10-13T21:33:57.303125Z",
     "iopub.status.idle": "2024-10-13T21:33:57.436063Z",
     "shell.execute_reply": "2024-10-13T21:33:57.435239Z"
    },
    "id": "Z6-ygv_lg8hd",
    "outputId": "94239f4e-7469-4093-f898-a87e865a2d4e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU(s):                             4\r\n",
      "Thread(s) per core:                 2\r\n",
      "Core(s) per socket:                 2\r\n",
      "Socket(s):                          1\r\n"
     ]
    }
   ],
   "source": [
    "!lscpu | grep -E '^Thread|^Core|^Socket|^CPU\\('"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6b1aPJRxoug7"
   },
   "source": [
    "Due to the limited number of CPUs on this machine, we'll use only $2$ partitions. Even so, the data generation timing has improved dramatically!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:57.438818Z",
     "iopub.status.busy": "2024-10-13T21:33:57.438267Z",
     "iopub.status.idle": "2024-10-13T21:33:57.653914Z",
     "shell.execute_reply": "2024-10-13T21:33:57.653060Z"
    },
    "id": "A9Oz8rLpdxmx",
    "outputId": "c5ee39db-2bd2-4124-ca2f-61cda50a4b0a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 10.7 ms, sys: 0 ns, total: 10.7 ms\n",
      "Wall time: 208 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "n = 5*10**4\n",
    "num_partitions = 2\n",
    "# Create an empty RDD with the specified number of partitions\n",
    "empty_rdd = sc.parallelize(range(num_partitions), num_partitions)\n",
    "# Define a function that will run on each partition to generate the fake data\n",
    "def generate_fake_data(_):\n",
    "    fake = Faker('de_DE') # Create a new Faker instance per partition\n",
    "    fake.seed_locale('de_DE', 42)\n",
    "    fake.seed_instance(42)\n",
    "    for _ in range(n // num_partitions):  # Divide work across partitions\n",
    "        yield {\n",
    "            \"name\": fake.name(),\n",
    "            \"age\": fake.random_int(0, 100),\n",
    "            \"postcode\": fake.postcode(),\n",
    "            \"email\": fake.email(),\n",
    "            \"nationality\": fake.country()\n",
    "        }\n",
    "\n",
    "# Use mapPartitions to generate fake data for each partition\n",
    "rdd = empty_rdd.mapPartitions(generate_fake_data)\n",
    "# Convert the RDD to a DataFrame\n",
    "df = rdd.toDF()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FieKeZmPnUMb"
   },
   "source": [
    "I'm convinced that the reason everyone always looks at the first $5$ rows in Spark's RDDs is an homage to the classic jazz piece 🎷🎶."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:57.657119Z",
     "iopub.status.busy": "2024-10-13T21:33:57.656842Z",
     "iopub.status.idle": "2024-10-13T21:33:57.740762Z",
     "shell.execute_reply": "2024-10-13T21:33:57.739928Z"
    },
    "id": "ye4IU7uqe8D8",
    "outputId": "5de616a3-2a41-4e28-f074-00b148d01e32"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'Aleksandr Weihmann',\n",
       "  'age': 35,\n",
       "  'postcode': '32181',\n",
       "  'email': 'bbeckmann@example.org',\n",
       "  'nationality': 'Fidschi'},\n",
       " {'name': 'Prof. Kurt Bauer B.A.',\n",
       "  'age': 91,\n",
       "  'postcode': '37940',\n",
       "  'email': 'hildaloechel@example.com',\n",
       "  'nationality': 'Guatemala'},\n",
       " {'name': 'Ekkehart Wiek-Kallert',\n",
       "  'age': 13,\n",
       "  'postcode': '61559',\n",
       "  'email': 'maja07@example.net',\n",
       "  'nationality': 'Brasilien'},\n",
       " {'name': 'Annelise Rohleder-Hornig',\n",
       "  'age': 80,\n",
       "  'postcode': '93103',\n",
       "  'email': 'daniel31@example.com',\n",
       "  'nationality': 'Guatemala'},\n",
       " {'name': 'Magrit Knappe B.A.',\n",
       "  'age': 47,\n",
       "  'postcode': '34192',\n",
       "  'email': 'gottliebmisicher@example.com',\n",
       "  'nationality': 'Guadeloupe'}]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rdd.take(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:57.744932Z",
     "iopub.status.busy": "2024-10-13T21:33:57.744465Z",
     "iopub.status.idle": "2024-10-13T21:33:57.973580Z",
     "shell.execute_reply": "2024-10-13T21:33:57.972762Z"
    },
    "id": "Bd-pFrpffVxP",
    "outputId": "ba92981f-afa6-4d5e-dbd4-a3558aa48916"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+--------------------+--------------------+--------------------+--------+\n",
      "|age|               email|                name|         nationality|postcode|\n",
      "+---+--------------------+--------------------+--------------------+--------+\n",
      "| 35|bbeckmann@example...|  Aleksandr Weihmann|             Fidschi|   32181|\n",
      "| 91|hildaloechel@exam...|Prof. Kurt Bauer ...|           Guatemala|   37940|\n",
      "| 13|  maja07@example.net|Ekkehart Wiek-Kal...|           Brasilien|   61559|\n",
      "| 80|daniel31@example.com|Annelise Rohleder...|           Guatemala|   93103|\n",
      "| 47|gottliebmisicher@...|  Magrit Knappe B.A.|          Guadeloupe|   34192|\n",
      "| 29| heini76@example.net|Univ.Prof. Gotthi...|             Litauen|   56413|\n",
      "| 95|frederikpechel@ex...|Franjo Etzold-Hen...|              Belize|   96965|\n",
      "| 19| qraedel@example.net|   Steffen Dörschner|            Tunesien|   69166|\n",
      "| 14|  uadler@example.net|       Milos Ullmann|        Griechenland|   51462|\n",
      "| 80|augustewulff@exam...|  Prof. Urban Döring|Vereinigtes König...|   89325|\n",
      "| 62|polinarosenow@exa...|    Krzysztof Junitz|             Belarus|   15430|\n",
      "| 16|   ihahn@example.net|Frau Zita Wesack ...|               Samoa|   82489|\n",
      "| 85|carlokambs@exampl...|    Olaf Jockel MBA.|      Nordmazedonien|   78713|\n",
      "| 90|karl-heinrichstau...|Prof. Emil Albers...|      Falklandinseln|   31051|\n",
      "| 60|  bklapp@example.com|Otfried Rudolph-Rust|          Madagaskar|   76311|\n",
      "| 82|hans-hermannreisi...|      Michail Söding|           Bulgarien|   06513|\n",
      "| 31|davidssusan@examp...|   Dr. Erna Misicher|       Côte d’Ivoire|   78108|\n",
      "| 51| bhoerle@example.net|Dipl.-Ing. Jana H...|    Äußeres Ozeanien|   26064|\n",
      "|  7|bjoernpechel@exam...|Dr. Cordula Hübel...| Trinidad und Tobago|   50097|\n",
      "| 86|scholldarius@exam...|Herr Konstantinos...|      Republik Korea|   61939|\n",
      "+---+--------------------+--------------------+--------------------+--------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "t_NNqOXPCbj3"
   },
   "source": [
    "# Filter and aggregate with PySpark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "c5X9Bz7g69kL"
   },
   "source": [
    "Show the first five records in the dataframe of fake data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:57.976956Z",
     "iopub.status.busy": "2024-10-13T21:33:57.976412Z",
     "iopub.status.idle": "2024-10-13T21:33:58.271821Z",
     "shell.execute_reply": "2024-10-13T21:33:58.271095Z"
    },
    "id": "0RYAfYsx69kL",
    "outputId": "8a758790-2da9-4221-965f-af799d46e8d2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+----------------------------+------------------------+-----------+--------+\n",
      "|age|email                       |name                    |nationality|postcode|\n",
      "+---+----------------------------+------------------------+-----------+--------+\n",
      "|35 |bbeckmann@example.org       |Aleksandr Weihmann      |Fidschi    |32181   |\n",
      "|91 |hildaloechel@example.com    |Prof. Kurt Bauer B.A.   |Guatemala  |37940   |\n",
      "|13 |maja07@example.net          |Ekkehart Wiek-Kallert   |Brasilien  |61559   |\n",
      "|80 |daniel31@example.com        |Annelise Rohleder-Hornig|Guatemala  |93103   |\n",
      "|47 |gottliebmisicher@example.com|Magrit Knappe B.A.      |Guadeloupe |34192   |\n",
      "+---+----------------------------+------------------------+-----------+--------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(n=5, truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ikvl-9Mm69kL"
   },
   "source": [
    "Do some data aggregation:\n",
    " - group by postcode\n",
    " - count the number of persons and the average age for each postcode\n",
    " - filter out postcodes with less than 4 persons\n",
    " - sort by average age descending\n",
    " - show the first 5 entries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:33:58.275375Z",
     "iopub.status.busy": "2024-10-13T21:33:58.275057Z",
     "iopub.status.idle": "2024-10-13T21:34:01.294542Z",
     "shell.execute_reply": "2024-10-13T21:34:01.293624Z"
    },
    "id": "1lr4WKxt69kL",
    "outputId": "862786b3-edbc-4d7d-eb15-d9c00df4f726"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 6:>                                                          (0 + 2) / 2]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+-----+-----------+\n",
      "|postcode|Count|Average age|\n",
      "+--------+-----+-----------+\n",
      "|   60653|    4|       98.5|\n",
      "|   59679|    4|       98.5|\n",
      "|   37287|    4|       98.5|\n",
      "|   63287|    4|       98.0|\n",
      "|   37841|    4|       97.5|\n",
      "+--------+-----+-----------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "import pyspark.sql.functions as F\n",
    "df.groupBy('postcode') \\\n",
    "  .agg(F.count('postcode').alias('Count'), F.round(F.avg('age'), 2).alias('Average age')) \\\n",
    "  .filter('Count>3') \\\n",
    "  .orderBy('Average age', ascending=False) \\\n",
    "  .show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PP4TeBbh69kL"
   },
   "source": [
    "Postcode $18029$ has the highest average age ($91.75$). Show all entries for postcode $18029$ using `filter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:01.297957Z",
     "iopub.status.busy": "2024-10-13T21:34:01.297661Z",
     "iopub.status.idle": "2024-10-13T21:34:04.537405Z",
     "shell.execute_reply": "2024-10-13T21:34:04.536566Z"
    },
    "id": "_6NsPWUt69kL",
    "outputId": "5da2140d-8d6a-4983-8d41-025c7d640d33"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 9:>                                                          (0 + 1) / 1]\r"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 10:>                                                         (0 + 1) / 1]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+-----+----+-----------+--------+\n",
      "|age|email|name|nationality|postcode|\n",
      "+---+-----+----+-----------+--------+\n",
      "+---+-----+----+-----------+--------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df.filter('postcode==18029').show(truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_Ib139jd69kL"
   },
   "source": [
    "# Another example with multiple locales and weights\n",
    "\n",
    "We are going to use multiple locales with weights (following the [examples](https://faker.readthedocs.io/en/master/fakerclass.html#examples) in the documentation).\n",
    "\n",
    "Here's the [list of all available locales](https://faker.readthedocs.io/en/master/locales.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.540690Z",
     "iopub.status.busy": "2024-10-13T21:34:04.540370Z",
     "iopub.status.idle": "2024-10-13T21:34:04.544411Z",
     "shell.execute_reply": "2024-10-13T21:34:04.543594Z"
    },
    "id": "eXBHg-x269kL"
   },
   "outputs": [],
   "source": [
    "from faker import Faker\n",
    "# set a seed for the random generator\n",
    "Faker.seed(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ssmNP9zR_Ldb"
   },
   "source": [
    "Generate data with locales `de_DE` and `de_AT` with weights respectively $5$ and $2$.\n",
    "\n",
    "The distribution of locales will be:\n",
    " - `de_DE` - $71.43\\%$ of the time ($5 / (5+2)$)\n",
    " - `de_AT` - $28.57\\%$ of the time ($2 / (5+2)$)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.547639Z",
     "iopub.status.busy": "2024-10-13T21:34:04.546990Z",
     "iopub.status.idle": "2024-10-13T21:34:04.586486Z",
     "shell.execute_reply": "2024-10-13T21:34:04.585866Z"
    },
    "id": "j0XFMtOf69kT",
    "outputId": "f3ab1489-fc5d-4bea-b82a-5c7d24d83808"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['de_DE', 'de_AT']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import OrderedDict\n",
    "locales = OrderedDict([\n",
    "    ('de_DE', 5),\n",
    "    ('de_AT', 2),\n",
    "])\n",
    "fake = Faker(locales)\n",
    "fake.seed_instance(42)\n",
    "fake.locales"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.588637Z",
     "iopub.status.busy": "2024-10-13T21:34:04.588270Z",
     "iopub.status.idle": "2024-10-13T21:34:04.591437Z",
     "shell.execute_reply": "2024-10-13T21:34:04.590863Z"
    },
    "id": "LHCsMtbR69kU"
   },
   "outputs": [],
   "source": [
    "fake.seed_locale('de_DE', 0)\n",
    "fake.seed_locale('de_AT', 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.593505Z",
     "iopub.status.busy": "2024-10-13T21:34:04.593129Z",
     "iopub.status.idle": "2024-10-13T21:34:04.598354Z",
     "shell.execute_reply": "2024-10-13T21:34:04.597810Z"
    },
    "id": "gUS9h0A669kU",
    "outputId": "93e24217-e2e1-46ba-d554-402414a0ab00"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'current_location': (Decimal('26.547114'), Decimal('-10.243190')),\n",
       " 'blood_group': 'B-',\n",
       " 'name': 'Axel Jung',\n",
       " 'sex': 'M',\n",
       " 'mail': 'claragollner@gmail.com',\n",
       " 'birthdate': datetime.date(2004, 1, 27)}"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',\n",
    "                     'mail', 'current_location'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.600195Z",
     "iopub.status.busy": "2024-10-13T21:34:04.599999Z",
     "iopub.status.idle": "2024-10-13T21:34:04.604258Z",
     "shell.execute_reply": "2024-10-13T21:34:04.603695Z"
    },
    "id": "PiSKooLQ69kU"
   },
   "outputs": [],
   "source": [
    "from pyspark.sql.types import *\n",
    "location = StructField('current_location',\n",
    "                       StructType([StructField('lat', DecimalType()),\n",
    "                                   StructField('lon', DecimalType())])\n",
    "                      )\n",
    "schema = StructType([StructField('name', StringType()),\n",
    "                     StructField('birthdate', DateType()),\n",
    "                     StructField('sex', StringType()),\n",
    "                     StructField('blood_group', StringType()),\n",
    "                     StructField('mail', StringType()),\n",
    "                     location\n",
    "                     ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.606177Z",
     "iopub.status.busy": "2024-10-13T21:34:04.605975Z",
     "iopub.status.idle": "2024-10-13T21:34:04.610792Z",
     "shell.execute_reply": "2024-10-13T21:34:04.610231Z"
    },
    "id": "8j35IGPd69kU",
    "outputId": "fc5b3894-1fa5-4fa2-8ed9-034b03754668"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'current_location': (Decimal('79.153888'), Decimal('-0.003034')),\n",
       " 'blood_group': 'B-',\n",
       " 'name': 'Dr. Anita Suppan',\n",
       " 'sex': 'F',\n",
       " 'mail': 'schauerbenedict@kabsi.at',\n",
       " 'birthdate': datetime.date(1980, 10, 9)}"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',\n",
    "                     'mail', 'current_location'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.612738Z",
     "iopub.status.busy": "2024-10-13T21:34:04.612538Z",
     "iopub.status.idle": "2024-10-13T21:34:04.620060Z",
     "shell.execute_reply": "2024-10-13T21:34:04.619227Z"
    },
    "id": "mBHUAB8Z69kU"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "24/10/13 21:34:04 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"Faker demo - part 2\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6gjkL0XM69kU"
   },
   "source": [
    "Create dataframe with $5\\cdot10^3$ rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:04.622662Z",
     "iopub.status.busy": "2024-10-13T21:34:04.622072Z",
     "iopub.status.idle": "2024-10-13T21:34:06.302690Z",
     "shell.execute_reply": "2024-10-13T21:34:06.301965Z"
    },
    "id": "pKxvxwDK69kU",
    "outputId": "e1c56b4e-081e-4ca9-cea5-19ac4514a565"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.6 s, sys: 0 ns, total: 1.6 s\n",
      "Wall time: 1.68 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "n = 5*10**3\n",
    "d = (fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',\n",
    "                          'mail', 'current_location'])\n",
    "     for i in range(n))\n",
    "df = spark.createDataFrame(d, schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:06.304908Z",
     "iopub.status.busy": "2024-10-13T21:34:06.304529Z",
     "iopub.status.idle": "2024-10-13T21:34:06.308672Z",
     "shell.execute_reply": "2024-10-13T21:34:06.308104Z"
    },
    "id": "XbsEgC1d69kY",
    "outputId": "27bf6fec-1525-4a1c-f9a3-eb9beea60fef"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- name: string (nullable = true)\n",
      " |-- birthdate: date (nullable = true)\n",
      " |-- sex: string (nullable = true)\n",
      " |-- blood_group: string (nullable = true)\n",
      " |-- mail: string (nullable = true)\n",
      " |-- current_location: struct (nullable = true)\n",
      " |    |-- lat: decimal(10,0) (nullable = true)\n",
      " |    |-- lon: decimal(10,0) (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f-SQ-dsb69kY"
   },
   "source": [
    "Note how `location` represents a _tuple_ data structure (a `StructType` of `StructField`s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:06.310862Z",
     "iopub.status.busy": "2024-10-13T21:34:06.310497Z",
     "iopub.status.idle": "2024-10-13T21:34:06.439731Z",
     "shell.execute_reply": "2024-10-13T21:34:06.438874Z"
    },
    "id": "Rut1uSF069kY",
    "outputId": "5ac6fe42-d4c3-418d-dcd4-bb862ff888d6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------------------+----------+---+-----------+-------------------------+----------------+\n",
      "|name                       |birthdate |sex|blood_group|mail                     |current_location|\n",
      "+---------------------------+----------+---+-----------+-------------------------+----------------+\n",
      "|Prof. Valentine Niemeier   |1979-11-12|F  |B-         |maricagotthard@aol.de    |{74, 164}       |\n",
      "|Magrit Graf                |1943-09-15|F  |A-         |hartungclaudio@web.de    |{-86, -34}      |\n",
      "|Harriet Weiß-Liebelt       |1960-07-24|F  |AB+        |heserhilma@gmail.com     |{20, 126}       |\n",
      "|Marisa Heser               |1919-08-27|F  |B-         |meinhard55@web.de        |{73, 169}       |\n",
      "|Alexa Loidl-Schönberger    |1934-08-30|F  |O-         |hannafroehlich@gmail.com |{-23, -117}     |\n",
      "|Rosa-Maria Schwital B.Sc.  |1928-02-08|F  |O-         |johannessauer@yahoo.de   |{2, -113}       |\n",
      "|Herr Roland Caspar B.Sc.   |1932-09-09|M  |O-         |weinholdslawomir@yahoo.de|{24, 100}       |\n",
      "|Bernard Mude               |1943-03-01|M  |O-         |stollinka@hotmail.de     |{22, -104}      |\n",
      "|Prof. Violetta Eberl       |1913-09-09|F  |B+         |lars24@chello.at         |{80, -135}      |\n",
      "|Alexandre Oestrovsky B.Eng.|1925-03-29|M  |A-         |schleichholger@yahoo.de  |{-62, 43}       |\n",
      "+---------------------------+----------+---+-----------+-------------------------+----------------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(n=10, truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KSCL83BVC3xu"
   },
   "source": [
    "# Save to Parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fw7FSGff69kY"
   },
   "source": [
    "[Write to parquet](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=write#pyspark.sql.DataFrameWriter.parquet) file ([Parquet](http://parquet.apache.org/) is a compressed, efficient columnar data representation compatible with all frameworks in the Hadoop ecosystem):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:06.442784Z",
     "iopub.status.busy": "2024-10-13T21:34:06.442236Z",
     "iopub.status.idle": "2024-10-13T21:34:07.695350Z",
     "shell.execute_reply": "2024-10-13T21:34:07.694507Z"
    },
    "id": "VcTsZQXI69kZ"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 12:>                                                         (0 + 4) / 4]\r"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df.write.mode(\"overwrite\").parquet(\"fakedata.parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2LjMN4zy69kZ"
   },
   "source": [
    "Check the size of parquet file (it is actually a directory containing the partitions):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:07.697783Z",
     "iopub.status.busy": "2024-10-13T21:34:07.697543Z",
     "iopub.status.idle": "2024-10-13T21:34:07.834293Z",
     "shell.execute_reply": "2024-10-13T21:34:07.833553Z"
    },
    "id": "Q3mJjr6q69kZ",
    "outputId": "5468f39d-01d9-4312-9b18-dee315b094ec"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "212K\tfakedata.parquet\r\n"
     ]
    }
   ],
   "source": [
    "!du -h fakedata.parquet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:07.836844Z",
     "iopub.status.busy": "2024-10-13T21:34:07.836428Z",
     "iopub.status.idle": "2024-10-13T21:34:07.966042Z",
     "shell.execute_reply": "2024-10-13T21:34:07.965362Z"
    },
    "id": "Pju4GyKL-DAn",
    "outputId": "38e2c7dd-8187-45fd-ce90-7f6cd40072e7"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 188K\r\n",
      "-rw-r--r-- 1 runner docker   0 Oct 13 21:34 _SUCCESS\r\n",
      "-rw-r--r-- 1 runner docker 39K Oct 13 21:34 part-00000-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet\r\n",
      "-rw-r--r-- 1 runner docker 39K Oct 13 21:34 part-00001-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet\r\n",
      "-rw-r--r-- 1 runner docker 39K Oct 13 21:34 part-00002-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet\r\n",
      "-rw-r--r-- 1 runner docker 67K Oct 13 21:34 part-00003-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lh fakedata.parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JAwKdWmcC-ib"
   },
   "source": [
    "# Stop Spark session"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JBG-CEbV69kZ"
   },
   "source": [
    "Don't forget to close the Spark session when you're done!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5taVyX_Pp-O1"
   },
   "source": [
    "## Why you should stop your Spark session\n",
    "\n",
    "Even when no jobs are running, the Spark session holds memory resources, that get released only when the session is properly stopped."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:07.968728Z",
     "iopub.status.busy": "2024-10-13T21:34:07.968278Z",
     "iopub.status.idle": "2024-10-13T21:34:07.972892Z",
     "shell.execute_reply": "2024-10-13T21:34:07.972346Z"
    },
    "id": "TkierRjOqWyS"
   },
   "outputs": [],
   "source": [
    "# Function to check memory usage\n",
    "import subprocess\n",
    "\n",
    "def get_memory_usage_ratio():\n",
    "    # Run the 'free -h' command\n",
    "    result = subprocess.run(['free', '-h'], stdout=subprocess.PIPE, text=True)\n",
    "\n",
    "    # Parse the output\n",
    "    lines = result.stdout.splitlines()\n",
    "\n",
    "    # Initialize used and total memory\n",
    "    used_memory = None\n",
    "    total_memory = None\n",
    "\n",
    "    # The second line contains the memory information\n",
    "    if len(lines) > 1:\n",
    "        # Split the line into parts\n",
    "        memory_parts = lines[1].split()\n",
    "        total_memory = memory_parts[1]  # Total memory\n",
    "        used_memory = memory_parts[2]   # Used memory\n",
    "\n",
    "    return used_memory, total_memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dshcBr81wL-y"
   },
   "source": [
    "Stop the session and compare."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:07.975006Z",
     "iopub.status.busy": "2024-10-13T21:34:07.974614Z",
     "iopub.status.idle": "2024-10-13T21:34:07.987259Z",
     "shell.execute_reply": "2024-10-13T21:34:07.986599Z"
    },
    "id": "e4OeDOgis_pz",
    "outputId": "c79e0dc4-bc70-462c-a08b-af116fca89d9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Memory used before stopping Spark session: 1.6Gi\n",
      "Total Memory: 15Gi\n"
     ]
    }
   ],
   "source": [
    "# Check memory usage before stopping the Spark session\n",
    "used_memory, total_memory = get_memory_usage_ratio()\n",
    "print(f\"Memory used before stopping Spark session: {used_memory}\")\n",
    "print(f\"Total Memory: {total_memory}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-10-13T21:34:07.989270Z",
     "iopub.status.busy": "2024-10-13T21:34:07.989070Z",
     "iopub.status.idle": "2024-10-13T21:34:08.681232Z",
     "shell.execute_reply": "2024-10-13T21:34:08.680529Z"
    },
    "id": "dxTCAI2TsHVU",
    "outputId": "cef31f7b-e03d-4183-c525-cf05d6d69280"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Memory used after stopping Spark session: 1.5Gi\n",
      "Total Memory: 15Gi\n"
     ]
    }
   ],
   "source": [
    "# Stop the Spark session\n",
    "spark.stop()\n",
    "\n",
    "# Check memory usage after stopping the Spark session\n",
    "used_memory, total_memory = get_memory_usage_ratio()\n",
    "print(f\"Memory used after stopping Spark session: {used_memory}\")\n",
    "print(f\"Total Memory: {total_memory}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KqtdcXcLvCfS"
   },
   "source": [
    "The amount of memory released may not be impressive in this case, but holding onto unnecessary resources is inefficient. Also, memory waste can add up quickly when multiple sessions are running."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "include_colab_link": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}