{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hwLHeIOc69kD"
},
"source": [
"\n",
"\n",
"# Data Generation and Aggregation with Python's Faker Library and PySpark\n",
"
\n",
"
\n",
"\n",
"Explore the capabilities of the Python Faker library (https://faker.readthedocs.io/) for dynamic data generation!\n",
"\n",
"Whether you're a data scientist, engineer, or analyst, this tutorial will guide you through the process of creating realistic and diverse datasets using Faker and then harnessing the distributed computing capabilities of PySpark to aggregate and analyze the generated data. Throughout this guide, you will explore effective techniques for data generation that enhance performance and optimize resource usage. Whether you're working with large datasets or simply seeking to streamline your data generation process, this tutorial offers valuable insights to elevate your skills.\n",
"\n",
"**Note:** This is not _synthetic_ data, as it is generated using straightforward methods and is unlikely to conform to any real-life distribution. Still, it serves as a valuable resource for testing purposes when authentic data is unavailable."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BW6yZdpy7run"
},
"source": [
"# Install Faker\n",
"\n",
"The Python `faker` module needs to be installed. Note that on Google Colab you can use `!pip` as well as just `pip` (no exclamation mark)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2024-10-13T21:33:10.436469Z",
"iopub.status.busy": "2024-10-13T21:33:10.435903Z",
"iopub.status.idle": "2024-10-13T21:33:12.688188Z",
"shell.execute_reply": "2024-10-13T21:33:12.687491Z"
},
"id": "7EOLVOTe7KnP",
"outputId": "f768294b-b3f8-4df7-9837-47b2dd1f4cdb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting faker\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Downloading Faker-30.3.0-py3-none-any.whl.metadata (15 kB)\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: python-dateutil>=2.4 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from faker) (2.9.0.post0)\r\n",
"Requirement already satisfied: typing-extensions in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from faker) (4.12.2)\r\n",
"Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.4->faker) (1.16.0)\r\n",
"Downloading Faker-30.3.0-py3-none-any.whl (1.8 MB)\r\n",
"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m31.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n",
"\u001b[?25h"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Installing collected packages: faker\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Successfully installed faker-30.3.0\r\n"
]
}
],
"source": [
"!pip install faker"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UYLa4gsrB7aS"
},
"source": [
"# Generate a Pandas dataframe with fake data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yKviUSgj7nHP"
},
"source": [
"Import `Faker` and set a random seed ($42$)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2024-10-13T21:33:12.691166Z",
"iopub.status.busy": "2024-10-13T21:33:12.690755Z",
"iopub.status.idle": "2024-10-13T21:33:12.742178Z",
"shell.execute_reply": "2024-10-13T21:33:12.741512Z"
},
"id": "7fbGCBoq69kF"
},
"outputs": [],
"source": [
"from faker import Faker\n",
"# Set the seed value of the shared `random.Random` object\n",
"# across all internal generators that will ever be created\n",
"Faker.seed(42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yUEes2gl69kF"
},
"source": [
"`fake` is a fake data generator with `DE_de` locale."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2024-10-13T21:33:12.744622Z",
"iopub.status.busy": "2024-10-13T21:33:12.744255Z",
"iopub.status.idle": "2024-10-13T21:33:12.772887Z",
"shell.execute_reply": "2024-10-13T21:33:12.772244Z"
},
"id": "Bf7UTx_r69kG"
},
"outputs": [],
"source": [
"fake = Faker('de_DE')\n",
"fake.seed_locale('de_DE', 42)\n",
"# Creates and seeds a unique `random.Random` object for\n",
"# each internal generator of this `Faker` instance\n",
"fake.seed_instance(42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eZnvde79ljUB"
},
"source": [
"With `fake` you can generate fake data, such as name, email, etc."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2024-10-13T21:33:12.775243Z",
"iopub.status.busy": "2024-10-13T21:33:12.774870Z",
"iopub.status.idle": "2024-10-13T21:33:12.778427Z",
"shell.execute_reply": "2024-10-13T21:33:12.777796Z"
},
"id": "81-ltoA0lp0v",
"outputId": "c7e97d64-3184-44b3-aac1-d77008e6a0d8"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A fake name: Aleksandr Weihmann\n",
"A fake email: ioannis32@example.net\n"
]
}
],
"source": [
"print(f\"A fake name: {fake.name()}\")\n",
"print(f\"A fake email: {fake.email()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0bSXBUI469kG"
},
"source": [
"Import Pandas to save data into a dataframe"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2024-10-13T21:33:12.780603Z",
"iopub.status.busy": "2024-10-13T21:33:12.780258Z",
"iopub.status.idle": "2024-10-13T21:33:17.535019Z",
"shell.execute_reply": "2024-10-13T21:33:17.534318Z"
},
"id": "PI_YqfM169kG"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting pandas==1.5.3\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)\r\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (2.9.0.post0)\r\n",
"Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (2024.2)\r\n",
"Requirement already satisfied: numpy>=1.20.3 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (1.24.4)\r\n",
"Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.5.3) (1.16.0)\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)\r\n",
"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/12.2 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.2/12.2 MB\u001b[0m \u001b[31m107.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n",
"\u001b[?25h"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Installing collected packages: pandas\r\n",
" Attempting uninstall: pandas\r\n",
" Found existing installation: pandas 2.0.3\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Uninstalling pandas-2.0.3:\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Successfully uninstalled pandas-2.0.3\r\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Successfully installed pandas-1.5.3\r\n"
]
}
],
"source": [
"# true if running on Google Colab\n",
"import sys\n",
"IN_COLAB = 'google.colab' in sys.modules\n",
"if not IN_COLAB:\n",
" !pip install pandas==1.5.3\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LjCsGikw69kG"
},
"source": [
"The function `create_row_faker` creates one row of fake data. Here we choose to generate a row containing the following fields:\n",
" - `fake.name()`\n",
" - `fake.postcode()`\n",
" - `fake.email()`\n",
" - `fake.country()`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2024-10-13T21:33:17.537889Z",
"iopub.status.busy": "2024-10-13T21:33:17.537412Z",
"iopub.status.idle": "2024-10-13T21:33:17.542322Z",
"shell.execute_reply": "2024-10-13T21:33:17.541794Z"
},
"id": "f1MiSZl069kG"
},
"outputs": [],
"source": [
"def create_row_faker(num=1):\n",
" fake = Faker('de_DE')\n",
" fake.seed_locale('de_DE', 42)\n",
" fake.seed_instance(42)\n",
" output = [{\"name\": fake.name(),\n",
" \"age\": fake.random_int(0, 100),\n",
" \"postcode\": fake.postcode(),\n",
" \"email\": fake.email(),\n",
" \"nationality\": fake.country(),\n",
" } for x in range(num)]\n",
" return output"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TeuZISIh69kH"
},
"source": [
"Generate a single row"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2024-10-13T21:33:17.544672Z",
"iopub.status.busy": "2024-10-13T21:33:17.544292Z",
"iopub.status.idle": "2024-10-13T21:33:17.551952Z",
"shell.execute_reply": "2024-10-13T21:33:17.551413Z"
},
"id": "wXP-5uSg69kH",
"outputId": "cb47d550-12a3-4b38-d21e-f4d976e5585f"
},
"outputs": [
{
"data": {
"text/plain": [
"[{'name': 'Aleksandr Weihmann',\n",
" 'age': 35,\n",
" 'postcode': '32181',\n",
" 'email': 'bbeckmann@example.org',\n",
" 'nationality': 'Fidschi'}]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"create_row_faker()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vu3dIuuFmSU0"
},
"source": [
"Generate `n=3` rows"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2024-10-13T21:33:17.554137Z",
"iopub.status.busy": "2024-10-13T21:33:17.553754Z",
"iopub.status.idle": "2024-10-13T21:33:17.559448Z",
"shell.execute_reply": "2024-10-13T21:33:17.558908Z"
},
"id": "fS0Bv6QPmV1k",
"outputId": "3020ff6b-ff6b-46d4-cf7b-f55b12957d14"
},
"outputs": [
{
"data": {
"text/plain": [
"[{'name': 'Aleksandr Weihmann',\n",
" 'age': 35,\n",
" 'postcode': '32181',\n",
" 'email': 'bbeckmann@example.org',\n",
" 'nationality': 'Fidschi'},\n",
" {'name': 'Prof. Kurt Bauer B.A.',\n",
" 'age': 91,\n",
" 'postcode': '37940',\n",
" 'email': 'hildaloechel@example.com',\n",
" 'nationality': 'Guatemala'},\n",
" {'name': 'Ekkehart Wiek-Kallert',\n",
" 'age': 13,\n",
" 'postcode': '61559',\n",
" 'email': 'maja07@example.net',\n",
" 'nationality': 'Brasilien'}]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"create_row_faker(3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4zIjkEOw69kI"
},
"source": [
"Generate a dataframe `df_fake` of 5000 rows using `create_row_faker`.\n",
"\n",
"We're using the _cell magic_ `%%time` to time the operation."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2024-10-13T21:33:17.561647Z",
"iopub.status.busy": "2024-10-13T21:33:17.561268Z",
"iopub.status.idle": "2024-10-13T21:33:17.847077Z",
"shell.execute_reply": "2024-10-13T21:33:17.846450Z"
},
"id": "JtRWDEsT69kI",
"outputId": "72f6326d-10cd-41d6-f1bd-4a81e6d63d18"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 276 ms, sys: 6.14 ms, total: 282 ms\n",
"Wall time: 282 ms\n"
]
}
],
"source": [
"%%time\n",
"df_fake = pd.DataFrame(create_row_faker(5000))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "meT16YZy69kI"
},
"source": [
"View dataframe"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"execution": {
"iopub.execute_input": "2024-10-13T21:33:17.849407Z",
"iopub.status.busy": "2024-10-13T21:33:17.849015Z",
"iopub.status.idle": "2024-10-13T21:33:17.859311Z",
"shell.execute_reply": "2024-10-13T21:33:17.858758Z"
},
"id": "RJK93FxW69kI",
"outputId": "0c2b4462-43a1-40ef-967a-dd1e9d7359dc"
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | name | \n", "age | \n", "postcode | \n", "nationality | \n", "|
---|---|---|---|---|---|
0 | \n", "Aleksandr Weihmann | \n", "35 | \n", "32181 | \n", "bbeckmann@example.org | \n", "Fidschi | \n", "
1 | \n", "Prof. Kurt Bauer B.A. | \n", "91 | \n", "37940 | \n", "hildaloechel@example.com | \n", "Guatemala | \n", "
2 | \n", "Ekkehart Wiek-Kallert | \n", "13 | \n", "61559 | \n", "maja07@example.net | \n", "Brasilien | \n", "
3 | \n", "Annelise Rohleder-Hornig | \n", "80 | \n", "93103 | \n", "daniel31@example.com | \n", "Guatemala | \n", "
4 | \n", "Magrit Knappe B.A. | \n", "47 | \n", "34192 | \n", "gottliebmisicher@example.com | \n", "Guadeloupe | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
4995 | \n", "Hanno Jopich-Rädel | \n", "99 | \n", "13333 | \n", "keudelstanislaus@example.org | \n", "Syrien | \n", "
4996 | \n", "Herr Arno Ebert B.A. | \n", "63 | \n", "36790 | \n", "josefaebert@example.org | \n", "Slowenien | \n", "
4997 | \n", "Miroslawa Schüler | \n", "22 | \n", "11118 | \n", "ruppersbergerbetina@example.org | \n", "Republik Moldau | \n", "
4998 | \n", "Janusz Nerger | \n", "74 | \n", "33091 | \n", "ann-kathrinseip@example.net | \n", "Belarus | \n", "
4999 | \n", "Frau Cathleen Bähr | \n", "97 | \n", "89681 | \n", "hethurhubertus@example.org | \n", "St. Barthélemy | \n", "
5000 rows × 5 columns
\n", "SparkContext
\n", "\n", " \n", "\n", "v3.5.3
local[*]
Faker demo