{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas - Optimize Memory and Speed Operation\n",
    "\n",
    "- Watch [Other Interesting Data Science Topics](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA/videos)\n",
    "- Created By: **Aakash Goel**\n",
    "- Connect on [LinkedIn](https://www.linkedin.com/in/aakash-goel-587a7385/)\n",
    "- Subscribe on [YouTube](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA?sub_confirmation=1)\n",
    "- Created on: 17-FEB-2020\n",
    "- Last Updated on: 06-JUL-2020"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of contents\n",
    "\n",
    "1. <a href=\"#1.-See-Hardware-and-System-Information\">See Hardware and System Information</a>    \n",
    "    1.1. <a href=\"#1.1-System,-CPU,-Memory,-Swap,-Disk,-GPU-Information\">System, CPU, Memory, Swap, Disk, GPU Information</a>    \n",
    "2. <a href=\"#2.-Reduce-DataFrame-size\">Reduce DataFrame size</a>    \n",
    "    2.1. <a href=\"#2.1-Change-in-int-datatype\">Change in int datatype</a>    \n",
    "    2.2. <a href=\"#2.2-Change-in-float-datatype\">Change in float datatype</a>    \n",
    "    2.3. <a href=\"#2.3-Change-from-object-to-category-datatype\">Change from object to category datatype</a>    \n",
    "    2.4. <a href=\"#2.4-Convert-to-Sparse-DataFrame\">Convert to Sparse DataFrame</a>    \n",
    "\n",
    "3. <a href=\"#3.-Reduce-DataFrame-Loading-Time\">Reduce DataFrame Loading Time</a>    \n",
    "    3.1. <a href=\"#3.1-Don't-Load-all-columns\">Don't Load all columns</a>    \n",
    "    3.2. <a href=\"#3.2-Read-Large-CSV-File\">Read Large CSV File (Avoid Memory error and fast reading)</a>    \n",
    "\n",
    "4. <a href=\"#4.-Fast-Pandas-Operation\">Fast Pandas Operation</a>    \n",
    "    4.1. <a href=\"#4.1-Loop-and-Vectorization\">Loop and Vectorization</a>    \n",
    "    4.2. <a href=\"#4.2-Indexing\">Indexing</a>    \n",
    "    4.3. <a href=\"#4.3-Parallelization\">Parallelization</a>   \n",
    "    \n",
    "5. <a href=\"#5.-References\">References</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](datatypeMemory.PNG \"segment\")\n",
    "\n",
    "                                                                                                                [1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](memory_nos.png \"segment\")\n",
    "                                                                                                                [8]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. See Hardware and System Information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is very important to know about your machine on which you are coding.\n",
    "Hardware and System Information helps you to take proper decision like how much `cores` you should allocate to your program.\n",
    "Code for same is taken from <a href=https://www.thepythoncode.com/article/get-hardware-system-information-python>here</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 System, CPU, Memory, Swap, Disk, GPU Information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================== Python Version ========================================\n",
      "3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]\n",
      "sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)\n",
      "======================================== System Information ========================================\n",
      "System: Windows\n",
      "Release: 10\n",
      "Version: 10.0.18362\n",
      "Machine: AMD64\n",
      "Processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel\n",
      "======================================== CPU Info ========================================\n",
      "Physical cores: 2\n",
      "Total cores: 4\n",
      "Max Frequency: 2000.00Mhz\n",
      "Min Frequency: 0.00Mhz\n",
      "Current Frequency: 2000.00Mhz\n",
      "CPU Usage Per Core:\n",
      "Core 0: 56.9%\n",
      "Core 1: 25.0%\n",
      "Core 2: 43.8%\n",
      "Core 3: 42.2%\n",
      "Total CPU Usage: 45.2%\n",
      "======================================== Memory Information ========================================\n",
      "Total: 7.91GB\n",
      "Available: 3.85GB\n",
      "Used: 4.06GB\n",
      "Percentage: 51.3%\n",
      "======================================== SWAP ========================================\n",
      "Total: 19.26GB\n",
      "Free: 13.41GB\n",
      "Used: 5.85GB\n",
      "Percentage: 30.4%\n",
      "======================================== Disk Information ========================================\n",
      "Partitions and Usage:\n",
      "=== Device: C:\\ ===\n",
      "  Mountpoint: C:\\\n",
      "  File system type: NTFS\n",
      "  Total Size: 465.13GB\n",
      "  Used: 65.29GB\n",
      "  Free: 399.84GB\n",
      "  Percentage: 14.0%\n",
      "Total read: 540.46GB\n",
      "Total write: 62.73GB\n",
      "======================================== GPU Details ========================================\n",
      "id    name    load    free memory    used memory    total memory    temperature    uuid\n",
      "----  ------  ------  -------------  -------------  --------------  -------------  ------\n"
     ]
    }
   ],
   "source": [
    "# ! pip install tabulate\n",
    "# ! pip install gputil\n",
    "import psutil\n",
    "import platform\n",
    "import sys\n",
    "from datetime import datetime\n",
    "import GPUtil\n",
    "from tabulate import tabulate\n",
    "\n",
    "def get_size(bytes, suffix=\"B\"):\n",
    "    \"\"\"\n",
    "    Scale bytes to its proper format\n",
    "    e.g:\n",
    "        1253656 => '1.20MB'\n",
    "        1253656678 => '1.17GB'\n",
    "    \"\"\"\n",
    "    factor = 1024\n",
    "    for unit in [\"\", \"K\", \"M\", \"G\", \"T\", \"P\"]:\n",
    "        if bytes < factor:\n",
    "            return f\"{bytes:.2f}{unit}{suffix}\"\n",
    "        bytes /= factor\n",
    "\n",
    "#### CODE -- https://www.thepythoncode.com/article/get-hardware-system-information-python ####\n",
    "\n",
    "print(\"=\"*40, \"Python Version\", \"=\"*40)\n",
    "print (sys.version)\n",
    "print (sys.version_info)\n",
    "\n",
    "print(\"=\"*40, \"System Information\", \"=\"*40)\n",
    "uname = platform.uname()\n",
    "print(f\"System: {uname.system}\")\n",
    "print(f\"Release: {uname.release}\")\n",
    "print(f\"Version: {uname.version}\")\n",
    "print(f\"Machine: {uname.machine}\")\n",
    "print(f\"Processor: {uname.processor}\")\n",
    "\n",
    "# let's print CPU information\n",
    "print(\"=\"*40, \"CPU Info\", \"=\"*40)\n",
    "# number of cores\n",
    "print(\"Physical cores:\", psutil.cpu_count(logical=False))\n",
    "print(\"Total cores:\", psutil.cpu_count(logical=True))\n",
    "# CPU frequencies\n",
    "cpufreq = psutil.cpu_freq()\n",
    "print(f\"Max Frequency: {cpufreq.max:.2f}Mhz\")\n",
    "print(f\"Min Frequency: {cpufreq.min:.2f}Mhz\")\n",
    "print(f\"Current Frequency: {cpufreq.current:.2f}Mhz\")\n",
    "# CPU usage\n",
    "print(\"CPU Usage Per Core:\")\n",
    "for i, percentage in enumerate(psutil.cpu_percent(percpu=True, interval=1)):\n",
    "    print(f\"Core {i}: {percentage}%\")\n",
    "print(f\"Total CPU Usage: {psutil.cpu_percent()}%\")\n",
    "\n",
    "# Memory Information\n",
    "print(\"=\"*40, \"Memory Information\", \"=\"*40)\n",
    "# get the memory details\n",
    "svmem = psutil.virtual_memory()\n",
    "print(f\"Total: {get_size(svmem.total)}\")\n",
    "print(f\"Available: {get_size(svmem.available)}\")\n",
    "print(f\"Used: {get_size(svmem.used)}\")\n",
    "print(f\"Percentage: {svmem.percent}%\")\n",
    "print(\"=\"*40, \"SWAP\", \"=\"*40)\n",
    "# get the swap memory details (if exists)\n",
    "swap = psutil.swap_memory()\n",
    "print(f\"Total: {get_size(swap.total)}\")\n",
    "print(f\"Free: {get_size(swap.free)}\")\n",
    "print(f\"Used: {get_size(swap.used)}\")\n",
    "print(f\"Percentage: {swap.percent}%\")\n",
    "\n",
    "# Disk Information\n",
    "print(\"=\"*40, \"Disk Information\", \"=\"*40)\n",
    "print(\"Partitions and Usage:\")\n",
    "# get all disk partitions\n",
    "partitions = psutil.disk_partitions()\n",
    "for partition in partitions:\n",
    "    print(f\"=== Device: {partition.device} ===\")\n",
    "    print(f\"  Mountpoint: {partition.mountpoint}\")\n",
    "    print(f\"  File system type: {partition.fstype}\")\n",
    "    try:\n",
    "        partition_usage = psutil.disk_usage(partition.mountpoint)\n",
    "    except PermissionError:\n",
    "        # this can be catched due to the disk that\n",
    "        # isn't ready\n",
    "        continue\n",
    "    print(f\"  Total Size: {get_size(partition_usage.total)}\")\n",
    "    print(f\"  Used: {get_size(partition_usage.used)}\")\n",
    "    print(f\"  Free: {get_size(partition_usage.free)}\")\n",
    "    print(f\"  Percentage: {partition_usage.percent}%\")\n",
    "# get IO statistics since boot\n",
    "disk_io = psutil.disk_io_counters()\n",
    "print(f\"Total read: {get_size(disk_io.read_bytes)}\")\n",
    "print(f\"Total write: {get_size(disk_io.write_bytes)}\")\n",
    "\n",
    "# GPU information\n",
    "print(\"=\"*40, \"GPU Details\", \"=\"*40)\n",
    "gpus = GPUtil.getGPUs()\n",
    "list_gpus = []\n",
    "for gpu in gpus:\n",
    "    # get the GPU id\n",
    "    gpu_id = gpu.id\n",
    "    # name of GPU\n",
    "    gpu_name = gpu.name\n",
    "    # get % percentage of GPU usage of that GPU\n",
    "    gpu_load = f\"{gpu.load*100}%\"\n",
    "    # get free memory in MB format\n",
    "    gpu_free_memory = f\"{gpu.memoryFree}MB\"\n",
    "    # get used memory\n",
    "    gpu_used_memory = f\"{gpu.memoryUsed}MB\"\n",
    "    # get total memory\n",
    "    gpu_total_memory = f\"{gpu.memoryTotal}MB\"\n",
    "    # get GPU temperature in Celsius\n",
    "    gpu_temperature = f\"{gpu.temperature} °C\"\n",
    "    gpu_uuid = gpu.uuid\n",
    "    list_gpus.append((\n",
    "        gpu_id, gpu_name, gpu_load, gpu_free_memory, gpu_used_memory,\n",
    "        gpu_total_memory, gpu_temperature, gpu_uuid\n",
    "    ))\n",
    "\n",
    "print(tabulate(list_gpus, headers=(\"id\", \"name\", \"load\", \"free memory\", \"used memory\", \"total memory\",\n",
    "                                   \"temperature\", \"uuid\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import nbconvert\n",
    "from time import time\n",
    "import gc\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Reduce DataFrame size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Change in int datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have `Age` column having `minimum value 1 and maximum value 150`, with `10 million` total rows in dataframe    \n",
    "**Task**: Reduce Memory Usage of `Age` column given above constraints    \n",
    "**Action**: Change of original dtype from `int32` to `uint8`    \n",
    "**Result**: Drop from `38.1 MB to 9.5 MB` in Memory usage i.e. `75%` reduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Initializing minimum and maximum value of age\n",
    "min_age_value , max_age_value = 1,150\n",
    "## Number of rows in dataframe\n",
    "nrows = int(np.power(10,7))\n",
    "## creation of Age dataframe\n",
    "df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 1 columns):\n",
      "Age    int32\n",
      "dtypes: int32(1)\n",
      "memory usage: 38.1 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_age.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "iinfo(min=0, max=255, dtype=uint8)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Range of \"uint8\"; satisfies range constraint of Age column \n",
    "np.iinfo('uint8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: conversion of dtype from \"int32\" to \"uint8\"\n",
    "converted_df_age = df_age.astype(np.uint8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 1 columns):\n",
      "Age    uint8\n",
      "dtypes: uint8(1)\n",
      "memory usage: 9.5 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "converted_df_age.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Change in float datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have `50,000 search queries` and `5,000 documents` and computed `cosine similarity` for each search query with all documents i.e. `dimension 50,000 X 5,000`. All similarity values are between `0 and 1` and should have atleast `2 decimal precision`    \n",
    "**Task**: Reduce Memory Usage of cosine smilarity dataframe given above constraints    \n",
    "**Action**: Change of original dtype from `float64` to `float16`    \n",
    "**Result**: Drop from `1.9 GB to 476.8 MB or 0.46 GB` in Memory usage i.e. `75%` reduction    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No. of search queries: 50000 and No. of documents: 5000\n"
     ]
    }
   ],
   "source": [
    "## no. of documents\n",
    "ncols = int(5*np.power(10,3))\n",
    "## no. of search queries\n",
    "nrows = int(5*np.power(10,4))\n",
    "## creation of cosine similarity dataframe\n",
    "df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))\n",
    "print(\"No. of search queries: {} and No. of documents: {}\".format(df_query_doc.shape[0],df_query_doc.shape[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 50000 entries, 0 to 49999\n",
      "Columns: 5000 entries, 0 to 4999\n",
      "dtypes: float64(5000)\n",
      "memory usage: 1.9 GB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_query_doc.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: conversion of dtype from \"float64\" to \"float16\"\n",
    "converted_df_query_doc = df_query_doc.astype('float16')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 50000 entries, 0 to 49999\n",
      "Columns: 5000 entries, 0 to 4999\n",
      "dtypes: float16(5000)\n",
      "memory usage: 476.8 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "converted_df_query_doc.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Change from object to category datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have `Day of Week` column having `7 unique` values, with `4.9 million` total rows in dataframe    \n",
    "**Task**: Reduce Memory Usage of `Day of Week` column given only 7 unique value exist    \n",
    "**Action**: Change of dtype from `object` to `category` as ratio of unique values to no. of rows is almost zero    \n",
    "**Result**: Drop from `2.9 GB to 46.7 MB or 0.045 GB` in Memory usage i.e. `98%` reduction    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No of rows in days of week dataframe 49000000\n"
     ]
    }
   ],
   "source": [
    "## unique values of \"days of week\"\n",
    "day_of_week = [\"monday\",\"tuesday\",\"wednesday\",\"thursday\",\"friday\",\"saturday\",\"sunday\"]\n",
    "## Number of times day_of_week repeats\n",
    "repeat_times = 7*np.power(10,6)\n",
    "## creation of days of week dataframe\n",
    "df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})\n",
    "print(\"No of rows in days of week dataframe {}\".format(df_day_of_week.shape[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 49000000 entries, 0 to 48999999\n",
      "Data columns (total 1 columns):\n",
      "day_of_week    object\n",
      "dtypes: object(1)\n",
      "memory usage: 2.9 GB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_day_of_week.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: conversion of dtype from \"object\" to \"category\"\n",
    "converted_df_day_of_week = df_day_of_week.astype('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 49000000 entries, 0 to 48999999\n",
      "Data columns (total 1 columns):\n",
      "day_of_week    category\n",
      "dtypes: category(1)\n",
      "memory usage: 46.7 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "converted_df_day_of_week.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>day_of_week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>monday</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>monday</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  day_of_week\n",
       "0      monday\n",
       "1      monday"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## check first two rows of dataframe\n",
    "converted_df_day_of_week.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    1\n",
       "dtype: int8"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## check how mapping of day_of_week is created in category dtype\n",
    "converted_df_day_of_week.head(2)['day_of_week'].cat.codes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Convert to Sparse DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have dataframe having `large count of zero or missing values (66%)` usually happens in lot of `NLP task` like Count/TF-IDF encoding, Recommender Systems  [2]    \n",
    "**Task**: Reduce Memory Usage of dataframe    \n",
    "**Action**: Change of DataFrame type to `SparseDataFrame` as Percentage of Non-Zero Non-NaN values is very less in number    \n",
    "**Result**: Drop from `228.9 MB to 152.6 MB` in Memory usage i.e. `33%` reduction    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "## number of rows in dataframe\n",
    "nrows = np.power(10,7)\n",
    "## creation of dataframe\n",
    "df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 3 columns):\n",
      "0    int64\n",
      "1    float64\n",
      "2    float64\n",
      "dtypes: float64(2), int64(1)\n",
      "memory usage: 228.9 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_dense.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Percentage of Non-Zero Non-NaN values in dataframe 33.33 %\n"
     ]
    }
   ],
   "source": [
    "## Percentage of Non-zero and Non-NaN values in dataframe\n",
    "non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()\n",
    "non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)\n",
    "print(\"Percentage of Non-Zero Non-NaN values in dataframe {} %\".format(non_zero_non_nan_percentage))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: Change of DataFrame type to SparseDataFrame\n",
    "df_sparse = df_dense.to_sparse()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.sparse.frame.SparseDataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 3 columns):\n",
      "0    Sparse[int64, nan]\n",
      "1    Sparse[float64, nan]\n",
      "2    Sparse[float64, nan]\n",
      "dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)\n",
      "memory usage: 152.6 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "df_sparse.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Reduce DataFrame Loading Time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Don't Load all columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have csv file which have `100 columns` and you only need `10` columns for processing    \n",
    "**Task**: Reduce `pd.read_csv()` Loading time     \n",
    "**Action**: Only Load required columns explicitly i.e. only 10 columns not 100 columns which is default     \n",
    "**Result**: Drop from `2.8 second to 0.8 second` in Loading time i.e. `71%` reduction    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Default Loading time 2.868640422821045 seconds\n"
     ]
    }
   ],
   "source": [
    "t0 = time()\n",
    "ufo = pd.read_csv('http://bit.ly/uforeports')\n",
    "t1 = time()\n",
    "print(\"Default Loading time {} seconds\".format(t1-t0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Default Loading time 0.8735167980194092 seconds\n"
     ]
    }
   ],
   "source": [
    "t0 = time()\n",
    "ufo = pd.read_csv('http://bit.ly/uforeports',usecols=['City','State'])\n",
    "t1 = time()\n",
    "print(\"Default Loading time {} seconds\".format(t1-t0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Read Large CSV File"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have large csv file which is taking too much time in reading or giving memory error while reading in one go        \n",
    "**Task**: Avoid Memory Error or reduce file loading time         \n",
    "**Action 1**: Instead of loading file in one go, load file in chunks i.e. use `chunksize` parameter in read_csv. Useful when you don't need to load all data into memory at once     \n",
    "**Action 2**: See if you can do memory saving by doing following operations:\n",
    "- Explicit pass appropriate `dtypes`for each column while loading file and reduce the volume of the data. Example - Sometimes `64-bit` precision is overkill and can use `32-bit`, 2x memory savings are worth it.\n",
    "- `Usecols` parameter in `read_csv` i.e. only load required columns\n",
    "- `Data compression` - Here, I don't mean by zip or gzip file. Say your status column has two values: \"AVAILABLE\" and \"UNAVAILABLE\". Instead of storing them as string (~ 10 bytes per entry), store as boolean: True or False (1 byte per entry)\n",
    "\n",
    "**Action 3**: For faster data loading, serialize data (pickle) on the first run and load the serialized data instead of the csv files in subsequent runs.[15]\n",
    "\n",
    "**Action 4**: Use of other Library instead of pandas but with similar functionality:\n",
    "- <a href = https://github.com/dask/dask-tutorial>Dask</a> - Dask provides multi-core execution on larger-than-memory datasets.\n",
    "\n",
    "\n",
    "- <a href = https://modin.readthedocs.io/en/latest/>Modin</a> - Modin transparently distributes the data and computation. Modin provides speed-ups of up to 4x on a laptop with 4 physical cores. Pandas able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](read_csv_benchmark.png \"segment\")\n",
    "                                                                                                                [9]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](lib_comparison.png \"segment\")\n",
    "                                                                                                                [11]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # pip install modin[ray]\n",
    "# import ray\n",
    "# ray.init(num_cpus=4)\n",
    "# import modin.pandas as pd\n",
    "# export MODIN_OUT_OF_CORE=true"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](modin_vs_pandas.png \"segment\")\n",
    "                                                                                                                [13]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Fast Pandas Operation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 Loop and Vectorization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](pandas_operation.png \"segment\")\n",
    "                                                                                                                [16]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](pandas_optimization.png \"segment\")\n",
    "                                                                                                                [17]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you want to join two dataframes based on some common column   \n",
    "**Task**: Reduce `pd.merge()` time     \n",
    "**Action**: During `merge` operation, use indexing    \n",
    "**Result**: Drop from `10 second to 2 second` in Loading time i.e. `80%` reduction   \n",
    "\n",
    "\n",
    "**General Advice**: Use indexing as much as possible for merging and value lookup in dataframe [18]    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time Taken in merge before indexing 9.495047569274902\n",
      "Time Taken in merge after indexing 2.741480827331543\n"
     ]
    }
   ],
   "source": [
    "df1 = pd.DataFrame({'A':list(range(10000000)),'B':list(range(0,20000000,2))})\n",
    "df2 = pd.DataFrame({'A':list(range(10000000)),'C':list(range(0,40000000,4))})\n",
    "\n",
    "t0 = time()\n",
    "df = pd.merge(df1,df2,how='left',on='A')\n",
    "t1 = time()\n",
    "print(\"Time Taken in merge before indexing {}\".format(t1-t0))\n",
    "\n",
    "gc.collect()\n",
    "\n",
    "t0 = time()\n",
    "df1_ = df1.set_index('A')\n",
    "df2_ = df2.set_index('A')\n",
    "df_ = pd.merge(df1_,df2_, left_index=True, right_index=True)\n",
    "t1 = time()\n",
    "print(\"Time Taken in merge after indexing {}\".format(t1-t0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Parallelization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question**: Can we parallelize `pandas apply or groupby` function without much change in code ?\n",
    "\n",
    "\n",
    "**Answer**: Yes, there are lot of libraries which can help in parallelizing pandas apply function without much change in code.\n",
    "\n",
    "**Libraries**\n",
    "- <a href = https://github.com/nalepae/pandarallel>pandarallel</a>\n",
    "\n",
    "![SegmentLocal](standard_vs_parallel_4_cores.png \"segment\")\n",
    "                                                                                                                [19]\n",
    "\n",
    "**General Advice:**  Parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.\n",
    "\n",
    "- <a href = https://github.com/jmcarpenter2/swifter>Swifter</a> - It applies any function to a pandas dataframe or series in the fastest available manner (Vectorize OR Dask Parallel Processing OR Pandas apply). This <a href =https://github.com/jmcarpenter2/swifter/blob/master/examples/swifter_speed_comparison.ipynb >notebook</a> contains speed comparison for swifter, Please check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "#########################################\n",
    "#### PANDARALLEL CODE #################\n",
    "#########################################\n",
    "# from pandarallel import pandarallel\n",
    "# pandarallel.initialize(nb_workers = _NO_CORES_USED_FOR_PARALLELIZATION,progress_bar = True)\n",
    "# df.parallel_apply(func)\n",
    "# df.groupby(args).parallel_apply(func)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "#########################################\n",
    "#### SWIFTER CODE #################\n",
    "#########################################\n",
    "# import pandas as pd\n",
    "# import swifter\n",
    "# df['Col'].swifter.apply(anyfunction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. References\n",
    "\n",
    "1) [https://www.dataquest.io/blog/pandas-big-data/](https://www.dataquest.io/blog/pandas-big-data/)    \n",
    "2) [https://machinelearningmastery.com/sparse-matrices-for-machine-learning/](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/)    \n",
    "3) [https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe](https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe)    \n",
    "4) [https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html)    \n",
    "5) [https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/](https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/)    \n",
    "6) [https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)    \n",
    "7) [https://pythonspeed.com/articles/pandas-load-less-data/](https://pythonspeed.com/articles/pandas-load-less-data/)    \n",
    "8) [https://numpy.org/devdocs/user/basics.types.html](https://numpy.org/devdocs/user/basics.types.html)    \n",
    "9) [https://modin.readthedocs.io/en/latest/](https://modin.readthedocs.io/en/latest/)    \n",
    "10) **DASK** - [https://docs.dask.org/en/latest/best-practices.html](https://docs.dask.org/en/latest/best-practices.html)\n",
    "\n",
    "11) [https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c](https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c)    \n",
    "12) [https://pythonspeed.com/datascience/](https://pythonspeed.com/datascience/)    \n",
    "13) **Modin Vs Pandas** [https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html](https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html)    \n",
    "14) [https://www.csvexplorer.com/blog/open-big-csv/](https://www.csvexplorer.com/blog/open-big-csv/)     \n",
    "15) [https://www.kaggle.com/c/zillow-prize-1/discussion/37261](https://www.kaggle.com/c/zillow-prize-1/discussion/37261)    \n",
    "16) [https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)    \n",
    "17) [https://github.com/s-heisler/pycon2017-optimizing-pandas/blob/master/MinneAnalytics%20materials/Minneanalytics_talk_slides.pdf](https://github.com/s-heisler/pycon2017-optimizing-pandas/blob/master/MinneAnalytics%20materials/Minneanalytics_talk_slides.pdf)    \n",
    "18) [https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2](https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2)    \n",
    "19) [https://github.com/nalepae/pandarallel](https://github.com/nalepae/pandarallel)    \n",
    "20) [https://www.thepythoncode.com/article/get-hardware-system-information-python](https://www.thepythoncode.com/article/get-hardware-system-information-python)    \n",
    "\n",
    "\n",
    "**NOTE:** I acknowledge the work of above references and this notebook uses some of code mentioned in references."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}