{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Drop Memory Usage Tricks\n",
    "\n",
    "- Watch [Other Interesting Data Science Topics](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA/videos)\n",
    "- Created By: **Aakash Goel**\n",
    "- Connect on [LinkedIn](https://www.linkedin.com/in/aakash-goel-587a7385/)\n",
    "- Subscribe on [YouTube](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA?sub_confirmation=1)\n",
    "- Created on: 17-FEB-2020\n",
    "- Last Updated on: 17-FEB-2020"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of contents\n",
    "\n",
    "1. <a href=\"#1.-Reduce-DataFrame-size\">Reduce DataFrame size</a>    \n",
    "    1.1. <a href=\"#1.1-Change-in-int-datatype\">Change in int datatype</a>    \n",
    "    1.2. <a href=\"#1.2-Change-in-float-datatype\">Change in float datatype</a>    \n",
    "    1.3. <a href=\"#1.3-Change-from-object-to-category-datatype\">Change from object to category datatype</a>    \n",
    "    1.4. <a href=\"#1.4-Convert-to-Sparse-DataFrame\">Convert to Sparse DataFrame</a>    \n",
    "\n",
    "2. <a href=\"#2.-References\">References</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![SegmentLocal](datatypeMemory.PNG \"segment\")\n",
    "\n",
    "                                                                                                                [1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import nbconvert\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Reduce DataFrame size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Change in int datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have `Age` column having `minimum value 1 and maximum value 150`, with `10 million` total rows in dataframe    \n",
    "**Task**: Reduce Memory Usage of `Age` column given above constraints    \n",
    "**Action**: Change of original dtype from `int32` to `uint8`    \n",
    "**Result**: Drop from `38.1 MB to 9.5 MB` in Memory usage i.e. `75%` reduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Initializing minimum and maximum value of age\n",
    "min_age_value , max_age_value = 1,150\n",
    "## Number of rows in dataframe\n",
    "nrows = int(np.power(10,7))\n",
    "## creation of Age dataframe\n",
    "df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 1 columns):\n",
      "Age    int32\n",
      "dtypes: int32(1)\n",
      "memory usage: 38.1 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_age.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "iinfo(min=0, max=255, dtype=uint8)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Range of \"uint8\"; satisfies range constraint of Age column \n",
    "np.iinfo('uint8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: conversion of dtype from \"int32\" to \"uint8\"\n",
    "converted_df_age = df_age.astype(np.uint8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 1 columns):\n",
      "Age    uint8\n",
      "dtypes: uint8(1)\n",
      "memory usage: 9.5 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "converted_df_age.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Change in float datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have `50,000 search queries` and `5,000 documents` and computed `cosine similarity` for each search query with all documents i.e. `dimension 50,000 X 5,000`. All similarity values are between `0 and 1` and should have atleast `2 decimal precision`    \n",
    "**Task**: Reduce Memory Usage of cosine smilarity dataframe given above constraints    \n",
    "**Action**: Change of original dtype from `float64` to `float16`    \n",
    "**Result**: Drop from `1.9 GB to 476.8 MB or 0.46 GB` in Memory usage i.e. `75%` reduction    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No. of search queries: 50000 and No. of documents: 5000\n"
     ]
    }
   ],
   "source": [
    "## no. of documents\n",
    "ncols = int(5*np.power(10,3))\n",
    "## no. of search queries\n",
    "nrows = int(5*np.power(10,4))\n",
    "## creation of cosine similarity dataframe\n",
    "df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))\n",
    "print(\"No. of search queries: {} and No. of documents: {}\".format(df_query_doc.shape[0],df_query_doc.shape[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 50000 entries, 0 to 49999\n",
      "Columns: 5000 entries, 0 to 4999\n",
      "dtypes: float64(5000)\n",
      "memory usage: 1.9 GB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_query_doc.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: conversion of dtype from \"float64\" to \"float16\"\n",
    "converted_df_query_doc = df_query_doc.astype('float16')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 50000 entries, 0 to 49999\n",
      "Columns: 5000 entries, 0 to 4999\n",
      "dtypes: float16(5000)\n",
      "memory usage: 476.8 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "converted_df_query_doc.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 Change from object to category datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have `Day of Week` column having `7 unique` values, with `4.9 million` total rows in dataframe    \n",
    "**Task**: Reduce Memory Usage of `Day of Week` column given only 7 unique value exist    \n",
    "**Action**: Change of dtype from `object` to `category` as ratio of unique values to no. of rows is almost zero    \n",
    "**Result**: Drop from `2.9 GB to 46.7 MB or 0.045 GB` in Memory usage i.e. `98%` reduction    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No of rows in days of week dataframe 49000000\n"
     ]
    }
   ],
   "source": [
    "## unique values of \"days of week\"\n",
    "day_of_week = [\"monday\",\"tuesday\",\"wednesday\",\"thursday\",\"friday\",\"saturday\",\"sunday\"]\n",
    "## Number of times day_of_week repeats\n",
    "repeat_times = 7*np.power(10,6)\n",
    "## creation of days of week dataframe\n",
    "df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})\n",
    "print(\"No of rows in days of week dataframe {}\".format(df_day_of_week.shape[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 49000000 entries, 0 to 48999999\n",
      "Data columns (total 1 columns):\n",
      "day_of_week    object\n",
      "dtypes: object(1)\n",
      "memory usage: 2.9 GB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_day_of_week.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: conversion of dtype from \"object\" to \"category\"\n",
    "converted_df_day_of_week = df_day_of_week.astype('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 49000000 entries, 0 to 48999999\n",
      "Data columns (total 1 columns):\n",
      "day_of_week    category\n",
      "dtypes: category(1)\n",
      "memory usage: 46.7 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "converted_df_day_of_week.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>day_of_week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>monday</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>monday</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  day_of_week\n",
       "0      monday\n",
       "1      monday"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## check first two rows of dataframe\n",
    "converted_df_day_of_week.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    1\n",
       "dtype: int8"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## check how mapping of day_of_week is created in category dtype\n",
    "converted_df_day_of_week.head(2)['day_of_week'].cat.codes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4 Convert to Sparse DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Situation**: Let say, you have dataframe having `large count of zero or missing values (66%)` usually happens in lot of `NLP task` like Count/TF-IDF encoding, Recommender Systems  [2]    \n",
    "**Task**: Reduce Memory Usage of dataframe    \n",
    "**Action**: Change of DataFrame type to `SparseDataFrame` as Percentage of Non-Zero Non-NaN values is very less in number    \n",
    "**Result**: Drop from `228.9 MB to 152.6 MB` in Memory usage i.e. `33%` reduction    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "## number of rows in dataframe\n",
    "nrows = np.power(10,7)\n",
    "## creation of dataframe\n",
    "df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 3 columns):\n",
      "0    int64\n",
      "1    float64\n",
      "2    float64\n",
      "dtypes: float64(2), int64(1)\n",
      "memory usage: 228.9 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage before action\n",
    "df_dense.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Percentage of Non-Zero Non-NaN values in dataframe 33.33 %\n"
     ]
    }
   ],
   "source": [
    "## Percentage of Non-zero and Non-NaN values in dataframe\n",
    "non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()\n",
    "non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)\n",
    "print(\"Percentage of Non-Zero Non-NaN values in dataframe {} %\".format(non_zero_non_nan_percentage))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Action: Change of DataFrame type to SparseDataFrame\n",
    "df_sparse = df_dense.to_sparse()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.sparse.frame.SparseDataFrame'>\n",
      "RangeIndex: 10000000 entries, 0 to 9999999\n",
      "Data columns (total 3 columns):\n",
      "0    Sparse[int64, nan]\n",
      "1    Sparse[float64, nan]\n",
      "2    Sparse[float64, nan]\n",
      "dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)\n",
      "memory usage: 152.6 MB\n"
     ]
    }
   ],
   "source": [
    "## check memory usage after action\n",
    "df_sparse.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. References\n",
    "\n",
    "1) [https://www.dataquest.io/blog/pandas-big-data/](https://www.dataquest.io/blog/pandas-big-data/)    \n",
    "2) [https://machinelearningmastery.com/sparse-matrices-for-machine-learning/](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/)    \n",
    "3) [https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe](https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe)    \n",
    "4) [https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html)    \n",
    "5) [https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/](https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/)    \n",
    "6) [https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)    \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}