{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Drop Memory Usage Tricks\n", "\n", "- Watch [Other Interesting Data Science Topics](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA/videos)\n", "- Created By: **Aakash Goel**\n", "- Connect on [LinkedIn](https://www.linkedin.com/in/aakash-goel-587a7385/)\n", "- Subscribe on [YouTube](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA?sub_confirmation=1)\n", "- Created on: 17-FEB-2020\n", "- Last Updated on: 17-FEB-2020" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of contents\n", "\n", "1. Reduce DataFrame size \n", " 1.1. Change in int datatype \n", " 1.2. Change in float datatype \n", " 1.3. Change from object to category datatype \n", " 1.4. Convert to Sparse DataFrame \n", "\n", "2. References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![SegmentLocal](datatypeMemory.PNG \"segment\")\n", "\n", " [1]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import nbconvert\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reduce DataFrame size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Change in int datatype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have `Age` column having `minimum value 1 and maximum value 150`, with `10 million` total rows in dataframe \n", "**Task**: Reduce Memory Usage of `Age` column given above constraints \n", "**Action**: Change of original dtype from `int32` to `uint8` \n", "**Result**: Drop from `38.1 MB to 9.5 MB` in Memory usage i.e. `75%` reduction" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "## Initializing minimum and maximum value of age\n", "min_age_value , max_age_value = 1,150\n", "## Number of rows in dataframe\n", "nrows = int(np.power(10,7))\n", "## creation of Age dataframe\n", "df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 1 columns):\n", "Age int32\n", "dtypes: int32(1)\n", "memory usage: 38.1 MB\n" ] } ], "source": [ "## check memory usage before action\n", "df_age.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "iinfo(min=0, max=255, dtype=uint8)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Range of \"uint8\"; satisfies range constraint of Age column \n", "np.iinfo('uint8')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "## Action: conversion of dtype from \"int32\" to \"uint8\"\n", "converted_df_age = df_age.astype(np.uint8)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 1 columns):\n", "Age uint8\n", "dtypes: uint8(1)\n", "memory usage: 9.5 MB\n" ] } ], "source": [ "## check memory usage after action\n", "converted_df_age.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Change in float datatype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have `50,000 search queries` and `5,000 documents` and computed `cosine similarity` for each search query with all documents i.e. `dimension 50,000 X 5,000`. All similarity values are between `0 and 1` and should have atleast `2 decimal precision` \n", "**Task**: Reduce Memory Usage of cosine smilarity dataframe given above constraints \n", "**Action**: Change of original dtype from `float64` to `float16` \n", "**Result**: Drop from `1.9 GB to 476.8 MB or 0.46 GB` in Memory usage i.e. `75%` reduction " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No. of search queries: 50000 and No. of documents: 5000\n" ] } ], "source": [ "## no. of documents\n", "ncols = int(5*np.power(10,3))\n", "## no. of search queries\n", "nrows = int(5*np.power(10,4))\n", "## creation of cosine similarity dataframe\n", "df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))\n", "print(\"No. of search queries: {} and No. of documents: {}\".format(df_query_doc.shape[0],df_query_doc.shape[1]))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Columns: 5000 entries, 0 to 4999\n", "dtypes: float64(5000)\n", "memory usage: 1.9 GB\n" ] } ], "source": [ "## check memory usage before action\n", "df_query_doc.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "## Action: conversion of dtype from \"float64\" to \"float16\"\n", "converted_df_query_doc = df_query_doc.astype('float16')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Columns: 5000 entries, 0 to 4999\n", "dtypes: float16(5000)\n", "memory usage: 476.8 MB\n" ] } ], "source": [ "## check memory usage after action\n", "converted_df_query_doc.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Change from object to category datatype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have `Day of Week` column having `7 unique` values, with `4.9 million` total rows in dataframe \n", "**Task**: Reduce Memory Usage of `Day of Week` column given only 7 unique value exist \n", "**Action**: Change of dtype from `object` to `category` as ratio of unique values to no. of rows is almost zero \n", "**Result**: Drop from `2.9 GB to 46.7 MB or 0.045 GB` in Memory usage i.e. `98%` reduction " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No of rows in days of week dataframe 49000000\n" ] } ], "source": [ "## unique values of \"days of week\"\n", "day_of_week = [\"monday\",\"tuesday\",\"wednesday\",\"thursday\",\"friday\",\"saturday\",\"sunday\"]\n", "## Number of times day_of_week repeats\n", "repeat_times = 7*np.power(10,6)\n", "## creation of days of week dataframe\n", "df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})\n", "print(\"No of rows in days of week dataframe {}\".format(df_day_of_week.shape[0]))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 49000000 entries, 0 to 48999999\n", "Data columns (total 1 columns):\n", "day_of_week object\n", "dtypes: object(1)\n", "memory usage: 2.9 GB\n" ] } ], "source": [ "## check memory usage before action\n", "df_day_of_week.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "## Action: conversion of dtype from \"object\" to \"category\"\n", "converted_df_day_of_week = df_day_of_week.astype('category')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 49000000 entries, 0 to 48999999\n", "Data columns (total 1 columns):\n", "day_of_week category\n", "dtypes: category(1)\n", "memory usage: 46.7 MB\n" ] } ], "source": [ "## check memory usage after action\n", "converted_df_day_of_week.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
day_of_week
0monday
1monday
\n", "
" ], "text/plain": [ " day_of_week\n", "0 monday\n", "1 monday" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## check first two rows of dataframe\n", "converted_df_day_of_week.head(2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "dtype: int8" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## check how mapping of day_of_week is created in category dtype\n", "converted_df_day_of_week.head(2)['day_of_week'].cat.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4 Convert to Sparse DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Situation**: Let say, you have dataframe having `large count of zero or missing values (66%)` usually happens in lot of `NLP task` like Count/TF-IDF encoding, Recommender Systems [2] \n", "**Task**: Reduce Memory Usage of dataframe \n", "**Action**: Change of DataFrame type to `SparseDataFrame` as Percentage of Non-Zero Non-NaN values is very less in number \n", "**Result**: Drop from `228.9 MB to 152.6 MB` in Memory usage i.e. `33%` reduction " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "## number of rows in dataframe\n", "nrows = np.power(10,7)\n", "## creation of dataframe\n", "df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 3 columns):\n", "0 int64\n", "1 float64\n", "2 float64\n", "dtypes: float64(2), int64(1)\n", "memory usage: 228.9 MB\n" ] } ], "source": [ "## check memory usage before action\n", "df_dense.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Percentage of Non-Zero Non-NaN values in dataframe 33.33 %\n" ] } ], "source": [ "## Percentage of Non-zero and Non-NaN values in dataframe\n", "non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()\n", "non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)\n", "print(\"Percentage of Non-Zero Non-NaN values in dataframe {} %\".format(non_zero_non_nan_percentage))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "## Action: Change of DataFrame type to SparseDataFrame\n", "df_sparse = df_dense.to_sparse()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 3 columns):\n", "0 Sparse[int64, nan]\n", "1 Sparse[float64, nan]\n", "2 Sparse[float64, nan]\n", "dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)\n", "memory usage: 152.6 MB\n" ] } ], "source": [ "## check memory usage after action\n", "df_sparse.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. References\n", "\n", "1) [https://www.dataquest.io/blog/pandas-big-data/](https://www.dataquest.io/blog/pandas-big-data/) \n", "2) [https://machinelearningmastery.com/sparse-matrices-for-machine-learning/](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/) \n", "3) [https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe](https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe) \n", "4) [https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) \n", "5) [https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/](https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/) \n", "6) [https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) \n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }