{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Drop Memory Usage Tricks\n",
"\n",
"- Watch [Other Interesting Data Science Topics](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA/videos)\n",
"- Created By: **Aakash Goel**\n",
"- Connect on [LinkedIn](https://www.linkedin.com/in/aakash-goel-587a7385/)\n",
"- Subscribe on [YouTube](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA?sub_confirmation=1)\n",
"- Created on: 17-FEB-2020\n",
"- Last Updated on: 17-FEB-2020"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"1. Reduce DataFrame size \n",
" 1.1. Change in int datatype \n",
" 1.2. Change in float datatype \n",
" 1.3. Change from object to category datatype \n",
" 1.4. Convert to Sparse DataFrame \n",
"\n",
"2. References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](datatypeMemory.PNG \"segment\")\n",
"\n",
" [1]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import nbconvert\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Reduce DataFrame size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Change in int datatype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have `Age` column having `minimum value 1 and maximum value 150`, with `10 million` total rows in dataframe \n",
"**Task**: Reduce Memory Usage of `Age` column given above constraints \n",
"**Action**: Change of original dtype from `int32` to `uint8` \n",
"**Result**: Drop from `38.1 MB to 9.5 MB` in Memory usage i.e. `75%` reduction"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"## Initializing minimum and maximum value of age\n",
"min_age_value , max_age_value = 1,150\n",
"## Number of rows in dataframe\n",
"nrows = int(np.power(10,7))\n",
"## creation of Age dataframe\n",
"df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 1 columns):\n",
"Age int32\n",
"dtypes: int32(1)\n",
"memory usage: 38.1 MB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_age.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"iinfo(min=0, max=255, dtype=uint8)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Range of \"uint8\"; satisfies range constraint of Age column \n",
"np.iinfo('uint8')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"## Action: conversion of dtype from \"int32\" to \"uint8\"\n",
"converted_df_age = df_age.astype(np.uint8)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 1 columns):\n",
"Age uint8\n",
"dtypes: uint8(1)\n",
"memory usage: 9.5 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"converted_df_age.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 Change in float datatype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have `50,000 search queries` and `5,000 documents` and computed `cosine similarity` for each search query with all documents i.e. `dimension 50,000 X 5,000`. All similarity values are between `0 and 1` and should have atleast `2 decimal precision` \n",
"**Task**: Reduce Memory Usage of cosine smilarity dataframe given above constraints \n",
"**Action**: Change of original dtype from `float64` to `float16` \n",
"**Result**: Drop from `1.9 GB to 476.8 MB or 0.46 GB` in Memory usage i.e. `75%` reduction "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No. of search queries: 50000 and No. of documents: 5000\n"
]
}
],
"source": [
"## no. of documents\n",
"ncols = int(5*np.power(10,3))\n",
"## no. of search queries\n",
"nrows = int(5*np.power(10,4))\n",
"## creation of cosine similarity dataframe\n",
"df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))\n",
"print(\"No. of search queries: {} and No. of documents: {}\".format(df_query_doc.shape[0],df_query_doc.shape[1]))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 50000 entries, 0 to 49999\n",
"Columns: 5000 entries, 0 to 4999\n",
"dtypes: float64(5000)\n",
"memory usage: 1.9 GB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_query_doc.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"## Action: conversion of dtype from \"float64\" to \"float16\"\n",
"converted_df_query_doc = df_query_doc.astype('float16')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 50000 entries, 0 to 49999\n",
"Columns: 5000 entries, 0 to 4999\n",
"dtypes: float16(5000)\n",
"memory usage: 476.8 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"converted_df_query_doc.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 Change from object to category datatype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have `Day of Week` column having `7 unique` values, with `4.9 million` total rows in dataframe \n",
"**Task**: Reduce Memory Usage of `Day of Week` column given only 7 unique value exist \n",
"**Action**: Change of dtype from `object` to `category` as ratio of unique values to no. of rows is almost zero \n",
"**Result**: Drop from `2.9 GB to 46.7 MB or 0.045 GB` in Memory usage i.e. `98%` reduction "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No of rows in days of week dataframe 49000000\n"
]
}
],
"source": [
"## unique values of \"days of week\"\n",
"day_of_week = [\"monday\",\"tuesday\",\"wednesday\",\"thursday\",\"friday\",\"saturday\",\"sunday\"]\n",
"## Number of times day_of_week repeats\n",
"repeat_times = 7*np.power(10,6)\n",
"## creation of days of week dataframe\n",
"df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})\n",
"print(\"No of rows in days of week dataframe {}\".format(df_day_of_week.shape[0]))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 49000000 entries, 0 to 48999999\n",
"Data columns (total 1 columns):\n",
"day_of_week object\n",
"dtypes: object(1)\n",
"memory usage: 2.9 GB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_day_of_week.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"## Action: conversion of dtype from \"object\" to \"category\"\n",
"converted_df_day_of_week = df_day_of_week.astype('category')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 49000000 entries, 0 to 48999999\n",
"Data columns (total 1 columns):\n",
"day_of_week category\n",
"dtypes: category(1)\n",
"memory usage: 46.7 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"converted_df_day_of_week.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" day_of_week | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" monday | \n",
"
\n",
" \n",
" 1 | \n",
" monday | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" day_of_week\n",
"0 monday\n",
"1 monday"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## check first two rows of dataframe\n",
"converted_df_day_of_week.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 1\n",
"dtype: int8"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## check how mapping of day_of_week is created in category dtype\n",
"converted_df_day_of_week.head(2)['day_of_week'].cat.codes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4 Convert to Sparse DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have dataframe having `large count of zero or missing values (66%)` usually happens in lot of `NLP task` like Count/TF-IDF encoding, Recommender Systems [2] \n",
"**Task**: Reduce Memory Usage of dataframe \n",
"**Action**: Change of DataFrame type to `SparseDataFrame` as Percentage of Non-Zero Non-NaN values is very less in number \n",
"**Result**: Drop from `228.9 MB to 152.6 MB` in Memory usage i.e. `33%` reduction "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"## number of rows in dataframe\n",
"nrows = np.power(10,7)\n",
"## creation of dataframe\n",
"df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 3 columns):\n",
"0 int64\n",
"1 float64\n",
"2 float64\n",
"dtypes: float64(2), int64(1)\n",
"memory usage: 228.9 MB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_dense.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Percentage of Non-Zero Non-NaN values in dataframe 33.33 %\n"
]
}
],
"source": [
"## Percentage of Non-zero and Non-NaN values in dataframe\n",
"non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()\n",
"non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)\n",
"print(\"Percentage of Non-Zero Non-NaN values in dataframe {} %\".format(non_zero_non_nan_percentage))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"## Action: Change of DataFrame type to SparseDataFrame\n",
"df_sparse = df_dense.to_sparse()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 3 columns):\n",
"0 Sparse[int64, nan]\n",
"1 Sparse[float64, nan]\n",
"2 Sparse[float64, nan]\n",
"dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)\n",
"memory usage: 152.6 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"df_sparse.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. References\n",
"\n",
"1) [https://www.dataquest.io/blog/pandas-big-data/](https://www.dataquest.io/blog/pandas-big-data/) \n",
"2) [https://machinelearningmastery.com/sparse-matrices-for-machine-learning/](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/) \n",
"3) [https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe](https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe) \n",
"4) [https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) \n",
"5) [https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/](https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/) \n",
"6) [https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}