{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas - Optimize Memory and Speed Operation\n",
"\n",
"- Watch [Other Interesting Data Science Topics](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA/videos)\n",
"- Created By: **Aakash Goel**\n",
"- Connect on [LinkedIn](https://www.linkedin.com/in/aakash-goel-587a7385/)\n",
"- Subscribe on [YouTube](https://www.youtube.com/channel/UC4yh4xPxRP0-bLG_ldnLCHA?sub_confirmation=1)\n",
"- Created on: 17-FEB-2020\n",
"- Last Updated on: 06-JUL-2020"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"1. See Hardware and System Information \n",
" 1.1. System, CPU, Memory, Swap, Disk, GPU Information \n",
"2. Reduce DataFrame size \n",
" 2.1. Change in int datatype \n",
" 2.2. Change in float datatype \n",
" 2.3. Change from object to category datatype \n",
" 2.4. Convert to Sparse DataFrame \n",
"\n",
"3. Reduce DataFrame Loading Time \n",
" 3.1. Don't Load all columns \n",
" 3.2. Read Large CSV File (Avoid Memory error and fast reading) \n",
"\n",
"4. Fast Pandas Operation \n",
" 4.1. Loop and Vectorization \n",
" 4.2. Indexing \n",
" 4.3. Parallelization \n",
" \n",
"5. References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](datatypeMemory.PNG \"segment\")\n",
"\n",
" [1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](memory_nos.png \"segment\")\n",
" [8]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. See Hardware and System Information"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is very important to know about your machine on which you are coding.\n",
"Hardware and System Information helps you to take proper decision like how much `cores` you should allocate to your program.\n",
"Code for same is taken from here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 System, CPU, Memory, Swap, Disk, GPU Information"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"======================================== Python Version ========================================\n",
"3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]\n",
"sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)\n",
"======================================== System Information ========================================\n",
"System: Windows\n",
"Release: 10\n",
"Version: 10.0.18362\n",
"Machine: AMD64\n",
"Processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel\n",
"======================================== CPU Info ========================================\n",
"Physical cores: 2\n",
"Total cores: 4\n",
"Max Frequency: 2000.00Mhz\n",
"Min Frequency: 0.00Mhz\n",
"Current Frequency: 2000.00Mhz\n",
"CPU Usage Per Core:\n",
"Core 0: 56.9%\n",
"Core 1: 25.0%\n",
"Core 2: 43.8%\n",
"Core 3: 42.2%\n",
"Total CPU Usage: 45.2%\n",
"======================================== Memory Information ========================================\n",
"Total: 7.91GB\n",
"Available: 3.85GB\n",
"Used: 4.06GB\n",
"Percentage: 51.3%\n",
"======================================== SWAP ========================================\n",
"Total: 19.26GB\n",
"Free: 13.41GB\n",
"Used: 5.85GB\n",
"Percentage: 30.4%\n",
"======================================== Disk Information ========================================\n",
"Partitions and Usage:\n",
"=== Device: C:\\ ===\n",
" Mountpoint: C:\\\n",
" File system type: NTFS\n",
" Total Size: 465.13GB\n",
" Used: 65.29GB\n",
" Free: 399.84GB\n",
" Percentage: 14.0%\n",
"Total read: 540.46GB\n",
"Total write: 62.73GB\n",
"======================================== GPU Details ========================================\n",
"id name load free memory used memory total memory temperature uuid\n",
"---- ------ ------ ------------- ------------- -------------- ------------- ------\n"
]
}
],
"source": [
"# ! pip install tabulate\n",
"# ! pip install gputil\n",
"import psutil\n",
"import platform\n",
"import sys\n",
"from datetime import datetime\n",
"import GPUtil\n",
"from tabulate import tabulate\n",
"\n",
"def get_size(bytes, suffix=\"B\"):\n",
" \"\"\"\n",
" Scale bytes to its proper format\n",
" e.g:\n",
" 1253656 => '1.20MB'\n",
" 1253656678 => '1.17GB'\n",
" \"\"\"\n",
" factor = 1024\n",
" for unit in [\"\", \"K\", \"M\", \"G\", \"T\", \"P\"]:\n",
" if bytes < factor:\n",
" return f\"{bytes:.2f}{unit}{suffix}\"\n",
" bytes /= factor\n",
"\n",
"#### CODE -- https://www.thepythoncode.com/article/get-hardware-system-information-python ####\n",
"\n",
"print(\"=\"*40, \"Python Version\", \"=\"*40)\n",
"print (sys.version)\n",
"print (sys.version_info)\n",
"\n",
"print(\"=\"*40, \"System Information\", \"=\"*40)\n",
"uname = platform.uname()\n",
"print(f\"System: {uname.system}\")\n",
"print(f\"Release: {uname.release}\")\n",
"print(f\"Version: {uname.version}\")\n",
"print(f\"Machine: {uname.machine}\")\n",
"print(f\"Processor: {uname.processor}\")\n",
"\n",
"# let's print CPU information\n",
"print(\"=\"*40, \"CPU Info\", \"=\"*40)\n",
"# number of cores\n",
"print(\"Physical cores:\", psutil.cpu_count(logical=False))\n",
"print(\"Total cores:\", psutil.cpu_count(logical=True))\n",
"# CPU frequencies\n",
"cpufreq = psutil.cpu_freq()\n",
"print(f\"Max Frequency: {cpufreq.max:.2f}Mhz\")\n",
"print(f\"Min Frequency: {cpufreq.min:.2f}Mhz\")\n",
"print(f\"Current Frequency: {cpufreq.current:.2f}Mhz\")\n",
"# CPU usage\n",
"print(\"CPU Usage Per Core:\")\n",
"for i, percentage in enumerate(psutil.cpu_percent(percpu=True, interval=1)):\n",
" print(f\"Core {i}: {percentage}%\")\n",
"print(f\"Total CPU Usage: {psutil.cpu_percent()}%\")\n",
"\n",
"# Memory Information\n",
"print(\"=\"*40, \"Memory Information\", \"=\"*40)\n",
"# get the memory details\n",
"svmem = psutil.virtual_memory()\n",
"print(f\"Total: {get_size(svmem.total)}\")\n",
"print(f\"Available: {get_size(svmem.available)}\")\n",
"print(f\"Used: {get_size(svmem.used)}\")\n",
"print(f\"Percentage: {svmem.percent}%\")\n",
"print(\"=\"*40, \"SWAP\", \"=\"*40)\n",
"# get the swap memory details (if exists)\n",
"swap = psutil.swap_memory()\n",
"print(f\"Total: {get_size(swap.total)}\")\n",
"print(f\"Free: {get_size(swap.free)}\")\n",
"print(f\"Used: {get_size(swap.used)}\")\n",
"print(f\"Percentage: {swap.percent}%\")\n",
"\n",
"# Disk Information\n",
"print(\"=\"*40, \"Disk Information\", \"=\"*40)\n",
"print(\"Partitions and Usage:\")\n",
"# get all disk partitions\n",
"partitions = psutil.disk_partitions()\n",
"for partition in partitions:\n",
" print(f\"=== Device: {partition.device} ===\")\n",
" print(f\" Mountpoint: {partition.mountpoint}\")\n",
" print(f\" File system type: {partition.fstype}\")\n",
" try:\n",
" partition_usage = psutil.disk_usage(partition.mountpoint)\n",
" except PermissionError:\n",
" # this can be catched due to the disk that\n",
" # isn't ready\n",
" continue\n",
" print(f\" Total Size: {get_size(partition_usage.total)}\")\n",
" print(f\" Used: {get_size(partition_usage.used)}\")\n",
" print(f\" Free: {get_size(partition_usage.free)}\")\n",
" print(f\" Percentage: {partition_usage.percent}%\")\n",
"# get IO statistics since boot\n",
"disk_io = psutil.disk_io_counters()\n",
"print(f\"Total read: {get_size(disk_io.read_bytes)}\")\n",
"print(f\"Total write: {get_size(disk_io.write_bytes)}\")\n",
"\n",
"# GPU information\n",
"print(\"=\"*40, \"GPU Details\", \"=\"*40)\n",
"gpus = GPUtil.getGPUs()\n",
"list_gpus = []\n",
"for gpu in gpus:\n",
" # get the GPU id\n",
" gpu_id = gpu.id\n",
" # name of GPU\n",
" gpu_name = gpu.name\n",
" # get % percentage of GPU usage of that GPU\n",
" gpu_load = f\"{gpu.load*100}%\"\n",
" # get free memory in MB format\n",
" gpu_free_memory = f\"{gpu.memoryFree}MB\"\n",
" # get used memory\n",
" gpu_used_memory = f\"{gpu.memoryUsed}MB\"\n",
" # get total memory\n",
" gpu_total_memory = f\"{gpu.memoryTotal}MB\"\n",
" # get GPU temperature in Celsius\n",
" gpu_temperature = f\"{gpu.temperature} °C\"\n",
" gpu_uuid = gpu.uuid\n",
" list_gpus.append((\n",
" gpu_id, gpu_name, gpu_load, gpu_free_memory, gpu_used_memory,\n",
" gpu_total_memory, gpu_temperature, gpu_uuid\n",
" ))\n",
"\n",
"print(tabulate(list_gpus, headers=(\"id\", \"name\", \"load\", \"free memory\", \"used memory\", \"total memory\",\n",
" \"temperature\", \"uuid\")))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import nbconvert\n",
"from time import time\n",
"import gc\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Reduce DataFrame size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Change in int datatype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have `Age` column having `minimum value 1 and maximum value 150`, with `10 million` total rows in dataframe \n",
"**Task**: Reduce Memory Usage of `Age` column given above constraints \n",
"**Action**: Change of original dtype from `int32` to `uint8` \n",
"**Result**: Drop from `38.1 MB to 9.5 MB` in Memory usage i.e. `75%` reduction"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"## Initializing minimum and maximum value of age\n",
"min_age_value , max_age_value = 1,150\n",
"## Number of rows in dataframe\n",
"nrows = int(np.power(10,7))\n",
"## creation of Age dataframe\n",
"df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 1 columns):\n",
"Age int32\n",
"dtypes: int32(1)\n",
"memory usage: 38.1 MB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_age.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"iinfo(min=0, max=255, dtype=uint8)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Range of \"uint8\"; satisfies range constraint of Age column \n",
"np.iinfo('uint8')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"## Action: conversion of dtype from \"int32\" to \"uint8\"\n",
"converted_df_age = df_age.astype(np.uint8)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 1 columns):\n",
"Age uint8\n",
"dtypes: uint8(1)\n",
"memory usage: 9.5 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"converted_df_age.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Change in float datatype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have `50,000 search queries` and `5,000 documents` and computed `cosine similarity` for each search query with all documents i.e. `dimension 50,000 X 5,000`. All similarity values are between `0 and 1` and should have atleast `2 decimal precision` \n",
"**Task**: Reduce Memory Usage of cosine smilarity dataframe given above constraints \n",
"**Action**: Change of original dtype from `float64` to `float16` \n",
"**Result**: Drop from `1.9 GB to 476.8 MB or 0.46 GB` in Memory usage i.e. `75%` reduction "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No. of search queries: 50000 and No. of documents: 5000\n"
]
}
],
"source": [
"## no. of documents\n",
"ncols = int(5*np.power(10,3))\n",
"## no. of search queries\n",
"nrows = int(5*np.power(10,4))\n",
"## creation of cosine similarity dataframe\n",
"df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))\n",
"print(\"No. of search queries: {} and No. of documents: {}\".format(df_query_doc.shape[0],df_query_doc.shape[1]))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 50000 entries, 0 to 49999\n",
"Columns: 5000 entries, 0 to 4999\n",
"dtypes: float64(5000)\n",
"memory usage: 1.9 GB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_query_doc.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"## Action: conversion of dtype from \"float64\" to \"float16\"\n",
"converted_df_query_doc = df_query_doc.astype('float16')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 50000 entries, 0 to 49999\n",
"Columns: 5000 entries, 0 to 4999\n",
"dtypes: float16(5000)\n",
"memory usage: 476.8 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"converted_df_query_doc.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Change from object to category datatype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have `Day of Week` column having `7 unique` values, with `4.9 million` total rows in dataframe \n",
"**Task**: Reduce Memory Usage of `Day of Week` column given only 7 unique value exist \n",
"**Action**: Change of dtype from `object` to `category` as ratio of unique values to no. of rows is almost zero \n",
"**Result**: Drop from `2.9 GB to 46.7 MB or 0.045 GB` in Memory usage i.e. `98%` reduction "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No of rows in days of week dataframe 49000000\n"
]
}
],
"source": [
"## unique values of \"days of week\"\n",
"day_of_week = [\"monday\",\"tuesday\",\"wednesday\",\"thursday\",\"friday\",\"saturday\",\"sunday\"]\n",
"## Number of times day_of_week repeats\n",
"repeat_times = 7*np.power(10,6)\n",
"## creation of days of week dataframe\n",
"df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})\n",
"print(\"No of rows in days of week dataframe {}\".format(df_day_of_week.shape[0]))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 49000000 entries, 0 to 48999999\n",
"Data columns (total 1 columns):\n",
"day_of_week object\n",
"dtypes: object(1)\n",
"memory usage: 2.9 GB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_day_of_week.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"## Action: conversion of dtype from \"object\" to \"category\"\n",
"converted_df_day_of_week = df_day_of_week.astype('category')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 49000000 entries, 0 to 48999999\n",
"Data columns (total 1 columns):\n",
"day_of_week category\n",
"dtypes: category(1)\n",
"memory usage: 46.7 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"converted_df_day_of_week.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" day_of_week | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" monday | \n",
"
\n",
" \n",
" 1 | \n",
" monday | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" day_of_week\n",
"0 monday\n",
"1 monday"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## check first two rows of dataframe\n",
"converted_df_day_of_week.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 1\n",
"dtype: int8"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## check how mapping of day_of_week is created in category dtype\n",
"converted_df_day_of_week.head(2)['day_of_week'].cat.codes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 Convert to Sparse DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have dataframe having `large count of zero or missing values (66%)` usually happens in lot of `NLP task` like Count/TF-IDF encoding, Recommender Systems [2] \n",
"**Task**: Reduce Memory Usage of dataframe \n",
"**Action**: Change of DataFrame type to `SparseDataFrame` as Percentage of Non-Zero Non-NaN values is very less in number \n",
"**Result**: Drop from `228.9 MB to 152.6 MB` in Memory usage i.e. `33%` reduction "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"## number of rows in dataframe\n",
"nrows = np.power(10,7)\n",
"## creation of dataframe\n",
"df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 3 columns):\n",
"0 int64\n",
"1 float64\n",
"2 float64\n",
"dtypes: float64(2), int64(1)\n",
"memory usage: 228.9 MB\n"
]
}
],
"source": [
"## check memory usage before action\n",
"df_dense.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Percentage of Non-Zero Non-NaN values in dataframe 33.33 %\n"
]
}
],
"source": [
"## Percentage of Non-zero and Non-NaN values in dataframe\n",
"non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()\n",
"non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)\n",
"print(\"Percentage of Non-Zero Non-NaN values in dataframe {} %\".format(non_zero_non_nan_percentage))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"## Action: Change of DataFrame type to SparseDataFrame\n",
"df_sparse = df_dense.to_sparse()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 10000000 entries, 0 to 9999999\n",
"Data columns (total 3 columns):\n",
"0 Sparse[int64, nan]\n",
"1 Sparse[float64, nan]\n",
"2 Sparse[float64, nan]\n",
"dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)\n",
"memory usage: 152.6 MB\n"
]
}
],
"source": [
"## check memory usage after action\n",
"df_sparse.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Reduce DataFrame Loading Time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 Don't Load all columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have csv file which have `100 columns` and you only need `10` columns for processing \n",
"**Task**: Reduce `pd.read_csv()` Loading time \n",
"**Action**: Only Load required columns explicitly i.e. only 10 columns not 100 columns which is default \n",
"**Result**: Drop from `2.8 second to 0.8 second` in Loading time i.e. `71%` reduction "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Default Loading time 2.868640422821045 seconds\n"
]
}
],
"source": [
"t0 = time()\n",
"ufo = pd.read_csv('http://bit.ly/uforeports')\n",
"t1 = time()\n",
"print(\"Default Loading time {} seconds\".format(t1-t0))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Default Loading time 0.8735167980194092 seconds\n"
]
}
],
"source": [
"t0 = time()\n",
"ufo = pd.read_csv('http://bit.ly/uforeports',usecols=['City','State'])\n",
"t1 = time()\n",
"print(\"Default Loading time {} seconds\".format(t1-t0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Read Large CSV File"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you have large csv file which is taking too much time in reading or giving memory error while reading in one go \n",
"**Task**: Avoid Memory Error or reduce file loading time \n",
"**Action 1**: Instead of loading file in one go, load file in chunks i.e. use `chunksize` parameter in read_csv. Useful when you don't need to load all data into memory at once \n",
"**Action 2**: See if you can do memory saving by doing following operations:\n",
"- Explicit pass appropriate `dtypes`for each column while loading file and reduce the volume of the data. Example - Sometimes `64-bit` precision is overkill and can use `32-bit`, 2x memory savings are worth it.\n",
"- `Usecols` parameter in `read_csv` i.e. only load required columns\n",
"- `Data compression` - Here, I don't mean by zip or gzip file. Say your status column has two values: \"AVAILABLE\" and \"UNAVAILABLE\". Instead of storing them as string (~ 10 bytes per entry), store as boolean: True or False (1 byte per entry)\n",
"\n",
"**Action 3**: For faster data loading, serialize data (pickle) on the first run and load the serialized data instead of the csv files in subsequent runs.[15]\n",
"\n",
"**Action 4**: Use of other Library instead of pandas but with similar functionality:\n",
"- Dask - Dask provides multi-core execution on larger-than-memory datasets.\n",
"\n",
"\n",
"- Modin - Modin transparently distributes the data and computation. Modin provides speed-ups of up to 4x on a laptop with 4 physical cores. Pandas able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](read_csv_benchmark.png \"segment\")\n",
" [9]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](lib_comparison.png \"segment\")\n",
" [11]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# # pip install modin[ray]\n",
"# import ray\n",
"# ray.init(num_cpus=4)\n",
"# import modin.pandas as pd\n",
"# export MODIN_OUT_OF_CORE=true"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](modin_vs_pandas.png \"segment\")\n",
" [13]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Fast Pandas Operation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1 Loop and Vectorization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](pandas_operation.png \"segment\")\n",
" [16]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![SegmentLocal](pandas_optimization.png \"segment\")\n",
" [17]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2 Indexing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Situation**: Let say, you want to join two dataframes based on some common column \n",
"**Task**: Reduce `pd.merge()` time \n",
"**Action**: During `merge` operation, use indexing \n",
"**Result**: Drop from `10 second to 2 second` in Loading time i.e. `80%` reduction \n",
"\n",
"\n",
"**General Advice**: Use indexing as much as possible for merging and value lookup in dataframe [18] "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Time Taken in merge before indexing 9.495047569274902\n",
"Time Taken in merge after indexing 2.741480827331543\n"
]
}
],
"source": [
"df1 = pd.DataFrame({'A':list(range(10000000)),'B':list(range(0,20000000,2))})\n",
"df2 = pd.DataFrame({'A':list(range(10000000)),'C':list(range(0,40000000,4))})\n",
"\n",
"t0 = time()\n",
"df = pd.merge(df1,df2,how='left',on='A')\n",
"t1 = time()\n",
"print(\"Time Taken in merge before indexing {}\".format(t1-t0))\n",
"\n",
"gc.collect()\n",
"\n",
"t0 = time()\n",
"df1_ = df1.set_index('A')\n",
"df2_ = df2.set_index('A')\n",
"df_ = pd.merge(df1_,df2_, left_index=True, right_index=True)\n",
"t1 = time()\n",
"print(\"Time Taken in merge after indexing {}\".format(t1-t0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3 Parallelization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question**: Can we parallelize `pandas apply or groupby` function without much change in code ?\n",
"\n",
"\n",
"**Answer**: Yes, there are lot of libraries which can help in parallelizing pandas apply function without much change in code.\n",
"\n",
"**Libraries**\n",
"- pandarallel\n",
"\n",
"![SegmentLocal](standard_vs_parallel_4_cores.png \"segment\")\n",
" [19]\n",
"\n",
"**General Advice:** Parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.\n",
"\n",
"- Swifter - It applies any function to a pandas dataframe or series in the fastest available manner (Vectorize OR Dask Parallel Processing OR Pandas apply). This notebook contains speed comparison for swifter, Please check."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"#########################################\n",
"#### PANDARALLEL CODE #################\n",
"#########################################\n",
"# from pandarallel import pandarallel\n",
"# pandarallel.initialize(nb_workers = _NO_CORES_USED_FOR_PARALLELIZATION,progress_bar = True)\n",
"# df.parallel_apply(func)\n",
"# df.groupby(args).parallel_apply(func)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"#########################################\n",
"#### SWIFTER CODE #################\n",
"#########################################\n",
"# import pandas as pd\n",
"# import swifter\n",
"# df['Col'].swifter.apply(anyfunction)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. References\n",
"\n",
"1) [https://www.dataquest.io/blog/pandas-big-data/](https://www.dataquest.io/blog/pandas-big-data/) \n",
"2) [https://machinelearningmastery.com/sparse-matrices-for-machine-learning/](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/) \n",
"3) [https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe](https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe) \n",
"4) [https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) \n",
"5) [https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/](https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/) \n",
"6) [https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) \n",
"7) [https://pythonspeed.com/articles/pandas-load-less-data/](https://pythonspeed.com/articles/pandas-load-less-data/) \n",
"8) [https://numpy.org/devdocs/user/basics.types.html](https://numpy.org/devdocs/user/basics.types.html) \n",
"9) [https://modin.readthedocs.io/en/latest/](https://modin.readthedocs.io/en/latest/) \n",
"10) **DASK** - [https://docs.dask.org/en/latest/best-practices.html](https://docs.dask.org/en/latest/best-practices.html)\n",
"\n",
"11) [https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c](https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c) \n",
"12) [https://pythonspeed.com/datascience/](https://pythonspeed.com/datascience/) \n",
"13) **Modin Vs Pandas** [https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html](https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html) \n",
"14) [https://www.csvexplorer.com/blog/open-big-csv/](https://www.csvexplorer.com/blog/open-big-csv/) \n",
"15) [https://www.kaggle.com/c/zillow-prize-1/discussion/37261](https://www.kaggle.com/c/zillow-prize-1/discussion/37261) \n",
"16) [https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6) \n",
"17) [https://github.com/s-heisler/pycon2017-optimizing-pandas/blob/master/MinneAnalytics%20materials/Minneanalytics_talk_slides.pdf](https://github.com/s-heisler/pycon2017-optimizing-pandas/blob/master/MinneAnalytics%20materials/Minneanalytics_talk_slides.pdf) \n",
"18) [https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2](https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2) \n",
"19) [https://github.com/nalepae/pandarallel](https://github.com/nalepae/pandarallel) \n",
"20) [https://www.thepythoncode.com/article/get-hardware-system-information-python](https://www.thepythoncode.com/article/get-hardware-system-information-python) \n",
"\n",
"\n",
"**NOTE:** I acknowledge the work of above references and this notebook uses some of code mentioned in references."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}