{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "While KMeans is a good algorithm, the time complexity is very poor. Kmeans works in O(n⋅K⋅I⋅f) Where n is number of records, K is number of clusters, I is number of iterations, f is number of features in particular record. Clearly, the algorithm will take forever to complete on a dataset of > 100,000 data points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Minibatch KMeans\n", "Main features of Minibatch KMeans are:\n", "\n", "* Instead of using the entire dataset at once, it operates in batches.\n", "* Uses Gradient Descent update, which is way more faster than what KMeans does.\n", "\n", "#### How it works \n", "* It takes batches of datasets and finds the centroids for the smaller dataset (minibatch) \n", "* Then for the next batch, it uses the centroid found in previous batch and updates it using Gradient Descent. \n", "* This simple method makes it faster by a magnitude of the input size." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import time" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | CustomerID | \n", "Genre | \n", "Age | \n", "Annual Income (k$) | \n", "Spending Score (1-100) | \n", "
|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "Male | \n", "19 | \n", "15 | \n", "39 | \n", "
| 1 | \n", "2 | \n", "Male | \n", "21 | \n", "15 | \n", "81 | \n", "
| 2 | \n", "3 | \n", "Female | \n", "20 | \n", "16 | \n", "6 | \n", "
| 3 | \n", "4 | \n", "Female | \n", "23 | \n", "16 | \n", "77 | \n", "
| 4 | \n", "5 | \n", "Female | \n", "31 | \n", "17 | \n", "40 | \n", "