{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Cloud Price Comparison\n", "How do cloud providers stack up?\n", " \n", "Making apples to apples comparisons between different cloud providers is very difficult, because each one offers instances with varying vCPUs, RAM, SSD space and HDD space. To further obfuscate matters, slightly different billing systems, promises of arcane discounting, only providing pricing in USD, and inconsistent naming conventions are sprinkled throughout.\n", "\n", "As an attempt to provide a clearer price comparison, I'll be using [multiple linear regression](https://en.wikipedia.org/wiki/Linear_regression) to \"[normalise](http://bit.ly/2xIUM5C)\" the pricing of general purpose compute instances across different cloud providers.\n", "\n", "In essence, **If every cloud provider offered the same size compute instances, how expensive would they be?**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Importing libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "from sklearn.linear_model import LinearRegression\n", "import matplotlib.pyplot as plt\n", "import requests\n", "import json\n", "import re\n", "from bs4 import BeautifulSoup\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The dataset\n", "\n", "I'll be taking the price tables of:\n", "* Google Cloud - [Predefined machine types](https://cloud.google.com/compute/pricing#predefined_machine_types)\n", "* AWS - [On demand instances](https://aws.amazon.com/ec2/pricing/on-demand/)\n", "* Azure - [Linux virtual machines](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)\n", "\n", "and converting them into the instance sizes offered by [Catalyst Cloud](https://catalystcloud.nz/services/iaas/compute/#prices). You can find the datasets and their sources [here](https://github.com/catalyst-cloud/catalystcloud-price-comparison)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialisation\n", "Assuring dataset directories exist for later data storage." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def create_dir_if_not_exists(rel_dir_path):\n", " if not os.path.exists(rel_dir_path):\n", " os.makedirs(rel_dir_path)\n", " \n", "create_dir_if_not_exists('dataset')\n", "create_dir_if_not_exists('predicted-dataset')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exchange rate\n", "Getting the USD to NZD exchange rate" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Current exchange rate from USD to NZD is 1.6772901111\n" ] } ], "source": [ "usd_to_nzd_exchange_rate_url = 'https://api.exchangeratesapi.io/latest?base=USD&symbols=NZD'\n", "\n", "usd_to_nzd_exchange_rate_json = requests.get(usd_to_nzd_exchange_rate_url).json()\n", "usd_to_nzd_exchange_rate = float(usd_to_nzd_exchange_rate_json['rates']['NZD'])\n", "print(f\"Current exchange rate from USD to NZD is {usd_to_nzd_exchange_rate}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Catalyst Cloud prices\n", "Scraping the Catalyst Cloud compute page for prices. As Catalyst Cloud currently only offers general purpose compute instances, we'll be scraping all of them." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Variables\n", "\n", "catalyst_url = 'https://catalystcloud.nz/services/iaas/compute'\n", "catalyst_data_location = 'dataset/catalyst_price_data.csv'\n", "catalyst_gst_exclusive = True" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloaded Catalyst prices, with 45 items.\n" ] } ], "source": [ "catalyst_price_page_html = requests.get(catalyst_url).text \n", "catalyst_price_page = BeautifulSoup(catalyst_price_page_html, 'html.parser')\n", "\n", "catalyst_price_table = catalyst_price_page.find(attrs={'class': 'service-price-table'}).tbody\n", "catalyst_price_rows = catalyst_price_table.find_all('tr')\n", "\n", "catalyst_prices_list = []\n", "\n", "for row in catalyst_price_rows:\n", " catalyst_price_cells = list(row.stripped_strings)\n", " \n", " catalyst_prices_list.append({\n", " 'Name': catalyst_price_cells[0],\n", " 'vCPU': float(catalyst_price_cells[1]),\n", " 'RAM, GB': float(catalyst_price_cells[2]),\n", " 'Price per hour, NZD (ex GST)': float(catalyst_price_cells[3].strip('$')),\n", " 'SSD storage, GB': .0,\n", " 'HDD storage, GB': .0\n", " })\n", "\n", "# Convert to csv\n", "catalyst_dataframe = pd.DataFrame(catalyst_prices_list)\n", "catalyst_dataframe.to_csv(catalyst_data_location)\n", "\n", "print('Downloaded Catalyst prices, with {} items.'.format(catalyst_dataframe.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AWS prices\n", "\n", "Pulling price list from AWS JSON API." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Variables\n", "\n", "aws_url = 'https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-southeast-2/index.json'\n", "aws_raw_location = 'dataset/raw_aws_bulk.json'\n", "aws_acceptable_instance_families = [\n", " 'General purpose',\n", " 'Micro instances'\n", "]\n", "aws_data_location = 'dataset/aws_price_data.csv'\n", "aws_gst_exclusive = True" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [], "source": [ "aws_bulk_json_request = requests.get(aws_url)\n", "aws_bulk_json = aws_bulk_json_request.json()\n", "with open(aws_raw_location, 'w') as aws_raw_file:\n", " json.dump(aws_bulk_json, aws_raw_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extracting the specific relevant prices from the raw AWS file, and putting them in a consistant, usable format. In this case, we'll only be using the General purpose and micro size instances." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Getting the instance products\n", "\n", "with open(aws_raw_location, 'r') as aws_raw_file:\n", " aws_raw_json = json.load(aws_raw_file)\n", " \n", " aws_instances_list = []\n", " \n", " for product in aws_raw_json['products']:\n", " \n", " productFamily = aws_raw_json['products'][product]['productFamily']\n", " \n", " # Check product is compute instance\n", " if productFamily == 'Compute Instance':\n", " \n", " # Check if instance is appropriate\n", " instanceFamily = aws_raw_json['products'][product]['attributes']['instanceFamily']\n", " is_current_gen = aws_raw_json['products'][product]['attributes']['currentGeneration'] == 'Yes'\n", " is_linux = aws_raw_json['products'][product]['attributes']['operatingSystem'] == 'Linux'\n", " no_preInstalledSw = aws_raw_json['products'][product]['attributes']['preInstalledSw'] == 'NA'\n", " is_shared_instance = aws_raw_json['products'][product]['attributes']['tenancy'] == 'Shared'\n", "\n", " if instanceFamily in aws_acceptable_instance_families and is_current_gen \\\n", " and is_linux and no_preInstalledSw and is_shared_instance:\n", " \n", " # Append if appropriate\n", " aws_instances_list.append(product)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloaded AWS prices, with 192 items.\n" ] } ], "source": [ "with open(aws_raw_location, 'r') as aws_raw_file:\n", " \n", " aws_prices_list = []\n", " \n", " for instance_key in aws_instances_list:\n", "\n", " attributes = aws_raw_json['products'][instance_key]['attributes']\n", " \n", " # Get vCPU and RAM\n", " vCPU = float(attributes['vcpu'].replace(',',''))\n", " RAM = float(attributes['memory'].split(' ')[0].replace(',',''))\n", "\n", " # Break storage spec into array\n", " storage_strings = attributes['storage'].split(' ')\n", "\n", " # Find where the numbers end (200 x 1), and the description of the storage type (SSD) starts.\n", " final_num_index = None\n", " for word in storage_strings[::-1]:\n", " try:\n", " float(word.replace(',', ''))\n", " final_num_index = storage_strings.index(word)\n", " break\n", " except:\n", " foo = None\n", "\n", " # If there are no numbers in the storage spec, there is no storage included\n", " if final_num_index == None:\n", "\n", " total_ssd = .0\n", " total_hdd = .0\n", "\n", " # Else...\n", " else:\n", "\n", " # Perform the math to figure out how many GB of storage is included\n", " storage_calcs = storage_strings[0:final_num_index+1]\n", " storage_volume = eval(' '.join(['*' if x=='x' else x.replace(',', '') for x in storage_calcs]))\n", "\n", " # discern the type of storage\n", " if 'HDD' in storage_strings: \n", " total_ssd = .0\n", " total_hdd = float(storage_volume)\n", "\n", " elif 'SSD' in storage_strings: \n", " total_ssd = float(storage_volume)\n", " total_hdd = .0\n", " else: \n", " total_ssd = float(storage_volume)\n", " total_hdd = .0\n", "\n", "\n", " # Get the price per USD\n", " terms = aws_raw_json['terms']['OnDemand'][instance_key]\n", " usd_price = None\n", " for specific_term in terms:\n", " for dimension_key in terms[specific_term]['priceDimensions']:\n", " dimension = terms[specific_term]['priceDimensions'][dimension_key]\n", " if dimension['unit'] != 'Hrs': raise ValueError(\"This price isn't in hours\")\n", " usd_price = float(dimension['pricePerUnit']['USD'])\n", " \n", " if not usd_price:\n", " continue\n", "\n", " # Convert to NZD\n", " nzd_price = usd_price * usd_to_nzd_exchange_rate\n", " \n", " # Append to list of prices\n", " aws_prices_list.append({\n", " 'Name': attributes['instanceType'],\n", " 'vCPU': vCPU,\n", " 'RAM, GB': RAM,\n", " 'Price per hour, NZD (ex GST)': nzd_price,\n", " 'Price per hour, USD (ex GST)': usd_price,\n", " 'SSD storage, GB': total_ssd,\n", " 'HDD storage, GB': total_hdd,\n", " })\n", "\n", "# Convert to CSV\n", "aws_dataframe = pd.DataFrame(aws_prices_list)\n", "aws_dataframe.to_csv(aws_data_location)\n", "\n", "print('Downloaded AWS prices, with {} items.'.format(aws_dataframe.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Google Cloud prices\n", "\n", "Scraping Google Cloud's documentation for prices of instance sizes." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Variables\n", "\n", "google_prices_url = 'https://cloud-dot-devsite-v2-prod.appspot.com/compute/all-pricing_f139b6960aafba0f57015e855bf5748ad892aa7e94f9dee00434116c29ca229a.frame'\n", "google_instances_url = 'https://cloud-dot-devsite-v2-prod.appspot.com/compute/all-pricing_70247bb78d85862a2b290545ac82cd3c0f4e0e7aa5ea1092e8dcba180b24ab80.frame'\n", "google_price_type = 'syd-hourly'\n", "google_acceptable_instance_families = [\n", " 'standard_machine_types'\n", "]\n", "google_data_location = 'dataset/google_price_data.csv'\n", "google_gst_exclusive = True" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "google_price_page_html = requests.get(google_prices_url).text\n", "google_price_page = BeautifulSoup(google_price_page_html, 'html.parser')\n", "google_instance_page_html = requests.get(google_instances_url).text\n", "google_instance_page = BeautifulSoup(google_instance_page_html, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scraping the custom compute sizes, for easier pricing later." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Extract the USD price per vCPU and per GB RAM\n", "google_custom_compute_price_table = google_price_page.find('table')\n", "google_custom_compute_rows = google_custom_compute_price_table.find_all('tr')[1:]\n", "\n", "google_per_vcpu_usd = float(google_custom_compute_rows[0].find_all('td')[1][google_price_type].split()[0].strip('$'))\n", "google_per_ram_usd = float(google_custom_compute_rows[1].find_all('td')[1][google_price_type].split()[0].strip('$'))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def most_freq_num(text):\n", " number_list = re.findall('\\d*\\.?\\d+', text)\n", " most_frequent_num = max(set(number_list), key=number_list.count)\n", " return float(most_frequent_num)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scraping the predefined instance size price table. In this case, we'll only be scraping the standard machine types." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloaded Google prices, with 7 items.\n" ] } ], "source": [ "google_prices_list = []\n", "\n", "for instance_type in google_acceptable_instance_families:\n", " \n", " google_price_table = google_instance_page.find('table')\n", " google_rows = google_price_table.find_all('tr')[1:-1]\n", " \n", " for row in google_rows:\n", " \n", " # Extract number of vCPUs and GB of RAM\n", " try:\n", " cells = row.find_all('td')\n", " name = cells[0].get_text().strip()\n", " # Ignore if has lake in name (to remove skylake instances)\n", " if 'lake' in name:\n", " continue\n", " cpu_val = most_freq_num(str(cells[1]))\n", " ram_val = most_freq_num(str(cells[2]))\n", " except:\n", " foo='bar'\n", " \n", " # Calcluate NZD price\n", " usd_price = (google_per_ram_usd * ram_val) + (google_per_vcpu_usd * cpu_val)\n", " nzd_price = usd_price * usd_to_nzd_exchange_rate\n", " \n", " try:\n", " google_prices_list.append({\n", " 'Name': name,\n", " 'vCPU': cpu_val,\n", " 'RAM, GB': ram_val,\n", " 'Price per hour, NZD (ex GST)': nzd_price,\n", " 'Price per hour, USD (ex GST)': usd_price,\n", " 'SSD storage, GB': .0,\n", " 'HDD storage, GB': .0,\n", " })\n", " except:\n", " continue\n", "\n", "google_dataframe = pd.DataFrame(google_prices_list)\n", "google_dataframe.to_csv(google_data_location)\n", "\n", "print('Downloaded Google prices, with {} items.'.format(google_dataframe.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Azure cloud prices\n", "\n", "Scraping the Azure on demand price pages. In this case, we'll only be scraping those listed as general purpose." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Variables\n", "azure_url = 'https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/'\n", "azure_data_location = 'dataset/azure_price_data.csv'\n", "azure_acceptable_categories = [\n", " 'general-purpose-filter'\n", "]\n", "azure_acceptable_regions = [\n", " 'australia-central',\n", " 'australia-central-2',\n", " 'australia-east',\n", " 'australia-southeast'\n", "]\n", "azure_ssd_temp_disk_vms = [\n", " r'A\\d.v2', # Regex for Av2-series\n", " r'D\\d{1,2}.v2', # Regex for Dv2 series\n", " r'D\\d{1,2}.v3' # Regex for Dv3 series\n", "] \n", "azure_gst_exclusive = True" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "azure_price_page_html = requests.get(azure_url).text\n", "azure_price_page = BeautifulSoup(azure_price_page_html, 'html.parser')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloaded Azure prices, with 41 items.\n" ] } ], "source": [ "# Find all appropriate price tables\n", "appropriate_price_tables = []\n", "for category in azure_acceptable_categories:\n", " category_section = azure_price_page.find_all(attrs={'data-filter': category})[0]\n", " appropriate_price_tables += category_section.find_all('table')\n", "\n", "# Extract data from tables\n", "azure_instance_list = []\n", "for price_table in appropriate_price_tables:\n", " table_rows = price_table.tbody.find_all('tr')\n", " \n", " # Work through each row\n", " for row in table_rows:\n", " \n", " cells = row.find_all('td')\n", " \n", " # Find a price by checking if each of the acceptable regions have a price\n", " usd_price = None\n", " price_json = json.loads(cells[5].span['data-amount'])['regional']\n", " for region in azure_acceptable_regions:\n", " if region in price_json:\n", " usd_price = price_json[region]\n", " break\n", " \n", " # If none of the regions have it, skip this row\n", " if usd_price is None:\n", " continue\n", " \n", " # Get NZD price\n", " nzd_price = usd_price * usd_to_nzd_exchange_rate\n", " \n", " # Get name\n", " name = cells[1].string\n", " \n", " # Get CPU value\n", " cpu_val = float(cells[2].string)\n", " \n", " # Get Ram value\n", " ram_string = cells[3].string\n", " ram_string = ram_string.strip(' GiB')\n", " ram_val = float(ram_string)\n", " \n", " # Get storage value\n", " storage_string = cells[4].string\n", " storage_string = storage_string.strip(' GiB')\n", " storage_string = storage_string.replace(',', '')\n", " storage_val = float(storage_string)\n", " \n", " # Get storage type\n", " for regex_string in azure_ssd_temp_disk_vms:\n", " pattern = re.compile(regex_string)\n", " if pattern.match(name):\n", " storage_type = 'SSD'\n", " else:\n", " storage_type = 'HDD'\n", " \n", " \n", " if storage_type is 'HDD':\n", " azure_instance_list.append({\n", " 'Name': name,\n", " 'vCPU': cpu_val,\n", " 'RAM, GB': ram_val,\n", " 'Price per hour, NZD (ex GST)': nzd_price,\n", " 'Price per hour, USD (ex GST)': usd_price,\n", " 'SSD storage, GB': .0,\n", " 'HDD storage, GB': storage_val\n", " })\n", " elif storage_type is 'SSD':\n", " azure_instance_list.append({\n", " 'Name': name,\n", " 'vCPU': cpu_val,\n", " 'RAM, GB': ram_val,\n", " 'Price per hour, NZD (ex GST)': nzd_price,\n", " 'Price per hour, USD (ex GST)': usd_price,\n", " 'SSD storage, GB': storage_val,\n", " 'HDD storage, GB': .0\n", " })\n", " else:\n", " raise ValueError('Unknown storage type.')\n", "\n", "pd.DataFrame(azure_instance_list).to_csv(azure_data_location)\n", "\n", "azure_dataframe = pd.DataFrame(azure_instance_list)\n", "azure_dataframe.to_csv(azure_data_location)\n", "\n", "print('Downloaded Azure prices, with {} items.'.format(azure_dataframe.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ingesting the datasets\n", "\n", "Importing the datasets from their saved locations." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "catalyst_dataset = pd.read_csv(catalyst_data_location, index_col=0)\n", "google_dataset = pd.read_csv(google_data_location, index_col=0)\n", "aws_dataset = pd.read_csv(aws_data_location, index_col=0)\n", "azure_dataset = pd.read_csv(azure_data_location, index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Previewing the datasets." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Name | \n", "vCPU | \n", "RAM, GB | \n", "Price per hour, NZD (ex GST) | \n", "SSD storage, GB | \n", "HDD storage, GB | \n", "
---|---|---|---|---|---|---|
0 | \n", "c1.c1r05 | \n", "1.0 | \n", "0.5 | \n", "0.017 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "c1.c1r1 | \n", "1.0 | \n", "1.0 | \n", "0.039 | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "c1.c1r2 | \n", "1.0 | \n", "2.0 | \n", "0.062 | \n", "0.0 | \n", "0.0 | \n", "
3 | \n", "c1.c1r4 | \n", "1.0 | \n", "4.0 | \n", "0.098 | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "c1.c2r1 | \n", "2.0 | \n", "1.0 | \n", "0.070 | \n", "0.0 | \n", "0.0 | \n", "
5 | \n", "c1.c2r2 | \n", "2.0 | \n", "2.0 | \n", "0.088 | \n", "0.0 | \n", "0.0 | \n", "