{ "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5-final" }, "orig_nbformat": 2, "kernelspec": { "name": "Python 3.8.5 64-bit ('bigquery': conda)", "display_name": "Python 3.8.5 64-bit ('bigquery': conda)", "metadata": { "interpreter": { "hash": "8e6f8fd53d913fe50345f9e659ed342277121f637d2311273da0eef260503de3" } } } }, "nbformat": 4, "nbformat_minor": 2, "cells": [ { "source": [ "# Correlation and covariance from scratch" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "In this post we examine covariance and a correlation a bit closer.\n", "\n", "We will use them to examine the relationship between Ethereum transaction value and gas price.\n", "\n", "Again, most of the time, we break down the steps into standard Python data types and operations (i.e. we use numpy mostly for verification of our results)." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Libraries and data load" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "We pull the data from Google's public datasets with BigQuery, use pandas and numpy to manipulate it, and altair to plot their relationship." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"]=os.path.expanduser(\"~/.credentials/Notebook bigquery-c422e406404b.json\")\n", "\n", "from google.cloud import bigquery\n", "client = bigquery.Client()\n", "\n", "import altair as alt\n", "alt.data_transformers.disable_max_rows()\n", "\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "query =\"\"\"\n", "SELECT\n", " EXTRACT(DATE FROM block_timestamp) AS date,\n", " AVG(value) AS value,\n", " AVG(gas_price) AS gas_price, \n", "FROM `bigquery-public-data.ethereum_blockchain.transactions`\n", "WHERE\n", " EXTRACT(YEAR FROM block_timestamp) = 2019\n", "GROUP BY date\n", "ORDER BY date\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": " date value gas_price\n0 2019-01-01 3.719103e+18 1.431514e+10\n1 2019-01-02 4.649915e+18 1.349952e+10\n2 2019-01-03 4.188781e+18 1.269504e+10\n3 2019-01-04 6.958368e+18 1.418197e+10\n4 2019-01-05 8.167590e+18 2.410475e+10", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
datevaluegas_price
02019-01-013.719103e+181.431514e+10
12019-01-024.649915e+181.349952e+10
22019-01-034.188781e+181.269504e+10
32019-01-046.958368e+181.418197e+10
42019-01-058.167590e+182.410475e+10
\n
" }, "metadata": {}, "execution_count": 3 } ], "source": [ "transactions = client.query(query).to_dataframe(dtypes={'value': float, 'gas_price': float}, date_as_object=False)\n", "transactions.head()" ] }, { "source": [ "There are a few days when the gas prices were outstandingly high so we remove values beyond three standard deviation from the mean." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Outliers" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/html": "\n
\n", "text/plain": "alt.LayerChart(...)" }, "metadata": {}, "execution_count": 5 } ], "source": [ "labelx = alt.selection_single(\n", " encodings=['x'],\n", " on='mouseover',\n", " empty='none'\n", ")\n", "\n", "labely = alt.selection_single(\n", " encodings=['y'],\n", " on='mouseover',\n", " empty='none'\n", ")\n", "\n", "ruler = alt.Chart().mark_rule(color='darkgray')\n", "\n", "chart = alt.Chart().mark_point().encode(\n", " alt.X('value', axis=alt.Axis(format=(',.2e'))),\n", " alt.Y('gas_price', axis=alt.Axis(format=(',.2e'))),\n", " alt.Tooltip(['value', 'gas_price', 'date'])\n", ").properties(width=600, height=400, title='Trasaction values and gas prices').add_selection(labelx).add_selection(labely)\n", "\n", "alt.layer(\n", " chart,\n", " ruler.encode(x='value:Q').transform_filter(labelx),\n", " ruler.encode(y='gas_price:Q').transform_filter(labely),\n", " data=transactions\n", ").interactive()" ] }, { "source": [ "transactions = transactions[~(transactions['gas_price'] >= transactions['gas_price'].mean() + 3 * transactions['gas_price'].std())]" ], "cell_type": "code", "metadata": {}, "execution_count": 4, "outputs": [] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "values = transactions['value']\n", "gas_prices = transactions['gas_price']" ] }, { "source": [ "As we emphasize standard operations, we use a few helper functions in the steps leading to covariance and correlation." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Helper functions" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from typing import Union, List\n", "\n", "Vector = List[float]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def dot(vector1: Vector, vector2: Vector) -> float:\n", " assert len(vector1) == len(vector2)\n", "\n", " return sum(v1 * v2 for v1, v2 in zip(vector1, vector2))\n", "\n", "assert dot([1, 2, 3], [4, 5, 6]) == 32\n", "\n", "\n", "def mean(x: Vector) -> float:\n", " return sum(x) / len(x)\n", "\n", "assert mean([1, 2, 3, 4]) == 2.5\n", "\n", "\n", "def de_mean(xs: Vector) -> Vector:\n", " x_mean = mean(xs)\n", " return [x - x_mean for x in xs]\n", "\n", "assert de_mean([4, 5, 6, 7, 8]) == [-2, -1, 0, 1, 2]\n", "\n", "def sum_of_squares(xs: Vector) -> float:\n", " return dot(xs, xs)\n", "\n", "assert sum_of_squares([1, 2, 3]) == 14\n", "\n", "def variance(xs: Vector) -> float:\n", " return sum_of_squares(de_mean(xs)) / (len(xs) - 1)\n", "\n", "assert variance([1, 2, 3]) == 1\n", "\n", "import math as m\n", "\n", "def standard_deviation(xs: Vector):\n", " return m.sqrt(variance(xs))\n", "\n", "assert standard_deviation([4, 5, 6]) == 1" ] }, { "source": [ "Covariance looks at the degree two variables 'move together'.\n", "\n", "\n", "For this, first, it multiplies the variables' deviation from their respective means. This produces a series of values which are very high for those observations where both variables deviate a lot. Furthermore, when the two variables deviate to the same direction these values are positive, otherwise they are negative.\n", "\n", "Then, it calculates the mean of these multiplied deviation values. However, because we are calculating the sample covariance, we divide their sum by $n + 1$ (where $n$ is the number of observations)\n", "\n", "$ \\text{Cov} = \\frac { \\sum_{i=1}^n (x-\\bar{x}) (y-\\bar{y})} {n - 1} $" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Covariance" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "def covariance(xs: Vector, ys: Vector) -> float:\n", " assert len(xs) == len(ys)\n", "\n", " return dot(de_mean(xs), de_mean(ys)) / (len(xs) - 1)\n", "\n", "assert covariance([1, 2, 3], [4, 5, 6]) == 1" ], "cell_type": "code", "metadata": {}, "execution_count": 9, "outputs": [] }, { "source": [ "There is also an alternate way to calculate covariance, using the variables' expected values (which here are the means):\n", "\n", "$ \\text{Cov} = E[\\vec{x}\\vec{y}] - E[\\vec{x}]E[\\vec{y}] $\n", "\n", "This is a much simpler version. However, again, as we are dealing with sample data, so we need to adjust for that:\n", "\n", "$ \\text{Cov}_s = \\frac {n} {n - 1} (E[\\vec{x}\\vec{y}] - E[\\vec{x}]E[\\vec{y}]) $" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "1.5856518696847875e+26" }, "metadata": {}, "execution_count": 10 } ], "source": [ "covariance(values, gas_prices)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def covariance_2(xs: Vector, ys: Vector) -> float:\n", " xsys = [x * y for x, y in zip(xs, ys)]\n", " return (mean(xsys) - mean(xs) * mean(ys)) * len(xs) / (len(xs) - 1)\n", "\n", "assert np.isclose(covariance_2([1, 2, 3], [4, 5, 6]), 1) " ] }, { "source": [ "We also verify our method with numpy." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "1.5856518696847882e+26" }, "metadata": {}, "execution_count": 13 } ], "source": [ "np.cov(values, gas_prices)[0, 1]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "1.585651869684888e+26" }, "metadata": {}, "execution_count": 12 } ], "source": [ "covariance_2(values, gas_prices)" ] }, { "source": [ "Because the value of covariance really depends on the units of the variables, it is often hard to interpret and also to compare it with other covariences.\n", "\n", "This is why correlation is an often preferred method as it adjusts the covariance by the variables' standard deviation values. As a result, it bounds the end result into the $[-1, 1]$ domain making it comparable with other correlation values.\n", "\n", "$ \\text{Corr(x, y)} = \\frac { \\text{Cov(x, y)} } {\\text{Std(x)} \\text{Std(y)}} $ \n", "\n" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Correlation" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "0.035069533929694634" }, "metadata": {}, "execution_count": 16 } ], "source": [ "correlation(values, gas_prices)" ] }, { "source": [ "Finally, we verify the result with numpy." ], "cell_type": "markdown", "metadata": {} }, { "source": [ "values.corr(gas_prices)" ], "cell_type": "code", "metadata": {}, "execution_count": 17, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "0.03506953392969465" }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def correlation(xs: Vector, ys: Vector) -> float:\n", " return covariance(xs, ys) / (standard_deviation(xs) * standard_deviation(ys))\n", "\n", "\n", "assert np.isclose(correlation([.1, .2, .3], [400, 500, 600]), 1)" ] } ] }