{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/lhagen-isb/Community-Notebooks/blob/master/Notebooks/How_to_perform_Nearest_Centroid_Classification_with_BigQuery.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "nqKijP9S1_c8"
   },
   "source": [
    "# ISB-CGC Community Notebooks\n",
    "\n",
    "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
    "\n",
    "```\n",
    "Title:   How to Perform Nearest Centroid Classification with_BigQuery\n",
    "Author:  Lauren Hagen\n",
    "Created: 2019-12-17\n",
    "URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_preform_Nearest_Centroid_Classification_with_BigQuery.ipynb\n",
    "Purpose: Demonnstrate using BigQuery to categorize patients based on gene expression using the Nearest Centroid Classification.\n",
    "Notes: \n",
    "```\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zzZa1dNF4ZYl"
   },
   "source": [
    "#  Introduction\n",
    "\n",
    "## Overview\n",
    "\n",
    "This notebook is to demonstrate how to use BigQuery to categorize patients based on gene expression using the Nearest Centroid Classifier. We will be using the Kidney renal papillary cell carcinoma (KIRP) study from The Cancer Genome Atlas (TCGA) as an example dataset and will use the RNA Sequence Gene Expression Data.\n",
    "\n",
    "## What is the Nearest Centroid Classifier?\n",
    "\n",
    "Nearest Centroid Classifier assigns a label based on the nearest mean (centroid) of the training samples that the observation is closest to<sup>1</sup>. This classifier is an example of supervised learning and does not create a model for use later<sup>2</sup>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JT8xA5m-fEFn"
   },
   "source": [
    "Before we get started, we will need to load the BigQuery module, authenticate ourselves, create a client variable, and load necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "K50vTz7JbqX-"
   },
   "outputs": [],
   "source": [
    "# Load the BigQuery Module\n",
    "from google.cloud import bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 326
    },
    "colab_type": "code",
    "id": "ivQo4rAFfPLV",
    "outputId": "9cc8e8d0-1b89-45bc-bc5c-f3a078e596a3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Go to the following link in your browser:\n",
      "\n",
      "    https://accounts.google.com/o/oauth2/auth?code_challenge=vczbWqjPWlXINDmstsMXSC_dgiQ7OUkYLzS4LUn6f5o&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth\n",
      "\n",
      "\n",
      "Enter verification code: 4/vwEmwq_ebAQ_L2OTt_nQjSnrLyC1fYuTMR1PH5CClYdxCP9qLhKUIdA\n",
      "\u001b[1;33mWARNING:\u001b[0m Cannot find a project to insert into application default credentials (ADC) as a quota project.\n",
      "Run $gcloud auth application-default set-quota-project to insert a quota project to ADC.\n",
      "\n",
      "Credentials saved to file: [/content/.config/application_default_credentials.json]\n",
      "\n",
      "These credentials will be used by any library that requests Application Default Credentials (ADC).\n",
      "\n",
      "\n",
      "To take a quick anonymous survey, run:\n",
      "  $ gcloud survey\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Authenticate ourselves\n",
    "!gcloud auth application-default login"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Uq2GiSBlfQ2N"
   },
   "outputs": [],
   "source": [
    "# Create a variable for which client to use with BigQuery\n",
    "project_num = 'your_project_number'# Replace with your project ID\n",
    "if project_num == 'your_project_number':\n",
    "    print('Please update the project number with your Google Cloud Project')\n",
    "else:\n",
    "    client = bigquery.Client(project_num)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "rT9_z_iL_LIU"
   },
   "outputs": [],
   "source": [
    "# Required Libraries\n",
    "import pandas\n",
    "import pandas_confusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6OoHQUga6rxr"
   },
   "source": [
    "# Select Genes\n",
    "\n",
    "We will be using the top three genes with the highest mean expression as the genes we will use to categorize the clinical stage for each case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "lOlplmLd6ZPU"
   },
   "outputs": [],
   "source": [
    "# Find top 20 genes by mean with training set\n",
    "top_20_highest_mean_gene_exp = \"\"\"SELECT gene_name, AVG(HTSeq__FPKM_UQ) as m\n",
    "FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`\n",
    "WHERE project_short_name = 'TCGA-KIRP'\n",
    "GROUP BY gene_name\n",
    "ORDER BY m DESC\n",
    "LIMIT 20\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 381
    },
    "colab_type": "code",
    "id": "zKD6peSmxVVX",
    "outputId": "c4a710e9-0f92-47b6-bcbb-cd4deb6c0870"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   gene_name             m\n",
      "0     MT-CO3  5.275677e+08\n",
      "1     MT-CO1  4.638104e+08\n",
      "2     MT-CO2  4.374221e+08\n",
      "3     MT-ND4  4.188905e+08\n",
      "4    MT-RNR2  3.383655e+08\n",
      "5    MT-ATP6  2.715383e+08\n",
      "6     MT-ND3  2.477042e+08\n",
      "7     MT-CYB  2.397069e+08\n",
      "8        FTL  1.631893e+08\n",
      "9     MT-ND2  1.626764e+08\n",
      "10    MT-ND1  1.602064e+08\n",
      "11   MT-ND4L  1.299254e+08\n",
      "12   MT-ATP8  1.064019e+08\n",
      "13    MT-ND6  1.018257e+08\n",
      "14    MT-ND5  8.410137e+07\n",
      "15   MT-RNR1  8.201815e+07\n",
      "16    TMSB10  7.268314e+07\n",
      "17     MT-TP  6.806825e+07\n",
      "18      SPP1  6.308363e+07\n",
      "19  MTATP6P1  4.023022e+07\n"
     ]
    }
   ],
   "source": [
    "# Run the query\n",
    "top_20 = client.query(top_20_highest_mean_gene_exp).to_dataframe()\n",
    "print(top_20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MgUUpD1X7A1j"
   },
   "source": [
    "# Nearest Centroid in BigQuery\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "nh2xB6wg0f86"
   },
   "source": [
    "## What are we going to use Nearest Centroids for?\n",
    "We will be attempting to predict the clinical stage for a case based on a specific subset of genes and their expression levels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-5G0CkAAtVqz"
   },
   "source": [
    "First, we will filter the TCGA clinical table for our target disease (KIRP) and not include the cases where the clinical stage is missing. We need there to be no missing clinical stage features due to the nature of k nearest algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "hz4HphF3nsjp"
   },
   "outputs": [],
   "source": [
    "filtered_clin = \"\"\"\n",
    "WITH\n",
    "  clin AS (\n",
    "  SELECT\n",
    "    case_barcode,\n",
    "    clinical_stage    \n",
    "  FROM\n",
    "    `isb-cgc.TCGA_bioclin_v0.Clinical`\n",
    "  WHERE\n",
    "    disease_code = 'KIRP'\n",
    "    AND clinical_stage <> 'NULL'\n",
    "  ), \"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "k6DChjGPwyVP"
   },
   "source": [
    "Next, we will want to get the gene expressions for the genes that we are going to use to attempt to identify the clinical stage. We will filter for the 3 genes that we want, then we will label each case with whether it will be in the training or testing group. For more information on randomization in BigQuery, please see the [How to Create a Random Sample in BigQuery Notebook](https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_a_random_sample_in_bigquery.ipynb) in the [ISB-CGC Community Notebook Repository](https://github.com/isb-cgc/Community-Notebooks)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pUX2_ivypaST"
   },
   "outputs": [],
   "source": [
    "filtered_expr = \"\"\"\n",
    "expr AS (\n",
    "  SELECT\n",
    "    case_barcode,\n",
    "    IF ( MOD(ABS(FARM_FINGERPRINT(case_barcode)),10) > 1, 'TRAIN', 'TEST') as class,\n",
    "    gene_name,\n",
    "    HTSeq__FPKM_UQ\n",
    "  FROM\n",
    "    `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`\n",
    "  WHERE\n",
    "    project_short_name = 'TCGA-KIRP'\n",
    "    AND gene_name IN ('MT-CO3','MT-CO1','MT-CO2')\n",
    "    AND (\n",
    "    case_barcode IN (select case_barcode from clin)\n",
    "    )\n",
    "  GROUP BY case_barcode, gene_name, HTSeq__FPKM_UQ\n",
    "  ),\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AzwB6NkqxIE-"
   },
   "source": [
    "We will then find the centroids or means for each clinical stage within the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "C6QPXBBFpWgi"
   },
   "outputs": [],
   "source": [
    "centroids = \"\"\"\n",
    " mean AS (\n",
    "  SELECT\n",
    "    expr.class,\n",
    "    clin.clinical_stage,\n",
    "    AVG(CASE\n",
    "        WHEN gene_name = 'MT-CO3' THEN HTSeq__FPKM_UQ\n",
    "    END\n",
    "      ) AS gene1,\n",
    "    AVG(CASE\n",
    "        WHEN gene_name = 'MT-CO1' THEN HTSeq__FPKM_UQ\n",
    "    END\n",
    "      ) AS gene2,\n",
    "    AVG(CASE\n",
    "        WHEN gene_name = 'MT-CO2' THEN HTSeq__FPKM_UQ\n",
    "    END\n",
    "      ) AS gene3\n",
    "  FROM\n",
    "    expr\n",
    "  JOIN\n",
    "    clin\n",
    "  ON\n",
    "    expr.case_barcode = clin.case_barcode\n",
    "  WHERE\n",
    "    expr.class = 'TRAIN'\n",
    "  GROUP BY\n",
    "    expr.class,\n",
    "    clin.clinical_stage),\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "napCg0BGxU5H"
   },
   "source": [
    "We will also need the testing data to have the genes that we are going to use in the classifier in separate columns for the test cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "sQwKwnPrpKqS"
   },
   "outputs": [],
   "source": [
    "test = \"\"\"\n",
    "test AS (\n",
    "  SELECT\n",
    "    case_barcode,\n",
    "    class,\n",
    "    MAX(CASE WHEN gene_name = 'MT-CO3' THEN HTSeq__FPKM_UQ END) AS gene1,\n",
    "    MAX(CASE WHEN gene_name = 'MT-CO1' THEN HTSeq__FPKM_UQ END) AS gene2,\n",
    "    MAX(CASE WHEN gene_name = 'MT-CO2' THEN HTSeq__FPKM_UQ END) AS gene3\n",
    "  FROM\n",
    "    expr\n",
    "  WHERE\n",
    "    class = 'TEST'\n",
    "  GROUP BY case_barcode, class),\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "IMjQsor1NB-M"
   },
   "source": [
    "This next section of the query will find the euclidean distance for each case in the testing data set<sup>3</sup>. The euclidean distance can be found by the following equation<sup>4</sup>:\n",
    "\n",
    "$d(q,p) = \\sqrt{(q_1-p_1)^2+(q_2-p_2)^2+...+(q_n-p_n)^2}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "bB0ONe9cpGwJ"
   },
   "outputs": [],
   "source": [
    "dist = \"\"\"\n",
    "distance AS (SELECT\n",
    "  case_barcode,\n",
    "  gene1,\n",
    "  gene2,\n",
    "  SQRT(POWER((SELECT gene1 FROM mean WHERE clinical_stage = 'Stage I')-gene1, 2) + POWER((SELECT gene2 FROM mean WHERE clinical_stage = 'Stage I')-gene2, 2) + POWER((SELECT gene3 FROM mean WHERE clinical_stage = 'Stage I')-gene3, 2)) as stage1,\n",
    "  SQRT(POWER((SELECT gene1 FROM mean WHERE clinical_stage = 'Stage II')- gene1, 2) + POWER((SELECT gene2 FROM mean WHERE clinical_stage = 'Stage II')-gene2, 2) + POWER((SELECT gene3 FROM mean WHERE clinical_stage = 'Stage II')-gene3, 2)) as stage2,\n",
    "  SQRT(POWER((SELECT gene1 FROM mean WHERE clinical_stage = 'Stage III')-gene1, 2) + POWER((SELECT gene2 FROM mean WHERE clinical_stage = 'Stage III')-gene2, 2) + POWER((SELECT gene3 FROM mean WHERE clinical_stage = 'Stage III')-gene3, 2)) as stage3,\n",
    "  SQRT(POWER((SELECT gene1 FROM mean WHERE clinical_stage = 'Stage IV')-gene1, 2) + POWER((SELECT gene2 FROM mean WHERE clinical_stage = 'Stage IV')-gene2, 2) + POWER((SELECT gene3 FROM mean WHERE clinical_stage = 'Stage IV')-gene3, 2)) as stage4,\n",
    "FROM test),\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oKDupClkPUJ6"
   },
   "source": [
    "Finally, we will calculate which category or stage has the smallest distance from the centroid and assign that stage to the case. We will also include the case barcode and actual clinical stage as columns in our final result to assist with calculating the accuracy of the classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "gwQedjOFpDwK"
   },
   "outputs": [],
   "source": [
    "pred = \"\"\"\n",
    "pred AS (SELECT case_barcode,\n",
    "  (CASE WHEN stage1<stage2 AND stage1<stage3 AND stage1<stage4 THEN 'Stage I'\n",
    "  WHEN stage2<stage1 AND stage2<stage3 AND stage2<stage4 THEN 'Stage II'\n",
    "  WHEN stage3<stage1 AND stage3<stage2 AND stage3<stage4 THEN 'Stage III'\n",
    "  ELSE 'Stage IV' END) AS prediction\n",
    "FROM distance)\n",
    "\n",
    "SELECT a.case_barcode, a.prediction, b.clinical_stage\n",
    "FROM pred AS a\n",
    "JOIN clin AS b\n",
    "ON a.case_barcode = b.case_barcode\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0pvPvxZhQuBj"
   },
   "source": [
    "Now that we have our full query laid out, we can combine all of the sub-queries into the full string and query the data in BigQuery."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "colab_type": "code",
    "id": "P2S-084aRnGu",
    "outputId": "ddbfd065-cf48-4017-a804-345a1259a9c7"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>case_barcode</th>\n",
       "      <th>prediction</th>\n",
       "      <th>clinical_stage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>TCGA-BQ-7055</td>\n",
       "      <td>Stage I</td>\n",
       "      <td>Stage I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>TCGA-2Z-A9JK</td>\n",
       "      <td>Stage IV</td>\n",
       "      <td>Stage III</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>TCGA-EV-5903</td>\n",
       "      <td>Stage IV</td>\n",
       "      <td>Stage I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>TCGA-2Z-A9JE</td>\n",
       "      <td>Stage I</td>\n",
       "      <td>Stage I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>TCGA-G7-6797</td>\n",
       "      <td>Stage IV</td>\n",
       "      <td>Stage III</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   case_barcode prediction clinical_stage\n",
       "0  TCGA-BQ-7055    Stage I        Stage I\n",
       "1  TCGA-2Z-A9JK   Stage IV      Stage III\n",
       "2  TCGA-EV-5903   Stage IV        Stage I\n",
       "3  TCGA-2Z-A9JE    Stage I        Stage I\n",
       "4  TCGA-G7-6797   Stage IV      Stage III"
      ]
     },
     "execution_count": 15,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Build the query with all of the subqueries\n",
    "final_query = filtered_clin + filtered_expr + centroids + test + dist + pred\n",
    "\n",
    "# Run the query\n",
    "cn = client.query(final_query).to_dataframe()\n",
    "cn.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Cw2iW_Hf7a1L"
   },
   "source": [
    "## Find Accuracy of the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Qj5PrTvS4I5h"
   },
   "source": [
    "To find the accuracy of the model, we weil use a confusion matrix and then will calculate the accuracy with the following formula<sup>3</sup>:\n",
    "\n",
    "$Accuracy = \\frac{M_{ii}}{\\sum_{j}M_{ji}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 119
    },
    "colab_type": "code",
    "id": "eInd23imZN2S",
    "outputId": "3d4c8615-f4da-4357-a219-43e6f08b1caf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "clinical_stage  Stage I  Stage II  Stage III  Stage IV\n",
      "prediction                                            \n",
      "Stage I              12         0          1         2\n",
      "Stage II              3         2          1         0\n",
      "Stage III             1         0          0         0\n",
      "Stage IV             13         2          3         0\n"
     ]
    }
   ],
   "source": [
    "# Create a confusion matrix for the clinical predictions\n",
    "confusion = pandas.crosstab(cn['prediction'], cn['clinical_stage'])\n",
    "print(confusion)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 102
    },
    "colab_type": "code",
    "id": "6LZs3Mjl2dA_",
    "outputId": "75f1df71-e24f-4167-bc5f-f8f8deb77321"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overall Accuracy is 0.35\n",
      "Stage 1 Accuracy is 0.41\n",
      "Stage 2 Accuracy is 0.5\n",
      "Stage 3 Accuracy is 0.0\n",
      "Stage 4 Accuracy is 0.0\n"
     ]
    }
   ],
   "source": [
    "# Calculate the accuracy of the model\n",
    "print(\"Overall Accuracy is\", (confusion['Stage I'][0] + \n",
    "                              confusion['Stage II'][1] + \n",
    "                              confusion['Stage III'][2] + \n",
    "                              confusion['Stage IV'][3]) / cn.count()[0])\n",
    "print(\"Stage 1 Accuracy is\", round(confusion['Stage I'][0]/sum(confusion['Stage I']), 2))\n",
    "print(\"Stage 2 Accuracy is\", round(confusion['Stage II'][1]/sum(confusion['Stage II']), 2))\n",
    "print(\"Stage 3 Accuracy is\", round(confusion['Stage III'][2]/sum(confusion['Stage III']), 2))\n",
    "print(\"Stage 4 Accuracy is\", round(confusion['Stage IV'][3]/sum(confusion['Stage IV']), 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "u7MkqBvD7Csk"
   },
   "source": [
    "It seems that this model has an overall accuracy of 35% and was able to get close to 41% accuracy for Stage 1 but it was not able to accurately predict any of the other classes. We can tell that this is not a good model for this data set though it could easily be updated to different genes or data sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "W8mOQHpl71Jl"
   },
   "source": [
    "# References\n",
    "\n",
    "<sup>1</sup>Statistical classification - Wikipedia. n.d. <https://en.wikipedia.org/wiki/Statistical_classification>.\n",
    "\n",
    "<sup>2</sup>Euclidean distance - Wikipedia. n.d. <https://en.wikipedia.org/wiki/Euclidean_distance>.\n",
    "\n",
    "<sup>3</sup>Finding Nearest Neighbors in SQL | Sisense. n.d. <https://www.sisense.com/blog/finding-nearest-neighbors-in-sql/>.\n",
    "\n",
    "<sup>4</sup>Machine Learning with Python: Confusion Matrix in Machine Learning with Python. n.d. <https://www.python-course.eu/confusion_matrix.php>.\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "include_colab_link": true,
   "name": "How_to_perform_Nearest_Centroid_Classification_with_BigQuery.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}