{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dimensionality reduction on Dirac\n",
    "#### Device: Dirac-1\n",
    "\n",
    "\n",
    "## Introduction\n",
    "In machine learning problems, we often have to start with a large number of features. We can use a dimensionality reduction approach to reduce the number of features, especially when the features are used for an unsupervised algorithm such as clustering. In what follows, we present a tutorial on using QCi's technology to implement a QUBO based Singular Value Decomposition (SVD) algorithm. Slightly differently from most other Dirac tutorials, this tutorial does not focus on solving a computationally hard (NP-hard for example) problem. Singular value decompositions can effectively be performed in a similar method to what we used here but using matrix diagonalization routines (for example, those available in [LAPACK](https://www.netlib.org/lapack/) or [ARPACK](https://sites.google.com/view/libarpack/home?authuser=0) which, can be used through SciPy, MATLAB, and Mathematica) instead of QUBO solvers. The motivation here is to show the versatility of our hardware and the underlying QUBO model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Importance \n",
    "\n",
    "Singular value decompositions, and the highly related task of matrix diagonalization, are a key to many applications. One of the key applications is the one studied here, principal component analysis, which provides a method to extract the most important variables to explain data. The first component is effectively the direction in a high dimensional space over which the data vary the most. Informally, this can be thought of as the combination of variables that are able to provide the most information about the data points. The second explains it the second most, et cetera. Dimensionality reduction is thus the task of building a subset of quantities from the original data that provide the most information, but can be much smaller than the total information contained in each data point. While we will not discuss it here, this can be thought of as taking a subspace of a much higher dimensional space, by effectively fitting to a high-dimensional ellipse and taking the directions where the ellipse is the largest. This process is easily visualized in two dimensions (see for example [the wikipedia article on principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis)) but cannot be in the higher dimensions that are of practical interest."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applications\n",
    "\n",
    "As an unsupervised method, principal component analysis is good for exploring data without prior knowledge of the structure (a review can be [found here](https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202)). The example here shows a problem related to analyzing prescription data, but there are many other potential applications for principal component analysis. For example, this technique is extensively used in the [chemical analysis of food samples](https://link.springer.com/article/10.1007/s10068-023-01509-5), where the number of chemical tests should be reduced to contain only the most relevant ones. Another example is in [portfolio optimization](https://ir.canterbury.ac.nz/server/api/core/bitstreams/a7b99f2d-5d9b-4212-a947-31c1d162d1c4/content), where diversification can be achieved by choosing stocks that contribute strongly to different principal components. An additional very different application is in [atmospheric science](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://jiscmail.ac.uk/cgi-bin/filearea.cgi%3FLMGT1%3DENVSTAT%26a%3Dget%26f%3D/jolliffe_0205_article.pdf&ved=2ahUKEwiruPrinc-GAxVkXEEAHWBtJhIQFnoECC0QAQ&usg=AOvVaw0s9wYiSVfs1Cw8aLI4XRPb), where the most important modes in atmospheric movement (weather patterns) can be identified by finding principal components corresponding to sets of regions where quantities such as pressure are highly correlated. However, [care must be taken](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2020ea001275) in how these methods are applied."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Methodology\n",
    "\n",
    "Principal Component Analysis (PCA) is a dimensionality reduction\n",
    "technique which is based on a Singular Value Decomposition (SVD) of the\n",
    "data matrix. Let us have a dataset with $N$ samples and $d$ features,\n",
    "represented by a $N \\times d$ matrix $F$. We can decompose $F$ as,\n",
    "\n",
    "$F = U D V^T$\n",
    "\n",
    "where $U$ is an $N \\times d$ matrix with orthonormal columns, $D$ is a\n",
    "diagonal square $d \\times d$ matrix, and $V$ is an orthogonal matrix\n",
    "with orthonormal columns and rows. Note that columns of $U$ form an\n",
    "orthonormal basis ${\\{\\bf{u_k}\\}}_{k \\in \\{1, 2,..., d\\}}$ for the\n",
    "vector space that is spanned by columns $F$. It can be shown that the\n",
    "larger the $k$-th diagonal element of $D$, the more the contribution\n",
    "of the basis element ${\\bf{u_k}}$. Thus, one can rank the basis\n",
    "elements ${\\bf{u_k}}$ based on the value of $D_{kk}$, and choose a\n",
    "subset of most important basis elements; this yields a lower\n",
    "dimensional representation of the data matrix $F$. Assuming that\n",
    "$D_{00} > D_{11} > ... > D_{dd}$, ${\\bf{u_0}}$ is said to be the first\n",
    "principal component of the original matrix $F$.\n",
    "\n",
    "It can be shown that\n",
    "\n",
    "${\\bf{u_k}} = \\frac{1}{\\sqrt{\\lambda_k}} F {\\bf{v_k}}$\n",
    "\n",
    "where $\\lambda_k$ and ${\\bf{v_k}}$ denote the $k$-th eigenvalue and\n",
    "normalized eigenvector of the orthogonal $d \\times d$ matrix $G := F^T                        \n",
    "F$. Note too that ${\\bf{v_k}}$ is in fact the $k$-th column of matrix\n",
    "$V$ and that $\\sqrt{\\lambda_k}$ is the $k$-th diagonal element of\n",
    "matrix $D$. Finding the first principal\n",
    "component of $F$ is then equivalent to finding the eigenvector of $G$\n",
    "corresponding to its maximum eigenvalue, that is the eigenvector of\n",
    "$-G$ corresponding to its minimum eigenvalue. An estimation to the\n",
    "first principal component of $F$ can then be obtained by solving a\n",
    "QUBO as,\n",
    "\n",
    "$\\min_{\\bf{q}}  {-\\bf{q}^{T}} G {\\bf{q}}$\n",
    "\n",
    "where ${\\bf{q}}$ is a $d$-dimensional binary vector. The first principal component of $F$\n",
    "is,\n",
    "\n",
    "${\\bf{u_0}} = \\frac{1}{\\sqrt{\\lambda_0}} F {\\bf{q}}$\n",
    "\n",
    "where $\\lambda_0 = {\\bf{v_0}}^T G {\\bf{v_0}}$, and ${\\bf{v_0}} =                              \n",
    "\\frac{{\\bf{q}}}{||{\\bf{q}}||}$. One can then replace each column of\n",
    "$F$ with its component orthogonal to ${\\bf{u_0}}$, and repeat the\n",
    "above-mentioned process to get the second principal component\n",
    "${\\bf{u_1}}$. This can be repeated to get any desired number of\n",
    "principal components of $F$, say ${\\bf{u_0}}, {\\bf{u_1}},...,                                 \n",
    "{\\bf{u_{d^\\prime}}}$, where $d^\\prime < d$. To remove a principal\n",
    "component ${\\bf{u}}$ from a feature matrix $F$, we have,\n",
    "\n",
    "$F^{new} = F - {\\bf{u}} {\\bf{u}^{T}} F$\n",
    "\n",
    "Note that this is an estimation to a full PCA as the vector $q$ in\n",
    "the above equation is a binary vector. This\n",
    "estimation is equivalent to quantizing the directions in feature space\n",
    "such that a potentially large but finite set of directions can be\n",
    "chosen by PCA. These are the directions that correspond to the\n",
    "$d$-dimensional binary vector $q$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Medicare Prescription Data\n",
    "\n",
    "We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean data\n",
    "\n",
    "We start by cleaning the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Input                                                                                       \n",
    "INP_FILE = \"Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv\"\n",
    "OUT_FILE  = \"cleaned_medicare_data.csv\"\n",
    "\n",
    "CON_VARS = [\n",
    "    \"total_claim_count\",\n",
    "    \"total_30_day_fill_count\",\n",
    "    \"total_drug_cost\",\n",
    "    \"total_day_supply\",\n",
    "    \"bene_count\",\n",
    "    \"total_claim_count_ge65\",\n",
    "    \"total_30_day_fill_count_ge65\",\n",
    "    \"total_drug_cost_ge65\",\n",
    "    \"total_day_supply_ge65\",\n",
    "    \"bene_count_ge65\",\n",
    "    \"brand_claim_count\",\n",
    "    \"brand_drug_cost\",\n",
    "    \"generic_claim_count\",\n",
    "    \"generic_drug_cost\",\n",
    "    \"other_claim_count\",\n",
    "    \"other_drug_cost\",\n",
    "    \"mapd_claim_count\",\n",
    "    \"mapd_drug_cost\",\n",
    "    \"pdp_claim_count\",\n",
    "    \"pdp_drug_cost\",\n",
    "    \"lis_claim_count\",\n",
    "    \"lis_drug_cost\",\n",
    "    \"nonlis_claim_count\",\n",
    "    \"nonlis_drug_cost\",\n",
    "    \"opioid_claim_count\",\n",
    "    \"opioid_drug_cost\",\n",
    "    \"opioid_day_supply\",\n",
    "    \"opioid_bene_count\",\n",
    "    \"opioid_prescriber_rate\",\n",
    "    \"antibiotic_claim_count\",\n",
    "    \"antibiotic_drug_cost\",\n",
    "    \"antibiotic_bene_count\",\n",
    "    \"hrm_claim_count_ge65\",\n",
    "    \"hrm_drug_cost_ge65\",\n",
    "    \"hrm_bene_count_ge65\",\n",
    "    \"antipsych_claim_count_ge65\",\n",
    "    \"antipsych_drug_cost_ge65\",\n",
    "    \"antipsych_bene_count_ge65\",\n",
    "    \"average_age_of_beneficiaries\",\n",
    "    \"beneficiary_age_less_65_count\",\n",
    "    \"beneficiary_age_65_74_count\",\n",
    "    \"beneficiary_age_75_84_count\",\n",
    "    \"beneficiary_age_greater_84_count\",\n",
    "    \"beneficiary_female_count\",\n",
    "    \"beneficiary_male_count\",\n",
    "    \"beneficiary_race_white_count\",\n",
    "    \"beneficiary_race_black_count\",\n",
    "    \"beneficiary_race_asian_pi_count\",\n",
    "    \"beneficiary_race_hispanic_count\",\n",
    "    \"beneficiary_race_nat_ind_count\",\n",
    "    \"beneficiary_race_other_count\",\n",
    "    \"beneficiary_nondual_count\",\n",
    "    \"beneficiary_dual_count\",\n",
    "    \"beneficiary_average_risk_score\",\n",
    "]\n",
    "\n",
    "VALID_PROVIDER_MI = [\n",
    "    \"A\",\n",
    "    \"M\",\n",
    "    \"J\",\n",
    "    \"L\",\n",
    "    \"R\",\n",
    "    \"S\",\n",
    "    \"E\",\n",
    "    \"D\",\n",
    "    \"C\",\n",
    "    \"B\",\n",
    "    \"K\",\n",
    "    \"P\",\n",
    "    \"W\",\n",
    "    \"H\",\n",
    "    \"T\",\n",
    "    \"G\",\n",
    "    \"F\",\n",
    "    \"N\",\n",
    "    \"V\",\n",
    "    \"I\",\n",
    "    \"O\",\n",
    "    \"Y\",\n",
    "    \"Z\",\n",
    "    \"U\", \n",
    "    \"Q\",\n",
    "    \"X\",\n",
    "]\n",
    "\n",
    "VALID_GEN = [\"F\", \"M\", \"Other\", \"Unknown\"]\n",
    "\n",
    "VALID_ENTITIES = [\"I\", \"O\"]\n",
    "\n",
    "VALID_DESC_FLAGS = [\"S\", \"T\"]\n",
    "\n",
    "VALID_ENROLLS = [\"E\", \"N\", \"O\"]\n",
    "\n",
    "# Some utilities                                                                              \n",
    "def convert_to_float(x):\n",
    "    try:\n",
    "        return float(x)\n",
    "    except:\n",
    "        return None\n",
    "    \n",
    "def convert_to_int(x):\n",
    "    try:\n",
    "        return int(float(x))\n",
    "    except:\n",
    "        return None\n",
    "\n",
    "# Read data                                                                                   \n",
    "df = pd.read_csv(INP_FILE, on_bad_lines = \"skip\", low_memory=False)\n",
    "    \n",
    "# Clean categorical variables                                                                 \n",
    "df[\"nppes_provider_mi\"] = df[\"nppes_provider_mi\"].fillna(\"Unknown\")\n",
    "df[\"nppes_provider_mi\"] = df[\"nppes_provider_mi\"].apply(\n",
    "    lambda x: x if x in VALID_PROVIDER_MI else \"Unknown\"\n",
    ")\n",
    "\n",
    "df[\"nppes_credentials\"] = df[\"nppes_credentials\"].fillna(\"Unknown\")\n",
    "df[\"nppes_credentials\"] = df[\"nppes_credentials\"].apply(\n",
    "    lambda x: str(x).replace(\".\", \"\")\n",
    ")\n",
    "\n",
    "cred_hash = {\n",
    "    \"MEDICAL DOCTOR\": \"MD\",\n",
    "    \"NURSE PRACTITIONER\": \"NP\",\n",
    "}\n",
    "df[\"nppes_credentials\"] = df[\"nppes_credentials\"].apply(\n",
    "    lambda x: cred_hash[x] if x in cred_hash else x,\n",
    ")\n",
    "\n",
    "df[\"nppes_provider_gender\"] = df[\"nppes_provider_gender\"].fillna(\"Unknown\")\n",
    "df[\"nppes_provider_gender\"] = df[\"nppes_provider_gender\"].apply(\n",
    "    lambda x: x if x in VALID_GEN else \"Other\",\n",
    ")\n",
    "\n",
    "df[\"nppes_entity_code\"] = df[\"nppes_entity_code\"].apply(\n",
    "    lambda x: x if x in VALID_ENTITIES else \"Unknown\",\n",
    ")\n",
    "\n",
    "df[\"nppes_provider_zip5\"] = df[\"nppes_provider_zip5\"].fillna(\"Unknown\")\n",
    "\n",
    "df[\"nppes_provider_country\"] = df[\"nppes_provider_country\"].apply(\n",
    "    lambda x: \"US\" if x == \"US\" else \"Other\",\n",
    ")\n",
    "\n",
    "df[\"description_flag\"] = df[\"description_flag\"].apply(\n",
    "    lambda x: x if x in VALID_DESC_FLAGS else \"Unknown\",\n",
    ")\n",
    "\n",
    "df[\"medicare_prvdr_enroll_status\"] = df[\"medicare_prvdr_enroll_status\"].apply(\n",
    "    lambda x: x if x in VALID_ENROLLS else \"Unknown\",\n",
    ")\n",
    "\n",
    "\n",
    "# Treat missing beneficiary count as it cannot be zero                                        \n",
    "df[\"bene_count\"] = df[\"bene_count\"].apply(\n",
    "    convert_to_int\n",
    ").fillna(-1)\n",
    "\n",
    "tmp_df = df.groupby(\n",
    "    \"specialty_description\", as_index=False,\n",
    ")[\"bene_count\"].mean()\n",
    "\n",
    "bene_count_hash = dict(\n",
    "    zip(\n",
    "        tmp_df[\"specialty_description\"],\n",
    "        tmp_df[\"bene_count\"],\n",
    "    )\n",
    ")\n",
    "df[\"bene_count\"] = df.apply(\n",
    "    lambda x: x[\"bene_count\"] if x[\n",
    "        \"bene_count\"\n",
    "    ] > 0 else bene_count_hash[\n",
    "        x[\"specialty_description\"]\n",
    "    ],\n",
    "    axis=1,\n",
    ")\n",
    "\n",
    "# Treat continuous variables                                                                  \n",
    "for item in CON_VARS:\n",
    "    df[item] = df[item].apply(\n",
    "        convert_to_float\n",
    "    ).fillna(0.0)\n",
    "\n",
    "# Filter out invalid states                                                                   \n",
    "df  = df[\n",
    "    ~df[\"nppes_provider_state\"].isin(\n",
    "        [\"XX\", \"E\", \"N\", \"S\"]\n",
    "    )\n",
    "]\n",
    "\n",
    "# Output                                                                                      \n",
    "df.to_csv(OUT_FILE, index=False)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate features\n",
    "\n",
    "We then generate features. The categorical features are encoded using the average value of a few important variables in each category."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Input                                                                                       \n",
    "INP_FILE = \"cleaned_medicare_data.csv\"\n",
    "OUT_FILE  = \"medicare_features.csv\"\n",
    "\n",
    "# Set some parameters                                                                         \n",
    "CAT_VARS = [\n",
    "    \"nppes_provider_mi\",\n",
    "    #\"nppes_credentials\", # This is rather messy, so ignoring it.                             \n",
    "    \"nppes_provider_gender\",\n",
    "    \"nppes_entity_code\",\n",
    "    \"nppes_provider_city\",\n",
    "    \"nppes_provider_zip5\",\n",
    "    #\"nppes_provider_country\", # Almost all cases are US                                      \n",
    "    \"specialty_description\",\n",
    "    \"medicare_prvdr_enroll_status\",\n",
    "    \"nppes_provider_state\",\n",
    "]\n",
    "\n",
    "CON_VARS = [\n",
    "    \"total_claim_count\",\n",
    "    \"total_30_day_fill_count\",\n",
    "    \"total_drug_cost\",\n",
    "    \"total_day_supply\",\n",
    "    \"bene_count\",\n",
    "    \"total_claim_count_ge65\",\n",
    "    \"total_30_day_fill_count_ge65\",\n",
    "    \"total_drug_cost_ge65\",\n",
    "    \"total_day_supply_ge65\",\n",
    "    \"bene_count_ge65\",\n",
    "    \"brand_claim_count\",\n",
    "    \"brand_drug_cost\",\n",
    "    \"generic_claim_count\",\n",
    "    \"generic_drug_cost\",\n",
    "    \"other_claim_count\",\n",
    "    \"other_drug_cost\",\n",
    "    \"mapd_claim_count\",\n",
    "    \"mapd_drug_cost\",\n",
    "    \"pdp_claim_count\",\n",
    "    \"pdp_drug_cost\",\n",
    "    \"lis_claim_count\",\n",
    "    \"lis_drug_cost\",\n",
    "    \"nonlis_claim_count\",\n",
    "    \"nonlis_drug_cost\",\n",
    "    \"opioid_claim_count\",\n",
    "    \"opioid_drug_cost\",\n",
    "    \"opioid_day_supply\",\n",
    "    \"opioid_bene_count\",\n",
    "    \"antibiotic_claim_count\",\n",
    "    \"antibiotic_drug_cost\",\n",
    "    \"antibiotic_bene_count\",\n",
    "    \"hrm_claim_count_ge65\",\n",
    "    \"hrm_drug_cost_ge65\",\n",
    "    \"hrm_bene_count_ge65\",\n",
    "    \"antipsych_claim_count_ge65\",\n",
    "    \"antipsych_drug_cost_ge65\",\n",
    "    \"antipsych_bene_count_ge65\",\n",
    "    \"average_age_of_beneficiaries\",\n",
    "    \"beneficiary_age_less_65_count\",\n",
    "    \"beneficiary_age_65_74_count\",\n",
    "    \"beneficiary_age_75_84_count\",\n",
    "    \"beneficiary_age_greater_84_count\",\n",
    "    \"beneficiary_female_count\",\n",
    "    \"beneficiary_male_count\",\n",
    "    \"beneficiary_race_white_count\",\n",
    "    \"beneficiary_race_black_count\",\n",
    "    \"beneficiary_race_asian_pi_count\",\n",
    "    \"beneficiary_race_hispanic_count\",\n",
    "    \"beneficiary_race_nat_ind_count\",\n",
    "    \"beneficiary_race_other_count\",\n",
    "    \"beneficiary_nondual_count\",\n",
    "    \"beneficiary_dual_count\",\n",
    "    \"beneficiary_average_risk_score\",\n",
    "]\n",
    "\n",
    "# Read and clean data                                                                         \n",
    "df = pd.read_csv(INP_FILE, low_memory=False)\n",
    "\n",
    "# Embed categorical features                                                                  \n",
    "embedded_cat_features = []\n",
    "for item in CAT_VARS:\n",
    "    tmp_df = df.groupby(item, as_index=False).agg(\n",
    "        {\n",
    "            \"opioid_claim_count\": \"mean\",\n",
    "            \"opioid_drug_cost\": \"mean\",\n",
    "            \"opioid_day_supply\": \"mean\",\n",
    "            \"opioid_bene_count\": \"mean\",\n",
    "            \"opioid_prescriber_rate\": \"mean\",\n",
    "        }\n",
    "    ).rename(\n",
    "        columns={\n",
    "            \"opioid_claim_count\": \"%s_opioid_claim_count\" % item,\n",
    "            \"opioid_drug_cost\": \"%s_opioid_drug_cost\" % item,\n",
    "            \"opioid_day_supply\": \"%s_opioid_day_supply\" % item,\n",
    "            \"opioid_bene_count\": \"%s_opioid_bene_count\" % item,\n",
    "            \"opioid_prescriber_rate\": \"%s_opioid_prescriber_rate\" % item,\n",
    "        }\n",
    "    )\n",
    "\n",
    "    df = df.merge(tmp_df, how=\"left\", on=item)\n",
    "\n",
    "    embedded_cat_features += [\n",
    "        \"%s_opioid_claim_count\" % item,\n",
    "        \"%s_opioid_drug_cost\" % item,\n",
    "        \"%s_opioid_day_supply\" % item,\n",
    "        \"%s_opioid_bene_count\" % item,\n",
    "        \"%s_opioid_prescriber_rate\" % item,\n",
    "    ]\n",
    "\n",
    "# Drop unembedded categorical variables and some others                                       \n",
    "df = df[[\"npi\"] + CON_VARS + embedded_cat_features]\n",
    "\n",
    "# Write features file                                                                         \n",
    "df.to_csv(OUT_FILE, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dimensionality Reduction\n",
    "\n",
    "Once the features are generated, we can implement the above-mentioned SVD algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import libs\n",
    "import sys\n",
    "import os\n",
    "import time\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from qci_client import QciClient\n",
    "\n",
    "# Define some parameters\n",
    "FEATURES_FILE = \"medicare_features.csv\"\n",
    "REDUCED_DIM = 3\n",
    "\n",
    "# Read features\n",
    "df = pd.read_csv(FEATURES_FILE, low_memory=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now print the feature names and get the total count of features in the dataset,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original dimension is 93; reduced dimension will be 3\n"
     ]
    }
   ],
   "source": [
    "feature_names = list(set(df.columns) - {\"npi\"})\n",
    "\n",
    "orig_dim = len(feature_names)\n",
    "\n",
    "print(\n",
    "    \"Original dimension is %d; reduced dimension will be %d\" % (\n",
    "        orig_dim,\n",
    "        REDUCED_DIM,\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now define a function that gets the first principal component of features by solving a QUBO, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "token = \"your_token\"\n",
    "api_url = \"https://api.qci-prod.com\"\n",
    "qci = QciClient(api_token=token, url=api_url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_first_principal_comp(F):\n",
    "    \n",
    "    qubo = -np.matmul(F.transpose(), F)\n",
    "\n",
    "    # Make sure matrix is symmetric to machine precision\n",
    "    qubo = 0.5 * (qubo + qubo.transpose())\n",
    "\n",
    "    # Create json objects \n",
    "    qubo_json = {\n",
    "        \"file_name\": \"qubo_tutorial.json\",\n",
    "        \"file_config\": {\n",
    "            \"qubo\": {\"data\": qubo, \"num_variables\": orig_dim},\n",
    "        }  \n",
    "    }\n",
    "    \n",
    "    # Solve the optimizzation problem\n",
    "    #qci = QciClient()\n",
    "\n",
    "    response_json = qci.upload_file(file=qubo_json)\n",
    "    qubo_file_id = response_json[\"file_id\"]\n",
    "\n",
    "    # Setup job json\n",
    "    job_params = {\n",
    "        \"device_type\": \"dirac-1\", \n",
    "        \"alpha\": 1.0, \n",
    "        \"num_samples\": 20,\n",
    "    }\n",
    "    job_json = qci.build_job_body(\n",
    "        job_type=\"sample-qubo\", \n",
    "        job_params=job_params,\n",
    "        qubo_file_id=qubo_file_id,\n",
    "        job_name=\"tutorial_eqc1\",\n",
    "        job_tags=[\"tutorial_eqc1\"],\n",
    "    )\n",
    "    print(job_json)\n",
    "\n",
    "    # Run the job\n",
    "    job_response_json = qci.process_job(\n",
    "        job_body=job_json,\n",
    "    )\n",
    "\n",
    "    print(job_response_json)\n",
    "\n",
    "    results = job_response_json[\"results\"]\n",
    "    energies = results[\"energies\"]\n",
    "    samples = results[\"solutions\"]\n",
    "\n",
    "    if True:\n",
    "        print(\"Energies:\", energies)        \n",
    "\n",
    "    q = np.array(samples[0])\n",
    "\n",
    "    assert len(q) == orig_dim, \"Inconsistent solution size!\"\n",
    "\n",
    "    fct = np.linalg.norm(q)\n",
    "    if fct > 0:\n",
    "        fct = 1.0 / fct\n",
    "\n",
    "    v0 = fct * q\n",
    "    v0 = v0.reshape((v0.shape[0], 1))\n",
    "\n",
    "    lambda0 = np.matmul(\n",
    "        np.matmul(v0.transpose(), -qubo),\n",
    "        v0\n",
    "    )[0][0]\n",
    "\n",
    "    assert lambda0 >= 0, \"Unexpected negative eigenvalue!\"\n",
    "\n",
    "    fct = np.sqrt(lambda0)\n",
    "    if fct > 0:\n",
    "        fct = 1.0 / fct\n",
    "\n",
    "    u0 = fct * np.matmul(F, v0)\n",
    "    u0 = u0.reshape(-1)\n",
    "\n",
    "    fct = np.linalg.norm(u0)\n",
    "    if fct > 0:\n",
    "        fct = 1.0 / fct\n",
    "\n",
    "    u0 = fct * u0\n",
    "\n",
    "    return u0    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can get the first REDUCED_DIM components by applying the above function recursively,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}\n",
      "2024-05-08 09:11:30 - Dirac allocation balance = 0 s (unmetered)\n",
      "2024-05-08 09:11:30 - Job submitted: job_id='663ba432d448b017e54f94a8'\n",
      "2024-05-08 09:11:31 - QUEUED\n",
      "2024-05-08 09:11:33 - RUNNING\n",
      "2024-05-08 09:16:54 - COMPLETED\n",
      "2024-05-08 09:16:57 - Dirac allocation balance = 0 s (unmetered)\n",
      "{'job_info': {'job_id': '663ba432d448b017e54f94a8', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:11:30.979Z', 'queued_at_rfc3339nano': '2024-05-08T16:11:30.98Z', 'running_at_rfc3339nano': '2024-05-08T16:11:31.644Z', 'completed_at_rfc3339nano': '2024-05-08T16:16:52.435Z'}, 'job_result': {'file_id': '663ba57498263204a36574d2', 'device_usage_s': 281}}, 'status': 'COMPLETED', 'results': {'counts': [19, 1], 'energies': [-2802848646149886000, -2802846722004537300], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}\n",
      "Energies: [-2802848646149886000, -2802846722004537300]\n",
      "{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}\n",
      "2024-05-08 09:17:38 - Dirac allocation balance = 0 s (unmetered)\n",
      "2024-05-08 09:17:38 - Job submitted: job_id='663ba5a2d448b017e54f94a9'\n",
      "2024-05-08 09:17:38 - QUEUED\n",
      "2024-05-08 09:17:41 - RUNNING\n",
      "2024-05-08 09:23:02 - COMPLETED\n",
      "2024-05-08 09:23:04 - Dirac allocation balance = 0 s (unmetered)\n",
      "{'job_info': {'job_id': '663ba5a2d448b017e54f94a9', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:17:38.906Z', 'queued_at_rfc3339nano': '2024-05-08T16:17:38.909Z', 'running_at_rfc3339nano': '2024-05-08T16:17:39.763Z', 'completed_at_rfc3339nano': '2024-05-08T16:22:59.746Z'}, 'job_result': {'file_id': '663ba6e398263204a36574d8', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [13, 4, 2, 1], 'energies': [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000], 'solutions': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]}}\n",
      "Energies: [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000]\n",
      "{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}\n",
      "2024-05-08 09:23:41 - Dirac allocation balance = 0 s (unmetered)\n",
      "2024-05-08 09:23:41 - Job submitted: job_id='663ba70dd448b017e54f94ab'\n",
      "2024-05-08 09:23:41 - QUEUED\n",
      "2024-05-08 09:37:48 - RUNNING\n",
      "2024-05-08 09:43:10 - COMPLETED\n",
      "2024-05-08 09:43:12 - Dirac allocation balance = 0 s (unmetered)\n",
      "{'job_info': {'job_id': '663ba70dd448b017e54f94ab', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:23:41.935Z', 'queued_at_rfc3339nano': '2024-05-08T16:23:41.936Z', 'running_at_rfc3339nano': '2024-05-08T16:37:48.722Z', 'completed_at_rfc3339nano': '2024-05-08T16:43:08.614Z'}, 'job_result': {'file_id': '663bab9c98263204a36574de', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [12, 8], 'energies': [-31525704197734400, -31525704197734400], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}\n",
      "Energies: [-31525704197734400, -31525704197734400]\n"
     ]
    }
   ],
   "source": [
    "F = np.array(df[feature_names])\n",
    "\n",
    "U = []\n",
    "for i in range(min(REDUCED_DIM, F.shape[1])):\n",
    "\n",
    "    u = get_first_principal_comp(F)\n",
    "    U.append(u)\n",
    "    u = u.reshape((u.shape[0], 1))\n",
    "    F = F - np.matmul(\n",
    "    \tu,\n",
    "        np.matmul(u.transpose(), F),\n",
    "\t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can save the results and echo some information,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3, 1022499)\n"
     ]
    }
   ],
   "source": [
    "U = np.array(U)\n",
    "print(U.shape)\n",
    "np.save(\"U.npy\", U)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do a classical SVD, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from sklearn.preprocessing import normalize\n",
    "\n",
    "F = np.array(df[feature_names])\n",
    "pca_classical = PCA(n_components=REDUCED_DIM)\n",
    "U_classical = pca_classical.fit_transform(F)\n",
    "U_classical = normalize(U_classical, axis=0, norm=\"l2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and compare the results to those of the above-mentioned approach,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3, 1022499)\n",
      "(1022499, 3)\n",
      "[0.9101745  0.93294887 0.82830476]\n"
     ]
    }
   ],
   "source": [
    "U = np.load(\"U.npy\")\n",
    "#U = U_classical.transpose()\n",
    "print(U.shape)\n",
    "print(U_classical.shape)\n",
    "print(\n",
    "    abs(\n",
    "        np.diag(\n",
    "        \tnp.matmul(U, U_classical)\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "In this tutorial, we have demonstrated the versatility of our Dirac-1 machine and its underlying QUBO model. Rather than an NP-hard optimization problem, we instead show that our device can be used for dimensionality reduction using principal component analysis, a powerful matrix based data-science method. For more traditional QUBO applications, see [here](https://quantumcomputinginc.com/learn/tutorials-and-use-cases/qubo-on-dirac). This includes a QUBO based machine learning method known as [QBoost](https://quantumcomputinginc.com/learn/tutorials-and-use-cases/qboost-for-qubo). You could also try out these devices on your own problems.\n",
    "\n",
    "Like [portfolio optimization](https://quantumcomputinginc.com/learn/tutorials-and-use-cases/portfolio-optimization-on-dirac), [QBoost](https://quantumcomputinginc.com/learn/tutorials-and-use-cases/dimensionality-reduction-on-dirac), and [feature selection](https://quantumcomputinginc.com/learn/tutorials-and-use-cases/feature-selection-on-dirac), this tutorial is a variation on the same theme of taking advantage of the correlation structure of an underlying data set. Such problems keep arising both because they have important applications, and because they are naturally expressed as QUBOs, given the importance of two-body correlations. Trying one of these tutorials is a natural next step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}