{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dimensionality reduction on Dirac\n", "#### Device: Dirac-1\n", "\n", "\n", "## Introduction\n", "In machine learning problems, we often have to start with a large number of features. We can use a dimensionality reduction approach to reduce the number of features, especially when the features are used for an unsupervised algorithm such as clustering. In what follows we present a tutorial on using QCi's technology to implement a QUBO based Singular Value Decomposition (SVD) algorithm." ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "## Methodology\n", "\n", "Principal Component Analysis (PCA) is a dimensionality reduction\n", "technique which is based on a Singular Value Decomposition (SVD) of the\n", "data matrix. Let us have a dataset with $N$ samples and $d$ features,\n", "represented by a $N \\times d$ matrix $F$. We can decompose $F$ as,\n", "\n", "$F = U D V^T$\n", "\n", "where $U$ is an $N \\times d$ matrix with orthonormal columns, $D$ is a\n", "diagonal square $d \\times d$ matrix, and $V$ is an orthogonal matrix\n", "with orthonormal columns and rows. Note that columns of $U$ form an\n", "orthonormal basis ${\\{\\bf{u_k}\\}}_{k \\in \\{1, 2,..., d\\}}$ for the\n", "vector space that is spanned by columns $F$. It can be shown that the\n", "larger the $k$-th diagonal element of $D$, the more the contribution\n", "of the basis element ${\\bf{u_k}}$. Thus, one can rank the basis\n", "elements ${\\bf{u_k}}$ based on the value of $D_{kk}$, and choose a\n", "subset of most important basis elements; this yields a lower\n", "dimensional representation of the data matrix $F$. Assuming that\n", "$D_{00} > D_{11} > ... > D_{dd}$, ${\\bf{u_0}}$ is said to be the first\n", "principal component of the original matrix $F$.\n", "\n", "It can be shown that\n", "\n", "${\\bf{u_k}} = \\frac{1}{\\sqrt{\\lambda_k}} F {\\bf{v_k}}$\n", "\n", "where $\\lambda_k$ and ${\\bf{v_k}}$ denote the $k$-th eigenvalue and\n", "normalized eigenvector of the orthogonal $d \\times d$ matrix $G := F^T \n", "F$. Note too that ${\\bf{v_k}}$ is in fact the $k$-th column of matrix\n", "$V$ and that $\\sqrt{\\lambda_k}$ is the $k$-th diagonal element of\n", "matrix $D$. Finding the first principal\n", "component of $F$ is then equivalent to finding the eigenvector of $G$\n", "corresponding to its maximum eigenvalue, that is the eigenvector of\n", "$-G$ corresponding to its minimum eigenvalue. An estimation to the\n", "first principal component of $F$ can then be obtained by solving a\n", "QUBO as,\n", "\n", "$\\min_{\\bf{q}} {-\\bf{q}^{T}} G {\\bf{q}}$\n", "\n", "where ${\\bf{q}}$ is a $d$-dimensional binary vector. The first principal component of $F$\n", "is,\n", "\n", "${\\bf{u_0}} = \\frac{1}{\\sqrt{\\lambda_0}} F {\\bf{q}}$\n", "\n", "where $\\lambda_0 = {\\bf{v_0}}^T G {\\bf{v_0}}$, and ${\\bf{v_0}} = \n", "\\frac{{\\bf{q}}}{||{\\bf{q}}||}$. One can then replace each column of\n", "$F$ with its component orthogonal to ${\\bf{u_0}}$, and repeat the\n", "above-mentioned process to get the second principal component\n", "${\\bf{u_1}}$. This can be repeated to get any desired number of\n", "principal components of $F$, say ${\\bf{u_0}}, {\\bf{u_1}},..., \n", "{\\bf{u_{d^\\prime}}}$, where $d^\\prime < d$. To remove a principal\n", "component ${\\bf{u}}$ from a feature matrix $F$, we have,\n", "\n", "$F^{new} = F - {\\bf{u}} {\\bf{u}^{T}} F$\n", "\n", "Note that this is an estimation to a full PCA as the vector $q$ in\n", "the above equation is a binary vector. This\n", "estimation is equivalent to quantizing the directions in feature space\n", "such that a potentially large but finite set of directions can be\n", "chosen by PCA. These are the directions that correspond to the\n", "$d$-dimensional binary vector $q$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Medicare Prescription Data\n", "\n", "We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean data\n", "\n", "We start by cleaning the dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Input \n", "INP_FILE = \"Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv\"\n", "OUT_FILE = \"cleaned_medicare_data.csv\"\n", "\n", "CON_VARS = [\n", " \"total_claim_count\",\n", " \"total_30_day_fill_count\",\n", " \"total_drug_cost\",\n", " \"total_day_supply\",\n", " \"bene_count\",\n", " \"total_claim_count_ge65\",\n", " \"total_30_day_fill_count_ge65\",\n", " \"total_drug_cost_ge65\",\n", " \"total_day_supply_ge65\",\n", " \"bene_count_ge65\",\n", " \"brand_claim_count\",\n", " \"brand_drug_cost\",\n", " \"generic_claim_count\",\n", " \"generic_drug_cost\",\n", " \"other_claim_count\",\n", " \"other_drug_cost\",\n", " \"mapd_claim_count\",\n", " \"mapd_drug_cost\",\n", " \"pdp_claim_count\",\n", " \"pdp_drug_cost\",\n", " \"lis_claim_count\",\n", " \"lis_drug_cost\",\n", " \"nonlis_claim_count\",\n", " \"nonlis_drug_cost\",\n", " \"opioid_claim_count\",\n", " \"opioid_drug_cost\",\n", " \"opioid_day_supply\",\n", " \"opioid_bene_count\",\n", " \"opioid_prescriber_rate\",\n", " \"antibiotic_claim_count\",\n", " \"antibiotic_drug_cost\",\n", " \"antibiotic_bene_count\",\n", " \"hrm_claim_count_ge65\",\n", " \"hrm_drug_cost_ge65\",\n", " \"hrm_bene_count_ge65\",\n", " \"antipsych_claim_count_ge65\",\n", " \"antipsych_drug_cost_ge65\",\n", " \"antipsych_bene_count_ge65\",\n", " \"average_age_of_beneficiaries\",\n", " \"beneficiary_age_less_65_count\",\n", " \"beneficiary_age_65_74_count\",\n", " \"beneficiary_age_75_84_count\",\n", " \"beneficiary_age_greater_84_count\",\n", " \"beneficiary_female_count\",\n", " \"beneficiary_male_count\",\n", " \"beneficiary_race_white_count\",\n", " \"beneficiary_race_black_count\",\n", " \"beneficiary_race_asian_pi_count\",\n", " \"beneficiary_race_hispanic_count\",\n", " \"beneficiary_race_nat_ind_count\",\n", " \"beneficiary_race_other_count\",\n", " \"beneficiary_nondual_count\",\n", " \"beneficiary_dual_count\",\n", " \"beneficiary_average_risk_score\",\n", "]\n", "\n", "VALID_PROVIDER_MI = [\n", " \"A\",\n", " \"M\",\n", " \"J\",\n", " \"L\",\n", " \"R\",\n", " \"S\",\n", " \"E\",\n", " \"D\",\n", " \"C\",\n", " \"B\",\n", " \"K\",\n", " \"P\",\n", " \"W\",\n", " \"H\",\n", " \"T\",\n", " \"G\",\n", " \"F\",\n", " \"N\",\n", " \"V\",\n", " \"I\",\n", " \"O\",\n", " \"Y\",\n", " \"Z\",\n", " \"U\", \n", " \"Q\",\n", " \"X\",\n", "]\n", "\n", "VALID_GEN = [\"F\", \"M\", \"Other\", \"Unknown\"]\n", "\n", "VALID_ENTITIES = [\"I\", \"O\"]\n", "\n", "VALID_DESC_FLAGS = [\"S\", \"T\"]\n", "\n", "VALID_ENROLLS = [\"E\", \"N\", \"O\"]\n", "\n", "# Some utilities \n", "def convert_to_float(x):\n", " try:\n", " return float(x)\n", " except:\n", " return None\n", " \n", "def convert_to_int(x):\n", " try:\n", " return int(float(x))\n", " except:\n", " return None\n", "\n", "# Read data \n", "df = pd.read_csv(INP_FILE, on_bad_lines = \"skip\", low_memory=False)\n", " \n", "# Clean categorical variables \n", "df[\"nppes_provider_mi\"] = df[\"nppes_provider_mi\"].fillna(\"Unknown\")\n", "df[\"nppes_provider_mi\"] = df[\"nppes_provider_mi\"].apply(\n", " lambda x: x if x in VALID_PROVIDER_MI else \"Unknown\"\n", ")\n", "\n", "df[\"nppes_credentials\"] = df[\"nppes_credentials\"].fillna(\"Unknown\")\n", "df[\"nppes_credentials\"] = df[\"nppes_credentials\"].apply(\n", " lambda x: str(x).replace(\".\", \"\")\n", ")\n", "\n", "cred_hash = {\n", " \"MEDICAL DOCTOR\": \"MD\",\n", " \"NURSE PRACTITIONER\": \"NP\",\n", "}\n", "df[\"nppes_credentials\"] = df[\"nppes_credentials\"].apply(\n", " lambda x: cred_hash[x] if x in cred_hash else x,\n", ")\n", "\n", "df[\"nppes_provider_gender\"] = df[\"nppes_provider_gender\"].fillna(\"Unknown\")\n", "df[\"nppes_provider_gender\"] = df[\"nppes_provider_gender\"].apply(\n", " lambda x: x if x in VALID_GEN else \"Other\",\n", ")\n", "\n", "df[\"nppes_entity_code\"] = df[\"nppes_entity_code\"].apply(\n", " lambda x: x if x in VALID_ENTITIES else \"Unknown\",\n", ")\n", "\n", "df[\"nppes_provider_zip5\"] = df[\"nppes_provider_zip5\"].fillna(\"Unknown\")\n", "\n", "df[\"nppes_provider_country\"] = df[\"nppes_provider_country\"].apply(\n", " lambda x: \"US\" if x == \"US\" else \"Other\",\n", ")\n", "\n", "df[\"description_flag\"] = df[\"description_flag\"].apply(\n", " lambda x: x if x in VALID_DESC_FLAGS else \"Unknown\",\n", ")\n", "\n", "df[\"medicare_prvdr_enroll_status\"] = df[\"medicare_prvdr_enroll_status\"].apply(\n", " lambda x: x if x in VALID_ENROLLS else \"Unknown\",\n", ")\n", "\n", "\n", "# Treat missing beneficiary count as it cannot be zero \n", "df[\"bene_count\"] = df[\"bene_count\"].apply(\n", " convert_to_int\n", ").fillna(-1)\n", "\n", "tmp_df = df.groupby(\n", " \"specialty_description\", as_index=False,\n", ")[\"bene_count\"].mean()\n", "\n", "bene_count_hash = dict(\n", " zip(\n", " tmp_df[\"specialty_description\"],\n", " tmp_df[\"bene_count\"],\n", " )\n", ")\n", "df[\"bene_count\"] = df.apply(\n", " lambda x: x[\"bene_count\"] if x[\n", " \"bene_count\"\n", " ] > 0 else bene_count_hash[\n", " x[\"specialty_description\"]\n", " ],\n", " axis=1,\n", ")\n", "\n", "# Treat continuous variables \n", "for item in CON_VARS:\n", " df[item] = df[item].apply(\n", " convert_to_float\n", " ).fillna(0.0)\n", "\n", "# Filter out invalid states \n", "df = df[\n", " ~df[\"nppes_provider_state\"].isin(\n", " [\"XX\", \"E\", \"N\", \"S\"]\n", " )\n", "]\n", "\n", "# Output \n", "df.to_csv(OUT_FILE, index=False)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate features\n", "\n", "We then generate features. The categorical features are encoded using the average value of a few important variables in each category." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Input \n", "INP_FILE = \"cleaned_medicare_data.csv\"\n", "OUT_FILE = \"medicare_features.csv\"\n", "\n", "# Set some parameters \n", "CAT_VARS = [\n", " \"nppes_provider_mi\",\n", " #\"nppes_credentials\", # This is rather messy, so ignoring it. \n", " \"nppes_provider_gender\",\n", " \"nppes_entity_code\",\n", " \"nppes_provider_city\",\n", " \"nppes_provider_zip5\",\n", " #\"nppes_provider_country\", # Almost all cases are US \n", " \"specialty_description\",\n", " \"medicare_prvdr_enroll_status\",\n", " \"nppes_provider_state\",\n", "]\n", "\n", "CON_VARS = [\n", " \"total_claim_count\",\n", " \"total_30_day_fill_count\",\n", " \"total_drug_cost\",\n", " \"total_day_supply\",\n", " \"bene_count\",\n", " \"total_claim_count_ge65\",\n", " \"total_30_day_fill_count_ge65\",\n", " \"total_drug_cost_ge65\",\n", " \"total_day_supply_ge65\",\n", " \"bene_count_ge65\",\n", " \"brand_claim_count\",\n", " \"brand_drug_cost\",\n", " \"generic_claim_count\",\n", " \"generic_drug_cost\",\n", " \"other_claim_count\",\n", " \"other_drug_cost\",\n", " \"mapd_claim_count\",\n", " \"mapd_drug_cost\",\n", " \"pdp_claim_count\",\n", " \"pdp_drug_cost\",\n", " \"lis_claim_count\",\n", " \"lis_drug_cost\",\n", " \"nonlis_claim_count\",\n", " \"nonlis_drug_cost\",\n", " \"opioid_claim_count\",\n", " \"opioid_drug_cost\",\n", " \"opioid_day_supply\",\n", " \"opioid_bene_count\",\n", " \"antibiotic_claim_count\",\n", " \"antibiotic_drug_cost\",\n", " \"antibiotic_bene_count\",\n", " \"hrm_claim_count_ge65\",\n", " \"hrm_drug_cost_ge65\",\n", " \"hrm_bene_count_ge65\",\n", " \"antipsych_claim_count_ge65\",\n", " \"antipsych_drug_cost_ge65\",\n", " \"antipsych_bene_count_ge65\",\n", " \"average_age_of_beneficiaries\",\n", " \"beneficiary_age_less_65_count\",\n", " \"beneficiary_age_65_74_count\",\n", " \"beneficiary_age_75_84_count\",\n", " \"beneficiary_age_greater_84_count\",\n", " \"beneficiary_female_count\",\n", " \"beneficiary_male_count\",\n", " \"beneficiary_race_white_count\",\n", " \"beneficiary_race_black_count\",\n", " \"beneficiary_race_asian_pi_count\",\n", " \"beneficiary_race_hispanic_count\",\n", " \"beneficiary_race_nat_ind_count\",\n", " \"beneficiary_race_other_count\",\n", " \"beneficiary_nondual_count\",\n", " \"beneficiary_dual_count\",\n", " \"beneficiary_average_risk_score\",\n", "]\n", "\n", "# Read and clean data \n", "df = pd.read_csv(INP_FILE, low_memory=False)\n", "\n", "# Embed categorical features \n", "embedded_cat_features = []\n", "for item in CAT_VARS:\n", " tmp_df = df.groupby(item, as_index=False).agg(\n", " {\n", " \"opioid_claim_count\": \"mean\",\n", " \"opioid_drug_cost\": \"mean\",\n", " \"opioid_day_supply\": \"mean\",\n", " \"opioid_bene_count\": \"mean\",\n", " \"opioid_prescriber_rate\": \"mean\",\n", " }\n", " ).rename(\n", " columns={\n", " \"opioid_claim_count\": \"%s_opioid_claim_count\" % item,\n", " \"opioid_drug_cost\": \"%s_opioid_drug_cost\" % item,\n", " \"opioid_day_supply\": \"%s_opioid_day_supply\" % item,\n", " \"opioid_bene_count\": \"%s_opioid_bene_count\" % item,\n", " \"opioid_prescriber_rate\": \"%s_opioid_prescriber_rate\" % item,\n", " }\n", " )\n", "\n", " df = df.merge(tmp_df, how=\"left\", on=item)\n", "\n", " embedded_cat_features += [\n", " \"%s_opioid_claim_count\" % item,\n", " \"%s_opioid_drug_cost\" % item,\n", " \"%s_opioid_day_supply\" % item,\n", " \"%s_opioid_bene_count\" % item,\n", " \"%s_opioid_prescriber_rate\" % item,\n", " ]\n", "\n", "# Drop unembedded categorical variables and some others \n", "df = df[[\"npi\"] + CON_VARS + embedded_cat_features]\n", "\n", "# Write features file \n", "df.to_csv(OUT_FILE, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dimensionality Reduction\n", "\n", "Once the features are generated, we can implement the above-mentioned SVD algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Import libs\n", "import sys\n", "import os\n", "import time\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from qci_client import QciClient\n", "\n", "# Define some parameters\n", "FEATURES_FILE = \"medicare_features.csv\"\n", "REDUCED_DIM = 3\n", "\n", "# Read features\n", "df = pd.read_csv(FEATURES_FILE, low_memory=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now print the feature names and get the total count of features in the dataset," ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original dimension is 93; reduced dimension will be 3\n" ] } ], "source": [ "feature_names = list(set(df.columns) - {\"npi\"})\n", "\n", "orig_dim = len(feature_names)\n", "\n", "print(\n", " \"Original dimension is %d; reduced dimension will be %d\" % (\n", " orig_dim,\n", " REDUCED_DIM,\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now define a function that gets the first principal component of features by solving a QUBO, " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "token = \"your_token\"\n", "api_url = \"https://api.qci-prod.com\"\n", "qci = QciClient(api_token=token, url=api_url)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def get_first_principal_comp(F):\n", " \n", " qubo = -np.matmul(F.transpose(), F)\n", "\n", " # Make sure matrix is symmetric to machine precision\n", " qubo = 0.5 * (qubo + qubo.transpose())\n", "\n", " # Create json objects \n", " qubo_json = {\n", " \"file_name\": \"qubo_tutorial.json\",\n", " \"file_config\": {\n", " \"qubo\": {\"data\": qubo, \"num_variables\": orig_dim},\n", " } \n", " }\n", " \n", " # Solve the optimizzation problem\n", " #qci = QciClient()\n", "\n", " response_json = qci.upload_file(file=qubo_json)\n", " qubo_file_id = response_json[\"file_id\"]\n", "\n", " # Setup job json\n", " job_params = {\n", " \"device_type\": \"dirac-1\", \n", " \"alpha\": 1.0, \n", " \"num_samples\": 20,\n", " }\n", " job_json = qci.build_job_body(\n", " job_type=\"sample-qubo\", \n", " job_params=job_params,\n", " qubo_file_id=qubo_file_id,\n", " job_name=\"tutorial_eqc1\",\n", " job_tags=[\"tutorial_eqc1\"],\n", " )\n", " print(job_json)\n", "\n", " # Run the job\n", " job_response_json = qci.process_job(\n", " job_body=job_json,\n", " )\n", "\n", " print(job_response_json)\n", "\n", " results = job_response_json[\"results\"]\n", " energies = results[\"energies\"]\n", " samples = results[\"solutions\"]\n", "\n", " if True:\n", " print(\"Energies:\", energies) \n", "\n", " q = np.array(samples[0])\n", "\n", " assert len(q) == orig_dim, \"Inconsistent solution size!\"\n", "\n", " fct = np.linalg.norm(q)\n", " if fct > 0:\n", " fct = 1.0 / fct\n", "\n", " v0 = fct * q\n", " v0 = v0.reshape((v0.shape[0], 1))\n", "\n", " lambda0 = np.matmul(\n", " np.matmul(v0.transpose(), -qubo),\n", " v0\n", " )[0][0]\n", "\n", " assert lambda0 >= 0, \"Unexpected negative eigenvalue!\"\n", "\n", " fct = np.sqrt(lambda0)\n", " if fct > 0:\n", " fct = 1.0 / fct\n", "\n", " u0 = fct * np.matmul(F, v0)\n", " u0 = u0.reshape(-1)\n", "\n", " fct = np.linalg.norm(u0)\n", " if fct > 0:\n", " fct = 1.0 / fct\n", "\n", " u0 = fct * u0\n", "\n", " return u0 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can get the first REDUCED_DIM components by applying the above function recursively," ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}\n", "2024-05-08 09:11:30 - Dirac allocation balance = 0 s (unmetered)\n", "2024-05-08 09:11:30 - Job submitted: job_id='663ba432d448b017e54f94a8'\n", "2024-05-08 09:11:31 - QUEUED\n", "2024-05-08 09:11:33 - RUNNING\n", "2024-05-08 09:16:54 - COMPLETED\n", "2024-05-08 09:16:57 - Dirac allocation balance = 0 s (unmetered)\n", "{'job_info': {'job_id': '663ba432d448b017e54f94a8', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:11:30.979Z', 'queued_at_rfc3339nano': '2024-05-08T16:11:30.98Z', 'running_at_rfc3339nano': '2024-05-08T16:11:31.644Z', 'completed_at_rfc3339nano': '2024-05-08T16:16:52.435Z'}, 'job_result': {'file_id': '663ba57498263204a36574d2', 'device_usage_s': 281}}, 'status': 'COMPLETED', 'results': {'counts': [19, 1], 'energies': [-2802848646149886000, -2802846722004537300], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}\n", "Energies: [-2802848646149886000, -2802846722004537300]\n", "{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}\n", "2024-05-08 09:17:38 - Dirac allocation balance = 0 s (unmetered)\n", "2024-05-08 09:17:38 - Job submitted: job_id='663ba5a2d448b017e54f94a9'\n", "2024-05-08 09:17:38 - QUEUED\n", "2024-05-08 09:17:41 - RUNNING\n", "2024-05-08 09:23:02 - COMPLETED\n", "2024-05-08 09:23:04 - Dirac allocation balance = 0 s (unmetered)\n", "{'job_info': {'job_id': '663ba5a2d448b017e54f94a9', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:17:38.906Z', 'queued_at_rfc3339nano': '2024-05-08T16:17:38.909Z', 'running_at_rfc3339nano': '2024-05-08T16:17:39.763Z', 'completed_at_rfc3339nano': '2024-05-08T16:22:59.746Z'}, 'job_result': {'file_id': '663ba6e398263204a36574d8', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [13, 4, 2, 1], 'energies': [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000], 'solutions': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]}}\n", "Energies: [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000]\n", "{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}\n", "2024-05-08 09:23:41 - Dirac allocation balance = 0 s (unmetered)\n", "2024-05-08 09:23:41 - Job submitted: job_id='663ba70dd448b017e54f94ab'\n", "2024-05-08 09:23:41 - QUEUED\n", "2024-05-08 09:37:48 - RUNNING\n", "2024-05-08 09:43:10 - COMPLETED\n", "2024-05-08 09:43:12 - Dirac allocation balance = 0 s (unmetered)\n", "{'job_info': {'job_id': '663ba70dd448b017e54f94ab', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:23:41.935Z', 'queued_at_rfc3339nano': '2024-05-08T16:23:41.936Z', 'running_at_rfc3339nano': '2024-05-08T16:37:48.722Z', 'completed_at_rfc3339nano': '2024-05-08T16:43:08.614Z'}, 'job_result': {'file_id': '663bab9c98263204a36574de', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [12, 8], 'energies': [-31525704197734400, -31525704197734400], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}\n", "Energies: [-31525704197734400, -31525704197734400]\n" ] } ], "source": [ "F = np.array(df[feature_names])\n", "\n", "U = []\n", "for i in range(min(REDUCED_DIM, F.shape[1])):\n", "\n", " u = get_first_principal_comp(F)\n", " U.append(u)\n", " u = u.reshape((u.shape[0], 1))\n", " F = F - np.matmul(\n", " \tu,\n", " np.matmul(u.transpose(), F),\n", "\t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save the results and echo some information," ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3, 1022499)\n" ] } ], "source": [ "U = np.array(U)\n", "print(U.shape)\n", "np.save(\"U.npy\", U)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do a classical SVD, " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import normalize\n", "\n", "F = np.array(df[feature_names])\n", "pca_classical = PCA(n_components=REDUCED_DIM)\n", "U_classical = pca_classical.fit_transform(F)\n", "U_classical = normalize(U_classical, axis=0, norm=\"l2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and compare the results to those of the above-mentioned approach," ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3, 1022499)\n", "(1022499, 3)\n", "[0.9101745 0.93294887 0.82830476]\n" ] } ], "source": [ "U = np.load(\"U.npy\")\n", "#U = U_classical.transpose()\n", "print(U.shape)\n", "print(U_classical.shape)\n", "print(\n", " abs(\n", " np.diag(\n", " \tnp.matmul(U, U_classical)\n", " )\n", " )\n", ")" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" } }, "nbformat": 4, "nbformat_minor": 4 }