{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Feature Embedding with Feathr Feature Store\n",
    "\n",
    "Feature embedding is a way to translate a high-dimensional feature vector to a lower-dimensional vector, where the embedding can be learned and reused across models. In this example, we show how one can define feature embeddings in Feathr Feature Store via **UDF (User Defined Function).**\n",
    "\n",
    "We use a sample hotel review dataset downloaded from [Azure-Samples repository](https://github.com/Azure-Samples/azure-search-sample-data). The original dataset can be found [here](https://www.kaggle.com/datasets/datafiniti/hotel-reviews).\n",
    "\n",
    "For the embedding, a pre-trained [HuggingFace Transformer model](https://huggingface.co/sentence-transformers) is used to encode texts into numerical values. The text embeddings can be used for many NLP problems such as detecting fake reviews, sentiment analysis, and finding similar hotels, but building such models is out of scope and thus we don't cover that in this notebook.\n",
    "\n",
    "## Prerequisite\n",
    "* Databricks: In this notebook, we use Databricks as the target Spark platform.\n",
    "    - You may use Azure Synapse Spark pool too by following [this](https://github.com/feathr-ai/feathr/blob/main/docs/quickstart_synapse.md) instructions. Note, you'll need to install a `sentence-transformers` pip package to your Spark pool to use the embedding example.\n",
    "* Feature registry: We showcase using feature registry later in this notebook. You may use [ARM-template](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-arm.html) to deploy the necessary resources.\n",
    "\n",
    "First, install Feathr and other necessary packages to run this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install feathr from the latest codes in the repo. You may use `pip install \"feathr[notebook]\"` as well.\n",
    "#%pip install \"git+https://github.com/feathr-ai/feathr.git#subdirectory=feathr_project&egg=feathr[notebook]\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "79bd243c-f78e-4184-82b8-94eb8bea361f",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "from copy import deepcopy\n",
    "import json\n",
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "from pyspark.sql import DataFrame\n",
    "\n",
    "import feathr\n",
    "from feathr import (\n",
    "    # dtype\n",
    "    FLOAT_VECTOR, ValueType,\n",
    "    # source\n",
    "    HdfsSource,\n",
    "    # client\n",
    "    FeathrClient,\n",
    "    # feature\n",
    "    Feature,\n",
    "    # anchor\n",
    "    FeatureAnchor,\n",
    "    # typed_key\n",
    "    TypedKey,\n",
    "    # query_feature_list\n",
    "    FeatureQuery,\n",
    "    # settings\n",
    "    ObservationSettings,\n",
    "    # feathr_configurations\n",
    "    SparkExecutionConfiguration,\n",
    ")\n",
    "from feathr.datasets.constants import HOTEL_REVIEWS_URL\n",
    "from feathr.datasets.utils import maybe_download\n",
    "from feathr.utils.config import DEFAULT_DATABRICKS_CLUSTER_CONFIG, generate_config\n",
    "from feathr.utils.job_utils import get_result_df\n",
    "from feathr.utils.platform import is_jupyter, is_databricks\n",
    "\n",
    "print(f\"Feathr version: {feathr.__version__}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notebook parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "dc33b9b9-d7a2-4fc0-a6c6-fb8a60da3de4",
     "showTitle": false,
     "title": ""
    },
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "RESOURCE_PREFIX = \"\"  # TODO fill the value\n",
    "PROJECT_NAME = \"hotel_reviews_embedding\"\n",
    "\n",
    "REGISTRY_ENDPOINT = f\"https://{RESOURCE_PREFIX}webapp.azurewebsites.net/api/v1\"\n",
    "\n",
    "# TODO fill values to the following variables to use databricks cluster:\n",
    "DATABRICKS_CLUSTER_ID = None             # Set Databricks cluster id to use an existing cluster\n",
    "if is_databricks():\n",
    "    # If this notebook is running on Databricks, its context can be used to retrieve token and instance URL\n",
    "    ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()\n",
    "    DATABRICKS_WORKSPACE_TOKEN_VALUE = ctx.apiToken().get()\n",
    "    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = f\"https://{ctx.tags().get('browserHostName').get()}\"\n",
    "else:\n",
    "    # TODO change the values if necessary\n",
    "    DATABRICKS_WORKSPACE_TOKEN_VALUE = os.environ.get(\"DATABRICKS_WORKSPACE_TOKEN_VALUE\")\n",
    "    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = os.environ.get(\"SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL\")\n",
    "\n",
    "# TODO Change the value if necessary\n",
    "DATABRICKS_NODE_SIZE = \"Standard_DS3_v2\"\n",
    "\n",
    "# We'll need an authentication credential to access Azure resources and register features \n",
    "USE_CLI_AUTH = False  # Set True to use interactive authentication\n",
    "\n",
    "# If set True, register the features to Feathr registry.\n",
    "REGISTER_FEATURES = False\n",
    "\n",
    "# TODO fill the values to use EnvironmentCredential for authentication. (e.g. to run this notebook on DataBricks.)\n",
    "AZURE_TENANT_ID = None\n",
    "AZURE_CLIENT_ID = None\n",
    "AZURE_CLIENT_SECRET = None\n",
    "\n",
    "# Set True to delete the project output files at the end of this notebook.\n",
    "CLEAN_UP = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get an authentication credential to access Azure resources and register features\n",
    "if USE_CLI_AUTH:\n",
    "    # Use AZ CLI interactive browser authentication\n",
    "    !az login --use-device-code\n",
    "    from azure.identity import AzureCliCredential\n",
    "    credential = AzureCliCredential(additionally_allowed_tenants=['*'],)\n",
    "elif AZURE_TENANT_ID and AZURE_CLIENT_ID and AZURE_CLIENT_SECRET:\n",
    "    # Use Environment variable secret\n",
    "    from azure.identity import EnvironmentCredential\n",
    "    os.environ[\"AZURE_TENANT_ID\"] = AZURE_TENANT_ID\n",
    "    os.environ[\"AZURE_CLIENT_ID\"] = AZURE_CLIENT_ID\n",
    "    os.environ[\"AZURE_CLIENT_SECRET\"] = AZURE_CLIENT_SECRET\n",
    "    credential = EnvironmentCredential()\n",
    "else:\n",
    "    # Try to use the default credential\n",
    "    from azure.identity import DefaultAzureCredential\n",
    "    credential = DefaultAzureCredential(\n",
    "        exclude_interactive_browser_credential=False,\n",
    "        additionally_allowed_tenants=['*'],\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "b91b6f48-87a6-4788-9c09-b8aeb4406c54",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Prepare Dataset\n",
    "\n",
    "First, prepare the hotel review dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use dbfs if the notebook is running on Databricks\n",
    "if is_databricks():\n",
    "    WORKING_DIR = f\"/dbfs/{PROJECT_NAME}\"\n",
    "else:\n",
    "    WORKING_DIR = PROJECT_NAME"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a10a4625-6f98-42cb-9967-3d5d0b75fb7a",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "data_filepath = f\"{WORKING_DIR}/hotel_reviews_with_id.csv\"\n",
    "maybe_download(src_url=HOTEL_REVIEWS_URL, dst_filepath=data_filepath)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the review IDs are not included in our sample dataset, we set incremantal numbers to the ID column so that we can use them for feature joinining later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(data_filepath)\n",
    "df['reviews_id'] = df.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "22e27778-3472-44b7-90e0-aca7d78dbbdc",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "# Verify the data\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the updated data back to file so that we can use it later in this sample notebook.\n",
    "df.to_csv(data_filepath, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "45c08e6e-a2f7-4ae7-9c3f-81edc1adcf48",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Initialize Feathr Client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a8da762c-d245-4f90-abe8-42d4f6a4ea80",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "databricks_cluster_config = deepcopy(DEFAULT_DATABRICKS_CLUSTER_CONFIG)\n",
    "databricks_cluster_config[\"node_type_id\"] = DATABRICKS_NODE_SIZE\n",
    "\n",
    "databricks_config = {\n",
    "    \"run_name\": \"FEATHR_FILL_IN\",\n",
    "    \"libraries\": [\n",
    "        {\"jar\": \"FEATHR_FILL_IN\"},\n",
    "        # sentence-transformers pip package\n",
    "        {\"pypi\": {\"package\": \"sentence-transformers\"}},\n",
    "    ],\n",
    "    \"spark_jar_task\": {\n",
    "        \"main_class_name\": \"FEATHR_FILL_IN\",\n",
    "        \"parameters\": [\"FEATHR_FILL_IN\"],\n",
    "    },\n",
    "    \"new_cluster\": databricks_cluster_config,\n",
    "}\n",
    "os.environ[\"SPARK_CONFIG__DATABRICKS__CONFIG_TEMPLATE\"] = json.dumps(databricks_config)\n",
    "\n",
    "# Note, config arguments maybe overridden by the environment variables. \n",
    "config_path = generate_config(\n",
    "    resource_prefix=RESOURCE_PREFIX,\n",
    "    project_name=PROJECT_NAME,\n",
    "    spark_config__spark_cluster=\"databricks\",\n",
    "    # You may set an existing cluster id here, but Databricks recommend to use new clusters for greater reliability.\n",
    "    databricks_cluster_id=None,  # Set None to create a new job cluster\n",
    "    databricks_workspace_token_value=DATABRICKS_WORKSPACE_TOKEN_VALUE,\n",
    "    spark_config__databricks__work_dir=f\"dbfs:/{PROJECT_NAME}\",\n",
    "    spark_config__databricks__workspace_instance_url=SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL,\n",
    "    feature_registry__api_endpoint=REGISTRY_ENDPOINT,\n",
    ")\n",
    "\n",
    "with open(config_path, \"r\") as f:\n",
    "    print(f.read())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a35d5b78-542d-4c9e-a64c-76d045a8f587",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "client = FeathrClient(\n",
    "    config_path=config_path,\n",
    "    credential=credential,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "352bd8b2-1626-4aee-9b00-58750ac18086",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Feature Creator Scenario\n",
    "\n",
    "With the feature creator's point of view, we implement a feature embedding UDF, define the embedding output as a feature, and register the feature to Feathr registry.    \n",
    "\n",
    "### Create Features\n",
    "\n",
    "First, we set the data source path that our feature definition will use. This path will be used from the **Feature Consumer Scenario** later in this notebook when extracting the feature vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if client.spark_runtime != \"databricks\":\n",
    "    raise ValueError(\"To run this notebook, you must use Databricks as a target Spark cluster.\\\n",
    "        To use other platforms, you'll need to install `sentence-transformers` pip package to your Spark cluster.\")\n",
    "\n",
    "# If the notebook is running on Databricks, convert to spark path format\n",
    "if is_databricks():\n",
    "    data_source_path = data_filepath.replace(\"/dbfs\", \"dbfs:\")\n",
    "# Otherwise, upload the local file to the cloud storage (either dbfs or adls).\n",
    "else:\n",
    "    data_source_path = client.feathr_spark_launcher.upload_or_get_cloud_path(data_filepath)\n",
    "\n",
    "data_source_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create feature embedding UDF. Here, we will use a [pretrained Transformer model from HuggingFace](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "cbf14644-fd42-49a2-9199-6471b719e03e",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "def sentence_embedding(df: DataFrame) -> DataFrame:\n",
    "    \"\"\"Feathr data source UDF to generate sentence embeddings.\n",
    "\n",
    "    Args:\n",
    "        df: A Spark DataFrame with a column named \"reviews_text\" of type string.\n",
    "    \n",
    "    Returns:\n",
    "        A Spark DataFrame with a column named \"reviews_text_embedding\" of type array<float>.\n",
    "    \"\"\"\n",
    "    import pandas as pd\n",
    "    from pyspark.sql.functions import col, pandas_udf\n",
    "    from pyspark.sql.types import ArrayType, FloatType\n",
    "    from sentence_transformers import SentenceTransformer\n",
    "    \n",
    "    @pandas_udf(ArrayType(FloatType()))\n",
    "    def predict_batch_udf(data: pd.Series) -> pd.Series:\n",
    "        \"\"\"Pandas UDF transforming a pandas.Series of text into a pandas.Series of embeddings.\n",
    "        You may use iterator input and output instead, e.g. Iterator[pd.Series] -> Iterator[pd.Series]\n",
    "        \"\"\"\n",
    "        model = SentenceTransformer('paraphrase-MiniLM-L6-v2')\n",
    "        embedding = model.encode(data.to_list())\n",
    "        return pd.Series(embedding.tolist())\n",
    "\n",
    "    return df.withColumn(\"reviews_text_embedding\", predict_batch_udf(col(\"reviews_text\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "d570545a-ba3e-4562-9893-a0de8d06e467",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "hdfs_source = HdfsSource(\n",
    "    name=\"hotel_reviews\",\n",
    "    path=data_source_path,\n",
    "    preprocessing=sentence_embedding,\n",
    ")\n",
    "\n",
    "# key is required for the features from non-INPUT_CONTEXT source\n",
    "key = TypedKey(\n",
    "    key_column=\"reviews_id\",\n",
    "    key_column_type=ValueType.INT64,\n",
    "    description=\"Reviews ID\",\n",
    "    full_name=f\"{PROJECT_NAME}.review_id\",\n",
    ")\n",
    "\n",
    "# The column 'reviews_text_embedding' will be generated by our UDF `sentence_embedding`.\n",
    "# We use the column as the feature. \n",
    "features = [\n",
    "    Feature(\n",
    "        name=\"f_reviews_text_embedding\",\n",
    "        key=key,\n",
    "        feature_type=FLOAT_VECTOR,\n",
    "        transform=\"reviews_text_embedding\",\n",
    "    ),\n",
    "]\n",
    "\n",
    "feature_anchor = FeatureAnchor(\n",
    "    name=\"feature_anchor\",\n",
    "    source=hdfs_source,\n",
    "    features=features,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "75ad69ff-0c94-4cc7-be9e-3cf8f372ecf2",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "client.build_features(\n",
    "    anchor_list=[feature_anchor],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "d71dd42f-57b3-4ff5-a79f-f154efd3d806",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "### Register the Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "be389daa-3762-445b-a16a-38f30eb7d7bb",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "if REGISTER_FEATURES:\n",
    "    try:\n",
    "        client.register_features()\n",
    "    except Exception as e:\n",
    "        print(e)\n",
    "\n",
    "    print(client.list_registered_features(project_name=PROJECT_NAME))\n",
    "    # You can get the actual features too by calling client.get_features_from_registry(PROJECT_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "195a2a99-98f7-43a5-bd4a-2d65772c93da",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Feature Consumer Scenario\n",
    "\n",
    "From the feature consumer point of view, we first get the registered feature and then extract the feature vectors by using the feature definition.\n",
    "\n",
    "### Get Registered Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "13a20076-1b24-4537-8d07-a5bf5b440cf0",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "if REGISTER_FEATURES:\n",
    "    registered_features = client.get_features_from_registry(project_name=PROJECT_NAME)\n",
    "else:\n",
    "    # Assume we get the registered features. This is for a notebook unit-test w/o the actual registration.\n",
    "    registered_features = {feat.name: feat for feat in features}\n",
    "\n",
    "print(\"Features:\")\n",
    "for f_name, f in registered_features.items():\n",
    "    print(f\"\\t{f_name} (key: {f.key[0].key_column})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "7ca62c78-281a-4a84-a8a0-1879ea441e9d",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "### Extract the Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "c92708e6-ca44-48b6-ae47-30db88e39277",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "feature_name = \"f_reviews_text_embedding\"\n",
    "feature_key = registered_features[feature_name].key[0]\n",
    "output_filepath = f\"dbfs:/{PROJECT_NAME}/feature_embeddings.parquet\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "d9dfe7f6-67d0-407b-aaac-5ac65f9dde3e",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "query = FeatureQuery(\n",
    "    feature_list=[feature_name],\n",
    "    key=feature_key,\n",
    ")\n",
    "\n",
    "settings = ObservationSettings(\n",
    "    observation_path=data_source_path,\n",
    ")\n",
    "\n",
    "client.get_offline_features(\n",
    "    observation_settings=settings,\n",
    "    feature_query=query,\n",
    "    # For more details, see https://feathr-ai.github.io/feathr/how-to-guides/feathr-job-configuration.html\n",
    "    execution_configurations=SparkExecutionConfiguration({\n",
    "        \"spark.feathr.outputFormat\": \"parquet\",\n",
    "        \"spark.sql.execution.arrow.enabled\": \"true\",\n",
    "    }),\n",
    "    output_path=output_filepath,\n",
    ")\n",
    "\n",
    "client.wait_job_to_finish(timeout_sec=5000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a8be8d73-df8e-40f5-b21a-163e2da4b1c6",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "result_df = get_result_df(client=client, res_url=output_filepath, data_format=\"parquet\")\n",
    "result_df[[\"name\", \"reviews_text\", feature_name]].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's visualize the feature values. Here, we use TSNE (T-distributed Stochastic Neighbor Embedding) using [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to plot the vectors in 2D space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "c03e4c41-00d7-4163-bdab-b5cf3e22ca30",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import plotly.graph_objs as go\n",
    "from sklearn.manifold import TSNE\n",
    "\n",
    "\n",
    "X = np.stack(result_df[feature_name], axis=0)\n",
    "result = TSNE(\n",
    "    n_components=2,\n",
    "    init='random',\n",
    "    perplexity=10,\n",
    ").fit_transform(X)\n",
    "\n",
    "result[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "20a2fe88-3b74-45ad-9b4f-2e63e9171ee1",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "names = set(result_df['name'])\n",
    "names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "25b798da-d0fa-4d37-98a9-a9614c47eb53",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "fig = go.Figure()\n",
    "\n",
    "for name in names:\n",
    "    mask = result_df['name']==name\n",
    "    \n",
    "    fig.add_trace(go.Scatter(\n",
    "        x=result[mask, 0],\n",
    "        y=result[mask, 1],\n",
    "        name=name,\n",
    "        textposition='top center',\n",
    "        mode='markers+text',\n",
    "        marker={\n",
    "            'size': 8,\n",
    "            'opacity': 0.8,\n",
    "        },\n",
    "    ))\n",
    "\n",
    "fig.update_layout(\n",
    "    margin={'l': 0, 'r': 0, 'b': 0, 't': 0},\n",
    "    showlegend=True,\n",
    "    autosize=False,\n",
    "    width=1000,\n",
    "    height=500,\n",
    ")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cleaning up the output files. CAUTION: this maybe dangerous if you \"reused\" the project name.\n",
    "if CLEAN_UP:\n",
    "    import shutil\n",
    "    shutil.rmtree(WORKING_DIR, ignore_errors=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "application/vnd.databricks.v1+notebook": {
   "dashboards": [],
   "language": "python",
   "notebookMetadata": {
    "pythonIndentUnit": 4,
    "widgetLayout": []
   },
   "notebookName": "embedding",
   "notebookOrigID": 2956141409782062,
   "widgets": {}
  },
  "kernelspec": {
   "display_name": "feathr",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  },
  "vscode": {
   "interpreter": {
    "hash": "ddb0e38f168d5afaa0b8ab4851ddd8c14364f1d087c15de6ff2ee5a559aec1f2"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}