{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "2021-07-24-storing-ml-features-using-feast.ipynb",
"provenance": [],
"collapsed_sections": [],
"mount_file_id": "1RgDRfPhQXOZgC9VL5HSYgDziqFQHeD_v",
"authorship_tag": "ABX9TyMwUwNK3PdMDMsGzmAUDBo5"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Jf0aHUhtsgPE"
},
"source": [
"# Storing ML features using Feast\n",
"> Storing features in feast (a featurestore system), tried out on movielens & ad-click datasets\n",
"\n",
"- toc: true\n",
"- badges: true\n",
"- comments: true\n",
"- categories: [FeatureStore]\n",
"- image:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "47p4Kt6jPZbE"
},
"source": [
"## Feast"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NwwcuVvSPgV0"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "V7mxBOaBPn1_"
},
"source": [
"Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production.\n",
"\n",
"[Git](https://github.com/feast-dev/feast)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0xKlcDKOP2BY"
},
"source": [
""
]
},
{
"cell_type": "code",
"metadata": {
"id": "mQCHBlg-PSee",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "46b232e2-b5eb-4eec-e8cc-12057210d518"
},
"source": [
"!pip install -q feast"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"\u001b[K |████████████████████████████████| 190 kB 5.1 MB/s \n",
"\u001b[K |████████████████████████████████| 100 kB 4.3 MB/s \n",
"\u001b[K |████████████████████████████████| 269 kB 9.8 MB/s \n",
"\u001b[K |████████████████████████████████| 10.1 MB 10.2 MB/s \n",
"\u001b[K |████████████████████████████████| 50 kB 5.5 MB/s \n",
"\u001b[K |████████████████████████████████| 2.3 MB 33.8 MB/s \n",
"\u001b[?25h Building wheel for pandavro (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for PyYAML (setup.py) ... \u001b[?25l\u001b[?25hdone\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bnTSti6YU6ib"
},
"source": [
"A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YfYconoLVO36"
},
"source": [
"Edit the example feature definitions in example.py and run feast apply again to change feature definitions."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ljhwODVAW9iX"
},
"source": [
"Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store.\n",
"Below is an example data source with a single entity (driver) and two features (trips_today, and rating)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8WH6lGaHW_rA"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ArwtR5Wx_J7o"
},
"source": [
"## Movielens"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1du3HIsPNDVx",
"outputId": "c6958224-77aa-4f70-f05e-cd4d53de07ee"
},
"source": [
"!pip install -q git+https://github.com/sparsh-ai/recochef.git\n",
"from recochef.datasets.movielens import MovieLens"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n",
"\u001b[K |████████████████████████████████| 4.3MB 6.0MB/s \n",
"\u001b[?25h Building wheel for recochef (PEP 517) ... \u001b[?25l\u001b[?25hdone\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GGBxfgV1CqYv"
},
"source": [
"### Load"
]
},
{
"cell_type": "code",
"metadata": {
"id": "4famy0k3Cpac"
},
"source": [
"ml = MovieLens()\n",
"df = ml.load_interactions()\n",
"df.head()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xRnxWKjzCl9-"
},
"source": [
"### Transform"
]
},
{
"cell_type": "code",
"metadata": {
"id": "PELmaKuICkVw"
},
"source": [
"from recochef.preprocessing import encode, split\n",
"train, test = split.chrono_split(df)\n",
"train, umap = encode.label_encode(train, col='USERID')\n",
"train, imap = encode.label_encode(train, col='ITEMID')\n",
"test = encode.label_encode(test, col='USERID', maps=umap)\n",
"test = encode.label_encode(test, col='ITEMID', maps=imap)\n",
"train.head()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "DOaCtwiKCtRb"
},
"source": [
"### Create a feature repository"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5cLD9ScIQFUP",
"outputId": "cfbb34c5-6531-41fb-a4b8-906c0b14dee1"
},
"source": [
"!feast init my_movielens_repo\n",
"%cd my_movielens_repo"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"Creating a new Feast repository in \u001b[1m\u001b[32m/content/my_movielens_repo\u001b[0m.\n",
"\n",
"/content/my_movielens_repo\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "C4PtfmVlVrdG"
},
"source": [
"train.to_parquet(\"./data/movielens_train.parquet\")\n",
"test.to_parquet(\"./data/movielens_test.parquet\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "61DYdzh1Vh9V",
"outputId": "f75424bf-5615-4993-be90-6589f5d332f2"
},
"source": [
"%%writefile example.py\n",
"from google.protobuf.duration_pb2 import Duration\n",
"\n",
"from feast import Entity, Feature, FeatureView, ValueType\n",
"from feast.data_source import FileSource\n",
"\n",
"\n",
"movielens_train = FileSource(\n",
" path=\"/content/my_movielens_repo/data/movielens_train.parquet\",\n",
" event_timestamp_column=\"datetime\",\n",
" created_timestamp_column=\"created\",\n",
")\n",
"\n",
"movielens_test = FileSource(\n",
" path=\"/content/my_movielens_repo/data/movielens_test.parquet\",\n",
" event_timestamp_column=\"datetime\",\n",
" created_timestamp_column=\"created\",\n",
")\n",
"\n",
"\n",
"itemid = Entity(name=\"ITEMID\", value_type=ValueType.INT64, description=\"movie id\")\n",
"userid = Entity(name=\"USERID\", value_type=ValueType.INT64, description=\"user id\")\n",
"\n",
"\n",
"movielens_train_view = FeatureView(\n",
" name=\"movielens_train\",\n",
" entities=[\"itemid\",\"userid\"],\n",
" ttl=Duration(seconds=86400 * 1),\n",
" features=[\n",
" Feature(name=\"RATING\", dtype=ValueType.FLOAT),\n",
" Feature(name=\"TIMESTAMP\", dtype=ValueType.FLOAT),\n",
" ],\n",
" online=True,\n",
" input=movielens_train,\n",
" tags={},\n",
")\n",
"\n",
"movielens_test_view = FeatureView(\n",
" name=\"movielens_test\",\n",
" entities=[\"itemid\",\"userid\"],\n",
" ttl=Duration(seconds=86400 * 1),\n",
" features=[\n",
" Feature(name=\"RATING\", dtype=ValueType.FLOAT),\n",
" Feature(name=\"TIMESTAMP\", dtype=ValueType.FLOAT),\n",
" ],\n",
" online=True,\n",
" input=movielens_test,\n",
" tags={},\n",
")"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Overwriting example.py\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LHI_B14wVj8a"
},
"source": [
"Register your feature definitions and set up your feature store"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "R2ifKusVQKZH",
"outputId": "2f91b0f9-18b5-4f00-a87f-4490e94a2224"
},
"source": [
"!feast apply"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Registered entity \u001b[1m\u001b[32mITEMID\u001b[0m\n",
"Registered entity \u001b[1m\u001b[32mUSERID\u001b[0m\n",
"Registered feature view \u001b[1m\u001b[32mmovielens_test\u001b[0m\n",
"Registered feature view \u001b[1m\u001b[32mmovielens_train\u001b[0m\n",
"Deploying infrastructure for \u001b[1m\u001b[32mmovielens_test\u001b[0m\n",
"Deploying infrastructure for \u001b[1m\u001b[32mmovielens_train\u001b[0m\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZV56zvD-Qm1x"
},
"source": [
"### Build a training dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CrxmVUTKbiwQ"
},
"source": [
"Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "xgniqdUWa56m",
"outputId": "55bf0004-a386-43c6-8717-0eac6b4c98a9"
},
"source": [
"train.sample(2)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" USERID | \n",
" ITEMID | \n",
" RATING | \n",
" TIMESTAMP | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2408 | \n",
" 212 | \n",
" 408 | \n",
" 5.0 | \n",
" 878955409 | \n",
"
\n",
" \n",
" | 18737 | \n",
" 390 | \n",
" 53 | \n",
" 5.0 | \n",
" 877399659 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" USERID ITEMID RATING TIMESTAMP\n",
"2408 212 408 5.0 878955409\n",
"18737 390 53 5.0 877399659"
]
},
"metadata": {
"tags": []
},
"execution_count": 27
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Ho7ddXxddSay",
"outputId": "c0ab1d17-895e-4d2f-8db7-a95ff956f789"
},
"source": [
"present_time"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Timestamp('2021-07-09 08:49:27.205924+0000', tz='UTC')"
]
},
"metadata": {
"tags": []
},
"execution_count": 31
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 578
},
"id": "j9p3IDrGa5I-",
"collapsed": true,
"outputId": "30d0c398-4321-4711-e03f-0225e5721e07"
},
"source": [
"from feast import FeatureStore\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"\n",
"present_time = pd.Timestamp(datetime.now(), tz=\"UTC\")\n",
"\n",
"entity_df = pd.DataFrame.from_dict({\n",
" \"userid\": [212, 390],\n",
" \"itemid\": [408, 53],\n",
" \"datetime\": [present_time,\n",
" present_time]\n",
"})\n",
"\n",
"store = FeatureStore(repo_path=\".\")\n",
"\n",
"training_df = store.get_historical_features(\n",
" entity_df=entity_df, \n",
" feature_refs = [\n",
" 'movielens_train:RATING',\n",
" # 'movielens_train:TIMESTAMP',\n",
" ],\n",
").to_df()\n",
"\n",
"training_df"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Using datetime as the event timestamp. To specify a column explicitly, please name it event_timestamp.\n"
],
"name": "stdout"
},
{
"output_type": "error",
"ename": "KeyError",
"evalue": "ignored",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2897\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2898\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2899\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'datetime'",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[0mentity_df\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mentity_df\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 18\u001b[0m feature_refs = [\n\u001b[0;32m---> 19\u001b[0;31m \u001b[0;34m'movielens_train:RATING'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 20\u001b[0m \u001b[0;31m# 'movielens_train:TIMESTAMP',\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 21\u001b[0m ],\n",
"\u001b[0;32m/usr/local/lib/python3.7/dist-packages/feast/infra/offline_stores/file.py\u001b[0m in \u001b[0;36mto_df\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 36\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mto_df\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 37\u001b[0m \u001b[0;31m# Only execute the evaluation function to build the final historical retrieval dataframe at the last moment.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 38\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mevaluation_function\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 39\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 40\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.7/dist-packages/feast/infra/offline_stores/file.py\u001b[0m in \u001b[0;36mevaluate_historical_retrieval\u001b[0;34m()\u001b[0m\n\u001b[1;32m 112\u001b[0m \u001b[0;31m# Make sure all timestamp fields are tz-aware. We default tz-naive fields to UTC\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 113\u001b[0m df_to_join[event_timestamp_column] = df_to_join[\n\u001b[0;32m--> 114\u001b[0;31m \u001b[0mevent_timestamp_column\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 115\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 116\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mx\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtzinfo\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreplace\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtzinfo\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpytz\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2905\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2906\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2907\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2908\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2898\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2899\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2900\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2901\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2902\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'datetime'"
]
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "PMAjXPbvaYnY"
},
"source": [
"from feast import FeatureStore\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"\n",
"entity_df = pd.DataFrame.from_dict({\n",
" \"driver_id\": [1001, 1002, 1003, 1004],\n",
" \"event_timestamp\": [\n",
" datetime(2021, 4, 12, 10, 59, 42),\n",
" datetime(2021, 4, 12, 8, 12, 10),\n",
" datetime(2021, 4, 12, 16, 40, 26),\n",
" datetime(2021, 4, 12, 15, 1 , 12)\n",
" ]\n",
"})\n",
"\n",
"store = FeatureStore(repo_path=\".\")\n",
"\n",
"training_df = store.get_historical_features(\n",
" entity_df=entity_df, \n",
" feature_refs = [\n",
" 'driver_hourly_stats:conv_rate',\n",
" 'driver_hourly_stats:acc_rate',\n",
" 'driver_hourly_stats:avg_daily_trips'\n",
" ],\n",
").to_df()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "MdPaCw_ZQbia"
},
"source": [
"from feast import FeatureStore\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"\n",
"entity_df = pd.DataFrame.from_dict({\n",
" \"driver_id\": [1001, 1002, 1003, 1004],\n",
" \"event_timestamp\": [\n",
" datetime(2021, 4, 12, 10, 59, 42),\n",
" datetime(2021, 4, 12, 8, 12, 10),\n",
" datetime(2021, 4, 12, 16, 40, 26),\n",
" datetime(2021, 4, 12, 15, 1 , 12)\n",
" ]\n",
"})\n",
"\n",
"store = FeatureStore(repo_path=\".\")\n",
"\n",
"training_df = store.get_historical_features(\n",
" entity_df=entity_df, \n",
" feature_refs = [\n",
" 'driver_hourly_stats:conv_rate',\n",
" 'driver_hourly_stats:acc_rate',\n",
" 'driver_hourly_stats:avg_daily_trips'\n",
" ],\n",
").to_df()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 261
},
"id": "iqMZM-xQQ6vC",
"outputId": "f784eabd-b3af-4dcb-a363-650efe45eb9b"
},
"source": [
"training_df.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" event_timestamp | \n",
" driver_id | \n",
" driver_hourly_stats__conv_rate | \n",
" driver_hourly_stats__acc_rate | \n",
" driver_hourly_stats__avg_daily_trips | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2021-04-12 08:12:10+00:00 | \n",
" 1002 | \n",
" 0.385016 | \n",
" 0.913230 | \n",
" 803 | \n",
"
\n",
" \n",
" | 1 | \n",
" 2021-04-12 10:59:42+00:00 | \n",
" 1001 | \n",
" 0.192806 | \n",
" 0.492017 | \n",
" 973 | \n",
"
\n",
" \n",
" | 2 | \n",
" 2021-04-12 15:01:12+00:00 | \n",
" 1004 | \n",
" 0.371372 | \n",
" 0.788611 | \n",
" 837 | \n",
"
\n",
" \n",
" | 3 | \n",
" 2021-04-12 16:40:26+00:00 | \n",
" 1003 | \n",
" 0.828210 | \n",
" 0.315526 | \n",
" 205 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" event_timestamp ... driver_hourly_stats__avg_daily_trips\n",
"0 2021-04-12 08:12:10+00:00 ... 803\n",
"1 2021-04-12 10:59:42+00:00 ... 973\n",
"2 2021-04-12 15:01:12+00:00 ... 837\n",
"3 2021-04-12 16:40:26+00:00 ... 205\n",
"\n",
"[4 rows x 5 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 3
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GQAaubc5RmnR"
},
"source": [
"### Load feature values into your online store"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "s8OrKWPvRz-0",
"outputId": "63ff1864-5bf4-4476-8aaa-f6d5bf505819"
},
"source": [
"%%sh\n",
"CURRENT_TIME=$(date -u +\"%Y-%m-%dT%H:%M:%S\")\n",
"feast materialize-incremental $CURRENT_TIME"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Materializing \u001b[1m\u001b[32m1\u001b[0m feature views to \u001b[1m\u001b[32m2021-07-09 08:01:35+00:00\u001b[0m into the \u001b[1m\u001b[32msqlite\u001b[0m online store.\n",
"\n",
"\u001b[1m\u001b[32mdriver_hourly_stats\u001b[0m from \u001b[1m\u001b[32m2021-07-08 08:01:36+00:00\u001b[0m to \u001b[1m\u001b[32m2021-07-09 08:01:35+00:00\u001b[0m:\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"\r 0%| | 0/5 [00:00, ?it/s]\r100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 455.73it/s]\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CIJCDiEMSV4m"
},
"source": [
"### Read online features at low latency"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "szaiNCXURJgt",
"outputId": "2b877525-0e8e-4c42-cbbf-1e0ea4372a51"
},
"source": [
"from pprint import pprint\n",
"from feast import FeatureStore\n",
"\n",
"store = FeatureStore(repo_path=\".\")\n",
"\n",
"feature_vector = store.get_online_features(\n",
" feature_refs=[\n",
" 'driver_hourly_stats:conv_rate',\n",
" 'driver_hourly_stats:acc_rate',\n",
" 'driver_hourly_stats:avg_daily_trips'\n",
" ],\n",
" entity_rows=[{\"driver_id\": 1001}]\n",
").to_dict()\n",
"\n",
"pprint(feature_vector) "
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"{'driver_hourly_stats__acc_rate': [0.410092294216156],\n",
" 'driver_hourly_stats__avg_daily_trips': [870],\n",
" 'driver_hourly_stats__conv_rate': [0.8009825944900513],\n",
" 'driver_id': [1001]}\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eHMPt2cfAAEu"
},
"source": [
"## Ad-click dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r2MjWFcbSF-7"
},
"source": [
"### Download the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x1n_POn5AG9L"
},
"source": [
""
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"collapsed": true,
"id": "moZd0KLNAL9l",
"outputId": "bc89560e-976e-49af-a8b2-9067931ed381"
},
"source": [
"!pip install -q -U kaggle\n",
"!pip install --upgrade --force-reinstall --no-deps kaggle\n",
"!mkdir ~/.kaggle\n",
"!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/\n",
"!chmod 600 ~/.kaggle/kaggle.json\n",
"!kaggle datasets download -d arashnic/ctrtest\n",
"!unzip ctrtest"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Collecting kaggle\n",
" Downloading kaggle-1.5.12.tar.gz (58 kB)\n",
"\u001b[?25l\r\u001b[K |█████▋ | 10 kB 26.8 MB/s eta 0:00:01\r\u001b[K |███████████▏ | 20 kB 29.3 MB/s eta 0:00:01\r\u001b[K |████████████████▊ | 30 kB 15.3 MB/s eta 0:00:01\r\u001b[K |██████████████████████▎ | 40 kB 11.0 MB/s eta 0:00:01\r\u001b[K |███████████████████████████▉ | 51 kB 5.3 MB/s eta 0:00:01\r\u001b[K |████████████████████████████████| 58 kB 2.5 MB/s \n",
"\u001b[?25hBuilding wheels for collected packages: kaggle\n",
" Building wheel for kaggle (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73052 sha256=ca2a0e775793c592ad857ea4c9e276cd1e6bef3d47d45182fbb83bb35cc8159c\n",
" Stored in directory: /root/.cache/pip/wheels/62/d6/58/5853130f941e75b2177d281eb7e44b4a98ed46dd155f556dc5\n",
"Successfully built kaggle\n",
"Installing collected packages: kaggle\n",
" Attempting uninstall: kaggle\n",
" Found existing installation: kaggle 1.5.12\n",
" Uninstalling kaggle-1.5.12:\n",
" Successfully uninstalled kaggle-1.5.12\n",
"Successfully installed kaggle-1.5.12\n",
"Downloading ctrtest.zip to /content\n",
" 69% 25.0M/36.1M [00:00<00:00, 118MB/s]\n",
"100% 36.1M/36.1M [00:00<00:00, 146MB/s]\n",
"Archive: ctrtest.zip\n",
" inflating: sample_submission/sample_submission.csv \n",
" inflating: test_ctr/test.csv \n",
" inflating: train_adc/item_data.csv \n",
" inflating: train_adc/train.csv \n",
" inflating: train_adc/view_log.csv \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "p7MKxMvJBYRz"
},
"source": [
"import os\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"from feast import FeatureStore\n",
"from feast import Entity, ValueType, Feature, FeatureView\n",
"from feast.data_format import ParquetFormat\n",
"from feast.data_source import FileSource\n",
"from google.protobuf.duration_pb2 import Duration"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "MX6Bv1x6C1DV"
},
"source": [
"### Initializing the feature store"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "k46fEozUC2nP",
"outputId": "2ba334fb-942d-4470-a810-7ff85e8945bb"
},
"source": [
"!feast init click_data\n",
"%cd click_data"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"Creating a new Feast repository in \u001b[1m\u001b[32m/content/click_data\u001b[0m.\n",
"\n",
"/content/click_data\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xpoDsN6bCzAc"
},
"source": [
"### ETL"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "WJ_g0usFB7R2",
"outputId": "d6187d44-5f92-4d2d-a84a-7559aac3868c"
},
"source": [
"data = pd.read_csv(\"/content/train_adc/train.csv\")\n",
"#Convert it to datetime before writing to quaaquet.\n",
"data['impression_time'] = pd.to_datetime(data['impression_time'])\n",
"data.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" impression_id | \n",
" impression_time | \n",
" user_id | \n",
" app_code | \n",
" os_version | \n",
" is_4G | \n",
" is_click | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" c4ca4238a0b923820dcc509a6f75849b | \n",
" 2018-11-15 00:00:00 | \n",
" 87862 | \n",
" 422 | \n",
" old | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" | 1 | \n",
" 45c48cce2e2d7fbdea1afc51c7c6ad26 | \n",
" 2018-11-15 00:01:00 | \n",
" 63410 | \n",
" 467 | \n",
" latest | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" | 2 | \n",
" 70efdf2ec9b086079795c442636b55fb | \n",
" 2018-11-15 00:02:00 | \n",
" 71748 | \n",
" 259 | \n",
" intermediate | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" | 3 | \n",
" 8e296a067a37563370ded05f5a3bf3ec | \n",
" 2018-11-15 00:02:00 | \n",
" 69209 | \n",
" 244 | \n",
" latest | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" | 4 | \n",
" 182be0c5cdcd5072bb1864cdee4d3d6e | \n",
" 2018-11-15 00:02:00 | \n",
" 62873 | \n",
" 473 | \n",
" latest | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" impression_id impression_time ... is_4G is_click\n",
"0 c4ca4238a0b923820dcc509a6f75849b 2018-11-15 00:00:00 ... 0 0\n",
"1 45c48cce2e2d7fbdea1afc51c7c6ad26 2018-11-15 00:01:00 ... 1 1\n",
"2 70efdf2ec9b086079795c442636b55fb 2018-11-15 00:02:00 ... 1 0\n",
"3 8e296a067a37563370ded05f5a3bf3ec 2018-11-15 00:02:00 ... 1 0\n",
"4 182be0c5cdcd5072bb1864cdee4d3d6e 2018-11-15 00:02:00 ... 0 0\n",
"\n",
"[5 rows x 7 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 15
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "v22kUx4OCDHg"
},
"source": [
"data.to_parquet(\"./data/train.parquet\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "Qa1AeoakCGY1",
"outputId": "d55f6cf0-0d4b-4405-bfee-81d86fbce675"
},
"source": [
"item = pd.read_csv(\"/content/train_adc/item_data.csv\")\n",
"item.to_parquet(\"./data/item_data.parquet\")\n",
"item.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item_id | \n",
" item_price | \n",
" category_1 | \n",
" category_2 | \n",
" category_3 | \n",
" product_type | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 26880 | \n",
" 4602 | \n",
" 11 | \n",
" 35 | \n",
" 20 | \n",
" 3040 | \n",
"
\n",
" \n",
" | 1 | \n",
" 54939 | \n",
" 3513 | \n",
" 12 | \n",
" 57 | \n",
" 85 | \n",
" 6822 | \n",
"
\n",
" \n",
" | 2 | \n",
" 40383 | \n",
" 825 | \n",
" 17 | \n",
" 8 | \n",
" 279 | \n",
" 1619 | \n",
"
\n",
" \n",
" | 3 | \n",
" 8777 | \n",
" 2355 | \n",
" 13 | \n",
" 58 | \n",
" 189 | \n",
" 5264 | \n",
"
\n",
" \n",
" | 4 | \n",
" 113705 | \n",
" 1267 | \n",
" 17 | \n",
" 39 | \n",
" 151 | \n",
" 10239 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item_id item_price category_1 category_2 category_3 product_type\n",
"0 26880 4602 11 35 20 3040\n",
"1 54939 3513 12 57 85 6822\n",
"2 40383 825 17 8 279 1619\n",
"3 8777 2355 13 58 189 5264\n",
"4 113705 1267 17 39 151 10239"
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "tBlU-8KICGVi",
"outputId": "4eecc6f8-9425-4aa2-c089-fb71d3c7d6cc"
},
"source": [
"view_log = pd.read_csv(\"/content/train_adc/view_log.csv\")\n",
"view_log['server_time'] = pd.to_datetime(view_log['server_time'])\n",
"view_log.to_parquet(\"./data/view_log.parquet\")\n",
"view_log.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" server_time | \n",
" device_type | \n",
" session_id | \n",
" user_id | \n",
" item_id | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2018-10-15 08:58:00 | \n",
" android | \n",
" 112333 | \n",
" 4557 | \n",
" 32970 | \n",
"
\n",
" \n",
" | 1 | \n",
" 2018-10-15 08:58:00 | \n",
" android | \n",
" 503590 | \n",
" 74788 | \n",
" 7640 | \n",
"
\n",
" \n",
" | 2 | \n",
" 2018-10-15 08:58:00 | \n",
" android | \n",
" 573960 | \n",
" 23628 | \n",
" 128855 | \n",
"
\n",
" \n",
" | 3 | \n",
" 2018-10-15 08:58:00 | \n",
" android | \n",
" 121691 | \n",
" 2430 | \n",
" 12774 | \n",
"
\n",
" \n",
" | 4 | \n",
" 2018-10-15 08:58:00 | \n",
" android | \n",
" 218564 | \n",
" 19227 | \n",
" 28296 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" server_time device_type session_id user_id item_id\n",
"0 2018-10-15 08:58:00 android 112333 4557 32970\n",
"1 2018-10-15 08:58:00 android 503590 74788 7640\n",
"2 2018-10-15 08:58:00 android 573960 23628 128855\n",
"3 2018-10-15 08:58:00 android 121691 2430 12774\n",
"4 2018-10-15 08:58:00 android 218564 19227 28296"
]
},
"metadata": {
"tags": []
},
"execution_count": 18
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "p4FrToWQBYU2"
},
"source": [
"os.environ[\"TRAIN_DATA\"] = \"./data/view_log.parquet\"\n",
"os.environ[\"ITEM_DATA\"] = \"./data/item_data.parquet\"\n",
"os.environ[\"VIEW_LOG_DATA\"] = \"./data/view_log.parquet\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "u8YXLod8AfcG"
},
"source": [
"### Re-write the featureTable definition as feature store"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Idy4E5rGAQ8v"
},
"source": [
"class ContextAdClickData:\n",
"\n",
" def __init__(self) -> None:\n",
" self.features = {}\n",
"\n",
" def train_view_source(self):\n",
" return FileSource(\n",
" event_timestamp_column=\"impression_time\",\n",
" # created_timestamp_column=\"created\",\n",
" file_format=ParquetFormat(),\n",
" path=os.environ.get(\"TRAIN_DATA\"),\n",
" )\n",
" \n",
" def item_data_view_source(self):\n",
" return FileSource(\n",
" file_format=ParquetFormat(),\n",
" path=os.environ.get(\"ITEM_DATA\")\n",
" # path=\"s3://{bucket_name}/data/item_data.parquet\"\n",
" )\n",
" \n",
" def view_log_data_view_source(self):\n",
" return FileSource(\n",
" event_timestamp_column=\"server_time\",\n",
" file_format=ParquetFormat(),\n",
" path=os.environ.get(\"VIEW_LOG_DATA\")\n",
" )\n",
"\n",
" def trainView(self):\n",
" \"\"\"Defines the train table for the click data.\n",
" :params:\n",
" - column_type_dict - A dictionary of columns and the data type\n",
" \n",
" \"\"\"\n",
" name = \"train_table\"\n",
" return FeatureView(\n",
" name=name,\n",
" entities=[self.train_entity().name],\n",
" ttl=Duration(seconds=86400 * 1),\n",
" features=[\n",
" self.feature_create(\"user_id\", ValueType.STRING),\n",
" self.feature_create(\"impression_id\", ValueType.STRING),\n",
" self.feature_create(\"app_code\", ValueType.INT32),\n",
" self.feature_create(\"os_version\", ValueType.STRING),\n",
" self.feature_create(\"is_4G\", ValueType.INT32),\n",
" self.feature_create(\"is_click\", ValueType.INT32),\n",
" ],\n",
" online=True,\n",
" input=self.train_view_source(),\n",
" tags={}\n",
" )\n",
" \n",
" def viewLogView(self):\n",
" name = \"view_log_table\"\n",
" return FeatureView(\n",
" name=name,\n",
" entities=[self.view_log_entity().name],\n",
" ttl=Duration(seconds=86400 * 1),\n",
" features=[\n",
" # self.feature_create(\"server_time\", ValueType.UNIX_TIMESTAMP),\n",
" self.feature_create(\"device_type\", ValueType.STRING),\n",
" # self.feature_create(\"session_id\", ValueType.INT32),\n",
" self.feature_create(\"user_id\", ValueType.INT64),\n",
" self.feature_create(\"item_id\", ValueType.INT64)\n",
" ],\n",
" online=True,\n",
" input=self.view_log_data_view_source(),\n",
" tags={}\n",
" )\n",
"\n",
" def itemDataView(self):\n",
" name = \"item_data_table\"\n",
" feature_table = FeatureView(\n",
" name=name,\n",
" entities=[self.item_data_entity().name],\n",
" ttl=Duration(seconds=86400 * 1),\n",
" features=[\n",
" self.feature_create(\"item_id\", ValueType.INT32),\n",
" self.feature_create(\"item_price\", ValueType.INT32),\n",
" self.feature_create(\"category_1\", ValueType.INT32),\n",
" self.feature_create(\"category_2\", ValueType.INT32),\n",
" self.feature_create(\"category_3\", ValueType.INT32),\n",
" self.feature_create(\"product_type\", ValueType.INT32)\n",
" ],\n",
" online=True,\n",
" input=self.item_data_view_source(),\n",
" tags={}\n",
" )\n",
" return feature_table\n",
" \n",
" \n",
" def train_entity(self):\n",
" name = \"impression_id\"\n",
" return Entity(name, value_type=ValueType.INT32, description=\"Impression logs with click details\")\n",
"\n",
" def view_log_entity(self):\n",
" name = \"session_id\"\n",
" #TODO: Check how to merge the user_id in this entity and user id in click entity.\n",
" return Entity(name=name, value_type=ValueType.INT64, description=\"View log containing user_id and item_id being viewed\")\n",
" \n",
" def item_data_entity(self):\n",
" name=\"item_id\"\n",
" return Entity(name=name, value_type=ValueType.INT32, description=\"Item data\")\n",
"\n",
" def feature_create(self, name, value):\n",
" \"\"\"Add features \"\"\"\n",
" self.features[name] = Feature(name, dtype=value)\n",
" assert name in self.features\n",
" return self.features[name]"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Pa6mY1iuA5g7"
},
"source": [
"addClick = ContextAdClickData()\n",
"\n",
"en_train = addClick.train_entity()\n",
"en_item = addClick.item_data_entity()\n",
"en_view_log = addClick.view_log_entity()\n",
"\n",
"x = addClick.trainView()\n",
"y = addClick.itemDataView()\n",
"z = addClick.viewLogView()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "JerMdLxBIq4Q"
},
"source": [
"### Registering the features to local feature store"
]
},
{
"cell_type": "code",
"metadata": {
"id": "T3GekxbrCGSH"
},
"source": [
"store = FeatureStore(repo_path=\".\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "0sspvdoXGuQH"
},
"source": [
"store.apply([x,en_train])\n",
"# store.apply([y,en_item])\n",
"store.apply([z,en_view_log])"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "DCKG3RWeIj8S"
},
"source": [
"### Retrieving some features from local store"
]
},
{
"cell_type": "code",
"metadata": {
"id": "H07v33CHCGOP"
},
"source": [
"entity_df = pd.DataFrame.from_dict({\n",
" \"session_id\": [218564],\n",
" \"event_timestamp\" : datetime(2018, 10, 15, 8, 58, 00),\n",
"})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"id": "hWFHO2lEGk5P",
"outputId": "e9113179-84f3-4a4e-9e7d-1e4e2b46abdb"
},
"source": [
"data_df = store.get_historical_features(feature_refs=[\"view_log_table:device_type\"], entity_df=entity_df)\n",
"ex_data = data_df.to_df()\n",
"ex_data.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" event_timestamp | \n",
" session_id | \n",
" view_log_table__device_type | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2018-10-15 08:58:00+00:00 | \n",
" 218564 | \n",
" android | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" event_timestamp session_id view_log_table__device_type\n",
"0 2018-10-15 08:58:00+00:00 218564 android"
]
},
"metadata": {
"tags": []
},
"execution_count": 38
}
]
}
]
}