{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![KTS logo](https://raw.githubusercontent.com/konodyuk/kts/master/docs/static/banner_alpha.png)\n", "# Feature Engineering Guide" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
DASHBOARD
\n", "
features
\n", "
\n", "
\n", "
\n", "
FEATURE CONSTRUCTOR
\n", " \n", "
\n", "
name
\n", "
simple_feature
\n", "
source
\n", "
@feature\n",
       "def simple_feature(df):\n",
       "    res = stl.empty_like(df)\n",
       "    res['is_male'] = (df.Sex == 'male') + 0\n",
       "    return res\n",
       "
\n", "
\n", "\n", "
\n", "
\n", "
GENERIC FEATURE
\n", " \n", "
\n", "
name
\n", "
interactions
\n", "
source
\n", "
@feature\n",
       "@generic(left="Pclass", right="SibSp")\n",
       "def interactions(df):\n",
       "    res = stl.empty_like(df)\n",
       "    res[f"{left}_add_{right}"] = df[left] + df[right]\n",
       "    res[f"{left}_sub_{right}"] = df[left] - df[right]\n",
       "    res[f"{left}_mul_{right}"] = df[left] * df[right]\n",
       "    return res\n",
       "
\n", "
\n", "\n", "
\n", "
\n", "
GENERIC FEATURE
\n", " \n", "
\n", "
name
\n", "
num_aggs
\n", "
description
\n", "
Descriptions are also supported.
\n", "
source
\n", "
@feature\n",
       "@generic(col="Parch")\n",
       "def num_aggs(df):\n",
       "    """Descriptions are also supported."""\n",
       "    res = pd.DataFrame(index=df.index)\n",
       "    mean = df[col].mean()\n",
       "    std = df[col].std()\n",
       "    res[f"{col}_div_mean"] = df[col] / mean\n",
       "    res[f"{col}_sub_div_mean"] = (df[col] - mean) / mean\n",
       "    res[f"{col}_div_std"] = df[col] / std\n",
       "    return res\n",
       "
\n", "
\n", "\n", "
\n", "
\n", "
GENERIC FEATURE
\n", " \n", "
\n", "
name
\n", "
tfidf
\n", "
source
\n", "
@feature\n",
       "@generic(col='Name')\n",
       "def tfidf(df):\n",
       "    if df.train:\n",
       "        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)\n",
       "        res = enc.fit_transform(df[col])\n",
       "        df.state['enc'] = enc\n",
       "    else:\n",
       "        enc = df.state['enc']\n",
       "        res = enc.transform(df[col])\n",
       "    return res.todense()\n",
       "
\n", "
requirements
\n", "
sklearn==0.20.2
\n", "
\n", "
helpers
\n", "
You've got no helpers so far.
\n", "
\n", "\n", "
\n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "np.random.seed(0)\n", "\n", "import kts\n", "from kts import *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('../input/train.csv', index_col='PassengerId')\n", "test = pd.read_csv('../input/test.csv', index_col='PassengerId')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen, Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `kts.save` to put objects or dataframes to user cache:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "kts.save(train, 'train')\n", "kts.save(test, 'test')" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "## Modular Feature Engineering in 30 seconds\n", "\n", "Instead of sequentially adding new columns to one dataframe, you define functions called feature blocks, which take a raw dataframe as input and produce a new dataframe containing only new columns. Then these blocks are collected into feature sets. Such encapsulation enables your features to be computed in parallel, cached, and automatically applied during inference stage, making your experiments executable end-to-end out of the box.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature block is defined as a function taking one dataframe as an argument and returning a dataframe, too. Indices of input and output should be identical:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
PassengerId
1a
2a
\n", "
" ], "text/plain": [ " a\n", "PassengerId \n", "1 a\n", "2 a" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
PassengerId
3a
4a
5a
\n", "
" ], "text/plain": [ " a\n", "PassengerId \n", "3 a\n", "4 a\n", "5 a" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def dummy_feature_a(df):\n", " res = pd.DataFrame(index=df.index)\n", " res['a'] = 'a'\n", " return res\n", "\n", "dummy_feature_a(train[:2])\n", "dummy_feature_a(train[2:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`@preview(frame, size_1, size_2, ...)` does almost the same thing as above: it runs your feature constructor on `frame.head(size_1), frame.head(size_2), ...`.\n", "\n", "\n", "*In addition, you can test out parallel execution. By default all of your features will be parallel, but if you want to change this behavior, use `parallel=False`.*" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
dummy_feature_a
\n", "
0s
\n", "
\n", "
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
PassengerId
1a
2a
\n", "
" ], "text/plain": [ " a\n", "PassengerId \n", "1 a\n", "2 a" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
PassengerId
1a
2a
3a
4a
5a
\n", "
" ], "text/plain": [ " a\n", "PassengerId \n", "1 a\n", "2 a\n", "3 a\n", "4 a\n", "5 a" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 2, 5, parallel=True)\n", "def dummy_feature_a(df):\n", " res = stl.empty_like(df) # kts.stl is a standard library of feature constructors. Now you need to know\n", " res['a'] = 'a' # only that stl.empty_like(df) is identical to pd.DataFrame(index=df.index)\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature blocks usually consist of more than one feature:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
dummy_feature_age_mean
\n", "
0s
\n", "
\n", "
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Agemean
PassengerId
122.028.666667
238.028.666667
326.028.666667
\n", "
" ], "text/plain": [ " Age mean\n", "PassengerId \n", "1 22.0 28.666667\n", "2 38.0 28.666667\n", "3 26.0 28.666667" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Agemean
PassengerId
122.031.2
238.031.2
326.031.2
435.031.2
535.031.2
6NaN31.2
\n", "
" ], "text/plain": [ " Age mean\n", "PassengerId \n", "1 22.0 31.2\n", "2 38.0 31.2\n", "3 26.0 31.2\n", "4 35.0 31.2\n", "5 35.0 31.2\n", "6 NaN 31.2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 3, 6)\n", "def dummy_feature_age_mean(df):\n", " res = stl.empty_like(df)\n", " res['Age'] = df['Age']\n", " res['mean'] = df['Age'].mean()\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions are registered and converted into feature constructors using `@feature` decorator:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "@feature\n", "def dummy_feature_a(df):\n", " res = stl.empty_like(df)\n", " res['a'] = 'a'\n", " return res\n", "\n", "@feature\n", "def dummy_feature_bcd(df):\n", " res = stl.empty_like(df)\n", " res['b'] = 'b'\n", " res['c'] = 'c'\n", " res['d'] = 'd'\n", " return res\n", "\n", "@feature\n", "def dummy_feature_age_mean(df):\n", " res = stl.empty_like(df)\n", " res['mean'] = df['Age'].mean()\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then a feature set is defined by a list of feature constructors. Use slicing syntax to preview it:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
dummy_feature_a
\n", "
0s
\n", "
\n", "
dummy_feature_bcd
\n", "
0s
\n", "
\n", "
dummy_feature_age_mean
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmean
PassengerId
31abcd44.666667
32abcd44.666667
33abcd44.666667
34abcd44.666667
35abcd44.666667
\n", "
" ], "text/plain": [ " a b c d mean\n", "PassengerId \n", "31 a b c d 44.666667\n", "32 a b c d 44.666667\n", "33 a b c d 44.666667\n", "34 a b c d 44.666667\n", "35 a b c d 44.666667" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_fs = FeatureSet([dummy_feature_a, dummy_feature_bcd, dummy_feature_age_mean], train_frame=train)\n", "dummy_fs[30:35]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's clean up our namespace a bit:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "delete(dummy_feature_a, force=True)\n", "delete(dummy_feature_bcd, force=True)\n", "delete(dummy_feature_age_mean, force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get to the real things. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decorators\n", "\n", "\n", "Almost all of the functions that you'll use have rich docstrings with examples. \n", "Although it is not necessary, I'll demonstrate them throughout this tutorial.\n", "\n", "Let's first take a closer look at the decorators that you have already seen. \n", "Don't be confused if you can't understand something, as it will be better explained in the [Feature Types](#Feature-Types) section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### @preview" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
PREVIEW DOCS
\n", "
signature
\n", "
preview(frame, sizes, parallel, train)\n",
       "
\n", "
description
\n", "
Runs a feature constructor several times to let you make sure it works correctly

Sequentially passes frame.head(size) to your feature constructor
for each provided size.
Generic features can also be previewed, in this case they'll be
initialized using their default arguments.
\n", "
params
\n", "
frame
a dataframe to be used for testing your feature
\n", "
*sizes
one or more ints, sizes of input dataframes
\n", "
parallel
whether to preview as a parallel feature constructor
\n", "
train
df.train flag value to be passed to the feature constructor
\n", "
\n", "
examples
\n", "
>>> @preview(train, 2, 3, parallel=False)\n",
       "... def some_feature(df):\n",
       "...     res = stl.empty_like(df)\n",
       "...     res['col'] = ...\n",
       "...     return res\n",
       "\n",
       ">>> @preview(train, 200)\n",
       "... def some_feature(df):\n",
       "...     return stl.mean_encode(['Parch', 'Embarked'], 'Survived')(df)\n",
       "\n",
       ">>> @preview(train, 100)\n",
       "... @generic(left="Age", right="SibSp")\n",
       "... def numeric_interactions(df):\n",
       "...     res = stl.empty_like(df)\n",
       "...     res[f"{left}_add_{right}"] = df[left] + df[right]\n",
       "...     res[f"{left}_sub_{right}"] = df[left] - df[right]\n",
       "...     res[f"{left}_mul_{right}"] = df[left] * df[right]\n",
       "...     return res\n",
       "
" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### @feature" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
FEATURE DOCS
\n", "
signature
\n", "
feature(args, cache, parallel, verbose)\n",
       "
\n", "
description
\n", "
Registers a function as a feature constructor and saves it

Can be used both with and without flags.
Note that generic feature constructors should be
additionally registered using this decorator.
\n", "
params
\n", "
cache
whether to cache calls and avoid recomputing
\n", "
parallel
whether to run in parallel with other parallel FCs
\n", "
verbose
whether to print logs and show progress
\n", "
\n", "
returns
\n", "
A feature constructor.
\n", "
examples
\n", "
>>> @feature(parallel=False, verbose=False)\n",
       "... def some_feature(df):\n",
       "...     ...\n",
       "\n",
       ">>> @feature\n",
       "... def some_feature(df):\n",
       "...     ...\n",
       "\n",
       ">>> @feature\n",
       "... @generic(param='default')\n",
       "... def generic_feature(df):\n",
       "...     ...\n",
       "
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### @generic" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GENERIC DOCS
\n", "
signature
\n", "
generic(kwargs)\n",
       "
\n", "
description
\n", "
Creates a generic feature constructor

Generic features are parametrized feature constructors.

Note that this decorator does not register your function
and you should add @feature to save it.
\n", "
params
\n", "
**kwargs
arguments and their default values
\n", "
\n", "
returns
\n", "
A generic feature constructor.
\n", "
examples
\n", "
>>> @feature\n",
       "... @generic(left="Age", right="SibSp")\n",
       "... def numeric_interactions(df):\n",
       "...     res = stl.empty_like(df)\n",
       "...     res[f"{left}_add_{right}"] = df[left] + df[right]\n",
       "...     res[f"{left}_sub_{right}"] = df[left] - df[right]\n",
       "...     res[f"{left}_mul_{right}"] = df[left] * df[right]\n",
       "...     return res\n",
       "\n",
       ">>> from itertools import combinations\n",
       ">>> fs = FeatureSet([\n",
       "...     numeric_interactions(left, right)\n",
       "...     for left, right in combinations(['Parch', 'SibSp', 'Age'], r=2)\n",
       "... ], ...)\n",
       "
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "generic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### delete" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
DELETE DOCS
\n", "
signature
\n", "
delete(feature_or_helper, force)\n",
       "
\n", "
description
\n", "
Deletes given feature or helper from lists and clears cache

Feature constructors are deleted along with their cache.
Generic feature constructors are also fully deleted.
As some STL features produce cache, you can also remove it
by passing an STL feature as an argument. The STL feature itself won't be removed.
\n", "
params
\n", "
feature_or_helper
an instance to be removed
\n", "
force
force deletion without any warnings and confirmations
\n", "
\n", "
examples
\n", "
>>> delete(incorrect_feature)\n",
       ">>> delete(old_helper)\n",
       ">>> delete(stl.mean_encode('Embarked', 'Survived'))\n",
       ">>> delete(generic_feature)\n",
       "
" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "delete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Types\n", "\n", "### Regular Features\n", "\n", "This type of FCs should already look quite familiar:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
simple_feature
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
is_male
PassengerId
11
20
30
40
51
\n", "
" ], "text/plain": [ " is_male\n", "PassengerId \n", "1 1\n", "2 0\n", "3 0\n", "4 0\n", "5 1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "def simple_feature(df):\n", " res = stl.empty_like(df)\n", " res['is_male'] = (df.Sex == 'male') + 0\n", " return res" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "@feature\n", "def simple_feature(df):\n", " res = stl.empty_like(df)\n", " res['is_male'] = (df.Sex == 'male') + 0\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature constructors can print anything to stdout and it will be shown in your report in real time, even if your features are computed in separate processes:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
feature_with_stdout
\n", "
\n", "
[17:34:13.126] some logs
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
PassengerId
1a
2a
\n", "
" ], "text/plain": [ " a\n", "PassengerId \n", "1 a\n", "2 a" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 2)\n", "def feature_with_stdout(df):\n", " res = stl.empty_like(df)\n", " res['a'] = 'a'\n", " print('some logs')\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `kts.pbar` to track progress of long-running features:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
feature_with_pbar
\n", "
3s
\n", "
0s
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
PassengerId
1a
2a
\n", "
" ], "text/plain": [ " a\n", "PassengerId \n", "1 a\n", "2 a" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import time\n", "\n", "@preview(train, 2)\n", "def feature_with_pbar(df):\n", " res = stl.empty_like(df)\n", " res['a'] = 'a'\n", " for i in pbar(['a', 'b', 'c']):\n", " time.sleep(1)\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "They can also be nested and titled:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
feature_with_nested_pbar
\n", "
9s
\n", "
0s
\n", "
feature_with_nested_pbar - a
\n", "
3s
\n", "
0s
\n", "
feature_with_nested_pbar - b
\n", "
3s
\n", "
0s
\n", "
feature_with_nested_pbar - c
\n", "
3s
\n", "
0s
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a
PassengerId
1a
2a
\n", "
" ], "text/plain": [ " a\n", "PassengerId \n", "1 a\n", "2 a" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 2)\n", "def feature_with_nested_pbar(df):\n", " res = stl.empty_like(df)\n", " res['a'] = 'a'\n", " for i in pbar(['a', 'b', 'c']):\n", " for j in pbar(range(6), title=i):\n", " time.sleep(0.5)\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Features Using External Frames\n", "\n", "Sometimes datasets consist of more than one dataframe. To get an external dataframe into you feature constructor's scope, you need to save it with `kts.save()` and then use the following syntax:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
feature_using_external
\n", "
\n", "
[17:34:25.902] DataFrame
\n", "
1s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pclasssomefeat
PassengerId
136
214
336
414
536
636
714
\n", "
" ], "text/plain": [ " Pclass somefeat\n", "PassengerId \n", "1 3 6\n", "2 1 4\n", "3 3 6\n", "4 1 4\n", "5 3 6\n", "6 3 6\n", "7 1 4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "external = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})\n", "kts.save(external, 'external')\n", "\n", "@preview(train, 7)\n", "def feature_using_external(df, somename='external'):\n", " \"\"\"\n", " To get an external dataframe, you should set its name in user cache as a default value.\n", " Inside it will look like a usual dataframe.\n", " \"\"\"\n", " print(somename.__class__.__name__)\n", " time.sleep(1) # a short delay to receive stdout\n", " res = stl.empty_like(df)\n", " res['Pclass'] = df['Pclass']\n", " res['somefeat'] = somename.set_index('a').loc[df['Pclass']]['b'].values\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stateful Features\n", "\n", "Some features may need their state to be saved between training and inference stages. In this case you can use `df.train` or `df._train` to identify which stage it is and `df.state` or `df._state` as a dictionary to write and read the state:\n", "\n", "*Unfortunately, so far you can preview only training stage using @preview. Later we'll add @preview_train_test to emulate both stages.*" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
stateful_feature
\n", "
\n", "
[17:34:27.039] this is a training stage
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Ageage_std
PassengerId
122.0-0.294872
238.00.217949
326.0-0.166667
435.00.121795
535.00.121795
\n", "
" ], "text/plain": [ " Age age_std\n", "PassengerId \n", "1 22.0 -0.294872\n", "2 38.0 0.217949\n", "3 26.0 -0.166667\n", "4 35.0 0.121795\n", "5 35.0 0.121795" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "def stateful_feature(df):\n", " \"\"\"A simple standardizer\"\"\"\n", " res = stl.empty_like(df)\n", " if df.train:\n", " print('this is a training stage')\n", " df.state['mean'] = df['Age'].mean()\n", " df.state['std'] = df['Age'].std()\n", " mean = df.state['mean']\n", " std = df.state['mean']\n", " res['Age'] = df['Age']\n", " res['age_std'] = (df['Age'] - mean) / std\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generic Features\n", "\n", "You can also create reusable functions with `@generic(arg1=default, arg2=default, ...)`. For preview, default arguments are used." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
interactions__Pclass_SibSp
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pclass_add_SibSpPclass_sub_SibSpPclass_mul_SibSp
PassengerId
1423
2201
3330
4201
5330
\n", "
" ], "text/plain": [ " Pclass_add_SibSp Pclass_sub_SibSp Pclass_mul_SibSp\n", "PassengerId \n", "1 4 2 3\n", "2 2 0 1\n", "3 3 3 0\n", "4 2 0 1\n", "5 3 3 0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "@generic(left=\"Pclass\", right=\"SibSp\")\n", "def interactions(df):\n", " res = stl.empty_like(df)\n", " res[f\"{left}_add_{right}\"] = df[left] + df[right]\n", " res[f\"{left}_sub_{right}\"] = df[left] - df[right]\n", " res[f\"{left}_mul_{right}\"] = df[left] * df[right]\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's register a couple of generic features:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "@feature\n", "@generic(left=\"Pclass\", right=\"SibSp\")\n", "def interactions(df):\n", " res = stl.empty_like(df)\n", " res[f\"{left}_add_{right}\"] = df[left] + df[right]\n", " res[f\"{left}_sub_{right}\"] = df[left] - df[right]\n", " res[f\"{left}_mul_{right}\"] = df[left] * df[right]\n", " return res" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "@feature\n", "@generic(col=\"Parch\")\n", "def num_aggs(df):\n", " \"\"\"Descriptions are also supported.\"\"\"\n", " res = pd.DataFrame(index=df.index)\n", " mean = df[col].mean()\n", " std = df[col].std()\n", " res[f\"{col}_div_mean\"] = df[col] / mean\n", " res[f\"{col}_sub_div_mean\"] = (df[col] - mean) / mean\n", " res[f\"{col}_div_std\"] = df[col] / std\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A combination of generic and stateful feature. It also returns a numpy array instead of dataframe. In this case, KTS will attach input index to result dataframe automatically." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
tfidf__Name
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tfidf__Name_0tfidf__Name_1tfidf__Name_2tfidf__Name_3tfidf__Name_4
PassengerId
10.5082810.3388540.1855750.7423000.203426
20.5936160.1978720.4334630.5418280.356369
30.4641730.4641730.5084130.0000000.557318
40.6037710.3018860.6613170.2204390.241644
50.6310880.4207250.4608250.4608250.000000
60.5089840.5089840.2787480.5574960.305561
70.7798440.2599480.0000000.5694470.000000
80.3950670.5267560.2884810.2884810.632461
90.6059110.3029560.4424400.3318300.485000
100.4498650.4498650.4927410.2463710.540139
\n", "
" ], "text/plain": [ " tfidf__Name_0 tfidf__Name_1 tfidf__Name_2 tfidf__Name_3 \\\n", "PassengerId \n", "1 0.508281 0.338854 0.185575 0.742300 \n", "2 0.593616 0.197872 0.433463 0.541828 \n", "3 0.464173 0.464173 0.508413 0.000000 \n", "4 0.603771 0.301886 0.661317 0.220439 \n", "5 0.631088 0.420725 0.460825 0.460825 \n", "6 0.508984 0.508984 0.278748 0.557496 \n", "7 0.779844 0.259948 0.000000 0.569447 \n", "8 0.395067 0.526756 0.288481 0.288481 \n", "9 0.605911 0.302956 0.442440 0.331830 \n", "10 0.449865 0.449865 0.492741 0.246371 \n", "\n", " tfidf__Name_4 \n", "PassengerId \n", "1 0.203426 \n", "2 0.356369 \n", "3 0.557318 \n", "4 0.241644 \n", "5 0.000000 \n", "6 0.305561 \n", "7 0.000000 \n", "8 0.632461 \n", "9 0.485000 \n", "10 0.540139 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "@preview(train, 10)\n", "@generic(col='Name')\n", "def tfidf(df):\n", " if df.train:\n", " enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)\n", " res = enc.fit_transform(df[col])\n", " df.state['enc'] = enc\n", " else:\n", " enc = df.state['enc']\n", " res = enc.transform(df[col])\n", " return res.todense()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't forget to change `@preview` to `@feature` to register generics:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "@feature\n", "@generic(col='Name')\n", "def tfidf(df):\n", " if df.train:\n", " enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)\n", " res = enc.fit_transform(df[col])\n", " df.state['enc'] = enc\n", " else:\n", " enc = df.state['enc']\n", " res = enc.transform(df[col])\n", " return res.todense()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GENERIC FEATURE
\n", "
name
\n", "
tfidf
\n", "
source
\n", "
@feature\n",
       "@generic(col='Name')\n",
       "def tfidf(df):\n",
       "    if df.train:\n",
       "        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)\n",
       "        res = enc.fit_transform(df[col])\n",
       "        df.state['enc'] = enc\n",
       "    else:\n",
       "        enc = df.state['enc']\n",
       "        res = enc.transform(df[col])\n",
       "    return res.todense()\n",
       "
\n", "
requirements
\n", "
sklearn==0.20.2
" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that KTS added sklearn to dependencies. Right now it is not very useful, but later it may be used to dockerize experiments automatically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standard Library\n", "\n", "KTS provides the most essential feature constructors as a standard library, i.e. `kts.stl` submodule. All of the STL features have rich docstrings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.empty_like" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
EMPTY_LIKE DOCS
\n", "
description
\n", "
Returns an empty dataframe, preserving only index
\n", "
examples
\n", "
>>> @feature\n",
       "... def some_feature(df):\n",
       "...     res = stl.empty_like(df)\n",
       "...     res['col'] = ...\n",
       "...     return res\n",
       "
" ], "text/plain": [ "stl.empty_like" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.empty_like" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerId
1
2
3
4
5
\n", "
" ], "text/plain": [ "Empty KTSFrame\n", "Columns: []\n", "Index: [1, 2, 3, 4, 5]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "def preview_stl(df):\n", " return stl.empty_like(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.identity" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
IDENTITY DOCS
\n", "
description
\n", "
Returns its input
\n", "
examples
\n", "
>>> fs = FeatureSet([stl.identity, one_feature, another_feature], ...)\n",
       ">>> assert all((stl.identity & ['a', 'b'])(df) == stl.select(['a', 'b'])(df))\n",
       ">>> assert all((stl.identity - ['a', 'b'])(df) == stl.drop(['a', 'b'])(df))\n",
       "
" ], "text/plain": [ "stl.identity" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.identity" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen, Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "def preview_stl(df):\n", " return stl.identity(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.select" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
SELECT DOCS
\n", "
signature
\n", "
select(columns)\n",
       "
\n", "
description
\n", "
Selects columns from a dataframe. Identical to df[columns]
\n", "
params
\n", "
columns
columns to select
\n", "
\n", "
returns
\n", "
A feature constructor selecting given columns from input dataframe.
\n", "
examples
\n", "
>>> assert all(stl.select(['a', 'b'])(df) == df[['a', 'b']])\n",
       "
" ], "text/plain": [ " kts.core.feature_constructor.base.Selector>" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.select" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSex
PassengerId
1Braund, Mr. Owen Harrismale
2Cumings, Mrs. John Bradley (Florence Briggs Th...female
3Heikkinen, Miss. Lainafemale
4Futrelle, Mrs. Jacques Heath (Lily May Peel)female
5Allen, Mr. William Henrymale
\n", "
" ], "text/plain": [ " Name Sex\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male\n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female\n", "3 Heikkinen, Miss. Laina female\n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female\n", "5 Allen, Mr. William Henry male" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "def preview_stl(df):\n", " return stl.select(['Name', 'Sex'])(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.drop" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
DROP DOCS
\n", "
signature
\n", "
drop(columns)\n",
       "
\n", "
description
\n", "
Drops columns from a dataframe. Identical to df.drop(columns, axis=1)
\n", "
params
\n", "
columns
columns to drop
\n", "
\n", "
returns
\n", "
A feature constructor dropping given columns from input dataframe.
\n", "
examples
\n", "
>>> assert all(stl.drop(['a', 'b'])(df) == df.drop(['a', 'b'], axis=1))\n",
       "
" ], "text/plain": [ " kts.core.feature_constructor.base.Dropper>" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.drop" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
13Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
21Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
33Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
41Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
53Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Pclass Name \\\n", "PassengerId \n", "1 3 Braund, Mr. Owen Harris \n", "2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... \n", "3 3 Heikkinen, Miss. Laina \n", "4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) \n", "5 3 Allen, Mr. William Henry \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin \\\n", "PassengerId \n", "1 male 22.0 1 0 A/5 21171 7.2500 NaN \n", "2 female 38.0 1 0 PC 17599 71.2833 C85 \n", "3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN \n", "4 female 35.0 1 0 113803 53.1000 C123 \n", "5 male 35.0 0 0 373450 8.0500 NaN \n", "\n", " Embarked \n", "PassengerId \n", "1 S \n", "2 C \n", "3 S \n", "4 S \n", "5 S " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "def preview_stl(df):\n", " return stl.drop(['Survived'])(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.concat" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
CONCAT DOCS
\n", "
signature
\n", "
concat(feature_constructors)\n",
       "
\n", "
description
\n", "
Concatenates feature constructors
\n", "
params
\n", "
feature_constructors
list of feature constructors
\n", "
\n", "
returns
\n", "
A single feature constructor whose output contains columns from each of the given features.
\n", "
examples
\n", "
>>> from category_encoders import WOEEncoder, CatBoostEncoder\n",
       ">>> stl.concat([\n",
       "...     stl.select('Age']),\n",
       "...     stl.category_encode(WOEEncoder(), ['Sex', 'Embarked'], 'Survived'),\n",
       "...     stl.category_encode(CatBoostEncoder(), ['Sex', 'Embarked'], 'Survived'),\n",
       "... ])\n",
       "
" ], "text/plain": [ " kts.stl.backend.Concat>" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.concat" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
2s
\n", "
\n", "
simple_feature
\n", "
0s
\n", "
\n", "
tfidf__Name
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexis_maletfidf__Name_0tfidf__Name_1tfidf__Name_2tfidf__Name_3tfidf__Name_4
PassengerId
1Braund, Mr. Owen Harrismale10.4974770.3316510.1658260.0000000.784236
2Cumings, Mrs. John Bradley (Florence Briggs Th...female00.6106620.2035540.4071080.2406660.601665
3Heikkinen, Miss. Lainafemale00.5464020.5464020.5464020.3230110.000000
4Futrelle, Mrs. Jacques Heath (Lily May Peel)female00.5442450.2721220.5442450.5362270.214491
5Allen, Mr. William Henrymale10.4474240.2982830.2982830.7053320.352666
\n", "
" ], "text/plain": [ " Name Sex \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female \n", "3 Heikkinen, Miss. Laina female \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female \n", "5 Allen, Mr. William Henry male \n", "\n", " is_male tfidf__Name_0 tfidf__Name_1 tfidf__Name_2 \\\n", "PassengerId \n", "1 1 0.497477 0.331651 0.165826 \n", "2 0 0.610662 0.203554 0.407108 \n", "3 0 0.546402 0.546402 0.546402 \n", "4 0 0.544245 0.272122 0.544245 \n", "5 1 0.447424 0.298283 0.298283 \n", "\n", " tfidf__Name_3 tfidf__Name_4 \n", "PassengerId \n", "1 0.000000 0.784236 \n", "2 0.240666 0.601665 \n", "3 0.323011 0.000000 \n", "4 0.536227 0.214491 \n", "5 0.705332 0.352666 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 5)\n", "def preview_stl(df):\n", " res = stl.concat([\n", " stl.select(['Sex', 'Name']),\n", " simple_feature,\n", " tfidf('Name')\n", " ])(df)\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.apply" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
APPLY DOCS
\n", "
signature
\n", "
apply(df, func, parts, optimize, verbose)\n",
       "
\n", "
description
\n", "
Applies a function row-wise in parallel. Identical to df.apply(func, axis=1)
\n", "
params
\n", "
df
input dataframe
\n", "
func
function taking a pd.Series as input and returning a single value
\n", "
parts
number of parts to split the dataframe into. May be greater than the number of cores
\n", "
optimize
if set to True, then the dataframe won't be partitioned if its size is less than 100
\n", "
verbose
whether to show a progress bar for each process
\n", "
\n", "
returns
\n", "
A dataframe whose only column contains the result of calling func for each row.
\n", "
examples
\n", "
>>> def func(row):\n",
       "...     if row.Embarked == 'S':\n",
       "...         return row.SibSp\n",
       "...     return row.Age\n",
       ">>> stl.apply(df, func, parts=7, verbose=True)\n",
       "
" ], "text/plain": [ " pandas.core.frame.DataFrame>" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.apply" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
19s
\n", "
\n", "
stl_apply_0_100
\n", "
10s
\n", "
0s
\n", "
stl_apply_100_200
\n", "
10s
\n", "
0s
\n", "
stl_apply_200_300
\n", "
10s
\n", "
0s
\n", "
stl_apply_300_400
\n", "
10s
\n", "
0s
\n", "
stl_apply_400_500
\n", "
10s
\n", "
0s
\n", "
stl_apply_500_600
\n", "
10s
\n", "
0s
\n", "
stl_apply_600_700
\n", "
10s
\n", "
0s
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col
PassengerId
11.0
238.0
30.0
41.0
50.0
......
6960.0
6970.0
698NaN
69949.0
7000.0
\n", "

700 rows × 1 columns

\n", "
" ], "text/plain": [ " col\n", "PassengerId \n", "1 1.0\n", "2 38.0\n", "3 0.0\n", "4 1.0\n", "5 0.0\n", "... ...\n", "696 0.0\n", "697 0.0\n", "698 NaN\n", "699 49.0\n", "700 0.0\n", "\n", "[700 rows x 1 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 700, parallel=True)\n", "def preview_stl(df):\n", " def func(row):\n", " \"\"\"A regular row-wise function with any logic.\"\"\"\n", " time.sleep(0.1)\n", " if row.Embarked == 'S':\n", " return row.SibSp\n", " return row.Age\n", " res = stl.empty_like(df)\n", " res['col'] = stl.apply(df, func, parts=7, verbose=True)\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.category_encode" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
CATEGORY_ENCODE DOCS
\n", "
signature
\n", "
category_encode(encoder, columns, targets)\n",
       "
\n", "
description
\n", "
Encodes categorical features in parallel

Performs both simple category encoding, such as one-hot, and various target encoding techniques.
In case if target columns are provided, each pair (encoded column, target column) from cartesian product of
both lists is encoded using encoder.

Runs encoders returning one column (e.g. TargetEncoder, WOEEncoder)
or fixed number of columns (HashingEncoder, BaseNEncoder) in parallel,
whereas encoders whose number of output columns depends on count of unique values (HelmertEncoder, OneHotEncoder)
are run in the main process to avoid result serialization overhead.
\n", "
params
\n", "
encoder
an instance of encoder from category_encoders package with predefined parameters
\n", "
columns
list of encoded columns. Treats string as a list of length 1
\n", "
targets
list of target columns. Should be provided if encoder uses target. Treats string as a list of length 1
\n", "
\n", "
returns
\n", "
A feature constructor returning a concatenation of resulting columns.
\n", "
examples
\n", "
>>> from category_encoders import WOEEncoder, TargetEncoder\n",
       ">>> stl.category_encode(WOEEncoder(), ['Sex', 'Embarked'], 'Survived')\n",
       ">>> stl.category_encode(TargetEncoder(smoothing=3), ['Sex', 'Embarked'], ['Survived', 'Age'])\n",
       ">>> stl.category_encode(WOEEncoder(sigma=0.1, regularization=0.5), 'Sex', 'Survived')\n",
       "
" ], "text/plain": [ " kts.stl.backend.CategoryEncoder>" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.category_encode" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Cabin_ce_Survived_CatBoostEncoder_random_state_0_sigma_3Embarked_ce_Survived_CatBoostEncoder_random_state_0_sigma_3
PassengerId
12.5797842.579784
20.9021930.902193
30.8069240.806924
43.1662993.629659
53.1032573.978111
.........
961.0963011.090039
970.4229150.485321
982.6066212.848815
990.4790630.475339
1000.7833940.780401
\n", "

100 rows × 2 columns

\n", "
" ], "text/plain": [ " Cabin_ce_Survived_CatBoostEncoder_random_state_0_sigma_3 \\\n", "PassengerId \n", "1 2.579784 \n", "2 0.902193 \n", "3 0.806924 \n", "4 3.166299 \n", "5 3.103257 \n", "... ... \n", "96 1.096301 \n", "97 0.422915 \n", "98 2.606621 \n", "99 0.479063 \n", "100 0.783394 \n", "\n", " Embarked_ce_Survived_CatBoostEncoder_random_state_0_sigma_3 \n", "PassengerId \n", "1 2.579784 \n", "2 0.902193 \n", "3 0.806924 \n", "4 3.629659 \n", "5 3.978111 \n", "... ... \n", "96 1.090039 \n", "97 0.485321 \n", "98 2.848815 \n", "99 0.475339 \n", "100 0.780401 \n", "\n", "[100 rows x 2 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from category_encoders import CatBoostEncoder, WOEEncoder, TargetEncoder\n", "\n", "@preview(train, 100)\n", "def preview_stl(df):\n", " encoder = CatBoostEncoder(sigma=3, random_state=0)\n", " return stl.category_encode(encoder, columns=['Cabin', 'Embarked'], targets='Survived')(df)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedCabinCabin_ce_Survived_CatBoostEncoder_random_state_0Cabin_ce_Survived_WOEEncoderCabin_ce_Survived_TargetEncoder
PassengerId
10NaN0.410000-0.2533220.35
21C850.4100000.0000000.41
31NaN0.205000-0.2533220.35
41C1230.4100000.0000000.41
50NaN0.470000-0.2533220.35
..................
960NaN0.351410-0.2533220.35
970A50.4100000.0000000.41
981D10 D120.4100000.0000000.41
991NaN0.346962-0.2533220.35
1000NaN0.355125-0.2533220.35
\n", "

100 rows × 5 columns

\n", "
" ], "text/plain": [ " Survived Cabin \\\n", "PassengerId \n", "1 0 NaN \n", "2 1 C85 \n", "3 1 NaN \n", "4 1 C123 \n", "5 0 NaN \n", "... ... ... \n", "96 0 NaN \n", "97 0 A5 \n", "98 1 D10 D12 \n", "99 1 NaN \n", "100 0 NaN \n", "\n", " Cabin_ce_Survived_CatBoostEncoder_random_state_0 \\\n", "PassengerId \n", "1 0.410000 \n", "2 0.410000 \n", "3 0.205000 \n", "4 0.410000 \n", "5 0.470000 \n", "... ... \n", "96 0.351410 \n", "97 0.410000 \n", "98 0.410000 \n", "99 0.346962 \n", "100 0.355125 \n", "\n", " Cabin_ce_Survived_WOEEncoder Cabin_ce_Survived_TargetEncoder \n", "PassengerId \n", "1 -0.253322 0.35 \n", "2 0.000000 0.41 \n", "3 -0.253322 0.35 \n", "4 0.000000 0.41 \n", "5 -0.253322 0.35 \n", "... ... ... \n", "96 -0.253322 0.35 \n", "97 0.000000 0.41 \n", "98 0.000000 0.41 \n", "99 -0.253322 0.35 \n", "100 -0.253322 0.35 \n", "\n", "[100 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 100)\n", "def preview_stl(df):\n", " return stl.concat([\n", " stl.select(['Cabin', 'Survived']),\n", " stl.category_encode(CatBoostEncoder(random_state=0), columns='Cabin', targets='Survived'),\n", " stl.category_encode(WOEEncoder(), columns='Cabin', targets='Survived'),\n", " stl.category_encode(TargetEncoder(), columns='Cabin', targets='Survived'),\n", " ])(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.mean_encode" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
MEAN_ENCODE DOCS
\n", "
signature
\n", "
mean_encode(columns, targets, smoothing, min_samples_leaf)\n",
       "
\n", "
description
\n", "
Performs mean target encoding in parallel

An alias to stl.category_encode(TargetEncoder(smoothing, min_samples_leaf), columns, targets).
\n", "
params
\n", "
columns
list of encoded columns. Treats string as a list of length 1
\n", "
targets
list of target columns. Should be provided if encoder uses target. Treats string as a list of length 1
\n", "
smoothing
smoothing effect to balance categorical average vs prior.
Higher value means stronger regularization.
The value must be strictly bigger than 0.
\n", "
min_samples_leaf
minimum samples to take category average into account.
\n", "
\n", "
returns
\n", "
A feature constructor performing mean encoding for each pair (column, target) and returning the concatenation.
\n", "
examples
\n", "
>>> stl.mean_encoding(['Sex', 'Embarked'], ['Survived', 'Age'])\n",
       ">>> stl.mean_encoding(['Sex', 'Embarked'], 'Survived', smoothing=1.5, min_samples_leaf=5)\n",
       "
" ], "text/plain": [ " kts.stl.backend.CategoryEncoder>" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.mean_encode" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Cabin_ce_Survived_TargetEncoder_smoothing_3.0
PassengerId
10.35
20.41
30.35
40.41
50.35
......
960.35
970.41
980.41
990.35
1000.35
\n", "

100 rows × 1 columns

\n", "
" ], "text/plain": [ " Cabin_ce_Survived_TargetEncoder_smoothing_3.0\n", "PassengerId \n", "1 0.35\n", "2 0.41\n", "3 0.35\n", "4 0.41\n", "5 0.35\n", "... ...\n", "96 0.35\n", "97 0.41\n", "98 0.41\n", "99 0.35\n", "100 0.35\n", "\n", "[100 rows x 1 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 100)\n", "def preview_stl(df):\n", " \"\"\"An alias for stl.category_encode(TargetEncoder())\"\"\"\n", " return stl.mean_encode('Cabin', 'Survived', smoothing=3)(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### stl.one_hot_encode" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ONE_HOT_ENCODE DOCS
\n", "
signature
\n", "
one_hot_encode(columns)\n",
       "
\n", "
description
\n", "
Performs simple one-hot encoding

An alias to stl.category_encode(OneHotEncoder(), columns).
\n", "
params
\n", "
columns
list of columns to be encoded. Treats string as a list of length 1
\n", "
\n", "
returns
\n", "
A feature constructor returning a concatenation of one-hot encoding of each column.
\n", "
examples
\n", "
>>> stl.one_hot_encode(['Sex', 'Embarked'])\n",
       ">>> stl.one_hot_encode('Embarked')\n",
       "
" ], "text/plain": [ " kts.stl.backend.CategoryEncoder>" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stl.one_hot_encode" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
preview_stl
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarked_ce_OneHotEncoder_0Embarked_ce_OneHotEncoder_1Embarked_ce_OneHotEncoder_2Embarked_ce_OneHotEncoder_3
PassengerId
11000
20100
31000
41000
51000
...............
961000
970100
980100
991000
1001000
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " Embarked_ce_OneHotEncoder_0 Embarked_ce_OneHotEncoder_1 \\\n", "PassengerId \n", "1 1 0 \n", "2 0 1 \n", "3 1 0 \n", "4 1 0 \n", "5 1 0 \n", "... ... ... \n", "96 1 0 \n", "97 0 1 \n", "98 0 1 \n", "99 1 0 \n", "100 1 0 \n", "\n", " Embarked_ce_OneHotEncoder_2 Embarked_ce_OneHotEncoder_3 \n", "PassengerId \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 \n", "5 0 0 \n", "... ... ... \n", "96 0 0 \n", "97 0 0 \n", "98 0 0 \n", "99 0 0 \n", "100 0 0 \n", "\n", "[100 rows x 4 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "@preview(train, 100, parallel=False) # One hot encoder produces a lot of columns, but is computationally cheap, that's why we don't compute it in parallel\n", "def preview_stl(df):\n", " \"\"\"An alias for stl.category_encode(OneHotEncoder())\"\"\"\n", " return stl.one_hot_encode('Embarked')(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Set" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
FEATURESET DOCS
\n", "
signature
\n", "
FeatureSet(before_split, after_split, train_frame, test_frame, targets, auxiliary, description)\n",
       "
\n", "
description
\n", "
Collects and computes feature constructors
\n", "
params
\n", "
before_split
list of regular features
\n", "
after_split
list of stateful features which may leak target if computed before split.
They are run in Single Validation mode, i.e. for each fold they are fit using training objects
and then applied to validation objects in inference mode.
\n", "
train_frame
a dataframe to perform training on. Should contain unique indices for each object.
\n", "
targets
list of target columns in case of a multilabel task, or a single string otherwise.
Target columns may be computed. In this case the corresponding feature constructors
should be passed to before_split list.
\n", "
auxiliary
list of auxiliary columns, such as datetime, groups or whatever else can be used
for setting up your validation. These columns can be utilized by overriding Validator.
As well as targets, auxiliary columns may be computed.
\n", "
description
any notes about this feature set.
\n", "
\n", "
examples
\n", "
>>> fs = FeatureSet([feature_1, feature_2], [single_validation_feature],\n",
       "...                  train_frame=train, targets='Survived')\n",
       "\n",
       ">>> fs = FeatureSet([feature_1, feature_2], [single_validation_feature],\n",
       "...                  train_frame=train,\n",
       "...                  targets=['Target1', 'Target2'], auxiliary=['date', 'metric_group'])\n",
       "\n",
       ">>> fs = FeatureSet([stl.select(['Age', 'Fare'])], [stl.mean_encode(['Embarked', 'Parch'], 'Survived')],\n",
       "...                  train_frame=train, targets='Survived')\n",
       "
" ], "text/plain": [ "kts.core.feature_set.FeatureSet" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FeatureSet" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "fs = FeatureSet([simple_feature, interactions('Pclass', 'Age'), num_aggs('Fare'), tfidf('Name')], \n", " [stl.category_encode(TargetEncoder(), 'Embarked', 'Survived'), \n", " stl.category_encode(WOEEncoder(), 'Embarked', 'Survived')],\n", " train_frame=train,\n", " targets='Survived')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each feature set is given a unique identifier. It also contains source code of all the features right in its repr:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
FEATURE SET
\n", "
name
\n", "
FSBWBXEK
\n", "
features
\n", "
\n", "
\n", "
\n", "
FEATURE CONSTRUCTOR
\n", " \n", "
\n", "
name
\n", "
simple_feature
\n", "
source
\n", "
@feature\n",
       "def simple_feature(df):\n",
       "    res = stl.empty_like(df)\n",
       "    res['is_male'] = (df.Sex == 'male') + 0\n",
       "    return res\n",
       "
\n", "
\n", "\n", "
\n", "
\n", "
FEATURE CONSTRUCTOR
\n", " \n", "
\n", "
name
\n", "
interactions('Pclass', 'Age')
\n", "
description
\n", "
An instance of generic feature constructor interactions
\n", "
source
\n", "
interactions('Pclass', 'Age')\n",
       "
\n", "
additional source
\n", "
@feature\n",
       "@generic(left="Pclass", right="SibSp")\n",
       "def interactions(df):\n",
       "    res = stl.empty_like(df)\n",
       "    res[f"{left}_add_{right}"] = df[left] + df[right]\n",
       "    res[f"{left}_sub_{right}"] = df[left] - df[right]\n",
       "    res[f"{left}_mul_{right}"] = df[left] * df[right]\n",
       "    return res\n",
       "
\n", "
\n", "\n", "
\n", "
\n", "
FEATURE CONSTRUCTOR
\n", " \n", "
\n", "
name
\n", "
num_aggs('Fare')
\n", "
description
\n", "
An instance of generic feature constructor num_aggs
\n", "
source
\n", "
num_aggs('Fare')\n",
       "
\n", "
additional source
\n", "
@feature\n",
       "@generic(col="Parch")\n",
       "def num_aggs(df):\n",
       "    """Descriptions are also supported."""\n",
       "    res = pd.DataFrame(index=df.index)\n",
       "    mean = df[col].mean()\n",
       "    std = df[col].std()\n",
       "    res[f"{col}_div_mean"] = df[col] / mean\n",
       "    res[f"{col}_sub_div_mean"] = (df[col] - mean) / mean\n",
       "    res[f"{col}_div_std"] = df[col] / std\n",
       "    return res\n",
       "
\n", "
\n", "\n", "
\n", "
\n", "
FEATURE CONSTRUCTOR
\n", " \n", "
\n", "
name
\n", "
tfidf('Name')
\n", "
description
\n", "
An instance of generic feature constructor tfidf
\n", "
source
\n", "
tfidf('Name')\n",
       "
\n", "
additional source
\n", "
@feature\n",
       "@generic(col='Name')\n",
       "def tfidf(df):\n",
       "    if df.train:\n",
       "        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)\n",
       "        res = enc.fit_transform(df[col])\n",
       "        df.state['enc'] = enc\n",
       "    else:\n",
       "        enc = df.state['enc']\n",
       "        res = enc.transform(df[col])\n",
       "    return res.todense()\n",
       "
\n", "
requirements
\n", "
sklearn==0.20.2
\n", "
\n", "\n", "
\n", "
\n", "
FEATURE CONSTRUCTOR
\n", " \n", "
\n", "
name
\n", "
stl.category_encode(TargetEncoder(), ['Embarked'], ['Survived'])
\n", "
source
\n", "
stl.category_encode(TargetEncoder(), ['Embarked'], ['Survived'])\n",
       "
\n", "
\n", "\n", "
\n", "
\n", "
FEATURE CONSTRUCTOR
\n", " \n", "
\n", "
name
\n", "
stl.category_encode(WOEEncoder(), ['Embarked'], ['Survived'])
\n", "
source
\n", "
stl.category_encode(WOEEncoder(), ['Embarked'], ['Survived'])\n",
       "
\n", "
\n", "
source
\n", "
FeatureSet([simple_feature,\n",
       "            interactions('Pclass', 'Age'),\n",
       "            num_aggs('Fare'),\n",
       "            tfidf('Name')],\n",
       "           [stl.category_encode(TargetEncoder(), ['Embarked'], ['Survived']),\n",
       "            stl.category_encode(WOEEncoder(), ['Embarked'], ['Survived'])],\n",
       "           targets=['Survived'],\n",
       "           auxiliary=[])\n",
       "
\n", "
requirements
\n", "
sklearn==0.20.2
" ], "text/plain": [ "" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use slicing to preview your feature sets. Slicing calls are not cached and do not leak dataframes to IPython namespace, so you can run them as many times as you need. For stateful features, slicing calls always trigger a training stage." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
COMPUTING FEATURES
feature
progress
\n", "
simple_feature
\n", "
0s
\n", "
\n", "
interactions__Pclass_Age
\n", "
0s
\n", "
\n", "
num_aggs__Fare
\n", "
0s
\n", "
\n", "
tfidf__Name
\n", "
0s
\n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
is_malePclass_add_AgePclass_sub_AgePclass_mul_AgeFare_div_meanFare_sub_div_meanFare_div_stdtfidf__Name_0tfidf__Name_1tfidf__Name_2tfidf__Name_3tfidf__Name_4Embarked_ce_Survived_TargetEncoderEmbarked_ce_Survived_WOEEncoder
PassengerId
1125.0-19.066.00.268312-0.7316880.3071780.5082810.3388540.1855750.7423000.2034260.428748-0.223144
2039.0-37.038.02.6380881.6380883.0202310.5936160.1978720.4334630.5418280.3563690.8655291.098612
3029.0-23.078.00.293292-0.7067080.3357780.4641730.4641730.5084130.0000000.5573180.428748-0.223144
4036.0-34.035.01.9651510.9651512.2498150.6037710.3018860.6613170.2204390.2416440.428748-0.223144
5138.0-32.0105.00.297918-0.7020820.3410740.6310880.4207250.4608250.4608250.0000000.428748-0.223144
61NaNNaNNaN0.313029-0.6869710.3583730.5089840.5089840.2787480.5574960.3055610.5000000.000000
7155.0-53.054.01.9193530.9193532.1973830.7798440.2599480.0000000.5694470.0000000.428748-0.223144
815.01.06.00.779954-0.2200460.8929350.3950670.5267560.2884810.2884810.6324610.428748-0.223144
9030.0-24.081.00.412027-0.5879730.4717110.6059110.3029560.4424400.3318300.4850000.428748-0.223144
10016.0-12.028.01.1128750.1128751.2740820.4498650.4498650.4927410.2463710.5401390.8655291.098612
\n", "
" ], "text/plain": [ " is_male Pclass_add_Age Pclass_sub_Age Pclass_mul_Age \\\n", "PassengerId \n", "1 1 25.0 -19.0 66.0 \n", "2 0 39.0 -37.0 38.0 \n", "3 0 29.0 -23.0 78.0 \n", "4 0 36.0 -34.0 35.0 \n", "5 1 38.0 -32.0 105.0 \n", "6 1 NaN NaN NaN \n", "7 1 55.0 -53.0 54.0 \n", "8 1 5.0 1.0 6.0 \n", "9 0 30.0 -24.0 81.0 \n", "10 0 16.0 -12.0 28.0 \n", "\n", " Fare_div_mean Fare_sub_div_mean Fare_div_std tfidf__Name_0 \\\n", "PassengerId \n", "1 0.268312 -0.731688 0.307178 0.508281 \n", "2 2.638088 1.638088 3.020231 0.593616 \n", "3 0.293292 -0.706708 0.335778 0.464173 \n", "4 1.965151 0.965151 2.249815 0.603771 \n", "5 0.297918 -0.702082 0.341074 0.631088 \n", "6 0.313029 -0.686971 0.358373 0.508984 \n", "7 1.919353 0.919353 2.197383 0.779844 \n", "8 0.779954 -0.220046 0.892935 0.395067 \n", "9 0.412027 -0.587973 0.471711 0.605911 \n", "10 1.112875 0.112875 1.274082 0.449865 \n", "\n", " tfidf__Name_1 tfidf__Name_2 tfidf__Name_3 tfidf__Name_4 \\\n", "PassengerId \n", "1 0.338854 0.185575 0.742300 0.203426 \n", "2 0.197872 0.433463 0.541828 0.356369 \n", "3 0.464173 0.508413 0.000000 0.557318 \n", "4 0.301886 0.661317 0.220439 0.241644 \n", "5 0.420725 0.460825 0.460825 0.000000 \n", "6 0.508984 0.278748 0.557496 0.305561 \n", "7 0.259948 0.000000 0.569447 0.000000 \n", "8 0.526756 0.288481 0.288481 0.632461 \n", "9 0.302956 0.442440 0.331830 0.485000 \n", "10 0.449865 0.492741 0.246371 0.540139 \n", "\n", " Embarked_ce_Survived_TargetEncoder \\\n", "PassengerId \n", "1 0.428748 \n", "2 0.865529 \n", "3 0.428748 \n", "4 0.428748 \n", "5 0.428748 \n", "6 0.500000 \n", "7 0.428748 \n", "8 0.428748 \n", "9 0.428748 \n", "10 0.865529 \n", "\n", " Embarked_ce_Survived_WOEEncoder \n", "PassengerId \n", "1 -0.223144 \n", "2 1.098612 \n", "3 -0.223144 \n", "4 -0.223144 \n", "5 -0.223144 \n", "6 0.000000 \n", "7 -0.223144 \n", "8 -0.223144 \n", "9 -0.223144 \n", "10 1.098612 " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs[:10]" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }