{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Rulefit Boston Housing Demo\n",
"\n",
"Rulefit algorithm aims for a compromise between interpretability and complexity of the resulting model. While simpler ML algorithms usually miss interaction effects or require advanced methods to uncover interaction effects, rulefit learns a sparse linear model that include automatically detected interaction effects in the form of decision rules. After that, new features are created in the form of decision rules and a transparent model is built using these features.\n",
"\n",
"Example: IF the number of rooms > 2 AND the age of the house < 15 THEN 1 ELSE 0 (lower than medium)\n",
"\n",
"The general algorithm flow:\n",
"\n",
"1. Algorithm fits a tree ensemble to the data, builds a rule ensemble by traversing each tree. This results in many rules but majority of them not informative.\n",
"2. After that, it evaluates the rules on the data to build a rule feature set and fits a sparse linear model (LASSO) to the rule feature set joined with the original feature set, to select the best ones. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Boston house prices dataset:\n",
"\n",
"The response variable is the price of the houses and the goal is to produce a model with significant variables that can predict the house price using the given explanatory variables. Each record describes a Boston suburb or town. The data was created from the Boston Standard Metropolitan Statistical Area (SMSA) in the 70s. The attributes are defined as follows:\n",
"\n",
"- CRIM: per capita crime rate by town\n",
"- ZN: proportion of residential land zoned for lots over 25000 sq. ft.\n",
"- INDUS: proportion of non retail business acres per town\n",
"- CHAS: Charles River dummy var (= 1 if tract bounds rivers; 0 othervise)\n",
"- NOX: nitric oxides concentration (parts per 10 milion)\n",
"- RM: average number of rooms per dweling\n",
"- AGE: proportion of owner-occupied units built prior to 1940\n",
"- DIS: weighed distances to five Boston employment centers\n",
"- RAD: index of accesibility to radial highways\n",
"- TAX: full value property=tax rate per 10000 USD\n",
"- PTRATIO: pupil-teacher ratio by town\n",
"- B: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town\n",
"- LSTAT: % lower status of the population\n",
"- MEDV: Median value of owner-occupied homes in 1000s USD"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"versionFromGradle='3.36.1',projectVersion='3.36.1.99999',branch='zuzana/rulefit_boston_demo',lastCommitHash='49ae4e81f2be4ece461ddccfb5ed1ec923cb4415',gitDescribe='jenkins-3.36.1.2-7-g49ae4e81f2-dirty',compiledOn='2022-06-02 14:19:03',compiledBy='zuzanaolajcova'\n",
"Checking whether there is an H2O instance running at http://localhost:54321 . connected.\n"
]
},
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"drf = H2ORandomForestEstimator(ntrees=50, seed=123, max_depth=3)\n",
"drf.train(y=y,x=x,training_frame=train)\n",
"print(\"RMSE: \")\n",
"print(drf.training_model_metrics()['RMSE'])\n",
"drf.varimp_plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, RM and LSTAT are the most significant features affecting the house price. \n",
"Now, let's create a rulefit model to a create custom rules for more explainability and interpretability:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rulefit Model Build progress: |██████████████████████████████████████████████████| (done) 100%\n",
"RMSE: \n",
"4.816685889071532\n",
"\n",
"Rule Importance: \n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
variable
\n",
"
coefficient
\n",
"
support
\n",
"
rule
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
\n",
"
linear.rm
\n",
"
3.555456
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
\n",
"
1
\n",
"
\n",
"
linear.chas
\n",
"
1.750286
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
\n",
"
2
\n",
"
\n",
"
M0T41N18
\n",
"
1.455534
\n",
"
0.132411
\n",
"
(age < 95.2587890625 or age is NA) & (nox < 0.6593242287635803 or ...
\n",
"
\n",
"
\n",
"
3
\n",
"
\n",
"
M0T42N16
\n",
"
1.200294
\n",
"
0.084980
\n",
"
(crim < 7.391515254974365 or crim is NA) & (lstat < 5.127500057220...
\n",
"
\n",
"
\n",
"
4
\n",
"
\n",
"
linear.ptratio
\n",
"
-0.632223
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
\n",
"
5
\n",
"
\n",
"
linear.lstat
\n",
"
-0.520512
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
\n",
"
6
\n",
"
\n",
"
M0T31N21
\n",
"
-0.395165
\n",
"
0.264822
\n",
"
(dis >= 1.2369916439056396 or dis is NA) & (ptratio >= 19.90244102...
\n",
"
\n",
"
\n",
"
7
\n",
"
\n",
"
M0T20N21
\n",
"
-0.277795
\n",
"
0.199605
\n",
"
(indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.278124809...
\n",
"
\n",
"
\n",
"
8
\n",
"
\n",
"
M0T27N22
\n",
"
-0.210425
\n",
"
0.199605
\n",
"
(indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.702812671...
\n",
"
\n",
"
\n",
"
9
\n",
"
\n",
"
M0T22N14
\n",
"
0.163021
\n",
"
0.092885
\n",
"
(crim < 6.696437835693359 or crim is NA) & (ptratio < 18.346485137...
\n",
"
\n",
"
\n",
"
10
\n",
"
\n",
"
linear.dis
\n",
"
-0.083196
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
\n",
"
11
\n",
"
\n",
"
linear.crim
\n",
"
-0.025105
\n",
"
1.000000
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" variable coefficient support \\\n",
"0 linear.rm 3.555456 1.000000 \n",
"1 linear.chas 1.750286 1.000000 \n",
"2 M0T41N18 1.455534 0.132411 \n",
"3 M0T42N16 1.200294 0.084980 \n",
"4 linear.ptratio -0.632223 1.000000 \n",
"5 linear.lstat -0.520512 1.000000 \n",
"6 M0T31N21 -0.395165 0.264822 \n",
"7 M0T20N21 -0.277795 0.199605 \n",
"8 M0T27N22 -0.210425 0.199605 \n",
"9 M0T22N14 0.163021 0.092885 \n",
"10 linear.dis -0.083196 1.000000 \n",
"11 linear.crim -0.025105 1.000000 \n",
"\n",
" rule \n",
"0 \n",
"1 \n",
"2 (age < 95.2587890625 or age is NA) & (nox < 0.6593242287635803 or ... \n",
"3 (crim < 7.391515254974365 or crim is NA) & (lstat < 5.127500057220... \n",
"4 \n",
"5 \n",
"6 (dis >= 1.2369916439056396 or dis is NA) & (ptratio >= 19.90244102... \n",
"7 (indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.278124809... \n",
"8 (indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.702812671... \n",
"9 (crim < 6.696437835693359 or crim is NA) & (ptratio < 18.346485137... \n",
"10 \n",
"11 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from h2o.estimators.rulefit import H2ORuleFitEstimator\n",
"rf = H2ORuleFitEstimator(seed=123, lambda_=.5 )#, model_type=\"rules\")\n",
"rf.train(y=y, x=x, training_frame=train)\n",
"print(\"RMSE: \")\n",
"print(rf.training_model_metrics()['RMSE'])\n",
"rf.rule_importance()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rulefits's rule importance table shows all the predictors with non-zero LASSO coefficients. Predictors can be linear (variable name prefixed with \"linear.\") or a rule (rule identificator as a variable name, e.g. M0T43N16 means this rule comes from the tree based model n.0, tree n.43, node n.16). This table holds values of LASSO coefficient and support of undrlying predictor as a factors of predictor importances. In case of rule predictor, also it's language representation is present.\n",
"\n",
"Let's look closer on resulting predictors. From the table we can see that the rules selected by rulefit as most important have LSTAT or RM variables present. Those which were by far the top 2 most important variables of previous H2ORandomForestEstimator model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to interpret significant rules:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The full text of M0T41N18 is:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'(age < 95.2587890625 or age is NA) & (nox < 0.6593242287635803 or nox is NA) & (rm >= 6.940098762512207)'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf.rule_importance()[4][2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which means that the most expensive houses have 7 or more rooms and are in areas with low nitric oxides concentration and with possibly high proportion of historical buildings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The full text of M0T42N16 is:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'(crim < 7.391515254974365 or crim is NA) & (lstat < 5.127500057220459 or lstat is NA) & (rm >= 6.940098762512207)'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf.rule_importance()[4][3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which means that big houses (7 or more rooms) in areas with very-low-to-zero crime rate and very-low-to-zero percentage of lower status of the population are more expensive to live in."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The full text of M0T31N21 is:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'(dis >= 1.2369916439056396 or dis is NA) & (ptratio >= 19.902441024780273) & (rm >= 5.864699363708496 or rm is NA)'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rf.rule_importance()[4][6]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Which can be interpreted as: \"with increasing distance from Boston employment centers and increasing pupil-teacher ratio, the price of majority of the house degrades\" (since majority of the houses in our dataset have 6 or more rooms)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also find out at which specific rows the certain rules apply or not:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
M0T31N21
M0T42N16
linear.rm
\n",
"\n",
"\n",
"
0
0
1
\n",
"
0
0
1
\n",
"
0
1
1
\n",
"
0
1
1
\n",
"
0
0
1
\n",
"
0
0
1
\n",
"
0
0
1
\n",
"
0
0
1
\n",
"
0
0
1
\n",
"
0
0
1
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predicted_rules = rf.predict_rules(train, [\"M0T31N21\", \"M0T42N16\", \"linear.rm\"])\n",
"predicted_rules.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please note, that linear predictor applies to all the observations and that col-wise sums of this output divided by the number of observations represents a support of each rule."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apart of that, Friedman and Popescu defines (https://arxiv.org/abs/0811.1679) the rulefit-specific global measure of predictors importance as follows:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def FPimportance(data, lasso_coef, support, is_rule):\n",
" if is_rule:\n",
" import math\n",
" return abs(lasso_coef) * math.sqrt(support * (1 - support))\n",
" else:\n",
" return abs(lasso_coef) * data.sd()[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hence, we can get global importances calculated out of combined importance factors like:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"def calculate_FPimportance(input, data):\n",
" result = dict()\n",
" for x in range(len(input[1])):\n",
" if input[1][x].startswith('linear.'):\n",
" result[input[1][x]] = FPimportance(data[input[1][x][len(\"linear.\"):]], input[2][x], input[3][x], False)\n",
" else:\n",
" result[input[1][x]] = FPimportance(None, input[2][x], input[3][x], True)\n",
" return result "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"FPimportances = calculate_FPimportance(rf.rule_importance(), train)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"import operator\n",
"sorted_FPimportances = sorted(FPimportances.items(), key=operator.itemgetter(1), reverse=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which gives us a slightly reordered list of importances:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('linear.lstat', 3.7170067016740407),\n",
" ('linear.rm', 2.498124117997926),\n",
" ('linear.ptratio', 1.368729326964665),\n",
" ('M0T41N18', 0.49333452353437895),\n",
" ('linear.chas', 0.4445622162226629),\n",
" ('M0T42N16', 0.33470481934298546),\n",
" ('linear.crim', 0.21593853846749478),\n",
" ('linear.dis', 0.1751857632048295),\n",
" ('M0T31N21', 0.1743620600544106),\n",
" ('M0T20N21', 0.11103558646592059),\n",
" ('M0T27N22', 0.0841076264935881),\n",
" ('M0T22N14', 0.04732040851755857)]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted_FPimportances"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stay tuned for the future improvements, mainly rulefit-specific tools for importance and interaction examination!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "h2o3pyenv",
"language": "python",
"name": "h2o3pyenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}