{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formula\n", "\n", "The models fit by, e.g. the `ols` and `lasso` functions, are specified in a compact symbolic form. The tilde operator ~ is basic in the formation of such models. An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modelled by a linear predictor specified symbolically by model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import $ivy.`com.github.haifengl::smile-scala:3.1.0`\n", "import $ivy.`org.slf4j:slf4j-simple:2.0.13` \n", "\n", "import scala.language.postfixOps\n", "import smile._\n", "import smile.data._\n", "import smile.data.formula._\n", "import smile.data.formula.Terms.$\n", "import smile.data.`type`._\n", "import smile.data.measure._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the simplest case, the right hand side of ~ can be empty. That is, all the variables except the response variable will be used as predictors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val f = \"class\" ~" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where `class` is the response variable. When your data is already prepared, such a simple model is usually sufficient. For feature engineering and selection, however, you may be more specific on the features. In those cases, a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by :: operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val f = \"class\" ~ \"salary\" + (\"gender\"::\"race\") + \"age\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible to create a formula without response variable (in case of unsupervised learning). In such cases, the formula is used to generate features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val f = ~ \"salary\" + \"gender\" + \"age\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to + and ::, a number of other operators are useful in model formulae. The && operator denotes factor crossing: a&&b interpreted as a+b+a::b. The - operator removes the specified terms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"salary\" ~ \".\" + (\"a\" && \"b\" && \"c\") - \"d\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This formula includes all the cross terms of `a`, `b`, and `c`, removes the term `d`. Here, `.` means all other variables in the data, not otherwise in the formula. Most mathematical functions can be applied to terms. For example," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"salary\" ~ \".\" + log(\"age\") + \"gender\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, we have defined several abstract formulas. We may bind a formula to a schema, which associates the formula variable to the schema's columns. The output schema is inferred based on input data types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val inputSchema = DataTypes.struct(\n", " new StructField(\"water\", DataTypes.ByteType, new NominalScale(\"dry\", \"wet\")),\n", " new StructField(\"sowing_density\", DataTypes.ByteType, new NominalScale(\"low\", \"high\")),\n", " new StructField(\"wind\", DataTypes.ByteType, new NominalScale(\"weak\", \"strong\"))\n", ")\n", "\n", "val formula = ~ \"water\" + \"sowing_density\" + \"wind\" + (\"water\" :: \"sowing_density\" :: \"wind\")\n", "\n", "val outputSchema = formula.bind(inputSchema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's apply a formula on a data frame. In a program or Scala REPL, we should be able to use the `$` method directly. However the `$` sign is in special use in the notebook. So we apply the full qualiifer `Terms.$` here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val iris = read.arff(\"../data/weka/iris.arff\")\n", "val formula = log(\"petallength\") ~ sin(exp(\"petalwidth\")) + (Terms.$(\"sepalwidth\") + Terms.$(\"sepallength\")) + \".\" - \"class\"\n", "formula.frame(iris)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And train a random forest model with a formula." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smile.classification.randomForest(\"class\" ~ \".\", iris)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we apply a formula with factor cross on the weather data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val weather = read.arff(\"../data/weka/weather.nominal.arff\")\n", "val formula = \"class\" ~ \"outlook\" + (\"temperature\" && \"humidity\" && \"windy\")\n", "formula.frame(weather)" ] } ], "metadata": { "kernelspec": { "display_name": "Scala (2.13)", "language": "scala", "name": "scala213" }, "language_info": { "codemirror_mode": "text/x-scala", "file_extension": ".scala", "mimetype": "text/x-scala", "name": "scala", "nbconvert_exporter": "script", "version": "2.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }