{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Formula\n",
    "\n",
    "The models fit by, e.g. the `ols` and `lasso` functions, are specified in a compact symbolic form. The tilde operator ~  is basic in the formation of such models. An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modelled by a linear predictor specified symbolically by model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import $ivy.`com.github.haifengl::smile-scala:3.1.0`\n",
    "import $ivy.`org.slf4j:slf4j-simple:2.0.13`  \n",
    "\n",
    "import scala.language.postfixOps\n",
    "import smile._\n",
    "import smile.data._\n",
    "import smile.data.formula._\n",
    "import smile.data.formula.Terms.$\n",
    "import smile.data.`type`._\n",
    "import smile.data.measure._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the simplest case, the right hand side of ~ can be empty. That is, all the variables except the response variable will be used as predictors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val f = \"class\" ~"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where `class` is the response variable. When your data is already prepared, such a simple model is usually sufficient. For feature engineering and selection, however, you may be more specific on the features. In those cases, a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by :: operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val f = \"class\" ~ \"salary\" + (\"gender\"::\"race\") + \"age\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is possible to create a formula without response variable (in case of unsupervised learning). In such cases, the formula is used to generate features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val f = ~ \"salary\" + \"gender\" + \"age\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to + and ::, a number of other operators are useful in model formulae. The && operator denotes factor crossing: a&&b interpreted as a+b+a::b. The - operator removes the specified terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"salary\" ~ \".\" + (\"a\" && \"b\" && \"c\") - \"d\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This formula includes all the cross terms of `a`, `b`, and `c`, removes the term `d`. Here, `.` means all other variables in the data, not otherwise in the formula. Most mathematical functions can be applied to terms. For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"salary\" ~ \".\" + log(\"age\") + \"gender\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we have defined several abstract formulas. We may bind a formula to a schema, which associates the formula variable to the schema's columns. The output schema is inferred based on input data types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val inputSchema = DataTypes.struct(\n",
    "  new StructField(\"water\", DataTypes.ByteType, new NominalScale(\"dry\", \"wet\")),\n",
    "  new StructField(\"sowing_density\", DataTypes.ByteType, new NominalScale(\"low\", \"high\")),\n",
    "  new StructField(\"wind\", DataTypes.ByteType, new NominalScale(\"weak\", \"strong\"))\n",
    ")\n",
    "\n",
    "val formula = ~ \"water\" + \"sowing_density\" + \"wind\" + (\"water\" :: \"sowing_density\" :: \"wind\")\n",
    "\n",
    "val outputSchema = formula.bind(inputSchema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's apply a formula on a data frame. In a program or Scala REPL, we should be able to use the `$` method directly. However the `$` sign is in special use in the notebook. So we apply the full qualiifer `Terms.$` here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val iris = read.arff(\"../data/weka/iris.arff\")\n",
    "val formula = log(\"petallength\") ~ sin(exp(\"petalwidth\")) + (Terms.$(\"sepalwidth\") + Terms.$(\"sepallength\")) + \".\" - \"class\"\n",
    "formula.frame(iris)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And train a random forest model with a formula."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smile.classification.randomForest(\"class\" ~ \".\", iris)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, we apply a formula with factor cross on the weather data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val weather = read.arff(\"../data/weka/weather.nominal.arff\")\n",
    "val formula = \"class\" ~ \"outlook\" + (\"temperature\" && \"humidity\" && \"windy\")\n",
    "formula.frame(weather)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Scala (2.13)",
   "language": "scala",
   "name": "scala213"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".scala",
   "mimetype": "text/x-scala",
   "name": "scala",
   "nbconvert_exporter": "script",
   "version": "2.13.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}