# Formula

The models fit by, e.g. the `ols` and `lasso` functions, are specified in a compact symbolic form. The tilde operator ~ is basic in the formation of such models. An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modelled by a linear predictor specified symbolically by model.

In [None]:
import $ivy.`com.github.haifengl::smile-scala:3.1.0`
import $ivy.`org.slf4j:slf4j-simple:2.0.13` 

import scala.language.postfixOps
import smile._
import smile.data._
import smile.data.formula._
import smile.data.formula.Terms.$
import smile.data.`type`._
import smile.data.measure._

In the simplest case, the right hand side of ~ can be empty. That is, all the variables except the response variable will be used as predictors.

In [None]:
val f = "class" ~

where `class` is the response variable. When your data is already prepared, such a simple model is usually sufficient. For feature engineering and selection, however, you may be more specific on the features. In those cases, a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by :: operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.

In [None]:
val f = "class" ~ "salary" + ("gender"::"race") + "age"

It is possible to create a formula without response variable (in case of unsupervised learning). In such cases, the formula is used to generate features.

In [None]:
val f = ~ "salary" + "gender" + "age"

In addition to + and ::, a number of other operators are useful in model formulae. The && operator denotes factor crossing: a&&b interpreted as a+b+a::b. The - operator removes the specified terms.

In [None]:
"salary" ~ "." + ("a" && "b" && "c") - "d"

This formula includes all the cross terms of `a`, `b`, and `c`, removes the term `d`. Here, `.` means all other variables in the data, not otherwise in the formula. Most mathematical functions can be applied to terms. For example,

In [None]:
"salary" ~ "." + log("age") + "gender"

So far, we have defined several abstract formulas. We may bind a formula to a schema, which associates the formula variable to the schema's columns. The output schema is inferred based on input data types.

In [None]:
val inputSchema = DataTypes.struct(
 new StructField("water", DataTypes.ByteType, new NominalScale("dry", "wet")),
 new StructField("sowing_density", DataTypes.ByteType, new NominalScale("low", "high")),
 new StructField("wind", DataTypes.ByteType, new NominalScale("weak", "strong"))
)

val formula = ~ "water" + "sowing_density" + "wind" + ("water" :: "sowing_density" :: "wind")

val outputSchema = formula.bind(inputSchema)

Now let's apply a formula on a data frame. In a program or Scala REPL, we should be able to use the `$` method directly. However the `$` sign is in special use in the notebook. So we apply the full qualiifer `Terms.$` here.

In [None]:
val iris = read.arff("../data/weka/iris.arff")
val formula = log("petallength") ~ sin(exp("petalwidth")) + (Terms.$("sepalwidth") + Terms.$("sepallength")) + "." - "class"
formula.frame(iris)

And train a random forest model with a formula.

In [None]:
smile.classification.randomForest("class" ~ ".", iris)

Lastly, we apply a formula with factor cross on the weather data.

In [None]:
val weather = read.arff("../data/weka/weather.nominal.arff")
val formula = "class" ~ "outlook" + ("temperature" && "humidity" && "windy")
formula.frame(weather)