\n",
"_while we haven't used polynomials, there's still a balance for our models between simplicity and feature dependence_\n",
"\n",
"#### Why?\n",
"We should aim to keep our models as simple as possible in order to attribute the most gain. \n",
"Simple models are much easier to understand as well"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How do we reduce the number of features in our data?\n",
"\n",
"There's a number of techniques available in sklearn that automate these processes for us:\n",
"\n",
"sklearn_helper | technique\n",
"---------------|----------\n",
"`VarianceThreshold` | Remove features with low variance, based on a tolerance level\n",
"`SelectKBest` | Select the best group of correlated features using `feature_selection` tools. K (as usual) is something you search for and define.\n",
"`L1 and Trees` | using fit_transform on any supervised learning algorithm that has it can drop features with low coefficients or importances.\n",
"\n",
"While SKlearn also has a `pipeline` module to _further_ automate this process for you, it is more recommended to explore the data first to get a sense of what you are working with. There's no magic button that says \"solve my problem,\" but if you are interested in automating a model fit (say, a nightly procedue on a deployed model with constantly updated data), then it might be something worth exploring. \n",
"\n",
"For each below we'll work through Iris and notice how it picks out the best features for us. We'll use iris because the data is well scaled (which otherwise requires finetuning) and relatively predictive (we know there are features more predictive than others).\n",
"\n",
"For each code sample below:\n",
"\n",
"1. Review what the code is doing. Consider opening up the help function or reading the documentation on sklearn.\n",
"2. find the `.shape` of the new array returned and compare to the original dataset. What columns did it end up keeping, vs removing?\n",
"3. Adjust the parameters. Do results change?\n",
"4. ** \\* **These are all considered data preprocessing steps. In your final project, what and where might you consider adding one of these processes?"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"def make_irisdf():\n",
" from sklearn.datasets import load_iris\n",
" from pandas import DataFrame\n",
" iris = load_iris()\n",
" df = DataFrame(iris.data, columns=iris.feature_names)\n",
" df['target'] = iris.target\n",
" return df\n",
"\n",
"iris = make_irisdf()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn import feature_selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `VarianceThreshold`\n",
"\n",
"Goals:\n",
"\n",
"1. What is variance?\n",
"2. How does changing the threshold change the fit_transform?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) 0.685694\n",
"sepal width (cm) 0.188004\n",
"petal length (cm) 3.113179\n",
"petal width (cm) 0.582414\n",
"dtype: float64\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
"0 5.1 3.5 1.4 0.2\n",
"1 4.9 3.0 1.4 0.2\n",
"2 4.7 3.2 1.3 0.2\n",
"3 4.6 3.1 1.5 0.2\n",
"4 5.0 3.6 1.4 0.2\n",
"[[ 5.1 1.4]\n",
" [ 4.9 1.4]\n",
" [ 4.7 1.3]\n",
" [ 4.6 1.5]\n",
" [ 5. 1.4]]\n"
]
}
],
"source": [
"print iris.ix[:,:4].apply(lambda x: x.var())\n",
"print iris.ix[:,:4].head()\n",
"print feature_selection.VarianceThreshold(threshold=.6).fit_transform(iris.ix[:,:4])[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `SelectKBest`\n",
"Goals:\n",
"\n",
"1. while f test and chi2 are different tests, are the results the same?\n",
"2. How might you solve for k?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_math sidebar:_\n",
"\n",
"$X^2 = \\dfrac{(O-E)^2}{E}$
\n",
"
\n",
"Manhattan is built on a grid system, with the exception of a couple key points:
\n", "If we needed to get from Harold Square to Eataly, what is easier to explain?
\n", "Why is that one easier to explain?
\n", "
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How does it work?\n",
"\n",
"Recall that variance is a 1-dimensional metric describing the average distance from the mean. **Covariance** is a representation of variance with respect to other features.\n",
"\n",
"If variance is a summary of one metric, and a correlation matrix is a square (the relationships of features against each other), what is our expected shape of the covariance matrix?\n",
"\n",
"