{"nbformat_minor": 0, "cells": [{"source": "## Feature Selection with scikit-learn (sklearn)\nJaganadh Gopinadhan\nhttp://jaganadhg.in ", "cell_type": "markdown", "metadata": {"raw_mimetype": "text/latex"}}, {"source": "Feature extraction is one of the essential step in Data Science/Machine Learning and Data Mining exercises. Effective use of feature extraction techniques helps a Data Scientist to build the best model. This note is intent to give a brief over view on feature selection with scikit-learn (sklearn). The result of a feature selection exercise is to find the most important and descriptive feature from a given data.\n\n## Note\nThe code is for getting familiarity with the utilities.", "cell_type": "markdown", "metadata": {}}, {"source": "#### Find K-Best features for classification and regression\nThe first method which we are going to explore is the selecting the K-best features using the SelectKBest utility in sklearn. We will use the famous IRIS two class data-set.\n\nThe first example we are going to look is feature selection for classification.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 52, "cell_type": "code", "source": "import pandas as pd\nfrom sklearn.feature_selection import SelectKBest, f_classif\n\ndef select_kbest_clf(data_frame, target, k=2):\n \"\"\"\n Selecting K-Best features for classification\n :param data_frame: A pandas dataFrame with the training data\n :param target: target variable name in DataFrame\n :param k: desired number of features from the data\n :returns feature_scores: scores for each feature in the data as \n pandas DataFrame\n \"\"\"\n feat_selector = SelectKBest(f_classif, k=k)\n _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])\n \n feat_scores = pd.DataFrame()\n feat_scores[\"F Score\"] = feat_selector.scores_\n feat_scores[\"P Value\"] = feat_selector.pvalues_\n feat_scores[\"Support\"] = feat_selector.get_support()\n feat_scores[\"Attribute\"] = data_frame.drop(target, axis=1).columns\n \n return feat_scores\n\niris_data = pd.read_csv(\"/resources/iris.csv\")\n\nkbest_feat = select_kbest_clf(iris_data, \"Class\", k=2)\nkbest_feat = kbest_feat.sort([\"F Score\", \"P Value\"], ascending=[False, False])\nkbest_feat\n", "outputs": [{"execution_count": 52, "output_type": "execute_result", "data": {"text/plain": " F Score P Value Support Attribute\n2 2498.618817 1.504801e-71 True petal-length\n3 1830.624469 3.230375e-65 True petal-width\n0 236.735022 6.892546e-28 False sepal-length\n1 41.607003 4.246355e-09 False sepal-width\n\n[4 rows x 4 columns]", "text/html": "
\n | F Score | \nP Value | \nSupport | \nAttribute | \n
---|---|---|---|---|
2 | \n2498.618817 | \n1.504801e-71 | \nTrue | \npetal-length | \n
3 | \n1830.624469 | \n3.230375e-65 | \nTrue | \npetal-width | \n
0 | \n236.735022 | \n6.892546e-28 | \nFalse | \nsepal-length | \n
1 | \n41.607003 | \n4.246355e-09 | \nFalse | \nsepal-width | \n
4 rows \u00d7 4 columns
\n\n | F Score | \nP Value | \nSupport | \nAttribute | \n
---|---|---|---|---|
12 | \n601.617871 | \n5.081103e-88 | \nTrue | \n12 | \n
5 | \n471.846740 | \n2.487229e-74 | \nTrue | \n5 | \n
10 | \n175.105543 | \n1.609509e-34 | \nTrue | \n10 | \n
2 | \n153.954883 | \n4.900260e-31 | \nTrue | \n2 | \n
9 | \n141.761357 | \n5.637734e-29 | \nTrue | \n9 | \n
4 | \n112.591480 | \n7.065042e-24 | \nFalse | \n4 | \n
0 | \n88.151242 | \n2.083550e-19 | \nFalse | \n0 | \n
8 | \n85.914278 | \n5.465933e-19 | \nFalse | \n8 | \n
6 | \n83.477459 | \n1.569982e-18 | \nFalse | \n6 | \n
1 | \n75.257642 | \n5.713584e-17 | \nFalse | \n1 | \n
11 | \n63.054229 | \n1.318113e-14 | \nFalse | \n11 | \n
7 | \n33.579570 | \n1.206612e-08 | \nFalse | \n7 | \n
3 | \n15.971512 | \n7.390623e-05 | \nFalse | \n3 | \n
13 rows \u00d7 4 columns
\n\n | F Score | \nP Value | \nSupport | \nAttribute | \n
---|---|---|---|---|
12 | \n601.617871 | \n5.081103e-88 | \nTrue | \n12 | \n
5 | \n471.846740 | \n2.487229e-74 | \nTrue | \n5 | \n
10 | \n175.105543 | \n1.609509e-34 | \nTrue | \n10 | \n
2 | \n153.954883 | \n4.900260e-31 | \nTrue | \n2 | \n
9 | \n141.761357 | \n5.637734e-29 | \nTrue | \n9 | \n
4 | \n112.591480 | \n7.065042e-24 | \nTrue | \n4 | \n
0 | \n88.151242 | \n2.083550e-19 | \nFalse | \n0 | \n
8 | \n85.914278 | \n5.465933e-19 | \nFalse | \n8 | \n
6 | \n83.477459 | \n1.569982e-18 | \nFalse | \n6 | \n
1 | \n75.257642 | \n5.713584e-17 | \nFalse | \n1 | \n
11 | \n63.054229 | \n1.318113e-14 | \nFalse | \n11 | \n
7 | \n33.579570 | \n1.206612e-08 | \nFalse | \n7 | \n
3 | \n15.971512 | \n7.390623e-05 | \nFalse | \n3 | \n
13 rows \u00d7 4 columns
\n\n | F Score | \nP Value | \nSupport | \nAttribute | \n
---|---|---|---|---|
12 | \n601.617871 | \n5.081103e-88 | \nTrue | \n12 | \n
5 | \n471.846740 | \n2.487229e-74 | \nTrue | \n5 | \n
10 | \n175.105543 | \n1.609509e-34 | \nTrue | \n10 | \n
2 | \n153.954883 | \n4.900260e-31 | \nTrue | \n2 | \n
9 | \n141.761357 | \n5.637734e-29 | \nTrue | \n9 | \n
4 | \n112.591480 | \n7.065042e-24 | \nTrue | \n4 | \n
0 | \n88.151242 | \n2.083550e-19 | \nTrue | \n0 | \n
8 | \n85.914278 | \n5.465933e-19 | \nTrue | \n8 | \n
6 | \n83.477459 | \n1.569982e-18 | \nTrue | \n6 | \n
1 | \n75.257642 | \n5.713584e-17 | \nTrue | \n1 | \n
11 | \n63.054229 | \n1.318113e-14 | \nTrue | \n11 | \n
7 | \n33.579570 | \n1.206612e-08 | \nTrue | \n7 | \n
3 | \n15.971512 | \n7.390623e-05 | \nFalse | \n3 | \n
13 rows \u00d7 4 columns
\n\n | F Score | \nP Value | \nSupport | \nAttribute | \n
---|---|---|---|---|
12 | \n601.617871 | \n5.081103e-88 | \nTrue | \n12 | \n
5 | \n471.846740 | \n2.487229e-74 | \nTrue | \n5 | \n
10 | \n175.105543 | \n1.609509e-34 | \nTrue | \n10 | \n
2 | \n153.954883 | \n4.900260e-31 | \nTrue | \n2 | \n
9 | \n141.761357 | \n5.637734e-29 | \nTrue | \n9 | \n
4 | \n112.591480 | \n7.065042e-24 | \nTrue | \n4 | \n
0 | \n88.151242 | \n2.083550e-19 | \nTrue | \n0 | \n
8 | \n85.914278 | \n5.465933e-19 | \nTrue | \n8 | \n
6 | \n83.477459 | \n1.569982e-18 | \nTrue | \n6 | \n
1 | \n75.257642 | \n5.713584e-17 | \nTrue | \n1 | \n
11 | \n63.054229 | \n1.318113e-14 | \nTrue | \n11 | \n
7 | \n33.579570 | \n1.206612e-08 | \nTrue | \n7 | \n
3 | \n15.971512 | \n7.390623e-05 | \nTrue | \n3 | \n
13 rows \u00d7 4 columns
\n\n | Attribute Name | \nRanking | \nSupport | \n
---|---|---|---|
9 | \n9 | \n8 | \nFalse | \n
8 | \n8 | \n7 | \nFalse | \n
11 | \n11 | \n6 | \nFalse | \n
6 | \n6 | \n5 | \nFalse | \n
1 | \n1 | \n4 | \nFalse | \n
2 | \n2 | \n3 | \nFalse | \n
0 | \n0 | \n2 | \nFalse | \n
12 | \n12 | \n1 | \nTrue | \n
10 | \n10 | \n1 | \nTrue | \n
7 | \n7 | \n1 | \nTrue | \n
5 | \n5 | \n1 | \nTrue | \n
4 | \n4 | \n1 | \nTrue | \n
3 | \n3 | \n1 | \nTrue | \n
13 rows \u00d7 3 columns
\n\n | Attribute Name | \nRanking | \nSupport | \n
---|---|---|---|
9 | \n9 | \n4 | \nFalse | \n
8 | \n8 | \n3 | \nFalse | \n
11 | \n11 | \n2 | \nFalse | \n
12 | \n12 | \n1 | \nTrue | \n
10 | \n10 | \n1 | \nTrue | \n
7 | \n7 | \n1 | \nTrue | \n
6 | \n6 | \n1 | \nTrue | \n
5 | \n5 | \n1 | \nTrue | \n
4 | \n4 | \n1 | \nTrue | \n
3 | \n3 | \n1 | \nTrue | \n
2 | \n2 | \n1 | \nTrue | \n
1 | \n1 | \n1 | \nTrue | \n
0 | \n0 | \n1 | \nTrue | \n
13 rows \u00d7 3 columns
\n\n | 0 | \n1 | \n2 | \ntarget | \n
---|---|---|---|---|
0 | \n-2.237799 | \n-0.296785 | \n5.6 | \n1 | \n
1 | \n2.346082 | \n-0.109259 | \n1.6 | \n0 | \n
2 | \n-2.877376 | \n0.472073 | \n6.0 | \n1 | \n
3 | \n2.374624 | \n0.205149 | \n1.4 | \n0 | \n
4 | \n-2.583883 | \n0.029046 | \n5.6 | \n1 | \n
5 rows \u00d7 4 columns
\n