[![Open in Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justmarkham/scikit-learn-tips/master?filepath=notebooks%2F34_feature_selection.ipynb)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justmarkham/scikit-learn-tips/blob/master/notebooks/34_feature_selection.ipynb)

# ðŸ¤–âš¡ scikit-learn tip #34 ([video](https://www.youtube.com/watch?v=BMBVwV8iarc&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=34))

It's simple to add feature selection to a Pipeline:

1. Use SelectPercentile to keep the highest scoring features
2. Add feature selection after preprocessing but before model building

See example ðŸ‘‡

P.S. Make sure to tune the percentile value!

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain')

In [2]:
X = df['Name']
y = df['Survived']

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

### Pipeline without feature selection

In [4]:
vect = CountVectorizer()
clf = LogisticRegression()

In [5]:
pipe = make_pipeline(vect, clf)
cross_val_score(pipe, X, y, scoring='accuracy').mean()

0.7957190383528967

### Pipeline with feature selection

In [6]:
from sklearn.feature_selection import SelectPercentile, chi2

In [7]:
# keep 50% of features with the best chi-squared scores
selection = SelectPercentile(chi2, percentile=50)

In [8]:
pipe = make_pipeline(vect, selection, clf)
cross_val_score(pipe, X, y, scoring='accuracy').mean()

0.8147824995292197

### Want more tips? [View all tips on GitHub](https://github.com/justmarkham/scikit-learn-tips) or [Sign up to receive 2 tips by email every week](https://scikit-learn.tips) ðŸ’Œ

Â© 2020 [Data School](https://www.dataschool.io). All rights reserved.