[![Open in Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justmarkham/scikit-learn-tips/master?filepath=notebooks%2F14_handle_missing_values.ipynb)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justmarkham/scikit-learn-tips/blob/master/notebooks/14_handle_missing_values.ipynb)

# ðŸ¤–âš¡ scikit-learn tip #14 ([video](https://www.youtube.com/watch?v=jbc6BPQEM3o&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=14))

Four options for handling missing values (NaNs):

1. Drop rows containing NaNs
2. Drop columns containing NaNs
3. Fill NaNs with imputed values
4. Use a model that natively handles NaNs (NEW!)

See example ðŸ‘‡

In [1]:
import pandas as pd
train = pd.read_csv('http://bit.ly/kaggletrain')
test = pd.read_csv('http://bit.ly/kaggletest', nrows=175)

In [2]:
train = train[['Survived', 'Age', 'Fare', 'Pclass']]
test = test[['Age', 'Fare', 'Pclass']]

In [3]:
# count the number of NaNs in each column
train.isna().sum()

Survived      0
Age         177
Fare          0
Pclass        0
dtype: int64

In [4]:
test.isna().sum()

Age       36
Fare       1
Pclass     0
dtype: int64

In [5]:
label = train.pop('Survived')

In [6]:
# new in 0.22: this estimator (experimental) has native support for NaNs
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

In [7]:
clf = HistGradientBoostingClassifier()

In [8]:
# no errors, despite NaNs in train and test!
clf.fit(train, label)
clf.predict(test)

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])

### Want more tips? [View all tips on GitHub](https://github.com/justmarkham/scikit-learn-tips) or [Sign up to receive 2 tips by email every week](https://scikit-learn.tips) ðŸ’Œ

Â© 2020 [Data School](https://www.dataschool.io). All rights reserved.