[![Open in Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justmarkham/scikit-learn-tips/master?filepath=notebooks%2F50_simple_ml_pattern.ipynb)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justmarkham/scikit-learn-tips/blob/master/notebooks/50_simple_ml_pattern.ipynb)

# ðŸ¤–âš¡ scikit-learn tip #50 ([video](https://www.youtube.com/watch?v=gd-TZut-oto&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=50))

Here's a simple pattern that can be adapted to solve many ML problems. It has plenty of shortcomings, but can work surprisingly well as-is!

Check it out ðŸ‘‡

Shortcomings include:

- Assumes all columns have proper data types
- May include irrelevant or improper features
- Does not handle text or date columns well
- Does not include feature engineering
- Ordinal encoding may be better
- Other imputation strategies may be better
- Numeric features may not need scaling
- A different model may be better
- And so on...

In [1]:
import pandas as pd

In [2]:
cols = ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

In [3]:
df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

In [4]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]

In [5]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [6]:
# set up preprocessing for numeric columns
imp_median = SimpleImputer(strategy='median', add_indicator=True)
scaler = StandardScaler()

In [7]:
# set up preprocessing for categorical columns
imp_constant = SimpleImputer(strategy='constant')
ohe = OneHotEncoder(handle_unknown='ignore')

In [8]:
# select columns by data type
num_cols = make_column_selector(dtype_include='number')
cat_cols = make_column_selector(dtype_exclude='number')

In [9]:
# do all preprocessing
preprocessor = make_column_transformer(
    (make_pipeline(imp_median, scaler), num_cols),
    (make_pipeline(imp_constant, ohe), cat_cols))

In [10]:
# create a pipeline
pipe = make_pipeline(preprocessor, LogisticRegression())

In [11]:
# cross-validate the pipeline
cross_val_score(pipe, X, y).mean()

0.8035904839620865

In [12]:
# fit the pipeline and make predictions
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

### Want more tips? [View all tips on GitHub](https://github.com/justmarkham/scikit-learn-tips) or [Sign up to receive 2 tips by email every week](https://scikit-learn.tips) ðŸ’Œ

Â© 2020 [Data School](https://www.dataschool.io). All rights reserved.