{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification\n", "> A Summary of lecture \"Supervised Learning with scikit-learn\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/digits.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# plt.style.use('ggplot')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supervised learning\n", "- What is machine learning?\n", " - The art and science of:\n", " - Giving computers the ability to learn to make decisions from data\n", " - without being explicitly programmed\n", " - Examples:\n", " - Learning to predict whether an email is spam or not\n", " - Clustering wikipedia entries into different categories\n", " - Supervised learning : Uses labeled data\n", " - Unsupervised learning : Uses unlabeled data \n", " \n", "- Unsupervised learning\n", " - Uncovering hidden patterns from unlabeled data\n", " - Example:\n", " - Grouping customers into distinct categories (Clustering)\n", "\n", "- Reinforcement learning\n", " - Software agents interact with an environment\n", " - Learn how to optimize their behavior\n", " - Given a system of rewards and punishments\n", " - Draws inspiration from behavioral psychology\n", " - Applications\n", " - Economics\n", " - Genetics\n", " - Game playing\n", " \n", "- Supervised learning\n", " - Predictor variables / features and a target variable\n", " - Automate time-consuming or expensive manual tasks\n", " - Doctor's diagnosis\n", " - Make predictions about the future\n", " - Will acustomer click on an ad or not?\n", " - Need labeled data\n", " - Historical data with labels\n", " - Experiments to get labeled data\n", " - Crowd-sourcing labeled data\n", " \n", "- Naming Conventions\n", " - Features = predictor variables = independent variables\n", " - Target variable = dependent variable = response variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory data analysis\n", "- Iris dataset\n", " - Features\n", " - Petal length\n", " - Petal width\n", " - Sepal length\n", " - Sepal width\n", " - Target variable : Species \n", " - Versicolor\n", " - Virginica\n", " - Setosa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Numerical EDA\n", "In this chapter, you'll be working with a dataset obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records) consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocessing" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | party | \n", "infants | \n", "water | \n", "budget | \n", "physician | \n", "salvador | \n", "religious | \n", "satellite | \n", "aid | \n", "missile | \n", "immigration | \n", "synfuels | \n", "education | \n", "superfund | \n", "crime | \n", "duty_free_exports | \n", "eaa_rsa | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "republican | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "1 | \n", "
1 | \n", "republican | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "
2 | \n", "democrat | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "
3 | \n", "democrat | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "
4 | \n", "democrat | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "
\n", " | infants | \n", "water | \n", "budget | \n", "physician | \n", "salvador | \n", "religious | \n", "satellite | \n", "aid | \n", "missile | \n", "immigration | \n", "synfuels | \n", "education | \n", "superfund | \n", "crime | \n", "duty_free_exports | \n", "eaa_rsa | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "435.000000 | \n", "
mean | \n", "0.429885 | \n", "0.448276 | \n", "0.581609 | \n", "0.406897 | \n", "0.487356 | \n", "0.625287 | \n", "0.549425 | \n", "0.556322 | \n", "0.475862 | \n", "0.496552 | \n", "0.344828 | \n", "0.393103 | \n", "0.480460 | \n", "0.570115 | \n", "0.400000 | \n", "0.618391 | \n", "
std | \n", "0.495630 | \n", "0.497890 | \n", "0.493863 | \n", "0.491821 | \n", "0.500416 | \n", "0.484606 | \n", "0.498124 | \n", "0.497390 | \n", "0.499992 | \n", "0.500564 | \n", "0.475859 | \n", "0.489002 | \n", "0.500193 | \n", "0.495630 | \n", "0.490462 | \n", "0.486341 | \n", "
min | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
25% | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
50% | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
75% | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "
max | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "