{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "848fcc94-3480-439c-b565-b8dc6072268a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys \n", "!{sys.executable} -m pip install --quiet pandas numpy matplotlib jupyterlab_myst ipython imblearn\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "b0926c24", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "\n", "# Introduction to classification\n", "\n", "In these four sections, you will explore a fundamental focus of classic machine learning _classification_. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!" ] }, { "cell_type": "markdown", "id": "4cc6fb13", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/pinch.png\n", "---\n", "name: 'Celebrate pan-Asian cuisines in these lessons!'\n", "width: 90%\n", "---\n", "Image by [Jen Looper](https://twitter.com/jenlooper)\n", ":::" ] }, { "cell_type": "markdown", "id": "21bdf1d7", "metadata": {}, "source": [ "Classification is a form of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that bears a lot in common with regression techniques. If machine learning is all about predicting values or names to things by using datasets, then classification generally falls into two groups: _binary classification_ and _multiclass classification_." ] }, { "cell_type": "code", "execution_count": 2, "id": "4b39b77c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "hide-input", "output-scoll" ] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "A demo of Neural Network Playground. [source]\n", "
\n" ], "text/plain": [ "\n", "\n", "A demo of Neural Network Playground. [source]\n", "
\n", "\"\"\"\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "44c95f32", "metadata": {}, "source": [ "## Introduction\n", "\n", "Classification is one of the fundamental activities of the machine learning researcher and data scientist. From basic classification of a binary value (\"is this email spam or not?\"), to complex image classification and segmentation using computer vision, it's always useful to be able to sort data into classes and ask questions of it.\n", "\n", "To state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/binary-multiclass.png\n", "---\n", "name: 'binary vs. multiclass classification'\n", "width: 90%\n", "---\n", "Infographic by [Jen Looper](https://twitter.com/jenlooper)\n", ":::\n" ] }, { "cell_type": "markdown", "id": "752b6fbd", "metadata": {}, "source": [ "Before starting the process of cleaning our data, visualizing it, and prepping it for our ML tasks, let's learn a bit about the various ways machine learning can be leveraged to classify data.\n", "\n", "Derived from [statistics](https://wikipedia.org/wiki/Statistical_classification), classification using classic machine learning uses features, such as `smoker`, `weight`, and `age` to determine _likelihood of developing X disease_. As a supervised learning technique similar to the regression exercises you performed earlier, your data is labeled and the ML algorithms use those labels to classify and predict classes (or 'features') of a dataset and assign them to a group or outcome.\n", "\n", ":::{note}\n", "Take a moment to imagine a dataset about cuisines. What would a multiclass model be able to answer? What would a binary model be able to answer? What if you wanted to determine whether a given cuisine was likely to use fenugreek? What if you wanted to see if, given a present of a grocery bag full of star anise, artichokes, cauliflower, and horseradish, you could create a typical Indian dish?\n", ":::" ] }, { "cell_type": "code", "execution_count": 4, "id": "f7f31899", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "hide-input", "output-scoll" ] }, "outputs": [ { "data": { "text/html": [ "\n", "| \n", " | Unnamed: 0 | \n", "cuisine | \n", "almond | \n", "angelica | \n", "anise | \n", "anise_seed | \n", "apple | \n", "apple_brandy | \n", "apricot | \n", "armagnac | \n", "... | \n", "whiskey | \n", "white_bread | \n", "white_wine | \n", "whole_grain_wheat_flour | \n", "wine | \n", "wood | \n", "yam | \n", "yeast | \n", "yogurt | \n", "zucchini | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "65 | \n", "indian | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "66 | \n", "indian | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 2 | \n", "67 | \n", "indian | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 3 | \n", "68 | \n", "indian | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "69 | \n", "indian | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
5 rows × 385 columns
\n", "| \n", " | almond | \n", "angelica | \n", "anise | \n", "anise_seed | \n", "apple | \n", "apple_brandy | \n", "apricot | \n", "armagnac | \n", "artemisia | \n", "artichoke | \n", "... | \n", "whiskey | \n", "white_bread | \n", "white_wine | \n", "whole_grain_wheat_flour | \n", "wine | \n", "wood | \n", "yam | \n", "yeast | \n", "yogurt | \n", "zucchini | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
5 rows × 380 columns
\n", "