{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SciPy 2018 Scikit-learn Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Machine Learning in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Machine Learning?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine learning is the process of extracting knowledge from data automatically, usually with the goal of making predictions on new, unseen data. A classical example is a spam filter, for which the user keeps labeling incoming mails as either spam or not spam. A machine learning algorithm then \"learns\" a predictive model from data that distinguishes spam from normal emails, a model which can predict for new emails whether they are spam or not. \n", "\n", "Central to machine learning is the concept of **automating decision making** from data **without the user specifying explicit rules** how this decision should be made.\n", "\n", "For the case of emails, the user doesn't provide a list of words or characteristics that make an email spam. Instead, the user provides examples of spam and non-spam emails that are labeled as such.\n", "\n", "The second central concept is **generalization**. The goal of a machine learning model is to predict on new, previously unseen data. In a real-world application, we are not interested in marking an already labeled email as spam or not. Instead, we want to make the user's life easier by automatically classifying new incoming mail." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is presented to the algorithm usually as a two-dimensional array (or matrix) of numbers. Each data point (also known as a *sample* or *training instance*) that we want to either learn from or make a decision on is represented as a list of numbers, a so-called feature vector, and its containing features represent the properties of this point. \n", "\n", "Later, we will work with a popular dataset called *Iris* -- among many other datasets. Iris, a classic benchmark dataset in the field of machine learning, contains the measurements of 150 iris flowers from 3 different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica. \n", "\n", "\n", "\n", "
Species | \n", "Image | \n", "
---|---|
Iris Setosa | \n", "\n", " |
Iris Versicolor | \n", "\n", " |
Iris Virginica | \n", "\n", " |