{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.core.display import HTML\n", "css_file = 'stylesheets/custom.css'\n", "HTML(open(css_file, \"r\").read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\n", "

Optimization and Machine

\n", "

Learning with Python

\n", "\n", "




" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "This tutorial is prepared by ACM/SIAM Student Chapter of King Abdullah University of Science and Technology (KAUST).\n", "

\n", "\n", "This part of the advanced tutorial focuses more on some specific packages.\n", "
\n", "The topics covered in this part are:\n", "\n", "
\n", "\n", "Presented on 2nd March, 2017 by Yiannis Hadjimichael and Uchenna Akujuobi.\n", "

\n", "\n", "**Prerequisites:** *Basic and Intermediate tutorials or some experience with Python, Numpy, Scipy, Matplotlib (optional). Basic knowledge in machine learning and optimization*.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Optimization in Python with Scipy and APMonitor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this part, we'll learn how to solve optimization problems using the **Scipy** and **Advanced Process Monitor (APMonitor)** package for Python.\n", "\n", "**APMonitor Python** is designed for large-scale optimization and accesses solvers of constrained, unconstrained, continuous, and discrete problems. It can solve a variety of problems in linear programming, quadratic programming, integer programming, nonlinear optimization, systems of dynamic nonlinear equations, and multiobjective optimization.\n", "It is free for academic and commercial use.\n", "\n", "The current APMonitor Python version with examples can be found at\n", "http://apmonitor.com/wiki/index.php/Main/PythonApp.\n", "\n", "For this tutorial we have already included the source python file (apm.py) in the git reposidory, as well as the .amp files\n", "for needed in the various examples.\n", "\n", "APMonitor works both with Python and Matlab." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimization formulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The general formulation of an optimization problem is to \n", "An optimization problem can be expressed in the following way:\n", "\n", "- Given: a function $f : A \\to R$ from some set $A$ to the set of real numbers\n", "- Want to find a solution $x_0 \\in A$ such that $f(x_0)$ is the minimum (or maximum) of $f(x)$ for all $x \\in A$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example the standard form of a continuous optimization problem is:\n", "\n", "$\n", " \\text{max (or min) } \\mathbf{f(x)}\n", "$\n", "\n", "$\n", " \\text{subject to } \\\\\n", "$\n", "\n", "$ \n", " \\mathbf{lb} \\le \\mathbf{x} \\le \\mathbf{ub} \\\\\n", " \\mathbf{A x} \\le \\mathbf{b} \\\\\n", " \\mathbf{A}_\\mathbf{eq} \\mathbf{x} = \\mathbf{b}_\\mathbf{eq} \\\\\n", " \\mathbf{c_i(x) \\ge 0} \\quad \\forall i=0,...,I \\\\\n", " \\mathbf{h_j(x) = 0} \\quad \\forall i=0,...,J \\\\\n", " \\mathbf{x \\in R^n}\n", "$\n", "\n", "where $\\mathbf{f, c_i, h_j} :R^n \\to R$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the type of the objective function $\\mathbf{f(x)}$ and constraints the optimization problem is either linear, quadratic, nonlinear, unconstrained, constrained etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How the APMonitor works\n", "\n", "First, the package is called by\n", "\n", "> from apm import *\n", "\n", "An online server and an application name must be then defined. e.g.\n", "\n", "> server = 'http://xps.apmonitor.com'\n", "\n", "> app = 'my_app'\n", "\n", "The `.apm` file is then loaded which contains the setup of the problem in a similar fashion is defined mathematically.\n", "Therefore the `.apm` file contains the objective function, optimization and control variables, fixed parameters and a set of\n", "equations describing equality/inequality constraints or differential algebraic equations.\n", "\n", "We load the setup file by\n", "\n", "> apm_load(s,a,'setup.apm')\n", "\n", "Then depending on the nature of the optimization problem we can set variable options using the `apm_option` command.\n", "\n", "Finally we solve the problem by\n", "\n", "> apm(server,app,'solve')\n", "\n", "A detailed report is provided consisting of the optimization variables, equations used, and optimization solver, number of iterations, terminating conditions and tolerances, objective functions and optimizers and flags whether the optimiations was succesful or not.\n", "\n", "The solution can be retrieved by \n", "\n", "> solution = apm_sol(server,app)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best way to see how it works is to go through some examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 : Linear problem\n", "\n", "A linear optimization problem has the following form:\n", "\n", "$\n", " \\text{max (or min) } \\mathbf{c}^T\\mathbf{x}\n", "$\n", "\n", "$\n", " \\text{subject to } \\\\\n", "$\n", "\n", "$ \n", " \\mathbf{lb} \\le \\mathbf{x} \\le \\mathbf{ub} \\\\\n", " \\mathbf{A x} \\le \\mathbf{b} \\\\\n", " \\mathbf{A}_\\mathbf{eq} \\mathbf{x} = \\mathbf{b}_\\mathbf{eq} \\\\\n", " \\mathbf{x \\in R^n}\n", "$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the following problem:\n", "\n", "$\n", " \\text{min} (15x_1 + 8x_2 + 80x_3)\n", "$\n", "\n", "$\n", " \\text{subject to}\n", "$\n", "\n", "$ \n", " x_1 + 2x_2 + 3x_3 <= 15 \\\\\n", " 8x_1 + 15x_2 + 80x_3 <= 80 \\\\\n", " 8x_1 + 80x_2 + 15x_3 <= 150 \\\\\n", " -100x_1 - 10x_2 - x_3 <= -800 \\\\\n", " 80x_1 + 8x_2 + 15x_3 = 750 \\\\\n", " x_1 + 10x_2 + 100x_3 = 80 \n", "$\n", "\n", "$\\text{with } 4 <= x_1, -80 <= x_2 <= -8.$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `lp.apm` file contains the descriprion of the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# %load scripts/session5/lp.apm\n", "Model\n", "\n", " Variables\n", " x[1] = 4, >=4\n", " x[2] = -80, >=-80, <=-8\n", " x[3] = 0 >=-1e10, <=1e10\n", " End Variables\n", "\n", " Equations\n", " x[1] + 2*x[2] + 3*x[3] <= 15\n", " 8*x[1] + 15*x[2] + 80*x[3] <= 80\n", " 8*x[1] + 80*x[2] + 15*x[3] <= 150\n", " -100*x[1] - 10*x[2] - x[3] <= -800\n", " 80*x[1] + 8*x[2] + 15*x[3] = 750\n", " x[1] + 10*x[2] + 100*x[3] = 80 \n", "\n", " minimize 15*x[1] + 8*x[2] + 80*x[3]\n", " End Equations\n", "\n", "End Model\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We import the necessary packages and append to the path the location of `apm.py` in order to import the APM package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from random import randint\n", "import sys" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sys.path.append(\"scripts/session5/\")\n", "from apm import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we choose the server and application name, load the model and option and solve it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# server\n", "server = 'http://xps.apmonitor.com'\n", "\n", "# application + random number\n", "app = 'lp' + str(randint(2,999))\n", "\n", "# clear server\n", "apm(server,app,'clear all')\n", "\n", "# load model\n", "apm_load(server,app,'scripts/session5/lp.apm')\n", "\n", "# choose the solver\n", "# 1=APOPT (active-set), 2=BPOPT (interior-point), 3=IPOPT (interior-point)\n", "apm_option(server,app,'nlc.solver',1)\n", "\n", "# solve\n", "output = apm(server,app,'solve')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# retrieve solution\n", "solution = apm_sol(server,app)\n", "print(solution) # should print [ 9.89355041 -8. 1.5010645 ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solving the problem with `scipy`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scipy.optimize import linprog\n", "\n", "# objective function\n", "f = np.array([15, 8, 80])\n", "\n", "# inequalities\n", "A = np.array([[1, 2, 3], [8, 15, 80], [8, 80, 15], [-100, -10, -1]]) \n", "b = np.array([15, 80, 150, -800]) \n", "\n", "# equalities\n", "Aeq = np.array([[80, 8, 15], [1, 10, 100]]) \n", "beq = np.array([750, 80])\n", "\n", "# bounds\n", "x0_bounds = (4, -80, None)\n", "x1_bounds = (None, -8, None)\n", "\n", "# solve\n", "sol = linprog(f, A_ub=A, b_ub=b, A_eq=Aeq, b_eq=beq, bounds=list(zip(x0_bounds,x1_bounds)))\n", "\n", "print('objFunValue: {:.16f}'.format(sol.fun)) # should print 204.48841578\n", "print('x_opt: {}'.format(sol.x)) # should print [ 9.89355041 -8. 1.5010645 ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily verify that solution satisfies the constraints:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# error for equalitity constraints\n", "print(np.linalg.norm(np.dot(Aeq,sol.x)- beq))\n", "\n", "# inequalities \n", "print(np.dot(A,sol.x)- b <= 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2: Least square problem\n", "\n", "Consider some 2D data. We would like to find a model to the measured data that minimizes the discrepancy between model predictions and the data. This can be formulated by minimizing the sum of the squares of the difference between values of model and measured data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First create some data and plot them:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n = 200\n", "x = np.linspace(0,2,n)\n", "m = lambda x : 0.8*np.exp(-1.5*x)+1.2*np.exp(-0.8*x)\n", "perturb = 0.5*np.random.uniform(0,1,n)\n", "y = m(x)*(1.0+perturb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(12,8), dpi=100) # changing figure's shape\n", "plt.plot(x,y,'.')\n", "plt.xlabel('x',fontsize=16) # horizontal axis name\n", "plt.ylabel('y(x)',fontsize=16) # vertical axis name\n", "plt.xticks(fontsize=14)\n", "plt.yticks(fontsize=14)\n", "plt.title('Sample data',fontsize=18) # title \n", "plt.grid(True) # enabling grid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's solve the least square problem:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "\"\"\"\n", "In this example we will search for z = (a, b, c, d) \n", "that minimizes Sum_i || F(X_i;z) - Y_i ||^2\n", "for the function\n", "F(x) = a*exp(-b*x) + c*exp(-d*x)\n", "\"\"\"\n", "from scipy.optimize import least_squares\n", "\n", "# function for computing residuals\n", "def fun(z, x, y):\n", " return z[0]*np.exp(-z[1]*x) + z[2]*np.exp(-z[3]*x) - y\n", "\n", "# initial estimate\n", "x0 = np.array([0.2, 0.9, 1.9, 0.9])\n", "\n", "sol_lsq = least_squares(fun, x0, args=(x, y))\n", "\n", "# solution parameters\n", "z = sol_lsq.x\n", "\n", "residual = sum(sol_lsq.fun**2)\n", "\n", "print('solution: '+str(z)+'\\n||residuals||^2 = '+str(residual))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def fun(z, x, y):\n", " return z[0]*np.exp(-z[1]*x) + z[2]*np.exp(-z[3]*x)\n", "\n", "plt.figure(figsize=(15,10), dpi=100)\n", "plt.plot(x,y,'.')\n", "plt.plot(x,fun(z,x,y),'r-',linewidth=1.5)\n", "plt.xlabel('x',fontsize=16)\n", "plt.ylabel('y',fontsize=16)\n", "plt.title('Sample data',fontsize=18)\n", "plt.grid(True) # enabling grid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 3: Nonlinear problem\n", "\n", "Consider the following example of a non-linear problem (from [http://apmonitor.com/wiki/index.php/Main/PythonApp](http://apmonitor.com/wiki/index.php/Main/PythonApp))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# %load scripts/session5/nlp.apm\n", "Model\n", " Variables\n", " x[1] = 1, >=1, <=5\n", " x[2] = 5, >=1, <=5\n", " x[3] = 5, >=1, <=5\n", " x[4] = 1, >=1, <=5\n", " End Variables\n", "\n", " Equations\n", " x[1] * x[2] * x[3] * x[4] > 25\n", " x[1]^2 + x[2]^2 + x[3]^2 + x[4]^2 = 40\n", "\n", " minimize x[1] * x[4] * (x[1]+x[2]+x[3]) + x[3]\n", " End Equations\n", "End Model\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# server\n", "s = 'http://xps.apmonitor.com'\n", "\n", "# application + random number\n", "a = 'app' + str(randint(2,999))\n", "\n", "# clear server\n", "apm(s,a,'clear all')\n", "\n", "# load model\n", "apm_load(s,a,'scripts/session5/nlp.apm')\n", "\n", "# choose solver\n", "# 1=APOPT (active-set), 2=BPOPT (interior-point), 3=IPOPT (interior-point)\n", "apm_option(s,a,'nlc.solver',3)\n", "\n", "# solve\n", "output = apm(s,a,'solve')\n", "print(output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "solution = apm_sol(s,a)\n", "print(solution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also see the results in the browser:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "apm_web_var(s,a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's visualize the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create a mesh for x_2 and x_4\n", "x = np.arange(1.0, 5.0, 0.1)\n", "y = np.arange(1.0, 5.0, 0.1)\n", "x2, x4 = np.meshgrid(x,y)\n", "\n", "# since we can plot only against two variables we choose the optimal values for x_1 and x_3 \n", "x1 = solution['x[1]']\n", "x3 = solution['x[3]']\n", "\n", "# compute the objective function, the inequality and equality equation\n", "obj = x1*x4*(x1+x2+x3) + x3\n", "ic = x1*x2*x3*x4\n", "eq = x1**2 + x2**2 + x3**2 + x4**2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's make a contour plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(10,10))\n", "plt.title('NLP Contour Plot',fontsize=20)\n", "\n", "CS = plt.contour(x2,x4,obj)\n", "plt.clabel(CS,fontsize=14)\n", "plt.xlabel(r'$x_2$',fontsize=18)\n", "plt.ylabel(r'$x_4$',fontsize=18)\n", "plt.xticks(fontsize=14)\n", "plt.yticks(fontsize=14)\n", "\n", "CS = plt.contour(x2,x4,ic,[25.,26.,27.,28.],colors='b',linewidts=[40.,3.,2.,1.])\n", "plt.clabel(CS,inline=True,fontsize='x-large')\n", "CS = plt.contour(x2,x4,eq,[40.],colors='r',linewidts=[4.])\n", "plt.clabel(CS,inline=True,fontsize='x-large')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily get the same solution using `scipy`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scipy.optimize import minimize\n", "\n", "# objective function\n", "def fun(x):\n", " return x[0] * x[3] * (x[0]+x[1]+x[2]) + x[2]\n", "# or inline-function:\n", "# fun = lambda x: x[0] * x[3] * (x[0]+x[1]+x[2]) + x[2]\n", "\n", "# constraints\n", "cons = ({'type':'ineq','fun':lambda x: x[0]*x[1]*x[2]*x[3]-25},{'type':'eq','fun':lambda x: x[0]**2+x[1]**2+x[2]**2+x[3]**2-40})\n", "\n", "# bounds\n", "x0_bounds = (1,1,1,1)\n", "x1_bounds = (5,5,5,5)\n", "\n", "# initial value\n", "x0 = np.array([1,5,5,1])\n", "\n", "# solve\n", "sol = minimize(fun, x0, method='SLSQP', bounds=list(zip(x0_bounds,x1_bounds)), constraints=cons)\n", "\n", "print('objFunValue: {:.16f}'.format(sol.fun))\n", "print('x_opt: {}'.format(sol.x))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# solution with APMonitor\n", "print(solution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 4: Dynamic Programming\n", "\n", "Consider now a reactor that converts a compound A to intermediate compound B and a final species C. The objective is to adjust the temperature (T) to maximize the concentration of compound B at the final time.\n", "\n", "This example is taken from [Dynamic Optimization Benchmarks](http://apmonitor.com/do/index.php/Main/DynamicOptimizationBenchmarks).\n", "\n", "The problem is described as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\text{max } x_2(t_f)$\n", "\n", "subject to\n", "\n", "$\\dfrac{d x_1}{dt} = -k_1x_1^2 \\\\ \\dfrac{d x_2}{dt} = k_1x_1^2 - k_2x_2$\n", "\n", "with\n", "\n", "$k_1 = 4000 \\exp\\left(\\frac{-2500}{T}\\right), \\quad k_2 = 6.3\\times 10^5 \\exp\\left(\\frac{-5000}{T}\\right),$\n", "\n", "and\n", "\n", "\n", "$x(0) = [1, 0], \\quad 298 <= T <= 398, \\quad t_f = 1.$ " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# %load scripts/session5/dae.apm\n", "Model\n", "\n", " Parameters\n", " p\n", " t >=298 <=398\n", " End Parameters\n", "\n", " Variables\n", " x1 = 1\n", " x2 = 0\n", " End Variables\n", "\n", " Intermediates\n", " k1 = 4000 * exp(-2500/t)\n", " k2 = 6.2e5 * exp(-5000/t)\n", " End Intermediates\n", "\n", " Equations\n", " maximize p * x2\n", "\n", " $x1 = -k1 * x1^2\n", " $x2 = k1*x1^2 - k2*x2\n", "\n", " ! note: the $ denotes time differential\n", " End Equations\n", "\n", "End Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# %load scripts/session5/dae.csv\n", "time,p\n", "0,0\n", "0.001,0\n", "0.1,0\n", "0.2,0\n", "0.3,0\n", "0.4,0\n", "0.5,0\n", "0.6,0\n", "0.7,0\n", "0.8,0\n", "0.9,0\n", "1.0,1\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# specify server and application name\n", "s = 'http://byu.apmonitor.com'\n", "a = 'dynopt'\n", "\n", "apm(s,a,'clear all')\n", "apm_load(s,a,'scripts/session5/dae.apm')\n", "csv_load(s,a,'scripts/session5/dae.csv')\n", "\n", "apm_option(s,a,'nlc.nodes',4)\n", "apm_option(s,a,'nlc.solver',1)\n", "apm_option(s,a,'nlc.imode',6)\n", "apm_option(s,a,'nlc.mv_type',1)\n", "\n", "apm_info(s,a,'MV','t')\n", "apm_option(s,a,'t.status',1)\n", "apm_option(s,a,'t.dcost',0)\n", "\n", "output = apm(s,a,'solve')\n", "print(output)\n", "y = apm_sol(s,a)\n", "\n", "print('Optimal Solution: ' + str(y['x2'][-1]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(12,12))\n", "\n", "plt.subplot(211)\n", "plt.plot(y['time'][1:],y['t'][1:],'r-s')\n", "plt.legend('T',fontsize=18)\n", "plt.ylabel('Manipulated',fontsize=18)\n", "plt.xticks(fontsize=14)\n", "plt.yticks(fontsize=14)\n", "\n", "plt.subplot(212)\n", "plt.plot(y['time'],y['x1'],'b-s')\n", "plt.plot(y['time'],y['x2'],'g-o')\n", "plt.legend(['x1','x2'],fontsize=18)\n", "plt.ylabel('Variables',fontsize=18)\n", "plt.xlabel('Time',fontsize=18)\n", "plt.xticks(fontsize=16)\n", "plt.yticks(fontsize=16)\n", "\n", "plt.show() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning in Python\n", "Machine learning is a hot and popular topic nowadays. It is about building programs with tunable parameters that are adjusted automatically by adapting to previously seen data. It intervenes literally every sphere of our life ranging from science, engineering, and industry to economics, sociology, and even [personal relationships](http://www.wired.com/wiredscience/2014/01/how-to-hack-okcupid/all/). So if it is applicable everywhere, we need to have handy tools for doing machine learning." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "# Start pylab inline mode, so figures will appear in the notebook\n", "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Quick Overview of Scikit-Learn\n", "*Adapted from http://scikit-learn.org/stable/tutorial/basic/tutorial.html*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load a dataset already available in scikit-learn\n", "from sklearn import datasets\n", "digits = datasets.load_digits()\n", "type(digits)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "digits.data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "digits.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import svm\n", "clf = svm.SVC(gamma=0.001, C=100.)\n", "clf.fit(digits.data[:-1], digits.target[:-1])\n", "\n", "print (\"Original label:\", digits.target[7])\n", "print (\"Predicted label:\", clf.predict(digits.data[7].reshape(1,-1)))\n", "\n", "plt.figure(figsize=(10, 10))\n", "for i in range(10):\n", " plt.subplot(2, 5, i + 1)\n", " plt.imshow(digits.images[i], interpolation='nearest',vmax=16, cmap=plt.cm.binary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Data representation in Scikit-learn\n", "*Adapted from https://github.com/jakevdp/sklearn_scipy2013*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data in scikit-learn, with very few exceptions, is assumed to be stored as a **two-dimensional array**, of size `[n_samples, n_features]`, and most of machine learning algorithms implemented in scikit-learn expect data to be stored this way.\n", "\n", "- **n_samples:** The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n", "- **n_features:** The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.\n", "\n", "The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where scipy.sparse matrices can be useful, in that they are much more memory-efficient than numpy arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.1. Example: `iris dataset`\n", "To understand these simple concepts better, let's work with an example: `iris-dataset`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Don't know what iris is? Run this code to generate some pictures.\n", "from IPython.core.display import Image, display\n", "display(Image(filename='figures/session5/iris_setosa.jpg'))\n", "print (\"Iris Setosa\\n\")\n", "\n", "display(Image(filename='figures/session5/iris_versicolor.jpg'))\n", "print (\"Iris Versicolor\\n\")\n", "\n", "display(Image(filename='figures/session5/iris_virginica.jpg'))\n", "print (\"Iris Virginica\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load the dataset from scikit-learn\n", "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "type(iris) # The resulting dataset is a Bunch object" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iris.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print (iris.DESCR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Number of samples and features\n", "iris.data.shape\n", "print (\"Number of samples:\", iris.data.shape[0])\n", "print (\"Number of features:\", iris.data.shape[1])\n", "print (iris.data[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Target (label) names\n", "print (iris.target_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Visualize the data\n", "x_index = 0\n", "y_index = 1\n", "\n", "# this formatter will label the colorbar with the correct target names\n", "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n", "\n", "plt.figure(figsize=(7, 5))\n", "plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)\n", "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n", "plt.xlabel(iris.feature_names[x_index])\n", "plt.ylabel(iris.feature_names[y_index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Quick Exercise:**\n", "Change `x_index` and `y_index` and find the 2D projection of the data which maximally separate all three classes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# your code goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.2. Other available datasets\n", "Scikit-learn makes available a host of datasets for testing learning algorithms. They come in three flavors:\n", "\n", "- **Packaged Data**: these small datasets are packaged with the scikit-learn installation, and can be downloaded using the tools in `sklearn.datasets.load_*`\n", "- **Downloadable Data**: these larger datasets are available for download, and scikit-learn includes tools which streamline this process. These tools can be found in `sklearn.datasets.fetch_*`\n", "- **Generated Data**: there are several datasets which are generated from models based on a random seed. These are available in the `sklearn.datasets.make_*`\n", "\n", "visit http://scikit-learn.org/stable/datasets/ to learn more about the available Scikit-learn datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Where the downloaded datasets via fetch_scripts are stored\n", "from sklearn.datasets import get_data_home\n", "get_data_home()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# How to fetch an external dataset\n", "from sklearn import datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can type `datasets.fetch_` and a list of available datasets will pop up." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise:\n", "Fetch the dataset `olivetti_faces`. Visualize various 2D projections for the data. Find the best in terms of separability.\n", "\n", "A sample preview of the faces can be found at http://www.cl.cam.ac.uk/research/dtg/attarchive/facesataglance.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_olivetti_faces" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# your code goes here\n", "# fetch the faces data\n", "# Use a script like above to plot the faces image data.\n", "# Hint: plt.cm.bone is a good colormap for this data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. Linear regression and classification\n", "In this section, we explain how to use scikit-learn library for regression and classification. We discuss `Estimator` class - the base class for many of the scikit-learn models - and work through some examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_boston\n", "data = load_boston()\n", "print (data.DESCR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Plot histogram for real estate prices\n", "plt.hist(data.target)\n", "plt.xlabel('price ($1000s)')\n", "plt.ylabel('count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Quick Exercise:** Try sctter-plot several 2D projections of the data. What you can infer from the plots?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Visualize the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every algorithm is exposed in scikit-learn via an `Estimator` object. For instance a linear regression is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "clf = LinearRegression()\n", "clf.fit(data.data, data.target)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predicted = clf.predict(data.data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Plot the predicted price vs. true price\n", "plt.scatter(data.target, predicted)\n", "plt.plot([0, 50], [0, 50], '--k')\n", "plt.axis('tight')\n", "plt.xlabel('True price ($1000s)')\n", "plt.ylabel('Predicted price ($1000s)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examples of regression type problems in machine learning:\n", "\n", "- **Sales:** given consumer data, predict how much they will spend\n", "- **Advertising:** given information about a user, predict the click-through rate for a web ad.\n", "- **Collaborative Filtering:** given a collection of user-ratings for movies, predict preferences for other movies & users.\n", "- **Astronomy:** given observations of galaxies, predict their mass or redshift" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another example of regularized linear regression on a synthetic data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.linear_model import LinearRegression\n", "\n", "rng = np.random.RandomState(0)\n", "x = 2 * rng.rand(100) - 1\n", "\n", "f = lambda t: 1.2 * t ** 2 + .1 * t ** 3 - .4 * t ** 5 - .5 * t ** 9\n", "y = f(x) + .4 * rng.normal(size=100)\n", "\n", "plt.figure(figsize=(7, 5))\n", "plt.scatter(x, y, s=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x_test = np.linspace(-1, 1, 100) #generate numbers from -1 to 1\n", "\n", "for k in [4, 9]:\n", " X = np.array([x**i for i in range(k)]).T\n", " X_test = np.array([x_test**i for i in range(k)]).T\n", " order4 = LinearRegression()\n", " order4.fit(X,y)\n", " plt.plot(x_test, order4.predict(X_test), label='%i-th order'%(k))\n", "\n", "plt.legend(loc='best')\n", "plt.axis('tight')\n", "plt.title('Fitting a 4th and a 9th order polynomial')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's look at the ground truth\n", "plt.figure()\n", "plt.scatter(x, y, s=4)\n", "plt.plot(x_test, f(x_test), label=\"truth\")\n", "plt.axis('tight')\n", "plt.title('Ground truth (9th order polynomial)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use [Ridge regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) (with l1-norm regularization)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import Ridge\n", "\n", "for k in [20]:\n", " X = np.array([x**i for i in range(k)]).T\n", " X_test = np.array([x_test**i for i in range(k)]).T\n", " order4 = Ridge(alpha=0.3)\n", " order4.fit(X, y)\n", " plt.plot(x_test, order4.predict(X_test), label='%i-th order'%(k))\n", "plt.plot(x_test, f(x_test), label=\"truth\")\n", "plt.scatter(x, y, s=4)\n", "plt.legend(loc='best')\n", "plt.axis('tight')\n", "plt.title('Fitting a 4th and a 9th order polynomial')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.ylim?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise:\n", "For this exercise, you need to classify iris data with nearest neighbors classifier and with linear support vector classifier (LinearSVC).\n", "In order to measure the performance, calculate the accuracy or just use `clf.score` function built-in into the `Estimator` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from __future__ import division # turn off division truncation -- division will be floating point by default\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "\n", "# Splitting the data into train and test sets\n", "indices = np.random.permutation(range(iris.data.shape[0]))\n", "train_sz = int(floor(0.8*iris.data.shape[0]))\n", "X, y = iris.data[indices[:train_sz],:], iris.target[indices[:train_sz]]\n", "Xt, yt = iris.data[indices[train_sz:],:], iris.target[indices[train_sz:]]\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# your code goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4. Approaches to validation and testing\n", "It is usually an arduous and time consuming task to write cross-validation or other testing wrappers for your models. Scikit-learn allows you validate your models out-of-the-box. Let's have a look at how to do this." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import model_selection\n", "X, Xt, y, yt = model_selection.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)\n", "print (X.shape, y.shape)\n", "print (Xt.shape, yt.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Cross-validating your models in one line\n", "svc = LinearSVC()\n", "scores = model_selection.cross_val_score(svc, iris.data, iris.target, cv=5,)\n", "print (scores)\n", "print (scores.mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model_selection.cross_val_score?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Calculate F1-measure (or any other score)\n", "from sklearn import metrics\n", "f1_scores = model_selection.cross_val_score(svc, iris.data, iris.target, cv=5, scoring='f1_macro')\n", "print (f1_scores)\n", "print (f1_scores.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Quick Exercise:** Do the same CV-evaluation of the KNN classifier. Find the best number of nearest neighbors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.5. What to learn next?\n", "If you would like to start using Scikit-learn for your machine learning tasks, the most right way is to jump into it and refer to [the documentation](http://scikit-learn.org/stable/documentation.html) from time to time. When you acquire enough skills in it, you can go and check out the following resources:\n", "\n", "- [A tutorial from PyCon 2013](https://github.com/jakevdp/sklearn_pycon2013)\n", "- [A tutorial from SciPy 2013 conference](https://github.com/jakevdp/sklearn_scipy2013) (check out the advanced track!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Scikit-learn algorithm cheatsheet](http://scikit-learn.org/dev/_static/ml_map.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Copyright 2017, ACM/SIAM Student Chapter of King Abdullah University of Science and Technology*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 0 }