{ "metadata": { "language": "Julia", "name": "", "signature": "sha256:e555c308a4f434c7dbfd18eb7231c498456237ab8c824dfa0ee83c086d8a62cc" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](files/img/julia_logo.png)\n", "# for Data Science\n", "\n", "@BenSadeghi\n", "\n", "[Twitter, GitHub, Linkedin]\n", "\n", "Based on presentations by John Myles White[[1]](http://nbviewer.ipython.org/github/johnmyleswhite/UCDavis.jl/blob/master/Julia.ipynb)[[2]](http://nbviewer.ipython.org/github/johnmyleswhite/DCStats.jl/blob/master/Base%20Julia.ipynb)[[3]](http://nbviewer.ipython.org/github/johnmyleswhite/DCStats.jl/blob/master/Statistical%20Programming%20in%20Julia.ipynb), Stefan Karpinski and others" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Contents\n", "* **Background**\n", "* **Why use Julia?**\n", "* **Language Basics**\n", " * Types\n", " * Linear Algebra\n", " * Functions, Multiple Dispatch\n", " * Programming Styles\n", "* **Package Manager**\n", "* **Statistics in Julia**\n", "* **Tabular Data**\n", "* **Data Visualization**\n", "* **Machine Learning Algorithms**\n", " * Unsupervised\n", " * Supervised\n", "* **Resources**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Background\n", "\n", "Julia is a high-level dynamic programming language designed to address the requirements of high-performance numerical and scientific computing while also being effective for general purpose programming.\n", "\n", "Julia's core is implemented in C and C++, its parser in Scheme, and the LLVM compiler framework is used for just-in-time generation of machine code for x86(-64).\n", "\n", "### Designed By\n", "* Jeff Bezanson\n", "* Stefan Karpinski\n", "* Viral B. Shah\n", "* Alan Edelman (MIT supervisor)\n", "\n", "Development began in 2009, open-sourced in February 2012\n", "\n", "Currently has 250+ contributors to the language, 400+ overall\n", "\n", "Stable release: v0.2.1 (2014/02/11), pre-release: v0.3 (nightly build)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# World of Julia\n", "![](files/img/julia_world.png)\n", "Source: GitHub, 2014/06/30 - [Source code](http://nbviewer.ipython.org/github/jiahao/ijulia-notebooks/blob/master/2014-06-30-world-of-julia.ipynb) by Jiahao Chen" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Why Use Julia?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Language Features\n", "* Multiple dispatch\n", "* Dynamic type system\n", "* Performance approaching that of statically-compiled languages like C\n", "* Built-in package manager\n", "* Lisp-like macros and other metaprogramming facilities\n", "* Call C functions directly: no wrappers or special APIs\n", "* Shell-like capabilities for managing other processes\n", "* Designed for parallelism and distributed computation\n", "* User-defined types are as fast and compact as built-ins\n", "* Efficient support for Unicode, including but not limited to UTF-8\n", "* Familiar Matlab/NumPy-like syntax" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](files/img/benchmarks.svg)\n", "[Source code](http://nbviewer.ipython.org/url/julialang.org/benchmarks.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Why You Shouldn't Use Julia\n", "\n", "* Julia is very young\n", "* The Julia package ecosystem is even younger\n", "* Breaking changes are still coming in core and will be quite frequent outside of core Julia\n", "* Language features are still being added: your favorite may not exist yet\n", "* Code quality for packages varies from reasonably well tested, to never tested, to broken" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# So... Should You Use Julia?\n", "That depends on your use case:\n", "\n", "* If you tend to build lots of tools from scratch, Julia is usable, but a little rough\n", "* If you tend to build upon lots of other packages, Julia isn't ready for you yet" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Hierarchical Built-In Types" ] }, { "cell_type": "code", "collapsed": false, "input": [ "subtypes(Number)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "2-element Array{Any,1}:\n", " Complex{T<:Real}\n", " Real " ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "subtypes(Real)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "4-element Array{Any,1}:\n", " FloatingPoint \n", " Integer \n", " MathConst{sym} \n", " Rational{T<:Integer}" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "subtypes(Integer)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "5-element Array{Any,1}:\n", " BigInt \n", " Bool \n", " Char \n", " Signed \n", " Unsigned" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "# Floating Point\n", "@show 5/3\n", "\n", "# Mathematical Constant\n", "@show pi\n", "\n", "# Rational\n", "@show 2//3 + 1\n", "\n", "# BigInt\n", "@show big(2) ^ 1000 ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5 / 3 => 1.6666666666666667" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "pi => \u03c0 = 3.1415926535897...\n", "2 // 3 + 1 => 5//3\n", "big(2)^1000 => 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376\n" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "subtypes(String)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "8-element Array{Any,1}:\n", " DirectIndexString \n", " GenericString \n", " RepString \n", " RevString{T<:String}\n", " RopeString \n", " SubString{T<:String}\n", " UTF16String \n", " UTF8String " ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "s = \"Hello World\"\n", "\n", "@show typeof(s)\n", "@show s[7] ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "typeof(s) => " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "ASCIIString\n", "s[7] => 'W'\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# Unicode Names and Values\n", "\n", "\u4f60\u597d = \"(\uff61\u25d5_\u25d5\uff61)\uff89 \"\n", "\n", "@show typeof(\u4f60\u597d)\n", "@show \u4f60\u597d ^ 3 ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "typeof(\u4f60\u597d) => UTF8String\n", "\u4f60\u597d^3 => \"(\uff61\u25d5_\u25d5\uff61)\uff89 (\uff61\u25d5_\u25d5\uff61)\uff89 (\uff61\u25d5_\u25d5\uff61)\uff89 \"\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# User-Defined Types" ] }, { "cell_type": "code", "collapsed": false, "input": [ "type NewType\n", " i::Integer\n", " s::String\n", "end\n", "\n", "new_t = NewType(33, \"this is a NewType\")\n", "\n", "@show new_t.i\n", "@show new_t.s ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "new_t.i => " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "33\n", "new_t.s => \"this is a NewType\"\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Linear Algebra" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Vectors\n", "\n", "v = [1, 1]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "2-element Array{Int64,1}:\n", " 1\n", " 1" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# Vector Operations\n", "\n", "@show v + [2, 0] # vector addition\n", "@show v + 1 # same as v + [1,1]\n", "@show 5*v # scalar multiplication" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "v + [2,0] => " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "[3,1]\n", "v + 1 => [2,2]\n", "5v => [5,5]\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "2-element Array{Int64,1}:\n", " 5\n", " 5" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "println( \"Dot Product : \", dot(v, v) )\n", "println( \"Norm : \", norm(v) )" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Dot Product : 2\n", "Norm : 1" ] }, { "output_type": "stream", "stream": "stdout", "text": [ ".4142135623730951\n" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "# Matrices\n", "\n", "M = [1 1 ; 0 1]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "2x2 Array{Int64,2}:\n", " 1 1\n", " 0 1" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "# Matrix Addition\n", "\n", "M + 1 ,\n", "M + [0 0 ; 5 5]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "(\n", "2x2 Array{Int64,2}:\n", " 2 2\n", " 1 2,\n", "\n", "2x2 Array{Int64,2}:\n", " 1 1\n", " 5 6)" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "# Matrix Multiplication\n", "\n", "2M ,\n", "M ^ 2 ,\n", "M * v" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "(\n", "2x2 Array{Int64,2}:\n", " 2 2\n", " 0 2,\n", "\n", "2x2 Array{Int64,2}:\n", " 1 2\n", " 0 1,\n", "\n", "[2,1])" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "# Gaussian Elimination\n", "\n", "b = M * v\n", "\n", "M \\ b # solve back for v" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "2-element Array{Float64,1}:\n", " 1.0\n", " 1.0" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Functions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Named functions\n", "\n", "f(x) = 10x\n", "\n", "function g(x)\n", " return x * 10\n", "end\n", "\n", "@show f(5)\n", "@show g(5) ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "f(5) => " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "50\n", "g(5) => 50\n" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "# Anonymous functions assigned to variables\n", "\n", "h = x -> x * 10\n", "\n", "i = function(x)\n", " x * 10\n", "end\n", "\n", "@show h(5)\n", "@show i(5) ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "h(5) => 50\n", "i(5) => 50\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "# Operators are functions\n", "\n", "+(4,5)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "9" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "p = +\n", "\n", "p(2,3)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "5" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Multiple Dispatch" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bar(x::String) = println(\"You entered the string: $x\")\n", "bar(x::Integer) = x * 10\n", "bar(x::NewType) = println(x.s)\n", "\n", "methods(bar)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "html": [ "3 methods for generic function bar:" ], "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "# 3 methods for generic function \"bar\":\n", "bar(x::String) at In[20]:1\n", "bar(x::Integer) at In[20]:2\n", "bar(x::NewType) at In[20]:3" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "bar(\"Hello\")\n", "bar(new_t)\n", "bar(5)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "You entered the string: Hello" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "this is a NewType\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "50" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "# Adding strings\n", "\n", "\"Hello\" + \"World\"" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "ename": "LoadError", "evalue": "`+` has no method matching +(::ASCIIString, ::ASCIIString)\nwhile loading In[22], in expression starting on line 3", "output_type": "pyerr", "traceback": [ "`+` has no method matching +(::ASCIIString, ::ASCIIString)\nwhile loading In[22], in expression starting on line 3" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "# But the addition operator is a function, so we can apply multi-dispatch\n", "\n", "+(a::String, b::String) = a * b\n", "\n", "\"Hello\" + \"World\"" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "\"HelloWorld\"" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "+(a::Number, b::String) = string(a) + b\n", "+(a::String, b::Number) = a + string(b)\n", "\n", "99 + \"bottles\"" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "\"99bottles\"" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Object-Oriented Programming" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Method Overloading\n", "\n", "type SimpleObject\n", " data::Union(Integer, String)\n", " set::Function\n", "\n", " function SimpleObject()\n", " this = new()\n", " this.data = \"\"\n", "\n", " function setter(x::Integer)\n", " println(\"Setting an integer\")\n", " this.data = x\n", " end\n", " function setter(x::String)\n", " println(\"Setting a string\")\n", " this.data = x\n", " end\n", " this.set = setter\n", "\n", " return this\n", " end\n", "end\n", "\n", "obj = SimpleObject()\n", "obj.set(99)\n", "obj.set(\"hello\")" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Setting an integer" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Setting a string\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "\"hello\"" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Functional Programming" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Sum of odd integers between 1 and 5\n", "\n", "values = 1:5\n", "\n", "myMapper = x -> x\n", "myFilter = x -> x % 2 == 1\n", "myReducer = (x,y) -> x + y\n", "\n", "mapped = map( myMapper, values )\n", "filtered = filter( myFilter, mapped )\n", "reduced = reduce( myReducer, filtered )" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ "9" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Metaprogramming" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Code Generation\n", "# Functions for exponentiating to the powers of 1 to 5\n", "\n", "for n in 1:5\n", " s = \"power$n(x) = x ^ $n\"\n", " println(s)\n", " expression = parse(s)\n", " eval(expression) \n", "end\n", "\n", "power5( 2 )" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "power1(x) = x ^ 1" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "power2(x) = x ^ 2\n", "power3(x) = x ^ 3\n", "power4(x) = x ^ 4\n", "power5(x) = x ^ 5\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ "32" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "# Macros: Crude Timer Example\n", "\n", "macro timeit(expression)\n", " quote\n", " t = time()\n", " result = $expression # evaluation\n", " elapsed = time() - t\n", " println( \"elapsed time: \", elapsed )\n", " return result\n", " end\n", "end\n", "\n", "@timeit cos(2pi)\n", "@timeit cos(2pi)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "elapsed time: 0.005074977874755859\n", "elapsed time: 4.0531158447265625e-6\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ "1.0" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Package Manager\n", "Julia has a built-in package management system. All packages are git repositories, mostly hosted on GitHub.\n", "\n", "### Installing a new package\n", "Pkg.add(\"PackageName\")\n", "\n", "### Start using it\n", "using PackageName\n", "\n", "### Import a function to overload\n", "import PackageName.FunctionName\n", "\n", "### Update packages\n", "Pkg.update()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Basic Statistics" ] }, { "cell_type": "code", "collapsed": false, "input": [ "using StatsBase\n", "\n", "x = rand(100) # uniform distribution [0,1)\n", "\n", "println( \"mean: \", mean(x) )\n", "println( \"variance: \", var(x) )\n", "println( \"skewness: \", skewness(x) )\n", "println( \"kurtosis: \", kurtosis(x) )" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "mean: " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "0.5260291483830464\n", "variance: 0.09076375466564988\n", "skewness: 0.007890698485229702\n", "kurtosis: -1.3205549430417554\n" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "describe(x)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Summary Stats:" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Mean: 0.526029\n", "Minimum: 0.002137\n", "1st Quartile: 0.287252\n", "Median: 0.495243\n", "3rd Quartile: 0.809833\n", "Maximum: 0.995268\n" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Probability Distributions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "using Distributions\n", "\n", "distr = Normal(0, 2)\n", "\n", "println( \"pdf @ origin = \", pdf(distr, 0.0) )\n", "println( \"cdf @ origin = \", cdf(distr, 0.0) )" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "pdf @ origin = " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "0.19947114020071635\n", "cdf @ origin = 0.5\n" ] } ], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "x = rand(distr, 1000)\n", "\n", "fit_mle(Normal, x)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 32, "text": [ "Normal( \u03bc=-0.010365910392224809 \u03c3=2.0295200120189767 )" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Tabular Data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "using DataFrames\n", "\n", "df = DataFrame(\n", " A = [6, 3, 4],\n", " B = [\"a\", \"b\", \"c\"],\n", " C = [1//2, 3//4, 5//6],\n", " D = [true, true, false]\n", ")\n", "\n", "df[:C][2] = NA\n", "df" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "html": [ "
ABCD
16a1//2true
23bNAtrue
34c5//6false
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 33, "text": [ "3x4 DataFrame\n", "|-------|---|-----|------|-------|\n", "| Row # | A | B | C | D |\n", "| 1 | 6 | \"a\" | 1//2 | true |\n", "| 2 | 3 | \"b\" | NA | true |\n", "| 3 | 4 | \"c\" | 5//6 | false |" ] } ], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "# Joins\n", "\n", "names = DataFrame(ID = [5, 4], Name = [\"Jack\", \"Jill\"])\n", "jobs = DataFrame(ID = [5, 4], Job = [\"Lawyer\", \"Doctor\"])\n", "\n", "full = join(names, jobs, on = :ID)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "html": [ "
IDNameJob
14JillDoctor
25JackLawyer
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ "2x3 DataFrame\n", "|-------|----|--------|----------|\n", "| Row # | ID | Name | Job |\n", "| 1 | 4 | \"Jill\" | \"Doctor\" |\n", "| 2 | 5 | \"Jack\" | \"Lawyer\" |" ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "using RDatasets\n", "\n", "iris = dataset(\"datasets\", \"iris\")\n", "head(iris)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "html": [ "
SepalLengthSepalWidthPetalLengthPetalWidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ "6x5 DataFrame\n", "|-------|-------------|------------|-------------|------------|----------|\n", "| Row # | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |\n", "| 1 | 5.1 | 3.5 | 1.4 | 0.2 | \"setosa\" |\n", "| 2 | 4.9 | 3.0 | 1.4 | 0.2 | \"setosa\" |\n", "| 3 | 4.7 | 3.2 | 1.3 | 0.2 | \"setosa\" |\n", "| 4 | 4.6 | 3.1 | 1.5 | 0.2 | \"setosa\" |\n", "| 5 | 5.0 | 3.6 | 1.4 | 0.2 | \"setosa\" |\n", "| 6 | 5.4 | 3.9 | 1.7 | 0.4 | \"setosa\" |" ] } ], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "# Group by Species, then compute mean of PetalLength per group\n", "\n", "by( iris, :Species, df -> mean(df[:PetalLength]) )" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "html": [ "
Speciesx1
1setosa1.462
2versicolor4.26
3virginica5.552
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 36, "text": [ "3x2 DataFrame\n", "|-------|--------------|-------|\n", "| Row # | Species | x1 |\n", "| 1 | \"setosa\" | 1.462 |\n", "| 2 | \"versicolor\" | 4.26 |\n", "| 3 | \"virginica\" | 5.552 |" ] } ], "prompt_number": 36 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Data Visualization" ] }, { "cell_type": "code", "collapsed": false, "input": [ "using ASCIIPlots\n", "\n", "x = iris[:PetalLength]\n", "y = iris[:PetalWidth]\n", "\n", "scatterplot(x, y)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 37, "text": [ "\n", "\t-------------------------------------------------------------\n", "\t| ^ ^ | 2.50\n", "\t| ^ ^ |\n", "\t| ^ ^^^ ^ ^^ ^|\n", "\t| ^ ^ ^ |\n", "\t| ^^ ^ ^^ ^ ^ ^^ ^ |\n", "\t| ^ ^ ^ |\n", "\t| ^^^ ^ ^ ^ ^ |\n", "\t| ^ ^ |\n", "\t| ^ ^ ^ ^^ ^ |\n", "\t| ^ ^^ ^^ ^ |\n", "\t| ^ ^ ^^^^ |\n", "\t| ^ ^ ^ ^ ^ |\n", "\t| ^ ^ ^ ^^ ^ |\n", "\t| |\n", "\t| |\n", "\t| |\n", "\t| ^ |\n", "\t| ^ ^^ ^ |\n", "\t| ^ ^^ |\n", "\t|^^ ^ ^^ ^ | 0.10\n", "\t-------------------------------------------------------------\n", "\t1.00 6.90\n" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "using Winston\n", "\n", "scatter(x, y, \".\")\n", "\n", "xlabel(\"PetalLength\")\n", "ylabel(\"PetalWidth\")" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "png": "", "prompt_number": 38, "text": [ "FramedPlot(...)" ] } ], "prompt_number": 38 }, { "cell_type": "code", "collapsed": false, "input": [ "using Gadfly\n", "\n", "set_default_plot_size(20cm, 12cm)\n", "plot(iris, x = \"PetalLength\", y = \"PetalWidth\", color = \"Species\", Geom.point)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "html": [ "\n", "\n", "\n", " \n", " PetalLength\n", " \n", " \n", " -10\n", " -8\n", " -6\n", " -4\n", " -2\n", " 0\n", " 2\n", " 4\n", " 6\n", " 8\n", " 10\n", " 12\n", " 14\n", " 16\n", " 18\n", " -8.0\n", " -7.5\n", " -7.0\n", " -6.5\n", " -6.0\n", " -5.5\n", " -5.0\n", " -4.5\n", " -4.0\n", " -3.5\n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " 6.0\n", " 6.5\n", " 7.0\n", " 7.5\n", " 8.0\n", " 8.5\n", " 9.0\n", " 9.5\n", " 10.0\n", " 10.5\n", " 11.0\n", " 11.5\n", " 12.0\n", " 12.5\n", " 13.0\n", " 13.5\n", " 14.0\n", " 14.5\n", " 15.0\n", " 15.5\n", " 16.0\n", " -10\n", " 0\n", " 10\n", " 20\n", " -8.0\n", " -7.5\n", " -7.0\n", " -6.5\n", " -6.0\n", " -5.5\n", " -5.0\n", " -4.5\n", " -4.0\n", " -3.5\n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " 6.0\n", " 6.5\n", " 7.0\n", " 7.5\n", " 8.0\n", " 8.5\n", " 9.0\n", " 9.5\n", " 10.0\n", " 10.5\n", " 11.0\n", " 11.5\n", " 12.0\n", " 12.5\n", " 13.0\n", " 13.5\n", " 14.0\n", " 14.5\n", " 15.0\n", " 15.5\n", " 16.0\n", " \n", " \n", " \n", " versicolor\n", " virginica\n", " setosa\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Species\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " -2.5\n", " -2.4\n", " -2.3\n", " -2.2\n", " -2.1\n", " -2.0\n", " -1.9\n", " -1.8\n", " -1.7\n", " -1.6\n", " -1.5\n", " -1.4\n", " -1.3\n", " -1.2\n", " -1.1\n", " -1.0\n", " -0.9\n", " -0.8\n", " -0.7\n", " -0.6\n", " -0.5\n", " -0.4\n", " -0.3\n", " -0.2\n", " -0.1\n", " 0.0\n", " 0.1\n", " 0.2\n", " 0.3\n", " 0.4\n", " 0.5\n", " 0.6\n", " 0.7\n", " 0.8\n", " 0.9\n", " 1.0\n", " 1.1\n", " 1.2\n", " 1.3\n", " 1.4\n", " 1.5\n", " 1.6\n", " 1.7\n", " 1.8\n", " 1.9\n", " 2.0\n", " 2.1\n", " 2.2\n", " 2.3\n", " 2.4\n", " 2.5\n", " 2.6\n", " 2.7\n", " 2.8\n", " 2.9\n", " 3.0\n", " 3.1\n", " 3.2\n", " 3.3\n", " 3.4\n", " 3.5\n", " 3.6\n", " 3.7\n", " 3.8\n", " 3.9\n", " 4.0\n", " 4.1\n", " 4.2\n", " 4.3\n", " 4.4\n", " 4.5\n", " 4.6\n", " 4.7\n", " 4.8\n", " 4.9\n", " 5.0\n", " -2.5\n", " 0.0\n", " 2.5\n", " 5.0\n", " -2.6\n", " -2.4\n", " -2.2\n", " -2.0\n", " -1.8\n", " -1.6\n", " -1.4\n", " -1.2\n", " -1.0\n", " -0.8\n", " -0.6\n", " -0.4\n", " -0.2\n", " 0.0\n", " 0.2\n", " 0.4\n", " 0.6\n", " 0.8\n", " 1.0\n", " 1.2\n", " 1.4\n", " 1.6\n", " 1.8\n", " 2.0\n", " 2.2\n", " 2.4\n", " 2.6\n", " 2.8\n", " 3.0\n", " 3.2\n", " 3.4\n", " 3.6\n", " 3.8\n", " 4.0\n", " 4.2\n", " 4.4\n", " 4.6\n", " 4.8\n", " 5.0\n", " \n", " \n", " PetalWidth\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n" ], "metadata": {}, "output_type": "pyout", "png": "", "prompt_number": 39, "svg": [ "\n", "\n", "\n", " \n", " PetalLength\n", " \n", " \n", " 0\n", " 2\n", " 4\n", " 6\n", " 8\n", " \n", " \n", " \n", " versicolor\n", " virginica\n", " setosa\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Species\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " \n", " \n", " PetalWidth\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n" ], "text": [ "Plot(...)" ] } ], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# ML Algorithms\n", "## Unsupervised Learning" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# K-means Clustering\n", "\n", "using Clustering\n", "\n", "features = array(iris[:, 1:4])' # use matrix() on Julia v0.2\n", "result = kmeans( features, 3 ) # onto 3 clusters\n", "\n", "plot(iris, x = \"PetalLength\", y = \"PetalWidth\", color = result.assignments, Geom.point)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " " ] }, { "output_type": "stream", "stream": "stderr", "text": [ "Warning: using DataFrames.describe in module Main conflicts with an existing identifier.\n", "Warning: could not import StatsBase.bandwidth into Stat\n", "Warning: could not import StatsBase.kde into Stat\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " Iters objv objv-change | affected \n", "-------------------------------------------------------------\n", " 1 8.200215e+01 -6.279785e+01 | 2\n", " 2 8.108093e+01 -9.212131e-01 | 2\n", " 3 7.987358e+01 -1.207354e+00 | 2\n", " 4 7.934436e+01 -5.292157e-01 | 2\n", " 5 7.892131e+01 -4.230544e-01 | 2\n", " 6 7.885567e+01 -6.564390e-02 | 0\n", " 7 7.885567e+01 0.000000e+00 | 0\n", "K-means converged with 7 iterations (objv = 78.85566582597716)\n" ] }, { "html": [ "\n", "\n", "\n", " \n", " PetalLength\n", " \n", " \n", " -10\n", " -8\n", " -6\n", " -4\n", " -2\n", " 0\n", " 2\n", " 4\n", " 6\n", " 8\n", " 10\n", " 12\n", " 14\n", " 16\n", " 18\n", " -8.0\n", " -7.5\n", " -7.0\n", " -6.5\n", " -6.0\n", " -5.5\n", " -5.0\n", " -4.5\n", " -4.0\n", " -3.5\n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " 6.0\n", " 6.5\n", " 7.0\n", " 7.5\n", " 8.0\n", " 8.5\n", " 9.0\n", " 9.5\n", " 10.0\n", " 10.5\n", " 11.0\n", " 11.5\n", " 12.0\n", " 12.5\n", " 13.0\n", " 13.5\n", " 14.0\n", " 14.5\n", " 15.0\n", " 15.5\n", " 16.0\n", " -10\n", " 0\n", " 10\n", " 20\n", " -8.0\n", " -7.5\n", " -7.0\n", " -6.5\n", " -6.0\n", " -5.5\n", " -5.0\n", " -4.5\n", " -4.0\n", " -3.5\n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " 6.0\n", " 6.5\n", " 7.0\n", " 7.5\n", " 8.0\n", " 8.5\n", " 9.0\n", " 9.5\n", " 10.0\n", " 10.5\n", " 11.0\n", " 11.5\n", " 12.0\n", " 12.5\n", " 13.0\n", " 13.5\n", " 14.0\n", " 14.5\n", " 15.0\n", " 15.5\n", " 16.0\n", " \n", " \n", " \n", " 1.5\n", " 1.0\n", " 3.0\n", " 2.0\n", " 2.5\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Color\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " -2.5\n", " -2.4\n", " -2.3\n", " -2.2\n", " -2.1\n", " -2.0\n", " -1.9\n", " -1.8\n", " -1.7\n", " -1.6\n", " -1.5\n", " -1.4\n", " -1.3\n", " -1.2\n", " -1.1\n", " -1.0\n", " -0.9\n", " -0.8\n", " -0.7\n", " -0.6\n", " -0.5\n", " -0.4\n", " -0.3\n", " -0.2\n", " -0.1\n", " 0.0\n", " 0.1\n", " 0.2\n", " 0.3\n", " 0.4\n", " 0.5\n", " 0.6\n", " 0.7\n", " 0.8\n", " 0.9\n", " 1.0\n", " 1.1\n", " 1.2\n", " 1.3\n", " 1.4\n", " 1.5\n", " 1.6\n", " 1.7\n", " 1.8\n", " 1.9\n", " 2.0\n", " 2.1\n", " 2.2\n", " 2.3\n", " 2.4\n", " 2.5\n", " 2.6\n", " 2.7\n", " 2.8\n", " 2.9\n", " 3.0\n", " 3.1\n", " 3.2\n", " 3.3\n", " 3.4\n", " 3.5\n", " 3.6\n", " 3.7\n", " 3.8\n", " 3.9\n", " 4.0\n", " 4.1\n", " 4.2\n", " 4.3\n", " 4.4\n", " 4.5\n", " 4.6\n", " 4.7\n", " 4.8\n", " 4.9\n", " 5.0\n", " -2.5\n", " 0.0\n", " 2.5\n", " 5.0\n", " -2.6\n", " -2.4\n", " -2.2\n", " -2.0\n", " -1.8\n", " -1.6\n", " -1.4\n", " -1.2\n", " -1.0\n", " -0.8\n", " -0.6\n", " -0.4\n", " -0.2\n", " 0.0\n", " 0.2\n", " 0.4\n", " 0.6\n", " 0.8\n", " 1.0\n", " 1.2\n", " 1.4\n", " 1.6\n", " 1.8\n", " 2.0\n", " 2.2\n", " 2.4\n", " 2.6\n", " 2.8\n", " 3.0\n", " 3.2\n", " 3.4\n", " 3.6\n", " 3.8\n", " 4.0\n", " 4.2\n", " 4.4\n", " 4.6\n", " 4.8\n", " 5.0\n", " \n", " \n", " PetalWidth\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n" ], "metadata": {}, "output_type": "pyout", "png": "", "prompt_number": 40, "svg": [ "\n", "\n", "\n", " \n", " PetalLength\n", " \n", " \n", " 0\n", " 2\n", " 4\n", " 6\n", " 8\n", " \n", " \n", " \n", " 1.5\n", " 1.0\n", " 3.0\n", " 2.0\n", " 2.5\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Color\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " \n", " \n", " PetalWidth\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n" ], "text": [ "Plot(...)" ] } ], "prompt_number": 40 }, { "cell_type": "code", "collapsed": false, "input": [ "# Principal Component Analysis\n", "\n", "using MultivariateStats\n", "\n", "pc = fit(PCA, features; maxoutdim = 2)\n", "reduced = transform(pc, features)\n", "@show size(reduced)\n", "\n", "plot(iris, x = reduced[1,:], y = reduced[2,:], color = \"Species\", Geom.point)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "size(reduced) => " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "(2,150)\n" ] }, { "html": [ "\n", "\n", "\n", " \n", " x\n", " \n", " \n", " -14\n", " -12\n", " -10\n", " -8\n", " -6\n", " -4\n", " -2\n", " 0\n", " 2\n", " 4\n", " 6\n", " 8\n", " 10\n", " 12\n", " 14\n", " -12.0\n", " -11.5\n", " -11.0\n", " -10.5\n", " -10.0\n", " -9.5\n", " -9.0\n", " -8.5\n", " -8.0\n", " -7.5\n", " -7.0\n", " -6.5\n", " -6.0\n", " -5.5\n", " -5.0\n", " -4.5\n", " -4.0\n", " -3.5\n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " 6.0\n", " 6.5\n", " 7.0\n", " 7.5\n", " 8.0\n", " 8.5\n", " 9.0\n", " 9.5\n", " 10.0\n", " 10.5\n", " 11.0\n", " 11.5\n", " 12.0\n", " -20\n", " -10\n", " 0\n", " 10\n", " 20\n", " -12.0\n", " -11.5\n", " -11.0\n", " -10.5\n", " -10.0\n", " -9.5\n", " -9.0\n", " -8.5\n", " -8.0\n", " -7.5\n", " -7.0\n", " -6.5\n", " -6.0\n", " -5.5\n", " -5.0\n", " -4.5\n", " -4.0\n", " -3.5\n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " 6.0\n", " 6.5\n", " 7.0\n", " 7.5\n", " 8.0\n", " 8.5\n", " 9.0\n", " 9.5\n", " 10.0\n", " 10.5\n", " 11.0\n", " 11.5\n", " 12.0\n", " \n", " \n", " \n", " versicolor\n", " virginica\n", " setosa\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Species\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " -7\n", " -6\n", " -5\n", " -4\n", " -3\n", " -2\n", " -1\n", " 0\n", " 1\n", " 2\n", " 3\n", " 4\n", " 5\n", " 6\n", " 7\n", " -6.0\n", " -5.8\n", " -5.6\n", " -5.4\n", " -5.2\n", " -5.0\n", " -4.8\n", " -4.6\n", " -4.4\n", " -4.2\n", " -4.0\n", " -3.8\n", " -3.6\n", " -3.4\n", " -3.2\n", " -3.0\n", " -2.8\n", " -2.6\n", " -2.4\n", " -2.2\n", " -2.0\n", " -1.8\n", " -1.6\n", " -1.4\n", " -1.2\n", " -1.0\n", " -0.8\n", " -0.6\n", " -0.4\n", " -0.2\n", " 0.0\n", " 0.2\n", " 0.4\n", " 0.6\n", " 0.8\n", " 1.0\n", " 1.2\n", " 1.4\n", " 1.6\n", " 1.8\n", " 2.0\n", " 2.2\n", " 2.4\n", " 2.6\n", " 2.8\n", " 3.0\n", " 3.2\n", " 3.4\n", " 3.6\n", " 3.8\n", " 4.0\n", " 4.2\n", " 4.4\n", " 4.6\n", " 4.8\n", " 5.0\n", " 5.2\n", " 5.4\n", " 5.6\n", " 5.8\n", " 6.0\n", " -6\n", " -3\n", " 0\n", " 3\n", " 6\n", " -6.0\n", " -5.5\n", " -5.0\n", " -4.5\n", " -4.0\n", " -3.5\n", " -3.0\n", " -2.5\n", " -2.0\n", " -1.5\n", " -1.0\n", " -0.5\n", " 0.0\n", " 0.5\n", " 1.0\n", " 1.5\n", " 2.0\n", " 2.5\n", " 3.0\n", " 3.5\n", " 4.0\n", " 4.5\n", " 5.0\n", " 5.5\n", " 6.0\n", " \n", " \n", " y\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n" ], "metadata": {}, "output_type": "pyout", "png": "", "prompt_number": 41, "svg": [ "\n", "\n", "\n", " \n", " x\n", " \n", " \n", " -4\n", " -2\n", " 0\n", " 2\n", " 4\n", " \n", " \n", " \n", " versicolor\n", " virginica\n", " setosa\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Species\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " -2\n", " -1\n", " 0\n", " 1\n", " 2\n", " \n", " \n", " y\n", " \n", "\n", "\n", "\n", " \n", "\n", "\n" ], "text": [ "Plot(...)" ] } ], "prompt_number": 41 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# ML Algorithms\n", "## Supervised Learning - Regression" ] }, { "cell_type": "code", "collapsed": false, "input": [ "using MultivariateStats\n", "\n", "# Generate a noisy linear system\n", "features = rand(1000, 3) # feature matrix\n", "coeffs = rand(3) # ground truth of weights\n", "targets = features * coeffs + 0.1 * randn(1000) # generate response\n", "\n", "# Linear Least Square Regression\n", "coeffs_llsq = llsq(features, targets; bias=false)\n", "\n", "# Ridge Regression\n", "coeffs_ridge = ridge(features, targets, 0.1; bias=false) # regularization coef = 0.1\n", "\n", "@show coeffs\n", "@show coeffs_llsq\n", "@show coeffs_ridge ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "coeffs => " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "[0.909725259136879,0.09457886909449087,0.5497737690044144]\n", "coeffs_llsq => [0.9136892428062334,0.09038032773513839,0.5584636569649853]\n", "coeffs_ridge => [0.9131715345035267,0.09081871314618502,0.5583822928501584]\n" ] } ], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "# Cross Validation: K-Fold Example\n", "\n", "using MLBase, MultivariateStats\n", "\n", "n = length(targets)\n", "\n", "# Define training and error evaluation functions\n", "function training(inds)\n", " coeffs = ridge(features[inds, :], targets[inds], 0.1; bias=false)\n", " return coeffs\n", "end\n", "\n", "function error_evaluation(coeffs, inds)\n", " y = features[inds, :] * coeffs \n", " rms_error = sqrt(mean(abs2(targets[inds] .- y)))\n", " return rms_error\n", "end\n", "\n", "# Cross validate\n", "scores = cross_validate(\n", " inds -> training(inds),\n", " (coeffs, inds) -> error_evaluation(coeffs, inds),\n", " n, # total number of samples\n", " Kfold(n, 3)) # cross validation plan: 3-fold\n", "\n", "# Get the mean and std of scores\n", "@show scores\n", "@show mean_and_std(scores) ;" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "scores => " ] }, { "output_type": "stream", "stream": "stderr", "text": [ "Warning: using MLBase.transform in module Main conflicts with an existing identifier.\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "[0.09242034479832754,0.10521231254886307,0.10191139455606114]\n", "mean_and_std(scores) => (0.0998480173010839,0.006640915148151889)\n" ] } ], "prompt_number": 43 }, { "cell_type": "code", "collapsed": false, "input": [ "# Model Tuning: Grid Search\n", "\n", "using MLBase, MultivariateStats\n", "\n", "# Hold out 20% of records for testing\n", "n_test = int(length(targets) * 0.2)\n", "train_rows = shuffle([1:length(targets)] .> n_test)\n", "features_train, features_test = features[train_rows, :], features[!train_rows, :]\n", "targets_train, targets_test = targets[train_rows], targets[!train_rows]\n", "\n", "# Define estimation function\n", "function estfun(regcoef, bias)\n", " coeffs = ridge(features_train, targets_train, regcoef; bias=bias)\n", " return bias ? (coeffs[1:end-1], coeffs[end]) : (coeffs, 0.0)\n", "end\n", "\n", "# Define error evaluation function as mean squared deviation\n", "evalfun(coeffs) = msd(features_test * coeffs[1] + coeffs[2], targets_test)\n", "\n", "result = gridtune(estfun, evalfun,\n", " (\"regcoef\", [0.01, 0.1, 1.0]),\n", " (\"bias\", [true, false]);\n", " ord=Reverse, # smaller msd value indicates better model\n", " verbose=true) # show progress information\n", "\n", "best_model, best_config, best_score = result\n", "\n", "# Print results\n", "coeffs, bias = best_model\n", "println(\"Best model:\")\n", "println(\" coeffs = $(coeffs')\"),\n", "println(\" bias = $bias\")\n", "println(\"Best config: regcoef = $(best_config[1]), bias = $(best_config[2])\")\n", "println(\"Best score: $(best_score)\")" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "regcoef=0.01, bias=true] => 0.011858694727414574\n", "[regcoef=0.1, bias=true] => 0.011850442117052462\n", "[regcoef=1.0, bias=true] => 0.011787552735182201\n", "[regcoef=0.01, bias=false] => 0.011804804972495204\n", "[regcoef=0.1, bias=false] => 0.011801611222973227\n", "[regcoef=1.0, bias=false] => 0.011777881905009186\n", "Best model:\n", " coeffs = [0.9133644779181795 0.09101416771441581 0.5528567682069362]" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", " bias = 0.0\n", "Best config: regcoef = 1.0, bias = false\n", "Best score: 0.011777881905009186\n" ] } ], "prompt_number": 44 }, { "cell_type": "code", "collapsed": false, "input": [ "# Regression Tree\n", "\n", "using DecisionTree\n", "\n", "# Train model, make predictions on test records\n", "model = build_tree(targets_train, features_train)\n", "predictions = apply_tree(model, features_test)\n", "\n", "@show cor(targets_test, predictions)\n", "@show R2(targets_test, predictions)\n", "\n", "scatter(targets_test, predictions, \".\")\n", "xlabel(\"actual\"); ylabel(\"predicted\")" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "cor(targets_test,predictions) => " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "0.9062575173707162\n", "R2(targets_test,predictions) => 0.8095402909108701\n" ] }, { "metadata": {}, "output_type": "pyout", "png": "", "prompt_number": 45, "text": [ "FramedPlot(...)" ] } ], "prompt_number": 45 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# ML Algorithms\n", "## Supervised Learning - Classification" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Support Vector Machine\n", "\n", "using LIBSVM\n", "\n", "features = array(iris[:, 1:4])\n", "labels = array(iris[:Species])\n", "\n", "# Hold out 20% of records for testing\n", "n_test = int(length(labels) * 0.2)\n", "train_rows = shuffle([1:length(labels)] .> n_test)\n", "features_train, features_test = features[train_rows, :], features[!train_rows, :]\n", "labels_train, labels_test = labels[train_rows], labels[!train_rows]\n", "\n", "model = svmtrain(labels_train, features_train')\n", "(predictions, decision_values) = svmpredict(model, features_test')\n", "\n", "confusion_matrix(labels_test, predictions)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 46, "text": [ "Classes: ASCIIString[\"setosa\",\"versicolor\",\"virginica\"]\n", "Matrix: 3x3 Array{Int64,2}:\n", " 15 0 0\n", " 0 9 1\n", " 0 0 5\n", "Accuracy: 0.9666666666666667\n", "Kappa: 0.9459459459459458" ] } ], "prompt_number": 46 }, { "cell_type": "code", "collapsed": false, "input": [ "# Random Forest\n", "\n", "using DecisionTree\n", "\n", "# Train forest using 2 random features per split and 10 trees\n", "model = build_forest(labels_train, features_train, 2, 10)\n", "predictions = apply_forest(model, features_test)\n", "\n", "# Pretty print of one tree in forest\n", "print_tree(model.trees[1])\n", "\n", "confusion_matrix(labels_test, predictions)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Feature 4, Threshold 1.0" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "L-> setosa : 25/25\n", "R-> Feature 1, Threshold 6.3\n", " L-> Feature 3, Threshold 4.8\n", " L-> versicolor : 25/25\n", " R-> Feature 2, Threshold 2.7\n", " L-> virginica : 2/2\n", " R-> Feature 4, Threshold 1.8\n", " L-> versicolor : 2/2\n", " R-> Feature 3, Threshold 4.9\n", " L-> Feature 1, Threshold 6.2\n", " L-> versicolor : 1/1\n", " R-> virginica : 1/1\n", " R-> virginica : 3/3\n", " R-> Feature 1, Threshold 6.9\n", " L-> Feature 3, Threshold 5.0\n", " L-> versicolor : 6/6\n", " R-> virginica : 10/10\n", " R-> virginica : 9/9\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 47, "text": [ "Classes: {\"setosa\",\"versicolor\",\"virginica\"}\n", "Matrix: 3x3 Array{Int64,2}:\n", " 15 0 0\n", " 0 8 2\n", " 0 0 5\n", "Accuracy: 0.9333333333333333\n", "Kappa: 0.8928571428571429" ] } ], "prompt_number": 47 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Other Statistics & ML Packages\n", "* [GLM](https://github.com/JuliaStats/GLM.jl) - Generalized Linear Models\n", "* [Orchestra](https://github.com/svs14/Orchestra.jl) - Heterogeneous ensemble learning package\n", "* [MCMC](https://github.com/JuliaStats/MCMC.jl) - Markov Chain Monte Carlo methods\n", "* [HypothesisTests](https://github.com/JuliaStats/HypothesisTests.jl) - Hypothesis testing toolkit" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Resources\n", "### Documentation\n", "* [Manual](http://docs.julialang.org/en/latest/#manual)\n", "* [Standard Library](http://docs.julialang.org/en/latest/stdlib/base/)\n", "\n", "### GitHub Groups\n", "* [Julia Statistics](https://github.com/JuliaStats)\n", "* [JuliaOpt: Optimization-related projects](http://www.juliaopt.org/)\n", "\n", "### Discussion Forums / Mailing Lists (Google Groups)\n", "* [Julia-Users](https://groups.google.com/forum/#!forum/julia-users)\n", "* [Julia-Dev](https://groups.google.com/forum/#!forum/julia-dev)\n", "* [Julia-Stats](https://groups.google.com/forum/#!forum/julia-stats)\n", "\n", "### Blogs / Curations\n", "* [JuliaBloggers](http://www.juliabloggers.com/)\n", "* [Curated decibans of Julia](http://svaksha.github.io/Julia.jl/)\n", "\n", "### Crash Courses\n", "* [Julia by Example](http://www.scolvin.com/juliabyexample/)\n", "* [Learn Julia in Y Minutes](http://learnxinyminutes.com/docs/julia/)\n", "* [The Julia Express](http://bogumilkaminski.pl/files/julia_express.pdf)\n", "\n", "### Cheat Sheets\n", "* [For MATLAB, Python NumPy, R, and Julia](http://sebastianraschka.com/Articles/2014_matrix_cheatsheet.html)\n", "* [Julia and IJulia](http://math.mit.edu/~stevenj/Julia-cheatsheet.pdf)" ] } ], "metadata": {} } ] }