{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistics with Julia from the ground up\n", "#### A [JuliaCon 2021](https://juliacon.org/2021/) workshop by [Yoni Nazarathy](https://yoninazarathy.com/)\n", "\n", "Many of the code examples for this workshop are adapted from [Statistics with Julia:\n", "Fundamentals for Data Science, Machine Learning and Artificial Intelligence by Yoni Nazarathy and Hayden Klok](https://statisticswithjulia.org/). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also related (and recommended in this JuliaCon): \n", "* [Dataframes tutorial](https://github.com/bkamins/JuliaCon2021-DataFrames-Tutorial) or [here](https://pretalx.com/juliacon2021/talk/FXZXMB/) by Bogumił Kamiński.\n", "* [Introduction to Bayesian Data Analysis](https://pretalx.com/juliacon2021/talk/J7BFBM/) by Kusti Skytén.\n", "* Dozens of other very exciting talks..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Table of Contents\n", "\n", "1. [Why Julia?](#why-julia)\n", "1. [What do you `mean`?](#what-do-you-mean)\n", "1. [Something `rand`.](#something-rand)\n", "1. [Do you still miss R? So Just `RCall`.](#just-rcall)\n", "1. [Some `Plots`.](#some-plots)\n", "1. [Your favorite `Distribution`.](#favorite-distribution)\n", "1. [We love `DataFrames`.](#love-dataframes)\n", "1. [Gotta have some basic inference.](#inference)\n", "1. [Linear models at our core.](#linear-models)\n", "1. [Basic Machine learning.](#basic-ml)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Before we start" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tutorial was developed and tested under Julia 1.6.0. It is best to run it with the `Project.toml` and `Manifest.toml` files present in the working directory of the notebook. It also uses the following data files: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "8-element Vector{String}:\n", " \"L1L2data.csv\"\n", " \"fertilizer.csv\"\n", " \"machine1.csv\"\n", " \"machine2.csv\"\n", " \"machine3.csv\"\n", " \"purchaseData.csv\"\n", " \"temperatures.csv\"\n", " \"weightHeight.csv\"" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "readdir(\"./data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will find all these files and this notebook in the [Github repo for this workshop](https://github.com/yoninazarathy/JuliaCon2021-StatisticsWithJuliaFromTheGroundUp). You can either \"clone\" the repo or download a zip file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use a [Jupyter Notebook](https://jupyter.org/). Here is a [quick reference sheet](edureka.co/blog/wp-content/uploads/2018/10/Jupyter_Notebook_CheatSheet_Edureka.pdf). Many other reserouces on the web as well for Jupyter - many of which use Python and not Julia, but Jupyter is the same. BTW the \"J\" in \"Jupyter\" is for \"Julia\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the packages we will use:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32m\u001b[1m Activating\u001b[22m\u001b[39m environment at `~/git/mine/JuliaCon2021-StatisticsWithJuliaFromTheGroundUp/Project.toml`\n" ] } ], "source": [ "using Pkg\n", "Pkg.activate(\".\")\n", "Pkg.instantiate()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m\u001b[1m Status\u001b[22m\u001b[39m `~/git/mine/JuliaCon2021-StatisticsWithJuliaFromTheGroundUp/Project.toml`\n", " \u001b[90m [336ed68f] \u001b[39m\u001b[37mCSV v0.8.5\u001b[39m\n", " \u001b[90m [aaaa29a8] \u001b[39m\u001b[37mClustering v0.14.2\u001b[39m\n", " \u001b[90m [861a8166] \u001b[39m\u001b[37mCombinatorics v1.0.2\u001b[39m\n", " \u001b[90m [a93c6f00] \u001b[39m\u001b[37mDataFrames v1.2.0\u001b[39m\n", " \u001b[90m [31c24e10] \u001b[39m\u001b[37mDistributions v0.25.11\u001b[39m\n", " \u001b[90m [587475ba] \u001b[39m\u001b[37mFlux v0.12.5\u001b[39m\n", " \u001b[90m [38e38edf] \u001b[39m\u001b[37mGLM v1.5.1\u001b[39m\n", " \u001b[90m [09f84164] \u001b[39m\u001b[37mHypothesisTests v0.10.4\u001b[39m\n", " \u001b[90m [5ab0869b] \u001b[39m\u001b[37mKernelDensity v0.6.3\u001b[39m\n", " \u001b[90m [b964fa9f] \u001b[39m\u001b[37mLaTeXStrings v1.2.1\u001b[39m\n", " \u001b[90m [b4fcebef] \u001b[39m\u001b[37mLasso v0.6.2\u001b[39m\n", " \u001b[90m [eb30cadb] \u001b[39m\u001b[37mMLDatasets v0.5.7\u001b[39m\n", " \u001b[90m [442fdcdd] \u001b[39m\u001b[37mMeasures v0.3.1\u001b[39m\n", " \u001b[90m [dbeba491] \u001b[39m\u001b[37mMetalhead v0.5.3\u001b[39m\n", " \u001b[90m [6f286f6a] \u001b[39m\u001b[37mMultivariateStats v0.8.0\u001b[39m\n", " \u001b[90m [91a5bcdd] \u001b[39m\u001b[37mPlots v1.19.2\u001b[39m\n", " \u001b[90m [ce6b1742] \u001b[39m\u001b[37mRDatasets v0.7.5\u001b[39m\n", " \u001b[90m [f2b01f46] \u001b[39m\u001b[37mRoots v1.0.10\u001b[39m\n", " \u001b[90m [276daf66] \u001b[39m\u001b[37mSpecialFunctions v1.5.1\u001b[39m\n", " \u001b[90m [2913bbd2] \u001b[39m\u001b[37mStatsBase v0.33.8\u001b[39m\n", " \u001b[90m [f3b207a7] \u001b[39m\u001b[37mStatsPlots v0.14.25\u001b[39m\n" ] } ], "source": [ "Pkg.status()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [], "source": [ "using Random, Statistics, LinearAlgebra, Dates #Shipped with Julia\n", "using Distributions, StatsBase #Core statistics\n", "using CSV, DataFrames #Basic Data\n", "using Plots, StatsPlots, LaTeXStrings, Measures #Plotting and Output\n", "using HypothesisTests, KernelDensity, GLM, Lasso, Clustering, MultivariateStats #Statistical/ML methods\n", "using Flux, Metalhead #Deep learning \n", "using Combinatorics, SpecialFunctions, Roots #Mathematical misc.\n", "using RDatasets, MLDatasets #Example datasets\n", "#uncomment if using R: using RCall #Interface with R" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "fix_seed! (generic function with 1 method)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We run this before many examples for reproducibility\n", "fix_seed!() = Random.seed!(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "---\n", "\n", "# Why Julia?\n", "[home](#home)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Julia Curve](img/julia_curve.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some ways to run Julia\n", "\n", "* REPL\n", " - As an application\n", " - Out of your shell \n", " - As part of an IDE\n", "* Jupyter (IJulia)\n", " - In your web browser\n", " - Jupyter Lab\n", "* Google collab\n", "* Pluto\n", "* Visual Studio Code\n", "* Legacy: Atom (Juno)\n", "* JuliaHub\n", "* In RMarkdown with IJulia (e.g. in R studio)\n", "* ... " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key resources (My favorites)\n", "\n", "* Main Julia Page: https://julialang.org/\n", "* Docs: https://docs.julialang.org/\n", "* Julia Express: https://github.com/bkamins/The-Julia-Express \n", "* Think Julia: https://www.oreilly.com/library/view/think-julia/9781492045021/\n", "* MIT Course, computational thinking: https://computationalthinking.mit.edu/Spring21/ \n", "* A University of Queensland Course: https://courses.smp.uq.edu.au/MATH2504/\n", "* Statistics with Julia: https://statisticswithjulia.org/ (use image gallary) \n", "* Package documentation: Searching for the package, e.g. `Plots.jl`, typically gets you to GitHub. From there find the docs.\n", "* Julia Discourse: https://discourse.julialang.org/\n", "* Julia Slack: https://julialang.org/slack/ \n", "* Your local Julia \"club\": E.g. in my area: https://www.meetup.com/en-AU/brisbane-julia-language-meetup/ \n", "* YouTube...\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "---\n", "\n", "# What do you `mean`?\n", "[home](#home)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "5-element Vector{Float64}:\n", " 0.6791074260357777\n", " 0.8284134829000359\n", " -0.3530074003005963\n", " -0.13485387193052173\n", " 0.5866170746331097" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fix_seed!()\n", "data = rand(Normal(),5)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = length(data)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.3212553422675611" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(data)/n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.3212553422675611" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "+(data...)/n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "? +" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "my_sum" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "This is a function that takes in data and returns its sum.\n", "\"\"\"\n", "function my_sum(data)\n", " s = 0.0\n", " for d in data\n", " s += d\n", " end\n", " return s\n", "end" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search: \u001b[0m\u001b[1mm\u001b[22m\u001b[0m\u001b[1my\u001b[22m\u001b[0m\u001b[1m_\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1mu\u001b[22m\u001b[0m\u001b[1mm\u001b[22m\n", "\n" ] }, { "data": { "text/latex": [ "This is a function that takes in data and returns its sum.\n", "\n" ], "text/markdown": [ "This is a function that takes in data and returns its sum.\n" ], "text/plain": [ " This is a function that takes in data and returns its sum." ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "? my_sum" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.3212553422675611" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_sum(data)/length(data)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.3212553422675611" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean(data)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.3212553422675611" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data'*ones(n)/n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.3212553422675611" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dot(data,ones(n))/n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "? dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Doing it a little differently with the \"running mean\" formula" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\overline{X}_i = \\frac{1}{i} X_i + \\frac{i-1}{i} \\overline{X}_{i-1}.\n", "$$" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.32125534226756114" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mn = 0\n", "for i in 1:length(data)\n", " global mn = (1/i)*data[i] + (i-1)/i*mn\n", "end\n", "mn" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.32125534226756114" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mn = 0\n", "for (i,d) in enumerate(data)\n", " global mn = (1/i)*d + (i-1)/i*mn #Note that in Jupyter `global` isn't needed here but in the REPL it is.\n", "end\n", "mn" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "my_mean (generic function with 1 method)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function my_mean(data)\n", " mn = 0\n", " for (i,d) in enumerate(data)\n", " mn = (1/i)*d + (i-1)/i*mn\n", " end\n", " return mn\n", "end" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.32125534226756114" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_mean(data)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "5-element Vector{ComplexF64}:\n", " 0.6791074260357777 + 0.29733585084941616im\n", " 0.8284134829000359 + 0.06494754854834232im\n", " -0.3530074003005963 - 0.10901738508171745im\n", " -0.13485387193052173 - 0.514210390833322im\n", " 0.5866170746331097 + 1.5743302021369892im" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fix_seed!()\n", "data = rand(Normal(),n) + im*rand(Normal(),n)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.3212553422675611 + 0.2626771651239416im" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "methods(mean)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "mean(A::AbstractArray; dims) in Statistics at /Applications/Julia-1.6.app/Contents/Resources/julia/share/julia/stdlib/v1.6/Statistics/src/Statistics.jl:164" ], "text/plain": [ "mean(A::AbstractArray; dims) in Statistics at /Applications/Julia-1.6.app/Contents/Resources/julia/share/julia/stdlib/v1.6/Statistics/src/Statistics.jl:164" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@which mean(data)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "# 1 method for generic function my_mean:
777 rows × 5 columns
Year | Month | Day | Brisbane | GoldCoast | |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Float64 | Float64 | |
1 | 2015 | 1 | 1 | 31.3 | 30.9 |
2 | 2015 | 1 | 2 | 30.5 | 30.1 |
3 | 2015 | 1 | 3 | 28.9 | 30.1 |
4 | 2015 | 1 | 4 | 30.2 | 30.1 |
5 | 2015 | 1 | 5 | 28.1 | 28.0 |
6 | 2015 | 1 | 6 | 29.5 | 29.3 |
7 | 2015 | 1 | 7 | 27.1 | 26.4 |
8 | 2015 | 1 | 8 | 28.4 | 27.7 |
9 | 2015 | 1 | 9 | 29.5 | 29.6 |
10 | 2015 | 1 | 10 | 30.2 | 29.5 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |