{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# BIOSTAT M280: Statistical Computing\n", "\n", "* Mon 12pm-1:50pm @ CHS 43-105A \n", " Wed 1pm-1:50pm @ CHS 43-105A \n", "* Instructor: Dr. Hua Zhou, \n", "* Multi-listed as BIOSTAT M280, BIOMATH M280, and STAT M230 \n", "* For Biostatistics doctoral students, this course satisfies the requirement for BIOSTAT 257." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# What is statistics?\n", "\n", "* Statistics, the science of *data analysis*, is the applied mathematics in the 21st century. \n", "\n", "* People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data. \n", "\n", "* Must-read for (bio)statistics students: \n", " - [_50 years of data sicence_](http://hua-zhou.github.io/teaching/biostatm280-2018spring/readings/Donoho15FiftyYearsDataScience.pdf), by David Donoho.\n", "\n", "* If existing software tools readily solve the problem, all the better. \n", "\n", "* Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming). \n", "\n", "* This entails at least two essential skills: **programming** and fundamental knowledge of **algorithms**. \n", "\n", "* Two examples: How Gauss became famous and Marc Coram deciphering a jail message." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# How Gauss became famous?\n", "\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 1801\n", "\n", "* **Dr. Carl Friedrich Gauss**, 24; proved the [Fundamental Theorem of Algebra](https://en.wikipedia.org/wiki/Fundamental_theorem_of_algebra); wrote the book [_Disquisitiones Arithmetic_](https://en.wikipedia.org/wiki/Disquisitiones_Arithmeticae), which is still being studied today.\n", "\n", "* Jan 1-Feb 11 (41 days), astronomer Piazzi observed [Ceres (a dwarf planet)](https://en.wikipedia.org/wiki/Ceres_(dwarf_planet), which was then lost behind sun.\n", "\n", "* Aug-Sep, futile search by top astronomers; Laplace claimed it unsolvable.\n", "\n", "* Oct-Nov, Gauss did calculations by the **method of least squares**. \n", "\n", "* Dec 31, astronomer von Zach re-located Ceres according to Gauss' calculation." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Aftermath\n", "\n", "* 1802, [_Summarische Übersicht der zur Bestimmung der Bahnen der beiden neuen Hauptplaneten angewandten Methoden_](https://books.google.com/books?id=ZPtICAAAQBAJ&pg=PT65&lpg=PT65&dq=Summarische+Übersicht+der+zur+Bestimmung+der+Bahnen+der+beiden+neuen+Hauptplaneten+angewandten+Methoden&source=bl&ots=Nr9xIdDDHx&sig=wAUfDTHZqoDo3WKJvPnE-2793QA&hl=en&sa=X&ved=0ahUKEwij157U2P7SAhWiv1QKHa3zAvoQ6AEIITAB#v=onepage&q=Summarische%20Übersicht%20der%20zur%20Bestimmung%20der%20Bahnen%20der%20beiden%20neuen%20Hauptplaneten%20angewandten%20Methoden&f=false) (Summary survey of the methods used for the determination of the orbits of the two new main planets), considered the origin of **linear algebra**.\n", "\n", "* 1807, **Professor of Astronomy** and (the first) Director of Göttingen Observatory in the remainder of his life.\n", "\n", "* 1809, [_Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium_](http://www.cambridge.org/us/academic/subjects/mathematics/historical-mathematical-texts/theoria-motus-corporum-coelestium-sectionibus-conicis-solem-ambientium?format=PB&isbn=9781108143110) (Theory of motion of the celestial bodies moving in conic sections around the Sun); birth of the **Gaussian (normal) distribution**, as an attempt to rationalize the method of least squares.\n", "\n", "* 1810, Laplace consolidated the importance of Gaussian distribution by proving the central limit theorem.\n", "\n", "* 1829, **Gauss-Markov Theorem**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## For more history\n", "\n", "* Webpage: [The Discovery of Ceres](http://www.keplersdiscovery.com/Asteroid.html)\n", "\n", "* Article: [_The Discovery of Ceres: How Gauss Became Famous_](https://www.jstor.org/stable/2690592) by Teets and Whitehead (1999). \n", "\n", "* Stephen Stigler gives a more comprehensive account of the origin of the method of least squares in his book [_The History of Statistics_](http://www.hup.harvard.edu/catalog.php?isbn=9780674403413)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Lessons\n", "\n", "* Motivated by real data and real problem (data science!).\n", "\n", "* Heuristic solution first: method of least squares.\n", "\n", "* **Algorithm development**: linear algebra, Gaussian elimination, FFT (fast Fourier transform). \n", "\n", "* Solution readily verifiable: Ceres was re-discovered.\n", "\n", "* Theoretical justification (Gaussian distribution, Gauss-Markov theorem) comes much later." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Marc Coram and a jail message\n", "\n", "\n", "\n", "* A consulting project by Marc Coram (then a graduate student in statistics at Stanford); customer is a professor in political science, who wants to understand a cryptic message circulated in a state prison.\n", "* Marc modeled the letter sequence by a Markov chain ($26 \\times 26$ transition matrix) and estimated transition probabilities from _War and Peace_.\n", "* Now each mapping $\\sigma$ yields a likelihood $f(\\sigma)$ of the symbol sequence.\n", "* Find the $\\sigma$ that maximizes $f$. Sample space is at least $26! = 4.0329 \\times 10^{26}$. Combinatorial optimization -- hard!\n", "* **Metropolis algorithm**: At each iteration:\n", " - generate a new $\\sigma'$ by random transposition of two letters\n", " - accept $\\sigma'$ with probability $\\min \\left\\{\\frac{f(\\sigma')}{f(\\sigma)}, 1\\right\\}$\n", "\n", "## Lessons\n", "\n", "* Motivated by a real problem (data science!).\n", "\n", "* Solution readily verifiable: we can read it!\n", "\n", "* **Algorithm development**: Metropolis sampler is one of top 10 algorithms in the 20th century.\n", "\n", "* See the article [_The Markov chain Monte Carlo revolution_](http://www.ams.org/journals/bull/2009-46-02/S0273-0979-08-01238-X/) by Persi Diaconis (2009) for more details.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# What is this course about?\n", "\n", "* **Not** a course on statistical packages. It does not answer questions such as _How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?_\n", "\n", "* **Not** a pure programming course, although programming is important and we do homework in Julia. \n", "The new BIOSTAT 203A (Data Management) in fall quarter focuses on programming in R and SAS.\n", "\n", "* **Not** a course on data science. The new [BIOSTAT 203B (Introduction to Data Science)](http://hua-zhou.github.io/teaching/biostatm280-2018winter/schedule.html) in winter quarter focuses on software tools for data scientists.\n", "\n", "* This course focuses on **algorithms**, or, numerical methods in statistics. \n", "\n", "* To quote [James Gentle](https://books.google.com/books?id=Pbz3D7Tg5eoC&pg=PR9&lpg=PR9&dq=The+form+of+a+mathematical+expression+and+the+way+the+expression+should+be+evaluated+in+actual+practice+may+be+quite+different.&source=bl&ots=MYABVAwDtC&sig=MGuPY_171sZFZLMCuewlOjV-Cl4&hl=en&sa=X&ved=0ahUKEwjkv_u34v7SAhUJrlQKHfT6DjAQ6AEIITAB#v=onepage&q=The%20form%20of%20a%20mathematical%20expression%20and%20the%20way%20the%20expression%20should%20be%20evaluated%20in%20actual%20practice%20may%20be%20quite%20different.&f=false)\n", "> The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.\n", "\n", "* For a common numerical task in statistics, say solving the least squares problem \n", "$$\n", " \\widehat \\beta = ({\\bf X}^T {\\bf X})^{-1} {\\bf X}^T {\\bf y},\n", "$$\n", "we need to know which methods/algorithms are out there and what are their advantages and disadvantages. You will **fail** this course if you use\n", "```julia\n", "inv(X'X) * X' * y\n", "```\n", "Using `X \\ y` (or `lm(y ~ X)` in R) is correct but not the purpose of this course. We want to understand what computer is doing when calling `X \\ y`.\n", "\n", "* For biostat studuents, this course satisfies the requirement of BIOSTAT 257 in the new curriculum. Ask Ms Roxy Naranjo for the paperwork." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Course logistics\n", "\n", "* Course webpage: [http://hua-zhou.github.io/teaching/biostatm280-2018spring/](http://hua-zhou.github.io/teaching/biostatm280-2018spring/).\n", "\n", "* [Syllabus](http://hua-zhou.github.io/teaching/biostatm280-2018spring/syllabus.html).\n", "\n", "* Check the [Schedule](http://hua-zhou.github.io/teaching/biostatm280-2018spring/schedule.html) and [Announcements](http://hua-zhou.github.io/teaching/biostatm280-2018spring/announcement.html) pages frequently. \n", "\n", "* Slides will be posted before each lecture." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Julia 0.6.2", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.2" }, "livereveal": { "scroll": true, "start_slideshow_at": "selected" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "213px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }