{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Graduate Introduction to Probability and Statistics for Scientists and Engineers\n", "\n", "## [Philip B. Stark](http://www.stat.berkeley.edu/~stark), Department of Statistics, University of California, Berkeley\n", "\n", "## First offering: a 10-hour short course at University of Tokyo, August 2015\n", "\n", "## Software requirements\n", "+ Jupyter: http://continuum.io/downloads and Python 2 kernel for Jupyter; see https://ipython.org/install.html\n", "\n", "## Supplemental Texts\n", "+ Stark, P.B., 1997–2015. [_SticiGui: Statistical Tools for Internet and Classroom Instruction with a Graphical User Interface_](http://www.stat.berkeley.edu/~stark/SticiGui/index.htm).\n", "+ Stark, P.B., 1990–2010. Lecture notes for Nonparametrics, [Statistics 240](https://www.stat.berkeley.edu/~stark/Teach/S240/Notes/index.htm)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Index\n", "**These notes are in draft form, with large gaps.**\n", "I'm happy to hear about any errors, and I hope eventually to fill in some of the missing pieces.\n", "\n", "1. [Overview](overview.ipynb)\n", "1. [Introduction to Jupyter and Python](jupyter.ipynb)\n", "1. [Sets, Combinatorics, & Probability](prob.ipynb)\n", "1. [Theories of Probability](probTheory.ipynb)\n", "1. [Random Variables, Expectation, Random Vectors, and Stochastic Processes](rv.ipynb)\n", "1. [Probability Inequalities](ineq.ipynb)\n", "1. [Inference](inference.ipynb)\n", "1. [Confidence Sets](conf.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rough Syllabus for Tokyo Short Course\n", "\n", "## [Preamble: Introduction to Jupyter and Python](jupyter.ipynb)\n", "1. Jupyter notebook\n", " + Cells, markdown, MathJax\n", "1. Less Python than you need\n", "\n", "## [Lecture 1: Probability](prob.ipynb)\n", "1. What's the difference between Probability and Statistics?\n", "1. Counting and combinatorics\n", " + Sets: unions, intersections, partitions\n", " + De Morgan's Laws\n", " + The Inclusion-Exclusion principle\n", " + The Fundamental Rule of Counting\n", " + Combinations\n", " + Permutations\n", " + Strategies for counting\n", "\n", "2. Axiomatic Probability\n", " + Outcome space and events, events as sets\n", " + Kolmogorov's axioms (finite and countable)\n", " + Analogies between probability and area or mass\n", " + Consequences of the axioms\n", " - Probabilities of unions and intersections\n", " - Bounds on probabilities\n", " - Bonferroni's inequality\n", " - The inclusion-exclusion rule for probabilities\n", " + Conditional probability\n", " - The Multiplication Rule\n", " - Independence\n", " - Bayes Rule\n", "## Lecture 2: Probability, continued\n", "3. Theories of probability\n", " + Equally likely outcomes\n", " + Frequency Theory\n", " + Subjective Theory\n", " + Shortcomings of the theories\n", " + Rates versus probabilities\n", " + Measurement error\n", " + Where does probability come from in physical problems?\n", " + Making sense of geophysical probabilities\n", " - Earthquake probabilities\n", " - Probability of magnetic reversals\n", " - Probability that Earth is more than 5B years old\n", "4. Random variables.\n", " + Probability distributions of real-valued random variables\n", " + Cumulative distribution functions\n", " + Discrete random variables\n", " - Probability mass functions\n", " - The uniform distribution on a finite set\n", " - Bernoulli random variables\n", " - Random variables derived from the Bernoulli\n", " * Binomial random variables\n", " * Geometric\n", " * Negative binomial\n", " - Hypergeometric random variables\n", " - Poisson random variables: countably infinite outcome spaces\n", "\n", "## Lecture 3: Random variables, contd.\n", "5. Random variables, continued\n", " + Continuous and \"mixed\" random variables\n", " + Probability densities\n", " - The uniform distribution on an interval\n", " - The Gaussian distribution\n", " + The CDF of discrete, continuous, and mixed distributions\n", " + Distribution of measurement errors\n", " - The box model for random error\n", " - Systematic and stochastic error\n", "6. Independence of random variables\n", " + Events derived from random variables\n", " + Definitions of independence\n", " + Independence and \"informativeness\"\n", " + Examples of independent and dependent random variables\n", " + IID random variables\n", " + Exchangeability of random variables\n", "7. Marginal distributions\n", "8. Point processes\n", " + Poisson processes\n", " - Homogeneous and inhomogeneous Poisson processes\n", " - Spatially heterogeneous, temporally homogenous Poisson processes as a model for seismicity\n", " - The conditional distribution of Poisson processes given N\n", " + Marked point processes\n", " + Inter-arrival times and inter-arrival distributions\n", " + Branching processes\n", " - ETAS\n", "\n", "## Lecture 4: Expectation, Probability Inequalities, and Simulation\n", "9. Expectation\n", " + The Law of Large Numbers\n", " + The Expected Value\n", " - Expected value of a discrete univariate distribution\n", " * Special cases: Bernoulli, Binomial, Geometric, Hypergeometric, Poisson\n", " - Expected value of a continuous univariate distribution\n", " * Special cases: uniform, exponential, normal\n", " - Expected value of a multivariate distribution\n", " + Standard Error and Variance.\n", " - Discrete examples\n", " - Continuous examples\n", " - The square-root law\n", " - Standardization and Studentization\n", " - The Central Limit Theorem\n", " + The tail-sum formula for the expected value\n", " + Conditional expectation\n", " - The conditional expectation is a random variable\n", " - The expectation of the conditional expectation is the unconditional expectation\n", " + Useful probability inequalities\n", " - Markov's Inequality\n", " - Chebychev's Inequality\n", " - Hoeffding's Inequality\n", " - Jensen's inequality\n", "10. Simulation\n", " + Pseudo-random number generation\n", " - Importance of the PRNG. Period, DIEHARD\n", " + Assumptions\n", " + Uncertainties\n", " + Sampling distributions\n", "\n", "## Lecture 5: Testing\n", "11. Hypothesis tests\n", " + Null and alternative hypotheses, \"omnibus\" hypotheses\n", " + Type I and Type II errors\n", " + Significance level and power\n", " + Approximate, exact, and conservative tests\n", " + Families of tests\n", " + P-values\n", " - Estimating P-values by simulation\n", " + Test statistics\n", " - Selecting a test statistic\n", " - The null distribution of a test statistic\n", " - One-sided and two-sided tests\n", " + Null hypotheses involving actual, hypothetical, and counterfactual randomness\n", " + Multiplicity\n", " - Per-comparison error rate (PCER)\n", " - Familywise error rate (FWER)\n", " - The False Discovery Rate (FDR)\n", "\n", "## Lecture 6: Tests and Confidence sets\n", "12. Tests, continued\n", " + Parametric and nonparametric tests\n", " - The Kolmogorov-Smirnov test and the MDKW inequality\n", " - Example: Testing for uniformity\n", " - Conditional test for Poisson behavior\n", " + Permutation and randomization tests\n", " - Invariances of distributions\n", " - Exchangeability\n", " - The permutation distribution of test statistics\n", " - Approximating permutation distributions by simulation\n", " - The two-sample problem\n", " + Testing when there are nuisance parameters\n", "13. Confidence sets\n", " + Definition\n", " + Interpretation\n", " + Duality between hypothesis tests and confidence sets\n", " + Tests and confidence sets for Binomial p\n", " + Pivoting\n", " - Confidence sets for a normal mean\n", " * known variance\n", " * unknown variance; Student's t distribution\n", " + Approximate confidence intervals using the normal approximation\n", " - Empirical coverage\n", " - Failures\n", " + Nonparametric confidence bounds for the mean of a nonnegative population\n", " + Multiplicity\n", " - Simultaneous coverage\n", " - Selective coverage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rough Syllabus for complete 45-hour course\n", "\n", "---\n", "### Descriptive Statistics\n", "\n", "1. Summarizing data.\n", " 1. Types of data: categorical, ordinal, quantitative\n", " 1. Univariate data.\n", " 1. Measures of location and spread: mean, median, mode, quantiles, inter-quartile range, range, standard deviation, RMS\n", " 1. Markov's and Chebychev's inequalities for quantitative lists\n", " 1. Ranks and ordinal categorical data\n", " 1. Frequency tables and histograms\n", " 1. Bar charts\n", " 1. Multivariate data\n", " 1. Scatterplots\n", " 1. Measures of association: Pearson and Spearman correlation coefficients\n", " 1. Linear regression\n", " 1. The Least Squares principle\n", " 1. The Projection Theorem\n", " 1. The Normal Equations\n", " 1. Numerical solution of the normal equations\n", " 1. Numerical linear algebra is not the same as abstract linear algebra\n", " 1. Condition number\n", " 1. Do not invert matrices to solve linear systems: use backsubstitution or factorization\n", " 1. Errors in regression: RMS error of linear regression\n", " 1. Least Absolute Value regression\n", " 1. Principal components and approximation by subspaces: another application of the Projection Theorem\n", " 1. Clustering\n", " 1. Distance functions\n", " 1. Hierarchical methods, tree-based methods\n", " 1. Centroid methods: K-means\n", " 1. Density-based clustering: kernel methods, DBSCAN\n", "\n", "---\n", "### Probability\n", "\n", "1. Counting and combinatorics\n", " 1. Sets: unions, intersections, partitions\n", " 1. De Morgan's Laws\n", " 1. The Inclusion-Exclusion principle.\n", " 1. The Fundamental Rule of Counting\n", " 1. Combinations. Application (using the Inclusion-Exclusion Principle): counting derangements\n", " 1. Permutations\n", " 1. Strategies for complex counting problems\n", "\n", "1. Theories of probability\n", " 1. Equally likely outcomes\n", " 1. Frequency Theory\n", " 1. Subjective Theory\n", " 1. Shortcomings of the theories\n", "\n", "1. Axiomatic Probability\n", " 1. Outcome space and events, events as sets\n", " 1. Kolmogorov's axioms (finite and countable)\n", " 1. Analogies between probability and area or mass\n", " 1. Consequences of the axioms\n", " 1. Probabilities of unions and intersections\n", " 1. Bounds on probabilities\n", " 1. Bonferroni's inequality\n", " 1. The inclusion-exclusion rule for probabilities\n", " 1. Conditional probability\n", " 1. The Multiplication Rule\n", " 1. Independence\n", " 1. Bayes Rule\n", "\n", "1. Random variables.\n", " 1. Probability distributions\n", " 1. Cumulative distribution functions for real-valued random variables\n", " 1. Discrete random variables\n", " 1. Probability mass functions\n", " 1. The uniform distribution on a finite set\n", " 1. Bernoulli random variables\n", " 1. Random variables derived from the Bernoulli\n", " 1. Binomial random variables\n", " 1. Geometric\n", " 1. Negative binomial\n", " 1. Poisson random variables: countably infinite outcome spaces\n", " 1. Hypergeometric random variables\n", " 1. Examples of other discrete random variables\n", " 1. Continuous and \"mixed\" random variables\n", " 1. Probability densities\n", " 1. The uniform distribution on an interval\n", " 1. The exponential distribution and double-exponential distributions\n", " 1. The Gaussian distribution\n", " 1. The CDF of discrete, continuous, and mixed distributions\n", " 1. Survival functions and hazard functions\n", " 1. Counting processes\n", " 1. Joint distributions of collections of random variables, random vectors\n", " 1. The multivariate uniform distribution\n", " 1. The multivariate normal distribution\n", " 1. Independence of random variables\n", " 1. Events derived from random variables\n", " 1. Definitions of independence\n", " 1. Marginal distributions\n", " 1. Conditional distributions\n", " 1. The \"memoryless property\" of the exponential distribution\n", " 1. The Central Limit Theorem\n", " 1. Stochastic processes\n", " 1. Point processes\n", " 1. Intensity functions and conditional intensity functions\n", " 1. Poisson processes\n", " 1. Homogeneous and inhomogeneous Poisson processes\n", " 1. The conditional distribution of Poisson processes given N\n", " 1. Marked point processes\n", " 1. Inter-arrival times and inter-arrival distributions\n", " 1. The conditional distribution of a Poisson process\n", " 1. Random walks\n", " 1. Markov chains\n", " 1. Brownian motion\n", "\n", "1. Expectation\n", " 1. The Law of Large Numbers\n", " 1. The Expected Value\n", " 1. Expected value of a discrete univariate distribution\n", " 1. Special cases: Bernoulli, Binomial, Geometric, Hypergeometric, Poisson\n", " 1. Expected value of a continuous univariate distribution\n", " 1. Special cases: uniform, exponential, normal\n", " 1. (Aside: measurability, Lebesgue integration, and the CDF as a measure)\n", " 1. Expected value of a multivariate distribution\n", " 1. Expected values of functions of a random variable\n", " 1. Change-of-variables formulas for probability mass functions and densities\n", " 1. Standard Error and Variance.\n", " 1. Discrete examples\n", " 1. Continuous examples\n", " 1. The square-root law\n", " 1. The tail-sum formula for the expected value\n", " 1. Conditional expectation\n", " 1. The expectation of the conditional expectation is the unconditional expectation\n", " 1. Useful probability inequalities\n", " 1. Markov's Inequality\n", " 1. Chebychev's Inequality\n", " 1. Hoeffding's Inequality\n", "\n", "---\n", "### Sampling\n", "\n", "1. Empirical distributions\n", " 1. The ECDF for univariate distributions\n", " 1. The Kolmogorov-Smirnov statistic and The Massart-Dvoretzky-Kiefer-Wolfowitz inequality\n", " 1. Inference: inverting the MDKW inequality\n", " 1. Q-Q plots\n", "\n", "1. Random sampling.\n", " 1. Types of samples\n", " 1. Samples of convenience\n", " 1. Quota sampling\n", " 1. Systematic sampling\n", " 1. The importance of random sampling: stirring the soup.\n", " 1. Systematic random sampling\n", " 1. Random sampling with replacement\n", " 1. Simple random sampling\n", " 1. Stratified random sampling.\n", " 1. Cluster sampling\n", " 1. Multistage sampling\n", " 1. Weighted random samples\n", " 1. Sampling with probability proportional to size\n", " 1. Sampling frames\n", " 1. Nonresponse and missing data\n", " 1. Sampling bias\n", "\n", "1. Simulation\n", " 1. Pseudo-random number generators\n", " 1. Why the PRNG matters\n", " 1. Uniformity, period, independence\n", " 1. Assessing PRNGs. DIEHARD and other tests\n", " 1. Linear congruential PRNGs, including the Wichmann-Hill. Group-induced patterns\n", " 1. Statistically \"adequate\" PRNGs, including the Mersenne Twister\n", " 1. Cryptographic quality PRNGs, including cryptographic hashes\n", " 1. Generating pseudorandom permutations\n", " 1. Taking pseudorandom samples\n", " 1. Simulating sampling distributions\n", "\n", "---\n", "### Estimation and Inference\n", "\n", "1. Estimating parameters using random samples\n", " 1. Sampling distributions\n", " 1. The Central Limit Theorem\n", " 1. Measures of accuracy: mean squared error, median absolute deviation, etc.\n", " 1. Maximum likelihood\n", " 1. Loss functions, Risk, and decision theory\n", " 1. Minimax estimates\n", " 1. Bayes estimates\n", " 1. The Bootstrap\n", " 1. Shrinkage and regularization\n", "\n", "1. Inference\n", " 1. Hypothesis tests\n", " 1. Null and alternative hypotheses, \"omnibus\" hypotheses\n", " 1. Type I and Type II errors\n", " 1. Significance level and Power\n", " 1. Approximate, exact, and conservative tests\n", " 1. Families of tests\n", " 1. P-values\n", " 1. Estimating P-values by simulation\n", " 1. Test statistics\n", " 1. Selecting a test statistic\n", " 1. The null distribution of a test statistic\n", " 1. One-sided and two-sided tests\n", " 1. Null hypotheses involving actual, hypothetical, and counterfactual randomness\n", " 1. Multiplicity\n", " 1. Per-comparison error rate\n", " 1. Familywise error rate\n", " 1. The False Discovery Rate\n", " 1. Approaches to testing\n", " 1. Parametric and nonparametric tests\n", " 1. Likelihood ratio tests\n", " 1. Permutation and randomization tests\n", " 1. Invariances of distributions\n", " 1. Exchangeability\n", " 1. Other symmetries\n", " 1. The permutation distribution of test statistics\n", " 1. Approximating permutation distributions by simulation\n", " 1. Confidence sets\n", " 1. Duality between hypothesis tests and confidence sets\n", " 1. Conditional tests, conditional and unconditional significance levels\n", "\n", "1. Tests of particular hypotheses\n", " 1. The Neyman model of a randomized experiment.\n", " 1. Strong and weak null hypotheses\n", " 1. Testing the strong null hypothesis\n", " 1. The distribution of a test statistic under the strong null\n", " 1. \"Interference\"\n", " 1. Blocking and other designs\n", " 1. Ensuring that the null hypothesis matches the experiment\n", " 1. Tests for Binomial p\n", " 1. The Sign test\n", " 1. The sign test for the median; tests for other quantiles\n", " 1. The sign test for a difference in medians\n", " 1. Tests based on the normal approximation\n", " 1. The Z statistic and the Z test\n", " 1. The t statistic and the t test\n", " 1. 2-sample problems, paired and unpaired tests\n", " 1. Tests based on ranks\n", " 1. The Wilcoxon test\n", " 1. The Wilcoxon signed rank test\n", " 1. Tests using actual values\n", " 1. Tests of association\n", " 1. The hypothesis of exchangeability\n", " 1. The Spearman test\n", " 1. The permutation distribution of the Pearson correlation\n", " 1. Tests of randomness and independence\n", " 1. The runs test\n", " 1. Tests of symmetry\n", " 1. Tests of exchangeability\n", " 1. Tests of spherical symmetry\n", " 1. The two-sample problem\n", " 1. Selecting the test statistic: what's the alternative?\n", " 1. Mean, sum, Student t\n", " 1. Smirnov statistic\n", " 1. Other choices\n", " 1. The permutation distribution of the test statistic\n", " 1. The two-sample problem for complex data\n", " 1. Test statistics\n", " 1. The k-sample problem\n", " 1. Stratified permutation tests\n", " 1. Fisher's Exact Test\n", " 1. Tests of homogeneity and ANOVA\n", " 1. The F statistic\n", " 1. The permutation distribution of the F statistic\n", " 1. Other statistics\n", " 1. Ordered alternatives\n", " 1. Tests based on the distribution function: The Kolmogorov-Smirnov Test\n", " 1. The universality of the null distribution for continuous variables\n", " 1. Using the K-S test to test for Poisson behavior\n", " 1. Sequential tests and Wald's SPRT\n", " 1. Random walks and Gambler's ruin\n", " 1. Wald's Theorem\n", "\n", "1. Confidence intervals for particular parameters\n", " 1. Confidence intervals for a shift in the Neyman model\n", " 1. Confidence intervals for Binomial p\n", " 1. Application: confidence bounds for P-values estimated by simulation\n", " 1. Application: intervals for quantiles by inverting binomial tests\n", " 1. Confidence intervals for a Normal mean using the Z and t distributions\n", " 1. Confidence intervals for the mean\n", " 1. Nonparametric confidence bounds for a population mean\n", " 1. The need for a priori bounds\n", " 1. Nonnegative random variables\n", " 1. Bounded random variables\n", " 1. Confidence sets for multivariate parameters\n", "\n", "1. Density estimation\n", " 1. Histogram estimates\n", " 1. Kernel estimates\n", " 1. Confidence bounds for monotone and shape-restricted densities\n", " 1. Lower confidence bounds on the number of modes\n", "\n", "1. Function estimation\n", " 1. Splines and penalized splines\n", " 1. Polynomial splines\n", " 1. Periodic splines\n", " 1. Smoothing splines as least-squares\n", " 1. B-splines\n", " 1. L1 splines\n", " 1. Constraints\n", " 1. Balls and ellipsoids\n", " 1. Smoothness and norms\n", " 1. Lipschitz conditions\n", " 1. Sobolev conditions\n", " 1. Cones\n", " 1. Nonnegativity\n", " 1. Shape restrictions\n", " 1. Monotonicity\n", " 1. Convexity\n", " 1. Star-shaped constraints\n", " 1. Sparsity and minimum L1 methods\n", "\n", "---\n", "### *Sketchy from here down*\n", "### Experiments\n", "\n", "1. Experiments versus observational studies\n", " 1. Controls and the Method of Comparison\n", " 1. Randomization\n", " 1. Blinding\n", "\n", "1. Experimental design\n", " 1. Blocking\n", " 1. Orthogonal designs\n", " 1. Latin hypercube design\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "//anaconda/lib/python2.7/site-packages/IPython/core/formatters.py:827: FormatterWarning: JSON expects JSONable list/dict containers, not JSON strings\n", " FormatterWarning)\n" ] }, { "data": { "application/json": { "Software versions": [ { "module": "Python", "version": "2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]" }, { "module": "IPython", "version": "3.2.1" }, { "module": "OS", "version": "Darwin 14.5.0 x86_64 i386 64bit" }, { "module": "scipy", "version": "0.14.0" }, { "module": "numpy", "version": "1.9.2" }, { "module": "pandas", "version": "0.14.1" }, { "module": "matplotlib", "version": "1.4.3" } ] }, "text/html": [ "
SoftwareVersion
Python2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython3.2.1
OSDarwin 14.5.0 x86_64 i386 64bit
scipy0.14.0
numpy1.9.2
pandas0.14.1
matplotlib1.4.3
Sun Aug 23 17:00:13 2015 PDT
" ], "text/latex": [ "\\begin{tabular}{|l|l|}\\hline\n", "{\\bf Software} & {\\bf Version} \\\\ \\hline\\hline\n", "Python & 2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)] \\\\ \\hline\n", "IPython & 3.2.1 \\\\ \\hline\n", "OS & Darwin 14.5.0 x86\\letterunderscore{}64 i386 64bit \\\\ \\hline\n", "scipy & 0.14.0 \\\\ \\hline\n", "numpy & 1.9.2 \\\\ \\hline\n", "pandas & 0.14.1 \\\\ \\hline\n", "matplotlib & 1.4.3 \\\\ \\hline\n", "\\hline \\multicolumn{2}{|l|}{Sun Aug 23 17:00:13 2015 PDT} \\\\ \\hline\n", "\\end{tabular}\n" ], "text/plain": [ "Software versions\n", "Python 2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]\n", "IPython 3.2.1\n", "OS Darwin 14.5.0 x86_64 i386 64bit\n", "scipy 0.14.0\n", "numpy 1.9.2\n", "pandas 0.14.1\n", "matplotlib 1.4.3\n", "Sun Aug 23 17:00:13 2015 PDT" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Version information\n", "%load_ext version_information\n", "%version_information scipy, numpy, pandas, matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }