{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "09fe0040-1786-42c3-a866-447e283fa1ab", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import sys\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "!{sys.executable} -m pip install --quiet pandas numpy matplotlib jupyterlab_myst pygments" ] }, { "cell_type": "markdown", "id": "103f46e6", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "e8feb576", "metadata": { "tags": [] }, "source": [ "# Introduction to statistics and probability\n", "\n", "Statistics and Probability Theory are two highly related areas of Mathematics that are highly relevant to Data Science. It is possible to operate with data without deep knowledge of mathematics, but it is still better to know at least some basic concepts. Here we will present a short introduction that will help you get started." ] }, { "cell_type": "code", "execution_count": null, "id": "574d8e32-a00e-4a44-bc3f-d5691b3c9364", "metadata": { "attributes": { "classes": [ "seealso" ], "id": "" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "from IPython.display import HTML\n", "\n", "display(\n", " HTML(\n", " \"\"\"\n", "
\n", " \n", "
\n", "\"\"\"\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "a8fb9e1e", "metadata": { "tags": [ "output-scoll", "hide-input" ] }, "source": [ "execute:\n", " exclude_patterns:\n", " - 'assignments/*'\n", " - 'assignments/**/*'\n", " - 'deep-learning/*'\n", " - 'slides/*'\n", " - 'slides/**/*'\n", " - 'ml-fundamentals/classification/introduction-to-classification.md'\n", " - 'ml-fundamentals/build-a-web-app-to-use-a-machine-learning-model.md'\n", " - 'ml-fundamentals/regression/managing-data.md'\n", " - 'ml-fundamentals/parameter-optimization/loss-function.md'\n", " - 'ml-advanced/ensemble-learning/random-forest.md'\n", " - 'ml-advanced/clustering/introduction-to-clustering.md'\n", " - 'ml-advanced/clustering/k-means-clustering.md'\n", " - 'ml-advanced/unsupervised-learning.md'\n", " - 'data-science/data-visualization/visualization-distributions.md'" ] }, { "cell_type": "markdown", "id": "578cecfa", "metadata": { "tags": [] }, "source": [ "## Probability and random variables\n", "\n", "**Probability** is a number between 0 and 1 that expresses how probable an **event** is. It is defined as a number of positive outcomes (that lead to the event), divided by a total number of outcomes, given that all outcomes are equally probable. For example, when we roll a dice, the probability that we get an even number is $3/6 = 0.5$.\n", "\n", "When we talk about events, we use **random variables**. For example, the random variable that represents a number obtained when rolling a dice would take values from $1$ to $6$. A set of numbers from $1$ to $6$ is called **sample space**. We can talk about the probability of a random variable taking a certain value, for example, $P(X=3)=1/6$.\n", "\n", "The random variable in the previous example is called **discrete** because it has a countable sample space, i.e. there are separate values that can be enumerated. There are cases when sample space is a range of real numbers or the whole set of real numbers. Such variables are called **continuous**. A good example is a time when the bus arrives.\n", "\n", "\n", "\n", "## Probability distribution\n", "\n", "In the case of discrete random variables, it is easy to describe the probability of each event by a function $P(X)$. For each value $s$ from sample space $S$ it will give a number from $0$ to $1$, such that the sum of all values of $P(X=s)$ for all events would be $1$.\n", "\n", "The most well-known discrete distribution is a **uniform distribution**, in which there is a sample space of $N$ elements, with an equal probability of $1/N$ for each of them.\n", "\n", "It is more difficult to describe the probability distribution of a continuous variable, with values drawn from some interval $[a,b]$, or the whole set of real numbers $\\mathbb{R}$. Consider the case of bus arrival time. In fact, for each exact arrival time $t$, the probability of a bus arriving at exactly that time is $0$!" ] }, { "cell_type": "markdown", "id": "ef9ef875", "metadata": { "attributes": { "classes": [ "note" ], "id": "" } }, "source": [ ":::{note}\n", "Now you know that events with $0$ probability happen, and very often! At least each time when the bus arrives!\n", ":::" ] }, { "cell_type": "markdown", "id": "ac87900d", "metadata": { "tags": [] }, "source": [ "We can only talk about the probability of a variable falling in a given interval of values, eg. $P(t_1 \\le X < t_2)$. In this case, the probability distribution is described by a **probability density function** p(x), such that\n", "\n", "$$P(t_1\\le X