{ "cells": [ { "cell_type": "markdown", "id": "e6af5417", "metadata": {}, "source": [ "##### --- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "1cd89b70", "metadata": {}, "source": [ "

Lecture 4.2 (Inferential Statistics)


\n", "\"Open" ] }, { "cell_type": "markdown", "id": "bb37886f", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "markdown", "id": "d4a5143b", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "8ca0e576", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "5e5625e4", "metadata": {}, "outputs": [], "source": [ "# Unlike the other modules, we have been working so far, you have to download and install...\n", "# To install this library in Jupyter notebook\n", "import sys\n", "!{sys.executable} -m pip install -q --upgrade pip\n", "!{sys.executable} -m pip install -q statistics statsmodels scipy" ] }, { "cell_type": "code", "execution_count": null, "id": "e4a28038", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "import math as m\n", "import statistics\n", "import scipy.stats as st\n", "import statsmodels as sm" ] }, { "cell_type": "markdown", "id": "3de03bdf", "metadata": {}, "source": [ "## Learning agenda of this notebook\n", "\n", "**Section 1: (Overview of Probability for Machine Learning)**\n", "1. Overview of Probability\n", " - Probability Basics\n", " - Joint Probability\n", " - Combinatorics\n", " - The Law of Large Numbers\n", " - Practical Implementation in Python\n", "2. Conditional Probability, Bayes’ Theorem and Naive Bayes' Classifier\n", "3. How Probability Relates with Statistics\n", "4. Probability Distributions\n", " - Continuous Probability Distributions (Normal, Standard Normal, Log Normal, Exponential, Student’s T, Chi-square)\n", " - Discrete Probability Distributions (Binomial, Bernoulli, Multinomial, Poisson)\n", "5. Central Limit Theorem\n", "\n", "**Section 2: (Hypothesis Testing)**\n", "1. Overview of Hypothesis Testing\n", " - How to formulate a Hypothesis?\n", " - Types of Hypotheshis Tests.\n", " - Related Terminologies (Rejection region, significance level, confidence intervals, test scores, p-value, and error types)\n", "2. Z-Test vs T-Test\n", "3. Student's Single Sample T-Test\n", "4. Student's Two Samples T-Test\n", " - Independent Samples\n", " - Paired Samples\n", "\n", "**Section 3: (Journey from Variance -> Covariance -> Correlation -> Regression)**\n", "1. A Recap\n", " - Variance and Standard Deviation\n", " - Covariance and Covariance Matrix\n", " - Correlation and Correlation Matrix\n", " - Regression\n", "2. Regression Analysis\n", "3. Linear Regression\n", " - Fitting a Line using Gradiant Descent\n", " - Fitting a Line using Linear Least Squares (with one feature)" ] }, { "cell_type": "code", "execution_count": null, "id": "88e73197", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5813177d", "metadata": {}, "source": [ "# Section 1: (Overview of Probability for Machine Learning) " ] }, { "cell_type": "markdown", "id": "4a4dd136", "metadata": {}, "source": [ "## 1. A Recap of Basic Probability Concepts" ] }, { "cell_type": "markdown", "id": "09ecad8e", "metadata": {}, "source": [ "\n", "\n", "### a. Probability Basics\n", "- Probability theory is a branch of mathematics concerned with the analysis of random phenomena, and defines the `likelihood of occurrence of an event`. \n", "- Probability can be defined as the ratio of the number of favorable outcomes to the total number of outcomes of an event.\n", "\n", "$$ P(\\text{event}) = \\frac{\\text{# of outcomes of event}}{\\text{# of outcomes in }\\Omega} $$\n", "\n", "- The value of the probability of an event to happen can lie between 0 and 1 because the favorable number of outcomes can never cross the total number of outcomes. \n", "- To understand Probability, we normally start to predict the outcomes for the `tossing of coins`, `rolling of dice`, or `drawing a card from a pack of playing cards`. Later we apply the same concepts in the domains of Artificial Intelligence and Machine Learning.\n", "\n", "- **Independent Event** is an event that does not impact the other. For example, rolling a dice twice and getting two sixes in a sequence (1/6 times 1/6)\n", "- **Dependent event** is an event that impact the other. For Example, drawing two cards from a pack of playing cards without replacement and getting two queens in a sequence. (4/52 times 3/52)\n", "\n", " " ] }, { "cell_type": "markdown", "id": "bc967892", "metadata": {}, "source": [ "- **Example:**\n", "If we're only flipping the coin once, then there are only two possible outcomes in the sample space $\\Omega$: it will either be H or T (using set notation, we could write this as $\\Omega$ = {H, T}).\n", "Therefore: $$ P(H) = \\frac{1}{2} = 0.5 $$\n", "Equally: $$ P(T) = \\frac{1}{2} = 0.5 $$" ] }, { "cell_type": "markdown", "id": "f77636cb", "metadata": {}, "source": [ "- **Example:** Consider drawing a single card from a standard deck of 52 playing cards. In this case, the number of possible outcomes in the sample space $\\Omega$ is 52. \n", "There is only one `ace of spades` in the deck, so the probability of drawing it is: $$ P(\\text{ace of spades}) = \\frac{1}{52} \\approx 0.019 $$\n", "In contrast there are four `aces`, so the probability of drawing an ace is: $$ P(\\text{ace}) = \\frac{4}{52} \\approx 0.077 $$" ] }, { "cell_type": "markdown", "id": "ecb55d51", "metadata": {}, "source": [ "- **More Examples:**\n", "$$ P(\\text{spade}) = \\frac{13}{52} = 0.25 $$\n", "$$ P(\\text{ace OR spade}) = \\frac{16}{52} \\approx 0.307 $$\n", "$$ P(\\text{card}) = \\frac{52}{52} = 1 $$\n", "$$ P(\\text{apple}) = \\frac{0}{52} = 0 $$" ] }, { "cell_type": "code", "execution_count": null, "id": "09942d9e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5069b09a", "metadata": {}, "source": [ "### b. Joint Probability (Independent Events)\n", "- Joint probability is the probability of two events happening at the same time.\n", "- If two events A and B are indpendent, and we want to find the probability of the intersection of these events (i.e., probability of both A and B happening together), we can use the \"Probability rule of Product\":\n", "#####
`P (A ∩ B) = P(A) . P(B)`
\n", "\n", "- Some real life examples of independent events are: Scoring good marks in an exam has no effect on what the neighbors are up to. Similarly, taking a cab to market has no effect on finding your favorite movie on Youtube.\n", "\n", "**Notee:** To calculate the probability of intersection of dependent events, we use conditional probability (later).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8275e462", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e8d031f2", "metadata": {}, "source": [ "- **Example:** The probability of getting `two consecutive heads` from two tosses ($\\Omega$ = {HH, HT, TH, TT}) is: $$ P(\\text{HH}) = \\frac{1}{4} = 0.25 $$\n", "- **Example:** The probability of getting `three consecutive heads` from three tosses ($\\Omega$ = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}) is: $$ P(\\text{HHH}) = \\frac{1}{8} = 0.125 $$\n", "- **Example:** The probability of getting `exactly two heads` from three tosses ($\\Omega$ = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}) is: $$ P(\\text{event}) = \\frac{3}{8} = 0.375 $$\n", "- **Example:** The probability of getting `at least two heads` from three tosses ($\\Omega$ = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}) is: $$ P(\\text{event}) = \\frac{4}{8} = 0.5 $$" ] }, { "cell_type": "code", "execution_count": null, "id": "366cc012", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9dd3a69e", "metadata": {}, "source": [ "- **Example:** To calculate the probability of throwing three consecutive heads from three tosses. We can take the product of the probabilities of getting a single head from a single toss and getting two heads from two tosses: $$ P(\\text{HHH}) = P(\\text{H}) \\times P(\\text{HH}) = \\frac{1}{2} \\times \\frac{1}{4} = \\frac{1}{8} \\approx 0.125 $$" ] }, { "cell_type": "markdown", "id": "2918cd63", "metadata": {}, "source": [ "- **Example:** So to calculate the probability of throwing five consecutive heads from five tosses, without creating the sample set of 32 possible events. We take the product of probabilities we've already calculated: $$ P(\\text{HHHHH}) = P(\\text{HH}) \\times P(\\text{HHH}) = \\frac{1}{4} \\times \\frac{1}{8} = \\frac{1}{32} \\approx 0.031 $$\n", "- **Example:** Similarly to calculate the probability of throwing 10 consecutive heads from ten tosses, without creating the sample set of 1024 possible events. We take the product of probabilities we've already calculated: $$ P(\\text{HHHHHHHHHH}) = P(\\text{HHHHH}) \\times P(\\text{HHHHH}) = \\frac{1}{32} \\times \\frac{1}{32} = \\frac{1}{1024} \\approx 0.000976 $$" ] }, { "cell_type": "markdown", "id": "74a38301", "metadata": {}, "source": [ ">**What is the probability of getting 5 consecutive heads from 100 tosses?**\n", "> - I know the total sample space will be 2^100\n", "> - Can you tell me the count of favourable outcomes?" ] }, { "cell_type": "code", "execution_count": null, "id": "f6cd8d67", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7aeda477", "metadata": { "id": "w-wlHpI05ChM" }, "source": [ "### c. Combinatorics" ] }, { "cell_type": "markdown", "id": "162c7da8", "metadata": { "id": "1ckSVU3p5ChM" }, "source": [ "- Combinatorics is a field of mathematics devoted to counting that can be helpful to studying probabilities. We can use `factorials` (e.g., $4! = 4 \\times 3 \\times 2 \\times 1 = 24$), which feature prominently in combinatorics, to calculate probabilities instead of painstakingly determining all of the members of the sample space $\\Omega$ and counting subsets within $\\Omega$. " ] }, { "cell_type": "markdown", "id": "a2c82dca", "metadata": { "id": "zYL4ODP75ChN" }, "source": [ "- More specifically, we can calculate the number of outcomes of an event using the \"number of combinations\" equation: $$ {n \\choose k} = \\frac{n!}{k!(n - k)!} $$" ] }, { "cell_type": "markdown", "id": "95493e03", "metadata": { "id": "kQhpdNEL5ChN" }, "source": [ "- The left-hand side of the equation is read \"$n$ choose $k$\". Consider three coin flips, $n = 3$ and if we're interested in the number of ways to get exactly two heads (may not be consecutive), $k = 2$. We would read this as \"3 choose 2\" and calculate it as:\n", "$$ {n \\choose k} = {3 \\choose 2} = \\frac{3!}{2!(3 - 2)!} = \\frac{3!}{(2!)(1!)} = \\frac{3 \\times 2 \\times 1}{(2 \\times 1)(1)} = \\frac{6}{(2)(1)} = \\frac{6}{2} = 3 $$" ] }, { "cell_type": "markdown", "id": "fd08a5c8", "metadata": { "id": "PoPfNW275ChN" }, "source": [ "This provide us with the numerator for event-probability equation from above: $$ P(\\text{event}) = \\frac{\\text{# of outcomes of event}}{\\text{# of outcomes in }\\Omega} $$" ] }, { "cell_type": "markdown", "id": "01e25d39", "metadata": { "id": "ThoyCpl35ChO" }, "source": [ "In the case of coin-flipping (or any binary process with equally probable outcomes), the denominator can be calculated with $2^n$ (where $n$ is again the number of coin flips), so: $$ \\frac{\\text{# of outcomes of event}}{\\text{# of outcomes in }\\Omega} = \\frac{3}{2^n} = \\frac{3}{2^3} = \\frac{3}{8} = 0.375 $$" ] }, { "cell_type": "markdown", "id": "acba5fa4", "metadata": { "id": "_ZzjXjHy5ChO" }, "source": [ "**Example**: Use $n \\choose k$ to calculate the probability of throwing exactly three heads in five coin tosses.\n", "\n", "$$ {n \\choose k} = {5 \\choose 3} = \\frac{5!}{3!(5 - 3)!} = \\frac{5!}{(3!)(2!)} = \\frac{5 \\times 4 \\times 3 \\times 2 \\times 1}{(3 \\times 2 \\times 1)(2 \\times 1)} = \\frac{120}{(6)(2)} = \\frac{120}{12} = 10 $$\n", "\n", "$$P = \\frac{10}{2^n} = \\frac{10}{2^5} = \\frac{10}{32} = 0.3125 $$" ] }, { "cell_type": "code", "execution_count": null, "id": "2cb2d728", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c1b4d20e", "metadata": {}, "source": [ "### d. Practice above Concepts with Python" ] }, { "cell_type": "code", "execution_count": null, "id": "04e5c055", "metadata": { "id": "MgdCyK805ChP" }, "outputs": [], "source": [ "def prob_calc(n, k):\n", "# num = m.factorial(n)/(m.factorial(k) * m.factorial(n-k))\n", " num = m.comb(n,k)\n", " den = 2**n\n", " rv = num/den\n", " return rv" ] }, { "cell_type": "code", "execution_count": null, "id": "3cf9dff0", "metadata": {}, "outputs": [], "source": [ "# In ten tosses what is the probability of getting exactly two heads\n", "print(\"prob_calc(10, 2): \", prob_calc(10, 2))\n", "# In ten tosses what is the probability of getting exactly 5 head\n", "print(\"prob_calc(10, 5): \", prob_calc(10, 5))" ] }, { "cell_type": "code", "execution_count": null, "id": "1620f675", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e570fcb9", "metadata": {}, "outputs": [], "source": [ "# Probabilities of getting 0,1,2,3,...10 heads in ten tosses\n", "for i in range(0,11):\n", " print (prob_calc(10,i), end=', ')" ] }, { "cell_type": "code", "execution_count": null, "id": "345963a4", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6UO8q5N45ChQ", "outputId": "d91d5573-86e0-42de-cef5-6738e757483c" }, "outputs": [], "source": [ "# Use list comprehension to iterate this function on values 0 to 5\n", "probabilities = [prob_calc(10, h) for h in range(11)]\n", "print(probabilities)" ] }, { "cell_type": "markdown", "id": "a2adb5e3", "metadata": {}, "source": [ ">- This shows that in a ten coin toss experiment, there is less than 1% chance that we'll get all heads, while the probability of getting five heads is 24.6%" ] }, { "cell_type": "markdown", "id": "f7a002cd", "metadata": { "id": "gjSkHJ8r5ChQ" }, "source": [ "### 2. The Law of Large Numbers" ] }, { "cell_type": "markdown", "id": "c64bfb3e", "metadata": { "id": "7rfVQfP55ChR" }, "source": [ "The **law of large numbers** states that the more experiments we run, the closer we will tend to get to the expected probability. " ] }, { "cell_type": "code", "execution_count": null, "id": "56d91954", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xRQh-0iG5ChS", "outputId": "609ca698-64fb-4b35-b2fa-fa8abfe3b79f" }, "outputs": [], "source": [ "# The binomial() method is used to draw samples from a binomial distribution\n", "# where n is the number of flips and p is the probability.\n", "#np.random.seed(54) # for reproducibility\n", "np.random.binomial(n=4, p=0.5)" ] }, { "cell_type": "code", "execution_count": null, "id": "614b4d71", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "df020007", "metadata": {}, "outputs": [], "source": [ "# Let me increase the count of experiments to ten thousands\n", "np.random.binomial(n=10000, p=0.5)" ] }, { "cell_type": "code", "execution_count": null, "id": "6960afb6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "68a1b9f8", "metadata": { "id": "ymsouctT5ChR" }, "outputs": [], "source": [ "ne = np.array([1, 2, 3, 4, 5, 6, 7, 2**15, 2**16, 2**17, 2**18]) \n", "\n", "heads_count = [np.random.binomial(n=a, p=0.5) for a in ne]\n", "\n", "proportion_heads = heads_count/ne\n", "print(proportion_heads)" ] }, { "cell_type": "markdown", "id": "233b6075", "metadata": {}, "source": [ "> **Note:** Above are the probabilities. This comply to the Law of Large Numbers which says that \"as we increase the number of coin tosses, the probability of getting heads gets closer and closer to 50%. " ] }, { "cell_type": "markdown", "id": "f8f31510", "metadata": { "id": "RvrI0woM5ChS" }, "source": [ "- **Gambler's Fallacy**. It is a common misconception that the law of large numbers dictates that if, say, five heads have been flipped in a row, then the probability of tails is higher on the sixth flip. In fact, probability theory holds that each coin flip is completely independent of all others. Thus, every single flip of a fair coin has a 50% chance of being heads, no matter what happened on preceding flips." ] }, { "cell_type": "code", "execution_count": null, "id": "61f7a753", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3ad3994d", "metadata": {}, "source": [ "## 3. Conditional Probability and Bayes’ Theorem" ] }, { "cell_type": "markdown", "id": "869a5da7", "metadata": {}, "source": [ "### a. Conditional Probability\n", "- **Marginal Probability:** The probability of an event irrespective of the outcomes of other random variables, e.g. $P(A)$.\n", "\n", "\n", "- **Joint Probability:** Joint probability measures the probability of two more events occurring together and at the same time. The joint probability is also called the intersection of two or more events. We can represent this relation using a Venn diagram as well. For two events A and B, it is denoted by P (A ∩ B) or P(A, B)\n", "\n", "\\begin{equation}\n", " P(A \\cap B) \\hspace{0.5cm} = \\hspace{0.5cm} P(A) \\times P(B) \\hspace{0.5cm}\n", "\\end{equation}\n", " \n", " - where the probability rule of product is used to find the probability of intersection of events. An important requirement of the rule of product is that the events should be independent. If one were to calculate the probability of an intersection of dependent events, then a different approach involving conditional probability would be needed.\n", "\n", "\n", "- **Conditional probability:** is the probability of an event A given that event B has already happened. This is formally written as P(A|B), which reads as: the probability of A given B. It can be calculated with the following formula:\n", "\n", "\\begin{equation}\n", " P(A \\mid B) = \\frac{P(A \\cap B)}{P(B)} \\hspace{0.5cm} = \\frac{P(A) \\times P(B)}{P(B)} \\hspace{0.5cm}\n", "\\end{equation}" ] }, { "cell_type": "code", "execution_count": null, "id": "d0112ef0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c8a65528", "metadata": {}, "source": [ "**Example 1:**\n", "- Consider the following three events:\n", " - Event-A: Student getting A-Grade in exam\n", " - Event-B: Student studied 5 hours daily for entire semester\n", " - Event-C: Student studied only 1 hour before the exam\n", "- Suppose \\begin{equation} P(A) = 0.15\\end{equation}\n", "\n", "- Now it is natural that:\n", "\n", "\\begin{equation}\n", " P(A \\mid B) >= 0.15 \\hspace{0.5cm} and \\hspace{0.5cm} P(A \\mid C) <= 0.15\n", "\\end{equation}" ] }, { "cell_type": "code", "execution_count": null, "id": "9f5cc3f7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "546d92cc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5a5050af", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ca8905e3", "metadata": {}, "source": [ "**Example 2:**\n", "- Consider the following two events:\n", " - Event-B: Card drawn from a deck of cards is a face card\n", " - Event-A: Without replacement the second card drawn from the same deck of cards is a face card\n", "\\begin{equation} P(B) = \\frac{12}{52} \\end{equation}\n", "\\begin{equation} P(A) = \\frac{11}{51} \\end{equation}\n", "\n", "\\begin{equation}\n", " P(A \\mid B) = \\frac{P(A \\cap B)}{P(B)}\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", "\\hspace{0.5cm} P(A \\mid B) = \\frac{P(A) \\times P(B)}{P(B)} \n", "\\end{equation}\n", "\n", "\\begin{equation}\n", "\\hspace{0.5cm} P(A \\mid B) = \\frac{\\frac{11}{51} \\times \\frac{12}{52}}{\\frac{12}{52}}\n", "\\end{equation}\n", "\\begin{equation}\n", "\\hspace{0.5cm} P(A \\mid B) \\approx 0.216 \n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "814e52ae", "metadata": {}, "source": [ "- **Example 3:** Similarly, your probability of getting a parking space is connected to the time of day you park, where you park, and what conventions are going on at any time. " ] }, { "cell_type": "code", "execution_count": null, "id": "24e5fa58", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4617edb0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e41d96af", "metadata": {}, "source": [ "### b. Bayes’ Theorem\n", "- **Bayes' Theorem** (by Thomas Bayes), is a way of calculating a conditional probability without the joint probability\n", "- To calculate the probability that event A occurs, given that event B has already occurred, we can use the following **Conditional Probability** formula:\n", "\n", "\\begin{equation}\n", " P(A \\mid B) = \\frac{P(A \\cap B)}{P(B)} \\hspace{0.5cm} ------(i)\n", "\\end{equation}\n", "\n", "- Similarly, to calculate the probability that event B occurs, given that event A has already occurred, we can use the same formula, only this time changing out the denominator as follows:\n", "\\begin{equation}\n", " P(B \\mid A) = \\frac{P(B \\cap A)}{P(A)} \\hspace{0.9cm} OR \\hspace{0.9cm} P(B \\mid A) = \\frac{P(A \\cap B)}{P(A)} \\hspace{0.5cm}------(ii)\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "e548ba4a", "metadata": {}, "source": [ "- Multiplying both sides of equation $(i)$ by $P(B)$ gives us\n", "\\begin{equation}\n", " P(A \\mid B) * P(B) = P(A \\cap B) \\hspace{0.5cm} ------(iii)\n", "\\end{equation}\n", "\n", "\n", "- Similarly, multiplying both sides of equation $(ii)$ by $P(A)$ gives us\n", "\\begin{equation}\n", " P(B \\mid A) * P(A) = P(A \\cap B) \\hspace{0.5cm} ------(iv)\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "daef46b4", "metadata": {}, "source": [ "- Equating equations $(iii)$ and $(iv)$, we get\n", "\\begin{equation}\n", "P(A \\mid B) * P(B) \\hspace{0.5cm} = \\hspace{0.5cm} P(B \\mid A) * P(A) \\hspace{0.5cm}\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", " P(A \\mid B) \\hspace{0.5cm} = \\hspace{0.5cm} \\frac{P(B \\mid A) * P(A)}{P(B)}, \\hspace{0.5cm} P(B)\\neq 0 \\hspace{0.5cm} ------(v)\n", "\\end{equation}\n", "\n", "- Where:\n", " - **P(A|B) is Posterior Probability:** Probability of an event that is calculated after all the information related to the event has been accounted for. (Also known as conditional probability).\n", " - **P(B|A) is Liklihood:** Reverse of the posterior probability.\n", " - **P(A) is Prior Probability:** Probability of an event that is calculated before considering the new information obtained.\n", " - **P(B) is Evidence:** Also known as normalization constant.\n", "\n", "- Note that the only difference between equation $(v)$ that defines Bayes' Theorem and equation $(i)$ that defines conditional probability, is the numerator on the right-hand side of the equation. In conditional probability, we see $P(A∩B)$ in the numerator, whereas in Bayes' theorem, we see $P(B|A)∗P(A)$ in the numerator.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "13f4dd1b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "01274efe", "metadata": {}, "source": [ "### Examples of Bayes' Theorem" ] }, { "cell_type": "markdown", "id": "1ce420f2", "metadata": {}, "source": [ "- **Example 1:** What is the probability that a card drawn from a deck of playing cards is a Queen, given that it is a card of Spades?\n", "\n", " - **Q:** Card is a Queen\n", " - **S:** Card is a Spades\n", "\n", "\\begin{equation}\n", " P(Q \\mid S) \\hspace{0.5cm} = \\hspace{0.5cm} \\frac{P(S \\mid Q) * P(Q)}{P(S)}\\hspace{0.5cm}\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", " P(Q \\mid S) \\hspace{0.5cm} = \\hspace{0.5cm} \\frac{\\frac{1}{4}*\\frac{4}{52}}{\\frac{13}{52}}\\hspace{0.5cm} = \\hspace{0.5cm}\\frac{1}{13}\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", " P(Q \\mid S) \\hspace{0.5cm} = \\hspace{0.5cm} \\frac{1}{13}\n", "\\end{equation}\n" ] }, { "cell_type": "markdown", "id": "4e18fc30", "metadata": {}, "source": [ "- **Example 2:** Consider the given data of men and women, some of them exercise and some don't. What is the probability that a person selected at random is a man, given that he is an exerciser?\n", "\n", "\n", "\n", "- Step 1: Event A is there are 100 men and 100 women, so the probability of occuring a man P(A) is 100/200\n", "- Step 2: Event B is there are total 39 person that does exercise so probability of occuring an exerciser is 39/200\n", "- Step 3: Probability of event B given event A, which means probability of occuring a man that does exercise is 22/100\n", "- Step 4: Finding out the probability that choosing an exerciser randomly will be a man\n", "\n", "\n", "######
P(A|B) = P(B|A) * P(A) / P(B) = (0.22 * 0.5)/0.195 = 0.564 or 56.4%
\n", "\n", "\n", "- Hence, the probability of an exerciser being a man is 56.4%." ] }, { "cell_type": "code", "execution_count": null, "id": "17dc1e75", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "42ad1a05", "metadata": {}, "source": [ "### Applications of Naive Bayes' Algorithm\n", "- Classification of SPAM emails\n", "- Sentiment Analysis\n", "- Recommendation Systems\n", "- Article Categorization\n", "- Search Engines\n", "- Detection of inappropriate comments" ] }, { "cell_type": "markdown", "id": "f92475d4", "metadata": {}, "source": [ "\n", " \n", "\n", "### Naive Bayes' Classifier for Filtering SPAM Emails\n", "- H: HAM\n", "- S: SPAM\n", "- D: DEAR\n", "- F: FREE\n", "\n", "\n", "\\begin{equation}\n", " P(H \\mid D,F) \\hspace{0.5cm} = \\hspace{0.5cm} \\frac{P(D,F \\mid H) * P(H)}{P(D,F)}\\hspace{0.5cm}\n", "\\end{equation}\n", "\n", "\n", "\\begin{equation}\n", " P(S \\mid D,F) \\hspace{0.5cm} = \\hspace{0.5cm} \\frac{P(D,F \\mid S) * P(S)}{P(D,F)}\\hspace{0.5cm}\n", "\\end{equation}\n" ] }, { "cell_type": "code", "execution_count": null, "id": "65dfe221", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "59155881", "metadata": {}, "source": [ ">Since both the above equations has same denominator, so that can be ignored" ] }, { "cell_type": "markdown", "id": "6aa5620d", "metadata": {}, "source": [ "\\begin{equation}\n", " P(H \\mid D,F) \\hspace{0.5cm} = \\hspace{0.5cm} P(D,F \\mid H) . P(H)\\hspace{0.5cm}=\\hspace{0.5cm} P(D \\mid H) . P(F \\mid H) . P(H)\\hspace{0.5cm}=\\hspace{0.5cm} \\frac{5}{30}. \\frac{3}{30}. \\frac{20}{40}\\hspace{0.5cm}=\\hspace{0.5cm} \\frac{1}{120}\\hspace{0.5cm}=\\hspace{0.5cm} 0.0083\n", "\\end{equation}\n", "\n", "\n", "\\begin{equation}\n", " P(S \\mid D,F) \\hspace{0.5cm} = \\hspace{0.5cm} P(D,F \\mid S) * P(S)\\hspace{0.5cm}=\\hspace{0.5cm} P(D \\mid S) . P(F \\mid S) . P(S)\\hspace{0.5cm}=\\hspace{0.5cm} \\frac{3}{30}. \\frac{1}{30}. \\frac{20}{40}\\hspace{0.5cm}=\\hspace{0.5cm} \\frac{1}{600}\\hspace{0.5cm}=\\hspace{0.5cm} 0.0017\n", "\\end{equation}\n" ] }, { "cell_type": "markdown", "id": "e20419ec", "metadata": {}, "source": [ "> **The given message is a HAM**" ] }, { "cell_type": "code", "execution_count": null, "id": "7332842b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a94e6baa", "metadata": { "id": "hKciO43C5ChT" }, "source": [ "## 4. How Probability Relates with Statistics?" ] }, { "cell_type": "markdown", "id": "060094fa", "metadata": { "id": "WqitDIkk5ChT" }, "source": [ "- Probability deals with the prediction of future events.\n", "- Statistics is used to analyze the frequency of past events. \n", "- Probability is the theoretical branch of mathematics, while statistics is an applied branch of mathematics. \n", "- The field of statistics applies probability theory to make inferences with a quantifiable degree of confidence. " ] }, { "cell_type": "markdown", "id": "b8edc176", "metadata": {}, "source": [ "**Example:** Suppose you have five coins and you flip them one hundred times. What is the count of heads that you have got in each of the hundred experiments?" ] }, { "cell_type": "code", "execution_count": null, "id": "9b4963be", "metadata": { "id": "Nvc9rCnq5ChT" }, "outputs": [], "source": [ "heads_count = np.random.binomial(n=5, p=0.5, size=100)\n", "heads_count" ] }, { "cell_type": "code", "execution_count": null, "id": "824ede68", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b2ec42da", "metadata": {}, "source": [ ">Let us calculate the frequency of getting 0,1,2,3,4, and 5 heads" ] }, { "cell_type": "code", "execution_count": null, "id": "ee6db785", "metadata": {}, "outputs": [], "source": [ "# With only one required argument returns the unique elements of an array passed\n", "# With return_counts=True also returns the number of times each unique item appears in `ar`\n", "unique_values, frequency = np.unique(ar=heads_count, return_counts=True)\n", "unique_values, frequency" ] }, { "cell_type": "code", "execution_count": null, "id": "5b1c8d8d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0168b5b0", "metadata": { "id": "siW-zTqm5ChT" }, "source": [ ">**A probability distribution is a statistical function that describes all the possible values and likelihoods (probabilities) that a random variable can take within a given range.**" ] }, { "cell_type": "markdown", "id": "22ef35be", "metadata": {}, "source": [ "**Example:** Let us draw a graph showing the probability distribution of getting heads as a continuation of above example." ] }, { "cell_type": "code", "execution_count": null, "id": "d27b9472", "metadata": { "id": "FRiZSpwy5ChT" }, "outputs": [], "source": [ "prob = frequency/100\n", "prob" ] }, { "cell_type": "code", "execution_count": null, "id": "5edb75dd", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "ecU6IMfE5ChU", "outputId": "d53a581c-65a5-41d1-987c-0417ba767169" }, "outputs": [], "source": [ "plt.bar(unique_values, prob)\n", "plt.grid(True)\n", "plt.xlabel('Heads flips (out of 5 tosses)')\n", "plt.ylabel('Event probability');" ] }, { "cell_type": "markdown", "id": "fa977810", "metadata": { "id": "KwhO9Zn75ChU" }, "source": [ ">Above graph shows the probability distribution of getting heads, when you toss five coins a hundred times. So the probability of throwing two or three heads in five tosses is much higher than the probability of tossing zero or five heads." ] }, { "cell_type": "code", "execution_count": null, "id": "069e36a7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5f5b1d29", "metadata": {}, "source": [ "## 5. Probability Distributions\n", "A probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range.\n", "- Probability Distributions for Continuous Random Variables. \n", "- Probability Distributions for Discrete Random Variables." ] }, { "cell_type": "markdown", "id": "297db67d", "metadata": {}, "source": [ "### a. Probability Distributions for Discrete Random Variables:\n", "\n", "- A discrete probability distribution, model the probabilities of random variables that can have discrete values as outcome, which are counted, not measured.\n", "- For example, when a coin is tossed twice, the number of heads can be {0, 1, 2}, and not any value from 0 to 2 like 0.1 or 1.6. \n", "- Discrete probability distributions are usually described/expressed by a formula called Probability Mass Function, abbreviated as **pmf**. Or a frequency distribution table, which tells you the number of times the observation occurs in the data. For example, in {1, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0, 6, 9}, the frequency of the number 9 is 5 (because it occurs 5 times)\n", "- The graph of a discrete probability distribution are bars, with their heights representing the probability of that specific value. \n", "\n", " \n", "\n", "**Discrete Probability Distribution Graph:**\n", "- This graph shows the probability distribution of getting heads, when you toss five coins a hundred times. So the probability of throwing two or three heads in five tosses is much higher than the probability of tossing zero or five heads.\n", "\n", "**Discrete Probability Distribution Types:**\n", "\n", " - Binomial distribution\n", " - Bernoulli distribution\n", " - Multinomial distribution\n", " - Poisson distribution" ] }, { "cell_type": "code", "execution_count": null, "id": "4312748d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e2158d0b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "75895d26", "metadata": {}, "source": [ "### b. Probability Distributions for Continuous Random Variables:\n", "\n", "- A continuous probability distribution, model the probabilities of random variables that can have continuous values as outcome, which are measured, not counted.\n", "- For example, the possible values for the random variable X that represents weights of citizens in a town can have any value like 34.5, 47.7, etc. \n", "- Continuous probability distributions are usually described/expressed by a formula called Probability Density Function, abbreviated as **pdf**.\n", "- The graph of a continuous probability distribution is a curve. \n", "- Probability is represented by area under the curve.\n", "- Area under the curve is given by a different function called the Cumulative Distribution Function, abbreviated as **cdf**.\n", "\n", " \n", "\n", "**Continuous Probability Distribution Graph:**\n", "- This graph shows the probability distribution for a continuous variable `amount of tip` given by customers in a restaurant, which ranges from 0 to 10.0. \n", "\n", "**Continuous Probability Distribution Types:**\n", "\n", " - Normal distribution\n", " - Standard Normal distribution\n", " - Log Normal distribution\n", " - Exponential distribution\n", " - Student’s T distribution\n", " - Chi-square distribution\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Note: \n", "- The entire area under the curve and above the x-axis is equal to one.\n", "- Probability is found for intervals of x values rather than for individual x values.\n", "- P(c < x < d) is the probability that the random variable X is in the interval between the values c and d. \n", "- P(x = c) = 0 The probability that x takes on any single individual value is zero. \n", "- P(c < x < d) is the same as P(c ≤ x ≤ d) because probability is equal to area.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6a640702", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "249ddcdf", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d53c8f25", "metadata": {}, "source": [ "\n", "\n", "### c. Guassian/Normal Distribution\n", "- The `Gaussian` or `Normal` Distribution is given by Friedrich Gauss, also known as the `Bell Curve`.\n", "- Guassian or Normal distribution is denoted as $\\mathcal{N}(\\mu, \\sigma^2)$. \n", "- A normal distribution is known as the bell curve because it looks like a bell.\n", "- The density curve is symmetrical\n", "- Normal distribution is defined by its mean and standard deviation. \n", "- Normal distribution is centered about its mean, with standard deviation indicating its spread. \n", "- At point x, the height can be calculated with the following formula:\n", "\n", "\n", "\n", "Where,\n", "\n", " - μ = Mean Value\n", " - σ = Standard Distribution of probability.\n", " - If mean(μ) = 0 and standard deviation(σ) = 1, then this distribution is known to be normal distribution.\n", " - x = Normal random variable\n", "\n", "- For normally distributed data:\n", " - 68.3% of observations are within 1 standard deviation from the mean (-1,1).\n", " - 95% of observations are within 2 standard deviations of the mean (-2,2).\n", " - 99.7% of observations are within 3 standard deviations of the mean, interval (-3,3). \n", "\n", "\n", "- Real World Examples:\n", " - Height of the Population of the world\n", " - Height of adult women\n", " - Height of adult men\n", " - Income distribution in countries economy among poor and rich\n", " - The sizes of females shoes\n", " - Weight of newly born babies range\n", " - Average report of Students based on their performance " ] }, { "cell_type": "markdown", "id": "1986727d", "metadata": {}, "source": [ "**Example:**" ] }, { "cell_type": "code", "execution_count": 2, "id": "e4b2751c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean: 24.98988377299562\n", "Standard Deviation: 2.9722754908420788\n" ] }, { "ename": "NameError", "evalue": "name 'sns' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_53880/979852923.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Mean: \"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Standard Deviation: \"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0msns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdisplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0maxvline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'orange'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0maxvline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'purple'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'sns' is not defined" ] } ], "source": [ "import numpy as np\n", "x = np.random.normal(loc=25, scale=3, size=10000)\n", "print(\"Mean: \",np.mean(x))\n", "print(\"Standard Deviation: \",np.std(x))\n", "sns.displot(x)\n", "plt.axvline(np.mean(x), color='orange')\n", "plt.axvline(np.mean(x) - np.std(x), color='purple')\n", "plt.axvline(np.mean(x) + np.std(x), color='purple')\n", "plt.axvline(np.mean(x) - 2*np.std(x), color='green')\n", "plt.axvline(np.mean(x) + 2*np.std(x), color='green')\n", "plt.axvline(np.mean(x) - 3*np.std(x), color='red')\n", "plt.axvline(np.mean(x) + 3*np.std(x), color='red')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "8c6e9ba5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0a26b6f9", "metadata": {}, "source": [ "\n", "\n", "### d. Standard Normal Distribution or Z-Distribution\n", "- The **Standard Normal Distribution**, also called the **Z-Distribution**, is a special Normal Distribution where the mean ($\\mu$) is 0 and the standard deviation ($\\sigma$) is 1. \n", "- Normal distributions can be denoted as $\\mathcal{N}(\\mu, \\sigma^2)$, thus the standard normal distribution can be denoted as $\\mathcal{N}(0, 1)$. \n", "- Any normal distribution can be standardized by converting its values into z-scores. Z-scores tell you how many standard deviations from the mean each value lies.\n", "- To covert a normally distributed data into a standard normal distribution, we need to subtract every data value from the mean and then divide the resulting value by the standard deviation.\n", "\n", "$$ Z = \\frac{x-\\mu}{\\sigma} $$\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "bb24a86e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e6e1362a", "metadata": {}, "source": [ "**Example 1:** Suppose there are 10000 students in my Data Science class. Generate the random result of the students with mean of 60 and standard deviation of 10. You have secured 85% marks. Calculate your Z-score and see how many students are above you? Also draw the graph of the normal distribution" ] }, { "cell_type": "code", "execution_count": null, "id": "ffbeae8b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "de32f985", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4c6ec5a0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b44009bb", "metadata": {}, "outputs": [], "source": [ "# Let us first generate the random marks of ten thousand students, with a mean of 60 and standard deviation of 10\n", "mu = 60\n", "sigma = 10\n", "np.random.seed(54)\n", "x = np.random.normal(mu, sigma, 10000)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "id": "a7961422", "metadata": {}, "outputs": [], "source": [ "# Let us verify, whether the mean and std dev of above distribution `x` is 60 and 10 respectively\n", "print(\"np.mean(x): \", np.mean(x))\n", "print(\"np.std(x): \", np.std(x))" ] }, { "cell_type": "code", "execution_count": null, "id": "cb9ab515", "metadata": {}, "outputs": [], "source": [ "# You have secured 85 marks, let us now compute your z-score\n", "xi = 85\n", "z = (xi - np.mean(x))/np.std(x)\n", "z = (xi - mu)/sigma\n", "print(\"Your z-score: \", z)" ] }, { "cell_type": "code", "execution_count": null, "id": "81d6c8ff", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "79872757", "metadata": {}, "source": [ "> Your Z-Score is 2.5 standard deviation above the mean." ] }, { "cell_type": "code", "execution_count": null, "id": "ff465cf2", "metadata": {}, "outputs": [], "source": [ "# Let us calculate the number of students above you out of ten thousand and visualize this by drawing a graph.\n", "a = len(np.where(x > 85)[0])\n", "a" ] }, { "cell_type": "markdown", "id": "15bc1b51", "metadata": {}, "source": [ ">Out of 10000 students, only 59 are above you :)" ] }, { "cell_type": "code", "execution_count": null, "id": "9c545576", "metadata": { "id": "nCN9Lrc5Jy-H" }, "outputs": [], "source": [ "sns.displot(x, color='green')\n", "#ax.set_xlim(0, 100)\n", "plt.axvline(mu, color='orange')\n", "for i in [-3, -2, -1, 1, 2, 3]:\n", " plt.axvline(mu+i*sigma, color='red')\n", "plt.axvline(xi, color='purple')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "cdaf7a48", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b2bddf08", "metadata": {}, "source": [ "**Example 2:** This is continuation of above example. Suppose your marks in Data Science are still 85%, but this time the mean of the overall result has increased to 90% with a standard deviation of 2. Calculate your Z-score and see how many students are above you? Also draw the graph of the normal distribution" ] }, { "cell_type": "code", "execution_count": null, "id": "4706c09a", "metadata": {}, "outputs": [], "source": [ "# Let us first generate the random marks of ten thousand students, with a mean of 60 and standard deviation of 10\n", "mu = 90\n", "sigma = 2\n", "np.random.seed(54)\n", "x = np.random.normal(mu, sigma, 10000)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "id": "b3be29c0", "metadata": {}, "outputs": [], "source": [ "#Let us verify, whether the mean and std dev of above distribution x is 90 and 2 respectively\n", "print(\"np.mean(x): \", np.mean(x))\n", "print(\"np.std(x): \", np.std(x))" ] }, { "cell_type": "code", "execution_count": null, "id": "f6fdae65", "metadata": {}, "outputs": [], "source": [ "# You have secured 85 marks, let us now compute your z-score\n", "xi = 85\n", "z = (xi - np.mean(x))/np.std(x)\n", "z = (xi - mu)/sigma\n", "print(\"Your z-score: \", z)" ] }, { "cell_type": "markdown", "id": "4f2f05eb", "metadata": { "id": "fZj22bDfJy-J" }, "source": [ ">Your Z-Score is 2.5 standard deviation below the mean." ] }, { "cell_type": "code", "execution_count": null, "id": "024b0a7e", "metadata": {}, "outputs": [], "source": [ "#Let us calculate the number of students above you out of ten thousand and visualize this by drawing a graph.\n", "a = len(np.where(x > 85)[0])\n", "a" ] }, { "cell_type": "markdown", "id": "97d68a76", "metadata": {}, "source": [ ">Out of 10000 students, 9934 are above you :(" ] }, { "cell_type": "code", "execution_count": null, "id": "9d75a269", "metadata": { "id": "6M0zc3PKJy-K" }, "outputs": [], "source": [ "sns.displot(x, color='gray')\n", "#ax.set_xlim(0, 100)\n", "plt.axvline(mu, color='orange')\n", "for i in [-3, -2, -1, 1, 2, 3]:\n", " plt.axvline(mu+i*sigma, color='red')\n", "plt.axvline(xi, color='purple')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "587e9c74", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ed08efbc", "metadata": {}, "source": [ "### The `z-score` vs the `p-value`?\n", "\n", "" ] }, { "cell_type": "markdown", "id": "4de38523", "metadata": { "id": "PNB9TnHgJy-P" }, "source": [ "- The `p-value` actually quantify the **probability** that a given observation would occur by chance alone. \n", "- For example, in our above two examples, in which we simulated exam results of 10K students, only `59` students attained a `z-score` above 2.5 and only `66` attained a `z-score` below -2.5. \n", "- Thus, if we randomly select one out of the 10k exam results, we would expect it to be outside of 2.5 (i.e., +/- 2.5) standard deviations only 1.25% of the time: \n", "$$ p-value = \\frac{59+66}{10000} = 0.0125 = 1.25\\% $$\n", "\n", "**Compute p-value corresponding to Z-Score:** https://www.statology.org/calculators/" ] }, { "cell_type": "code", "execution_count": null, "id": "cdde1bae", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e5d48cf0", "metadata": {}, "source": [ "## 6. Central Limit Theorem\n", "\n", "- The `Central Limit Theorem states`:\n", "\n", "####
If you have a population with mean μ and standard deviation σ and you take sufficiently large random samples from that population with replacement, then the distribution of the sample means will be approximately normally distributed.
\n", "- This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large.\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "id": "7905b59e", "metadata": {}, "source": [ "**Law of Large Numbers vs Central Limit Theorem:**\n", "- The central limit theorem is often confused with the law of large numbers by beginners. \n", " - The law of large numbers states: \"As a sample size grows, the sample mean gets closer to the population mean\". \n", " - Central limit theorem (CLT) states two things:\n", " - As the sample size grows, the shape of the distribution of the sample means resembles a bell shape. \n", " - As the sample size grows, the center of the distribution of the sample means becomes very close to the population mean (which is essentially the law of large numbers)." ] }, { "cell_type": "code", "execution_count": null, "id": "da6ca442", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bb337f66", "metadata": {}, "source": [ "### Central Limit Theorem in Practice" ] }, { "cell_type": "markdown", "id": "30d9db7c", "metadata": {}, "source": [ "#### Example 1:\n", "> **Drawing samples from a normal distribution**" ] }, { "cell_type": "code", "execution_count": null, "id": "60ebe329", "metadata": {}, "outputs": [], "source": [ "# Draw random sample of size from a normal distribution.\n", "x1 = np.random.normal(size=10000)\n", "sns.displot(x1);" ] }, { "cell_type": "code", "execution_count": null, "id": "4120d169", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6f5d16d8", "metadata": {}, "source": [ "**Trial 1:** # Generate a random sample from above normal distribution `x1`, of size **10** (without replacement). Draw corresponding histogram." ] }, { "cell_type": "code", "execution_count": null, "id": "cb4eeb38", "metadata": {}, "outputs": [], "source": [ "sample = np.random.choice(a=x1, size=10, replace=False)\n", "sns.displot(sample, color='green')\n", "plt.xlim(-1.5, 1.5)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "f66bf72f", "metadata": {}, "source": [ "With a smaller number of samples (10 in this case), the histogram is scattered all over and does not have a definite pattern. However, by increasing the sample size, the sampling distribution starts to resemble a normal distribution. This is the \"Central Limit Theorem\"." ] }, { "cell_type": "code", "execution_count": null, "id": "151b6e91", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "a56251ab", "metadata": {}, "outputs": [], "source": [ "# Function is passed the population distribution from which to draw samples\n", "# n is the number of samples to be drawn, while size is the size of each sample\n", "# Function returns a list of means of all the samples\n", "def myfunc(distr, n, size):\n", " sample_means = []\n", " for i in range(n):\n", " sample = np.random.choice(distr, size=size, replace=False) #Generates a random sample from a given 1-D array\n", " sample_means.append(sample.mean())\n", " return sample_means" ] }, { "cell_type": "code", "execution_count": null, "id": "6a5943c3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0f80efdc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bb330008", "metadata": {}, "source": [ "**Trial 2:** # Generate **10** random samples from above normal distribution `x1`, of size **10** each (without replacement). Draw histogram of the mean of those ten samples." ] }, { "cell_type": "code", "execution_count": null, "id": "eb94b921", "metadata": {}, "outputs": [], "source": [ "#Pass the displot method the list of means returned by myfunc()\n", "sns.displot(myfunc(distr=x1, n=10, size=10), color='green')\n", "plt.xlim(-1.5, 1.5)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "5c8f11bc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "705f3cd0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ef0954ef", "metadata": {}, "source": [ "**Trial 3:** # Generate **100** random samples from above normal distribution `x1`, of size **10** each (without replacement). Draw histogram of the mean of those hundred samples." ] }, { "cell_type": "code", "execution_count": null, "id": "263212e3", "metadata": {}, "outputs": [], "source": [ "#Pass the displot method the list of means returned by myfunc()\n", "sns.displot(myfunc(distr=x1, n=100, size=10), color='green')\n", "plt.xlim(-1.5, 1.5)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "f1667a26", "metadata": {}, "source": [ "**Trial 4:** # Generate **1000** random samples from above normal distribution `x1`, of size **10** each (without replacement). Draw histogram of the mean of those thousand samples." ] }, { "cell_type": "code", "execution_count": null, "id": "5778f4f3", "metadata": {}, "outputs": [], "source": [ "#Pass the displot method the list of means returned by myfunc()\n", "sns.displot(myfunc(distr=x1, n=1000, size=10), color='green')\n", "plt.xlim(-1.5, 1.5)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "43cc87a9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1e3c73ca", "metadata": {}, "source": [ "#### Example 2:\n", "> **Drawing samples from a skewed distribution**" ] }, { "cell_type": "code", "execution_count": null, "id": "e4cd48f4", "metadata": {}, "outputs": [], "source": [ "# Draw positive or negitive skewed distribution based on first parameter\n", "x2 = st.skewnorm.rvs(10, size=10000)\n", "sns.displot(x2);" ] }, { "cell_type": "markdown", "id": "63d03b2c", "metadata": {}, "source": [ "**Trial 1:** Generate **10** random samples from above normal distribution `x2`, of size **10** each (without replacement). Draw histogram of the mean of those ten samples." ] }, { "cell_type": "code", "execution_count": null, "id": "310877b3", "metadata": {}, "outputs": [], "source": [ "#Pass the displot method the list of means returned by myfunc()\n", "sns.displot(myfunc(distr=x2, n=10, size=10), color='green')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "247dd0ad", "metadata": {}, "source": [ "**Trial 2:** # Generate **10,000** random samples from above normal distribution `x1`, of size **10** each (without replacement). Draw histogram of the mean of those thousand samples." ] }, { "cell_type": "code", "execution_count": null, "id": "a83843e9", "metadata": {}, "outputs": [], "source": [ "#Pass the displot method the list of means returned by myfunc()\n", "sns.displot(myfunc(distr=x2, n=10000, size=10), color='green', kde=True)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "b801561a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "939e1d2c", "metadata": {}, "source": [ "#### Example 3:\n", "> **Drawing samples from a multimodal distribution**" ] }, { "cell_type": "code", "execution_count": null, "id": "b88cbbb8", "metadata": {}, "outputs": [], "source": [ "x3 = np.concatenate((np.random.normal(size=5000), np.random.normal(loc = 4.0, size=5000)))\n", "sns.displot(x3);" ] }, { "cell_type": "markdown", "id": "bbad06c3", "metadata": {}, "source": [ "**Trial 1:** Generate **10** random samples from above normal distribution `x3`, of size **10** each (without replacement). Draw histogram of the mean of those ten samples." ] }, { "cell_type": "code", "execution_count": null, "id": "ce011598", "metadata": {}, "outputs": [], "source": [ "#Pass the displot method the list of means returned by myfunc()\n", "sns.displot(myfunc(distr=x3, n=10, size=10), color='green')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "a2d9af0f", "metadata": {}, "source": [ "**Trial 2:** Generate **10,000** random samples from above normal distribution `x3`, of size **10** each (without replacement). Draw histogram of the mean of those ten thousand samples." ] }, { "cell_type": "code", "execution_count": null, "id": "75820259", "metadata": {}, "outputs": [], "source": [ "#Pass the displot method the list of means returned by myfunc()\n", "sns.displot(myfunc(distr=x3, n=10000, size=10), color='green', kde=True)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "8b1e0091", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "335a4e8c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ad10cb95", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "import math as m\n", "import statistics\n", "import scipy.stats as st\n", "import statsmodels as sm" ] }, { "cell_type": "markdown", "id": "ecbbfa10", "metadata": {}, "source": [ "# Section 2: (Hypothesis Testing) " ] }, { "cell_type": "markdown", "id": "cc9b328b", "metadata": {}, "source": [ "### 1. Overview of Hypothesis Testing\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "######
“A hypothesis is an idea that can be tested.”
\n", "\n", "######
“Hypothesis Testing is the process of evaluating two mutually exclusive statements on population data using sample.”
" ] }, { "cell_type": "markdown", "id": "53b22dc5", "metadata": {}, "source": [ "\n", "\n", "**1. Formulate the hypothesis**\n", "\n", "**2. Gather random sample data from target population**\n", "\n", "**3. Select appropriate Test**\n", "\n", "**4. Execute/Perform the Test**\n", "\n", "**5. Make a Decision**\n", "\n", "**6. Check Errors**" ] }, { "cell_type": "code", "execution_count": null, "id": "ba70d6b9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4335b22f", "metadata": {}, "source": [ "### a. How to Formulate a Hypothesis?\n", " " ] }, { "cell_type": "markdown", "id": "3e84d8f1", "metadata": {}, "source": [ "- The **Null Hypothesis ($H_0$)** is the status quo. (Innocent until proven guilty)\n", "- The **Alternative Hypothesis ($H_A$)** is also called the research hypothesis, which is the claim to be tested. A researcher actualy propose the Alternative hypothesis and then tries to reject the Null hypothesis." ] }, { "cell_type": "code", "execution_count": null, "id": "0a93425c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7b76b9dd", "metadata": {}, "source": [ " \n", "\n", "**Example 1:** A teacher believes that a certain teaching technique will influence the mean score of his students in semester exams. But he is not sure whether it will increase or decrease the mean score, which is currently 65%. To test this he apply the new teaching technique on his class of students. After the semester result, he applies the appropriate type of hypothesis test, and see if he can reject the Null Hypothesis.\n", "\n", "- Null Hypothesis ($H_0$): New teaching technique will have no effect on mean score 65\n", "\n", "\n", "- Alternative Hypothesis ($H_A$): New teaching technique will cause the mean score to be different than 65\n", "\n", "\n", "- $H_0: \\mu = 65$ $\\hspace{0.5cm}$ and $\\hspace{0.5cm}$ $H_A: \\mu \\neq 65 $\n", "\n", "\n", "- Since, the alternative hypothesis has a not equal to sign, so the test will be a two tailed test or a non-directional test; because the teacher can test for effects in both direction." ] }, { "cell_type": "code", "execution_count": null, "id": "3e4c2628", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c0d382c6", "metadata": {}, "source": [ " \n", "\n", "**Example 2:** A biologist believes that a certain fertilizer will cause the plants to grow more in one month than they normally do (status-quo). The average growth rate is currently less than or equal to 5 centimeters.\n", "\n", "- Null Hypothesis ($H_0$): Fertilizer has no effect on the mean plant growth\n", "\n", "\n", "- Alternative Hypothesis ($H_A$): Fertilizer will cause mean plant growth to increase\n", "\n", "\n", "- $H_0: \\mu$ ≤ 5 cm $\\hspace{0.5cm}$ and $\\hspace{0.5cm}$ $H_A: \\mu > 5$ cm\n", "\n", "\n", "- Since, the alternative hypothesis has a greater than sign, so the test will be a single tailed test or a directional test or a right tailed test." ] }, { "cell_type": "code", "execution_count": null, "id": "42d3e77e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "86fe2d9e", "metadata": {}, "source": [ " \n", "\n", "**Example 3:** A doctor believes that a new drug will be able to reduce the blood pressure in patients. The average blood pressure is currently greater than or equal to 150. To test this he/she gave that drug to a group of patients and test the effect by checking the blood pressure for one month and then perform the test.\n", "\n", "- Null Hypothesis ($H_0$): The mean blood pressure will remain same after using the drug\n", "\n", "\n", "- Alternative Hypothesis ($H_A$): The mean blood pressure will reduce after using drug\n", "\n", "\n", "- $H_0: \\mu ≥ 150$ $\\hspace{0.5cm}$ and $\\hspace{0.5cm}$ $H_A: \\mu < 150 $\n", "\n", "\n", "- Since, the alternative hypothesis has a less than sign, so the test will be a single tailed test or a directional test or a left tailed test." ] }, { "cell_type": "markdown", "id": "fcf2ca25", "metadata": {}, "source": [ "#### For a two-tailed test, we define `H0: µ1 = µ2 and Ha: µ1 ≠ µ2`\n", "\n", "#### For a one-tailed test, we can define `H0: µ1 ≤ µ2 and Ha: µ1 > µ2` or\n", "\n", "#### For a one-tailed test, we can define `H0: µ1 ≥ µ2 and Ha: µ1 < µ2`\n" ] }, { "cell_type": "code", "execution_count": null, "id": "45d7e225", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7a603c07", "metadata": {}, "source": [ "**Example:** \n", "- Let us suppose that I give you a coin, and ask you to check if it is a fair coin, how you should proceed?\n", " - First of all you should assume that the coin is fair (null hypothesis). \n", " - Then you test the coin by flipping it six times.\n", " - Let us susppose that it comes up heads on all six times or it comes up tails on all six times.\n", " - The probability of throwing six heads is 0.015625 and similarly the probability of throwing six tails is 0.015625\n", " - Therefore, the probability of throwing six heads or six tails in a six coin flip experiment, is 0.03125\n", " - This observation would suggest that you should *reject the null hypothesis* because chance alone would facilitate such an observation less than 5% of the time, i.e., $p < .05$.\n", " - So finally you will tell your friend that this is not a fair coin, and being a statistician you will say \"**This is a statistically meaningfull observation**\"" ] }, { "cell_type": "code", "execution_count": null, "id": "a79ebe98", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b4fc2a7a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1f6abdb4", "metadata": {}, "source": [ "### b. Types of Hypothesis Tests" ] }, { "cell_type": "markdown", "id": "5b65ca43", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "01fa67ba", "metadata": {}, "source": [ "**Note:**\n", "- If the sample data sets are drawn from a population, which is normally distributed we used parametric tests.\n", "\n", "\n", "- The Non-Parametric tests are also known as distribution free tests, and you can apply them on sample data sets, that are drawn from normally distributed, as well as from non-normally distributed populations.\n", "\n", "\n", "- One Sample tests allows us to compare the mean of a sample against a reference mean. While two sample tests allows us to comare the mean of two separate samples, which can be independent or paired.\n", "\n", "\n", "- Z-Test is used to compare means when the population standard deviation is known and the sample size is large (> 30).\n", "\n", "\n", "- T-Test is used to compare means when the population standard deviation is NOT known and the sample size is small (< 30)." ] }, { "cell_type": "code", "execution_count": null, "id": "35377784", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2e4cacee", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "311b3bba", "metadata": {}, "source": [ "### c. Some Terminologies related to Hypothesis Testing\n", "#### Two-sided problems\n", "\n", "\n", "\n", "A size $\\alpha$ test for $H_0: \\mu = \\mu_0$ versus $H_A: \\mu \\neq \\mu_0$\n", "\n", "Reject $H_0$ if the test statistics is in the **rejection region**\n", "$$R = [t: \\mid t \\mid > t_{\\alpha/2, n-1}]$$\n", "and accepts if in the **acceptance region**\n", "$$A = [t: \\mid t \\mid \\leq t_{\\alpha/2, n-1}]$$\n", "\n", "\n", "#### One-sided problems\n", "A size $\\alpha$ test for $H_0: \\mu \\geq \\mu_0$ versus $H_A: \\mu < \\mu_0$\n", "\n", "Reject $H_0$ if the test statistics is in the **rejection region**\n", "$$R = [t: t < - t_{\\alpha, n-1}]$$\n", "and accepts if in the **acceptance region**\n", "$$A = [t: t \\geq - t_{\\alpha, n-1}]$$" ] }, { "cell_type": "markdown", "id": "50c224a9", "metadata": {}, "source": [ "\n", "\n", "### Types of errors\n", "The errors in hypothesis testing are nothing but our biases towards either null hypothesis or alternative hypothesis.\n", "\n", "- Type I error: An error committed by rejecting the null hypothesis when it is true.\n", "- Type II error: An error committed by accepting the null hypothesis when it is false." ] }, { "cell_type": "code", "execution_count": null, "id": "137132ef", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9304597c", "metadata": {}, "source": [ "## 2. Student's Single Sample T-Test" ] }, { "cell_type": "markdown", "id": "c07a7959", "metadata": {}, "source": [ "- The **Z-scores** applies to individual values only. The formula for calculating a **z-score**: \n", "$$ z = \\frac{x_i-\\mu}{\\sigma} $$\n", "- The Student's Single Sample T-Test help us compare a sample of multiple values to some reference mean and is used on `continuous variables`. \n", "- William Gosset published this paper of T-test under the pseudoname \"Student\"\n", "- Following are Assumptions of this test:\n", " - Population distribution is normal.\n", " - Samples are random and independent\n", " - The sample size is small or less than 30.\n", "- The outcome of the t-test produces the t-value or t-score." ] }, { "cell_type": "markdown", "id": "4e973a24", "metadata": { "id": "BoXBaK4WJy-Y" }, "source": [ "- The **single-sample *t*-test** is a variation of above formula and is defined by: \n", "$$ t = \\frac{\\bar{x} - \\mu_0}{s_{\\bar{x}}} $$\n", "Where: \n", " * $\\bar{x}$ is the sample mean\n", " * $\\mu_0$ is a reference mean, e.g., known population mean or \"null hypothesis\" mean\n", " * $s_{\\bar{x}}$ is the sample standard error = $\\frac{\\sigma}{\\sqrt n}$\n", " * Use degree of freedom as n-1" ] }, { "cell_type": "code", "execution_count": null, "id": "ab520a68", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8f1b7b75", "metadata": {}, "source": [ "### Example: Single Sample T-Test" ] }, { "cell_type": "markdown", "id": "50f30417", "metadata": {}, "source": [ "Suppose we want to know that a certain species of Penguins in Pakistan is equal to 310 pounds. Test this hypothesis using a single sample T-Test at significance level of 95%" ] }, { "cell_type": "markdown", "id": "ddb74491", "metadata": {}, "source": [ "**Step 1: Formulate the Hypothesis**\n", "- $H_0: \\mu = 310 $ $\\hspace{0.5cm}$ and $\\hspace{0.5cm}$ $H_A: \\mu \\neq 310 $" ] }, { "cell_type": "code", "execution_count": null, "id": "d5303cde", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c954eefd", "metadata": {}, "source": [ "**Step 2: Data gathering**\n", "\n", "Since we cannot check the weight of all the penguins in Pakistan, so we took a random sample of 40 penguins and weigh them. The data is as under:" ] }, { "cell_type": "code", "execution_count": null, "id": "eb194347", "metadata": {}, "outputs": [], "source": [ "n = 40 # sample size\n", "mu = 310. # mean of population (reference mean)\n", "xbar = 300 # mean of the sample\n", "s = 18.5 # standard deviation of sample" ] }, { "cell_type": "code", "execution_count": null, "id": "e17f1510", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8c7f8544", "metadata": {}, "source": [ "**Step 3: Compute T-Score**" ] }, { "cell_type": "code", "execution_count": null, "id": "178ab8f4", "metadata": {}, "outputs": [], "source": [ "#calculate standard error\n", "sx = s/(n**.5)" ] }, { "cell_type": "code", "execution_count": null, "id": "0058d13d", "metadata": {}, "outputs": [], "source": [ "#calculate t-score\n", "t_value = (xbar - mu)/sx\n", "t_value" ] }, { "cell_type": "code", "execution_count": null, "id": "037252b7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1500187c", "metadata": {}, "source": [ "**Step 4: Compute p-value corresponding to T-Score:** https://www.statology.org/t-score-p-value-calculator/\n", "\n", "- Keep the degree of freedom = n-1 = 39\n", "- Resulting p-value = 0.00149" ] }, { "cell_type": "code", "execution_count": null, "id": "8d2cabba", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "aecb079b", "metadata": {}, "source": [ "\n", "\n", "**Step 5: Compare p-value with α (alpha)**\n", "- It is given that we have to check this at a 95% confidence level, which means α = 5% or 0.05. Moreover, since it is a two-tailed, therefore, α = 0.025: " ] }, { "cell_type": "code", "execution_count": null, "id": "a1fe51fe", "metadata": {}, "outputs": [], "source": [ "p_value = 0.00149\n", "alpha = 0.025\n", "if(p_value < alpha):\n", " print(\"The result is significant, so we Reject Null Hypothesis\")\n", "else:\n", " print(\"Fail to Reject NUll Hypothesis\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7e27bbc1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5d799eb7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a5d183de", "metadata": {}, "source": [ "## 3. Student's Two Independent Samples T-Test\n", "- The Student's Two Sample T-Test is used to determine whether or not two population means are equal. The formula for computing t-score for Student's. Two Sample T-Test is given below:\n", "\n", " " ] }, { "cell_type": "markdown", "id": "c9c02eba", "metadata": {}, "source": [ "### Example: Two Independent Samples T-Test" ] }, { "cell_type": "markdown", "id": "72f6a5f4", "metadata": {}, "source": [ "Suppose we want to know whether or not the mean weight of two species of penguins (Adelie and Gentoo) in Pakistan is equal. Test this hypothesis using a two sample T-Test at significance level of 95%" ] }, { "cell_type": "markdown", "id": "2d41dd47", "metadata": {}, "source": [ "**Step 1: Formulate the Hypothesis**\n", "- $H_0: \\mu_1 = \\mu_2 $ $\\hspace{0.5cm}$ and $\\hspace{0.5cm}$ $H_A: \\mu_1 \\neq \\mu_2$" ] }, { "cell_type": "markdown", "id": "80f258b9", "metadata": {}, "source": [ "**Step 2: Data gathering**\n", "- Since there are thousands of penguins of these two species in Pakistan, it would be too time consuming and costly to go around and weigh each individual penguin. \n", "- So we take two random samples of 15 penguins from each population of species (Adelie and Gentoo) and use the mean weight in each sample to determine if the mean weight is equal between the two populations. \n", "- This is for sure that the mean weight between the two samples will be a little different. The question is whether or not this difference is statically significant. Test this hypothesis using a two sample T-Test at significance level of 95%\n", "- The data is as under: \n", "\n", "**Note:** The two samples should have approximately same variance, If this assumption is not met, you should use Welch's Test instead" ] }, { "cell_type": "code", "execution_count": null, "id": "9eff8399", "metadata": {}, "outputs": [], "source": [ "alpha = 0.025 # 95%, Because it is a two tailed test, therefore, the rejection region will be on both extremes .05/2\n", "\n", "# Data about sample of Adelie species\n", "x1bar = 300 # sample mean\n", "n1 = 40 # sample size\n", "s1 = 18.5 # standard deviation of sample\n", "\n", "# Data about sample of Gantoo species\n", "x2bar = 305 # sample mean\n", "n2 = 38 # sample size\n", "s2 = 16.7 # standard deviation of sample" ] }, { "cell_type": "markdown", "id": "358b0118", "metadata": {}, "source": [ "**Step 3: Compute T-Score**\n", "- You do this by taking the difference in the two sample means and dividing by either the pooled or unpooled estimated standard error. The estimated standard error is an aggregate measure of the amount of variation in both groups." ] }, { "cell_type": "code", "execution_count": null, "id": "8c7ef7f2", "metadata": {}, "outputs": [], "source": [ "# pooled estimated standard error\n", "sp = ((((n1-1)*(s1**2)) + ((n2-1)*(s2**2)))/(n1+n2-2))**0.5\n", "sp" ] }, { "cell_type": "code", "execution_count": null, "id": "ec339c31", "metadata": {}, "outputs": [], "source": [ "t_value = (x1bar-x2bar)/(sp * (((1/n1) + (1/n2))**0.5))\n", "t_value" ] }, { "cell_type": "markdown", "id": "17c5ceab", "metadata": {}, "source": [ "**Step 4: Compute p-value corresponding to T-Score:** https://www.statology.org/t-score-p-value-calculator/\n", "- Keep the degree of freedom = n1 + n2 - 2 = 76\n", "- Resulting p-value = 0.21485" ] }, { "cell_type": "markdown", "id": "9aa2f023", "metadata": {}, "source": [ "**Step 5: Compare p-value with α (alpha)**\n", "- It is given that we have to check this at a 95% confidence level, which means α = 5% or 0.05. Moreover, since it is a two-tailed, therefore, α = 0.025: " ] }, { "cell_type": "code", "execution_count": null, "id": "cfbb3463", "metadata": {}, "outputs": [], "source": [ "p_value = 0.21485\n", "alpha = 0.025\n", "if(p_value < alpha):\n", " print(\"The result is significant, so we Reject Null Hypothesis\")\n", "else:\n", " print(\"The result is not significant, so we Fail to Reject NUll Hypothesis\")" ] }, { "cell_type": "markdown", "id": "c8ad38b3", "metadata": {}, "source": [ ">Therefore, we donot have sufficient evidence to say that the mean weight of two species of penguins is different" ] }, { "cell_type": "code", "execution_count": null, "id": "9e533e5c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b257f3be", "metadata": {}, "source": [ "## 4. Student's Paired Sample T-Test" ] }, { "cell_type": "markdown", "id": "080325c6", "metadata": { "id": "kaomPLqyJy-e" }, "source": [ "We can use Student's **paired-sample** (a.k.a., **dependent**) *t*-test: \n", "$$ t = \\frac{\\bar{d} - \\mu_0}{s_\\bar{d}} $$ \n", "Where: \n", "* $d$ is a vector of the differences between paired samples $x$ and $y$\n", "* $\\bar{d}$ is the mean of the differences\n", "* $\\mu_0$ will typically be zero, meaning the null hypothesis is that there is no difference between $x$ and $y$\n", "* $s_\\bar{d}$ is the standard error of the differences" ] }, { "cell_type": "markdown", "id": "bbea9d24", "metadata": { "id": "mRknH0rPJy-e" }, "source": [ "Here's an example: " ] }, { "cell_type": "code", "execution_count": null, "id": "decd8630", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "56ef16fb", "metadata": {}, "source": [ "### Example: Paired Sample T-Test" ] }, { "cell_type": "markdown", "id": "3f273304", "metadata": {}, "source": [ "- Suppose you have a dataset `exercise` containing heart rate (pulse), of 30 persons.\n", "- Their pulse is taken at three different time points in an experiment. (i.e, after one, 15, and 30 minutes of resting, walking and running). \n", "- Ten people were assigned to each of three activity groups\n", "- Check whether the mean heart rate varies significantly after one minute of walking relative to after 15 minutes" ] }, { "cell_type": "markdown", "id": "615ee84a", "metadata": {}, "source": [ "**Step 2: Data gathering**\n", "- Let us suppose this time we have a dataset containg the related observations" ] }, { "cell_type": "code", "execution_count": null, "id": "bca07732", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "I5OLpqPvJy-f", "outputId": "df5f319c-4d95-4bc9-f56b-e9c45a7328d2" }, "outputs": [], "source": [ "df = sns.load_dataset('exercise')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "29ab7dbc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "40aeb84c", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0L3Ez96QJy-g", "outputId": "ad7d5fc3-d4f9-4971-d3aa-65d23180ca5b" }, "outputs": [], "source": [ "np.unique(df.diet, return_counts=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "a1923c76", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8ZmmIMPEJy-f", "outputId": "40048b38-22fd-4096-851c-d5ca45329b13" }, "outputs": [], "source": [ "np.unique(df.time, return_counts=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "1e8bf6e8", "metadata": {}, "outputs": [], "source": [ "np.unique(df.kind, return_counts=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "4f863b77", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "288fad78", "metadata": {}, "outputs": [], "source": [ "rest = df[df.kind == 'rest']\n", "rest" ] }, { "cell_type": "code", "execution_count": null, "id": "cc5fe908", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "213f8d29", "metadata": {}, "outputs": [], "source": [ "run = df[df.kind == 'running']\n", "run" ] }, { "cell_type": "code", "execution_count": null, "id": "b3d5bcde", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "fcba3046", "metadata": {}, "outputs": [], "source": [ "sns.boxplot(x='time', y='pulse', data=rest);" ] }, { "cell_type": "code", "execution_count": null, "id": "8e529323", "metadata": {}, "outputs": [], "source": [ "sns.boxplot(x='time', y='pulse', data=run);" ] }, { "cell_type": "code", "execution_count": null, "id": "a6edaa5f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "764d5bcb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1ffdc79e", "metadata": {}, "source": [ "- Check whether the mean heart rate varies significantly after one minute of walking relative to after 15 minutes, " ] }, { "cell_type": "markdown", "id": "afc3bce8", "metadata": {}, "source": [ "**Step 1: Formulate the Hypothesis**\n", "- $H_0: \\hspace{0.5cm}$ Mean heart rate varies significantly after one minute of walking relative to after 15 minutes\n", "- $H_A: \\hspace{0.5cm}$ Mean heart rate DOES NOT varies significantly after one minute of walking relative to after 15 minutes" ] }, { "cell_type": "markdown", "id": "cb3d3ebc", "metadata": {}, "source": [ "**Step 2: Data gathering**" ] }, { "cell_type": "code", "execution_count": null, "id": "9c7e77af", "metadata": {}, "outputs": [], "source": [ "# For simplicity, let's only consider one of the six experimental groups, say the walking, no-fat dieters: \n", "walk_no = df[(df.diet == 'no fat') & (df.kind == 'walking')]\n", "walk_no" ] }, { "cell_type": "markdown", "id": "0069bbce", "metadata": {}, "source": [ "(Note how participant 16 has a relatively low heart rate at all three timepoints, whereas participant 20 has a relatively high heart rate at all three timepoints.)" ] }, { "cell_type": "code", "execution_count": null, "id": "bb0cd7f6", "metadata": {}, "outputs": [], "source": [ "sns.boxplot(x='time', y='pulse', data=walk_no);" ] }, { "cell_type": "code", "execution_count": null, "id": "90d4c075", "metadata": {}, "outputs": [], "source": [ "walk_no[walk_no.time == '1 min']" ] }, { "cell_type": "code", "execution_count": null, "id": "d2476202", "metadata": {}, "outputs": [], "source": [ "walk_no[walk_no.time == '1 min']['pulse']" ] }, { "cell_type": "code", "execution_count": null, "id": "395abc09", "metadata": {}, "outputs": [], "source": [ "min1 = walk_no[walk_no.time == '1 min']['pulse'].to_numpy()\n", "min1" ] }, { "cell_type": "code", "execution_count": null, "id": "ed4bd3dc", "metadata": {}, "outputs": [], "source": [ "min15 = walk_no[walk_no.time == '15 min']['pulse'].to_numpy()\n", "min15" ] }, { "cell_type": "code", "execution_count": null, "id": "ff84fc86", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6c8fb383", "metadata": {}, "source": [ "(With paired samples, we can plot the values in a scatterplot, which wouldn't make any sense for independent samples, e.g.:)" ] }, { "cell_type": "code", "execution_count": null, "id": "83bd0ec4", "metadata": {}, "outputs": [], "source": [ "sns.scatterplot(x=min1, y=min15)\n", "plt.title('Heart rate of no-fat dieters (beats per minute)')\n", "plt.xlabel('After 1 minute walking')\n", "plt.ylabel('After 15 minutes walking');" ] }, { "cell_type": "markdown", "id": "77f44d52", "metadata": {}, "source": [ "**Step 3+4: Compute T-Score and p-value using `ttest_rel()` method of scipy.stats**" ] }, { "cell_type": "code", "execution_count": null, "id": "2bf37f20", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7SZqWeDSJy-j", "outputId": "14b8234b-9751-4e11-dfcb-8e7ebf7a354d" }, "outputs": [], "source": [ "st.ttest_rel(min1, min15)" ] }, { "cell_type": "markdown", "id": "584fb3e2", "metadata": {}, "source": [ "**Step 5: Compare p-value with α = 0.05 (alpha)**" ] }, { "cell_type": "code", "execution_count": null, "id": "bbe55340", "metadata": {}, "outputs": [], "source": [ "p_value = 0.02846\n", "alpha = 0.025\n", "if(p_value < alpha):\n", " print(\"The result is significant, so we Reject Null Hypothesis\")\n", "else:\n", " print(\"The result is not significant, so we Fail to Reject NUll Hypothesis\")" ] }, { "cell_type": "markdown", "id": "6861b092", "metadata": { "id": "3Qq6CwH2Jy-k" }, "source": [ "#### Applications of T-Tests in Machine Learning" ] }, { "cell_type": "markdown", "id": "68ae40af", "metadata": { "id": "QKsOq0Q2Jy-k" }, "source": [ "* Single-sample: Does my stochastic model tend to be more accurate than an established benchmark? \n", "* Independent samples: Does my model have unwanted bias in it, e.g., do white men score higher than other demographic groups with HR model? \n", "* Paired samples: Is new TensorFlow.js model significantly faster? (paired by browser / device)" ] }, { "cell_type": "code", "execution_count": null, "id": "3b0294f0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "125f0c4e", "metadata": {}, "source": [ "# Section 3: (A Journey from Variance, Covariance, Correlation to Regression) " ] }, { "cell_type": "code", "execution_count": null, "id": "b8eb6a49", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "import math as m\n", "import statistics\n", "import scipy.stats as st\n", "import statsmodels as sm" ] }, { "cell_type": "markdown", "id": "4ac76981", "metadata": {}, "source": [ "## 1. A Quick Recap" ] }, { "cell_type": "markdown", "id": "f3860975", "metadata": {}, "source": [ "### a. Variance and Standard Deviation" ] }, { "cell_type": "markdown", "id": "203fa9dc", "metadata": {}, "source": [ "- `Variance` or `standard deviation` tells us how much a (single) quantity varies w.r.t. its mean 'OR' how spread out the data is around the center of the distribution (the mean). \n", "- The value of variance doesn't give us any information, so we take its square root to calculate the `Standard Deviation` \n", "- It also gives you an idea of where, percentage wise, a certain value falls. For example, let’s say you took a test and it was normally distributed (shaped like a bell). You score one standard deviation above the mean. That tells you your score puts you in the top 84% of test takers.\n", "- Low standard deviation implies that most values are close to the mean. High standard deviation suggests that the values are more broadly spread out.\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "id": "7f1ff516", "metadata": {}, "source": [ "$\\frac{\\sum(x_i-\\bar{x})^2}{n}$" ] }, { "cell_type": "markdown", "id": "4522abb8", "metadata": {}, "source": [ "**Example 1:**" ] }, { "cell_type": "code", "execution_count": null, "id": "f8f9959f", "metadata": {}, "outputs": [], "source": [ "data1 = [2, 4, 6]" ] }, { "cell_type": "code", "execution_count": null, "id": "41ab4a1f", "metadata": {}, "outputs": [], "source": [ "print(\"Population Variance: \", np.var(data1, ddof=0)) # use n degrees of freedom\n", "print(\"Sample Variance: \", np.var(data1, ddof=1)) # use n-1 degrees of freedom" ] }, { "cell_type": "code", "execution_count": 1, "id": "7bb86fbc", "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'np' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_50926/1434275318.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Population Std Dev: \"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mddof\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m#((4+0+4)/3)**.5 # use n degrees of freedom\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Sample Std Dev: \"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mddof\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m#((4+0+4)/2)**.5 # use n-1 degrees of freedom\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'np' is not defined" ] } ], "source": [ "print(\"Population Std Dev: \", np.std(data1, ddof=0)) #((4+0+4)/3)**.5 # use n degrees of freedom\n", "print(\"Sample Std Dev: \", np.std(data1, ddof=1)) #((4+0+4)/2)**.5 # use n-1 degrees of freedom" ] }, { "cell_type": "code", "execution_count": null, "id": "1796481b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "23add379", "metadata": {}, "source": [ "**Example 2:**" ] }, { "cell_type": "code", "execution_count": null, "id": "ac1c4ee2", "metadata": { "scrolled": true }, "outputs": [], "source": [ "data2 = [75, 69, 80, 70, 60, 63, 64, 69, 71]\n", "print(\"Mean: \", np.mean(data2))\n", "print(\"Std Dev: \", np.std(data2))" ] }, { "cell_type": "markdown", "id": "581a5efa", "metadata": {}, "source": [ ">**Low standard deviation implies that most values are close to the mean.**" ] }, { "cell_type": "code", "execution_count": null, "id": "33d645d0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b61629b1", "metadata": {}, "source": [ "**Example 3**" ] }, { "cell_type": "code", "execution_count": null, "id": "1a064f1c", "metadata": {}, "outputs": [], "source": [ "data3 = [44, 95, 25, 60, 76, 81, 93, 84, 71, 33, 85, 81]\n", "print(\"Mean: \", np.mean(data3))\n", "print(\"Std Dev: \", np.std(data3))" ] }, { "cell_type": "markdown", "id": "f7124af1", "metadata": {}, "source": [ ">**Note: High standard deviation implies that the values are more broadly spread out.**" ] }, { "cell_type": "code", "execution_count": null, "id": "e8f283c1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b5752f29", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "675ef903", "metadata": {}, "source": [ "### b. Covariance\n", "- If we have two vectors of the same length, $x$ and $y$, where each element of $x$ is paired with the corresponding element of $y$, **covariance** provides a measure of how related the variables are to each other:\n", "$$ \\text{cov}(x, y) = \\frac{\\sum_{i=1}^n (x_i - \\bar{x})(y_i - \\bar{y}) }{n} $$\n", "\n", "- Covariance is used to measure as to how the mean values of two random variables move together. For example the height and weight of a person in a population. \n", "\n", "\n", "\n", " \n", "- The covariance value of two variables x and y can lie anywhere between negative infinity and positive infinity.\n", "- Covariance measures the direction of the relationship between two variables. \n", " - `Positive covariance`: Indicates that two variables tend to move in the same direction.\n", " - `Negative covariance`: Reveals that two variables tend to move in inverse directions.\n", " - `Zero covarince`: Indicates that two variables have no relationship between each other.\n", "- **Covariance Matrix:** For multi-dimensional data, there applies a generalization of covariance in terms of a covariance matrix. The covariance matrix is also known as the variance-covariance matrix, as the diagonal values of the covariance matrix show variances and the other values are the covariances. The covariance matrix for two variables is a square matrix which can be written as follows:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "de86da2b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2113fca3", "metadata": {}, "source": [ "**Example 1: Positive Covariance**" ] }, { "cell_type": "code", "execution_count": 2, "id": "4954bfc8", "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'np' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_50926/2270568361.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mweight\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m61\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m62\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m73\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m74\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m82\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m86\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mheight\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m157\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m168\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m170\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m181\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m191\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m185\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcov\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mheight\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mweight\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mddof\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# sample variance along diagonal, and other values are covariances\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mNameError\u001b[0m: name 'np' is not defined" ] } ], "source": [ "weight = [61,62,73,74,82,86]\n", "height = [157,168,170,181,191,185]\n", "np.cov(height, weight,ddof=1) # sample variance along diagonal, and other values are covariances" ] }, { "cell_type": "code", "execution_count": null, "id": "82579483", "metadata": {}, "outputs": [], "source": [ "print(\"Variance of height: \", np.var(height, ddof=1)) # sample variance\n", "print(\"Variance of weight: \", np.var(weight, ddof=1)) # sample variance\n", "\n", "print(\"Covariance of weight, height: \", statistics.covariance(weight, height))\n", "print(\"Covariance of height, weight: \", statistics.covariance(height, weight))" ] }, { "cell_type": "markdown", "id": "cf9ddeb3", "metadata": {}, "source": [ ">- **Since the covariance is positive, so that means the two variables have a positive relationship**" ] }, { "cell_type": "code", "execution_count": null, "id": "49566000", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3f35c877", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c5b6ed61", "metadata": {}, "source": [ "**Example 2: Negative Covariance**" ] }, { "cell_type": "code", "execution_count": null, "id": "1f76edb5", "metadata": {}, "outputs": [], "source": [ "drug = np.array([0, 1, 2, 3, 4, 5, 6, 7.]) # Drug dosage in ml\n", "forgetness = np.array([1.86, 1.31, .62, .33, .09, -.67, -1.23, -1.37]) # Level of forgetfullness\n", "np.cov(drug, forgetness)" ] }, { "cell_type": "code", "execution_count": null, "id": "90ad29bb", "metadata": {}, "outputs": [], "source": [ "print(\"Variance of Drug: \", np.var(drug, ddof=1)) # sample variance\n", "print(\"Variance of Forgetfullness: \", np.var(forgetness, ddof=1)) # sample variance\n", "\n", "print(\"Covariance of drug, forgetfulness: \", statistics.covariance(drug, forgetness))\n", "print(\"Covariance of forgetfullness, drug: \", statistics.covariance(drug, forgetness))" ] }, { "cell_type": "markdown", "id": "f93f9972", "metadata": {}, "source": [ ">- **Since the covariance is negative, so that means the two variables have a negative relationship**" ] }, { "cell_type": "code", "execution_count": null, "id": "fd60a839", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "32003eee", "metadata": {}, "source": [ "### c. Correlation and Pearson's Correlation Coefficient\n", "- The value of Covariance range from negative infinity to positive infinity, so it tells us the direction and not the magnitude of relationship between two random variables. On the other hand the value of correlation coefficient range from -1 to +1, so it tells us the direction as well as the magnitude of relationship between the two variables.\n", "\n", "- Another difference between the two is that Covariance is affected by the change in scale. If all the values of one variable are multiplied by a constant and all the values of another variable are multiplied, by a similar or different constant, then the covariance is changed. On the other hand correlation is not affected by the change in scale.\n", "\n", "- For electricity generation using a windmill, if the speed of the wind turbine increases, the generation output will increase accordingly. Thus, the variable speed and electricity output have a positive correlation here.\n", "\n", "- The correlation coefficient is used to measure the strength of the relationship between two quantitative variables.\n", "\n", "- There are several types of correlation coefficients. The correlation coefficient (developed by Karl Pearson) denoted with $r$ or $\\rho$ is the most common and is defined by: \n", "$$ \\rho_{x,y} = \\frac{\\text{cov}(x,y)}{\\sigma_x \\sigma_y} $$\n", "\n", "- The following correlation graphs show the examples of different range of values for a correlation coefficient\n", "\n", "\n", "\n", "\n", "- **Correlation Matrix:** is a table showing correlation coefficients between various variables. The rows and columns contain the value of the variables, and each cell shows the correlation coefficient.\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "6b2fc97c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4ea064f1", "metadata": {}, "source": [ "**Example 1: Strong Positive Correlation**" ] }, { "cell_type": "code", "execution_count": null, "id": "95412adf", "metadata": {}, "outputs": [], "source": [ "study = np.array([1, 2, 3, 4, 5, 6, 7.]) # study hours \n", "gpa = np.array([1.0, 1.3, 2.5, 2.6, 3.5, 3.7, 4.0]) # gpa\n", "\n", "print(\"Correlation Matrix: \\n\", np.corrcoef(study, gpa))\n", "\n", "sns.scatterplot(x=study, y=gpa)\n", "plt.title(\"Study Hours vs GPA of Students\")\n", "plt.xlabel(\"Study Hours\")\n", "plt.ylabel(\"GPA\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "48a00233", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "43bd7350", "metadata": {}, "source": [ "**Example 2: Strong Negative Correlation**" ] }, { "cell_type": "code", "execution_count": null, "id": "b6636d6a", "metadata": {}, "outputs": [], "source": [ "drug = np.array([0, 1, 2, 3, 4, 5, 6, 7.]) # Drug dosage in ml\n", "forgetness = np.array([1.86, 1.31, .62, .33, .09, -.67, -1.23, -1.37]) # Level of forgetfullness\n", "\n", "print(\"Correlation Matrix: \\n\", np.corrcoef(drug, forgetness))\n", "\n", "\n", "sns.scatterplot(x=drug, y=forgetness)\n", "plt.title(\"Clinical Trial\")\n", "plt.xlabel(\"Drug dosage (mL)\")\n", "plt.ylabel(\"Forgetfulness\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "eb80e7a6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "24921f29", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d1a6b227", "metadata": {}, "source": [ "### d. Regression" ] }, { "cell_type": "markdown", "id": "7a37ff9a", "metadata": {}, "source": [ "\n", "\n", "**Correlation** lets us know the magnitude and direction of relationship between two random variables ‘x’ and ‘y’, while **Regression**, on the other hand, predicts the value of the dependent variable based on the known value of the independent variable, assuming that the average mathematical relationship between two or more variables.\n", "\n", "1. Regression can differentiate between `dependent` and `independent` variables, while correlation cannot. \n", " - The variables study hours and drug dosage, that we plotted along horizontal axis are the independent variables or feature variables, or predictor variables.\n", " - The variables GPA and forgetfullness, that we plotted along vertical axis are the dependent variables or the outcome variables.\n", " \n", "\n", "2. Regression can measure `causality` as well, while correlation just tell the direction of movement between two variables. \n", " - Causality tells us how one variable can impact or influance another variable. \n", " - For example, the more drug you provide the less is the forgetfullness. We give the drug first and then check the forgetfulness. So there is a causal relationship from drug dosage to forgetfullness.\n", "\n", "\n", "\n", "3. Regression can measure the strength and direction of both linear as well as non-linear relationship between two variables, while correlation can do this for linear relationship only.\n", "\n", "\n", "4. The reg(x,y) and reg(y,x) are completely different, while the corr(x,y) and corr(y,x) are identical. \n", "\n", "\n", "5. The graphical representation of a correlation is a single point, while a line visualizes a linear regression (Correlation produces a single statistics, while regression produces a statistical model)." ] }, { "cell_type": "code", "execution_count": null, "id": "84eeaa54", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "affd23b7", "metadata": {}, "source": [ "**Example 1:**" ] }, { "cell_type": "code", "execution_count": null, "id": "9e46f594", "metadata": {}, "outputs": [], "source": [ "study = np.array([1, 2, 3, 4, 5, 6, 7.]) # study hours \n", "gpa = np.array([1.0, 1.3, 2.5, 2.6, 3.5, 3.7, 4.0]) # gpa\n", "# plotting the regression line using regplot\n", "sns.regplot(x=study, y=gpa, ci=None, order=1)\n", "plt.title(\"Study Hours vs GPA of Students\")\n", "plt.xlabel(\"Study Hours\")\n", "plt.ylabel(\"GPA\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "0437b75c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "eb87c039", "metadata": {}, "source": [ "**Example 2:**" ] }, { "cell_type": "code", "execution_count": null, "id": "e5c2ff46", "metadata": {}, "outputs": [], "source": [ "drug = np.array([0, 1, 2, 3, 4, 5, 6, 7.]) # Drug dosage in ml\n", "forgetness = np.array([1.86, 1.31, .62, .33, .09, -.67, -1.23, -1.37]) # Level of forgetfullness\n", "# plotting the regression line using regplot\n", "sns.regplot(x=drug, y=forgetness, ci=None, order=1);\n", "plt.title(\"Clinical Trial\")\n", "plt.xlabel(\"Drug dosage (mL)\")\n", "plt.ylabel(\"Forgetfulness\")\n", "plt.show();" ] }, { "cell_type": "code", "execution_count": null, "id": "1c2f560d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f5bdcc3b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7a6b789e", "metadata": {}, "source": [ "## 2. Regression Analysis\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a39cb77c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "025f4e2a", "metadata": {}, "source": [ "## 3. What is Linear Regression" ] }, { "cell_type": "markdown", "id": "dbb551c8", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "522c39d6", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "21f27e48", "metadata": {}, "source": [ "###
`A simple linear regression model is a mathematical equation that allows us to predict a response for a given predictor value.`
" ] }, { "cell_type": "markdown", "id": "13059f0e", "metadata": {}, "source": [ "- The relationship between two variables can be represented by a simple equation called the regression equation. \n", "- The regression equation actually tells you, how much dependent/outcome variable `y` changes with any given change of independent/feature/predictor variable `x`.\n", "- The regression equation can be used to construct a regression line on a scatter diagram, and in the simplest case this is assumed to be a straight line. \n", "- The direction in which the line slopes depends on whether the correlation is positive or negative. " ] }, { "cell_type": "code", "execution_count": null, "id": "a991cc20", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "96a03922", "metadata": {}, "source": [ "### a. How do we Fit a Line?\n", "(i) Let us try to understand the idea behind fitting the line. The linear regression algorithm first throws a random line, which slowly gets closer and closer to the points until it gets to the point that works.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ba24b9c9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a1e359e4", "metadata": {}, "source": [ "(ii) Let say, algorithm throws a random line and then starts asking to the points, \"What they want\", one by one. It picks a random point which says come closer. So it listens to it, and move the line a bit closer to that point. \n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d0bcac6b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "42249bb1", "metadata": {}, "source": [ "(iii) And it repeats the process for every point, and many a times until we get the line which is best-fit for the data points.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "094d5fa1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8a03e502", "metadata": {}, "source": [ "\n", "\n", "### b. How do we Move a Line Close to a Point?\n", "- Let say we have point and a line plotted on a graph and point says come closer. We have two parameters to define the movement of the line:\n", " - `Slope (m)`: $=$ `rise`/`run` = `vertical distance` / `horizontal distance`\n", " - `y-intercept (b)`: It tells how high the line touches the Y-axis (vertical distance) \n", "- Thus, to move this line closer to the point we need to adjust the slope and y-intercept, and we can do two operations on them (increase or decrease)" ] }, { "cell_type": "code", "execution_count": null, "id": "87ab3f52", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "10d0e870", "metadata": {}, "source": [ "### c. Rotation and Translation\n", "\n", "#### (i) `Rotation`\n", "- Rotation means changing the `slope`. If we want to increase the slope rotate the line `counter clock-wise`. If we want to decrease the slope rotate the line in `clock-wise` direction. The point where the rotation happens is the point where the line interscets the y-axis and called `Pivot`.\n", "\n", "\n", "#### (ii) `Translation`\n", "- Translation means changing the `y-intercept`. To move the line in upward direction, `increase the y-intercept`, and to move the line in downward direction, `decrease the y-intercept`.\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "3c2561d2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "edc5e53a", "metadata": {}, "source": [ "### d. Correct Combination of Rotation and Translation\n", "- To move a line in four different regions, we need a correction combination of Rotation and Translation\n", "- So we have four cases:\n", "\n", "| Direction | Rotation | Translation |\n", "|-----------|----------|-------------|\n", "|Upward| Counter-clockwise (Increased Slope) | Translate line Up (Increased y-intercept) |\n", "|Downward| Clockwise (Decreased Slope) | Translate line down (Decreased y-intercept) |\n" ] }, { "cell_type": "code", "execution_count": null, "id": "981967ce", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bcf1380f", "metadata": {}, "source": [ "### e. How to Move the Line using Mathematical Model (Linear equation)\n", "\n", "\n", "\n", "- Let say, we have a point and a random line, and point says come closer. Let say tha slope of line is `m = 2`, and y-intercept is `c = 3`. Thus the linear equation of line will be:\n", "\n", "\n", "###
`y = 2x + 3`
\n", " \n", " \n", "- Now here is a very important rule of machine learning, that states: Never make drastic step, which means move the line by very tiny amount, beacuse we are going to move this many times, may be 1000 of times. So we are going to pick a very small number to multiply the equation with, which is called `Learning Rate`.\n", "\n", "\n", "\n", "- In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm.\n", "\n", "\n", "\n", "- Let say we pick a very small number 0.01, now the second step is:\n", " - add/subtract learning rate to slope (depending on the slope (counter-clockwise / clockwise) )\n", " - add/subtract learning rate to y-intercept (depending on the translation (up/down) )" ] }, { "cell_type": "markdown", "id": "4dc6b20b", "metadata": {}, "source": [ "### f. Linear Regression Algorithm `(Version 1)`\n", "- `Step 1`: Start with a Random line, and we perform several steps to fit this line better and better.\n", "\n", "\n", "- `Step 2`: In machine learing, we repeat a process many times to make it better, so second step is to pick a large number for repetitions or epochs (An epoch is a term used in machine learning and indicates the number of passes of the entire training dataset the machine learning algorithm has completed). So let say we pick a number 10000.\n", "\n", "\n", "\n", "- `Step 3`: Pick a small number or Learning Rate, let say 0.01.\n", "\n", "\n", "- `Step 4`: Now repeat the process 10000 times (loop). Following process we are going to repeat:\n", "\n", " - Pick a Random point\n", " - Move line towards point using this method:\n", " \n", " - If point above line, and to the right of y-axis\n", " - Rotate counter-clockwise and translate up which means\n", " - add 0.01 to slope\n", " - add 0.01 to y-intercept\n", " \n", " \n", " \n", " - If point above line, and to the left of y-axis\n", " - Rotate clockwise and translate up which means\n", " - subtract 0.01 to slope\n", " - add 0.01 to y-intercept\n", " \n", " \n", " - If point below line, and to the right of y-axis\n", " - Rotate clockwise and translate down which means\n", " - subtract 0.01 to slope\n", " - subtract 0.01 to y-intercept\n", " \n", " \n", " - If point below line, and to the left of y-axis\n", " - Rotate counter-clockwise and translate down which means\n", " - add 0.01 to slope\n", " - subtract 0.01 to y-intercept\n", "\n", " - `Step 5`: Enjoy the Best-fitted line ☺" ] }, { "cell_type": "code", "execution_count": null, "id": "1984b633", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 1, "id": "d8f5635e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.92" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "5-.08" ] }, { "cell_type": "markdown", "id": "7d8fc2fa", "metadata": {}, "source": [ "### g. Improved Linear Regression Algorithm `(Version 2)`\n", "\n", "\n", "\n", "In the above algorithm, we have four conditions based on which we perform the appropriate addition or subtraction operations on the slope and y-intercept values of the line. This algorithm aovids those four if...else conditions.\n", "\n", "- `Step 1`: Start with a Random line, and we perform several steps to fit this line better and better.\n", "\n", "\n", "- `Step 2`: In machine learing, we repeat a process many times to make it better, so second step is to pick a large number for repetitions or epochs. (An epoch is a term used in machine learning and indicates the number of passes of the entire training dataset the machine learning algorithm has completed). So let say we pick a number 1000.\n", "\n", "\n", "\n", "- `Step 3`: Pick a small number or Learning Rate, let say 0.01.\n", "\n", "\n", "- `Step 4`: Now repeat the process 1000 times (loop). Following process we are going to repeat:\n", " - Pick a Random point\n", " - Move line towards point using this method:\n", " - Add (learning rate) x (vertical distance) x (horizontal distance) to slope\n", " - Add (learning rate) x (vertical distance) to y-intercept\n", "\n", "- `Step 5`: Enjoy the Best-fitted line ☺" ] }, { "cell_type": "markdown", "id": "5432ef8b", "metadata": {}, "source": [ "Note: This algorithm is the basic idea behind the trandional algorithms such as Gradient descent or square error algorithm that are mostly used for peforming linear regression in machine learning." ] }, { "cell_type": "code", "execution_count": null, "id": "c613cb02", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 2, "id": "38419143", "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAczElEQVR4nO3de5hcdZ3n8fenobEhF+MkDQESaREUN6CBabmKEy7OBojghWUA5bariAMDDLg64/qIzMzus+O4LLeRGIHhIgIRkEFuio8ioFzshHCJYd2AYYkJSScQkg40BPq7f5zTsbpT1dW305Wq3+f1PPWkzv17qtL1qfM7p85PEYGZmaWrqdYFmJlZbTkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yCwIZP0oKQv1LoOG5ikD0p6UtIGSefWYPvXSfqnsd6uDZ2DoAFJ+pik30h6TdIrkn4t6aP5tNMlPVKjuspuW9IySUfWoqYiSDpR0uOSNkpanT//a0nKp18n6S1JXfn784CkvUqWf5+kHknfHWEpXwUejIgJEXF5mTpnSPqZpFclrZO0QNLR+bRZkpaPcPujYmuqpVE5CBqMpInA3cAVwJ8BuwIXA2/Wsq5akLRtDbZ5IXAZ8C/AVGAn4CzgEGC7klm/HRHjgWnAauC6kmmnAq8CJ0p61wjK2Q1YPMD0nwAP5DXuCJwLrB/B9qxeRYQfDfQA2oF1FaZ9COgG3gG6eucDHgS+UDLf6cAjJcOfAJ4DXgOuBH4FfAF4F/AKsE/JvDsCbwCtZbbfZ70l45cBR+bPm4BvAC+SfUDeALw7nzYLWD7Ast8CbgN+QPaB9gVgf6AjH14FXFLhtVkCzCkZ3hZYA+wHtOTrXAusA34L7FRmHe8GNgKfrfIeXQf8U8nwMUBXyfDzwJfzeo+vsq5jyT7s1+Xv44fy8b/I3+fu/L3+QL/lpgABTCqzznH5e9iTL9sF7FKm7j7vB7AvsBDYANwK3NJv/jnAorzW3wAf7vc+fgV4Ov9/dmv+uleqZVDvqx+De/iIoPH8HnhH0vWSjpL0nt4JEbGE7NvpoxExPiImVVuZpCnA7WQfzlPIPqQOydf3Jtkf++dLFjkJ+HlEdA6z/tPzx2HA7sB4svAZrOPIwmAScBPZt/PLImIi8H5gfoXlbiarvdd/BNZExELgNLIP+enAZLLX8I0y6ziILBz/fbDFShoPfA54Mh8+lOwo4Za81lMHWPYDed3nA63AvcBPJG0XEYcDDwPn5O/17/stvhZYCvxA0qck7dQ7ISI2AkcBK/Jlx0fEiir7sR1wJ3Aj2ZHoj4DPlkzfD7gW+BLZa/g94K5+RzwnALOB9wEfBk4foJbBvq82CA6CBhMR64GPkX3b+z7QKemu0j/0IToa+F1E3BYRm4BLgZdLpl8PnCyp9//SKWQfBpUcmLdHb34A7y2Z/jmyb3cvREQX8PdkTSSDbeZ5NCLujIieiHgD2ATsIWlKRHRFxGMVlvshcKykHfLhk/Nx5OuYDOwREe9ExIL8de5vCll4vN07Ij9Xs07SG5I+XjLvV/J9X0oWdqfn408D7ouIV/PtHyVpxwo1/xVwT0Q8kL833wG2Bw6uMP9mkX0NP4zsm/j/AlZKekjSntWWreBAoBm4NCI2RcRtZEdOvb4IfC8iHs9fw+vJmisPLJnn8ohYERGvkDVbzRxge4N9X20QHAQNKCKWRMTpETEN2JvsUPrSYa5uF+ClknVHv+HHyZpD/iI/4bkHcNcA63ssIiaVPoD/1297L5YMv0jWTDPYIHup3/B/AT4APCfpt5LmlFsoIpaSNQ99Mg+DY/lTENwI/BS4RdIKSd+W1FxmNWuBKaWhFREH5/u4lr5/b9/J939qRBwbEc9L2h74T2RHMkTEo2SvzckV9rXPaxURPfn+71ph/v77vDwizomI95OdT9hI1hQ3HLsAf8z/f/QqfR93Ay7s9wVger5cr9IvGK+TBWQlg3pfbXAcBA0uIp4ja9vdu3dUmdk2AjuUDE8teb6S7A8WgPzKl+n0dT1Z89ApwG0R0T2CkleQfWj0ei/wNlk7cJ86JW1D1iRSqs/+RcT/jYiTyM5d/DNwm6RxFbbd2zx0HNlR0NJ8HZsi4uKI+A9k37bnUL7J5lGyb7nHDWI/y/k0MBH4rqSXJb1M9qFeqXmoz2tV8t78cagbjoiXgH9lZP9Pdu29MipXeqT3EvDf+30J2CEibh5MeWXqHcr7alU4CBqMpL0kXShpWj48nezDrffQeRUwLW/T7bUI+IykHSTtQfZtq9c9wAxJn8m/6Z5L3w8AyL4xf5osDIb7jbLXzcDf5pdQjgf+B3Br3tzye6BF0jH5N/JvkLXJVyTp85Ja82/L6/LR71SY/RbgL8lO1PYeDSDpMEn75MGznqxZYot1RMQ6siu0vivpeEnjJTVJmkl20rOa08ja0fchaxaZSXY+ZqakfcrMPx84RtIR+etxIVkQ/abahiS9R9LFkvbIa5wC/Gf6/j+ZLOndJYstAo6W9GeSppKdm+j1KFlgnytpW0mfITuh2+v7wFmSDlBmXP4+TqhWa7lahvi+WhUOgsazATgAeFzSRrI/7GfJPiQgu5pkMfCypDX5uP8NvEX2B3c9edMEQESsIWuu+J9kzRt7Ar8u3WBELCe7WiTITlCOxLVkwfIQ8Aeyq17+Jt/Oa8BfA1eTfevdCFS7vnw2sFhSF9kJxhMrHbFExEqyD7SDya5a6TWV7AT0erLmo1+RXUVUbh3fBi4gu4Z/Ndlr+j3gawzwAS1pV+AIsjb2l0seC4D7yUKi/7b+D1n4XkF2hdMngU9GxFuVtlPiLaAN+Hm+X8+Shcjp+bqfIwvlF/KmnF3I3penyM4r/IyS1yjf5mfy5V8lO39xR8n0DrLzBFfm05fyp/MiA6pQy6DfV6tOfZv0zIZH0rVkV3Z8o9a1mNnQjPkPbqzxSGoj+za4b41LMbNhcNOQjYikfyRrVviXiPhDresxs6Fz05CZWeJ8RGBmlri6O0cwZcqUaGtrq3UZZmZ1ZcGCBWsiov/vboA6DIK2tjY6OjpqXYaZWV2R9GKlaW4aMjNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLXOFBIGkbSU9KurvMNEm6XNJSSU/nvRiZmVmJnp7ghc4uHn1+DS90dtHTM7o/BB6Ly0fPI7tj48Qy044iu5vlnmR3zLwq/9fMzMhC4P7FL3PB/EV0b+qhpbmJS06YyewZU2lqUvUVDEKhRwT5PfGPIbttcDnHATdE5jFgkqSdi6zJzKyeLFu7cXMIAHRv6uGC+YtYtnbjqG2j6KahS8nuy95TYfqu9O1acDllutmTdKakDkkdnZ3D7RPdzKz+rFrfvTkEenVv6mH1htHrfqGwIMj7EF2dd6xRcbYy48p1SzcvItojor21tewvpM3MGtJOE1toae77Ud3S3MSOE1pGbRtFHhEcAhwraRlZF4CHS+rfq9Ny+vZ/O42sH1YzMwPaJo/jkhNmbg6D3nMEbZNHr4vmMbkNtaRZwFciYk6/8ccA5wBHk50kvjwi9t9iBSXa29vD9xoys5T09ATL1m5k9YZudpzQQtvkcUM+USxpQUS0l5s25jedk3QWQETMBe4lC4GlwOvAGWNdj5nZ1q6pSezeOp7dW8cXsv4xCYKIeBB4MH8+t2R8AGePRQ1mZlaef1lsZpY4B4GZWeIcBGZmiXMQmJklzkFgZpY4B4GZWeIcBGZmiXMQmJklzkFgZpY4B4GZWeIcBGZmiRvzm86ZmY2F3jt2rlrfzU4Th3fHzlQ4CMys4YxFP7+NxE1DZtZwxqKf30biIDCzhjMW/fw2EgeBmTWcsejnt5E4CMys4YxFP7+NxCeLzazhNDWJ2TOmste5h46on99UOAjMrCEV3c9vI3HTkJlZ4hwEZmaJcxCYmSXOQWBmljgHgZlZ4goLAkktkp6Q9JSkxZIuLjPPLEmvSVqUP75ZVD1mVl1PT/BCZxePPr+GFzq76OmJWpdkY6DIy0ffBA6PiC5JzcAjku6LiMf6zfdwRMwpsA4zGwTfqC1dhR0RRKYrH2zOH/56YbaV8o3a0lXoOQJJ20haBKwGHoiIx8vMdlDefHSfpBkV1nOmpA5JHZ2dnUWWbJYs36gtXYUGQUS8ExEzgWnA/pL27jfLQmC3iPgIcAVwZ4X1zIuI9ohob21tLbJks2T5Rm3pGpOrhiJiHfAgMLvf+PW9zUcRcS/QLGnKWNRkZn35Rm3pKuxksaRWYFNErJO0PXAk8M/95pkKrIqIkLQ/WTCtLaomM6vMN2pLV5FXDe0MXC9pG7IP+PkRcbekswAiYi5wPPBlSW8DbwAnRoRPKJvViG/UlqbCgiAingb2LTN+bsnzK4Eri6rBzMyq8y+LzcwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxBUWBJJaJD0h6SlJiyVdXGYeSbpc0lJJT0var6h6zMysvG0LXPebwOER0SWpGXhE0n0R8VjJPEcBe+aPA4Cr8n/NzGyMFHZEEJmufLA5f0S/2Y4DbsjnfQyYJGnnomoyM7MtFXqOQNI2khYBq4EHIuLxfrPsCrxUMrw8H9d/PWdK6pDU0dnZWVi9ZmYpKjQIIuKdiJgJTAP2l7R3v1lUbrEy65kXEe0R0d7a2lpApWZm6RqTq4YiYh3wIDC736TlwPSS4WnAirGoyczMMkVeNdQqaVL+fHvgSOC5frPdBZyaXz10IPBaRKwsqiYzM9tSkVcN7QxcL2kbssCZHxF3SzoLICLmAvcCRwNLgdeBMwqsx8zMyigsCCLiaWDfMuPnljwP4OyiajAzs+r8y2Izs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8QV2VWlWTJ6eoJlazeyan03O01soW3yOJqaVOuyzAbFQWA2Qj09wf2LX+aC+Yvo3tRDS3MTl5wwk9kzpjoMrC64achshJat3bg5BAC6N/VwwfxFLFu7scaVmQ3OkINA0jhJn5d0TxEFmdWbVeu7N4dAr+5NPaze0F2jisyGZlBBIGk7SZ+SNB9YCRwBzC20MrM6sdPEFlqa+/4ptTQ3seOElhpVZDY0AwaBpE9Iuhb4A3A8cCPwSkScERE/GYsCzbZ2bZPHcckJMzeHQe85grbJ42pcmdngVDtZ/FPgYeBjEfEHAEmXFV6VWR1pahKzZ0xlr3MPZfWGbnac4KuGrL5UC4I/B04Efi7pBeAWYJvBrFjSdOAGYCrQA8yLiMv6zTML+HeyIw6AOyLiHwZbvNW3RrrksqlJ7N46nt1bx9e6FLMhGzAIIuJJ4Enga5IOAU4CtpN0H/DjiJg3wOJvAxdGxEJJE4AFkh6IiN/1m+/hiJgzgn2wOuRLLs22HlVPFkvaV9LxZOcGzgF2BS4FDhpouYhYGREL8+cbgCX5sma+5NJsK1LtZPE3gVuBzwL3SPpiRPRExE8j4ozBbkRSG7Av8HiZyQdJekrSfZJmVFj+TEkdkjo6OzsHu1nbivmSS7OtR7Ujgr8CZkbEScBHgTOHugFJ44HbgfMjYn2/yQuB3SLiI8AVwJ3l1hER8yKiPSLaW1tbh1qCbYV8yaXZ1qNaEHRHxOsAEbF2EPP3IamZLARuiog7+k+PiPUR0ZU/vxdoljRlKNuw+uRLLs22HtWuGnq/pLvy5yoZFhARcWylBSUJuAZYEhGXVJhnKrAqIkLS/mRBs3aoO2H1x5dcmm09qgXBcf2GvwNE/rzaX+whwCnAM5IW5eO+DrwXICLmkv1I7cuS3gbeAE6MiCizLmtAvuTSbOtQLQgmAdMi4l8BJD0BtJKFwdcGWjAiHqFKWETElcCVgy3WzMxGX7U2/68Cd5UMbwe0A7OAswqqyczMxlC1I4LtIuKlkuFH8pPGayX5rJ6ZWQOodkTwntKB/AdlvXwdp5lZA6gWBI9L+mL/kZK+BDxRTElmZjaWqjUN/S1wp6STyX78BdmN6N4FfKrAuszMbIxUu+ncauBgSYcDvbd/uCciflF4ZWZmNiYG1Xl9/sHvD38zswbkzuvNzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBJXWBBImi7pl5KWSFos6bwy80jS5ZKWSnpa0n5F1WNmZuUNqqvKYXobuDAiFkqaACyQ9EBE/K5knqOAPfPHAcBV+b9mZjZGCjsiiIiVEbEwf74BWALs2m+244AbIvMYMEnSzkXVZGZmWxqTcwSS2oB9gcf7TdoVeKlkeDlbhgWSzpTUIamjs7OzsDrNzFJUeBBIGg/cDpwfEev7Ty6zSGwxImJeRLRHRHtra2sRZZqZJavQIJDUTBYCN0XEHWVmWQ5MLxmeBqwosiYzM+uryKuGBFwDLImISyrMdhdwan710IHAaxGxsqiazMxsS0VeNXQIcArwjKRF+bivA+8FiIi5wL3A0cBS4HXgjALrMTOzMgoLgoh4hPLnAErnCeDsomowM7Pq/MtiM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscYUFgaRrJa2W9GyF6bMkvSZpUf74ZlG1mJlZZdsWuO7rgCuBGwaY5+GImFNgDWZmVkVhRwQR8RDwSlHrNzOz0VHrcwQHSXpK0n2SZlSaSdKZkjokdXR2do5lfWZmDa+WQbAQ2C0iPgJcAdxZacaImBcR7RHR3traOlb1mZkloWZBEBHrI6Irf34v0CxpSq3qMTNLVc2CQNJUScqf75/XsrZW9ZiZpaqwq4Yk3QzMAqZIWg5cBDQDRMRc4Hjgy5LeBt4AToyIKKoeMzMrr7AgiIiTqky/kuzyUjMzq6FaXzVkZmY15iAwM0ucg8DMLHEOAjOzxBV5ryErQE9PsGztRlat72aniS20TR5HU5NqXZaZ1TEHQR3p6QnuX/wyF8xfRPemHlqam7jkhJnMnjHVYWBmw+amoTqybO3GzSEA0L2phwvmL2LZ2o01rszM6pmDoI6sWt+9OQR6dW/qYfWG7hpVZGaNwEFQR3aa2EJLc9+3rKW5iR0ntNSoIjNrBA6COtI2eRyXnDBzcxj0niNomzyuxpWZWT3zyeI60tQkZs+Yyl7nHsrqDd3sOMFXDZnZyDkI6kxTk9i9dTy7t46vdSlm1iDcNGRmljgHgZlZ4hwEZmaJcxCYmSXOQWBmljgHgZlZ4hwEZmaJcxCYmSXOQWBmljgHgZlZ4hwEZmaJcxCYmSWusCCQdK2k1ZKerTBdki6XtFTS05L2K6qWnp7ghc4uHn1+DS90dtHTE0Vtysys7hR599HrgCuBGypMPwrYM38cAFyV/zuq3M+vmdnACjsiiIiHgFcGmOU44IbIPAZMkrTzaNfhfn7NzAZWy3MEuwIvlQwvz8dtQdKZkjokdXR2dg5pI+7n18xsYLUMgnLtMmUb7yNiXkS0R0R7a2vrkDbifn7NzAZWyyBYDkwvGZ4GrBjtjbifXzOzgdWyq8q7gHMk3UJ2kvi1iFg52htxP79mZgMrLAgk3QzMAqZIWg5cBDQDRMRc4F7gaGAp8DpwRlG1uJ9fM7PKCguCiDipyvQAzi5q+2ZmNjj+ZbGZWeIcBGZmiXMQmJklzkFgZpY4Zeds64ekTuDFYS4+BVgziuXUkvdl69Qo+9Io+wHel167RUTZX+TWXRCMhKSOiGivdR2jwfuydWqUfWmU/QDvy2C4acjMLHEOAjOzxKUWBPNqXcAo8r5snRplXxplP8D7UlVS5wjMzGxLqR0RmJlZPw4CM7PEJREEkq6VtFrSs7WuZaQkTZf0S0lLJC2WdF6taxoOSS2SnpD0VL4fF9e6ppGStI2kJyXdXetaRkLSMknPSFokqaPW9YyEpEmSbpP0XP43c1CtaxoqSR/M34vex3pJ54/qNlI4RyDp40AXWR/Je9e6npHI+3XeOSIWSpoALAA+FRG/q3FpQyJJwLiI6JLUDDwCnJf3X12XJF0AtAMTI2JOresZLknLgPaIqPsfYUm6Hng4Iq6WtB2wQ0Ssq3FZwyZpG+CPwAERMdwf1m4hiSOCiHgIeKXWdYyGiFgZEQvz5xuAJVTo63lrFpmufLA5f9TttxJJ04BjgKtrXYtlJE0EPg5cAxARb9VzCOSOAJ4fzRCARIKgUUlqA/YFHq9xKcOSN6UsAlYDD0REXe5H7lLgq0BPjesYDQH8TNICSWfWupgR2B3oBP4tb7K7WlK991F7InDzaK/UQVCnJI0HbgfOj4j1ta5nOCLinYiYSdZf9f6S6rLZTtIcYHVELKh1LaPkkIjYDzgKODtvWq1H2wL7AVdFxL7ARuDvalvS8OVNW8cCPxrtdTsI6lDepn47cFNE3FHrekYqP1x/EJhd20qG7RDg2Lxt/RbgcEk/qG1JwxcRK/J/VwM/BvavbUXDthxYXnKkeRtZMNSro4CFEbFqtFfsIKgz+UnWa4AlEXFJresZLkmtkiblz7cHjgSeq2lRwxQRfx8R0yKijezQ/RcR8fkalzUsksblFyGQN6P8JVCXV9tFxMvAS5I+mI86Aqiriyr6OYkCmoWgwD6LtyaSbgZmAVMkLQcuiohralvVsB0CnAI8k7evA3w9Iu6tXUnDsjNwfX4VRBMwPyLq+rLLBrET8OPs+wbbAj+MiPtrW9KI/A1wU96s8gJwRo3rGRZJOwCfAL5UyPpTuHzUzMwqc9OQmVniHARmZolzEJiZJc5BYGaWOAeBmVniHATWkCT9t/yupk/nd2w8IB9/fn4p3lDX11V9rs3ztvW/062kb0n6ylC3azYWkvgdgaUlv9XwHGC/iHhT0hRgu3zy+cAPgNdrVN6ISNo2It6udR3WWHxEYI1oZ2BNRLwJEBFrImKFpHOBXYBfSvol9P2mL+l4Sdflz98n6VFJv5X0jyXz3CjpuJLhmyQdO5TiJM2U9Fh+tPJjSe/Jxz8oqT1/PiW/ZQWSTpf0I0k/IbsZ3M6SHsqPdJ6VdOhwXiSzXg4Ca0Q/A6ZL+r2k70r6C4CIuBxYARwWEYdVWcdlZDcr+yjwcsn4q8l/nSrp3cDBQLlfdb+/tDMR4KySaTcAX4uIDwPPABcNYp8OAk6LiMOBk4Gf5jfs+wiwaBDLm1XkILCGk/dz8OfAmWS3Ib5V0ulDXM0h/Om+LjeWrPtXwB6SdiS798vtFZpqno+Imb0PYC5sDo9J+XoArie7Z341D0REb58avwXOkPQtYJ+8XwqzYXMQWEPKb3H9YERcBJwDfLbSrCXPWwaYVupG4HNkRwb/NqJC+3qbP/1N9q9l4+aiso6WPk7WU9WNkk4dxRosQQ4Cazh5H697loyaCfT26LQBmFAybZWkD0lqAj5dMv7XZHcShexDv9R1ZCediYjFQ6ktIl4DXi1p1z8F6D06WEZ2JANwfKV1SNqNrP+D75Pdibaeb61sWwFfNWSNaDxwRX6b67eBpWTNRADzgPskrczPE/wdcDfwEtntlsfn850H/FDSeWR9P2wWEaskLQHuHGZ9pwFz88tYS++I+R1gvqRTgF8MsPws4L9K2kTWF7ePCGxEfPdRsyHKP8CfIbs89bVa12M2Um4aMhsCSb0d6FzhELBG4SMCM7PE+YjAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxx/x9wEefqxgdefwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "X = np.array([1, 2, 3, 4, 5, 6, 7.]) # study hours \n", "Y = np.array([1.0, 1.3, 2.5, 2.6, 3.5, 3.7, 4.0]) # gpa\n", "# plotting the regression line using regplot\n", "sns.scatterplot(x=X, y=Y)\n", "plt.title(\"Study Hours vs GPA of Students\")\n", "plt.xlabel(\"Study Hours\")\n", "plt.ylabel(\"GPA\")\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1eb322f4", "metadata": {}, "outputs": [], "source": [ "# Initially let m = 0 and c = 0. Let L be our learning rate. \n", "#This controls how much the value of m changes with each step. L could be a small value like 0.0001 for good accuracy.\n", "m = 0\n", "c = 0\n", "L = 0.0001 # The learning Rate\n", "epochs = 10000 # The number of iterations to perform gradient descent\n", "n = float(len(X)) # Number of elements in X\n", "\n", "# Performing Gradient Descent \n", "# Calculate the partial derivative of the loss function with respect to m, \n", "# and plug in the current values of x, y, m and c in it to obtain the derivative value D.\n", "# finally update the current value of m and c \n", "for i in range(epochs): \n", " Y_pred = m*X + c # The current predicted value of Y\n", " D_m = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt m\n", " D_c = (-2/n) * sum(Y - Y_pred) # Derivative wrt c\n", " m = m - L * D_m # Update m\n", " c = c - L * D_c # Update c\n", " \n", "print (m, c)" ] }, { "cell_type": "code", "execution_count": null, "id": "24f3030c", "metadata": {}, "outputs": [], "source": [ "# Draw the regression line\n", "Y_pred = m*X + c\n", "\n", "plt.scatter(X, Y) \n", "plt.plot([min(X), max(X)], [min(Y_pred), max(Y_pred)], color='red') # regression line\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "c0c46b08", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f85ab50b", "metadata": {}, "source": [ "## 4. Linear Regression using `Linear Least Squares` Method\n", "- Linear Least Squares is a very basic method used to fit a line if you have only one feature or predictor or independent variable.\n", "- Ordinary Least Square is a specific LLS method, in which works with more than one features or predictors or independent variables" ] }, { "cell_type": "markdown", "id": "7f63e7b5", "metadata": { "id": "RxcUrV5EJy-w" }, "source": [ "### a. Example 1 (LLS):" ] }, { "cell_type": "code", "execution_count": null, "id": "37689599", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from scipy import stats as st\n", "import statistics\n", "import statsmodels as sm\n", "\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 3, "id": "cf096f1b", "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAczElEQVR4nO3de5hcdZ3n8fenobEhF+MkDQESaREUN6CBabmKEy7OBojghWUA5bariAMDDLg64/qIzMzus+O4LLeRGIHhIgIRkEFuio8ioFzshHCJYd2AYYkJSScQkg40BPq7f5zTsbpT1dW305Wq3+f1PPWkzv17qtL1qfM7p85PEYGZmaWrqdYFmJlZbTkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yCwIZP0oKQv1LoOG5ikD0p6UtIGSefWYPvXSfqnsd6uDZ2DoAFJ+pik30h6TdIrkn4t6aP5tNMlPVKjuspuW9IySUfWoqYiSDpR0uOSNkpanT//a0nKp18n6S1JXfn784CkvUqWf5+kHknfHWEpXwUejIgJEXF5mTpnSPqZpFclrZO0QNLR+bRZkpaPcPujYmuqpVE5CBqMpInA3cAVwJ8BuwIXA2/Wsq5akLRtDbZ5IXAZ8C/AVGAn4CzgEGC7klm/HRHjgWnAauC6kmmnAq8CJ0p61wjK2Q1YPMD0nwAP5DXuCJwLrB/B9qxeRYQfDfQA2oF1FaZ9COgG3gG6eucDHgS+UDLf6cAjJcOfAJ4DXgOuBH4FfAF4F/AKsE/JvDsCbwCtZbbfZ70l45cBR+bPm4BvAC+SfUDeALw7nzYLWD7Ast8CbgN+QPaB9gVgf6AjH14FXFLhtVkCzCkZ3hZYA+wHtOTrXAusA34L7FRmHe8GNgKfrfIeXQf8U8nwMUBXyfDzwJfzeo+vsq5jyT7s1+Xv44fy8b/I3+fu/L3+QL/lpgABTCqzznH5e9iTL9sF7FKm7j7vB7AvsBDYANwK3NJv/jnAorzW3wAf7vc+fgV4Ov9/dmv+uleqZVDvqx+De/iIoPH8HnhH0vWSjpL0nt4JEbGE7NvpoxExPiImVVuZpCnA7WQfzlPIPqQOydf3Jtkf++dLFjkJ+HlEdA6z/tPzx2HA7sB4svAZrOPIwmAScBPZt/PLImIi8H5gfoXlbiarvdd/BNZExELgNLIP+enAZLLX8I0y6ziILBz/fbDFShoPfA54Mh8+lOwo4Za81lMHWPYDed3nA63AvcBPJG0XEYcDDwPn5O/17/stvhZYCvxA0qck7dQ7ISI2AkcBK/Jlx0fEiir7sR1wJ3Aj2ZHoj4DPlkzfD7gW+BLZa/g94K5+RzwnALOB9wEfBk4foJbBvq82CA6CBhMR64GPkX3b+z7QKemu0j/0IToa+F1E3BYRm4BLgZdLpl8PnCyp9//SKWQfBpUcmLdHb34A7y2Z/jmyb3cvREQX8PdkTSSDbeZ5NCLujIieiHgD2ATsIWlKRHRFxGMVlvshcKykHfLhk/Nx5OuYDOwREe9ExIL8de5vCll4vN07Ij9Xs07SG5I+XjLvV/J9X0oWdqfn408D7ouIV/PtHyVpxwo1/xVwT0Q8kL833wG2Bw6uMP9mkX0NP4zsm/j/AlZKekjSntWWreBAoBm4NCI2RcRtZEdOvb4IfC8iHs9fw+vJmisPLJnn8ohYERGvkDVbzRxge4N9X20QHAQNKCKWRMTpETEN2JvsUPrSYa5uF+ClknVHv+HHyZpD/iI/4bkHcNcA63ssIiaVPoD/1297L5YMv0jWTDPYIHup3/B/AT4APCfpt5LmlFsoIpaSNQ99Mg+DY/lTENwI/BS4RdIKSd+W1FxmNWuBKaWhFREH5/u4lr5/b9/J939qRBwbEc9L2h74T2RHMkTEo2SvzckV9rXPaxURPfn+71ph/v77vDwizomI95OdT9hI1hQ3HLsAf8z/f/QqfR93Ay7s9wVger5cr9IvGK+TBWQlg3pfbXAcBA0uIp4ja9vdu3dUmdk2AjuUDE8teb6S7A8WgPzKl+n0dT1Z89ApwG0R0T2CkleQfWj0ei/wNlk7cJ86JW1D1iRSqs/+RcT/jYiTyM5d/DNwm6RxFbbd2zx0HNlR0NJ8HZsi4uKI+A9k37bnUL7J5lGyb7nHDWI/y/k0MBH4rqSXJb1M9qFeqXmoz2tV8t78cagbjoiXgH9lZP9Pdu29MipXeqT3EvDf+30J2CEibh5MeWXqHcr7alU4CBqMpL0kXShpWj48nezDrffQeRUwLW/T7bUI+IykHSTtQfZtq9c9wAxJn8m/6Z5L3w8AyL4xf5osDIb7jbLXzcDf5pdQjgf+B3Br3tzye6BF0jH5N/JvkLXJVyTp85Ja82/L6/LR71SY/RbgL8lO1PYeDSDpMEn75MGznqxZYot1RMQ6siu0vivpeEnjJTVJmkl20rOa08ja0fchaxaZSXY+ZqakfcrMPx84RtIR+etxIVkQ/abahiS9R9LFkvbIa5wC/Gf6/j+ZLOndJYstAo6W9GeSppKdm+j1KFlgnytpW0mfITuh2+v7wFmSDlBmXP4+TqhWa7lahvi+WhUOgsazATgAeFzSRrI/7GfJPiQgu5pkMfCypDX5uP8NvEX2B3c9edMEQESsIWuu+J9kzRt7Ar8u3WBELCe7WiTITlCOxLVkwfIQ8Aeyq17+Jt/Oa8BfA1eTfevdCFS7vnw2sFhSF9kJxhMrHbFExEqyD7SDya5a6TWV7AT0erLmo1+RXUVUbh3fBi4gu4Z/Ndlr+j3gawzwAS1pV+AIsjb2l0seC4D7yUKi/7b+D1n4XkF2hdMngU9GxFuVtlPiLaAN+Hm+X8+Shcjp+bqfIwvlF/KmnF3I3penyM4r/IyS1yjf5mfy5V8lO39xR8n0DrLzBFfm05fyp/MiA6pQy6DfV6tOfZv0zIZH0rVkV3Z8o9a1mNnQjPkPbqzxSGoj+za4b41LMbNhcNOQjYikfyRrVviXiPhDresxs6Fz05CZWeJ8RGBmlri6O0cwZcqUaGtrq3UZZmZ1ZcGCBWsiov/vboA6DIK2tjY6OjpqXYaZWV2R9GKlaW4aMjNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLXOFBIGkbSU9KurvMNEm6XNJSSU/nvRiZmVmJnp7ghc4uHn1+DS90dtHTM7o/BB6Ly0fPI7tj48Qy044iu5vlnmR3zLwq/9fMzMhC4P7FL3PB/EV0b+qhpbmJS06YyewZU2lqUvUVDEKhRwT5PfGPIbttcDnHATdE5jFgkqSdi6zJzKyeLFu7cXMIAHRv6uGC+YtYtnbjqG2j6KahS8nuy95TYfqu9O1acDllutmTdKakDkkdnZ3D7RPdzKz+rFrfvTkEenVv6mH1htHrfqGwIMj7EF2dd6xRcbYy48p1SzcvItojor21tewvpM3MGtJOE1toae77Ud3S3MSOE1pGbRtFHhEcAhwraRlZF4CHS+rfq9Ny+vZ/O42sH1YzMwPaJo/jkhNmbg6D3nMEbZNHr4vmMbkNtaRZwFciYk6/8ccA5wBHk50kvjwi9t9iBSXa29vD9xoys5T09ATL1m5k9YZudpzQQtvkcUM+USxpQUS0l5s25jedk3QWQETMBe4lC4GlwOvAGWNdj5nZ1q6pSezeOp7dW8cXsv4xCYKIeBB4MH8+t2R8AGePRQ1mZlaef1lsZpY4B4GZWeIcBGZmiXMQmJklzkFgZpY4B4GZWeIcBGZmiXMQmJklzkFgZpY4B4GZWeIcBGZmiRvzm86ZmY2F3jt2rlrfzU4Th3fHzlQ4CMys4YxFP7+NxE1DZtZwxqKf30biIDCzhjMW/fw2EgeBmTWcsejnt5E4CMys4YxFP7+NxCeLzazhNDWJ2TOmste5h46on99UOAjMrCEV3c9vI3HTkJlZ4hwEZmaJcxCYmSXOQWBmljgHgZlZ4goLAkktkp6Q9JSkxZIuLjPPLEmvSVqUP75ZVD1mVl1PT/BCZxePPr+GFzq76OmJWpdkY6DIy0ffBA6PiC5JzcAjku6LiMf6zfdwRMwpsA4zGwTfqC1dhR0RRKYrH2zOH/56YbaV8o3a0lXoOQJJ20haBKwGHoiIx8vMdlDefHSfpBkV1nOmpA5JHZ2dnUWWbJYs36gtXYUGQUS8ExEzgWnA/pL27jfLQmC3iPgIcAVwZ4X1zIuI9ohob21tLbJks2T5Rm3pGpOrhiJiHfAgMLvf+PW9zUcRcS/QLGnKWNRkZn35Rm3pKuxksaRWYFNErJO0PXAk8M/95pkKrIqIkLQ/WTCtLaomM6vMN2pLV5FXDe0MXC9pG7IP+PkRcbekswAiYi5wPPBlSW8DbwAnRoRPKJvViG/UlqbCgiAingb2LTN+bsnzK4Eri6rBzMyq8y+LzcwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxDkIzMwS5yAwM0ucg8DMLHEOAjOzxBUWBJJaJD0h6SlJiyVdXGYeSbpc0lJJT0var6h6zMysvG0LXPebwOER0SWpGXhE0n0R8VjJPEcBe+aPA4Cr8n/NzGyMFHZEEJmufLA5f0S/2Y4DbsjnfQyYJGnnomoyM7MtFXqOQNI2khYBq4EHIuLxfrPsCrxUMrw8H9d/PWdK6pDU0dnZWVi9ZmYpKjQIIuKdiJgJTAP2l7R3v1lUbrEy65kXEe0R0d7a2lpApWZm6RqTq4YiYh3wIDC736TlwPSS4WnAirGoyczMMkVeNdQqaVL+fHvgSOC5frPdBZyaXz10IPBaRKwsqiYzM9tSkVcN7QxcL2kbssCZHxF3SzoLICLmAvcCRwNLgdeBMwqsx8zMyigsCCLiaWDfMuPnljwP4OyiajAzs+r8y2Izs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxxDgIzs8QV2VWlWTJ6eoJlazeyan03O01soW3yOJqaVOuyzAbFQWA2Qj09wf2LX+aC+Yvo3tRDS3MTl5wwk9kzpjoMrC64achshJat3bg5BAC6N/VwwfxFLFu7scaVmQ3OkINA0jhJn5d0TxEFmdWbVeu7N4dAr+5NPaze0F2jisyGZlBBIGk7SZ+SNB9YCRwBzC20MrM6sdPEFlqa+/4ptTQ3seOElhpVZDY0AwaBpE9Iuhb4A3A8cCPwSkScERE/GYsCzbZ2bZPHcckJMzeHQe85grbJ42pcmdngVDtZ/FPgYeBjEfEHAEmXFV6VWR1pahKzZ0xlr3MPZfWGbnac4KuGrL5UC4I/B04Efi7pBeAWYJvBrFjSdOAGYCrQA8yLiMv6zTML+HeyIw6AOyLiHwZbvNW3RrrksqlJ7N46nt1bx9e6FLMhGzAIIuJJ4Enga5IOAU4CtpN0H/DjiJg3wOJvAxdGxEJJE4AFkh6IiN/1m+/hiJgzgn2wOuRLLs22HlVPFkvaV9LxZOcGzgF2BS4FDhpouYhYGREL8+cbgCX5sma+5NJsK1LtZPE3gVuBzwL3SPpiRPRExE8j4ozBbkRSG7Av8HiZyQdJekrSfZJmVFj+TEkdkjo6OzsHu1nbivmSS7OtR7Ujgr8CZkbEScBHgTOHugFJ44HbgfMjYn2/yQuB3SLiI8AVwJ3l1hER8yKiPSLaW1tbh1qCbYV8yaXZ1qNaEHRHxOsAEbF2EPP3IamZLARuiog7+k+PiPUR0ZU/vxdoljRlKNuw+uRLLs22HtWuGnq/pLvy5yoZFhARcWylBSUJuAZYEhGXVJhnKrAqIkLS/mRBs3aoO2H1x5dcmm09qgXBcf2GvwNE/rzaX+whwCnAM5IW5eO+DrwXICLmkv1I7cuS3gbeAE6MiCizLmtAvuTSbOtQLQgmAdMi4l8BJD0BtJKFwdcGWjAiHqFKWETElcCVgy3WzMxGX7U2/68Cd5UMbwe0A7OAswqqyczMxlC1I4LtIuKlkuFH8pPGayX5rJ6ZWQOodkTwntKB/AdlvXwdp5lZA6gWBI9L+mL/kZK+BDxRTElmZjaWqjUN/S1wp6STyX78BdmN6N4FfKrAuszMbIxUu+ncauBgSYcDvbd/uCciflF4ZWZmNiYG1Xl9/sHvD38zswbkzuvNzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBJXWBBImi7pl5KWSFos6bwy80jS5ZKWSnpa0n5F1WNmZuUNqqvKYXobuDAiFkqaACyQ9EBE/K5knqOAPfPHAcBV+b9mZjZGCjsiiIiVEbEwf74BWALs2m+244AbIvMYMEnSzkXVZGZmWxqTcwSS2oB9gcf7TdoVeKlkeDlbhgWSzpTUIamjs7OzsDrNzFJUeBBIGg/cDpwfEev7Ty6zSGwxImJeRLRHRHtra2sRZZqZJavQIJDUTBYCN0XEHWVmWQ5MLxmeBqwosiYzM+uryKuGBFwDLImISyrMdhdwan710IHAaxGxsqiazMxsS0VeNXQIcArwjKRF+bivA+8FiIi5wL3A0cBS4HXgjALrMTOzMgoLgoh4hPLnAErnCeDsomowM7Pq/MtiM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscQ4CM7PEOQjMzBLnIDAzS5yDwMwscYUFgaRrJa2W9GyF6bMkvSZpUf74ZlG1mJlZZdsWuO7rgCuBGwaY5+GImFNgDWZmVkVhRwQR8RDwSlHrNzOz0VHrcwQHSXpK0n2SZlSaSdKZkjokdXR2do5lfWZmDa+WQbAQ2C0iPgJcAdxZacaImBcR7RHR3traOlb1mZkloWZBEBHrI6Irf34v0CxpSq3qMTNLVc2CQNJUScqf75/XsrZW9ZiZpaqwq4Yk3QzMAqZIWg5cBDQDRMRc4Hjgy5LeBt4AToyIKKoeMzMrr7AgiIiTqky/kuzyUjMzq6FaXzVkZmY15iAwM0ucg8DMLHEOAjOzxBV5ryErQE9PsGztRlat72aniS20TR5HU5NqXZaZ1TEHQR3p6QnuX/wyF8xfRPemHlqam7jkhJnMnjHVYWBmw+amoTqybO3GzSEA0L2phwvmL2LZ2o01rszM6pmDoI6sWt+9OQR6dW/qYfWG7hpVZGaNwEFQR3aa2EJLc9+3rKW5iR0ntNSoIjNrBA6COtI2eRyXnDBzcxj0niNomzyuxpWZWT3zyeI60tQkZs+Yyl7nHsrqDd3sOMFXDZnZyDkI6kxTk9i9dTy7t46vdSlm1iDcNGRmljgHgZlZ4hwEZmaJcxCYmSXOQWBmljgHgZlZ4hwEZmaJcxCYmSXOQWBmljgHgZlZ4hwEZmaJcxCYmSWusCCQdK2k1ZKerTBdki6XtFTS05L2K6qWnp7ghc4uHn1+DS90dtHTE0Vtysys7hR599HrgCuBGypMPwrYM38cAFyV/zuq3M+vmdnACjsiiIiHgFcGmOU44IbIPAZMkrTzaNfhfn7NzAZWy3MEuwIvlQwvz8dtQdKZkjokdXR2dg5pI+7n18xsYLUMgnLtMmUb7yNiXkS0R0R7a2vrkDbifn7NzAZWyyBYDkwvGZ4GrBjtjbifXzOzgdWyq8q7gHMk3UJ2kvi1iFg52htxP79mZgMrLAgk3QzMAqZIWg5cBDQDRMRc4F7gaGAp8DpwRlG1uJ9fM7PKCguCiDipyvQAzi5q+2ZmNjj+ZbGZWeIcBGZmiXMQmJklzkFgZpY4Zeds64ekTuDFYS4+BVgziuXUkvdl69Qo+9Io+wHel167RUTZX+TWXRCMhKSOiGivdR2jwfuydWqUfWmU/QDvy2C4acjMLHEOAjOzxKUWBPNqXcAo8r5snRplXxplP8D7UlVS5wjMzGxLqR0RmJlZPw4CM7PEJREEkq6VtFrSs7WuZaQkTZf0S0lLJC2WdF6taxoOSS2SnpD0VL4fF9e6ppGStI2kJyXdXetaRkLSMknPSFokqaPW9YyEpEmSbpP0XP43c1CtaxoqSR/M34vex3pJ54/qNlI4RyDp40AXWR/Je9e6npHI+3XeOSIWSpoALAA+FRG/q3FpQyJJwLiI6JLUDDwCnJf3X12XJF0AtAMTI2JOresZLknLgPaIqPsfYUm6Hng4Iq6WtB2wQ0Ssq3FZwyZpG+CPwAERMdwf1m4hiSOCiHgIeKXWdYyGiFgZEQvz5xuAJVTo63lrFpmufLA5f9TttxJJ04BjgKtrXYtlJE0EPg5cAxARb9VzCOSOAJ4fzRCARIKgUUlqA/YFHq9xKcOSN6UsAlYDD0REXe5H7lLgq0BPjesYDQH8TNICSWfWupgR2B3oBP4tb7K7WlK991F7InDzaK/UQVCnJI0HbgfOj4j1ta5nOCLinYiYSdZf9f6S6rLZTtIcYHVELKh1LaPkkIjYDzgKODtvWq1H2wL7AVdFxL7ARuDvalvS8OVNW8cCPxrtdTsI6lDepn47cFNE3FHrekYqP1x/EJhd20qG7RDg2Lxt/RbgcEk/qG1JwxcRK/J/VwM/BvavbUXDthxYXnKkeRtZMNSro4CFEbFqtFfsIKgz+UnWa4AlEXFJresZLkmtkiblz7cHjgSeq2lRwxQRfx8R0yKijezQ/RcR8fkalzUsksblFyGQN6P8JVCXV9tFxMvAS5I+mI86Aqiriyr6OYkCmoWgwD6LtyaSbgZmAVMkLQcuiohralvVsB0CnAI8k7evA3w9Iu6tXUnDsjNwfX4VRBMwPyLq+rLLBrET8OPs+wbbAj+MiPtrW9KI/A1wU96s8gJwRo3rGRZJOwCfAL5UyPpTuHzUzMwqc9OQmVniHARmZolzEJiZJc5BYGaWOAeBmVniHATWkCT9t/yupk/nd2w8IB9/fn4p3lDX11V9rs3ztvW/062kb0n6ylC3azYWkvgdgaUlv9XwHGC/iHhT0hRgu3zy+cAPgNdrVN6ISNo2It6udR3WWHxEYI1oZ2BNRLwJEBFrImKFpHOBXYBfSvol9P2mL+l4Sdflz98n6VFJv5X0jyXz3CjpuJLhmyQdO5TiJM2U9Fh+tPJjSe/Jxz8oqT1/PiW/ZQWSTpf0I0k/IbsZ3M6SHsqPdJ6VdOhwXiSzXg4Ca0Q/A6ZL+r2k70r6C4CIuBxYARwWEYdVWcdlZDcr+yjwcsn4q8l/nSrp3cDBQLlfdb+/tDMR4KySaTcAX4uIDwPPABcNYp8OAk6LiMOBk4Gf5jfs+wiwaBDLm1XkILCGk/dz8OfAmWS3Ib5V0ulDXM0h/Om+LjeWrPtXwB6SdiS798vtFZpqno+Imb0PYC5sDo9J+XoArie7Z341D0REb58avwXOkPQtYJ+8XwqzYXMQWEPKb3H9YERcBJwDfLbSrCXPWwaYVupG4HNkRwb/NqJC+3qbP/1N9q9l4+aiso6WPk7WU9WNkk4dxRosQQ4Cazh5H697loyaCfT26LQBmFAybZWkD0lqAj5dMv7XZHcShexDv9R1ZCediYjFQ6ktIl4DXi1p1z8F6D06WEZ2JANwfKV1SNqNrP+D75Pdibaeb61sWwFfNWSNaDxwRX6b67eBpWTNRADzgPskrczPE/wdcDfwEtntlsfn850H/FDSeWR9P2wWEaskLQHuHGZ9pwFz88tYS++I+R1gvqRTgF8MsPws4L9K2kTWF7ePCGxEfPdRsyHKP8CfIbs89bVa12M2Um4aMhsCSb0d6FzhELBG4SMCM7PE+YjAzCxxDgIzs8Q5CMzMEucgMDNLnIPAzCxx/x9wEefqxgdefwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.array([1, 2, 3, 4, 5, 6, 7.]) # study hours \n", "y = np.array([1.0, 1.3, 2.5, 2.6, 3.5, 3.7, 4.0]) # gpa\n", "# plotting the regression line using regplot\n", "sns.scatterplot(x=x, y=y)\n", "plt.title(\"Study Hours vs GPA of Students\")\n", "plt.xlabel(\"Study Hours\")\n", "plt.ylabel(\"GPA\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "f5f47081", "metadata": {}, "source": [ "**Regression Equation for this scenario:**" ] }, { "cell_type": "markdown", "id": "79dfee89", "metadata": { "id": "_JFRDiSHJy-x" }, "source": [ "- Let us fit our regression line to these data points. Note we have just one feature/independent/predictor variable and that is the `study hours` and the outcome/dependent/response variable is `gpa`. For this scenario, the line equation is:\n", " \n", "$$ y = c + mx $$\n", "\n", "- In above equation `y` is the outcome/dependent/response variable and `x` is the only feature/independent/predictor variable. `m` is the slope of the line and `c` is the y-intercept. \n", "- In this scenario, there are only two model parameters (`m` is the slope of the line and `c` is the y-intercept)\n", "- In Machine Learning it is a convention to represent the model parameters with Greek letters Baita. So we represent the model parameter y-intercept as Baita-zero and the model parameter slope as Baita-one:\n", " - (i) The y-intercept ($\\beta_0$)\n", " - (ii) Slope of line ($\\beta_1$)\n", "\n", "\n", "- So the above, equation can be re-written as regression equation as follows: \n", "\n", "$$ y = \\beta_0 + \\beta_1 x + \\epsilon $$\n", "\n", "- The $\\epsilon$ term denotes **error**. For a given instance $i$, $\\epsilon_i$ is a measure of the difference between the true $y_i$ and the model's estimate, $\\hat{y}_i$. If the model predicts $y_i$ perfectly, then $\\epsilon_i = \\hat{y}_i - y_i = 0$. \n", "\n", "- Together, $\\beta_0$ and $\\beta_1$ are called the **model coefficients**. To create a model, we must \"learn\" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict! Our objective is to find the parameters $\\beta_0$ and $\\beta_1$ that minimize $\\epsilon$ across all the available data points. " ] }, { "cell_type": "code", "execution_count": null, "id": "3b9b2e3f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5cc919bd", "metadata": {}, "source": [ "**Calculate $\\beta_1$:**" ] }, { "cell_type": "markdown", "id": "4f2da781", "metadata": { "id": "9O_ZEib_Jy-x" }, "source": [ "In the case of a model with a single predictor $x$, there is a fairly straightforward **linear least squares** formula we can use to estimate $\\beta_1$: \n", "$$ \\hat{\\beta}_1 = \\frac{\\text{cov}(x,y)}{\\sigma^2_x} $$" ] }, { "cell_type": "code", "execution_count": null, "id": "a2495e91", "metadata": {}, "outputs": [], "source": [ "# To find cov(x,y) and var(x), use the covariance matrix\n", "cov_mat = np.cov(x, y)\n", "cov_mat" ] }, { "cell_type": "code", "execution_count": null, "id": "bbdcd0fd", "metadata": {}, "outputs": [], "source": [ "beta1 = cov_mat[0,1]/cov_mat[0,0]\n", "beta1" ] }, { "cell_type": "code", "execution_count": null, "id": "f7129ec8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1bc3e1de", "metadata": {}, "source": [ "**Calculate $\\beta_0$:**" ] }, { "cell_type": "markdown", "id": "337fa8e4", "metadata": {}, "source": [ "With $\\hat{\\beta}_1$ in hand, we can then rearrange the line equation ($y = \\beta_0 + \\beta_1 x$) to estimate $\\beta_0$:\n", "$$ y = \\beta_0 + \\beta_1 x $$\n", "$$ \\beta_0 = y - \\beta_1 x $$\n", "\n", "We can use the mean of x and the mean of y for calculating $\\hat{\\beta_0}$ for all the data points:\n", "$$ \\hat{\\beta_0} = \\bar{y} - \\hat{\\beta_1} \\bar{x} $$" ] }, { "cell_type": "code", "execution_count": null, "id": "2597643f", "metadata": {}, "outputs": [], "source": [ "beta0 = np.mean(y) - beta1 * np.mean(x)\n", "beta0" ] }, { "cell_type": "markdown", "id": "c321517a", "metadata": {}, "source": [ "**Fit the Line:**" ] }, { "cell_type": "code", "execution_count": null, "id": "3ccf8b0a", "metadata": {}, "outputs": [], "source": [ "x = np.array([1, 2, 3, 4, 5, 6, 7.]) # study hours \n", "y = np.array([1.0, 1.3, 2.5, 2.6, 3.5, 3.7, 4.0]) # gpa\n", "sns.scatterplot(x=x, y=y)\n", "\n", "xpointsofline = np.linspace(0, 8, 1000)\n", "ypointsofline = beta0 + beta1 * xpointsofline\n", "sns.lineplot(x=xpointsofline, y=ypointsofline)\n", "\n", "\n", "plt.title(\"Study Hours vs GPA of Students\")\n", "plt.xlabel(\"Study Hours\")\n", "plt.ylabel(\"GPA\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "005aa2af", "metadata": { "id": "8XzKSflaJy-y" }, "source": [ "**Using the Model for Prediction:**\n", "- In regression model terms, if we were provided with `study hours` we can now use the parameter estimates $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$ to predict the `GPA` of a student:\n", "$$ \\hat{y}_i = \\hat{\\beta}_0 + \\hat{\\beta}_1 x_i $$\n", "\n", "\n", "\n", "- Let us suppose that a student has studied for one and a half hour per day in the entire semester. Can you predict his/her GPA? " ] }, { "cell_type": "code", "execution_count": null, "id": "49cded42", "metadata": { "id": "hnFtPTnHJy-z" }, "outputs": [], "source": [ "x_i = 1.5" ] }, { "cell_type": "code", "execution_count": null, "id": "9f11594f", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5eUHYRQnJy-z", "outputId": "c6a54aca-0ed9-452a-98b1-cabcb408e7c8" }, "outputs": [], "source": [ "y_i = beta0 + beta1*x_i\n", "y_i" ] }, { "cell_type": "code", "execution_count": null, "id": "92cc443d", "metadata": {}, "outputs": [], "source": [ "x = np.array([1, 2, 3, 4, 5, 6, 7.]) # study hours \n", "y = np.array([1.0, 1.3, 2.5, 2.6, 3.5, 3.7, 4.0]) # gpa\n", "sns.scatterplot(x=x, y=y)\n", "\n", "\n", "\n", "xpointsofline = np.linspace(0, 8, 1000)\n", "ypointsofline = beta0 + beta1 * xpointsofline\n", "sns.lineplot(x=xpointsofline, y=ypointsofline)\n", "\n", "plt.scatter(x_i, y_i, marker='o',s=100, color='black')\n", "\n", "plt.title(\"Study Hours vs GPA of Students\")\n", "plt.xlabel(\"Study Hours\")\n", "plt.ylabel(\"GPA\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "08acf380", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b3d66594", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "daf1d66f", "metadata": {}, "source": [ "### b. Example 2 (LLS):" ] }, { "cell_type": "code", "execution_count": null, "id": "0bcf6390", "metadata": {}, "outputs": [], "source": [ "x = np.array([0, 1, 2, 3, 4, 5, 6, 7.]) # Drug dosage in ml\n", "y = np.array([1.86, 1.31, .62, .33, .09, -.67, -1.23, -1.37]) # Level of forgetfullness\n", "sns.scatterplot(x=x, y=y)\n", "plt.title(\"Clinical Trial\")\n", "plt.xlabel(\"Drug dosage (mL)\")\n", "plt.ylabel(\"Forgetfulness\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "25b54d34", "metadata": {}, "source": [ "**Regression Equation for this scenario:**" ] }, { "cell_type": "markdown", "id": "8c0baf5b", "metadata": {}, "source": [ "**Calculate $\\beta_1$:**" ] }, { "cell_type": "markdown", "id": "66c59c3f", "metadata": {}, "source": [ "In the case of a model with a single predictor $x$, there is a fairly straightforward **linear least squares** formula we can use to estimate $\\beta_1$: \n", "$$ \\hat{\\beta}_1 = \\frac{\\text{cov}(x,y)}{\\sigma^2_x} $$" ] }, { "cell_type": "code", "execution_count": null, "id": "18b8c645", "metadata": {}, "outputs": [], "source": [ "# To find cov(x,y) and var(x), use the covariance matrix\n", "cov_mat = np.cov(x, y)\n", "cov_mat" ] }, { "cell_type": "code", "execution_count": null, "id": "f01dd1d5", "metadata": {}, "outputs": [], "source": [ "beta1 = cov_mat[0,1]/cov_mat[0,0]\n", "beta1" ] }, { "cell_type": "markdown", "id": "558d67c2", "metadata": {}, "source": [ "**Calculate $\\beta_0$:**" ] }, { "cell_type": "code", "execution_count": null, "id": "0c8149a0", "metadata": {}, "outputs": [], "source": [ "beta0 = np.mean(y) - beta1 * np.mean(x)\n", "beta0" ] }, { "cell_type": "code", "execution_count": null, "id": "efc103bd", "metadata": {}, "outputs": [], "source": [ "x = np.array([0, 1, 2, 3, 4, 5, 6, 7.]) # Drug dosage in ml\n", "y = np.array([1.86, 1.31, .62, .33, .09, -.67, -1.23, -1.37]) # Level of forgetfullness\n", "sns.scatterplot(x=x, y=y)\n", "\n", "\n", "\n", "\n", "xpointsofline = np.linspace(0, 7, 1000)\n", "ypointsofline = beta0 + beta1 * xpointsofline\n", "sns.lineplot(x=xpointsofline, y=ypointsofline)\n", "\n", "\n", "plt.title(\"Clinical Trial\")\n", "plt.xlabel(\"Drug dosage (mL)\")\n", "plt.ylabel(\"Forgetfulness\")\n", "plt.show()\n", "\n" ] }, { "cell_type": "markdown", "id": "2bcd0f94", "metadata": {}, "source": [ "In regression model terms, if we were provided with `drug dosage` we can now use the parameter estimates $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$ to predict the `forgetfullness` of a patient:\n", "$$ \\hat{y}_i = \\hat{\\beta}_0 + \\hat{\\beta}_1 x_i $$" ] }, { "cell_type": "code", "execution_count": null, "id": "b51c2c2c", "metadata": {}, "outputs": [], "source": [ "x_i = 4" ] }, { "cell_type": "code", "execution_count": null, "id": "0954c282", "metadata": {}, "outputs": [], "source": [ "y_i = beta0 + beta1*x_i\n", "y_i" ] }, { "cell_type": "code", "execution_count": null, "id": "6c48a580", "metadata": {}, "outputs": [], "source": [ "x = np.array([0, 1, 2, 3, 4, 5, 6, 7.]) # Drug dosage in ml\n", "y = np.array([1.86, 1.31, .62, .33, .09, -.67, -1.23, -1.37]) # Level of forgetfullness\n", "sns.scatterplot(x=x, y=y)\n", "\n", "xpointsofline = np.linspace(0, 7, 1000)\n", "ypointsofline = beta0 + beta1 * xpointsofline\n", "sns.lineplot(x=xpointsofline, y=ypointsofline)\n", "\n", "plt.scatter(x_i, y_i, marker='o',s=100, color='black')\n", "\n", "plt.title(\"Clinical Trial\")\n", "plt.xlabel(\"Drug dosage (mL)\")\n", "plt.ylabel(\"Forgetfulness\")\n", "plt.grid(True)\n", "plt.show()\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7e93ac46", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2e56761f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e8d34e19", "metadata": {}, "source": [ "### b. Example 3 (LLS):\n", "With data from female Adélie penguins, create a linear least squares model that predicts body mass with flipper length. Predict the mass of a female Adélie penguin that has a flipper length of 197mm." ] }, { "cell_type": "code", "execution_count": null, "id": "2583fc2b", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "bAHMnL66lFqC", "outputId": "eef47f1a-f7a3-43d6-83ed-b5749107ec58" }, "outputs": [], "source": [ "penguins = sns.load_dataset('penguins')\n", "penguins.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "c043f970", "metadata": {}, "outputs": [], "source": [ "np.unique(penguins.species, return_counts=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "0431b26a", "metadata": {}, "outputs": [], "source": [ "adelie = penguins[penguins.species == 'Adelie']\n", "adelie.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "f0de76ec", "metadata": { "id": "ez3YcuanhuHI" }, "outputs": [], "source": [ "x = adelie[adelie.sex == 'Female']['flipper_length_mm'].to_numpy()\n", "y = adelie[adelie.sex == 'Female']['body_mass_g'].to_numpy()/1000" ] }, { "cell_type": "code", "execution_count": 1, "id": "a9d008e7", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "U7piwPvujy_S", "outputId": "ec4f0be4-4e5a-476c-c8f3-69efeff98458" }, "outputs": [ { "ename": "NameError", "evalue": "name 'sns' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_73570/1975260566.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0msns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscatterplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtitle\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Female Adélie Penguins\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mxlabel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Flipper Length (mm)\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mylabel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Body Mass (kg)\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshow\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m;\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'sns' is not defined" ] } ], "source": [ "sns.scatterplot(x=x, y=y)\n", "plt.title(\"Female Adélie Penguins\")\n", "plt.xlabel(\"Flipper Length (mm)\")\n", "plt.ylabel(\"Body Mass (kg)\")\n", "plt.show();" ] }, { "cell_type": "markdown", "id": "c51e8cdb", "metadata": {}, "source": [ "**Calculate $\\beta_1$:**" ] }, { "cell_type": "markdown", "id": "01e9102e", "metadata": {}, "source": [ "In the case of a model with a single predictor $x$, there is a fairly straightforward **linear least squares** formula we can use to estimate $\\beta_1$: \n", "$$ \\hat{\\beta}_1 = \\frac{\\text{cov}(x,y)}{\\sigma^2_x} $$" ] }, { "cell_type": "code", "execution_count": null, "id": "0b0e4522", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gqUW4gVailR5", "outputId": "f21df053-5529-433c-c992-1ea7bf676f52" }, "outputs": [], "source": [ "cov_mat = np.cov(x, y)\n", "cov_mat" ] }, { "cell_type": "code", "execution_count": null, "id": "5ded2439", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UyX0jsOVjD6V", "outputId": "0b41cf5e-9b19-44a8-d0a7-c4a96269cb29" }, "outputs": [], "source": [ "beta1 = cov_mat[0,1]/cov_mat[0,0]\n", "beta1" ] }, { "cell_type": "markdown", "id": "847f8a5b", "metadata": {}, "source": [ "**Calculate $\\beta_0$:**" ] }, { "cell_type": "markdown", "id": "c398ddc5", "metadata": {}, "source": [ "With $\\hat{\\beta}_1$ in hand, we can then rearrange the line equation ($y = \\beta_0 + \\beta_1 x$) to estimate $\\beta_0$:\n", "$$ y = \\beta_0 + \\beta_1 x $$\n", "$$ \\beta_0 = y - \\beta_1 x $$\n", "\n", "We can use the mean of x and the mean of y for calculating $\\hat{\\beta_0}$ for all the data points:\n", "$$ \\hat{\\beta_0} = \\bar{y} - \\hat{\\beta_1} \\bar{x} $$" ] }, { "cell_type": "code", "execution_count": null, "id": "4b2ed59c", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0JO7HmUgjEYd", "outputId": "b3823402-ea3a-4154-9a3f-23c1fa9a3f89" }, "outputs": [], "source": [ "beta0 = y.mean() - beta1*x.mean()\n", "beta0" ] }, { "cell_type": "markdown", "id": "fcfeb524", "metadata": {}, "source": [ "**Fit the Line:**" ] }, { "cell_type": "code", "execution_count": null, "id": "e5be0551", "metadata": {}, "outputs": [], "source": [ "sns.scatterplot(x=x, y=y)\n", "\n", "xline = np.linspace(170, 205, 1000)\n", "yline = beta0 + beta1*xline\n", "sns.lineplot(x=xline, y=yline, color='orange')\n", "\n", "plt.title(\"Female Adélie Penguins\")\n", "plt.xlabel(\"Flipper Length (mm)\")\n", "plt.ylabel(\"Body Mass (kg)\")\n", "plt.show();" ] }, { "cell_type": "markdown", "id": "5f1dfdfb", "metadata": {}, "source": [ "**Using the Model for Prediction:**\n", "- In regression model terms, if we were provided with `flipper length` we can now use the parameter estimates $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$ to predict the `body mass` of a penguin:\n", "$$ \\hat{y}_i = \\hat{\\beta}_0 + \\hat{\\beta}_1 x_i $$\n", "\n", "\n", "\n", "- Let us suppose that the `flipper length` of a penguin is 175 mm. Can you predict its `body mass`? " ] }, { "cell_type": "code", "execution_count": null, "id": "8bbbe859", "metadata": { "id": "o2ZfBlXUnFrL" }, "outputs": [], "source": [ "x_i = 175" ] }, { "cell_type": "code", "execution_count": null, "id": "cb58b651", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iOEfmJ7Um9Zz", "outputId": "3ee0a90c-14c0-4d53-86f3-00d3cc160fc5" }, "outputs": [], "source": [ "y_i = beta0 + beta1*x_i\n", "y_i" ] }, { "cell_type": "code", "execution_count": null, "id": "711a66e7", "metadata": {}, "outputs": [], "source": [ "sns.scatterplot(x=x, y=y)\n", "\n", "xline = np.linspace(170, 205, 1000)\n", "yline = beta0 + beta1*xline\n", "sns.lineplot(x=xline, y=yline, color='orange')\n", "\n", "plt.title(\"Female Adélie Penguins\")\n", "plt.xlabel(\"Flipper Length (mm)\")\n", "plt.ylabel(\"Body Mass (kg)\")\n", "\n", "\n", "plt.scatter(x_i, y_i, marker='o', s=100, color='purple');\n", "plt.grid(True)\n", "plt.show();" ] }, { "cell_type": "code", "execution_count": null, "id": "64ea204a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }