{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "650e0648",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Statistical Analysis with Python\"\n",
    "description: \"Learn statistical analysis using Python's scipy and pandas libraries with real survey data\"\n",
    "date: 2025-01-27\n",
    "lastmod: 2025-01-27\n",
    "author: \"Zer0-Mistakes Team\"\n",
    "layout: notebook\n",
    "difficulty: intermediate\n",
    "tags: [python, statistics, scipy, data-analysis, surveys]\n",
    "categories: [Notebooks, Tutorials]\n",
    "toc: true\n",
    "comments: true\n",
    "---\n",
    "\n",
    "# Statistical Analysis with Python\n",
    "\n",
    "Learn to perform statistical analysis using Python's powerful libraries. This tutorial covers descriptive statistics, hypothesis testing, correlation analysis, and more using real survey response data.\n",
    "\n",
    "**What you'll learn:**\n",
    "- Descriptive statistics (mean, median, mode, variance)\n",
    "- Correlation analysis\n",
    "- Hypothesis testing (t-tests, chi-square)\n",
    "- Normal distribution and normality testing\n",
    "- Confidence intervals"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4aa8976a",
   "metadata": {},
   "source": [
    "## Setup and Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "25a1e67b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Libraries imported successfully!\n",
      "Pandas: 3.0.0\n",
      "NumPy: 2.4.2\n",
      "SciPy: 1.17.0\n"
     ]
    }
   ],
   "source": [
    "# Import statistical and data libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import scipy\n",
    "from scipy import stats\n",
    "from scipy.stats import ttest_ind, chi2_contingency, pearsonr, spearmanr\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "print(\"✅ Libraries imported successfully!\")\n",
    "print(f\"Pandas: {pd.__version__}\")\n",
    "print(f\"NumPy: {np.__version__}\")\n",
    "print(f\"SciPy: {scipy.__version__}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63c12e25",
   "metadata": {},
   "source": [
    "## Load Survey Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5bd6d2fb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📊 Survey Data Preview:\n",
      "Shape: 75 respondents × 13 questions\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>respondent_id</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "      <th>education</th>\n",
       "      <th>employment</th>\n",
       "      <th>income_bracket</th>\n",
       "      <th>product_satisfaction</th>\n",
       "      <th>service_rating</th>\n",
       "      <th>would_recommend</th>\n",
       "      <th>purchase_frequency</th>\n",
       "      <th>category_preference</th>\n",
       "      <th>feedback_length</th>\n",
       "      <th>response_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>28</td>\n",
       "      <td>Female</td>\n",
       "      <td>Bachelor's</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>50000-75000</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Monthly</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>142</td>\n",
       "      <td>2025-01-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>35</td>\n",
       "      <td>Male</td>\n",
       "      <td>Master's</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>75000-100000</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Weekly</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>89</td>\n",
       "      <td>2025-01-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>42</td>\n",
       "      <td>Female</td>\n",
       "      <td>Bachelor's</td>\n",
       "      <td>Part-time</td>\n",
       "      <td>25000-50000</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>Maybe</td>\n",
       "      <td>Quarterly</td>\n",
       "      <td>Furniture</td>\n",
       "      <td>156</td>\n",
       "      <td>2025-01-17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>Male</td>\n",
       "      <td>High School</td>\n",
       "      <td>Student</td>\n",
       "      <td>Under 25000</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Monthly</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>45</td>\n",
       "      <td>2025-01-18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>51</td>\n",
       "      <td>Female</td>\n",
       "      <td>Doctorate</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>100000+</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Weekly</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>203</td>\n",
       "      <td>2025-01-19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>31</td>\n",
       "      <td>Non-binary</td>\n",
       "      <td>Bachelor's</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>50000-75000</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Monthly</td>\n",
       "      <td>Furniture</td>\n",
       "      <td>78</td>\n",
       "      <td>2025-01-20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7</td>\n",
       "      <td>45</td>\n",
       "      <td>Male</td>\n",
       "      <td>Master's</td>\n",
       "      <td>Self-employed</td>\n",
       "      <td>75000-100000</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>No</td>\n",
       "      <td>Rarely</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>312</td>\n",
       "      <td>2025-01-21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8</td>\n",
       "      <td>27</td>\n",
       "      <td>Female</td>\n",
       "      <td>Bachelor's</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>50000-75000</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Monthly</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>67</td>\n",
       "      <td>2025-01-22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>9</td>\n",
       "      <td>38</td>\n",
       "      <td>Male</td>\n",
       "      <td>Bachelor's</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>75000-100000</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Quarterly</td>\n",
       "      <td>Furniture</td>\n",
       "      <td>95</td>\n",
       "      <td>2025-01-23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>10</td>\n",
       "      <td>56</td>\n",
       "      <td>Female</td>\n",
       "      <td>High School</td>\n",
       "      <td>Retired</td>\n",
       "      <td>25000-50000</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Monthly</td>\n",
       "      <td>Furniture</td>\n",
       "      <td>124</td>\n",
       "      <td>2025-01-24</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   respondent_id  age      gender    education     employment income_bracket  \\\n",
       "0              1   28      Female   Bachelor's      Full-time    50000-75000   \n",
       "1              2   35        Male     Master's      Full-time   75000-100000   \n",
       "2              3   42      Female   Bachelor's      Part-time    25000-50000   \n",
       "3              4   23        Male  High School        Student    Under 25000   \n",
       "4              5   51      Female    Doctorate      Full-time        100000+   \n",
       "5              6   31  Non-binary   Bachelor's      Full-time    50000-75000   \n",
       "6              7   45        Male     Master's  Self-employed   75000-100000   \n",
       "7              8   27      Female   Bachelor's      Full-time    50000-75000   \n",
       "8              9   38        Male   Bachelor's      Full-time   75000-100000   \n",
       "9             10   56      Female  High School        Retired    25000-50000   \n",
       "\n",
       "   product_satisfaction  service_rating would_recommend purchase_frequency  \\\n",
       "0                     4               5             Yes            Monthly   \n",
       "1                     5               4             Yes             Weekly   \n",
       "2                     3               3           Maybe          Quarterly   \n",
       "3                     4               4             Yes            Monthly   \n",
       "4                     5               5             Yes             Weekly   \n",
       "5                     4               4             Yes            Monthly   \n",
       "6                     3               2              No             Rarely   \n",
       "7                     5               5             Yes            Monthly   \n",
       "8                     4               4             Yes          Quarterly   \n",
       "9                     4               5             Yes            Monthly   \n",
       "\n",
       "  category_preference  feedback_length response_date  \n",
       "0         Electronics              142    2025-01-15  \n",
       "1         Electronics               89    2025-01-16  \n",
       "2           Furniture              156    2025-01-17  \n",
       "3         Electronics               45    2025-01-18  \n",
       "4         Electronics              203    2025-01-19  \n",
       "5           Furniture               78    2025-01-20  \n",
       "6         Electronics              312    2025-01-21  \n",
       "7         Electronics               67    2025-01-22  \n",
       "8           Furniture               95    2025-01-23  \n",
       "9           Furniture              124    2025-01-24  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the survey response dataset\n",
    "survey = pd.read_csv('/Users/bamr87/github/zer0-mistakes/assets/data/notebooks/survey_responses.csv')\n",
    "\n",
    "print(\"📊 Survey Data Preview:\")\n",
    "print(f\"Shape: {survey.shape[0]} respondents × {survey.shape[1]} questions\\n\")\n",
    "survey.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58bc6f8d",
   "metadata": {},
   "source": [
    "## Descriptive Statistics\n",
    "\n",
    "Let's calculate key descriptive statistics for our numerical columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7417d28e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📈 Descriptive Statistics for Survey Responses:\n",
      "======================================================================\n",
      "\n",
      "AGE:\n",
      "  Mean:     38.28\n",
      "  Median:   37.00\n",
      "  Mode:     26\n",
      "  Std Dev:  11.24\n",
      "  Variance: 126.31\n",
      "  Range:    20 - 62\n",
      "  IQR:      18.00\n",
      "\n",
      "PRODUCT SATISFACTION:\n",
      "  Mean:     4.16\n",
      "  Median:   4.00\n",
      "  Mode:     4\n",
      "  Std Dev:  0.74\n",
      "  Variance: 0.54\n",
      "  Range:    2 - 5\n",
      "  IQR:      1.00\n",
      "\n",
      "SERVICE RATING:\n",
      "  Mean:     4.07\n",
      "  Median:   4.00\n",
      "  Mode:     4\n",
      "  Std Dev:  0.86\n",
      "  Variance: 0.74\n",
      "  Range:    2 - 5\n",
      "  IQR:      1.00\n",
      "\n",
      "FEEDBACK LENGTH:\n",
      "  Mean:     130.56\n",
      "  Median:   112.00\n",
      "  Mode:     98\n",
      "  Std Dev:  71.89\n",
      "  Variance: 5167.84\n",
      "  Range:    32 - 312\n",
      "  IQR:      97.50\n"
     ]
    }
   ],
   "source": [
    "# Calculate comprehensive descriptive statistics\n",
    "numeric_cols = ['age', 'product_satisfaction', 'service_rating', 'feedback_length']\n",
    "\n",
    "print(\"📈 Descriptive Statistics for Survey Responses:\")\n",
    "print(\"=\" * 70)\n",
    "\n",
    "for col in numeric_cols:\n",
    "    data = survey[col]\n",
    "    print(f\"\\n{col.upper().replace('_', ' ')}:\")\n",
    "    print(f\"  Mean:     {data.mean():.2f}\")\n",
    "    print(f\"  Median:   {data.median():.2f}\")\n",
    "    print(f\"  Mode:     {data.mode().values[0]}\")\n",
    "    print(f\"  Std Dev:  {data.std():.2f}\")\n",
    "    print(f\"  Variance: {data.var():.2f}\")\n",
    "    print(f\"  Range:    {data.min()} - {data.max()}\")\n",
    "    print(f\"  IQR:      {data.quantile(0.75) - data.quantile(0.25):.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e9cfbad",
   "metadata": {},
   "source": [
    "## Correlation Analysis\n",
    "\n",
    "Examine relationships between satisfaction metrics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f49742d6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔗 Correlation Matrix (Pearson):\n",
      "                      product_satisfaction  service_rating  feedback_length  \\\n",
      "product_satisfaction                 1.000           0.709           -0.309   \n",
      "service_rating                       0.709           1.000           -0.233   \n",
      "feedback_length                     -0.309          -0.233            1.000   \n",
      "age                                 -0.218          -0.005            0.647   \n",
      "\n",
      "                        age  \n",
      "product_satisfaction -0.218  \n",
      "service_rating       -0.005  \n",
      "feedback_length       0.647  \n",
      "age                   1.000  \n",
      "\n",
      "\n",
      "📊 Detailed Correlation Analysis:\n",
      "============================================================\n",
      "\n",
      "product_satisfaction vs service_rating:\n",
      "  Pearson r = 0.7093 (***)\n",
      "  p-value   = 0.0000\n",
      "  → Strong positive correlation\n",
      "\n",
      "age vs product_satisfaction:\n",
      "  Pearson r = -0.2179 (ns)\n",
      "  p-value   = 0.0604\n",
      "  → Weak negative correlation\n",
      "\n",
      "age vs service_rating:\n",
      "  Pearson r = -0.0048 (ns)\n",
      "  p-value   = 0.9677\n",
      "  → Weak negative correlation\n",
      "\n",
      "feedback_length vs product_satisfaction:\n",
      "  Pearson r = -0.3092 (**)\n",
      "  p-value   = 0.0069\n",
      "  → Weak negative correlation\n"
     ]
    }
   ],
   "source": [
    "# Calculate correlation matrix for satisfaction metrics\n",
    "satisfaction_cols = ['product_satisfaction', 'service_rating', 'feedback_length', 'age']\n",
    "correlation_matrix = survey[satisfaction_cols].corr()\n",
    "\n",
    "print(\"🔗 Correlation Matrix (Pearson):\")\n",
    "print(correlation_matrix.round(3))\n",
    "\n",
    "# Detailed pairwise correlations with significance\n",
    "print(\"\\n\\n📊 Detailed Correlation Analysis:\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "pairs = [\n",
    "    ('product_satisfaction', 'service_rating'),\n",
    "    ('age', 'product_satisfaction'),\n",
    "    ('age', 'service_rating'),\n",
    "    ('feedback_length', 'product_satisfaction')\n",
    "]\n",
    "\n",
    "for col1, col2 in pairs:\n",
    "    r, p_value = pearsonr(survey[col1], survey[col2])\n",
    "    significance = \"***\" if p_value < 0.001 else \"**\" if p_value < 0.01 else \"*\" if p_value < 0.05 else \"ns\"\n",
    "    print(f\"\\n{col1} vs {col2}:\")\n",
    "    print(f\"  Pearson r = {r:.4f} ({significance})\")\n",
    "    print(f\"  p-value   = {p_value:.4f}\")\n",
    "    if abs(r) >= 0.7:\n",
    "        strength = \"strong\"\n",
    "    elif abs(r) >= 0.4:\n",
    "        strength = \"moderate\"\n",
    "    else:\n",
    "        strength = \"weak\"\n",
    "    direction = \"positive\" if r > 0 else \"negative\"\n",
    "    print(f\"  → {strength.capitalize()} {direction} correlation\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30248e44",
   "metadata": {},
   "source": [
    "## Hypothesis Testing: T-Tests\n",
    "\n",
    "Compare satisfaction scores between different groups:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e273a147",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🧪 Independent Samples T-Test: Product Satisfaction by Gender\n",
      "============================================================\n",
      "\n",
      "Group Statistics:\n",
      "  Male   (n=35):   M = 4.00, SD = 0.77\n",
      "  Female (n=36): M = 4.28, SD = 0.70\n",
      "\n",
      "Test Results:\n",
      "  t-statistic = -1.5932\n",
      "  p-value     = 0.1157\n",
      "\n",
      "Conclusion at α = 0.05:\n",
      "  ✗ FAIL TO REJECT null hypothesis - no significant difference\n"
     ]
    }
   ],
   "source": [
    "# Independent samples t-test: Compare satisfaction between genders\n",
    "male_satisfaction = survey[survey['gender'] == 'Male']['product_satisfaction']\n",
    "female_satisfaction = survey[survey['gender'] == 'Female']['product_satisfaction']\n",
    "\n",
    "t_stat, p_value = ttest_ind(male_satisfaction, female_satisfaction)\n",
    "\n",
    "print(\"🧪 Independent Samples T-Test: Product Satisfaction by Gender\")\n",
    "print(\"=\" * 60)\n",
    "print(f\"\\nGroup Statistics:\")\n",
    "print(f\"  Male   (n={len(male_satisfaction)}):   M = {male_satisfaction.mean():.2f}, SD = {male_satisfaction.std():.2f}\")\n",
    "print(f\"  Female (n={len(female_satisfaction)}): M = {female_satisfaction.mean():.2f}, SD = {female_satisfaction.std():.2f}\")\n",
    "print(f\"\\nTest Results:\")\n",
    "print(f\"  t-statistic = {t_stat:.4f}\")\n",
    "print(f\"  p-value     = {p_value:.4f}\")\n",
    "print(f\"\\nConclusion at α = 0.05:\")\n",
    "if p_value < 0.05:\n",
    "    print(\"  ✓ REJECT null hypothesis - significant difference exists\")\n",
    "else:\n",
    "    print(\"  ✗ FAIL TO REJECT null hypothesis - no significant difference\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef914e6c",
   "metadata": {},
   "source": [
    "## Chi-Square Test\n",
    "\n",
    "Test for association between categorical variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1bcd7356",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📋 Contingency Table: Purchase Frequency × Category Preference\n",
      "category_preference  Electronics  Furniture\n",
      "purchase_frequency                         \n",
      "Monthly                       23         19\n",
      "Quarterly                      4          8\n",
      "Rarely                         3          4\n",
      "Weekly                        14          0\n",
      "\n",
      "\n",
      "🧪 Chi-Square Test of Independence\n",
      "============================================================\n",
      "\n",
      "Results:\n",
      "  Chi-square statistic = 14.0252\n",
      "  Degrees of freedom   = 3\n",
      "  p-value             = 0.0029\n",
      "\n",
      "Conclusion at α = 0.05:\n",
      "  ✓ REJECT null hypothesis - variables are DEPENDENT\n",
      "  → Purchase frequency IS associated with category preference\n"
     ]
    }
   ],
   "source": [
    "# Chi-square test: Association between purchase frequency and category preference\n",
    "contingency_table = pd.crosstab(survey['purchase_frequency'], survey['category_preference'])\n",
    "\n",
    "print(\"📋 Contingency Table: Purchase Frequency × Category Preference\")\n",
    "print(contingency_table)\n",
    "\n",
    "chi2, p_value, dof, expected = chi2_contingency(contingency_table)\n",
    "\n",
    "print(\"\\n\\n🧪 Chi-Square Test of Independence\")\n",
    "print(\"=\" * 60)\n",
    "print(f\"\\nResults:\")\n",
    "print(f\"  Chi-square statistic = {chi2:.4f}\")\n",
    "print(f\"  Degrees of freedom   = {dof}\")\n",
    "print(f\"  p-value             = {p_value:.4f}\")\n",
    "print(f\"\\nConclusion at α = 0.05:\")\n",
    "if p_value < 0.05:\n",
    "    print(\"  ✓ REJECT null hypothesis - variables are DEPENDENT\")\n",
    "    print(\"  → Purchase frequency IS associated with category preference\")\n",
    "else:\n",
    "    print(\"  ✗ FAIL TO REJECT null hypothesis - variables are INDEPENDENT\")\n",
    "    print(\"  → No significant association between purchase frequency and category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b578c79",
   "metadata": {},
   "source": [
    "## Normality Testing\n",
    "\n",
    "Check if satisfaction scores follow a normal distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "10b6a01f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📐 Normality Tests for Product Satisfaction Scores\n",
      "============================================================\n",
      "\n",
      "1. Shapiro-Wilk Test:\n",
      "   W-statistic = 0.8159\n",
      "   p-value     = 0.0000\n",
      "\n",
      "2. D'Agostino-Pearson Test:\n",
      "   K² statistic = 3.1266\n",
      "   p-value      = 0.2094\n",
      "\n",
      "3. Distribution Shape:\n",
      "   Skewness = -0.4623 (left-skewed)\n",
      "   Kurtosis = -0.3638 (platykurtic)\n",
      "\n",
      "📊 Conclusion:\n",
      "   Data deviates significantly from normal distribution (p < 0.05)\n"
     ]
    }
   ],
   "source": [
    "# Test normality of satisfaction scores using multiple methods\n",
    "data = survey['product_satisfaction']\n",
    "\n",
    "print(\"📐 Normality Tests for Product Satisfaction Scores\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "# Shapiro-Wilk Test (best for n < 5000)\n",
    "shapiro_stat, shapiro_p = stats.shapiro(data)\n",
    "print(f\"\\n1. Shapiro-Wilk Test:\")\n",
    "print(f\"   W-statistic = {shapiro_stat:.4f}\")\n",
    "print(f\"   p-value     = {shapiro_p:.4f}\")\n",
    "\n",
    "# D'Agostino's K-squared Test\n",
    "dagostino_stat, dagostino_p = stats.normaltest(data)\n",
    "print(f\"\\n2. D'Agostino-Pearson Test:\")\n",
    "print(f\"   K² statistic = {dagostino_stat:.4f}\")\n",
    "print(f\"   p-value      = {dagostino_p:.4f}\")\n",
    "\n",
    "# Skewness and Kurtosis\n",
    "skew = stats.skew(data)\n",
    "kurt = stats.kurtosis(data)\n",
    "print(f\"\\n3. Distribution Shape:\")\n",
    "print(f\"   Skewness = {skew:.4f} ({'right-skewed' if skew > 0 else 'left-skewed' if skew < 0 else 'symmetric'})\")\n",
    "print(f\"   Kurtosis = {kurt:.4f} ({'leptokurtic' if kurt > 0 else 'platykurtic' if kurt < 0 else 'mesokurtic'})\")\n",
    "\n",
    "print(f\"\\n📊 Conclusion:\")\n",
    "if shapiro_p > 0.05:\n",
    "    print(\"   Data appears to be normally distributed (p > 0.05)\")\n",
    "else:\n",
    "    print(\"   Data deviates significantly from normal distribution (p < 0.05)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c71ceffe",
   "metadata": {},
   "source": [
    "## Confidence Intervals\n",
    "\n",
    "Calculate confidence intervals for key metrics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "20826442",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📏 95% Confidence Intervals\n",
      "============================================================\n",
      "\n",
      "Product Satisfaction:\n",
      "  Sample Mean: 4.16\n",
      "  95% CI: [3.99, 4.33]\n",
      "  → We are 95% confident the true population mean\n",
      "    falls between 3.99 and 4.33\n",
      "\n",
      "Service Rating:\n",
      "  Sample Mean: 4.07\n",
      "  95% CI: [3.87, 4.26]\n",
      "  → We are 95% confident the true population mean\n",
      "    falls between 3.87 and 4.26\n",
      "\n",
      "Feedback Length:\n",
      "  Sample Mean: 130.56\n",
      "  95% CI: [114.02, 147.10]\n",
      "  → We are 95% confident the true population mean\n",
      "    falls between 114.02 and 147.10\n",
      "\n",
      "Age:\n",
      "  Sample Mean: 38.28\n",
      "  95% CI: [35.69, 40.87]\n",
      "  → We are 95% confident the true population mean\n",
      "    falls between 35.69 and 40.87\n"
     ]
    }
   ],
   "source": [
    "# Calculate 95% confidence intervals for satisfaction metrics\n",
    "def confidence_interval(data, confidence=0.95):\n",
    "    \"\"\"Calculate confidence interval for mean\"\"\"\n",
    "    n = len(data)\n",
    "    mean = np.mean(data)\n",
    "    se = stats.sem(data)  # Standard error of the mean\n",
    "    h = se * stats.t.ppf((1 + confidence) / 2, n - 1)  # Margin of error\n",
    "    return mean, mean - h, mean + h\n",
    "\n",
    "print(\"📏 95% Confidence Intervals\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "metrics = {\n",
    "    'Product Satisfaction': survey['product_satisfaction'],\n",
    "    'Service Rating': survey['service_rating'],\n",
    "    'Feedback Length': survey['feedback_length'],\n",
    "    'Age': survey['age']\n",
    "}\n",
    "\n",
    "for name, data in metrics.items():\n",
    "    mean, lower, upper = confidence_interval(data)\n",
    "    print(f\"\\n{name}:\")\n",
    "    print(f\"  Sample Mean: {mean:.2f}\")\n",
    "    print(f\"  95% CI: [{lower:.2f}, {upper:.2f}]\")\n",
    "    print(f\"  → We are 95% confident the true population mean\")\n",
    "    print(f\"    falls between {lower:.2f} and {upper:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4085a33",
   "metadata": {},
   "source": [
    "## Summary Statistics by Group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e135c1e7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📊 SURVEY ANALYSIS SUMMARY\n",
      "======================================================================\n",
      "\n",
      "📋 Dataset Overview:\n",
      "   Total Respondents: 75\n",
      "   Average Age: 38.3 years\n",
      "   Gender Distribution: {'Female': np.int64(36), 'Male': np.int64(35), 'Non-binary': np.int64(4)}\n",
      "\n",
      "🎯 Key Satisfaction Metrics:\n",
      "   Product Satisfaction: 4.16/5\n",
      "   Service Rating: 4.07/5\n",
      "   Would Recommend: 58/75 (77.3%)\n",
      "\n",
      "💻 Category Preferences:\n",
      "   Electronics: 44 (58.7%)\n",
      "   Furniture: 31 (41.3%)\n",
      "\n",
      "⏱️ Purchase Frequency:\n",
      "   Monthly: 42 (56.0%)\n",
      "   Weekly: 14 (18.7%)\n",
      "   Quarterly: 12 (16.0%)\n",
      "   Rarely: 7 (9.3%)\n",
      "\n",
      "======================================================================\n"
     ]
    }
   ],
   "source": [
    "# Generate comprehensive summary statistics by demographic groups\n",
    "print(\"📊 SURVEY ANALYSIS SUMMARY\")\n",
    "print(\"=\" * 70)\n",
    "\n",
    "# Overall statistics\n",
    "print(f\"\\n📋 Dataset Overview:\")\n",
    "print(f\"   Total Respondents: {len(survey)}\")\n",
    "print(f\"   Average Age: {survey['age'].mean():.1f} years\")\n",
    "print(f\"   Gender Distribution: {dict(survey['gender'].value_counts())}\")\n",
    "\n",
    "# Key findings\n",
    "print(f\"\\n🎯 Key Satisfaction Metrics:\")\n",
    "print(f\"   Product Satisfaction: {survey['product_satisfaction'].mean():.2f}/5\")\n",
    "print(f\"   Service Rating: {survey['service_rating'].mean():.2f}/5\")\n",
    "recommend_yes = (survey['would_recommend'] == 'Yes').sum()\n",
    "print(f\"   Would Recommend: {recommend_yes}/{len(survey)} ({100*recommend_yes/len(survey):.1f}%)\")\n",
    "\n",
    "# Category preferences\n",
    "print(f\"\\n💻 Category Preferences:\")\n",
    "category_counts = survey['category_preference'].value_counts()\n",
    "for category, count in category_counts.items():\n",
    "    pct = (count / len(survey)) * 100\n",
    "    print(f\"   {category}: {count} ({pct:.1f}%)\")\n",
    "\n",
    "# Purchase patterns\n",
    "print(f\"\\n⏱️ Purchase Frequency:\")\n",
    "frequency_counts = survey['purchase_frequency'].value_counts()\n",
    "for freq, count in frequency_counts.items():\n",
    "    pct = (count / len(survey)) * 100\n",
    "    print(f\"   {freq}: {count} ({pct:.1f}%)\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 70)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee08aedd",
   "metadata": {},
   "source": [
    "## Next Steps\n",
    "\n",
    "This tutorial covered the fundamentals of statistical analysis with Python. To continue learning:\n",
    "\n",
    "1. **Visualize your statistics** - Check out the [Matplotlib Visualization](/notebooks/matplotlib-visualization/) tutorial\n",
    "2. **Analyze larger datasets** - See the [Pandas Data Analysis](/notebooks/pandas-data-analysis/) tutorial\n",
    "3. **Fetch external data** - Learn about APIs in the [API Requests](/notebooks/api-requests/) tutorial\n",
    "\n",
    "**Key Takeaways:**\n",
    "- Use `describe()` for quick descriptive statistics\n",
    "- `scipy.stats` provides comprehensive hypothesis testing tools\n",
    "- Always check assumptions (normality, equal variances) before parametric tests\n",
    "- Correlation ≠ causation - always interpret results carefully\n",
    "- Report confidence intervals alongside point estimates"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}