{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solution for Exercise 01\n",
    "\n",
    "The goal of is to compare the performance of our classifier to some baseline classifier that  would ignore the input data and instead make constant predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\n",
    "    \"https://www.openml.org/data/get_csv/1595261/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = df[target_name].to_numpy()\n",
    "data = df.drop(columns=[target_name, \"fnlwgt\"])\n",
    "numerical_columns = [\n",
    "    c for c in data.columns if data[c].dtype.kind in [\"i\", \"f\"]]\n",
    "data_numeric = data[numerical_columns]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "high_revenue_clf = DummyClassifier(strategy=\"constant\",\n",
    "                                   constant=\" >50K\")\n",
    "scores = cross_val_score(high_revenue_clf, data_numeric, target)\n",
    "print(f\"{scores.mean():.3f} +/- {scores.std():.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "low_revenue_clf = DummyClassifier(strategy=\"constant\",\n",
    "                                  constant=\" <=50K\")\n",
    "scores = cross_val_score(low_revenue_clf, data_numeric, target)\n",
    "print(f\"{scores.mean():.3f} +/- {scores.std():.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "most_freq_revenue_clf = DummyClassifier(strategy=\"most_frequent\")\n",
    "scores = cross_val_score(most_freq_revenue_clf, data_numeric, target)\n",
    "print(f\"{scores.mean():.3f} +/- {scores.std():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So 81% accuracy is significantly better than 76% which is the score of a baseline model that would always predict the most frequent class which is the low revenue class: `\" <=50K\"`.\n",
    "\n",
    "In this dataset, we can see that the target classes are imbalanced: almost 3/4 of the records are people with a revenue below 50K:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"class\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(target == \" <=50K\").mean()"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "python_scripts//py:percent,notebooks//ipynb"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}