{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 144,
   "id": "31af659a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt #visualisation\n",
    "import seaborn as sns #visualisation\n",
    "import numpy as np "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ab38a44",
   "metadata": {},
   "source": [
    "Here I have loaded the dataset. To save myself from typing 'aps_failure.csv' every single time I have given the dataset a simplfied name 'afs'. Line 1 below tells the program where the data is while line 2 renames it for ease of use. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "id": "0ed5ef06",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('aps_failure_set.csv')\n",
    "afs=pd.read_csv('aps_failure_set.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91500a52",
   "metadata": {},
   "source": [
    "Exploratory Analysis. \n",
    "I am gathering some very basic information on my datset so I know what I'm dealing with. I start this process with gathering basic information "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "id": "b10e9b31",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(60000, 171)"
      ]
     },
     "execution_count": 146,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "afs.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4edc7e5",
   "metadata": {},
   "source": [
    "The afs.shape above has told me I am dealing with a datset that has 171 columns and 60,000 rows. I will now use the afs.describe(include=object) function to provide me with some basic statistics on the data. This is useful for the following reasons:\n",
    "\n",
    "-Count shows me that\n",
    "-Unique showes me that\n",
    "-Top shows me that\n",
    "-Freq shows me that"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "id": "7944cac4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>ab_000</th>\n",
       "      <th>ac_000</th>\n",
       "      <th>ad_000</th>\n",
       "      <th>ae_000</th>\n",
       "      <th>af_000</th>\n",
       "      <th>ag_000</th>\n",
       "      <th>ag_001</th>\n",
       "      <th>ag_002</th>\n",
       "      <th>ag_003</th>\n",
       "      <th>...</th>\n",
       "      <th>ee_002</th>\n",
       "      <th>ee_003</th>\n",
       "      <th>ee_004</th>\n",
       "      <th>ee_005</th>\n",
       "      <th>ee_006</th>\n",
       "      <th>ee_007</th>\n",
       "      <th>ee_008</th>\n",
       "      <th>ee_009</th>\n",
       "      <th>ef_000</th>\n",
       "      <th>eg_000</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>...</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "      <td>60000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>2</td>\n",
       "      <td>30</td>\n",
       "      <td>2062</td>\n",
       "      <td>1887</td>\n",
       "      <td>334</td>\n",
       "      <td>419</td>\n",
       "      <td>155</td>\n",
       "      <td>618</td>\n",
       "      <td>2423</td>\n",
       "      <td>7880</td>\n",
       "      <td>...</td>\n",
       "      <td>34489</td>\n",
       "      <td>31712</td>\n",
       "      <td>35189</td>\n",
       "      <td>36289</td>\n",
       "      <td>31796</td>\n",
       "      <td>30470</td>\n",
       "      <td>24214</td>\n",
       "      <td>9725</td>\n",
       "      <td>29</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>neg</td>\n",
       "      <td>na</td>\n",
       "      <td>0</td>\n",
       "      <td>na</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>59000</td>\n",
       "      <td>46329</td>\n",
       "      <td>8752</td>\n",
       "      <td>14861</td>\n",
       "      <td>55543</td>\n",
       "      <td>55476</td>\n",
       "      <td>59133</td>\n",
       "      <td>58587</td>\n",
       "      <td>56181</td>\n",
       "      <td>46894</td>\n",
       "      <td>...</td>\n",
       "      <td>1364</td>\n",
       "      <td>1557</td>\n",
       "      <td>1797</td>\n",
       "      <td>2814</td>\n",
       "      <td>4458</td>\n",
       "      <td>7898</td>\n",
       "      <td>17280</td>\n",
       "      <td>31863</td>\n",
       "      <td>57021</td>\n",
       "      <td>56794</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4 rows × 170 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        class ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 ag_003  \\\n",
       "count   60000  60000  60000  60000  60000  60000  60000  60000  60000  60000   \n",
       "unique      2     30   2062   1887    334    419    155    618   2423   7880   \n",
       "top       neg     na      0     na      0      0      0      0      0      0   \n",
       "freq    59000  46329   8752  14861  55543  55476  59133  58587  56181  46894   \n",
       "\n",
       "        ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000  \\\n",
       "count   ...  60000  60000  60000  60000  60000  60000  60000  60000  60000   \n",
       "unique  ...  34489  31712  35189  36289  31796  30470  24214   9725     29   \n",
       "top     ...      0      0      0      0      0      0      0      0      0   \n",
       "freq    ...   1364   1557   1797   2814   4458   7898  17280  31863  57021   \n",
       "\n",
       "       eg_000  \n",
       "count   60000  \n",
       "unique     50  \n",
       "top         0  \n",
       "freq    56794  \n",
       "\n",
       "[4 rows x 170 columns]"
      ]
     },
     "execution_count": 147,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "afs.describe(include=object)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfb0b758",
   "metadata": {},
   "source": [
    "Now I begin to view the data. data.head(10) gives me the first 10 rows of the data.\n",
    "\n",
    "This allows me to get an understanding of what I am actually dealing with. It is a good way"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "id": "aee8d2fb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>aa_000</th>\n",
       "      <th>ab_000</th>\n",
       "      <th>ac_000</th>\n",
       "      <th>ad_000</th>\n",
       "      <th>ae_000</th>\n",
       "      <th>af_000</th>\n",
       "      <th>ag_000</th>\n",
       "      <th>ag_001</th>\n",
       "      <th>ag_002</th>\n",
       "      <th>...</th>\n",
       "      <th>ee_002</th>\n",
       "      <th>ee_003</th>\n",
       "      <th>ee_004</th>\n",
       "      <th>ee_005</th>\n",
       "      <th>ee_006</th>\n",
       "      <th>ee_007</th>\n",
       "      <th>ee_008</th>\n",
       "      <th>ee_009</th>\n",
       "      <th>ef_000</th>\n",
       "      <th>eg_000</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>neg</td>\n",
       "      <td>76698</td>\n",
       "      <td>na</td>\n",
       "      <td>2130706438</td>\n",
       "      <td>280</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1240520</td>\n",
       "      <td>493384</td>\n",
       "      <td>721044</td>\n",
       "      <td>469792</td>\n",
       "      <td>339156</td>\n",
       "      <td>157956</td>\n",
       "      <td>73224</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>neg</td>\n",
       "      <td>33058</td>\n",
       "      <td>na</td>\n",
       "      <td>0</td>\n",
       "      <td>na</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>421400</td>\n",
       "      <td>178064</td>\n",
       "      <td>293306</td>\n",
       "      <td>245416</td>\n",
       "      <td>133654</td>\n",
       "      <td>81140</td>\n",
       "      <td>97576</td>\n",
       "      <td>1500</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>neg</td>\n",
       "      <td>41040</td>\n",
       "      <td>na</td>\n",
       "      <td>228</td>\n",
       "      <td>100</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>277378</td>\n",
       "      <td>159812</td>\n",
       "      <td>423992</td>\n",
       "      <td>409564</td>\n",
       "      <td>320746</td>\n",
       "      <td>158022</td>\n",
       "      <td>95128</td>\n",
       "      <td>514</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>neg</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "      <td>70</td>\n",
       "      <td>66</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>240</td>\n",
       "      <td>46</td>\n",
       "      <td>58</td>\n",
       "      <td>44</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>neg</td>\n",
       "      <td>60874</td>\n",
       "      <td>na</td>\n",
       "      <td>1368</td>\n",
       "      <td>458</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>622012</td>\n",
       "      <td>229790</td>\n",
       "      <td>405298</td>\n",
       "      <td>347188</td>\n",
       "      <td>286954</td>\n",
       "      <td>311560</td>\n",
       "      <td>433954</td>\n",
       "      <td>1218</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 171 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  class  aa_000 ab_000      ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002  \\\n",
       "0   neg   76698     na  2130706438    280      0      0      0      0      0   \n",
       "1   neg   33058     na           0     na      0      0      0      0      0   \n",
       "2   neg   41040     na         228    100      0      0      0      0      0   \n",
       "3   neg      12      0          70     66      0     10      0      0      0   \n",
       "4   neg   60874     na        1368    458      0      0      0      0      0   \n",
       "\n",
       "   ...   ee_002  ee_003  ee_004  ee_005  ee_006  ee_007  ee_008 ee_009 ef_000  \\\n",
       "0  ...  1240520  493384  721044  469792  339156  157956   73224      0      0   \n",
       "1  ...   421400  178064  293306  245416  133654   81140   97576   1500      0   \n",
       "2  ...   277378  159812  423992  409564  320746  158022   95128    514      0   \n",
       "3  ...      240      46      58      44      10       0       0      0      4   \n",
       "4  ...   622012  229790  405298  347188  286954  311560  433954   1218      0   \n",
       "\n",
       "  eg_000  \n",
       "0      0  \n",
       "1      0  \n",
       "2      0  \n",
       "3     32  \n",
       "4      0  \n",
       "\n",
       "[5 rows x 171 columns]"
      ]
     },
     "execution_count": 148,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the first 5 records\n",
    "afs.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "id": "445668ec",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>aa_000</th>\n",
       "      <th>ab_000</th>\n",
       "      <th>ac_000</th>\n",
       "      <th>ad_000</th>\n",
       "      <th>ae_000</th>\n",
       "      <th>af_000</th>\n",
       "      <th>ag_000</th>\n",
       "      <th>ag_001</th>\n",
       "      <th>ag_002</th>\n",
       "      <th>...</th>\n",
       "      <th>ee_002</th>\n",
       "      <th>ee_003</th>\n",
       "      <th>ee_004</th>\n",
       "      <th>ee_005</th>\n",
       "      <th>ee_006</th>\n",
       "      <th>ee_007</th>\n",
       "      <th>ee_008</th>\n",
       "      <th>ee_009</th>\n",
       "      <th>ef_000</th>\n",
       "      <th>eg_000</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>59995</th>\n",
       "      <td>neg</td>\n",
       "      <td>153002</td>\n",
       "      <td>na</td>\n",
       "      <td>664</td>\n",
       "      <td>186</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>998500</td>\n",
       "      <td>566884</td>\n",
       "      <td>1290398</td>\n",
       "      <td>1218244</td>\n",
       "      <td>1019768</td>\n",
       "      <td>717762</td>\n",
       "      <td>898642</td>\n",
       "      <td>28588</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59996</th>\n",
       "      <td>neg</td>\n",
       "      <td>2286</td>\n",
       "      <td>na</td>\n",
       "      <td>2130706538</td>\n",
       "      <td>224</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>10578</td>\n",
       "      <td>6760</td>\n",
       "      <td>21126</td>\n",
       "      <td>68424</td>\n",
       "      <td>136</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59997</th>\n",
       "      <td>neg</td>\n",
       "      <td>112</td>\n",
       "      <td>0</td>\n",
       "      <td>2130706432</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>792</td>\n",
       "      <td>386</td>\n",
       "      <td>452</td>\n",
       "      <td>144</td>\n",
       "      <td>146</td>\n",
       "      <td>2622</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59998</th>\n",
       "      <td>neg</td>\n",
       "      <td>80292</td>\n",
       "      <td>na</td>\n",
       "      <td>2130706432</td>\n",
       "      <td>494</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>699352</td>\n",
       "      <td>222654</td>\n",
       "      <td>347378</td>\n",
       "      <td>225724</td>\n",
       "      <td>194440</td>\n",
       "      <td>165070</td>\n",
       "      <td>802280</td>\n",
       "      <td>388422</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59999</th>\n",
       "      <td>neg</td>\n",
       "      <td>40222</td>\n",
       "      <td>na</td>\n",
       "      <td>698</td>\n",
       "      <td>628</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>440066</td>\n",
       "      <td>183200</td>\n",
       "      <td>344546</td>\n",
       "      <td>254068</td>\n",
       "      <td>225148</td>\n",
       "      <td>158304</td>\n",
       "      <td>170384</td>\n",
       "      <td>158</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 171 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      class  aa_000 ab_000      ac_000 ad_000 ae_000 af_000 ag_000 ag_001  \\\n",
       "59995   neg  153002     na         664    186      0      0      0      0   \n",
       "59996   neg    2286     na  2130706538    224      0      0      0      0   \n",
       "59997   neg     112      0  2130706432     18      0      0      0      0   \n",
       "59998   neg   80292     na  2130706432    494      0      0      0      0   \n",
       "59999   neg   40222     na         698    628      0      0      0      0   \n",
       "\n",
       "      ag_002  ...  ee_002  ee_003   ee_004   ee_005   ee_006  ee_007  ee_008  \\\n",
       "59995      0  ...  998500  566884  1290398  1218244  1019768  717762  898642   \n",
       "59996      0  ...   10578    6760    21126    68424      136       0       0   \n",
       "59997      0  ...     792     386      452      144      146    2622       0   \n",
       "59998      0  ...  699352  222654   347378   225724   194440  165070  802280   \n",
       "59999      0  ...  440066  183200   344546   254068   225148  158304  170384   \n",
       "\n",
       "       ee_009 ef_000 eg_000  \n",
       "59995   28588      0      0  \n",
       "59996       0      0      0  \n",
       "59997       0      0      0  \n",
       "59998  388422      0      0  \n",
       "59999     158      0      0  \n",
       "\n",
       "[5 rows x 171 columns]"
      ]
     },
     "execution_count": 149,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "afs.tail(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "id": "c818ebc0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000',\n",
      "       'ag_000', 'ag_001', 'ag_002',\n",
      "       ...\n",
      "       'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',\n",
      "       'ee_009', 'ef_000', 'eg_000'],\n",
      "      dtype='object', length=171)\n"
     ]
    }
   ],
   "source": [
    "print(afs.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02f71493",
   "metadata": {},
   "source": [
    "Checking the data type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "id": "def2df33",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method DataFrame.info of       class  aa_000 ab_000      ac_000 ad_000 ae_000 af_000 ag_000 ag_001  \\\n",
       "0       neg   76698     na  2130706438    280      0      0      0      0   \n",
       "1       neg   33058     na           0     na      0      0      0      0   \n",
       "2       neg   41040     na         228    100      0      0      0      0   \n",
       "3       neg      12      0          70     66      0     10      0      0   \n",
       "4       neg   60874     na        1368    458      0      0      0      0   \n",
       "...     ...     ...    ...         ...    ...    ...    ...    ...    ...   \n",
       "59995   neg  153002     na         664    186      0      0      0      0   \n",
       "59996   neg    2286     na  2130706538    224      0      0      0      0   \n",
       "59997   neg     112      0  2130706432     18      0      0      0      0   \n",
       "59998   neg   80292     na  2130706432    494      0      0      0      0   \n",
       "59999   neg   40222     na         698    628      0      0      0      0   \n",
       "\n",
       "      ag_002  ...   ee_002  ee_003   ee_004   ee_005   ee_006  ee_007  ee_008  \\\n",
       "0          0  ...  1240520  493384   721044   469792   339156  157956   73224   \n",
       "1          0  ...   421400  178064   293306   245416   133654   81140   97576   \n",
       "2          0  ...   277378  159812   423992   409564   320746  158022   95128   \n",
       "3          0  ...      240      46       58       44       10       0       0   \n",
       "4          0  ...   622012  229790   405298   347188   286954  311560  433954   \n",
       "...      ...  ...      ...     ...      ...      ...      ...     ...     ...   \n",
       "59995      0  ...   998500  566884  1290398  1218244  1019768  717762  898642   \n",
       "59996      0  ...    10578    6760    21126    68424      136       0       0   \n",
       "59997      0  ...      792     386      452      144      146    2622       0   \n",
       "59998      0  ...   699352  222654   347378   225724   194440  165070  802280   \n",
       "59999      0  ...   440066  183200   344546   254068   225148  158304  170384   \n",
       "\n",
       "       ee_009 ef_000 eg_000  \n",
       "0           0      0      0  \n",
       "1        1500      0      0  \n",
       "2         514      0      0  \n",
       "3           0      4     32  \n",
       "4        1218      0      0  \n",
       "...       ...    ...    ...  \n",
       "59995   28588      0      0  \n",
       "59996       0      0      0  \n",
       "59997       0      0      0  \n",
       "59998  388422      0      0  \n",
       "59999     158      0      0  \n",
       "\n",
       "[60000 rows x 171 columns]>"
      ]
     },
     "execution_count": 151,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "afs.info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "287f4907",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "bbc1a9e4",
   "metadata": {},
   "source": [
    "As the 'neg' column is not applicable to this project, I will remove them from the data set before I explore any further. \"The dataset’s  positive class consists of component failures for a specific component of the APS system.\n",
    "The negative class consists of trucks with failures for components not related to the APS.\" This data is unrelated and therefore not useful for my project. Firstly, I will change the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "321d9aab",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d281eb49",
   "metadata": {},
   "source": [
    "COoe source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7440acd0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "id": "939f42f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "afs = afs.drop(afs[afs['class'] == 'neg'].index)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "id": "b9f9e6f0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "id": "86f8a139",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Negative count: 0\n"
     ]
    }
   ],
   "source": [
    "print(\"Negative count:\", neg_count)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a727193",
   "metadata": {},
   "source": [
    "This confirms that all 'neg' values have been dropped. Source: - See method 2 'Using the drop function' https://saturncloud.io/blog/how-to-remove-rows-with-specific-values-in-pandas-dataframe/#:~:text=Another%20method%20to%20remove%20rows,value%20we%20want%20to%20remove"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "id": "500d2301",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      class  aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000  ag_001  \\\n",
      "9       pos  153204      0    182     na      0      0      0       0   \n",
      "23      pos  453236     na   2926     na      0      0      0       0   \n",
      "60      pos   72504     na   1594   1052      0      0      0     244   \n",
      "115     pos  762958     na     na     na     na     na    776  281128   \n",
      "135     pos  695994     na     na     na     na     na      0       0   \n",
      "...     ...     ...    ...    ...    ...    ...    ...    ...     ...   \n",
      "59484   pos  895178     na     na     na     na     na      0       0   \n",
      "59601   pos  862134     na     na     na     na     na      0   38834   \n",
      "59692   pos  186856     na     na     na      0      0      0       0   \n",
      "59742   pos  605092     na     na     na     na     na      0   44320   \n",
      "59769   pos  331704     na   1484   1142      0      0      0  267100   \n",
      "\n",
      "        ag_002  ...   ee_002   ee_003   ee_004   ee_005   ee_006    ee_007  \\\n",
      "9            0  ...   129862    26872    34044    22472    34362         0   \n",
      "23         222  ...  7908038  3026002  5025350  2025766  1160638    533834   \n",
      "60      178226  ...  1432098   372252   527514   358274   332818    284178   \n",
      "115    2186308  ...       na       na       na       na       na        na   \n",
      "135          0  ...  1397742   495544   361646    28610     5130       212   \n",
      "...        ...  ...      ...      ...      ...      ...      ...       ...   \n",
      "59484        0  ...  9116224  4276644  8701496  8082264  5827284   2057354   \n",
      "59601  1227952  ...  3456564  1793170  4159190  5847384  8364506  12875424   \n",
      "59692     4300  ...  2713108   800182   322322    71638    34662      7304   \n",
      "59742  1048970  ...  3940400  1865730  3698692  3271958  9831898   3755392   \n",
      "59769  1384372  ...  3738648  1425312  3381954  4346910  2166330    296580   \n",
      "\n",
      "        ee_008 ee_009 ef_000 eg_000  \n",
      "9            0      0      0      0  \n",
      "23      493800   6914      0      0  \n",
      "60        3742      0      0      0  \n",
      "115         na     na     na     na  \n",
      "135          0      0     na     na  \n",
      "...        ...    ...    ...    ...  \n",
      "59484  1662302  10790     na     na  \n",
      "59601   661442   2458     na     na  \n",
      "59692     2538      0      0      0  \n",
      "59742    65610      0     na     na  \n",
      "59769    15434      0      0      0  \n",
      "\n",
      "[1000 rows x 171 columns]\n"
     ]
    }
   ],
   "source": [
    "afs.describe(include=object)\n",
    "print(afs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94f2d4b2",
   "metadata": {},
   "source": [
    "Above I have ran the neg_count function to ensure that the negitive values were dropped. I then ran the describe function to confirm that the value of \"class\" is now \"1\" instead of two. Source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21778292",
   "metadata": {},
   "source": [
    "I need to get rid of the n/a in the ab_00 column. Below I will experiment with different strategies to do this. Forst, I will explore the classification and regression of the dataset. For this project, I will use multiclass classification. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "id": "3060f0cf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 171)"
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "afs.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "id": "7eb4144c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 60000 entries, 0 to 59999\n",
      "Columns: 171 entries, class to eg_000\n",
      "dtypes: int64(1), object(170)\n",
      "memory usage: 78.3+ MB\n"
     ]
    }
   ],
   "source": [
    "#Requesting basic info on the dataset\n",
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8aa3159b",
   "metadata": {},
   "source": [
    "Basic Statistical Information on the dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "id": "7ddd6f13",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>aa_000</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>6.000000e+04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>5.933650e+04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.454301e+05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000e+00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>8.340000e+02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.077600e+04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>4.866800e+04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>2.746564e+06</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             aa_000\n",
       "count  6.000000e+04\n",
       "mean   5.933650e+04\n",
       "std    1.454301e+05\n",
       "min    0.000000e+00\n",
       "25%    8.340000e+02\n",
       "50%    3.077600e+04\n",
       "75%    4.866800e+04\n",
       "max    2.746564e+06"
      ]
     },
     "execution_count": 159,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52eb81c9",
   "metadata": {},
   "source": [
    "I am checking the code for blank data. Note for myself- add in why this is important from lecture notes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "id": "59ceb170",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "class     0\n",
       "aa_000    0\n",
       "ab_000    0\n",
       "ac_000    0\n",
       "ad_000    0\n",
       "         ..\n",
       "ee_007    0\n",
       "ee_008    0\n",
       "ee_009    0\n",
       "ef_000    0\n",
       "eg_000    0\n",
       "Length: 171, dtype: int64"
      ]
     },
     "execution_count": 160,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "afs.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "id": "3e90e256",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False\n"
     ]
    }
   ],
   "source": [
    "print(afs.isnull().values.any())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "id": "ff98ff8b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos    1000\n",
       "Name: class, dtype: int64"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "afs[\"class\"].value_counts().sort_index()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33f178d1",
   "metadata": {},
   "source": [
    "I have notcied my first issue with the data. I have 59,000 data points for negative and only 1000 for posiitve. The code column contains two attributes, negative (neg) and positive (pos). I will change these to numercal values so I can count them. neg=0, pos=1... Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48c3ce58",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5069335d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "id": "d03b20fb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0    neg\n",
       " 1    neg\n",
       " 2    neg\n",
       " 3    neg\n",
       " 4    neg\n",
       " Name: class, dtype: object,\n",
       " neg    0.983333\n",
       " pos    0.016667\n",
       " Name: class, dtype: float64)"
      ]
     },
     "execution_count": 163,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Displaying the first few rows of the 'class' column and its distribution\n",
    "(data['class'].head(), data['class'].value_counts(normalize=True))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02754a15",
   "metadata": {},
   "source": [
    "This indicates that my dataset has no missing values or invalid data types. This is a good sign as my data is 'complete' and no further action is required. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4a32eca",
   "metadata": {},
   "source": [
    "Now I will begin to visualise my data using seaborne. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc0fc3fa",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}