{
"cells": [
{
"cell_type": "code",
"execution_count": 34,
"id": "31af659a",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt #visualisation\n",
"import seaborn as sns #visualisation\n",
"import numpy as np "
]
},
{
"cell_type": "markdown",
"id": "3ab38a44",
"metadata": {},
"source": [
"Step One- Data Wrangling\n",
"\n",
"1.1 Data Loading\n",
"\n",
"Here I have loaded the dataset. To save myself from typing 'aps_failure.csv' every single time I have given the dataset a simplfied name 'afs'. Line 1 below tells the program where the data is while line 2 renames it for ease of use. "
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "0ed5ef06",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('aps_failure_set.csv')\n",
"afs=pd.read_csv('aps_failure_set.csv')"
]
},
{
"cell_type": "markdown",
"id": "ec8ba2e7",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "91500a52",
"metadata": {},
"source": [
"Exploratory Analysis. \n",
"I am gathering some very basic information on my datset so I know what I'm dealing with. I start this process with gathering basic information "
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "b10e9b31",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(60000, 171)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.shape"
]
},
{
"cell_type": "markdown",
"id": "e4edc7e5",
"metadata": {},
"source": [
"The afs.shape above has told me I am dealing with a datset that has 171 columns and 60,000 rows. I will now use the afs.describe(include=object) function to provide me with some basic statistics on the data. This is useful for the following reasons:\n",
"\n",
"-Count shows me that\n",
"-Unique showes me that\n",
"-Top shows me that\n",
"-Freq shows me that"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "7944cac4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" class | \n",
" aa_000 | \n",
" ab_000 | \n",
" ac_000 | \n",
" ad_000 | \n",
" ae_000 | \n",
" af_000 | \n",
" ag_000 | \n",
" ag_001 | \n",
" ag_002 | \n",
" ... | \n",
" ee_002 | \n",
" ee_003 | \n",
" ee_004 | \n",
" ee_005 | \n",
" ee_006 | \n",
" ee_007 | \n",
" ee_008 | \n",
" ee_009 | \n",
" ef_000 | \n",
" eg_000 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" neg | \n",
" 76698 | \n",
" na | \n",
" 2130706438 | \n",
" 280 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 1240520 | \n",
" 493384 | \n",
" 721044 | \n",
" 469792 | \n",
" 339156 | \n",
" 157956 | \n",
" 73224 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" neg | \n",
" 33058 | \n",
" na | \n",
" 0 | \n",
" na | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 421400 | \n",
" 178064 | \n",
" 293306 | \n",
" 245416 | \n",
" 133654 | \n",
" 81140 | \n",
" 97576 | \n",
" 1500 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" neg | \n",
" 41040 | \n",
" na | \n",
" 228 | \n",
" 100 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 277378 | \n",
" 159812 | \n",
" 423992 | \n",
" 409564 | \n",
" 320746 | \n",
" 158022 | \n",
" 95128 | \n",
" 514 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" neg | \n",
" 12 | \n",
" 0 | \n",
" 70 | \n",
" 66 | \n",
" 0 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 240 | \n",
" 46 | \n",
" 58 | \n",
" 44 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
" 32 | \n",
"
\n",
" \n",
" 4 | \n",
" neg | \n",
" 60874 | \n",
" na | \n",
" 1368 | \n",
" 458 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 622012 | \n",
" 229790 | \n",
" 405298 | \n",
" 347188 | \n",
" 286954 | \n",
" 311560 | \n",
" 433954 | \n",
" 1218 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 171 columns
\n",
"
"
],
"text/plain": [
" class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 \\\n",
"0 neg 76698 na 2130706438 280 0 0 0 0 0 \n",
"1 neg 33058 na 0 na 0 0 0 0 0 \n",
"2 neg 41040 na 228 100 0 0 0 0 0 \n",
"3 neg 12 0 70 66 0 10 0 0 0 \n",
"4 neg 60874 na 1368 458 0 0 0 0 0 \n",
"\n",
" ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000 \\\n",
"0 ... 1240520 493384 721044 469792 339156 157956 73224 0 0 \n",
"1 ... 421400 178064 293306 245416 133654 81140 97576 1500 0 \n",
"2 ... 277378 159812 423992 409564 320746 158022 95128 514 0 \n",
"3 ... 240 46 58 44 10 0 0 0 4 \n",
"4 ... 622012 229790 405298 347188 286954 311560 433954 1218 0 \n",
"\n",
" eg_000 \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 32 \n",
"4 0 \n",
"\n",
"[5 rows x 171 columns]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.head(5)\n"
]
},
{
"cell_type": "markdown",
"id": "b4bb9952",
"metadata": {},
"source": [
"Now I begin to view the data. data.head(10) gives me the first 10 rows of the data.\n",
"\n",
"This allows me to get an understanding of what I am actually dealing with. It is a good way"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "aee8d2fb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" class | \n",
" aa_000 | \n",
" ab_000 | \n",
" ac_000 | \n",
" ad_000 | \n",
" ae_000 | \n",
" af_000 | \n",
" ag_000 | \n",
" ag_001 | \n",
" ag_002 | \n",
" ... | \n",
" ee_002 | \n",
" ee_003 | \n",
" ee_004 | \n",
" ee_005 | \n",
" ee_006 | \n",
" ee_007 | \n",
" ee_008 | \n",
" ee_009 | \n",
" ef_000 | \n",
" eg_000 | \n",
"
\n",
" \n",
" \n",
" \n",
" 59995 | \n",
" neg | \n",
" 153002 | \n",
" na | \n",
" 664 | \n",
" 186 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 998500 | \n",
" 566884 | \n",
" 1290398 | \n",
" 1218244 | \n",
" 1019768 | \n",
" 717762 | \n",
" 898642 | \n",
" 28588 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 59996 | \n",
" neg | \n",
" 2286 | \n",
" na | \n",
" 2130706538 | \n",
" 224 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 10578 | \n",
" 6760 | \n",
" 21126 | \n",
" 68424 | \n",
" 136 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 59997 | \n",
" neg | \n",
" 112 | \n",
" 0 | \n",
" 2130706432 | \n",
" 18 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 792 | \n",
" 386 | \n",
" 452 | \n",
" 144 | \n",
" 146 | \n",
" 2622 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 59998 | \n",
" neg | \n",
" 80292 | \n",
" na | \n",
" 2130706432 | \n",
" 494 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 699352 | \n",
" 222654 | \n",
" 347378 | \n",
" 225724 | \n",
" 194440 | \n",
" 165070 | \n",
" 802280 | \n",
" 388422 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 59999 | \n",
" neg | \n",
" 40222 | \n",
" na | \n",
" 698 | \n",
" 628 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 440066 | \n",
" 183200 | \n",
" 344546 | \n",
" 254068 | \n",
" 225148 | \n",
" 158304 | \n",
" 170384 | \n",
" 158 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 171 columns
\n",
"
"
],
"text/plain": [
" class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \\\n",
"59995 neg 153002 na 664 186 0 0 0 0 \n",
"59996 neg 2286 na 2130706538 224 0 0 0 0 \n",
"59997 neg 112 0 2130706432 18 0 0 0 0 \n",
"59998 neg 80292 na 2130706432 494 0 0 0 0 \n",
"59999 neg 40222 na 698 628 0 0 0 0 \n",
"\n",
" ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \\\n",
"59995 0 ... 998500 566884 1290398 1218244 1019768 717762 898642 \n",
"59996 0 ... 10578 6760 21126 68424 136 0 0 \n",
"59997 0 ... 792 386 452 144 146 2622 0 \n",
"59998 0 ... 699352 222654 347378 225724 194440 165070 802280 \n",
"59999 0 ... 440066 183200 344546 254068 225148 158304 170384 \n",
"\n",
" ee_009 ef_000 eg_000 \n",
"59995 28588 0 0 \n",
"59996 0 0 0 \n",
"59997 0 0 0 \n",
"59998 388422 0 0 \n",
"59999 158 0 0 \n",
"\n",
"[5 rows x 171 columns]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.tail(5)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "445668ec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(60000, 171)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.shape"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "6ef57477",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.info"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "274d6e61",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000',\n",
" 'ag_000', 'ag_001', 'ag_002',\n",
" ...\n",
" 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',\n",
" 'ee_009', 'ef_000', 'eg_000'],\n",
" dtype='object', length=171)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.columns"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "19a55131",
"metadata": {},
"outputs": [],
"source": [
"null_values = afs.isnull().sum()"
]
},
{
"cell_type": "markdown",
"id": "02f71493",
"metadata": {},
"source": [
"Checking the data type"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "def2df33",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 43,
"id": "8bdc2593",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.describe"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "db6b5067",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \\\n",
"0 False False False False False False False False False \n",
"1 False False False False False False False False False \n",
"2 False False False False False False False False False \n",
"3 False False False False False False False False False \n",
"4 False False False False False False False False False \n",
"... ... ... ... ... ... ... ... ... ... \n",
"59995 False False False False False False False False False \n",
"59996 False False False False False False False False False \n",
"59997 False False False False False False False False False \n",
"59998 False False False False False False False False False \n",
"59999 False False False False False False False False False \n",
"\n",
" ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \\\n",
"0 False ... False False False False False False False \n",
"1 False ... False False False False False False False \n",
"2 False ... False False False False False False False \n",
"3 False ... False False False False False False False \n",
"4 False ... False False False False False False False \n",
"... ... ... ... ... ... ... ... ... ... \n",
"59995 False ... False False False False False False False \n",
"59996 False ... False False False False False False False \n",
"59997 False ... False False False False False False False \n",
"59998 False ... False False False False False False False \n",
"59999 False ... False False False False False False False \n",
"\n",
" ee_009 ef_000 eg_000 \n",
"0 False False False \n",
"1 False False False \n",
"2 False False False \n",
"3 False False False \n",
"4 False False False \n",
"... ... ... ... \n",
"59995 False False False \n",
"59996 False False False \n",
"59997 False False False \n",
"59998 False False False \n",
"59999 False False False \n",
"\n",
"[60000 rows x 171 columns]\n"
]
}
],
"source": [
"nan_afs=afs.isna()\n",
"print(nan_afs)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "6d68e46c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"59995 False\n",
"59996 False\n",
"59997 False\n",
"59998 False\n",
"59999 False\n",
"Length: 60000, dtype: bool\n"
]
}
],
"source": [
"nan_rows = afs.isna().any(axis=1)\n",
"print(nan_rows)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec0a92d3",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 46,
"id": "61577671",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \\\n",
"0 neg 76698 na 2130706438 280 0 0 0 0 \n",
"1 neg 33058 na 0 na 0 0 0 0 \n",
"2 neg 41040 na 228 100 0 0 0 0 \n",
"3 neg 12 0 70 66 0 10 0 0 \n",
"4 neg 60874 na 1368 458 0 0 0 0 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"59995 neg 153002 na 664 186 0 0 0 0 \n",
"59996 neg 2286 na 2130706538 224 0 0 0 0 \n",
"59997 neg 112 0 2130706432 18 0 0 0 0 \n",
"59998 neg 80292 na 2130706432 494 0 0 0 0 \n",
"59999 neg 40222 na 698 628 0 0 0 0 \n",
"\n",
" ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \\\n",
"0 0 ... 1240520 493384 721044 469792 339156 157956 73224 \n",
"1 0 ... 421400 178064 293306 245416 133654 81140 97576 \n",
"2 0 ... 277378 159812 423992 409564 320746 158022 95128 \n",
"3 0 ... 240 46 58 44 10 0 0 \n",
"4 0 ... 622012 229790 405298 347188 286954 311560 433954 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"59995 0 ... 998500 566884 1290398 1218244 1019768 717762 898642 \n",
"59996 0 ... 10578 6760 21126 68424 136 0 0 \n",
"59997 0 ... 792 386 452 144 146 2622 0 \n",
"59998 0 ... 699352 222654 347378 225724 194440 165070 802280 \n",
"59999 0 ... 440066 183200 344546 254068 225148 158304 170384 \n",
"\n",
" ee_009 ef_000 eg_000 \n",
"0 0 0 0 \n",
"1 1500 0 0 \n",
"2 514 0 0 \n",
"3 0 4 32 \n",
"4 1218 0 0 \n",
"... ... ... ... \n",
"59995 28588 0 0 \n",
"59996 0 0 0 \n",
"59997 0 0 0 \n",
"59998 388422 0 0 \n",
"59999 158 0 0 \n",
"\n",
"[60000 rows x 171 columns]\n"
]
}
],
"source": [
"clean_afs = afs.dropna()\n",
"print(clean_afs)"
]
},
{
"cell_type": "markdown",
"id": "3431a9da",
"metadata": {},
"source": [
"I have noticed some data points say Na. I have asked how many are like this using code I found at this source: https://saturncloud.io/blog/how-to-find-all-rows-with-nan-values-in-python-pandas/#:~:text=To%20find%20all%20rows%20with%20NaN%20values%20in%20a%20Pandas,where%20NaN%20values%20are%20present."
]
},
{
"cell_type": "markdown",
"id": "67efc496",
"metadata": {},
"source": [
"Adding new column names as the current naming of the sensor columns is confusing. Source: "
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "af25578c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000',\n",
" 'ag_000', 'ag_001', 'ag_002',\n",
" ...\n",
" 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',\n",
" 'ee_009', 'ef_000', 'eg_000'],\n",
" dtype='object', length=171)\n"
]
}
],
"source": [
"print(afs.columns)\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "3d649710",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"class object\n",
"aa_000 int64\n",
"ab_000 object\n",
"ac_000 object\n",
"ad_000 object\n",
" ... \n",
"ee_007 object\n",
"ee_008 object\n",
"ee_009 object\n",
"ef_000 object\n",
"eg_000 object\n",
"Length: 171, dtype: object"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "8e1b75c4",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Your selected dataframe has 171 columns.\n",
"There are 169 columns that have missing values.\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Missing Values | \n",
" % of Total Values | \n",
"
\n",
" \n",
" \n",
" \n",
" br_000 | \n",
" 49264 | \n",
" 82.1 | \n",
"
\n",
" \n",
" bq_000 | \n",
" 48722 | \n",
" 81.2 | \n",
"
\n",
" \n",
" bp_000 | \n",
" 47740 | \n",
" 79.6 | \n",
"
\n",
" \n",
" bo_000 | \n",
" 46333 | \n",
" 77.2 | \n",
"
\n",
" \n",
" ab_000 | \n",
" 46329 | \n",
" 77.2 | \n",
"
\n",
" \n",
" cr_000 | \n",
" 46329 | \n",
" 77.2 | \n",
"
\n",
" \n",
" bn_000 | \n",
" 44009 | \n",
" 73.3 | \n",
"
\n",
" \n",
" bm_000 | \n",
" 39549 | \n",
" 65.9 | \n",
"
\n",
" \n",
" bl_000 | \n",
" 27277 | \n",
" 45.5 | \n",
"
\n",
" \n",
" bk_000 | \n",
" 23034 | \n",
" 38.4 | \n",
"
\n",
" \n",
" ch_000 | \n",
" 14861 | \n",
" 24.8 | \n",
"
\n",
" \n",
" co_000 | \n",
" 14861 | \n",
" 24.8 | \n",
"
\n",
" \n",
" cf_000 | \n",
" 14861 | \n",
" 24.8 | \n",
"
\n",
" \n",
" cg_000 | \n",
" 14861 | \n",
" 24.8 | \n",
"
\n",
" \n",
" ad_000 | \n",
" 14861 | \n",
" 24.8 | \n",
"
\n",
" \n",
" db_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" ct_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" cu_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" cv_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" cx_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" cy_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" cz_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" da_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" dc_000 | \n",
" 13808 | \n",
" 23.0 | \n",
"
\n",
" \n",
" ec_00 | \n",
" 10239 | \n",
" 17.1 | \n",
"
\n",
" \n",
" cm_000 | \n",
" 9877 | \n",
" 16.5 | \n",
"
\n",
" \n",
" ed_000 | \n",
" 9553 | \n",
" 15.9 | \n",
"
\n",
" \n",
" cl_000 | \n",
" 9553 | \n",
" 15.9 | \n",
"
\n",
" \n",
" ak_000 | \n",
" 4400 | \n",
" 7.3 | \n",
"
\n",
" \n",
" ca_000 | \n",
" 4356 | \n",
" 7.3 | \n",
"
\n",
" \n",
" dm_000 | \n",
" 4009 | \n",
" 6.7 | \n",
"
\n",
" \n",
" dh_000 | \n",
" 4008 | \n",
" 6.7 | \n",
"
\n",
" \n",
" dg_000 | \n",
" 4008 | \n",
" 6.7 | \n",
"
\n",
" \n",
" df_000 | \n",
" 4008 | \n",
" 6.7 | \n",
"
\n",
" \n",
" dl_000 | \n",
" 4008 | \n",
" 6.7 | \n",
"
\n",
" \n",
" dj_000 | \n",
" 4007 | \n",
" 6.7 | \n",
"
\n",
" \n",
" dk_000 | \n",
" 4007 | \n",
" 6.7 | \n",
"
\n",
" \n",
" eb_000 | \n",
" 4007 | \n",
" 6.7 | \n",
"
\n",
" \n",
" di_000 | \n",
" 4006 | \n",
" 6.7 | \n",
"
\n",
" \n",
" ac_000 | \n",
" 3335 | \n",
" 5.6 | \n",
"
\n",
" \n",
" bx_000 | \n",
" 3257 | \n",
" 5.4 | \n",
"
\n",
" \n",
" cc_000 | \n",
" 3255 | \n",
" 5.4 | \n",
"
\n",
" \n",
" ds_000 | \n",
" 2727 | \n",
" 4.5 | \n",
"
\n",
" \n",
" bd_000 | \n",
" 2727 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dt_000 | \n",
" 2727 | \n",
" 4.5 | \n",
"
\n",
" \n",
" du_000 | \n",
" 2726 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dv_000 | \n",
" 2726 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dp_000 | \n",
" 2726 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dq_000 | \n",
" 2726 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dr_000 | \n",
" 2726 | \n",
" 4.5 | \n",
"
\n",
" \n",
" bc_000 | \n",
" 2725 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dy_000 | \n",
" 2724 | \n",
" 4.5 | \n",
"
\n",
" \n",
" do_000 | \n",
" 2724 | \n",
" 4.5 | \n",
"
\n",
" \n",
" de_000 | \n",
" 2724 | \n",
" 4.5 | \n",
"
\n",
" \n",
" ef_000 | \n",
" 2724 | \n",
" 4.5 | \n",
"
\n",
" \n",
" cp_000 | \n",
" 2724 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dx_000 | \n",
" 2723 | \n",
" 4.5 | \n",
"
\n",
" \n",
" dz_000 | \n",
" 2723 | \n",
" 4.5 | \n",
"
\n",
" \n",
" ea_000 | \n",
" 2723 | \n",
" 4.5 | \n",
"
\n",
" \n",
" bz_000 | \n",
" 2723 | \n",
" 4.5 | \n",
"
\n",
" \n",
" eg_000 | \n",
" 2723 | \n",
" 4.5 | \n",
"
\n",
" \n",
" ar_000 | \n",
" 2723 | \n",
" 4.5 | \n",
"
\n",
" \n",
" be_000 | \n",
" 2503 | \n",
" 4.2 | \n",
"
\n",
" \n",
" dd_000 | \n",
" 2503 | \n",
" 4.2 | \n",
"
\n",
" \n",
" ce_000 | \n",
" 2502 | \n",
" 4.2 | \n",
"
\n",
" \n",
" ax_000 | \n",
" 2501 | \n",
" 4.2 | \n",
"
\n",
" \n",
" ae_000 | \n",
" 2500 | \n",
" 4.2 | \n",
"
\n",
" \n",
" af_000 | \n",
" 2500 | \n",
" 4.2 | \n",
"
\n",
" \n",
" bf_000 | \n",
" 2500 | \n",
" 4.2 | \n",
"
\n",
" \n",
" av_000 | \n",
" 2500 | \n",
" 4.2 | \n",
"
\n",
" \n",
" bs_000 | \n",
" 726 | \n",
" 1.2 | \n",
"
\n",
" \n",
" cb_000 | \n",
" 726 | \n",
" 1.2 | \n",
"
\n",
" \n",
" dn_000 | \n",
" 691 | \n",
" 1.2 | \n",
"
\n",
" \n",
" cq_000 | \n",
" 691 | \n",
" 1.2 | \n",
"
\n",
" \n",
" bv_000 | \n",
" 691 | \n",
" 1.2 | \n",
"
\n",
" \n",
" bu_000 | \n",
" 691 | \n",
" 1.2 | \n",
"
\n",
" \n",
" ba_009 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_003 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_007 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_000 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_004 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_006 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_005 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_002 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_001 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ba_008 | \n",
" 688 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_007 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_006 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_005 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_008 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_004 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_003 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_001 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_000 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_002 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cn_009 | \n",
" 687 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cd_000 | \n",
" 676 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_004 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_005 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_006 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_003 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_008 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_009 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_000 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_001 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_002 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_003 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_007 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_009 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_002 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_001 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ag_000 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_000 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_001 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_002 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_003 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_004 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_005 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_006 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_007 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_008 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ee_009 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_004 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_005 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_006 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_003 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_007 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_008 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ay_009 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_000 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_001 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_002 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_004 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_005 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_006 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_007 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" az_008 | \n",
" 671 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_009 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_000 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_001 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_003 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_004 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_002 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_008 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_007 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_006 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" cs_005 | \n",
" 669 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ah_000 | \n",
" 645 | \n",
" 1.1 | \n",
"
\n",
" \n",
" bb_000 | \n",
" 645 | \n",
" 1.1 | \n",
"
\n",
" \n",
" ap_000 | \n",
" 642 | \n",
" 1.1 | \n",
"
\n",
" \n",
" bh_000 | \n",
" 642 | \n",
" 1.1 | \n",
"
\n",
" \n",
" al_000 | \n",
" 642 | \n",
" 1.1 | \n",
"
\n",
" \n",
" bg_000 | \n",
" 642 | \n",
" 1.1 | \n",
"
\n",
" \n",
" an_000 | \n",
" 642 | \n",
" 1.1 | \n",
"
\n",
" \n",
" at_000 | \n",
" 629 | \n",
" 1.0 | \n",
"
\n",
" \n",
" as_000 | \n",
" 629 | \n",
" 1.0 | \n",
"
\n",
" \n",
" aj_000 | \n",
" 629 | \n",
" 1.0 | \n",
"
\n",
" \n",
" ai_000 | \n",
" 629 | \n",
" 1.0 | \n",
"
\n",
" \n",
" am_0 | \n",
" 629 | \n",
" 1.0 | \n",
"
\n",
" \n",
" au_000 | \n",
" 629 | \n",
" 1.0 | \n",
"
\n",
" \n",
" bj_000 | \n",
" 589 | \n",
" 1.0 | \n",
"
\n",
" \n",
" aq_000 | \n",
" 589 | \n",
" 1.0 | \n",
"
\n",
" \n",
" ao_000 | \n",
" 589 | \n",
" 1.0 | \n",
"
\n",
" \n",
" bi_000 | \n",
" 589 | \n",
" 1.0 | \n",
"
\n",
" \n",
" by_000 | \n",
" 473 | \n",
" 0.8 | \n",
"
\n",
" \n",
" ci_000 | \n",
" 338 | \n",
" 0.6 | \n",
"
\n",
" \n",
" cj_000 | \n",
" 338 | \n",
" 0.6 | \n",
"
\n",
" \n",
" ck_000 | \n",
" 338 | \n",
" 0.6 | \n",
"
\n",
" \n",
" bt_000 | \n",
" 167 | \n",
" 0.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Missing Values % of Total Values\n",
"br_000 49264 82.1\n",
"bq_000 48722 81.2\n",
"bp_000 47740 79.6\n",
"bo_000 46333 77.2\n",
"ab_000 46329 77.2\n",
"cr_000 46329 77.2\n",
"bn_000 44009 73.3\n",
"bm_000 39549 65.9\n",
"bl_000 27277 45.5\n",
"bk_000 23034 38.4\n",
"ch_000 14861 24.8\n",
"co_000 14861 24.8\n",
"cf_000 14861 24.8\n",
"cg_000 14861 24.8\n",
"ad_000 14861 24.8\n",
"db_000 13808 23.0\n",
"ct_000 13808 23.0\n",
"cu_000 13808 23.0\n",
"cv_000 13808 23.0\n",
"cx_000 13808 23.0\n",
"cy_000 13808 23.0\n",
"cz_000 13808 23.0\n",
"da_000 13808 23.0\n",
"dc_000 13808 23.0\n",
"ec_00 10239 17.1\n",
"cm_000 9877 16.5\n",
"ed_000 9553 15.9\n",
"cl_000 9553 15.9\n",
"ak_000 4400 7.3\n",
"ca_000 4356 7.3\n",
"dm_000 4009 6.7\n",
"dh_000 4008 6.7\n",
"dg_000 4008 6.7\n",
"df_000 4008 6.7\n",
"dl_000 4008 6.7\n",
"dj_000 4007 6.7\n",
"dk_000 4007 6.7\n",
"eb_000 4007 6.7\n",
"di_000 4006 6.7\n",
"ac_000 3335 5.6\n",
"bx_000 3257 5.4\n",
"cc_000 3255 5.4\n",
"ds_000 2727 4.5\n",
"bd_000 2727 4.5\n",
"dt_000 2727 4.5\n",
"du_000 2726 4.5\n",
"dv_000 2726 4.5\n",
"dp_000 2726 4.5\n",
"dq_000 2726 4.5\n",
"dr_000 2726 4.5\n",
"bc_000 2725 4.5\n",
"dy_000 2724 4.5\n",
"do_000 2724 4.5\n",
"de_000 2724 4.5\n",
"ef_000 2724 4.5\n",
"cp_000 2724 4.5\n",
"dx_000 2723 4.5\n",
"dz_000 2723 4.5\n",
"ea_000 2723 4.5\n",
"bz_000 2723 4.5\n",
"eg_000 2723 4.5\n",
"ar_000 2723 4.5\n",
"be_000 2503 4.2\n",
"dd_000 2503 4.2\n",
"ce_000 2502 4.2\n",
"ax_000 2501 4.2\n",
"ae_000 2500 4.2\n",
"af_000 2500 4.2\n",
"bf_000 2500 4.2\n",
"av_000 2500 4.2\n",
"bs_000 726 1.2\n",
"cb_000 726 1.2\n",
"dn_000 691 1.2\n",
"cq_000 691 1.2\n",
"bv_000 691 1.2\n",
"bu_000 691 1.2\n",
"ba_009 688 1.1\n",
"ba_003 688 1.1\n",
"ba_007 688 1.1\n",
"ba_000 688 1.1\n",
"ba_004 688 1.1\n",
"ba_006 688 1.1\n",
"ba_005 688 1.1\n",
"ba_002 688 1.1\n",
"ba_001 688 1.1\n",
"ba_008 688 1.1\n",
"cn_007 687 1.1\n",
"cn_006 687 1.1\n",
"cn_005 687 1.1\n",
"cn_008 687 1.1\n",
"cn_004 687 1.1\n",
"cn_003 687 1.1\n",
"cn_001 687 1.1\n",
"cn_000 687 1.1\n",
"cn_002 687 1.1\n",
"cn_009 687 1.1\n",
"cd_000 676 1.1\n",
"ag_004 671 1.1\n",
"ag_005 671 1.1\n",
"ag_006 671 1.1\n",
"ag_003 671 1.1\n",
"ag_008 671 1.1\n",
"ag_009 671 1.1\n",
"ay_000 671 1.1\n",
"ay_001 671 1.1\n",
"ay_002 671 1.1\n",
"ay_003 671 1.1\n",
"ag_007 671 1.1\n",
"az_009 671 1.1\n",
"ag_002 671 1.1\n",
"ag_001 671 1.1\n",
"ag_000 671 1.1\n",
"ee_000 671 1.1\n",
"ee_001 671 1.1\n",
"ee_002 671 1.1\n",
"ee_003 671 1.1\n",
"ee_004 671 1.1\n",
"ee_005 671 1.1\n",
"ee_006 671 1.1\n",
"ee_007 671 1.1\n",
"ee_008 671 1.1\n",
"ee_009 671 1.1\n",
"ay_004 671 1.1\n",
"ay_005 671 1.1\n",
"ay_006 671 1.1\n",
"az_003 671 1.1\n",
"ay_007 671 1.1\n",
"ay_008 671 1.1\n",
"ay_009 671 1.1\n",
"az_000 671 1.1\n",
"az_001 671 1.1\n",
"az_002 671 1.1\n",
"az_004 671 1.1\n",
"az_005 671 1.1\n",
"az_006 671 1.1\n",
"az_007 671 1.1\n",
"az_008 671 1.1\n",
"cs_009 669 1.1\n",
"cs_000 669 1.1\n",
"cs_001 669 1.1\n",
"cs_003 669 1.1\n",
"cs_004 669 1.1\n",
"cs_002 669 1.1\n",
"cs_008 669 1.1\n",
"cs_007 669 1.1\n",
"cs_006 669 1.1\n",
"cs_005 669 1.1\n",
"ah_000 645 1.1\n",
"bb_000 645 1.1\n",
"ap_000 642 1.1\n",
"bh_000 642 1.1\n",
"al_000 642 1.1\n",
"bg_000 642 1.1\n",
"an_000 642 1.1\n",
"at_000 629 1.0\n",
"as_000 629 1.0\n",
"aj_000 629 1.0\n",
"ai_000 629 1.0\n",
"am_0 629 1.0\n",
"au_000 629 1.0\n",
"bj_000 589 1.0\n",
"aq_000 589 1.0\n",
"ao_000 589 1.0\n",
"bi_000 589 1.0\n",
"by_000 473 0.8\n",
"ci_000 338 0.6\n",
"cj_000 338 0.6\n",
"ck_000 338 0.6\n",
"bt_000 167 0.3"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def missing_values_table(afs):\n",
" mis_val = afs.apply(lambda x: x[x == 'na'].count(), axis=0)\n",
" mis_val_percent = 100 * mis_val / len(afs)\n",
" mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)\n",
" mis_val_table_ren_columns = mis_val_table.rename(\n",
" columns = {0 : 'Missing Values', 1 : '% of Total Values'}\n",
" )\n",
" mis_val_table_ren_columns = mis_val_table_ren_columns[\n",
" mis_val_table_ren_columns.iloc[:,1] != 0\n",
" ].sort_values('% of Total Values', ascending=False).round(1)\n",
" print(\"Your selected dataframe has \" + str(afs.shape[1]) + \" columns.\\n\" \n",
" \"There are \" + str(mis_val_table_ren_columns.shape[0]) +\n",
" \" columns that have missing values.\")\n",
" return mis_val_table_ren_columns\n",
"missing_values_table(afs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "34557d64",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "13570714",
"metadata": {},
"source": [
"Add in source for the above:\n",
"\n",
"I have decided that anything with less then 60% readable data is not useable so I will cut anything that is under 60% usuable data. This includes: \n",
"\n",
"br_000\n",
"bq_000\n",
"bp_000\n",
"bo_000\n",
"ab_000\n",
"cr_000\n",
"bn_000\n",
"bm_000\n",
"bl_000\n",
"bk_000"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "6fc62974",
"metadata": {},
"outputs": [],
"source": [
"afs.drop(columns=['br_000', 'bq_000', 'bp_000', 'bo_000', 'ab_000', 'cr_000', 'bn_000', 'bm_000', 'bl_000', 'bk_000'], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "2e92c5f5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['class', 'aa_000', 'ac_000', 'ad_000', 'ae_000', 'af_000', 'ag_000',\n",
" 'ag_001', 'ag_002', 'ag_003',\n",
" ...\n",
" 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',\n",
" 'ee_009', 'ef_000', 'eg_000'],\n",
" dtype='object', length=161)\n"
]
}
],
"source": [
"print(afs.columns)"
]
},
{
"cell_type": "markdown",
"id": "ae5943e4",
"metadata": {},
"source": [
"Source for drop columns formula: https://www.freecodecamp.org/news/dataframe-drop-column-in-pandas-how-to-remove-columns-from-dataframes/#:~:text=drop()%20Method%20in%20Pandas,the%20inplace%20parameter%20to%20True%20\n",
"\n",
"Source for "
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "58b65e7f",
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_rows', None)\n"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "c0c83021",
"metadata": {},
"outputs": [],
"source": [
"new_column_names = {\n",
" \"aa_000\": \"sensor1_reading\",\n",
" \"ab_000\": \"sensor2_reading\",\n",
" \"ac_000\": \"sensor3_reading\",\n",
" \"ad_000\": \"sensor4_reading\",\n",
" \"ae_000\": \"sensor5_reading\",\n",
" \"af_000\": \"sensor6_reading\",\n",
" \"ag_000\": \"sensor7_reading\",\n",
" \"ag_001\": \"sensor8_reading\",\n",
" \"ag_002\": \"sensor9_reading\",\n",
" \"ee_002\": \"sensor10_reading\",\n",
" \"ee_003\": \"sensor11_reading\",\n",
" \"ee_004\": \"sensor12_reading\",\n",
" \"ee_005\": \"sensor13_reading\",\n",
" \"ee_006\": \"sensor14_reading\",\n",
" \"ee_007\": \"sensor15_reading\",\n",
" \"ee_008\": \"sensor16_reading\",\n",
" \"ee_009\": \"sensor17_reading\",\n",
" \"ef_000\": \"sensor18_reading\",\n",
" \"eg_000\": \"sensor19_reading\"\n",
"}\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "87c4243d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c9f020c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3350c13c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5438907",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "97a2eaac",
"metadata": {},
"source": [
"I have notcied my first issue with the data. I have 59,000 data points for negative and only 1000 for posiitve. The code column contains two attributes, negative (neg) and positive (pos). I will change these to numercal values so I can count them. neg=0, pos=1... Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline.htmlAs the 'neg' column is not applicable to this project, I will remove them from the data set before I explore any further. \"The dataset’s positive class consists of component failures for a specific component of the APS system.\n",
"The negative class consists of trucks with failures for components not related to the APS.\" This data is unrelated and therefore not useful for my project. Firstly, I will change the data"
]
},
{
"cell_type": "markdown",
"id": "de956434",
"metadata": {},
"source": [
"Code source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "dcf7776d",
"metadata": {},
"outputs": [],
"source": [
"afs = afs.drop(afs[afs['class'] == 'neg'].index)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "611c9b8c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" class | \n",
" ab_000 | \n",
" ac_000 | \n",
" ad_000 | \n",
" ae_000 | \n",
" af_000 | \n",
" ag_000 | \n",
" ag_001 | \n",
" ag_002 | \n",
" ag_003 | \n",
" ... | \n",
" ee_002 | \n",
" ee_003 | \n",
" ee_004 | \n",
" ee_005 | \n",
" ee_006 | \n",
" ee_007 | \n",
" ee_008 | \n",
" ee_009 | \n",
" ef_000 | \n",
" eg_000 | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" ... | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
" 1000 | \n",
"
\n",
" \n",
" unique | \n",
" 1 | \n",
" 11 | \n",
" 340 | \n",
" 265 | \n",
" 16 | \n",
" 16 | \n",
" 84 | \n",
" 349 | \n",
" 492 | \n",
" 694 | \n",
" ... | \n",
" 873 | \n",
" 871 | \n",
" 871 | \n",
" 866 | \n",
" 865 | \n",
" 859 | \n",
" 845 | \n",
" 370 | \n",
" 9 | \n",
" 10 | \n",
"
\n",
" \n",
" top | \n",
" pos | \n",
" na | \n",
" na | \n",
" na | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" freq | \n",
" 1000 | \n",
" 771 | \n",
" 462 | \n",
" 645 | \n",
" 645 | \n",
" 645 | \n",
" 899 | \n",
" 621 | \n",
" 461 | \n",
" 232 | \n",
" ... | \n",
" 44 | \n",
" 46 | \n",
" 46 | \n",
" 47 | \n",
" 48 | \n",
" 52 | \n",
" 61 | \n",
" 482 | \n",
" 614 | \n",
" 612 | \n",
"
\n",
" \n",
"
\n",
"
4 rows × 170 columns
\n",
"
"
],
"text/plain": [
" class ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 ag_003 \\\n",
"count 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 \n",
"unique 1 11 340 265 16 16 84 349 492 694 \n",
"top pos na na na 0 0 0 0 0 0 \n",
"freq 1000 771 462 645 645 645 899 621 461 232 \n",
"\n",
" ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000 \\\n",
"count ... 1000 1000 1000 1000 1000 1000 1000 1000 1000 \n",
"unique ... 873 871 871 866 865 859 845 370 9 \n",
"top ... 0 0 0 0 0 0 0 0 0 \n",
"freq ... 44 46 46 47 48 52 61 482 614 \n",
"\n",
" eg_000 \n",
"count 1000 \n",
"unique 10 \n",
"top 0 \n",
"freq 612 \n",
"\n",
"[4 rows x 170 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.describe(include=object)"
]
},
{
"cell_type": "markdown",
"id": "30dd2288",
"metadata": {},
"source": [
"This confirms that all 'neg' values have been dropped. Source: - See method 2 'Using the drop function' https://saturncloud.io/blog/how-to-remove-rows-with-specific-values-in-pandas-dataframe/#:~:text=Another%20method%20to%20remove%20rows,value%20we%20want%20to%20remove"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "0a9279e1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \\\n",
"9 pos 153204 0 182 na 0 0 0 0 \n",
"23 pos 453236 na 2926 na 0 0 0 0 \n",
"60 pos 72504 na 1594 1052 0 0 0 244 \n",
"115 pos 762958 na na na na na 776 281128 \n",
"135 pos 695994 na na na na na 0 0 \n",
"... ... ... ... ... ... ... ... ... ... \n",
"59484 pos 895178 na na na na na 0 0 \n",
"59601 pos 862134 na na na na na 0 38834 \n",
"59692 pos 186856 na na na 0 0 0 0 \n",
"59742 pos 605092 na na na na na 0 44320 \n",
"59769 pos 331704 na 1484 1142 0 0 0 267100 \n",
"\n",
" ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 \\\n",
"9 0 ... 129862 26872 34044 22472 34362 0 \n",
"23 222 ... 7908038 3026002 5025350 2025766 1160638 533834 \n",
"60 178226 ... 1432098 372252 527514 358274 332818 284178 \n",
"115 2186308 ... na na na na na na \n",
"135 0 ... 1397742 495544 361646 28610 5130 212 \n",
"... ... ... ... ... ... ... ... ... \n",
"59484 0 ... 9116224 4276644 8701496 8082264 5827284 2057354 \n",
"59601 1227952 ... 3456564 1793170 4159190 5847384 8364506 12875424 \n",
"59692 4300 ... 2713108 800182 322322 71638 34662 7304 \n",
"59742 1048970 ... 3940400 1865730 3698692 3271958 9831898 3755392 \n",
"59769 1384372 ... 3738648 1425312 3381954 4346910 2166330 296580 \n",
"\n",
" ee_008 ee_009 ef_000 eg_000 \n",
"9 0 0 0 0 \n",
"23 493800 6914 0 0 \n",
"60 3742 0 0 0 \n",
"115 na na na na \n",
"135 0 0 na na \n",
"... ... ... ... ... \n",
"59484 1662302 10790 na na \n",
"59601 661442 2458 na na \n",
"59692 2538 0 0 0 \n",
"59742 65610 0 na na \n",
"59769 15434 0 0 0 \n",
"\n",
"[1000 rows x 171 columns]\n"
]
}
],
"source": [
"afs.describe(include=object)\n",
"print(afs)"
]
},
{
"cell_type": "markdown",
"id": "9b786a0b",
"metadata": {},
"source": [
"Above I have ran the neg_count function to ensure that the negitive values were dropped. I then ran the describe function to confirm that the value of \"class\" is now \"1\" instead of two. Source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "b45cd7a8",
"metadata": {},
"source": [
"I need to get rid of the n/a in the ab_00 column. Below I will experiment with different strategies to do this. Forst, I will explore the classification and regression of the dataset. For this project, I will use multiclass classification. "
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "e8bd4620",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1000, 171)"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.shape"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "7eb4144c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 60000 entries, 0 to 59999\n",
"Columns: 171 entries, class to eg_000\n",
"dtypes: int64(1), object(170)\n",
"memory usage: 78.3+ MB\n"
]
}
],
"source": [
"#Requesting basic info on the dataset\n",
"data.info()"
]
},
{
"cell_type": "markdown",
"id": "8aa3159b",
"metadata": {},
"source": [
"Basic Statistical Information on the dataset\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "7ddd6f13",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" aa_000 | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 6.000000e+04 | \n",
"
\n",
" \n",
" mean | \n",
" 5.933650e+04 | \n",
"
\n",
" \n",
" std | \n",
" 1.454301e+05 | \n",
"
\n",
" \n",
" min | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 25% | \n",
" 8.340000e+02 | \n",
"
\n",
" \n",
" 50% | \n",
" 3.077600e+04 | \n",
"
\n",
" \n",
" 75% | \n",
" 4.866800e+04 | \n",
"
\n",
" \n",
" max | \n",
" 2.746564e+06 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" aa_000\n",
"count 6.000000e+04\n",
"mean 5.933650e+04\n",
"std 1.454301e+05\n",
"min 0.000000e+00\n",
"25% 8.340000e+02\n",
"50% 3.077600e+04\n",
"75% 4.866800e+04\n",
"max 2.746564e+06"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.describe()"
]
},
{
"cell_type": "markdown",
"id": "52eb81c9",
"metadata": {},
"source": [
"I am checking the code for blank data. Note for myself- add in why this is important from lecture notes"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "59ceb170",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"class 0\n",
"aa_000 0\n",
"ab_000 0\n",
"ac_000 0\n",
"ad_000 0\n",
" ..\n",
"ee_007 0\n",
"ee_008 0\n",
"ee_009 0\n",
"ef_000 0\n",
"eg_000 0\n",
"Length: 171, dtype: int64"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "3e90e256",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False\n"
]
}
],
"source": [
"print(afs.isnull().values.any())"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "ff98ff8b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pos 1000\n",
"Name: class, dtype: int64"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"afs[\"class\"].value_counts().sort_index()\n"
]
},
{
"cell_type": "markdown",
"id": "33f178d1",
"metadata": {},
"source": [
"I have notcied my first issue with the data. I have 59,000 data points for negative and only 1000 for posiitve. The code column contains two attributes, negative (neg) and positive (pos). I will change these to numercal values so I can count them. neg=0, pos=1... Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "48c3ce58",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5069335d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 51,
"id": "d03b20fb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0 neg\n",
" 1 neg\n",
" 2 neg\n",
" 3 neg\n",
" 4 neg\n",
" Name: class, dtype: object,\n",
" neg 0.983333\n",
" pos 0.016667\n",
" Name: class, dtype: float64)"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Displaying the first few rows of the 'class' column and its distribution\n",
"(data['class'].head(), data['class'].value_counts(normalize=True))\n"
]
},
{
"cell_type": "markdown",
"id": "02754a15",
"metadata": {},
"source": [
"This indicates that my dataset has no missing values or invalid data types. This is a good sign as my data is 'complete' and no further action is required. "
]
},
{
"cell_type": "markdown",
"id": "a4a32eca",
"metadata": {},
"source": [
"Now I will begin to visualise my data using seaborne. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc0fc3fa",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}