{ "cells": [ { "cell_type": "code", "execution_count": 144, "id": "31af659a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt #visualisation\n", "import seaborn as sns #visualisation\n", "import numpy as np " ] }, { "cell_type": "markdown", "id": "3ab38a44", "metadata": {}, "source": [ "Here I have loaded the dataset. To save myself from typing 'aps_failure.csv' every single time I have given the dataset a simplfied name 'afs'. Line 1 below tells the program where the data is while line 2 renames it for ease of use. " ] }, { "cell_type": "code", "execution_count": 145, "id": "0ed5ef06", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('aps_failure_set.csv')\n", "afs=pd.read_csv('aps_failure_set.csv')" ] }, { "cell_type": "markdown", "id": "91500a52", "metadata": {}, "source": [ "Exploratory Analysis. \n", "I am gathering some very basic information on my datset so I know what I'm dealing with. I start this process with gathering basic information " ] }, { "cell_type": "code", "execution_count": 146, "id": "b10e9b31", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(60000, 171)" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs.shape" ] }, { "cell_type": "markdown", "id": "e4edc7e5", "metadata": {}, "source": [ "The afs.shape above has told me I am dealing with a datset that has 171 columns and 60,000 rows. I will now use the afs.describe(include=object) function to provide me with some basic statistics on the data. This is useful for the following reasons:\n", "\n", "-Count shows me that\n", "-Unique showes me that\n", "-Top shows me that\n", "-Freq shows me that" ] }, { "cell_type": "code", "execution_count": 147, "id": "7944cac4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classab_000ac_000ad_000ae_000af_000ag_000ag_001ag_002ag_003...ee_002ee_003ee_004ee_005ee_006ee_007ee_008ee_009ef_000eg_000
count60000600006000060000600006000060000600006000060000...60000600006000060000600006000060000600006000060000
unique2302062188733441915561824237880...3448931712351893628931796304702421497252950
topnegna0na000000...0000000000
freq5900046329875214861555435547659133585875618146894...13641557179728144458789817280318635702156794
\n", "

4 rows × 170 columns

\n", "
" ], "text/plain": [ " class ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 ag_003 \\\n", "count 60000 60000 60000 60000 60000 60000 60000 60000 60000 60000 \n", "unique 2 30 2062 1887 334 419 155 618 2423 7880 \n", "top neg na 0 na 0 0 0 0 0 0 \n", "freq 59000 46329 8752 14861 55543 55476 59133 58587 56181 46894 \n", "\n", " ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000 \\\n", "count ... 60000 60000 60000 60000 60000 60000 60000 60000 60000 \n", "unique ... 34489 31712 35189 36289 31796 30470 24214 9725 29 \n", "top ... 0 0 0 0 0 0 0 0 0 \n", "freq ... 1364 1557 1797 2814 4458 7898 17280 31863 57021 \n", "\n", " eg_000 \n", "count 60000 \n", "unique 50 \n", "top 0 \n", "freq 56794 \n", "\n", "[4 rows x 170 columns]" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs.describe(include=object)" ] }, { "cell_type": "markdown", "id": "cfb0b758", "metadata": {}, "source": [ "Now I begin to view the data. data.head(10) gives me the first 10 rows of the data.\n", "\n", "This allows me to get an understanding of what I am actually dealing with. It is a good way" ] }, { "cell_type": "code", "execution_count": 148, "id": "aee8d2fb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classaa_000ab_000ac_000ad_000ae_000af_000ag_000ag_001ag_002...ee_002ee_003ee_004ee_005ee_006ee_007ee_008ee_009ef_000eg_000
0neg76698na213070643828000000...124052049338472104446979233915615795673224000
1neg33058na0na00000...4214001780642933062454161336548114097576150000
2neg41040na22810000000...2773781598124239924095643207461580229512851400
3neg1207066010000...24046584410000432
4neg60874na136845800000...622012229790405298347188286954311560433954121800
\n", "

5 rows × 171 columns

\n", "
" ], "text/plain": [ " class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 \\\n", "0 neg 76698 na 2130706438 280 0 0 0 0 0 \n", "1 neg 33058 na 0 na 0 0 0 0 0 \n", "2 neg 41040 na 228 100 0 0 0 0 0 \n", "3 neg 12 0 70 66 0 10 0 0 0 \n", "4 neg 60874 na 1368 458 0 0 0 0 0 \n", "\n", " ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000 \\\n", "0 ... 1240520 493384 721044 469792 339156 157956 73224 0 0 \n", "1 ... 421400 178064 293306 245416 133654 81140 97576 1500 0 \n", "2 ... 277378 159812 423992 409564 320746 158022 95128 514 0 \n", "3 ... 240 46 58 44 10 0 0 0 4 \n", "4 ... 622012 229790 405298 347188 286954 311560 433954 1218 0 \n", "\n", " eg_000 \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 32 \n", "4 0 \n", "\n", "[5 rows x 171 columns]" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the first 5 records\n", "afs.head(5)" ] }, { "cell_type": "code", "execution_count": 149, "id": "445668ec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classaa_000ab_000ac_000ad_000ae_000af_000ag_000ag_001ag_002...ee_002ee_003ee_004ee_005ee_006ee_007ee_008ee_009ef_000eg_000
59995neg153002na66418600000...9985005668841290398121824410197687177628986422858800
59996neg2286na213070653822400000...105786760211266842413600000
59997neg112021307064321800000...79238645214414626220000
59998neg80292na213070643249400000...69935222265434737822572419444016507080228038842200
59999neg40222na69862800000...44006618320034454625406822514815830417038415800
\n", "

5 rows × 171 columns

\n", "
" ], "text/plain": [ " class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \\\n", "59995 neg 153002 na 664 186 0 0 0 0 \n", "59996 neg 2286 na 2130706538 224 0 0 0 0 \n", "59997 neg 112 0 2130706432 18 0 0 0 0 \n", "59998 neg 80292 na 2130706432 494 0 0 0 0 \n", "59999 neg 40222 na 698 628 0 0 0 0 \n", "\n", " ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \\\n", "59995 0 ... 998500 566884 1290398 1218244 1019768 717762 898642 \n", "59996 0 ... 10578 6760 21126 68424 136 0 0 \n", "59997 0 ... 792 386 452 144 146 2622 0 \n", "59998 0 ... 699352 222654 347378 225724 194440 165070 802280 \n", "59999 0 ... 440066 183200 344546 254068 225148 158304 170384 \n", "\n", " ee_009 ef_000 eg_000 \n", "59995 28588 0 0 \n", "59996 0 0 0 \n", "59997 0 0 0 \n", "59998 388422 0 0 \n", "59999 158 0 0 \n", "\n", "[5 rows x 171 columns]" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs.tail(5)" ] }, { "cell_type": "code", "execution_count": 150, "id": "c818ebc0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000',\n", " 'ag_000', 'ag_001', 'ag_002',\n", " ...\n", " 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',\n", " 'ee_009', 'ef_000', 'eg_000'],\n", " dtype='object', length=171)\n" ] } ], "source": [ "print(afs.columns)" ] }, { "cell_type": "markdown", "id": "02f71493", "metadata": {}, "source": [ "Checking the data type" ] }, { "cell_type": "code", "execution_count": 151, "id": "def2df33", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs.info" ] }, { "cell_type": "code", "execution_count": null, "id": "287f4907", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bbc1a9e4", "metadata": {}, "source": [ "As the 'neg' column is not applicable to this project, I will remove them from the data set before I explore any further. \"The dataset’s positive class consists of component failures for a specific component of the APS system.\n", "The negative class consists of trucks with failures for components not related to the APS.\" This data is unrelated and therefore not useful for my project. Firstly, I will change the data" ] }, { "cell_type": "code", "execution_count": null, "id": "321d9aab", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d281eb49", "metadata": {}, "source": [ "COoe source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html" ] }, { "cell_type": "code", "execution_count": null, "id": "7440acd0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 164, "id": "939f42f9", "metadata": {}, "outputs": [], "source": [ "afs = afs.drop(afs[afs['class'] == 'neg'].index)\n" ] }, { "cell_type": "code", "execution_count": 165, "id": "b9f9e6f0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 168, "id": "86f8a139", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Negative count: 0\n" ] } ], "source": [ "print(\"Negative count:\", neg_count)\n" ] }, { "cell_type": "markdown", "id": "9a727193", "metadata": {}, "source": [ "This confirms that all 'neg' values have been dropped. Source: - See method 2 'Using the drop function' https://saturncloud.io/blog/how-to-remove-rows-with-specific-values-in-pandas-dataframe/#:~:text=Another%20method%20to%20remove%20rows,value%20we%20want%20to%20remove" ] }, { "cell_type": "code", "execution_count": 156, "id": "500d2301", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \\\n", "9 pos 153204 0 182 na 0 0 0 0 \n", "23 pos 453236 na 2926 na 0 0 0 0 \n", "60 pos 72504 na 1594 1052 0 0 0 244 \n", "115 pos 762958 na na na na na 776 281128 \n", "135 pos 695994 na na na na na 0 0 \n", "... ... ... ... ... ... ... ... ... ... \n", "59484 pos 895178 na na na na na 0 0 \n", "59601 pos 862134 na na na na na 0 38834 \n", "59692 pos 186856 na na na 0 0 0 0 \n", "59742 pos 605092 na na na na na 0 44320 \n", "59769 pos 331704 na 1484 1142 0 0 0 267100 \n", "\n", " ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 \\\n", "9 0 ... 129862 26872 34044 22472 34362 0 \n", "23 222 ... 7908038 3026002 5025350 2025766 1160638 533834 \n", "60 178226 ... 1432098 372252 527514 358274 332818 284178 \n", "115 2186308 ... na na na na na na \n", "135 0 ... 1397742 495544 361646 28610 5130 212 \n", "... ... ... ... ... ... ... ... ... \n", "59484 0 ... 9116224 4276644 8701496 8082264 5827284 2057354 \n", "59601 1227952 ... 3456564 1793170 4159190 5847384 8364506 12875424 \n", "59692 4300 ... 2713108 800182 322322 71638 34662 7304 \n", "59742 1048970 ... 3940400 1865730 3698692 3271958 9831898 3755392 \n", "59769 1384372 ... 3738648 1425312 3381954 4346910 2166330 296580 \n", "\n", " ee_008 ee_009 ef_000 eg_000 \n", "9 0 0 0 0 \n", "23 493800 6914 0 0 \n", "60 3742 0 0 0 \n", "115 na na na na \n", "135 0 0 na na \n", "... ... ... ... ... \n", "59484 1662302 10790 na na \n", "59601 661442 2458 na na \n", "59692 2538 0 0 0 \n", "59742 65610 0 na na \n", "59769 15434 0 0 0 \n", "\n", "[1000 rows x 171 columns]\n" ] } ], "source": [ "afs.describe(include=object)\n", "print(afs)" ] }, { "cell_type": "markdown", "id": "94f2d4b2", "metadata": {}, "source": [ "Above I have ran the neg_count function to ensure that the negitive values were dropped. I then ran the describe function to confirm that the value of \"class\" is now \"1\" instead of two. Source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html\n", "\n" ] }, { "cell_type": "markdown", "id": "21778292", "metadata": {}, "source": [ "I need to get rid of the n/a in the ab_00 column. Below I will experiment with different strategies to do this. Forst, I will explore the classification and regression of the dataset. For this project, I will use multiclass classification. " ] }, { "cell_type": "code", "execution_count": 157, "id": "3060f0cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 171)" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs.shape" ] }, { "cell_type": "code", "execution_count": 158, "id": "7eb4144c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 60000 entries, 0 to 59999\n", "Columns: 171 entries, class to eg_000\n", "dtypes: int64(1), object(170)\n", "memory usage: 78.3+ MB\n" ] } ], "source": [ "#Requesting basic info on the dataset\n", "data.info()" ] }, { "cell_type": "markdown", "id": "8aa3159b", "metadata": {}, "source": [ "Basic Statistical Information on the dataset\n" ] }, { "cell_type": "code", "execution_count": 159, "id": "7ddd6f13", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aa_000
count6.000000e+04
mean5.933650e+04
std1.454301e+05
min0.000000e+00
25%8.340000e+02
50%3.077600e+04
75%4.866800e+04
max2.746564e+06
\n", "
" ], "text/plain": [ " aa_000\n", "count 6.000000e+04\n", "mean 5.933650e+04\n", "std 1.454301e+05\n", "min 0.000000e+00\n", "25% 8.340000e+02\n", "50% 3.077600e+04\n", "75% 4.866800e+04\n", "max 2.746564e+06" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "id": "52eb81c9", "metadata": {}, "source": [ "I am checking the code for blank data. Note for myself- add in why this is important from lecture notes" ] }, { "cell_type": "code", "execution_count": 160, "id": "59ceb170", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "class 0\n", "aa_000 0\n", "ab_000 0\n", "ac_000 0\n", "ad_000 0\n", " ..\n", "ee_007 0\n", "ee_008 0\n", "ee_009 0\n", "ef_000 0\n", "eg_000 0\n", "Length: 171, dtype: int64" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 161, "id": "3e90e256", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "print(afs.isnull().values.any())" ] }, { "cell_type": "code", "execution_count": 162, "id": "ff98ff8b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pos 1000\n", "Name: class, dtype: int64" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs[\"class\"].value_counts().sort_index()\n" ] }, { "cell_type": "markdown", "id": "33f178d1", "metadata": {}, "source": [ "I have notcied my first issue with the data. I have 59,000 data points for negative and only 1000 for posiitve. The code column contains two attributes, negative (neg) and positive (pos). I will change these to numercal values so I can count them. neg=0, pos=1... Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline.html" ] }, { "cell_type": "code", "execution_count": null, "id": "48c3ce58", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5069335d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 163, "id": "d03b20fb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0 neg\n", " 1 neg\n", " 2 neg\n", " 3 neg\n", " 4 neg\n", " Name: class, dtype: object,\n", " neg 0.983333\n", " pos 0.016667\n", " Name: class, dtype: float64)" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Displaying the first few rows of the 'class' column and its distribution\n", "(data['class'].head(), data['class'].value_counts(normalize=True))\n" ] }, { "cell_type": "markdown", "id": "02754a15", "metadata": {}, "source": [ "This indicates that my dataset has no missing values or invalid data types. This is a good sign as my data is 'complete' and no further action is required. " ] }, { "cell_type": "markdown", "id": "a4a32eca", "metadata": {}, "source": [ "Now I will begin to visualise my data using seaborne. " ] }, { "cell_type": "code", "execution_count": null, "id": "fc0fc3fa", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }