{ "cells": [ { "cell_type": "code", "execution_count": 144, "id": "31af659a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt #visualisation\n", "import seaborn as sns #visualisation\n", "import numpy as np " ] }, { "cell_type": "markdown", "id": "3ab38a44", "metadata": {}, "source": [ "Here I have loaded the dataset. To save myself from typing 'aps_failure.csv' every single time I have given the dataset a simplfied name 'afs'. Line 1 below tells the program where the data is while line 2 renames it for ease of use. " ] }, { "cell_type": "code", "execution_count": 145, "id": "0ed5ef06", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('aps_failure_set.csv')\n", "afs=pd.read_csv('aps_failure_set.csv')" ] }, { "cell_type": "markdown", "id": "91500a52", "metadata": {}, "source": [ "Exploratory Analysis. \n", "I am gathering some very basic information on my datset so I know what I'm dealing with. I start this process with gathering basic information " ] }, { "cell_type": "code", "execution_count": 146, "id": "b10e9b31", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(60000, 171)" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "afs.shape" ] }, { "cell_type": "markdown", "id": "e4edc7e5", "metadata": {}, "source": [ "The afs.shape above has told me I am dealing with a datset that has 171 columns and 60,000 rows. I will now use the afs.describe(include=object) function to provide me with some basic statistics on the data. This is useful for the following reasons:\n", "\n", "-Count shows me that\n", "-Unique showes me that\n", "-Top shows me that\n", "-Freq shows me that" ] }, { "cell_type": "code", "execution_count": 147, "id": "7944cac4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | class | \n", "ab_000 | \n", "ac_000 | \n", "ad_000 | \n", "ae_000 | \n", "af_000 | \n", "ag_000 | \n", "ag_001 | \n", "ag_002 | \n", "ag_003 | \n", "... | \n", "ee_002 | \n", "ee_003 | \n", "ee_004 | \n", "ee_005 | \n", "ee_006 | \n", "ee_007 | \n", "ee_008 | \n", "ee_009 | \n", "ef_000 | \n", "eg_000 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "... | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "60000 | \n", "
unique | \n", "2 | \n", "30 | \n", "2062 | \n", "1887 | \n", "334 | \n", "419 | \n", "155 | \n", "618 | \n", "2423 | \n", "7880 | \n", "... | \n", "34489 | \n", "31712 | \n", "35189 | \n", "36289 | \n", "31796 | \n", "30470 | \n", "24214 | \n", "9725 | \n", "29 | \n", "50 | \n", "
top | \n", "neg | \n", "na | \n", "0 | \n", "na | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
freq | \n", "59000 | \n", "46329 | \n", "8752 | \n", "14861 | \n", "55543 | \n", "55476 | \n", "59133 | \n", "58587 | \n", "56181 | \n", "46894 | \n", "... | \n", "1364 | \n", "1557 | \n", "1797 | \n", "2814 | \n", "4458 | \n", "7898 | \n", "17280 | \n", "31863 | \n", "57021 | \n", "56794 | \n", "
4 rows × 170 columns
\n", "\n", " | class | \n", "aa_000 | \n", "ab_000 | \n", "ac_000 | \n", "ad_000 | \n", "ae_000 | \n", "af_000 | \n", "ag_000 | \n", "ag_001 | \n", "ag_002 | \n", "... | \n", "ee_002 | \n", "ee_003 | \n", "ee_004 | \n", "ee_005 | \n", "ee_006 | \n", "ee_007 | \n", "ee_008 | \n", "ee_009 | \n", "ef_000 | \n", "eg_000 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "neg | \n", "76698 | \n", "na | \n", "2130706438 | \n", "280 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "1240520 | \n", "493384 | \n", "721044 | \n", "469792 | \n", "339156 | \n", "157956 | \n", "73224 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "neg | \n", "33058 | \n", "na | \n", "0 | \n", "na | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "421400 | \n", "178064 | \n", "293306 | \n", "245416 | \n", "133654 | \n", "81140 | \n", "97576 | \n", "1500 | \n", "0 | \n", "0 | \n", "
2 | \n", "neg | \n", "41040 | \n", "na | \n", "228 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "277378 | \n", "159812 | \n", "423992 | \n", "409564 | \n", "320746 | \n", "158022 | \n", "95128 | \n", "514 | \n", "0 | \n", "0 | \n", "
3 | \n", "neg | \n", "12 | \n", "0 | \n", "70 | \n", "66 | \n", "0 | \n", "10 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "240 | \n", "46 | \n", "58 | \n", "44 | \n", "10 | \n", "0 | \n", "0 | \n", "0 | \n", "4 | \n", "32 | \n", "
4 | \n", "neg | \n", "60874 | \n", "na | \n", "1368 | \n", "458 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "622012 | \n", "229790 | \n", "405298 | \n", "347188 | \n", "286954 | \n", "311560 | \n", "433954 | \n", "1218 | \n", "0 | \n", "0 | \n", "
5 rows × 171 columns
\n", "\n", " | class | \n", "aa_000 | \n", "ab_000 | \n", "ac_000 | \n", "ad_000 | \n", "ae_000 | \n", "af_000 | \n", "ag_000 | \n", "ag_001 | \n", "ag_002 | \n", "... | \n", "ee_002 | \n", "ee_003 | \n", "ee_004 | \n", "ee_005 | \n", "ee_006 | \n", "ee_007 | \n", "ee_008 | \n", "ee_009 | \n", "ef_000 | \n", "eg_000 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
59995 | \n", "neg | \n", "153002 | \n", "na | \n", "664 | \n", "186 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "998500 | \n", "566884 | \n", "1290398 | \n", "1218244 | \n", "1019768 | \n", "717762 | \n", "898642 | \n", "28588 | \n", "0 | \n", "0 | \n", "
59996 | \n", "neg | \n", "2286 | \n", "na | \n", "2130706538 | \n", "224 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "10578 | \n", "6760 | \n", "21126 | \n", "68424 | \n", "136 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
59997 | \n", "neg | \n", "112 | \n", "0 | \n", "2130706432 | \n", "18 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "792 | \n", "386 | \n", "452 | \n", "144 | \n", "146 | \n", "2622 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
59998 | \n", "neg | \n", "80292 | \n", "na | \n", "2130706432 | \n", "494 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "699352 | \n", "222654 | \n", "347378 | \n", "225724 | \n", "194440 | \n", "165070 | \n", "802280 | \n", "388422 | \n", "0 | \n", "0 | \n", "
59999 | \n", "neg | \n", "40222 | \n", "na | \n", "698 | \n", "628 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "440066 | \n", "183200 | \n", "344546 | \n", "254068 | \n", "225148 | \n", "158304 | \n", "170384 | \n", "158 | \n", "0 | \n", "0 | \n", "
5 rows × 171 columns
\n", "\n", " | aa_000 | \n", "
---|---|
count | \n", "6.000000e+04 | \n", "
mean | \n", "5.933650e+04 | \n", "
std | \n", "1.454301e+05 | \n", "
min | \n", "0.000000e+00 | \n", "
25% | \n", "8.340000e+02 | \n", "
50% | \n", "3.077600e+04 | \n", "
75% | \n", "4.866800e+04 | \n", "
max | \n", "2.746564e+06 | \n", "