{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Data Cleaning using Python with Pandas Library.ipynb",
"version": "0.3.2",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GCC1HfHqtO-l",
"colab_type": "text"
},
"source": [
"# Data Cleaning using Python with Pandas Library."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2mDYF115tTvt",
"colab_type": "text"
},
"source": [
"**According to this [article](https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/), data cleaning and organizing constitutes 57% of the total weight when it comes to the part of the data science.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ovOTt6B5tdnw",
"colab_type": "text"
},
"source": [
"![alt text](https://whatsthebigdata.files.wordpress.com/2016/05/least-enjoyable.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x1a2c89qtux3",
"colab_type": "text"
},
"source": [
"## The entire data cleaning process is divided into sub-tasks as shown below."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Dy0jTpw1twjy",
"colab_type": "text"
},
"source": [
"\n",
"\n",
"1. Importing the required libraries.\n",
"\n",
"2. Getting the data-set from a different source (Kaggle) and displaying the dataset.\n",
"\n",
"3. Removing the unused or irrelevant columns.\n",
"\n",
"4. Renaming the column names as per our convenience.\n",
"\n",
"5. Replacing the value of the rows and make it more meaningful.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FbxWi6q6uDYU",
"colab_type": "text"
},
"source": [
"Even though this tutorial is small, but it’s a good way to start on small things and get our hands dirty later on. I will make sure that everyone with no prior experience in python programming or don’t know what is data science or data cleaning can easily understand this tutorial. I'm not very good at python in the first place, so even for me, this was a good place to start. One thing with python is that the code is self-explanatory, your focus should not be what the code does, because the code pretty much says what it does, rather you should tell why did you choose to do this, the “why” factor is important than the “what” factor. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TMCYcC0qufuS",
"colab_type": "text"
},
"source": [
"## Step 1: Importing the required libraries."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KktBTkD8uhWA",
"colab_type": "text"
},
"source": [
"This step involves just importing the required libraries which are [pandas](https://pandas.pydata.org/), [numpy](https://www.numpy.org/). These are the necessary libraries when it comes to data science."
]
},
{
"cell_type": "code",
"metadata": {
"id": "oSAXK8aOtHMB",
"colab_type": "code",
"colab": {}
},
"source": [
"# Importing the necessary libraries.\n",
"import pandas as pd\n",
"import numpy as np"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2uaY_ZVYxGLi",
"colab_type": "text"
},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NHcoGHhtu4dV",
"colab_type": "text"
},
"source": [
"## Step 2: Getting the data-set from a different source and displaying the data-set."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7AObb4PKu6CB",
"colab_type": "text"
},
"source": [
"This step involves getting the data-set from a different source, and the link for the data-set is provided below.\n",
"\n",
"[Data-set Download](https://www.kaggle.com/ronitf/heart-disease-uci)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "qUC9GuAmu4H3",
"colab_type": "code",
"outputId": "0f85eb1c-7fca-462b-d4d9-0feee16930b2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
}
},
"source": [
"# Reading a CSV file\n",
"df = pd.read_csv(\"heart.csv\")\n",
"df.head(5)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" sex | \n",
" cp | \n",
" trestbps | \n",
" chol | \n",
" fbs | \n",
" restecg | \n",
" thalach | \n",
" exang | \n",
" oldpeak | \n",
" slope | \n",
" ca | \n",
" thal | \n",
" target | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 63 | \n",
" 1 | \n",
" 3 | \n",
" 145 | \n",
" 233 | \n",
" 1 | \n",
" 0 | \n",
" 150 | \n",
" 0 | \n",
" 2.3 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 37 | \n",
" 1 | \n",
" 2 | \n",
" 130 | \n",
" 250 | \n",
" 0 | \n",
" 1 | \n",
" 187 | \n",
" 0 | \n",
" 3.5 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 41 | \n",
" 0 | \n",
" 1 | \n",
" 130 | \n",
" 204 | \n",
" 0 | \n",
" 0 | \n",
" 172 | \n",
" 0 | \n",
" 1.4 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 56 | \n",
" 1 | \n",
" 1 | \n",
" 120 | \n",
" 236 | \n",
" 0 | \n",
" 1 | \n",
" 178 | \n",
" 0 | \n",
" 0.8 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 57 | \n",
" 0 | \n",
" 0 | \n",
" 120 | \n",
" 354 | \n",
" 0 | \n",
" 1 | \n",
" 163 | \n",
" 1 | \n",
" 0.6 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age sex cp trestbps chol fbs ... exang oldpeak slope ca thal target\n",
"0 63 1 3 145 233 1 ... 0 2.3 0 0 1 1\n",
"1 37 1 2 130 250 0 ... 0 3.5 0 0 2 1\n",
"2 41 0 1 130 204 0 ... 0 1.4 2 0 2 1\n",
"3 56 1 1 120 236 0 ... 0 0.8 2 0 2 1\n",
"4 57 0 0 120 354 0 ... 1 0.6 2 0 2 1\n",
"\n",
"[5 rows x 14 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 3
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w1JAicnAvR_a",
"colab_type": "text"
},
"source": [
"Note: If you are using [Jupyter Notebook](https://jupyter.org/) to practice this tutorial then there should be no problem to read the CSV file. But if you are a Google fan like me, then you ought to use [Google Colab](https://colab.research.google.com/notebooks/welcome.ipynb) which is the best according to me, for practicing data science, then you must follow some steps in order to load or read the CSV file. So this [article](https://towardsdatascience.com/mastering-the-features-of-google-colaboratory-92850e75701) helps you to solve this issue (I was the one who wrote this article :0). I personally recommend everybody to go through this article.\n",
"\n",
"\n",
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BiTiQZXWv-oA",
"colab_type": "text"
},
"source": [
"## Step 3: Removing the unused or irrelevant columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oZ1D1CcLv_mn",
"colab_type": "text"
},
"source": [
"This step involves removing irrelevant columns such as cp, fbs, thalach, and many more, and the code is pretty much self-explanatory.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "rEJ-TNdivPpX",
"colab_type": "code",
"outputId": "e4e8f3c1-9c7e-4d4a-9f7b-7f23bace5993",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
}
},
"source": [
"# Dropping unused columns.\n",
"to_drop = ['cp',\n",
" 'fbs',\n",
" 'restecg',\n",
" 'thalach',\n",
" 'exang',\n",
" 'oldpeak',\n",
" 'slope',\n",
" 'thal',\n",
" 'target', \n",
" 'ca']\n",
"\n",
"df.drop(to_drop, inplace = True, axis = 1)\n",
"df.head(5)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" sex | \n",
" trestbps | \n",
" chol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 63 | \n",
" 1 | \n",
" 145 | \n",
" 233 | \n",
"
\n",
" \n",
" 1 | \n",
" 37 | \n",
" 1 | \n",
" 130 | \n",
" 250 | \n",
"
\n",
" \n",
" 2 | \n",
" 41 | \n",
" 0 | \n",
" 130 | \n",
" 204 | \n",
"
\n",
" \n",
" 3 | \n",
" 56 | \n",
" 1 | \n",
" 120 | \n",
" 236 | \n",
"
\n",
" \n",
" 4 | \n",
" 57 | \n",
" 0 | \n",
" 120 | \n",
" 354 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age sex trestbps chol\n",
"0 63 1 145 233\n",
"1 37 1 130 250\n",
"2 41 0 130 204\n",
"3 56 1 120 236\n",
"4 57 0 120 354"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iP8tQ6T3xDyC",
"colab_type": "text"
},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ky6rMC0UwI-u",
"colab_type": "text"
},
"source": [
"## Step 4: Renaming the column names as per our convenience."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yMRAnFNMwJyQ",
"colab_type": "text"
},
"source": [
"This step involves renaming the column names because many column names are kinda confusing and hard to understand."
]
},
{
"cell_type": "code",
"metadata": {
"id": "go-MAwLNwMKK",
"colab_type": "code",
"outputId": "e7f0ec67-793e-4491-a2c6-0ef93848d3c2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
}
},
"source": [
"# Renaming the column names\n",
"new_name = {'age': 'Age',\n",
" 'sex': 'Sex',\n",
" 'trestbps': 'Bps',\n",
" 'chol': 'Cholesterol'\n",
" }\n",
"\n",
"df.rename(columns = new_name, inplace = True)\n",
"df.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Sex | \n",
" Bps | \n",
" Cholesterol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 63 | \n",
" 1 | \n",
" 145 | \n",
" 233 | \n",
"
\n",
" \n",
" 1 | \n",
" 37 | \n",
" 1 | \n",
" 130 | \n",
" 250 | \n",
"
\n",
" \n",
" 2 | \n",
" 41 | \n",
" 0 | \n",
" 130 | \n",
" 204 | \n",
"
\n",
" \n",
" 3 | \n",
" 56 | \n",
" 1 | \n",
" 120 | \n",
" 236 | \n",
"
\n",
" \n",
" 4 | \n",
" 57 | \n",
" 0 | \n",
" 120 | \n",
" 354 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Sex Bps Cholesterol\n",
"0 63 1 145 233\n",
"1 37 1 130 250\n",
"2 41 0 130 204\n",
"3 56 1 120 236\n",
"4 57 0 120 354"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R5InYZWOoVqD",
"colab_type": "text"
},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FSB1FlnAwT9G",
"colab_type": "text"
},
"source": [
"## Step 5: Replacing the value of the rows if necessary."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JJEkaeDSwUuo",
"colab_type": "text"
},
"source": [
"This step involves replacing the incomplete values or making the values more readable, such as in here the Sex field consists of the values 1 and 0 being 1 as Male and 0 as Female, but it often seems ambiguous for the third person, so changing the value to an understandable one is a good idea."
]
},
{
"cell_type": "code",
"metadata": {
"id": "rafp7vU5wWjI",
"colab_type": "code",
"outputId": "970cb32e-4d32-4314-f19c-9024154c40f7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
}
},
"source": [
"# Replacing the values in the row\n",
"replace_values = {0: 'F', 1: 'M'}\n",
"\n",
"df = df.replace({\"Sex\": replace_values}) \n",
"\n",
"df.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Sex | \n",
" Bps | \n",
" Cholesterol | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 63 | \n",
" M | \n",
" 145 | \n",
" 233 | \n",
"
\n",
" \n",
" 1 | \n",
" 37 | \n",
" M | \n",
" 130 | \n",
" 250 | \n",
"
\n",
" \n",
" 2 | \n",
" 41 | \n",
" F | \n",
" 130 | \n",
" 204 | \n",
"
\n",
" \n",
" 3 | \n",
" 56 | \n",
" M | \n",
" 120 | \n",
" 236 | \n",
"
\n",
" \n",
" 4 | \n",
" 57 | \n",
" F | \n",
" 120 | \n",
" 354 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Sex Bps Cholesterol\n",
"0 63 M 145 233\n",
"1 37 M 130 250\n",
"2 41 F 130 204\n",
"3 56 M 120 236\n",
"4 57 F 120 354"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TypPRLmewfru",
"colab_type": "text"
},
"source": [
"So the above is the overall simple data cleaning process obviously this is not the actual cleaning process at an industry level, but this is a good start, so let’s start from small and then go for huge data-sets which then involves more cleaning process. This was just to give an idea as to how the process of data cleaning looks like in a beginners perspective. Thank you guys for spending your time reading my article, stay tuned for more updates. Let me know what is your opinion about this tutorial in the comment section below. Also if you have any doubts regarding the code, comment section is all yours. Moreover I am an author in [towardsdatascience.com](https://medium.com/@tanunprabhu95) and you can find this [article](https://towardsdatascience.com/data-cleaning-with-python-using-pandas-library-c6f4a68ea8eb) there. Have a nice day."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-7dpOTB0w6mc",
"colab_type": "text"
},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
}
]
}