{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Software Engineering Practices, Part 1\n",
"> A Summary of lecture in AWS ML Foundations Course, via Udacity\n",
"\n",
"- toc: true \n",
"- badges: true\n",
"- comments: true\n",
"- categories: [Software Engineering, Udacity]\n",
"- image: images/logo.png"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean and Modular Code\n",
"- **PRODUCTION CODE**: software running on production servers to handle live users and data of the intended audience. Note,this is different from *production quality code*, which describes code that meets expectations in reliability, efficiency, etc., for production. Ideally, all code in production meets these expectations, but this is not always the case.\n",
"\n",
"\n",
"- **CLEAN**: readable, simple, and concise. A characteristic of production quality code that is crucial for collaboration and maintainability in software development.\n",
"\n",
"\n",
"- **MODULAR**:logically broken up into functions and modules. Also an important characteristics of production quality code that makes your code more organized, efficient, and reusable.\n",
"\n",
"\n",
"- **MODULE**: a file. Modules allow code to be resued by encapsulating them into files that can be into other files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Refactoring Code\n",
"- **REFACTORING**: restructuring your code to improve its internal structure, without changing its external functionality. This gives you a chance to clean and modularize your program after you've got it working.\n",
"\n",
"\n",
"- Since it isn't easy to write your best code while you're still trying to just get it working, allocating time to do this is essential to producing high quality code. Despite the initial time and effort required, this really pays off by speeding up your development time in the long run.\n",
"\n",
"\n",
"- You become a much stronger programmer when you're constantly looking to improve your code. The more you refactor, the easier it will be to structure and write good code the first time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing Clean Code\n",
"### Meaningful Names\n",
"> **Tip: Use meaningful names**\n",
"\n",
"- **Be descriptive and imply type** : E.g. for booleans, you can prefix with `is_` or `has_` to make it clear it is a condition. You can also use part of speech to imply types, like verbs for functions and nouns for variables.\n",
"\n",
"- **Be consistent but clearly differentiate** : E.g. `age_list` and `age` is easier to differentiate than `ages` and `age`.\n",
"\n",
"- **Avoid abbreviations and especially single letters** : (Exception: counters and common math variables) Choosing when these exceptions can be made can be determined based on the audience for your code. If you work with other data scientists, certain variables may be common knowledge. While if you work with full stack engineers, it might be necessary to provide more descriptive names in these cases as well.\n",
"\n",
"- **Long names != descriptive names** : You should be descriptive, but only with relevant information. E.g. good functions names describe what they do well without including details about implementation or highly specific uses.\n",
"\n",
"Try testing how effective your names are by asking a fellow programmer to guess the purpose of a functino or variable based on its name, without looking at your code. Coming up with meaningful names often requires effort to get right.\n",
"\n",
"### Nice Whitespace\n",
"> **Tip: Use whitespace properly**\n",
"\n",
"- Organize your code with consistent indentation - the standard is to use 4 spaces for each indent. You can make this a default in your text editor.\n",
"\n",
"\n",
"- Separate sections with blank lines to keep your code well organized and readable.\n",
"\n",
"\n",
"- Try to limit your lines to around 79 characters, which is the guideline given in the PEP 8 style guide. In many good text editors, there is a setting to display a subtle line that indicates where the 79 character limit is.\n",
"\n",
"For more guidelines, check out the code layout section of PEP 8 in the notes below.\n",
"\n",
"[PEP 8 guidelines for code layout](https://www.python.org/dev/peps/pep-0008/?#code-lay-out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing Modular Code\n",
"> **Tip: DRY (Don't Repeat Yourself)**\n",
"\n",
"- Don't repeat yourself! Modularization allows you to reuse parts of your code. Generalize and consolidate repeated code in functions or loops.\n",
"\n",
"\n",
"> **Tip: Abstract out logic to improve readability**\n",
"\n",
"- Abstracting out code into a function not only makes it less repetitive, but also improves readability with descriptive function names. Although your code can become more readable when you abstract out logic into functions, it is possible to over-engineer this and have way too many modules, so use your judgement.\n",
"\n",
"\n",
"> **Tip: Minimize the number of entities (functions, classes, modules, etc.)**\n",
"\n",
"- There are tradeoffs to having function calls instead of inline logic. If you have broken up your code into an unnecessary amount of functions and modules, you'll have to jump around everywhere if you want to view the implementation details for something that may be too small to be worth it. Creating more modules doesn't necessarily result in effective modularization.\n",
"\n",
"> **Tip: Functions should do one thing**\n",
"\n",
"- Each function you write should be focused on doing one thing. If a function is doing multiple things, it becomes more difficult to generalize and reuse. Generally, if there's an \"and\" in your function name, consider refactoring.\n",
"\n",
"> **Tip: Arbitrary variable names can be more effective in certain functions**\n",
"\n",
"- Arbitrary variable names in general functions can actually make the code more readable.\n",
"\n",
"> **Tip: Try to use fewer than three arguments per function**\n",
"\n",
"- Try to use no more than three arguments when possible. This is not a hard rule and there are times it is more appropriate to use many parameters. But in many cases, it's more effective to use fewer arguments. Remember we are modularizing to simplify our code and make it more efficient to work with. If your function has a lot of parameters, you may want to rethink how you are splitting this up."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Refactor: Wine Quality Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise, you'll refactor code that analyzes a wine quality dataset taken from the UCI Machine Learning Repository [here](https://archive.ics.uci.edu/ml/datasets/wine+quality). Each row contains data on a wine sample, including several physicochemical properties gathered from tests, as well as a quality rating evaluated by wine experts.\n",
"\n",
"The code in this notebook first renames the columns of the dataset and then calculates some statistics on how some features may be related to quality ratings. Can you refactor this code to make it more clean and modular?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fixed acidity | \n",
" volatile acidity | \n",
" citric acid | \n",
" residual sugar | \n",
" chlorides | \n",
" free sulfur dioxide | \n",
" total sulfur dioxide | \n",
" density | \n",
" pH | \n",
" sulphates | \n",
" alcohol | \n",
" quality | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" 7.8 | \n",
" 0.88 | \n",
" 0.00 | \n",
" 2.6 | \n",
" 0.098 | \n",
" 25.0 | \n",
" 67.0 | \n",
" 0.9968 | \n",
" 3.20 | \n",
" 0.68 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 7.8 | \n",
" 0.76 | \n",
" 0.04 | \n",
" 2.3 | \n",
" 0.092 | \n",
" 15.0 | \n",
" 54.0 | \n",
" 0.9970 | \n",
" 3.26 | \n",
" 0.65 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 11.2 | \n",
" 0.28 | \n",
" 0.56 | \n",
" 1.9 | \n",
" 0.075 | \n",
" 17.0 | \n",
" 60.0 | \n",
" 0.9980 | \n",
" 3.16 | \n",
" 0.58 | \n",
" 9.8 | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
"0 7.4 0.70 0.00 1.9 0.076 \n",
"1 7.8 0.88 0.00 2.6 0.098 \n",
"2 7.8 0.76 0.04 2.3 0.092 \n",
"3 11.2 0.28 0.56 1.9 0.075 \n",
"4 7.4 0.70 0.00 1.9 0.076 \n",
"\n",
" free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
"0 11.0 34.0 0.9978 3.51 0.56 \n",
"1 25.0 67.0 0.9968 3.20 0.68 \n",
"2 15.0 54.0 0.9970 3.26 0.65 \n",
"3 17.0 60.0 0.9980 3.16 0.58 \n",
"4 11.0 34.0 0.9978 3.51 0.56 \n",
"\n",
" alcohol quality \n",
"0 9.4 5 \n",
"1 9.8 5 \n",
"2 9.8 5 \n",
"3 9.8 6 \n",
"4 9.4 5 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./dataset/winequality-red.csv', sep=';')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Renaming Columns\n",
"You want to replace the spaces in the column labels with underscores to be able to reference columns with dot notation. Here's one way you could've done it."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fixed_acidity | \n",
" volatile_acidity | \n",
" citric_acid | \n",
" residual_sugar | \n",
" chlorides | \n",
" free_sulfur_dioxide | \n",
" total_sulfur_dioxide | \n",
" density | \n",
" pH | \n",
" sulphates | \n",
" alcohol | \n",
" quality | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" 7.8 | \n",
" 0.88 | \n",
" 0.00 | \n",
" 2.6 | \n",
" 0.098 | \n",
" 25.0 | \n",
" 67.0 | \n",
" 0.9968 | \n",
" 3.20 | \n",
" 0.68 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 7.8 | \n",
" 0.76 | \n",
" 0.04 | \n",
" 2.3 | \n",
" 0.092 | \n",
" 15.0 | \n",
" 54.0 | \n",
" 0.9970 | \n",
" 3.26 | \n",
" 0.65 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 11.2 | \n",
" 0.28 | \n",
" 0.56 | \n",
" 1.9 | \n",
" 0.075 | \n",
" 17.0 | \n",
" 60.0 | \n",
" 0.9980 | \n",
" 3.16 | \n",
" 0.58 | \n",
" 9.8 | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fixed_acidity volatile_acidity citric_acid residual_sugar chlorides \\\n",
"0 7.4 0.70 0.00 1.9 0.076 \n",
"1 7.8 0.88 0.00 2.6 0.098 \n",
"2 7.8 0.76 0.04 2.3 0.092 \n",
"3 11.2 0.28 0.56 1.9 0.075 \n",
"4 7.4 0.70 0.00 1.9 0.076 \n",
"\n",
" free_sulfur_dioxide total_sulfur_dioxide density pH sulphates \\\n",
"0 11.0 34.0 0.9978 3.51 0.56 \n",
"1 25.0 67.0 0.9968 3.20 0.68 \n",
"2 15.0 54.0 0.9970 3.26 0.65 \n",
"3 17.0 60.0 0.9980 3.16 0.58 \n",
"4 11.0 34.0 0.9978 3.51 0.56 \n",
"\n",
" alcohol quality \n",
"0 9.4 5 \n",
"1 9.8 5 \n",
"2 9.8 5 \n",
"3 9.8 6 \n",
"4 9.4 5 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_df = df.rename(columns={'fixed acidity': 'fixed_acidity',\n",
" 'volatile acidity': 'volatile_acidity',\n",
" 'citric acid': 'citric_acid',\n",
" 'residual sugar': 'residual_sugar',\n",
" 'free sulfur dioxide': 'free_sulfur_dioxide',\n",
" 'total sulfur dioxide': 'total_sulfur_dioxide'\n",
" })\n",
"new_df.head() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And here's a slightly better way you could do it. You can avoid making naming errors due to typos caused by manual typing. However, this looks a little repetitive. Can you make it better?\n",
"```python\n",
"labels = list(df.columns)\n",
"labels[0] = labels[0].replace(' ', '_')\n",
"labels[1] = labels[1].replace(' ', '_')\n",
"labels[2] = labels[2].replace(' ', '_')\n",
"labels[3] = labels[3].replace(' ', '_')\n",
"labels[5] = labels[5].replace(' ', '_')\n",
"labels[6] = labels[6].replace(' ', '_')\n",
"df.columns = labels\n",
"\n",
"df.head()\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fixed_acidity | \n",
" volatile_acidity | \n",
" citric_acid | \n",
" residual_sugar | \n",
" chlorides | \n",
" free_sulfur_dioxide | \n",
" total_sulfur_dioxide | \n",
" density | \n",
" pH | \n",
" sulphates | \n",
" alcohol | \n",
" quality | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" 7.8 | \n",
" 0.88 | \n",
" 0.00 | \n",
" 2.6 | \n",
" 0.098 | \n",
" 25.0 | \n",
" 67.0 | \n",
" 0.9968 | \n",
" 3.20 | \n",
" 0.68 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 7.8 | \n",
" 0.76 | \n",
" 0.04 | \n",
" 2.3 | \n",
" 0.092 | \n",
" 15.0 | \n",
" 54.0 | \n",
" 0.9970 | \n",
" 3.26 | \n",
" 0.65 | \n",
" 9.8 | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 11.2 | \n",
" 0.28 | \n",
" 0.56 | \n",
" 1.9 | \n",
" 0.075 | \n",
" 17.0 | \n",
" 60.0 | \n",
" 0.9980 | \n",
" 3.16 | \n",
" 0.58 | \n",
" 9.8 | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" 7.4 | \n",
" 0.70 | \n",
" 0.00 | \n",
" 1.9 | \n",
" 0.076 | \n",
" 11.0 | \n",
" 34.0 | \n",
" 0.9978 | \n",
" 3.51 | \n",
" 0.56 | \n",
" 9.4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fixed_acidity volatile_acidity citric_acid residual_sugar chlorides \\\n",
"0 7.4 0.70 0.00 1.9 0.076 \n",
"1 7.8 0.88 0.00 2.6 0.098 \n",
"2 7.8 0.76 0.04 2.3 0.092 \n",
"3 11.2 0.28 0.56 1.9 0.075 \n",
"4 7.4 0.70 0.00 1.9 0.076 \n",
"\n",
" free_sulfur_dioxide total_sulfur_dioxide density pH sulphates \\\n",
"0 11.0 34.0 0.9978 3.51 0.56 \n",
"1 25.0 67.0 0.9968 3.20 0.68 \n",
"2 15.0 54.0 0.9970 3.26 0.65 \n",
"3 17.0 60.0 0.9980 3.16 0.58 \n",
"4 11.0 34.0 0.9978 3.51 0.56 \n",
"\n",
" alcohol quality \n",
"0 9.4 5 \n",
"1 9.8 5 \n",
"2 9.8 5 \n",
"3 9.8 6 \n",
"4 9.4 5 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def replace_labels(columns):\n",
" labels = list(columns)\n",
" \n",
" for i in range(len(labels)):\n",
" labels[i] = labels[i].replace(' ', '_')\n",
" return labels\n",
"\n",
"df.columns = replace_labels(df.columns)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyzing Features\n",
"Now that your columns are ready, you want to see how different features of this dataset relate to the quality rating of the wine. A very simple way you could do this is by observing the mean quality rating for the top and bottom half of each feature. The code below does this for four features. It looks pretty repetitive right now. Can you make this more concise? \n",
"\n",
"You might challenge yourself to figure out how to make this code more efficient! But you don't need to worry too much about efficiency right now - we will cover that more in the next section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"df = pd.read_csv('./dataset/winequality-red.csv', sep=';')\n",
"median_alcohol = df.alcohol.median()\n",
"for i, alcohol in enumerate(df.alcohol):\n",
" if alcohol >= median_alcohol:\n",
" df.loc[i, 'alcohol'] = 'high'\n",
" else:\n",
" df.loc[i, 'alcohol'] = 'low'\n",
"print(df.groupby('alcohol').quality.mean())\n",
"\n",
"median_pH = df.pH.median()\n",
"for i, pH in enumerate(df.pH):\n",
" if pH >= median_pH:\n",
" df.loc[i, 'pH'] = 'high'\n",
" else:\n",
" df.loc[i, 'pH'] = 'low'\n",
"print(df.groupby('pH').quality.mean())\n",
"\n",
"median_sugar = df['residual sugar'].median()\n",
"for i, sugar in enumerate(df['residual sugar']):\n",
" if sugar >= median_sugar:\n",
" df.loc[i, 'residual sugar'] = 'high'\n",
" else:\n",
" df.loc[i, 'residual sugar'] = 'low'\n",
"print(df.groupby('residual sugar').quality.mean())\n",
"\n",
"median_citric_acid = df['citric acid'].median()\n",
"for i, citric_acid in enumerate(df['citric acid']):\n",
" if citric_acid >= median_citric_acid:\n",
" df.loc[i, 'citric acid'] = 'high'\n",
" else:\n",
" df.loc[i, 'citric acid'] = 'low'\n",
"print(df.groupby('citric acid').quality.mean())\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def wine_quality(column):\n",
" median = df[column].median()\n",
" for i, col in enumerate(df[column]):\n",
" if col >= median:\n",
" df.loc[i, column] = 'high'\n",
" else:\n",
" df.loc[i, column] = 'low'\n",
" \n",
" return df.groupby(column).quality.mean()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"alcohol\n",
"high 5.958904\n",
"low 5.310302\n",
"Name: quality, dtype: float64\n",
"pH\n",
"high 5.598039\n",
"low 5.675607\n",
"Name: quality, dtype: float64\n",
"residual_sugar\n",
"high 5.665880\n",
"low 5.602394\n",
"Name: quality, dtype: float64\n",
"citric_acid\n",
"high 5.822360\n",
"low 5.447103\n",
"Name: quality, dtype: float64\n"
]
}
],
"source": [
"features = ['alcohol', 'pH', 'residual_sugar', 'citric_acid']\n",
"\n",
"for feature in features:\n",
" print(wine_quality(feature))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}