{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 8 - Group project\n", "\n", "> In this lesson we introduce the group project, its evaluation criteria and a submission example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lewtun/dslectures/master?urlpath=lab/tree/notebooks%2Flesson08_group-project.ipynb) \n", "[![slides](https://img.shields.io/static/v1?label=slides&message=lesson08_group-project.pdf&color=blue&logo=Google-drive)](https://drive.google.com/open?id=1F8QEELcst-lPPIwhEVQYy8UGFOdlv1gA)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A guide to structuring your group projects\n", "\n", "_Summary:_ In your group projects you will solve a data science problem end-to-end, pretending to be recently hired data scientists in a company. To help you get started, we've prepared a checklist to guide you through the project. Here are the main steps that you will go through:\n", "\n", "1. Frame the problem and look at the big picture\n", "2. Get the data\n", "3. Explore and visualise the data to gain insights\n", "4. Prepare the data to better expose the underlying data patterns to machine learning algorithms\n", "5. Explore many different models and short-list the best ones\n", "6. Fine-tune your models\n", "7. Present your solution\n", "\n", "In each step we list a set of questions that one should have in mind when undertaking a data science project. The list is not meant to be exhaustive, but does contain a selection of the most important questions to ask. We will be available to provide assitance with each of the steps, and will allocate some part of each lesson towards working on the projects.\n", "\n", "### Our expectations\n", "\n", "To streamline the grading, your group must submit a _**single**_ Jupyter notebook, structured in terms of the first 6 sections listed in this guide (the seventh will be presented to the class). You are welcome to adapt code from the web (e.g. Kaggle kernels), but you **_must_** reference the original source in your notebook.\n", "\n", "In addition to _clean, well-documented code_ (i.e. functions with docstrings etc), your notebook will be judged according to how well each step is explained (using Markdown). The main goal is to simulate what it is like to work as a data scientist, where communication is arguably as important as the ability to extract insights from data.\n", "\n", "The analysis in the Jupyter notebook will be evaluated according to a rubric similar to the assignments:\n", "\n", "| Critical Task | Needs Improvement | Basic | Surpassed |\n", "| :--- | :--- | :--- | :--- |\n", "| **Computation:** Perform computations | Computations contain errors and extraneous code | Computations are correct but contain extraneous/unnecessary computations | Computations are correct and properly identified and labeled |\n", "| **Analysis:** Choose and carry out analysis appropriate for data and context | Choice of analysis is overly simplistic, irrelevant, or missing key components | Analysis appropriate, but incomplete, or important features and assumptions not made explicit | Analysis appropriate, complete, advanced, relevant, and informative |\n", "| **Synthesis:** Identify key features of the analysis, and interpret results (including context) | Conclusions are missing, incorrect, or not made based on results of analysis | Conclusions reasonable, but is partially correct or partially complete | Make relevant conclusions explicitly connected to analysis and to context |\n", "| **Visual presentation:** Communicate findings graphically clearly, precisely, and concisely | Inappropriate choice of plots; poorly labeled plots; plots missing | Plots convey information correctly but lack context for interpretation | Plots convey information correctly with adequate/appropriate reference information |\n", "| **Written:** Communicate findings clearly, precisely, and concisely | Explanation is illogical, incorrect, or incoherent | Explanation is partially correct but incomplete or unconvincing | Explanation is correct, complete, and convincing |\n", "\n", "**Grading split:** The group project accounts for 50% of the final grading and is split equally between the notebook (25%) and the presentation (25%).\n", "\n", "**Submission deadline:** Thursday, May 28, 2020 before 23:59:59 CEST (Notebook + presentation recording)\n", "\n", "**Presentation date:** Thursday, June 4, 2020 (Discussion of group projects with questions)\n", "\n", "### Deliverables\n", "The teams have to submit two deliverables before the submission deadline: 1) a notebook and 2) presentation video.\n", "\n", "#### Notebook\n", "The notebook contains all the code to explore the dataset, train the final model and documents each step clearly. If code is copied from another codebase such as Github or Stack Overflow it **_must_** be properly referenced.\n", "\n", "#### Presentation\n", "The presentation video should be 15min long and should highlight the problem you are solving, interesting things you found in the data and the step involved in building up your model. On the presentation date we will discuss the presentation and ask questions about your project and submissions.\n", "\n", "### Some examples\n", "The Kaggle competitions [page](https://www.kaggle.com/competitions) has hundreds of examples where people have applied machine learning to solve a variety of problems. Below are a few examples that you might find useful:\n", "\n", "* Exploratory data analysis\n", " * Regression: [Comprehensive data exploration with Python](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)\n", "* Model building\n", " * Regression: [A study on Regression applied to the Ames dataset](https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset)\n", " \n", "The is also the excellent [Kaggle Learn](https://www.kaggle.com/learn/overview) resource that you might find useful too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Frame the problem and look at the big picture\n", "1. Define the objective in business terms\n", "2. How should you frame the problem (supervised/unsupervised etc.)?\n", "3. How should performance be measured?\n", "4. How would you solve the problem manually?\n", "5. List the assumption you and your team have made so far\n", "6. Verify your assumptions if possible" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Get the data\n", "1. Find and document where you can get the data from\n", "2. Create a workspace. We like to structure our project folders as follows:\n", "\n", "```\n", "my-awesome-project\n", "├── data\n", "│   ├── external <- Data from third party sources.\n", "│   ├── interim <- Intermediate data that has been transformed.\n", "│   ├── processed <- The final, canonical data sets for modeling.\n", "│   └── raw <- The original, immutable data dump.\n", "│\n", "├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),\n", "│ the creator's initials, and a short \"-\" delimited description, e.g.\n", "│ 1.0-ltu-initial-data-exploration.\n", "│\n", "├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.\n", "│   └── figures <- Generated graphics and figures to be used in reporting\n", "│\n", "└── requirements.txt <- The requirements file for reproducing the analysis environment.\n", "```\n", "\n", "3. Once you and your team have agreed on the folder structure, we suggest creating a new virtual environment as follows in the root of `my-awesome-project`.\n", "4. Get the data\n", "5. Check the size and type of data (time series, geographical etc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Explore the data\n", "1. Create a copy of the data for explorations (sampling it down to a manageable size if necessary)\n", "2. Create a Jupyter notebook to keep a record of your data exploration\n", "3. Study each feature and its characteristics:\n", " * Name\n", " * Type (categorical, int/float, bounded/unbounded, text, structured, etc)\n", " * Percentage of missing values\n", " * Check for outliers, rounding errors etc\n", "4. For supervised learning tasks, identify the target(s)\n", "5. Visualise the data\n", "6. Study the correlations between features\n", "7. Study how you would solve the problem manually\n", "8. Identify the promising transformations you may want to apply (e.g. convert skewed targets to normal via a log transformation)\n", "10. Document what you have learned" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Prepare the data\n", "Notes:\n", "* Work on copies of the data (keep the original dataset intact).\n", "* Write functions for all data transformations you apply, for three reasons:\n", " * So you can easily prepare the data the next time you run your code\n", " * So you can apply these transformations in future projects\n", " * To clean and prepare the test set\n", " \n", " \n", "1. Data cleaning:\n", " * Fix or remove outliers (optional)\n", " * Fill in missing values (e.g. with zero, mean, median, ...) or drop their rows (or columns)\n", "2. Feature selection (optional):\n", " * Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).\n", "3. Feature engineering, where appropriate:\n", " * Discretize continuous features\n", " * Add promising transformations of features (e.g. $\\log(x)$, $\\sqrt{x}$, $x^2$, etc)\n", " * Aggregate features into promising new features\n", "4. Feature scaling: standardise or normalise features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Short-list promising models\n", "We expect you to do some additional research and train at **least one model per team member**. You use a Random Forest model, but each team memeber has to investigate one additional model. So you may want to investigate the following alternatives for regression: \n", "- Linear Regression\n", "- Extra Trees\n", "- Histogram-based Gradient Boosting Regression Tree.\n", "- Multi-layer Perceptron regressor\n", "- Elastic-Net\n", "\n", "These additional models don't need to contribute to your final submission but there should for the individual models should be annotated with the student who created it. Each section should have the student name annotated. Each student should understand his model to the point where you can explain how it works and answer simple questions about it. \n", "\n", "1. Train mainy quick and dirty models from different categories (e.g. linear, SVM, Random Forests etc) using default parameters\n", "2. Measure and compare their performance\n", "3. Analyse the most significant variables for each algorithm\n", "4. Analyse the types of errors the models make\n", "5. Have a quick round of feature selection and engineering\n", "6. Have one or two more quick iterations of the five previous steps\n", "7. Short-list the top three to five most promising models, preferring models that make different types of errors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Fine-tune the system\n", "1. Fine-tune the hyperparameters\n", "2. Once you are confident about your final model, measure its performance on the test set to estimate the generalisation error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Present your solution\n", "1. Document what you have done\n", "2. Create a nice 15 minute video presentation with slides\n", " * Make sure you highlight the big picture first\n", "3. Explain why your solution achieves the business objective\n", "4. Don't forget to present interesting points you noticed along the way:\n", " * Describe what worked and what did not\n", " * List your assumptions and you model's limitations\n", "5. Ensure your key findings are communicated through nice visualisations or easy-to-remember statements (e.g. \"the median income is the number-one predictor of housing prices\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "* _Hands-On Machine Learning with Scikit-Learn and Tensorflow_, Appendix B, A. Géron \n", "* [Cookiecutter Data Science](http://drivendata.github.io/cookiecutter-data-science/#cookiecutter-data-science)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The competition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset\n", "The dataset for the competition consists of a regression problem for car pricing. The data was scraped from eBay on contains several features, such as the age of the cars and the model type. In total there are 19 features in the dataset. The target for the competition is to predict the car price from these features. In total, the dataset contains 371'528 samples which are split into a train (2/3) and test (1/3) sets. The training data contains the feature a and label columns whereas the test set only contains the feature columns. The goal of the competition is to train a model on the training set and then use this model to predict the labels on the test set. A more detailed description of the data is available on the competition website.\n", "\n", "### Kaggle\n", "\n", "\n", "The competition is organized as a Kaggle challenge. The data is available on the Kaggle page and you have to upload you model predictions on the Kaggle page. Your results will automatically be evaluated and you will see your scores as well as the scores of the other teams on the leaderboard. Note that the test set is split into two parts: 70% is used to evaluate your predictions every time you upload them. The remaining 30% of the test set are not evaluated until the competition finishes and is used to calculate the final score. This split aims at avoiding teams improving overfitting their models to the test set score. The number of team submissions per day are limited to **20**. Therefore, make sure you distribute the work such that you can evaluate all your ideas during the competition.\n", "\n", "### Signup\n", "Go to the Kaggle website and click on `Register` to create an account. Once you have set up your account go the the competition page and click on `Teams`. Invite your fellow team member and name your team according the the MS Teams names.\n", "\n", "\n", "### Submission\n", "\n", "To upload your predictions store them as a csv file (see the `sample_submission.csv` for reference) and click on the `Submit Predictions` button on the competition page. Upload your submission in the dialog and add a short description of the steps that led these particular descriptions. After a few minutes you should see your score under `My submissions` and if its your best run also on the `Leaderboard`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: The median regressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pathlib import Path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datapath = Path('../data/')\n", "train = pd.read_csv(datapath/'train.csv')\n", "test = pd.read_csv(datapath/'test.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((248923, 21), (122541, 20))" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.shape, test.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carIddateCrawlednamesellerofferTypepriceabtestvehicleTypeyearOfRegistrationgearbox...modelkilometermonthOfRegistrationfuelTypebrandnotRepairedDamagedateCreatednrOfPicturespostalCodelastSeen
002016-04-01 18:39:16Volvo_S80_2.4privatAngebot999controllimousine1999manuell...andere1500009benzinvolvoja2016-04-01 00:00:000268102016-04-05 14:16:56
112016-03-17 18:39:35BMW_KombiprivatAngebot900controlNaN2017manuell...3er1500002NaNbmwnein2016-03-17 00:00:000278042016-03-23 07:17:16
222016-03-16 12:50:34BMW_E39_525d_Exclusive_Automa_Vollausstattung_...privatAngebot6700testkombi2004automatik...5er1500002dieselbmwnein2016-03-16 00:00:000124352016-03-22 12:50:21
332016-03-25 15:55:43BMW_320d_DPF_Touring_Aut.privatAngebot5500testkombi2006automatik...3er15000010dieselbmwnein2016-03-25 00:00:00041582016-04-06 20:16:52
442016-03-12 22:40:32Schrott_Auto_Seat_Marbella_Altmetall_Schrottau...privatAngebot16controlkleinwagen1989manuell...andere10000010benzinseatnein2016-03-12 00:00:000277552016-03-12 22:40:32
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " carId dateCrawled \\\n", "0 0 2016-04-01 18:39:16 \n", "1 1 2016-03-17 18:39:35 \n", "2 2 2016-03-16 12:50:34 \n", "3 3 2016-03-25 15:55:43 \n", "4 4 2016-03-12 22:40:32 \n", "\n", " name seller offerType price \\\n", "0 Volvo_S80_2.4 privat Angebot 999 \n", "1 BMW_Kombi privat Angebot 900 \n", "2 BMW_E39_525d_Exclusive_Automa_Vollausstattung_... privat Angebot 6700 \n", "3 BMW_320d_DPF_Touring_Aut. privat Angebot 5500 \n", "4 Schrott_Auto_Seat_Marbella_Altmetall_Schrottau... privat Angebot 16 \n", "\n", " abtest vehicleType yearOfRegistration gearbox ... model kilometer \\\n", "0 control limousine 1999 manuell ... andere 150000 \n", "1 control NaN 2017 manuell ... 3er 150000 \n", "2 test kombi 2004 automatik ... 5er 150000 \n", "3 test kombi 2006 automatik ... 3er 150000 \n", "4 control kleinwagen 1989 manuell ... andere 100000 \n", "\n", " monthOfRegistration fuelType brand notRepairedDamage \\\n", "0 9 benzin volvo ja \n", "1 2 NaN bmw nein \n", "2 2 diesel bmw nein \n", "3 10 diesel bmw nein \n", "4 10 benzin seat nein \n", "\n", " dateCreated nrOfPictures postalCode lastSeen \n", "0 2016-04-01 00:00:00 0 26810 2016-04-05 14:16:56 \n", "1 2016-03-17 00:00:00 0 27804 2016-03-23 07:17:16 \n", "2 2016-03-16 00:00:00 0 12435 2016-03-22 12:50:21 \n", "3 2016-03-25 00:00:00 0 4158 2016-04-06 20:16:52 \n", "4 2016-03-12 00:00:00 0 27755 2016-03-12 22:40:32 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make predictions\n", "In this example we build a simple model that just consists of the training set median of the target variable. This corresponds to the step where train your model with `.fit(X, y)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "median = train['price'].median()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2950.0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "median" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict on the test set\n", "Normally you would then use the trained model to predict the target on the test set with `.predict(X)`. Our simple model does not depend on the features so we can just assign the median value to all test samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "submission = pd.read_csv(datapath/'sample_submission.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdPredicted
02489230
12489240
22489250
32489260
42489270
\n", "
" ], "text/plain": [ " Id Predicted\n", "0 248923 0\n", "1 248924 0\n", "2 248925 0\n", "3 248926 0\n", "4 248927 0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "submission.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "submission['Predicted'] = median" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdPredicted
02489232950.0
12489242950.0
22489252950.0
32489262950.0
42489272950.0
\n", "
" ], "text/plain": [ " Id Predicted\n", "0 248923 2950.0\n", "1 248924 2950.0\n", "2 248925 2950.0\n", "3 248926 2950.0\n", "4 248927 2950.0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "submission.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save submission\n", "Finally, we take the predictions on the test set and save them in a submission file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "submission.to_csv(datapath/'median_submission.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now upload this file on the Kaggle competition webpage under `Submit Prediction`.\n", "The general procedure to fit a `scikit-learn` model and make a submission looks as follows:\n", "\n", "```Python\n", "# train model\n", "X_train, y_train = train.drop('price', axis=1), train['price']\n", "my_model.fit(X_train, y_train)\n", "\n", "# predict on test set\n", "y_pred = my_model.predict(test)\n", "\n", "# create submission\n", "submission['Prediction'] = y_pred\n", "submission.to_csv(datapath/'median_submission.csv', index=False)\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }