{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is a Data Science Process?\n",
"\n",
"Data Science projects are often complex, with many stakeholders, data sources, and goals. Because of this, the Data Science community has created several methodologies for helping organize and structure Data Science Projects. In this lesson, we will explore 3 of the most popular methodologies--**_CRISP-DM_**, **_KDD_**, and **_OSEMN_**, and explore how we can make use of them to keep our own projects well-structured and organized. \n",
"\n",
"## CRoss-Industry Standard Process for Data Mining (CRISP-DM)\n",
"\n",
"
\n",
"\n",
"**_CRISP-DM_** is probably the most popular Data Science Process in the Data Science world right now. Take a look at the visualization above to get a feel for CRISP-DM. Notice that CRISP-DM is an iterative process!\n",
"\n",
"\n",
"Let's take a look at the individual steps involved in CRISP-DM.\n",
"\n",
"**_Business Understanding:_** This stage is all about gathering facts and requirements. Who will be using the model you build? How will they be using it? How will this help the goals of the business or organization overall? Data Science projects are complex, with many moving parts and stakeholders. They're also time intensive to complete or modify. Because of this, its very important that the Data Scientist team working on the project has a deep understanding of what the problem is, and how the solution will be used. Consider the fact that many stakeholders involved in the project may not have technical backgrounds, and may not even be from the same organization. Stakeholders from one part of the organization may have wildly different expectations about the project than stakeholders from a different part of the organization--for instance, the sales team may be under the impression that a recommendation system project is meant to increase sales by recommending upells to current customers, while the marketing team may be under the impression that the project is meant to help generate new leads by personalizing product recommendations in a marketing email. These are two very interpretations of a recommendation system project, and it's understandable that both departments would immediately assume that the primary goal of the project is one that helps their organization. As a Data Scientist, it's up to you clarify the requirements, and make sure that everyone involved understands what the project is and isn't. \n",
"\n",
"During this stage, the goal is to get everyone on the same page, and to provide clarity on scope of the project for everyone involved, not just the Data Science team. Generate and answer as many context questions as you can about the project. \n",
"\n",
"Good questions for this stage include:\n",
"\n",
"* Who are the stakeholders in this project? Who will be directly affected by the creation of this project?\n",
"* What business problem(s) will this Data Science project solve for the organization? \n",
"* What problems are inside the scope of this project?\n",
"* What problems are outside the scope of this project?\n",
"* What data sources are available to us?\n",
"\n",
"* What is the expected timeline for this project? Are there hard deadlines (e.g. \"must be live before holiday season shopping\") or is this an ongoing project?\n",
"* Do stakeholders from different parts of the company or organization all have the exact same understanding about what this project is and isn't?\n",
"\n",
"**_Data Understanding:_**\n",
"\n",
"Once we have a solid understanding of the business implications for this project, we move on to understanding our data. During this stage, we'll aim to get a solid understanding of the data needed to complete the project. This step includes both understanding where our data is coming from, as well as the information contained within the data. \n",
"\n",
"Consider the following questions when working through this stage:\n",
"\n",
"* What data is available to us? Where does it live? Do we have the data, or can we scrape/buy/source the data from somewhere else?\n",
"* Who controls the data sources, and what steps are needed to get access to the data?\n",
"* What is our target?\n",
"* What predictors are available to us?\n",
"* What data types are the predictors we'll be working with?\n",
"* What is the distribution of our data?\n",
"* How many observations does our dataset contain? Do we have a lot of data? Only a little? \n",
"* Do we have enough data to build a model? Will we need to use resampling methods?\n",
"* How do we know the data is correct? How is the data collected? Is there a chance the data could be wrong?\n",
"\n",
"**_Data Preparation:_**\n",
"\n",
"Once we have a strong understanding of our data, we can move onto preparing the data for our modeling steps. \n",
"\n",
"During this stage, we'll want to handle the following issues:\n",
"\n",
"* Detecting and dealing with missing values\n",
"* Data type conversions (e.g. numeric data mistakenly encoded as strings)\n",
"* Checking for and removing multicollinearity (correlated predictors)\n",
"* Normalizing our numeric data\n",
"* Converting categorical data to numeric format through one-hot encoding\n",
"\n",
"**_Modeling:_**\n",
"\n",
"Once we have clean data, we can begin modeling! Remember, modeling, as with any of these other steps, is an iterative process. During this stage, we'll try to build and tune models to get the highest performance possible on our task. \n",
"\n",
"Consider the following questions during the modeling step:\n",
"\n",
"* Is this a classification task? A regression task? Something else?\n",
"* What models will we try?\n",
"* How do we deal with overfitting?\n",
"* Do we need to use regularization or not?\n",
"* What sort of validation strategy will we be using to check that our model works well on unseen data?\n",
"* What loss functions will we use?\n",
"* What threshold of performance do we consider as successful?\n",
"\n",
"**_Evaluation:_**\n",
"\n",
"During this step, we'll evaluate the results of our modeling effors. Does our model solve the problems that we outlined all the way back during step 1? Why or why not? Often times, evaluating the results of our modeling step will raise new questions, or will cause us to consider changing our approach to the problem. Notice from the CRISP-DM diagram above, that the \"Evaluation\" step is unique in that it points to both _Business Understanding_ and _Deployment_. As we mentioned before, Data Science is an iterative process--that means that given the new information our model has provided, we'll often want to start over with another iteration, armed with our newfound knowledge! Perhaps the results of our model showed us something important that we had originally failed to consider about the goal of the project, or the scope. Perhaps we learned that the model can't be successful without more data, or different data. Perhaps our evaluation shows us that we should reconsider our approach to cleaning and structuring the data, or how we frame the project as a whole (e.g. realizing we should treat the problem as a classification rather than a regression task). In any of these cases, it is totally encouraged to revisit the earlier steps. \n",
"\n",
"Of course, if the results are satisfactory, then we instead move onto Deployment!\n",
"\n",
"**_Deployment:_**\n",
"\n",
"During this stage, we'll focus on moving our model into production and automating as much as possible. Everything before this serves as a proof-of-concept or an investigation. If the project has proved successful, then you'll work with stakeholders to determine the best way to productionize our results. This means taking all of our learnings from the entire process and using it to automate or assign important tasks whenever possible. We may set up an an automated ETL (Extract-Transform-Load) process to get the data from the production database and reformat it so that it is ready for modeling. During the deployment step, you'll actively work to determine the best course of action for getting the results of your project into the wild, and you'll often be involved with building everything needed to put the software into production. \n",
"\n",
"This is one of the most rewarding steps of the entire Data Science Process--getting to see your work go live!\n",
"\n",
"## Knowledge Discovery in Databases\n",
"\n",
"
\n",
"\n",
"**_Knowledge Discovery in Databases_**, or **_KDD_** is considered the oldest Data Science Process. The creation of this process is credited to Gregory Piatetsky-Shapiro, who also runs the ever-popular Data Science blog, [kdnuggets](https://www.kdnuggets.com/). When you have extra time, you're encourages to read the original white paper on KDD, which can be found [here](https://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1992.pdf)!\n",
"\n",
"The KDD process is quite similar to the CRISP-DM process. The diagram above illustrates every step of the KDD process, as well as the expected output at each stage. \n",
"\n",
"\n",
"**_Selection_**:\n",
"\n",
"During this stage, you'll focus on selecting your problem, and the data that will help you answer it. This stage works much like the first stage of CRISP-DM--you begin by focusing on developing an understanding of the domain the problem resides in (e.g. marketing, finance, increasing customer sales, etc), the previous work done in this domain, and the goals of the stakeholders involved with the process. \n",
"\n",
"Once you've developed a strong understanding of the goals and the domain, you'll work to establish where your data is coming from, and which data will be useful to you. Organizations and companies usually have a ton of data, and only some of it will be relevant to the problem you're trying to solve. During this stage, you'll focus on examing the data sources available to you and gathering the data that you deem useful for the project. \n",
"\n",
"The output of this stage is the dataset you'll be using for the Data Science project. \n",
"\n",
"**_Preprocessing_**:\n",
"\n",
"The preprocessing stage is pretty straightforward--the goal of this stage is to \"clean\" the data by preprocessing it. With things like text data, this may include things like tokenization. You'll also identify and deal with issues like outliers and/or missing data in this stage. \n",
"\n",
"In practice, this stage often blurs with the _Transformation_ stage. \n",
"\n",
"The output of this stage is preprocessed data that is more \"clean\" that it was at the start of this stage--although the dataset is not quite ready for modeling yet. \n",
"**_Transformation_**:\n",
"\n",
"During this stage, you'll take your preprocessed data and transform it in a way that makes it more ideal for modeling. This may include steps like feature engineering, and dimensionality reduction. At this stage, you'll also deal with things like checking for and removing multicollinearity from the dataset. Categorical data should also be converted to numeric format through one-hot encoding during this step.\n",
"\n",
"The output of this stage is a dataset that is now ready for modeling. All null values and outliers are removed, categorical data has been converted to a format that a model can work with, and the dataset is generally ready for experimentation with modeling. \n",
"\n",
"**_Data Mining_**:\n",
"\n",
"The Data Mining stage refers to using different modeling techniques to try and build a model that solves the problem we're after--often, this is a classification or regression task. During this stage, you'll also define your parameters for given models, as well as your overall criteria for measuring the performance of a model. \n",
"\n",
"You may be wondering what Data Mining is, and how it relates to Data Science. In practice, it's just an older term that essentially means the same thing as Data Science. Dr. Piatetsky-Shapiro defines Data Mining as \"the non-trivial extraction of implicit, previously unknown and potentially useful information from data.\" Making of things such as Machine Learning algorithms to find insights in large datasets that aren't immediately obvious without these algorithms is at the heart of the concept of Data Mining, just as it is in Data Science. In a pragmatic sense, this is why the terms Data Mining and Data Science are typically used interchangeably, although the term Data Mining is considered an older term that isn't used as often nowadays. \n",
"\n",
"The output of this stage is results from a fit to the data for the problem we're trying to solve. \n",
"\n",
"**_Interpretation/Evaluation_**:\n",
"\n",
"During this final stage of KDD, we focus on interpreting the \"patterns\" discovered in the previous step to help us make generalizations or predictions that help us answer our original question. During this stage, you'll consolidate everything you've learned use it or present it to stakeholders for guiding future actions. Your output may be a presentation that you use to communicate to non-technical managers or executives (never discount the importance of knowing PowerPoint as a data scientist!). Your conclusions for a project may range from \"this approach didn't work\" or \"we need more data about {X}\" to \"this is ready for production, let's build it!\". \n",
"\n",
"## OSEMiN\n",
"\n",
"
\n",
"