{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to Data Science\n", "## Part I. - What is Data Science?\n", "\n", "## Table of contents\n", "\n", "- ##### Administration\n", " - Administration\n", "\n", "- ##### Data Science intro\n", " - Intro\n", " - Taxonomy\n", " - Basic workflow\n", "\n", "- ##### Pipelines\n", " - Pipelines\n", "\n", "---\n", "\n", "## Administration\n", "\n", "### Curriculum:\n", "- Overview, technical basics, pipelines\n", "- Data Discovery, Naive linear classifiers\n", "- Data Transformation, Decision trees\n", "- Dimensionality Reduction, SVMs\n", "- Text mining, Neural networks\n", "- Model Evaluation, Hyperparameter optimization, Clustering\n", "- Regression and Embedding pipelines\n", "\n", "### Requirements:\n", "\n", "- Weekly Assignments\n", "- A data science project\n", "\n", "---\n", "\n", "## Intro\n", "\n", "### WTF is Data Science?\n", "\n", "According to a random venn diagram:\n", "\n", "\n", "
\n", "from kdnuggets\n", "\n", "As a metro map: \n", "\n", "\n", "
\n", "from pragmatic perspectives\n", "\n", "### At the end of the day:\n", "\n", "It's just a fancier name for Data Mining. Maybe throw some more hacking skill to the mix.\n", "\n", "\n", "### Who is a Data Scientist then?\n", "\n", "- _\"A data scientist is a statistician who lives in San Francisco\"_\n", "- _\"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.\"_\n", "\n", "\n", "### Thanks, much clearer now. (NOT) Can you please tell me at least what does he do? \n", "#### A.k.a: the typical workflow - The KDD (Knowledge Discovery in Databases) Process\n", "\n", "\n", "
\n", "from data flair\n", "\n", "\n", "## Basic taxonomy of data science methods\n", "\n", "There is a lot of \"implicit\" information in the data which humans can't directly observe, but can be extracted by statistical methods (a.k.a. _analytics_). Our goal is exactly this. Basically, there are two main types of analytics:\n", "\n", "#### Descriptive analytics\n", " \n", "
\n", "from data science central\n", "\n", "**Goal:** To extract valuable information from a given dataset. Answer the question: _\"What has happened?\"_ \n", "**Example:** Describe the relation between the students' math grade in high school and their achieved points in the university statistics course's tests.\n", "\n", "#### Predictive analytics\n", " \n", "
\n", "from Philippe Fournier-Viger\n", "\n", "**Goal:** Being able to make predictions on missing information based on previous knowledge. Answer the question: _\"What could happen?\"_ \n", "**Example #1:** When you apply for a loan, the bank gets your data, and puts it into its model for predicting the probability of you repaying that loan. Depending on this prediction it can choose to grant you the loan you asked for or not. \n", "**Example #2:** A store has some information on its customers, and from that information it can determine what type of people visit its stores (like students, retirees, etc.). This way it can adjust the stores open hours to fit the need of the different group of customers it serves. (This is called clustering.)\n", "\n", "---\n", " \n", "There is another way of categorizing the statistical/machine learning/data mining methods: **supervised** and **unsupervised** learning.\n", "\n", "#### Supervised learning\n", "
\n", "**Supervised learning** is based on data that is already 'labeled'. In other words we have data for which we know what the correct output is. We train our model on this dataset, and after this our model can predict the output of any input we give it (eg. is a picture shows a cat or a dog). The simplest supervised learning method is the linear regression.\n", "\n", "#### Unsupervised learning\n", "
\n", "With **unsupervised learning** we don't know what the correct output should be - we try to detect a hidden structure in the data. The simplest example for this is the above mentioned clustering example.\n", "\n", "### Validation\n", "\n", "How can we validate our model/output? In the case of unsupervised learning, we can't. With supervised learning, however the basic idea is pretty straightforward. We split our dataset into two parts: training and test set. We train our model _only on the training set_, and then compare the model's output on the test set to the known good output on it.\n", "\n", "
\n", "\n", "---\n", "\n", "## Basic workflow with scikit-learn\n", "\n", "\n", "
\n", "\n", "To introduce the basic workflow we'll try to answer a simple question: _\"Will I survive the sinking of the Titanic?\"_ \n", "This is a __classification problem__ which is a __prediction task__. We'll choose a familiar method to solve this problem: _logistic regression_. It is a __supervised method__ which we'll use to predict if a passenger survives the titanic catastrophe." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import random\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import Pipeline\n", "\n", "from sklearn.metrics import confusion_matrix" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### 1. Read and transform data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('data/titanic_full.csv', index_col='PassengerId').dropna(subset=['Embarked'])\n", "test_mask = pd.read_csv('data/titanic.csv', index_col='PassengerId')\n", "test_mask = test_mask['Survived'].isnull()\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sex = LabelEncoder()\n", "embark = LabelEncoder()\n", "\n", "data['Sex'] = sex.fit_transform(data['Sex'])\n", "data['Embarked'] = embark.fit_transform(data['Embarked'])\n", "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "input_cols = [col for col in data.columns\n", " if col not in ('Name', 'Ticket', 'Cabin', 'Survived')]\n", "target_col = 'Survived'\n", "\n", "train = data.loc[~test_mask]\n", "test = data.loc[test_mask]\n", "\n", "X_train = train[input_cols].fillna(-1)\n", "y_train = train[target_col]\n", "\n", "X_test = test[input_cols].fillna(-1)\n", "y_test = test[target_col]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Train models\n", "\n", "### Introducing pipelines\n", "\n", "Since we only want a logistic regression in our model, we could simply use the LogisticRegression() function we imported from sklearn's linear_model module. However, there is a useful concept called **pipeline**, which really comes in handy when dealing with more complicated models.\n", "\n", "When dealing with data, we may first want to transform our data to make it more digestible to our estimators (e.g. getting rid of some attributes). There can be multiple transformation steps involved in our process, and each transformation may have multiple parameters that can be tweaked independently. Pipelines provide a wrapping for these steps which makes working with these transformations easier and more conscise.\n", "\n", "- Create the pipline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logistic_regression = LogisticRegression()\n", "pipe = Pipeline(steps=[\n", " ('logistic', logistic_regression)\n", "])\n", "pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- fit the pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator = pipe.fit(X_train, y_train)\n", "estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Validation\n", "- Validation accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = estimator.predict(X_test)\n", "print(\"Prediction accuracy: {:.2f}%\".format(np.sum(y_pred == y_test) / len(y_pred) * 100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Confusion matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cnf_matrix = confusion_matrix(y_test, y_pred)\n", "sns.heatmap(cnf_matrix, annot=True, fmt=\"d\", cmap=plt.cm.Blues);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Use the validated model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_pclass = 2 # 1st, 2nd or 3rd class\n", "my_sex = sex.transform(['male'])\n", "my_age = 40\n", "my_sibsp = 1 # Number of siblings/spouses aboard\n", "my_parch = 1 # Number of parents/children aboard\n", "my_fare = data.loc[data['Pclass'] == my_pclass, 'Fare'].mean() # the average fare for my_pclass\n", "my_embarked = embark.transform([random.choice('CQS')]) # Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)\n", "\n", "me = pd.DataFrame([{\n", " 'Pclass': my_pclass,\n", " 'Sex': my_sex[0],\n", " 'Age': my_age,\n", " 'SibSp': my_sibsp,\n", " 'Parch': my_parch,\n", " 'Fare': my_fare,\n", " 'Embarked': my_embarked[0]\n", "}])\n", "me" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Drumroll\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.predict(me)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "szisz_ds_2025", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 1 }