{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Unit 2.3 Extracting Information from Data, Pandas\n", "> Data connections, trends, and correlation. Pandas is introduced as it could be valuable for PBL, data validation, as well as understanding College Board Topics.\n", "- toc: true\n", "- image: /images/python.png\n", "- categories: []\n", "- type: ap\n", "- week: 25" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Files To Get\n", "\n", "Save this file to your **_notebooks** folder\n", "\n", "wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/2023-03-09-AP-unit2-3.ipynb\n", "\n", "Save these files into a subfolder named **files** in your **_notebooks** folder\n", "\n", "wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/files/data.csv\n", "\n", "wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/files/grade.json\n", "\n", "Save this image into a subfolder named **images** in your **_notebooks** folder\n", "\n", "wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/images/table_dataframe.png\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas and DataFrames\n", "> In this lesson we will be exploring data analysis using Pandas. \n", "\n", "- College Board talks about ideas like \n", " - Tools. \"the ability to process data depends on users capabilities and their tools\"\n", " - Combining Data. \"combine county data sets\"\n", " - Status on Data\"determining the artist with the greatest attendance during a particular month\"\n", " - Data poses challenge. \"the need to clean data\", \"incomplete data\"\n", "\n", "\n", "- [From Pandas Overview](https://pandas.pydata.org/docs/getting_started/index.html) -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.\n", "\n", "\n", "![DataFrame](images/table_dataframe.png)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "'''Pandas is used to gather data sets through its DataFrames implementation'''\n", "import pandas as pd" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning Data\n", "\n", "When looking at a data set, check to see what data needs to be cleaned. Examples include:\n", "- Missing Data Points\n", "- Invalid Data\n", "- Inaccurate Data\n", "\n", "Run the following code to see what needs to be cleaned" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Student ID Year in School GPA\n", "0 123 12 3.57\n", "1 246 10 4.00\n", "2 578 12 2.78\n", "3 469 11 3.45\n", "4 324 Junior 4.75\n", "5 313 20 3.33\n", "6 145 12 2.95\n", "7 167 10 3.90\n", "8 235 9th Grade 3.15\n", "9 nil 9 2.80\n", "10 469 11 3.45\n", "11 456 10 2.75\n" ] } ], "source": [ "# reads the JSON file and converts it to a Pandas DataFrame\n", "df = pd.read_json('files/grade.json')\n", "\n", "print(df)\n", "# What part of the data set needs to be cleaned?\n", "# From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Extracting Info\n", "\n", "Take a look at some features that the Pandas library has that extracts info from the dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame Extract Column" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " GPA\n", "0 3.57\n", "1 4.00\n", "2 2.78\n", "3 3.45\n", "4 4.75\n", "5 3.33\n", "6 2.95\n", "7 3.90\n", "8 3.15\n", "9 2.80\n", "10 3.45\n", "11 2.75\n", "\n", "Student ID GPA\n", " 123 3.57\n", " 246 4.00\n", " 578 2.78\n", " 469 3.45\n", " 324 4.75\n", " 313 3.33\n", " 145 2.95\n", " 167 3.90\n", " 235 3.15\n", " nil 2.80\n", " 469 3.45\n", " 456 2.75\n" ] } ], "source": [ "#print the values in the points column with column header\n", "print(df[['GPA']])\n", "\n", "print()\n", "\n", "#try two columns and remove the index from print statement\n", "print(df[['Student ID','GPA']].to_string(index=False))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame Sort" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Student ID Year in School GPA\n", "11 456 10 2.75\n", "2 578 12 2.78\n", "9 nil 9 2.80\n", "6 145 12 2.95\n", "8 235 9th Grade 3.15\n", "5 313 20 3.33\n", "3 469 11 3.45\n", "10 469 11 3.45\n", "0 123 12 3.57\n", "7 167 10 3.90\n", "1 246 10 4.00\n", "4 324 Junior 4.75\n", "\n", " Student ID Year in School GPA\n", "4 324 Junior 4.75\n", "1 246 10 4.00\n", "7 167 10 3.90\n", "0 123 12 3.57\n", "3 469 11 3.45\n", "10 469 11 3.45\n", "5 313 20 3.33\n", "8 235 9th Grade 3.15\n", "6 145 12 2.95\n", "9 nil 9 2.80\n", "2 578 12 2.78\n", "11 456 10 2.75\n" ] } ], "source": [ "#sort values\n", "print(df.sort_values(by=['GPA']))\n", "\n", "print()\n", "\n", "#sort the values in reverse order\n", "print(df.sort_values(by=['GPA'], ascending=False))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame Selection or Filter" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Student ID Year in School GPA\n", "0 123 12 3.57\n", "1 246 10 4.00\n", "3 469 11 3.45\n", "4 324 Junior 4.75\n", "5 313 20 3.33\n", "7 167 10 3.90\n", "8 235 9th Grade 3.15\n", "10 469 11 3.45\n" ] } ], "source": [ "#print only values with a specific criteria \n", "print(df[df.GPA > 3.00])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame Selection Max and Min" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Student ID Year in School GPA\n", "4 324 Junior 4.75\n", "\n", " Student ID Year in School GPA\n", "11 456 10 2.75\n" ] } ], "source": [ "print(df[df.GPA == df.GPA.max()])\n", "print()\n", "print(df[df.GPA == df.GPA.min()])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Create your own DataFrame\n", "\n", "Using Pandas allows you to create your own DataFrame in Python." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Python Dictionary to Pandas DataFrame" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------------Dict_to_DF------------------\n", " calories duration\n", "0 420 50\n", "1 380 40\n", "2 390 45\n", "----------Dict_to_DF_labels--------------\n", " calories duration\n", "day1 420 50\n", "day2 380 40\n", "day3 390 45\n" ] } ], "source": [ "import pandas as pd\n", "\n", "#the data can be stored as a python dictionary\n", "dict = {\n", " \"calories\": [420, 380, 390],\n", " \"duration\": [50, 40, 45]\n", "}\n", "#stores the data in a data frame\n", "print(\"-------------Dict_to_DF------------------\")\n", "df = pd.DataFrame(dict)\n", "print(df)\n", "\n", "print(\"----------Dict_to_DF_labels--------------\")\n", "\n", "#or with the index argument, you can label rows.\n", "df = pd.DataFrame(dict, index = [\"day1\", \"day2\", \"day3\"])\n", "print(df)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Examine DataFrame Rows" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------Examine Selected Rows---------\n", " calories duration\n", "day1 420 50\n", "day3 390 45\n", "--------Examine Single Row-----------\n", "calories 420\n", "duration 50\n", "Name: day1, dtype: int64\n" ] } ], "source": [ "print(\"-------Examine Selected Rows---------\")\n", "#use a list for multiple labels:\n", "print(df.loc[[\"day1\", \"day3\"]])\n", "\n", "#refer to the row index:\n", "print(\"--------Examine Single Row-----------\")\n", "print(df.loc[\"day1\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas DataFrame Information" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 3 entries, day1 to day3\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 calories 3 non-null int64\n", " 1 duration 3 non-null int64\n", "dtypes: int64(2)\n", "memory usage: 180.0+ bytes\n", "None\n" ] } ], "source": [ "#print info about the data set\n", "print(df.info())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Example of larger data set\n", "\n", "Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--Duration Top 10---------\n", " Duration Pulse Maxpulse Calories\n", "69 300 108 143 1500.2\n", "79 270 100 131 1729.0\n", "109 210 137 184 1860.4\n", "60 210 108 160 1376.0\n", "106 180 90 120 800.3\n", "90 180 101 127 600.1\n", "65 180 90 130 800.4\n", "61 160 110 137 1034.4\n", "62 160 109 135 853.0\n", "67 150 107 130 816.0\n", "--Duration Bottom 10------\n", " Duration Pulse Maxpulse Calories\n", "68 20 106 136 110.4\n", "100 20 95 112 77.7\n", "89 20 83 107 50.3\n", "135 20 136 156 189.0\n", "94 20 150 171 127.4\n", "95 20 151 168 229.4\n", "139 20 141 162 222.4\n", "64 20 110 130 131.4\n", "112 15 124 139 124.2\n", "93 15 80 100 50.5\n" ] } ], "source": [ "import pandas as pd\n", "\n", "#read csv and sort 'Duration' largest to smallest\n", "df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)\n", "\n", "print(\"--Duration Top 10---------\")\n", "print(df.head(10))\n", "\n", "print(\"--Duration Bottom 10------\")\n", "print(df.tail(10))\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# APIs are a Source for Writing Programs with Data\n", "> 3rd Party APIs are a great source for creating Pandas Data Frames. \n", "- Data can be fetched and resulting json can be placed into a Data Frame\n", "- Observe output, this looks very similar to a Database" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " country_name cases deaths\n", "0 USA 82,649,779 1,018,316\n", "1 India 43,057,545 522,193\n", "2 Brazil 30,345,654 662,663\n", "3 France 28,244,977 145,020\n", "4 Germany 24,109,433 134,624\n", "5 UK 21,933,206 173,352\n" ] } ], "source": [ "'''Pandas can be used to analyze data'''\n", "import pandas as pd\n", "import requests\n", "\n", "def fetch():\n", " '''Obtain data from an endpoint'''\n", " url = \"https://flask.nighthawkcodingsociety.com/api/covid/\"\n", " fetch = requests.get(url)\n", " json = fetch.json()\n", "\n", " # filter data for requirement\n", " df = pd.DataFrame(json['countries_stat']) # filter endpoint for country stats\n", " print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths\n", " \n", "fetch()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Hacks\n", "> Early Seed award\n", "- Add this Blog to you own Blogging site.\n", "- Have all lecture files saved to your files directory before Tech Talk starts. Have data.csv open in vscode. Don't tell anyone. Show to Teacher.\n", "\n", "> AP Prep\n", "- Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.\n", "- In blog add College Board practice problems for 2.3.\n", "\n", "> The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.\n", "\n", "- Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be ***Amazing***.\n", "\n", "- When choosing a data set, think about the following...\n", "\n", " - Does it have a good sample size?\n", " - Is there bias in the data?\n", " - Does the data set need to be cleaned?\n", " - What is the purpose of the data set?\n", " - ...\n", "\n", "- Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)\n", "\n", "## Hack Helpers\n", "> Here is how Mort started on this assignment by asking ChatGPT ... Regarding Python Pandas, what are some data sets that would be good for learning Pandas?\n", "\n", "- There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop. \n", "\n", " Here are some suggestions...\n", "\n", " - Titanic Dataset: This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.\n", "\n", " - Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.\n", "\n", " - NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.\n", "\n", " - Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.\n", "\n", " - COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.\n", "\n", " - World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.\n", "\n", "\n", "- These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.\n", "\n", "\n", "> Follow up question, \"where can I find Titanic data set?\"\n", "\n", "- The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset...\n", "\n", " - Kaggle: The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.\n", "\n", " - UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.\n", "\n", " - Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:\n", "\n", " ```python\n", " import seaborn as sns\n", " titanic_data = sns.load_dataset('titanic')\n", " ```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Titanic Data\n", "> Look at a sample of data." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Titanic Data\n", "Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',\n", " 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',\n", " 'alive', 'alone'],\n", " dtype='object')\n", " survived pclass sex age sibsp parch class fare \\\n", "0 0 3 male 22.0 1 0 Third 7.2500 \n", "1 1 1 female 38.0 1 0 First 71.2833 \n", "2 1 3 female 26.0 0 0 Third 7.9250 \n", "3 1 1 female 35.0 1 0 First 53.1000 \n", "4 0 3 male 35.0 0 0 Third 8.0500 \n", ".. ... ... ... ... ... ... ... ... \n", "886 0 2 male 27.0 0 0 Second 13.0000 \n", "887 1 1 female 19.0 0 0 First 30.0000 \n", "888 0 3 female NaN 1 2 Third 23.4500 \n", "889 1 1 male 26.0 0 0 First 30.0000 \n", "890 0 3 male 32.0 0 0 Third 7.7500 \n", "\n", " embark_town \n", "0 Southampton \n", "1 Cherbourg \n", "2 Southampton \n", "3 Southampton \n", "4 Southampton \n", ".. ... \n", "886 Southampton \n", "887 Southampton \n", "888 Southampton \n", "889 Cherbourg \n", "890 Queenstown \n", "\n", "[891 rows x 9 columns]\n" ] } ], "source": [ "import seaborn as sns\n", "\n", "# Load the titanic dataset\n", "titanic_data = sns.load_dataset('titanic')\n", "\n", "print(\"Titanic Data\")\n", "\n", "\n", "print(titanic_data.columns) # titanic data set\n", "\n", "print(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town']]) # look at selected columns" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> Use Pandas to clean the data. Most analysis, like Machine Learning or even Pandas in general like data to be in standardized format. This is called 'Training' or 'Cleaning' data." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " survived pclass sex age sibsp parch fare alone embarked_C \\\n", "0 0 3 1 22.0 1 0 7.2500 0 0.0 \n", "1 1 1 0 38.0 1 0 71.2833 0 1.0 \n", "2 1 3 0 26.0 0 0 7.9250 1 0.0 \n", "3 1 1 0 35.0 1 0 53.1000 0 0.0 \n", "4 0 3 1 35.0 0 0 8.0500 1 0.0 \n", ".. ... ... ... ... ... ... ... ... ... \n", "705 0 2 1 39.0 0 0 26.0000 1 0.0 \n", "706 1 2 0 45.0 0 0 13.5000 1 0.0 \n", "707 1 1 1 42.0 0 0 26.2875 1 0.0 \n", "708 1 1 0 22.0 0 0 151.5500 1 0.0 \n", "710 1 1 0 24.0 0 0 49.5042 1 1.0 \n", "\n", " embarked_Q embarked_S \n", "0 0.0 1.0 \n", "1 0.0 0.0 \n", "2 0.0 1.0 \n", "3 0.0 1.0 \n", "4 0.0 1.0 \n", ".. ... ... \n", "705 0.0 1.0 \n", "706 0.0 1.0 \n", "707 1.0 0.0 \n", "708 0.0 1.0 \n", "710 0.0 0.0 \n", "\n", "[564 rows x 11 columns]\n" ] } ], "source": [ "\n", "# Preprocess the data\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "\n", "td = titanic_data\n", "td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)\n", "td.dropna(inplace=True)\n", "td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)\n", "td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)\n", "\n", "# Encode categorical variables\n", "enc = OneHotEncoder(handle_unknown='ignore')\n", "enc.fit(td[['embarked']])\n", "onehot = enc.transform(td[['embarked']]).toarray()\n", "cols = ['embarked_' + val for val in enc.categories_[0]]\n", "td[cols] = pd.DataFrame(onehot)\n", "td.drop(['embarked'], axis=1, inplace=True)\n", "td.dropna(inplace=True)\n", "\n", "print(td)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> The result of 'Training' data is making it easier to analyze or make conclusions. In looking at the Titanic, as you clean you would probably want to make assumptions on likely chance of survival.\n", "\n", "This would involve analyzing various factors (such as age, gender, class, etc.) that may have affected a person's chances of survival, and using that information to make predictions about whether an individual would have survived or not. \n", "\n", "- Data description:\n", " - Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file.\n", " - Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)\n", " - Name - Name\n", " - Sex - Sex\n", " - Age - Age\n", " - Sibsp - Number of Siblings/Spouses Aboard\n", " - Parch - Number of Parents/Children Aboard\n", " - Ticket - Ticket Number\n", " - Fare - Passenger Fare\n", " - Cabin - Cabin\n", " - Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)\n", "\n", "- Perished Mean/Average" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "survived 0.000000\n", "pclass 2.464072\n", "sex 0.844311\n", "age 31.073353\n", "sibsp 0.562874\n", "parch 0.398204\n", "fare 24.835902\n", "alone 0.616766\n", "embarked_C 0.185629\n", "embarked_Q 0.038922\n", "embarked_S 0.775449\n", "dtype: float64\n" ] } ], "source": [ "print(titanic_data.query(\"survived == 0\").mean())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "- Survived Mean/Average" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "survived 1.000000\n", "pclass 1.878261\n", "sex 0.326087\n", "age 28.481522\n", "sibsp 0.504348\n", "parch 0.508696\n", "fare 50.188806\n", "alone 0.456522\n", "embarked_C 0.152174\n", "embarked_Q 0.034783\n", "embarked_S 0.813043\n", "dtype: float64\n" ] } ], "source": [ "print(td.query(\"survived == 1\").mean())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> Survived Max and Min Stats" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "survived 1.0000\n", "pclass 3.0000\n", "sex 1.0000\n", "age 80.0000\n", "sibsp 4.0000\n", "parch 5.0000\n", "fare 512.3292\n", "alone 1.0000\n", "embarked_C 1.0000\n", "embarked_Q 1.0000\n", "embarked_S 1.0000\n", "dtype: float64\n", "survived 1.00\n", "pclass 1.00\n", "sex 0.00\n", "age 0.75\n", "sibsp 0.00\n", "parch 0.00\n", "fare 0.00\n", "alone 0.00\n", "embarked_C 0.00\n", "embarked_Q 0.00\n", "embarked_S 0.00\n", "dtype: float64\n" ] } ], "source": [ "print(td.query(\"survived == 1\").max())\n", "print(td.query(\"survived == 1\").min())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning Visit Tutorials Point\n", "> Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.\n", "\n", "- Description from ChatGPT. The Titanic dataset is a popular dataset for data analysis and machine learning. In the context of machine learning, accuracy refers to the percentage of correctly classified instances in a set of predictions. In this case, the testing data is a subset of the original Titanic dataset that the decision tree model has not seen during training......After training the decision tree model on the training data, we can evaluate its performance on the testing data by making predictions on the testing data and comparing them to the actual outcomes. The accuracy of the decision tree classifier on the testing data tells us how well the model generalizes to new data that it hasn't seen before......For example, if the accuracy of the decision tree classifier on the testing data is 0.8 (or 80%), this means that 80% of the predictions made by the model on the testing data were correct....Chance of survival could be done using various machine learning techniques, including decision trees, logistic regression, or support vector machines, among others.\n", "\n", "- Code Below prepares data for further analysis and provides an Accuracy. IMO, you would insert a new passenger and predict survival. Datasets could be used on various factors like prediction if a player will hit a Home Run, or a Stock will go up or down.\n", " - [Decision Trees](https://scikit-learn.org/stable/modules/tree.html#tree), prediction by a piecewise constant approximation.\n", " \n", " - [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), the probabilities describing the possible outcomes." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DecisionTreeClassifier Accuracy: 0.7705882352941177\n", "LogisticRegression Accuracy: 0.788235294117647\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/johnmortensen/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " n_iter_i = _check_optimize_result(\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score\n", "\n", "# Split arrays or matrices into random train and test subsets.\n", "X = td.drop('survived', axis=1)\n", "y = td['survived']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n", "\n", "# Train a decision tree classifier\n", "dt = DecisionTreeClassifier()\n", "dt.fit(X_train, y_train)\n", "\n", "# Test the model\n", "y_pred = dt.predict(X_test)\n", "accuracy = accuracy_score(y_test, y_pred)\n", "print('DecisionTreeClassifier Accuracy:', accuracy)\n", "\n", "# Train a logistic regression model\n", "logreg = LogisticRegression()\n", "logreg.fit(X_train, y_train)\n", "\n", "# Test the model\n", "y_pred = logreg.predict(X_test)\n", "accuracy = accuracy_score(y_test, y_pred)\n", "print('LogisticRegression Accuracy:', accuracy)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "65f6bdf080211a4261ca30203f2967d5d410cd9d47d7b7e5694003092334a949" } } }, "nbformat": 4, "nbformat_minor": 2 }