{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# School Budgeting with Machine Learning in Python\n", "> A Summary of lecture \"Case Study- School Budgeting with Machine Learning in Python\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing the challenge\n", "- Budgets for schools are huge, complex, and not standardize.\n", " - Hundreds of hours each year are spent manually labelling\n", "- Goal: Build a machine learning algorithm that can automate the process\n", "- Supervised Learning problem\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: Due to the size of dataset, it is not included in this repository, however, you can download it through [kaggle repo](https://www.kaggle.com/jeromeblanchet/drivendatas-boxplots-for-education-dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the data\n", "Now it's time to check out the dataset! You'll use pandas (which has been pre-imported as pd) to load your data into a DataFrame and then do some Exploratory Data Analysis (EDA) of it.\n", "\n", "Some of the column names correspond to **features** - descriptions of the budget items - such as the ```Job_Title_Description``` column. The values in this column tell us if a budget item is for a teacher, custodian, or other employee.\n", "\n", "Some columns correspond to the budget item **labels** you will be trying to predict with your model. For example, the ```Object_Type``` column describes whether the budget item is related classroom supplies, salary, travel expenses, etc." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Function | \n", "Use | \n", "Sharing | \n", "Reporting | \n", "Student_Type | \n", "Position_Type | \n", "Object_Type | \n", "Pre_K | \n", "Operating_Status | \n", "Object_Description | \n", "... | \n", "Sub_Object_Description | \n", "Location_Description | \n", "FTE | \n", "Function_Description | \n", "Facility_or_Department | \n", "Position_Extra | \n", "Total | \n", "Program_Description | \n", "Fund_Description | \n", "Text_1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
134338 | \n", "Teacher Compensation | \n", "Instruction | \n", "School Reported | \n", "School | \n", "NO_LABEL | \n", "Teacher | \n", "NO_LABEL | \n", "NO_LABEL | \n", "PreK-12 Operating | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "NaN | \n", "KINDERGARTEN | \n", "50471.810 | \n", "KINDERGARTEN | \n", "General Fund | \n", "NaN | \n", "
206341 | \n", "NO_LABEL | \n", "NO_LABEL | \n", "NO_LABEL | \n", "NO_LABEL | \n", "NO_LABEL | \n", "NO_LABEL | \n", "NO_LABEL | \n", "NO_LABEL | \n", "Non-Operating | \n", "CONTRACTOR SERVICES | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "RGN GOB | \n", "NaN | \n", "UNDESIGNATED | \n", "3477.860 | \n", "BUILDING IMPROVEMENT SERVICES | \n", "NaN | \n", "BUILDING IMPROVEMENT SERVICES | \n", "
326408 | \n", "Teacher Compensation | \n", "Instruction | \n", "School Reported | \n", "School | \n", "Unspecified | \n", "Teacher | \n", "Base Salary/Compensation | \n", "Non PreK | \n", "PreK-12 Operating | \n", "Personal Services - Teachers | \n", "... | \n", "NaN | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "NaN | \n", "TEACHER | \n", "62237.130 | \n", "Instruction - Regular | \n", "General Purpose School | \n", "NaN | \n", "
364634 | \n", "Substitute Compensation | \n", "Instruction | \n", "School Reported | \n", "School | \n", "Unspecified | \n", "Substitute | \n", "Benefits | \n", "NO_LABEL | \n", "PreK-12 Operating | \n", "EMPLOYEE BENEFITS | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "UNALLOC BUDGETS/SCHOOLS | \n", "NaN | \n", "PROFESSIONAL-INSTRUCTIONAL | \n", "22.300 | \n", "GENERAL MIDDLE/JUNIOR HIGH SCH | \n", "NaN | \n", "REGULAR INSTRUCTION | \n", "
47683 | \n", "Substitute Compensation | \n", "Instruction | \n", "School Reported | \n", "School | \n", "Unspecified | \n", "Teacher | \n", "Substitute Compensation | \n", "NO_LABEL | \n", "PreK-12 Operating | \n", "TEACHER COVERAGE FOR TEACHER | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NON-PROJECT | \n", "NaN | \n", "PROFESSIONAL-INSTRUCTIONAL | \n", "54.166 | \n", "GENERAL HIGH SCHOOL EDUCATION | \n", "NaN | \n", "REGULAR INSTRUCTION | \n", "
5 rows × 25 columns
\n", "\n", " | Function | \n", "Use | \n", "Sharing | \n", "Reporting | \n", "Student_Type | \n", "Position_Type | \n", "Object_Type | \n", "Pre_K | \n", "Operating_Status | \n", "Object_Description | \n", "... | \n", "Sub_Object_Description | \n", "Location_Description | \n", "FTE | \n", "Function_Description | \n", "Facility_or_Department | \n", "Position_Extra | \n", "Total | \n", "Program_Description | \n", "Fund_Description | \n", "Text_1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
109283 | \n", "Professional Development | \n", "ISPD | \n", "Shared Services | \n", "Non-School | \n", "Unspecified | \n", "Instructional Coach | \n", "Other Compensation/Stipend | \n", "NO_LABEL | \n", "PreK-12 Operating | \n", "WORKSHOP PARTICIPANT | \n", "... | \n", "NaN | \n", "STAFF DEV AND INSTR MEDIA | \n", "NaN | \n", "INST STAFF TRAINING SVCS | \n", "NaN | \n", "NaN | \n", "48.620000 | \n", "NaN | \n", "GENERAL FUND | \n", "STAFF DEV AND INSTR MEDIA | \n", "
102430 | \n", "Substitute Compensation | \n", "Instruction | \n", "School Reported | \n", "School | \n", "Unspecified | \n", "Substitute | \n", "Base Salary/Compensation | \n", "NO_LABEL | \n", "PreK-12 Operating | \n", "SALARIES OF PART TIME EMPLOYEE | \n", "... | \n", "NaN | \n", "NaN | \n", "0.00431 | \n", "TITLE II,D | \n", "NaN | \n", "PROFESSIONAL-INSTRUCTIONAL | \n", "128.824985 | \n", "INSTRUCTIONAL STAFF TRAINING | \n", "NaN | \n", "INSTRUCTIONAL STAFF | \n", "
413949 | \n", "Parent & Community Relations | \n", "NO_LABEL | \n", "School Reported | \n", "School | \n", "NO_LABEL | \n", "Other | \n", "NO_LABEL | \n", "NO_LABEL | \n", "PreK-12 Operating | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "1.00000 | \n", "NaN | \n", "NaN | \n", "PARENT/TITLE I | \n", "4902.290000 | \n", "Misc | \n", "Schoolwide Schools | \n", "NaN | \n", "
433672 | \n", "Library & Media | \n", "Instruction | \n", "School on Central Budgets | \n", "Non-School | \n", "Unspecified | \n", "Librarian | \n", "Benefits | \n", "NO_LABEL | \n", "PreK-12 Operating | \n", "EMPLOYEE BENEFITS | \n", "... | \n", "NaN | \n", "ED RESOURCE SERVICES | \n", "NaN | \n", "NON-PROJECT | \n", "NaN | \n", "OFFICE/ADMINISTRATIVE SUPPORT | \n", "4020.290000 | \n", "MEDIA SUPPORT SERVICES | \n", "NaN | \n", "INSTRUCTIONAL STAFF | \n", "
415831 | \n", "Substitute Compensation | \n", "Instruction | \n", "School Reported | \n", "School | \n", "Poverty | \n", "Substitute | \n", "Substitute Compensation | \n", "Non PreK | \n", "PreK-12 Operating | \n", "Salaries And Wages For Substitute Professionals | \n", "... | \n", "Inservice Substitute Teachers Grant Funded | \n", "School | \n", "NaN | \n", "Instruction | \n", "Instruction And Curriculum | \n", "CERTIFIED SUBSTITUTE | \n", "46.530000 | \n", "Accelerated Education | \n", "\"Title Part A Improving Basic Programs\" | \n", "MISCELLANEOUS | \n", "
5 rows × 25 columns
\n", "\n", " | FTE | \n", "Total | \n", "
---|---|---|
count | \n", "126071.000000 | \n", "3.957220e+05 | \n", "
mean | \n", "0.426794 | \n", "1.310586e+04 | \n", "
std | \n", "0.573576 | \n", "3.682254e+05 | \n", "
min | \n", "-0.087551 | \n", "-8.746631e+07 | \n", "
25% | \n", "0.000792 | \n", "7.379770e+01 | \n", "
50% | \n", "0.130927 | \n", "4.612300e+02 | \n", "
75% | \n", "1.000000 | \n", "3.652662e+03 | \n", "
max | \n", "46.800000 | \n", "1.297000e+08 | \n", "