{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset contains 1295 records of American colleges and their properties, collected by the [US Department of Education](https://collegescorecard.ed.gov/data/documentation/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import lux" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"../data/college.csv\")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the information about ACTMedian and SATAverage has a very strong correlation. This means that we could probably just keep one of the columns and still get about the same information. So let's drop the ACTMedian column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.drop(columns=[\"ACTMedian\"])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the Category tab, we see that there are few records where `PredominantDegree` is \"Certificate\". In addition, there are not a lot of colleges with \"Private For-Profit\" as `FundingModel`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can take a look at this by inspecting the `Series` corresponding to the column `PredominantDegree`. Note that Lux not only helps with visualizing dataframes, but also displays visualizations of Series objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"PredominantDegree\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df[\"PredominantDegree\"]==\"Certificate\"].to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upon inspection, there is only a single record for Certificate, we look at the [webpage for programs offered at Cleveland State Community College](http://catalog.clevelandstatecc.edu/content.php?catoid=2&navoid=90) and it looks like there is a large number of associate as well as certificate degrees offered. So we decide that this is more appropriately labelled as \"Associate\" for the `PredominantDegree` field." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[df[\"PredominantDegree\"]==\"Certificate\",\"PredominantDegree\"] = \"Associate\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By inspecting the subset of 9 colleges that are \"Private For-Profit\", we do not find any commonalities across them, so we can just leave the data as-is for now." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df[\"FundingModel\"]==\"Private For-Profit\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Back to looking at the entire dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are interested in picking a college to attend and want to understand the `AverageCost` of attending different colleges and how that relates to other information in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.intent = [\"AverageCost\"]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that there are a large number of colleges that cost around $20000 per year. We also see that Bachelor degree colleges and colleges in New England and large cities tend to have a higher `AverageCost` than its counterparts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are interested in the trend of `AverageCost` v.s. `SATAverage` since there is a rough upwards relationship above `AverageCost` of $30000, but below that the trend is less clear." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.intent = [\"AverageCost\",\"SATAverage\"]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By adding the `FundingModel`, we see that the cluster of points on the left can clearly be attributed to public colleges, whereas private colleges more or less follow a trend that shows that colleges with higher `SATAverage` tends to have higher `AverageCost`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }