{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DSI Summer Workshops Series\n",
"\n",
"## June 28, 2018\n",
"\n",
"Peggy Lindner
\n",
"Center for Advanced Computing & Data Science (CACDS)
\n",
"Data Science Institute (DSI)
\n",
"University of Houston \n",
"plindner@uh.edu \n",
"\n",
"\n",
"Please make sure you have Jupyterhub running with support for R and all the required packages installed. Data for this and other tutorials can be found in the github repsoitory for the Summer 2018 DSI Workshops https://github.com/peggylind/Materials_Summer2018\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How Much Money Should Machines Earn? *\n",
"### - A journey into computerization (jobs that will be taken over by machines)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's learn some R by creating an interactive visualization of some open data because you will train many important skills of a data scientist: \n",
"* loading, \n",
"* transforming and \n",
"* combinig data, \n",
"* cleaning and\n",
"* performing a suitable visualization. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Datasets used\n",
"\n",
"1. The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in [this paper](https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf) in 2013.\n",
"\n",
"2. Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"R needs some additional packages to do the work ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load libraries\n",
"library(dplyr)\n",
"library(tabulizer)\n",
"library(rlist)\n",
"library(readxl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data (Down)Loading"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#############################################################################################\n",
"# Download and parse data about probability of computerisation\n",
"#############################################################################################\n",
"\n",
"# set some variables to be used for download\n",
"urlfile <- \"https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf\"\n",
"file <- \"The_Future_of_Employment.pdf\"\n",
"\n",
"# download the pdf file (if we haven't done so already)\n",
"if (!file.exists(file)) {\n",
" download.file(urlfile, destfile = file, mode = 'wb')\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Extracting data from a pdf file\n",
"\n",
"using [Tabula](https://tabula.technology/) from within R"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract tables using tabulizer - that works a little bit like magic( and it takes some time)\n",
"out <- extract_tables(file, encoding=\"UTF-8\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# let's have a look at the \"thing\" that we just got\n",
"#out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data Transformation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We are not interested in first two tables - so let's remove them\n",
"list.remove(out, c(1:2)) -> tables\n",
"\n",
"# now let's look what we got\n",
"tables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Parse table into something that can be used in the next step"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First we create a placefolder \n",
"prob_comput_df=data.frame()\n",
"\n",
"# Now we go over each of the tables\n",
"for (i in 1:length(tables))\n",
"{\n",
" # We keep just SOC Code, rank and probability of computerisation\n",
" # We also remove first to lines of each element of table since they are non interesting\n",
" tables[[i]][-c(1,2),c(1,2,4)] %>% \n",
" as.data.frame(stringsAsFactors = FALSE) %>% \n",
" rbind(prob_comput_df) -> prob_comput_df\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's check what we got\n",
"prob_comput_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's give this thing some proper column names\n",
"colnames(prob_comput_df) = c(\"rank\", \"probability\", \"soc\")\n",
"\n",
"prob_comput_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#### Data Cleaning"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# what does R think it is looking at?\n",
"str(prob_comput_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prob_comput_df %>% \n",
" # convert things that look like numbers into numbers\n",
" mutate(rank=gsub(\"\\\\.\",\"\", rank) %>% as.numeric()) %>% \n",
" #let's get rid of missing data\n",
" na.omit() -> prob_comput_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"str(prob_comput_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# finally let's delete the file that we just downloaded\n",
"file.remove(file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data (Down)Loading "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#############################################################################################\n",
"# Download job statistics\n",
"#############################################################################################\n",
"\n",
"# set some variables to be used for download\n",
"urlfile <- \"https://www.bls.gov/emp/ind-occ-matrix/occupation.xlsx\"\n",
"file <- \"occupation.xlsx\"\n",
"# Download xlsx file \n",
"if (!file.exists(file)) {\n",
" download.file(urlfile, destfile = file, mode = 'wb')\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# read excel file into R\n",
"job_stats_df <- read_excel(file, \n",
" sheet=\"Table 1.7\", \n",
" skip=3,\n",
" col_names = c(\"job_title\",\n",
" \"soc\",\n",
" \"occupation_type\",\n",
" \"employment_2016\",\n",
" \"employment_2026\",\n",
" \"employment_change_2016_26_nu\",\n",
" \"employment_change_2016_26_pe\",\n",
" \"self_employed_2016_pe\",\n",
" \"occupational_openings_2016_26_av\",\n",
" \"median_annual_wage_2017\",\n",
" \"typical_education_entry\",\n",
" \"work_experience_related_occ\",\t\n",
" \"typical_training_needed\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# now we can remove the downloaded file\n",
"file.remove(file)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# let's look what we got here\n",
"job_stats_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data Transformation & Cleaning\n",
"\n",
"We are going to merge (join) the 2 data sets and keep only the columns that we need."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#############################################################################################\n",
"# Join data frames\n",
"#############################################################################################\n",
"results = prob_comput_df %>% \n",
" inner_join(job_stats_df, by = \"soc\") %>% \n",
" select(job_title, \n",
" probability, \n",
" employment_2016, \n",
" median_annual_wage_2017, \n",
" typical_education_entry) %>% \n",
" mutate(probability=as.numeric(probability),\n",
" median_annual_wage_2017=as.numeric(median_annual_wage_2017),\n",
" typical_education_entry=iconv(typical_education_entry, \"latin1\", \"ASCII\")) %>% \n",
" # get rid of missing data\n",
" na.omit()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Aehmm, can we do that a little slower?\n",
"#first, we join using the soc column\n",
"first_step <- prob_comput_df %>% \n",
" inner_join(job_stats_df, by = \"soc\")\n",
"\n",
"first_step"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#second, we select only columns that we want\n",
"second_step <- first_step %>%\n",
" select(job_title, \n",
" probability, \n",
" employment_2016, \n",
" median_annual_wage_2017, \n",
" typical_education_entry)\n",
"\n",
"second_step"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#third, we create 2 new columns using the existing columns\n",
"\n",
"third_step <- second_step %>% \n",
" mutate(probability=as.numeric(probability),\n",
" median_annual_wage_2017=as.numeric(median_annual_wage_2017),\n",
" typical_education_entry=iconv(typical_education_entry, \"latin1\", \"ASCII\")) \n",
"\n",
"third_step\n",
"\n",
"#that looks the same to me, but internally we change some data types\n",
"str(second_step)\n",
"str(third_step)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#do we have some missing data points?\n",
"is.na(third_step)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#show me the rows with missing data\n",
"third_step[!complete.cases(third_step),]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# and last but not least we remove the rows with missing data\n",
"results <- third_step %>%\n",
" na.omit()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#what did we get?\n",
"results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Finally, let's create a visualization\n",
"\n",
"We are ging to use [Highcharter](http://jkunst.com/highcharter/index.html) which is just one of many ways to create interactive visualizations in R."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#we need some more packages \n",
"#highcharter works with current ggplot only in dev version\n",
"#devtools::install_github(\"jbkunst/highcharter\")\n",
"#source(\"https://install-github.me/jbkunst/highcharter\")\n",
"library(highcharter)\n",
"library(htmlwidgets)\n",
"library(IRdisplay)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#let's create an object that is actually a visual\n",
"x=hchart(results, \n",
" \"scatter\", \n",
" hcaes(x = probability*100, \n",
" y = median_annual_wage_2017, \n",
" group=typical_education_entry, \n",
" size=employment_2016)) %>% \n",
" hc_title(text = \"How Much Money Should Machines Earn?\") %>%\n",
" hc_subtitle(text = \"Probability of Computerisation and Wages by Job\") %>% \n",
" hc_credits(enabled = TRUE, text = \"Source: Oxford Martin School and US Department of Labor\") %>% \n",
" hc_xAxis(title = list(text = \"Probability of Computerisation\"), labels = list(format = \"{value}%\")) %>% \n",
" hc_yAxis(title = list(text = \"Median Annual Wage 2017\"), labels = list(format = \"{value}$\")) %>% \n",
" hc_plotOptions(bubble = list(minSize = 3, maxSize = 35)) %>% \n",
" hc_tooltip(formatter = JS(\"function(){\n",
" return (''+ this.point.job_title + '
'+\n",
" 'Probability of computerisation: '+ Highcharts.numberFormat(this.x, 0)+'%' + \n",
" '
Median annual wage 2017 ($): '+ Highcharts.numberFormat(this.y, 0) + \n",
" '
Employment 2016 (000s): '+ Highcharts.numberFormat(this.point.size, 0) )}\")) %>% \n",
" hc_chart(zoomType = \"xy\") %>%\n",
" hc_exporting(enabled = TRUE)\n",
"\n",
"# it's an object!\n",
"#str(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# and now let's get this object showing up in our jupyter notebook\n",
"saveWidget(x, 'demox.html', selfcontained = FALSE)\n",
"display_html('')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A full size version of the visualization can be found [here](https://fronkonstin.com/wp-content/uploads/2018/06/machines_wage.html)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And thanks again to the person who wrote the [original post](https://fronkonstin.com/2018/06/17/how-much-money-should-machines-earn/)!\n",
"\n",
"These are some insights:\n",
"\n",
"* There is a moderate negative correlation between wages and probability of computerisation.\n",
"* Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.\n",
"* In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.\n",
"* Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).\n",
"* Don’t study for librarian or archivist: it seems a bad way to invest your time\n",
"* Mathematicians will survive to machines\n",
"\n",
"##### What do you see there?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.5.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}