{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DSI Summer Workshops Series\n", "\n", "## June 28, 2018\n", "\n", "Peggy Lindner
\n", "Center for Advanced Computing & Data Science (CACDS)
\n", "Data Science Institute (DSI)
\n", "University of Houston \n", "plindner@uh.edu \n", "\n", "\n", "Please make sure you have Jupyterhub running with support for R and all the required packages installed. Data for this and other tutorials can be found in the github repsoitory for the Summer 2018 DSI Workshops https://github.com/peggylind/Materials_Summer2018\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How Much Money Should Machines Earn? *\n", "### - A journey into computerization (jobs that will be taken over by machines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's learn some R by creating an interactive visualization of some open data because you will train many important skills of a data scientist: \n", "* loading, \n", "* transforming and \n", "* combinig data, \n", "* cleaning and\n", "* performing a suitable visualization. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Datasets used\n", "\n", "1. The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in [this paper](https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf) in 2013.\n", "\n", "2. Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "R needs some additional packages to do the work ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load libraries\n", "library(dplyr)\n", "library(tabulizer)\n", "library(rlist)\n", "library(readxl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data (Down)Loading" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#############################################################################################\n", "# Download and parse data about probability of computerisation\n", "#############################################################################################\n", "\n", "# set some variables to be used for download\n", "urlfile <- \"https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf\"\n", "file <- \"The_Future_of_Employment.pdf\"\n", "\n", "# download the pdf file (if we haven't done so already)\n", "if (!file.exists(file)) {\n", " download.file(urlfile, destfile = file, mode = 'wb')\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extracting data from a pdf file\n", "\n", "using [Tabula](https://tabula.technology/) from within R" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract tables using tabulizer - that works a little bit like magic( and it takes some time)\n", "out <- extract_tables(file, encoding=\"UTF-8\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's have a look at the \"thing\" that we just got\n", "#out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Transformation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We are not interested in first two tables - so let's remove them\n", "list.remove(out, c(1:2)) -> tables\n", "\n", "# now let's look what we got\n", "tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Parse table into something that can be used in the next step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First we create a placefolder \n", "prob_comput_df=data.frame()\n", "\n", "# Now we go over each of the tables\n", "for (i in 1:length(tables))\n", "{\n", " # We keep just SOC Code, rank and probability of computerisation\n", " # We also remove first to lines of each element of table since they are non interesting\n", " tables[[i]][-c(1,2),c(1,2,4)] %>% \n", " as.data.frame(stringsAsFactors = FALSE) %>% \n", " rbind(prob_comput_df) -> prob_comput_df\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's check what we got\n", "prob_comput_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's give this thing some proper column names\n", "colnames(prob_comput_df) = c(\"rank\", \"probability\", \"soc\")\n", "\n", "prob_comput_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#### Data Cleaning" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# what does R think it is looking at?\n", "str(prob_comput_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prob_comput_df %>% \n", " # convert things that look like numbers into numbers\n", " mutate(rank=gsub(\"\\\\.\",\"\", rank) %>% as.numeric()) %>% \n", " #let's get rid of missing data\n", " na.omit() -> prob_comput_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str(prob_comput_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# finally let's delete the file that we just downloaded\n", "file.remove(file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data (Down)Loading " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#############################################################################################\n", "# Download job statistics\n", "#############################################################################################\n", "\n", "# set some variables to be used for download\n", "urlfile <- \"https://www.bls.gov/emp/ind-occ-matrix/occupation.xlsx\"\n", "file <- \"occupation.xlsx\"\n", "# Download xlsx file \n", "if (!file.exists(file)) {\n", " download.file(urlfile, destfile = file, mode = 'wb')\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read excel file into R\n", "job_stats_df <- read_excel(file, \n", " sheet=\"Table 1.7\", \n", " skip=3,\n", " col_names = c(\"job_title\",\n", " \"soc\",\n", " \"occupation_type\",\n", " \"employment_2016\",\n", " \"employment_2026\",\n", " \"employment_change_2016_26_nu\",\n", " \"employment_change_2016_26_pe\",\n", " \"self_employed_2016_pe\",\n", " \"occupational_openings_2016_26_av\",\n", " \"median_annual_wage_2017\",\n", " \"typical_education_entry\",\n", " \"work_experience_related_occ\",\t\n", " \"typical_training_needed\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now we can remove the downloaded file\n", "file.remove(file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's look what we got here\n", "job_stats_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Transformation & Cleaning\n", "\n", "We are going to merge (join) the 2 data sets and keep only the columns that we need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#############################################################################################\n", "# Join data frames\n", "#############################################################################################\n", "results = prob_comput_df %>% \n", " inner_join(job_stats_df, by = \"soc\") %>% \n", " select(job_title, \n", " probability, \n", " employment_2016, \n", " median_annual_wage_2017, \n", " typical_education_entry) %>% \n", " mutate(probability=as.numeric(probability),\n", " median_annual_wage_2017=as.numeric(median_annual_wage_2017),\n", " typical_education_entry=iconv(typical_education_entry, \"latin1\", \"ASCII\")) %>% \n", " # get rid of missing data\n", " na.omit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Aehmm, can we do that a little slower?\n", "#first, we join using the soc column\n", "first_step <- prob_comput_df %>% \n", " inner_join(job_stats_df, by = \"soc\")\n", "\n", "first_step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#second, we select only columns that we want\n", "second_step <- first_step %>%\n", " select(job_title, \n", " probability, \n", " employment_2016, \n", " median_annual_wage_2017, \n", " typical_education_entry)\n", "\n", "second_step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#third, we create 2 new columns using the existing columns\n", "\n", "third_step <- second_step %>% \n", " mutate(probability=as.numeric(probability),\n", " median_annual_wage_2017=as.numeric(median_annual_wage_2017),\n", " typical_education_entry=iconv(typical_education_entry, \"latin1\", \"ASCII\")) \n", "\n", "third_step\n", "\n", "#that looks the same to me, but internally we change some data types\n", "str(second_step)\n", "str(third_step)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#do we have some missing data points?\n", "is.na(third_step)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#show me the rows with missing data\n", "third_step[!complete.cases(third_step),]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and last but not least we remove the rows with missing data\n", "results <- third_step %>%\n", " na.omit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#what did we get?\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Finally, let's create a visualization\n", "\n", "We are ging to use [Highcharter](http://jkunst.com/highcharter/index.html) which is just one of many ways to create interactive visualizations in R." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#we need some more packages \n", "#highcharter works with current ggplot only in dev version\n", "#devtools::install_github(\"jbkunst/highcharter\")\n", "#source(\"https://install-github.me/jbkunst/highcharter\")\n", "library(highcharter)\n", "library(htmlwidgets)\n", "library(IRdisplay)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#let's create an object that is actually a visual\n", "x=hchart(results, \n", " \"scatter\", \n", " hcaes(x = probability*100, \n", " y = median_annual_wage_2017, \n", " group=typical_education_entry, \n", " size=employment_2016)) %>% \n", " hc_title(text = \"How Much Money Should Machines Earn?\") %>%\n", " hc_subtitle(text = \"Probability of Computerisation and Wages by Job\") %>% \n", " hc_credits(enabled = TRUE, text = \"Source: Oxford Martin School and US Department of Labor\") %>% \n", " hc_xAxis(title = list(text = \"Probability of Computerisation\"), labels = list(format = \"{value}%\")) %>% \n", " hc_yAxis(title = list(text = \"Median Annual Wage 2017\"), labels = list(format = \"{value}$\")) %>% \n", " hc_plotOptions(bubble = list(minSize = 3, maxSize = 35)) %>% \n", " hc_tooltip(formatter = JS(\"function(){\n", " return (''+ this.point.job_title + '
'+\n", " 'Probability of computerisation: '+ Highcharts.numberFormat(this.x, 0)+'%' + \n", " '
Median annual wage 2017 ($): '+ Highcharts.numberFormat(this.y, 0) + \n", " '
Employment 2016 (000s): '+ Highcharts.numberFormat(this.point.size, 0) )}\")) %>% \n", " hc_chart(zoomType = \"xy\") %>%\n", " hc_exporting(enabled = TRUE)\n", "\n", "# it's an object!\n", "#str(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and now let's get this object showing up in our jupyter notebook\n", "saveWidget(x, 'demox.html', selfcontained = FALSE)\n", "display_html('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A full size version of the visualization can be found [here](https://fronkonstin.com/wp-content/uploads/2018/06/machines_wage.html)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And thanks again to the person who wrote the [original post](https://fronkonstin.com/2018/06/17/how-much-money-should-machines-earn/)!\n", "\n", "These are some insights:\n", "\n", "* There is a moderate negative correlation between wages and probability of computerisation.\n", "* Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.\n", "* In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.\n", "* Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).\n", "* Don’t study for librarian or archivist: it seems a bad way to invest your time\n", "* Mathematicians will survive to machines\n", "\n", "##### What do you see there?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 2 }