{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DSI Summer Workshops Series\n",
    "\n",
    "## June 28, 2018\n",
    "\n",
    "Peggy Lindner<br>\n",
    "Center for Advanced Computing & Data Science (CACDS)<br>\n",
    "Data Science Institute (DSI)<br>\n",
    "University of Houston  \n",
    "plindner@uh.edu \n",
    "\n",
    "\n",
    "Please make sure you have Jupyterhub running with support for R and all the required packages installed. Data for this and other tutorials can be found in the github repsoitory for the Summer 2018 DSI Workshops https://github.com/peggylind/Materials_Summer2018\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How Much Money Should Machines Earn? *\n",
    "### - A journey into computerization (jobs that will be taken over by machines)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's learn some R by creating an interactive visualization of some open data because you will train many important skills of a data scientist: \n",
    "* loading,  \n",
    "* transforming and \n",
    "* combinig data, \n",
    "* cleaning and\n",
    "* performing a suitable visualization. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Datasets used\n",
    "\n",
    "1. The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in [this paper](https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf) in 2013.\n",
    "\n",
    "2. Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "R needs some additional packages to do the work ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "library(dplyr)\n",
    "library(tabulizer)\n",
    "library(rlist)\n",
    "library(readxl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data (Down)Loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#############################################################################################\n",
    "# Download and parse data about probability of computerisation\n",
    "#############################################################################################\n",
    "\n",
    "# set some variables to be used for download\n",
    "urlfile <- \"https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf\"\n",
    "file <- \"The_Future_of_Employment.pdf\"\n",
    "\n",
    "# download the pdf file (if we haven't done so already)\n",
    "if (!file.exists(file)) {\n",
    "    download.file(urlfile, destfile = file, mode = 'wb')\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Extracting data from a pdf file\n",
    "\n",
    "using [Tabula](https://tabula.technology/) from within R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract tables using tabulizer - that works a little bit like magic( and it takes some time)\n",
    "out <- extract_tables(file, encoding=\"UTF-8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's have a look at the \"thing\" that we just got\n",
    "#out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Transformation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We are not interested in first two tables - so let's remove them\n",
    "list.remove(out, c(1:2)) -> tables\n",
    "\n",
    "# now let's look what we got\n",
    "tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Parse table into something that can be used in the next step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we create a placefolder \n",
    "prob_comput_df=data.frame()\n",
    "\n",
    "# Now we go over each of the tables\n",
    "for (i in 1:length(tables))\n",
    "{\n",
    "  # We keep just SOC Code, rank and probability of computerisation\n",
    "  # We also remove first to lines of each element of table since they are non interesting\n",
    "  tables[[i]][-c(1,2),c(1,2,4)] %>% \n",
    "    as.data.frame(stringsAsFactors = FALSE) %>% \n",
    "    rbind(prob_comput_df) -> prob_comput_df\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's check what we got\n",
    "prob_comput_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's give this thing some proper column names\n",
    "colnames(prob_comput_df) = c(\"rank\", \"probability\", \"soc\")\n",
    "\n",
    "prob_comput_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#### Data Cleaning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# what does R think it is looking at?\n",
    "str(prob_comput_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prob_comput_df %>% \n",
    "  # convert things that look like numbers into numbers\n",
    "  mutate(rank=gsub(\"\\\\.\",\"\", rank) %>% as.numeric()) %>% \n",
    "  #let's get rid of missing data\n",
    "  na.omit() -> prob_comput_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "str(prob_comput_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# finally let's delete the file that we just downloaded\n",
    "file.remove(file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data (Down)Loading "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#############################################################################################\n",
    "# Download job statistics\n",
    "#############################################################################################\n",
    "\n",
    "# set some variables to be used for download\n",
    "urlfile <- \"https://www.bls.gov/emp/ind-occ-matrix/occupation.xlsx\"\n",
    "file <- \"occupation.xlsx\"\n",
    "# Download xlsx file \n",
    "if (!file.exists(file)) {\n",
    "    download.file(urlfile, destfile = file, mode = 'wb')\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read excel file into R\n",
    "job_stats_df <- read_excel(file, \n",
    "                           sheet=\"Table 1.7\", \n",
    "                           skip=3,\n",
    "                           col_names = c(\"job_title\",\n",
    "                                         \"soc\",\n",
    "                                         \"occupation_type\",\n",
    "                                         \"employment_2016\",\n",
    "                                         \"employment_2026\",\n",
    "                                         \"employment_change_2016_26_nu\",\n",
    "                                         \"employment_change_2016_26_pe\",\n",
    "                                         \"self_employed_2016_pe\",\n",
    "                                         \"occupational_openings_2016_26_av\",\n",
    "                                         \"median_annual_wage_2017\",\n",
    "                                         \"typical_education_entry\",\n",
    "                                         \"work_experience_related_occ\",\t\n",
    "                                         \"typical_training_needed\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now we can remove the downloaded file\n",
    "file.remove(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's look what we got here\n",
    "job_stats_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Transformation & Cleaning\n",
    "\n",
    "We are going to merge (join) the 2 data sets and keep only the columns that we need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#############################################################################################\n",
    "# Join data frames\n",
    "#############################################################################################\n",
    "results = prob_comput_df %>% \n",
    "  inner_join(job_stats_df, by = \"soc\") %>% \n",
    "  select(job_title, \n",
    "         probability, \n",
    "         employment_2016, \n",
    "         median_annual_wage_2017, \n",
    "         typical_education_entry) %>% \n",
    "  mutate(probability=as.numeric(probability),\n",
    "         median_annual_wage_2017=as.numeric(median_annual_wage_2017),\n",
    "         typical_education_entry=iconv(typical_education_entry, \"latin1\", \"ASCII\")) %>% \n",
    "  # get rid of missing data\n",
    "  na.omit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Aehmm, can we do that a little slower?\n",
    "#first, we join using the soc column\n",
    "first_step <- prob_comput_df %>% \n",
    "     inner_join(job_stats_df, by = \"soc\")\n",
    "\n",
    "first_step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#second, we select only columns that we want\n",
    "second_step <- first_step %>%\n",
    "  select(job_title, \n",
    "         probability, \n",
    "         employment_2016, \n",
    "         median_annual_wage_2017, \n",
    "         typical_education_entry)\n",
    "\n",
    "second_step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#third, we create 2 new columns using the existing columns\n",
    "\n",
    "third_step <- second_step %>% \n",
    "  mutate(probability=as.numeric(probability),\n",
    "         median_annual_wage_2017=as.numeric(median_annual_wage_2017),\n",
    "         typical_education_entry=iconv(typical_education_entry, \"latin1\", \"ASCII\")) \n",
    "\n",
    "third_step\n",
    "\n",
    "#that looks the same to me, but internally we change some data types\n",
    "str(second_step)\n",
    "str(third_step)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#do we have some missing data points?\n",
    "is.na(third_step)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#show me the rows with missing data\n",
    "third_step[!complete.cases(third_step),]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# and last but not least we remove the rows with missing data\n",
    "results <- third_step %>%\n",
    "    na.omit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#what did we get?\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Finally, let's create a visualization\n",
    "\n",
    "We are ging to use [Highcharter](http://jkunst.com/highcharter/index.html) which is just one of many ways to create interactive visualizations in R."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#we need some more packages \n",
    "#highcharter works with current ggplot only in dev version\n",
    "#devtools::install_github(\"jbkunst/highcharter\")\n",
    "#source(\"https://install-github.me/jbkunst/highcharter\")\n",
    "library(highcharter)\n",
    "library(htmlwidgets)\n",
    "library(IRdisplay)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#let's create an object that is actually a visual\n",
    "x=hchart(results, \n",
    "       \"scatter\", \n",
    "       hcaes(x = probability*100, \n",
    "             y = median_annual_wage_2017, \n",
    "             group=typical_education_entry, \n",
    "             size=employment_2016)) %>% \n",
    "  hc_title(text = \"How Much Money Should Machines Earn?\") %>%\n",
    "  hc_subtitle(text = \"Probability of Computerisation and Wages by Job\") %>% \n",
    "  hc_credits(enabled = TRUE, text = \"Source: Oxford Martin School and US Department of Labor\") %>% \n",
    "  hc_xAxis(title = list(text = \"Probability of Computerisation\"), labels = list(format = \"{value}%\")) %>% \n",
    "  hc_yAxis(title = list(text = \"Median Annual Wage 2017\"), labels = list(format = \"{value}$\")) %>% \n",
    "  hc_plotOptions(bubble = list(minSize = 3, maxSize = 35)) %>% \n",
    "  hc_tooltip(formatter = JS(\"function(){\n",
    "                            return ('<b>'+ this.point.job_title + '</b><br>'+\n",
    "                            'Probability of computerisation: '+ Highcharts.numberFormat(this.x, 0)+'%' + \n",
    "                            '<br>Median annual wage 2017 ($): '+ Highcharts.numberFormat(this.y, 0) + \n",
    "                            '<br>Employment 2016 (000s): '+ Highcharts.numberFormat(this.point.size, 0) )}\")) %>% \n",
    "  hc_chart(zoomType = \"xy\") %>%\n",
    "  hc_exporting(enabled = TRUE)\n",
    "\n",
    "# it's an object!\n",
    "#str(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# and now let's get this object showing up in our jupyter notebook\n",
    "saveWidget(x, 'demox.html', selfcontained = FALSE)\n",
    "display_html('<iframe src=\"demox.html\", width = 900, height = 500 ></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A full size version of the visualization can be found [here](https://fronkonstin.com/wp-content/uploads/2018/06/machines_wage.html)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And thanks again to the person who wrote the [original post](https://fronkonstin.com/2018/06/17/how-much-money-should-machines-earn/)!\n",
    "\n",
    "These are some insights:\n",
    "\n",
    "* There is a moderate negative correlation between wages and probability of computerisation.\n",
    "* Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.\n",
    "* In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.\n",
    "* Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).\n",
    "* Don’t study for librarian or archivist: it seems a bad way to invest your time\n",
    "* Mathematicians will survive to machines\n",
    "\n",
    "##### What do you see there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}