{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ title: Natural Language Processing of German texts - Part 1: Using machine-learning to predict ratings\n",
    "+ date: 2020-04-10\n",
    "+ tags: python, NLP, sklearn, classification, machine-learning, tf-idf\n",
    "+ Slug: german-nlp-binary-text-classification-of-reviews-part1\n",
    "+ Category: Python\n",
    "+ Authors: MC\n",
    "+ Summary: Using a unique German data set containing ratings and comments on doctors, we build a Binary Text Classifier. To do so, we implement a complete machine learning work flow that predicts ratings from comments. In this first part, we start with basic methods. We go through text pre processing, feature creation (TF-IDF), classification and model optimization. Finally, we evaluate our model's ability to predict the sentiment of comments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Motivation\n",
    "\n",
    "The domain of Natural Language Processing (NLP) has seen a rapid advancement during the last few years. In particular, 2019 has been a remarkable year. New state of the art results on all relevant benchmarks have been established on a regular basis. In this multi part post we apply and compare several methods of Binary Text Classification, starting with traditional, basic methods and leading to state of the art models. For this, we use a novel German data set that contains hundreds of thousands of comments. They were submitted by patients on an online platform for reviewing doctors. The data also contains ratings so that it can be used to train supervised models in order to predict the rating / sentiment of the comment.  \n",
    "Most data sets for NLP are in English and the majority of learning resources also focus on English. Hence, the unique data at our hands is valuable for applying NLP methods to German text. Because of its many observations and the fact that it contains labels, it is ideal for applying machine-learning methods.  \n",
    "Let's take a look at a basic NLP workflow:\n",
    "\n",
    "<br>\n",
    "<a class=mybox href=\"./img/nlp/nlp_workflow.png\" title=\"A typical NLP machine learning workflow (click to enlarge)\">\n",
    "<img alt=\"Chart depicting a typical NLP workflow\" src=\"./img/nlp/nlp_workflow.png\" style=\"margin:auto; display:block; width:800;\">\n",
    "</a>\n",
    "<center><sup>A typical NLP machine-learning workflow (own illustration)</sup></center>\n",
    "<br>\n",
    "\n",
    "In the following notebook, we will go through this process step by step. In this first part, we apply frequency based methods for feature creation and compare several classification models. Moreover, we optimize the parameters of our process using a grid search. We'll see that these \"traditional\" methods already achieve adequate results on our data, that we'll use as a baseline. In follow up posts, we will apply more advanced methods and try to improve on our results in the search for the best performing model. During this journey, we will compare the different methods and discuss their pros and cons.  \n",
    "\n",
    "You can download this notebook or follow along in an interactive version of it on <a href=\"https://mybinder.org/v2/gh/mc51/blog_posts/master?filepath=doctors_nlp1.ipynb\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open on Binder\" style=\"display:inline;\"></a>  and <a href=\"https://colab.research.google.com/github/mc51/blog_posts/blob/master/doctors_nlp1.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\" style=\"display:inline;\"/></a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The data set\n",
    "\n",
    "You can take a look at the data on [data.world](https://data.world/mc51/german-language-reviews-of-doctors-by-patients) or directly download it from [here](https://query.data.world/s/v5xl53bs2rgq476vqy7cg7xx2db55y):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# store current path and download data there\n",
    "CURR_PATH = !pwd\n",
    "!wget -O reviews.zip https://query.data.world/s/v5xl53bs2rgq476vqy7cg7xx2db55y\n",
    "!unzip reviews.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " After unzipping you'll get a csv file.  \n",
    "Before we open it, let's setup our notebook by loading all relevant modules and setting some options:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div class=\"bk-root\">\n",
       "        <a href=\"https://bokeh.org\" target=\"_blank\" class=\"bk-logo bk-logo-small bk-logo-notebook\"></a>\n",
       "        <span id=\"1220\">Loading BokehJS ...</span>\n",
       "    </div>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/javascript": [
       "\n",
       "(function(root) {\n",
       "  function now() {\n",
       "    return new Date();\n",
       "  }\n",
       "\n",
       "  var force = true;\n",
       "\n",
       "  if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n",
       "    root._bokeh_onload_callbacks = [];\n",
       "    root._bokeh_is_loading = undefined;\n",
       "  }\n",
       "\n",
       "  var JS_MIME_TYPE = 'application/javascript';\n",
       "  var HTML_MIME_TYPE = 'text/html';\n",
       "  var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n",
       "  var CLASS_NAME = 'output_bokeh rendered_html';\n",
       "\n",
       "  /**\n",
       "   * Render data to the DOM node\n",
       "   */\n",
       "  function render(props, node) {\n",
       "    var script = document.createElement(\"script\");\n",
       "    node.appendChild(script);\n",
       "  }\n",
       "\n",
       "  /**\n",
       "   * Handle when an output is cleared or removed\n",
       "   */\n",
       "  function handleClearOutput(event, handle) {\n",
       "    var cell = handle.cell;\n",
       "\n",
       "    var id = cell.output_area._bokeh_element_id;\n",
       "    var server_id = cell.output_area._bokeh_server_id;\n",
       "    // Clean up Bokeh references\n",
       "    if (id != null && id in Bokeh.index) {\n",
       "      Bokeh.index[id].model.document.clear();\n",
       "      delete Bokeh.index[id];\n",
       "    }\n",
       "\n",
       "    if (server_id !== undefined) {\n",
       "      // Clean up Bokeh references\n",
       "      var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n",
       "      cell.notebook.kernel.execute(cmd, {\n",
       "        iopub: {\n",
       "          output: function(msg) {\n",
       "            var id = msg.content.text.trim();\n",
       "            if (id in Bokeh.index) {\n",
       "              Bokeh.index[id].model.document.clear();\n",
       "              delete Bokeh.index[id];\n",
       "            }\n",
       "          }\n",
       "        }\n",
       "      });\n",
       "      // Destroy server and session\n",
       "      var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n",
       "      cell.notebook.kernel.execute(cmd);\n",
       "    }\n",
       "  }\n",
       "\n",
       "  /**\n",
       "   * Handle when a new output is added\n",
       "   */\n",
       "  function handleAddOutput(event, handle) {\n",
       "    var output_area = handle.output_area;\n",
       "    var output = handle.output;\n",
       "\n",
       "    // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n",
       "    if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n",
       "      return\n",
       "    }\n",
       "\n",
       "    var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n",
       "\n",
       "    if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n",
       "      toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n",
       "      // store reference to embed id on output_area\n",
       "      output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n",
       "    }\n",
       "    if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n",
       "      var bk_div = document.createElement(\"div\");\n",
       "      bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n",
       "      var script_attrs = bk_div.children[0].attributes;\n",
       "      for (var i = 0; i < script_attrs.length; i++) {\n",
       "        toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n",
       "        toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n",
       "      }\n",
       "      // store reference to server id on output_area\n",
       "      output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n",
       "    }\n",
       "  }\n",
       "\n",
       "  function register_renderer(events, OutputArea) {\n",
       "\n",
       "    function append_mime(data, metadata, element) {\n",
       "      // create a DOM node to render to\n",
       "      var toinsert = this.create_output_subarea(\n",
       "        metadata,\n",
       "        CLASS_NAME,\n",
       "        EXEC_MIME_TYPE\n",
       "      );\n",
       "      this.keyboard_manager.register_events(toinsert);\n",
       "      // Render to node\n",
       "      var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n",
       "      render(props, toinsert[toinsert.length - 1]);\n",
       "      element.append(toinsert);\n",
       "      return toinsert\n",
       "    }\n",
       "\n",
       "    /* Handle when an output is cleared or removed */\n",
       "    events.on('clear_output.CodeCell', handleClearOutput);\n",
       "    events.on('delete.Cell', handleClearOutput);\n",
       "\n",
       "    /* Handle when a new output is added */\n",
       "    events.on('output_added.OutputArea', handleAddOutput);\n",
       "\n",
       "    /**\n",
       "     * Register the mime type and append_mime function with output_area\n",
       "     */\n",
       "    OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n",
       "      /* Is output safe? */\n",
       "      safe: true,\n",
       "      /* Index of renderer in `output_area.display_order` */\n",
       "      index: 0\n",
       "    });\n",
       "  }\n",
       "\n",
       "  // register the mime type if in Jupyter Notebook environment and previously unregistered\n",
       "  if (root.Jupyter !== undefined) {\n",
       "    var events = require('base/js/events');\n",
       "    var OutputArea = require('notebook/js/outputarea').OutputArea;\n",
       "\n",
       "    if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n",
       "      register_renderer(events, OutputArea);\n",
       "    }\n",
       "  }\n",
       "\n",
       "  \n",
       "  if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n",
       "    root._bokeh_timeout = Date.now() + 5000;\n",
       "    root._bokeh_failed_load = false;\n",
       "  }\n",
       "\n",
       "  var NB_LOAD_WARNING = {'data': {'text/html':\n",
       "     \"<div style='background-color: #fdd'>\\n\"+\n",
       "     \"<p>\\n\"+\n",
       "     \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n",
       "     \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n",
       "     \"</p>\\n\"+\n",
       "     \"<ul>\\n\"+\n",
       "     \"<li>re-rerun `output_notebook()` to attempt to load from CDN again, or</li>\\n\"+\n",
       "     \"<li>use INLINE resources instead, as so:</li>\\n\"+\n",
       "     \"</ul>\\n\"+\n",
       "     \"<code>\\n\"+\n",
       "     \"from bokeh.resources import INLINE\\n\"+\n",
       "     \"output_notebook(resources=INLINE)\\n\"+\n",
       "     \"</code>\\n\"+\n",
       "     \"</div>\"}};\n",
       "\n",
       "  function display_loaded() {\n",
       "    var el = document.getElementById(\"1220\");\n",
       "    if (el != null) {\n",
       "      el.textContent = \"BokehJS is loading...\";\n",
       "    }\n",
       "    if (root.Bokeh !== undefined) {\n",
       "      if (el != null) {\n",
       "        el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n",
       "      }\n",
       "    } else if (Date.now() < root._bokeh_timeout) {\n",
       "      setTimeout(display_loaded, 100)\n",
       "    }\n",
       "  }\n",
       "\n",
       "\n",
       "  function run_callbacks() {\n",
       "    try {\n",
       "      root._bokeh_onload_callbacks.forEach(function(callback) {\n",
       "        if (callback != null)\n",
       "          callback();\n",
       "      });\n",
       "    } finally {\n",
       "      delete root._bokeh_onload_callbacks\n",
       "    }\n",
       "    console.debug(\"Bokeh: all callbacks have finished\");\n",
       "  }\n",
       "\n",
       "  function load_libs(css_urls, js_urls, callback) {\n",
       "    if (css_urls == null) css_urls = [];\n",
       "    if (js_urls == null) js_urls = [];\n",
       "\n",
       "    root._bokeh_onload_callbacks.push(callback);\n",
       "    if (root._bokeh_is_loading > 0) {\n",
       "      console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n",
       "      return null;\n",
       "    }\n",
       "    if (js_urls == null || js_urls.length === 0) {\n",
       "      run_callbacks();\n",
       "      return null;\n",
       "    }\n",
       "    console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n",
       "    root._bokeh_is_loading = css_urls.length + js_urls.length;\n",
       "\n",
       "    function on_load() {\n",
       "      root._bokeh_is_loading--;\n",
       "      if (root._bokeh_is_loading === 0) {\n",
       "        console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n",
       "        run_callbacks()\n",
       "      }\n",
       "    }\n",
       "\n",
       "    function on_error() {\n",
       "      console.error(\"failed to load \" + url);\n",
       "    }\n",
       "\n",
       "    for (var i = 0; i < css_urls.length; i++) {\n",
       "      var url = css_urls[i];\n",
       "      const element = document.createElement(\"link\");\n",
       "      element.onload = on_load;\n",
       "      element.onerror = on_error;\n",
       "      element.rel = \"stylesheet\";\n",
       "      element.type = \"text/css\";\n",
       "      element.href = url;\n",
       "      console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n",
       "      document.body.appendChild(element);\n",
       "    }\n",
       "\n",
       "    const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.0.1.min.js\": \"JpP8FXbgAZLkfur7LiK3j9AGBhHNIvF742meBJrjO2ShJDhCG2I1uVvW+0DUtrmc\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.0.1.min.js\": \"xZlADit0Q04ISQEdKg2k3L4W9AwQBAuDs9nJL9fM/WwzL1tEU9VPNezOFX0nLEAz\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.0.1.min.js\": \"4BuPRZkdMKSnj3zoxiNrQ86XgNw0rYmBOxe7nshquXwwcauupgBF2DHLVG1WuZlV\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-2.0.1.min.js\": \"Dv1SQ87hmDqK6S5OhBf0bCuwAEvL5QYL0PuR/F1SPVhCS/r/abjkbpKDYL2zeM19\"};\n",
       "\n",
       "    for (var i = 0; i < js_urls.length; i++) {\n",
       "      var url = js_urls[i];\n",
       "      var element = document.createElement('script');\n",
       "      element.onload = on_load;\n",
       "      element.onerror = on_error;\n",
       "      element.async = false;\n",
       "      element.src = url;\n",
       "      if (url in hashes) {\n",
       "        element.crossOrigin = \"anonymous\";\n",
       "        element.integrity = \"sha384-\" + hashes[url];\n",
       "      }\n",
       "      console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n",
       "      document.head.appendChild(element);\n",
       "    }\n",
       "  };var element = document.getElementById(\"1220\");\n",
       "  if (element == null) {\n",
       "    console.error(\"Bokeh: ERROR: autoload.js configured with elementid '1220' but no matching script tag was found. \")\n",
       "    return false;\n",
       "  }\n",
       "\n",
       "  function inject_raw_css(css) {\n",
       "    const element = document.createElement(\"style\");\n",
       "    element.appendChild(document.createTextNode(css));\n",
       "    document.body.appendChild(element);\n",
       "  }\n",
       "\n",
       "  \n",
       "  var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.0.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.0.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.0.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-2.0.1.min.js\", \"https://unpkg.com/@holoviz/panel@^0.9.5/dist/panel.min.js\"];\n",
       "  var css_urls = [];\n",
       "  \n",
       "\n",
       "  var inline_js = [\n",
       "    function(Bokeh) {\n",
       "      inject_raw_css(\".json-formatter-row {\\n  font-family: monospace;\\n}\\n.json-formatter-row,\\n.json-formatter-row a,\\n.json-formatter-row a:hover {\\n  color: black;\\n  text-decoration: none;\\n}\\n.json-formatter-row .json-formatter-row {\\n  margin-left: 1rem;\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty {\\n  opacity: 0.5;\\n  margin-left: 1rem;\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty:after {\\n  display: none;\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-object:after {\\n  content: \\\"No properties\\\";\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-array:after {\\n  content: \\\"[]\\\";\\n}\\n.json-formatter-row .json-formatter-string,\\n.json-formatter-row .json-formatter-stringifiable {\\n  color: green;\\n  white-space: pre;\\n  word-wrap: break-word;\\n}\\n.json-formatter-row .json-formatter-number {\\n  color: blue;\\n}\\n.json-formatter-row .json-formatter-boolean {\\n  color: red;\\n}\\n.json-formatter-row .json-formatter-null {\\n  color: #855A00;\\n}\\n.json-formatter-row .json-formatter-undefined {\\n  color: #ca0b69;\\n}\\n.json-formatter-row .json-formatter-function {\\n  color: #FF20ED;\\n}\\n.json-formatter-row .json-formatter-date {\\n  background-color: rgba(0, 0, 0, 0.05);\\n}\\n.json-formatter-row .json-formatter-url {\\n  text-decoration: underline;\\n  color: blue;\\n  cursor: pointer;\\n}\\n.json-formatter-row .json-formatter-bracket {\\n  color: blue;\\n}\\n.json-formatter-row .json-formatter-key {\\n  color: #00008B;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-row .json-formatter-toggler-link {\\n  cursor: pointer;\\n}\\n.json-formatter-row .json-formatter-toggler {\\n  line-height: 1.2rem;\\n  font-size: 0.7rem;\\n  vertical-align: middle;\\n  opacity: 0.6;\\n  cursor: pointer;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-row .json-formatter-toggler:after {\\n  display: inline-block;\\n  transition: transform 100ms ease-in;\\n  content: \\\"\\\\25BA\\\";\\n}\\n.json-formatter-row > a > .json-formatter-preview-text {\\n  opacity: 0;\\n  transition: opacity 0.15s ease-in;\\n  font-style: italic;\\n}\\n.json-formatter-row:hover > a > .json-formatter-preview-text {\\n  opacity: 0.6;\\n}\\n.json-formatter-row.json-formatter-open > .json-formatter-toggler-link .json-formatter-toggler:after {\\n  transform: rotate(90deg);\\n}\\n.json-formatter-row.json-formatter-open > .json-formatter-children:after {\\n  display: inline-block;\\n}\\n.json-formatter-row.json-formatter-open > a > .json-formatter-preview-text {\\n  display: none;\\n}\\n.json-formatter-row.json-formatter-open.json-formatter-empty:after {\\n  display: block;\\n}\\n.json-formatter-dark.json-formatter-row {\\n  font-family: monospace;\\n}\\n.json-formatter-dark.json-formatter-row,\\n.json-formatter-dark.json-formatter-row a,\\n.json-formatter-dark.json-formatter-row a:hover {\\n  color: white;\\n  text-decoration: none;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-row {\\n  margin-left: 1rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty {\\n  opacity: 0.5;\\n  margin-left: 1rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty:after {\\n  display: none;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-object:after {\\n  content: \\\"No properties\\\";\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-array:after {\\n  content: \\\"[]\\\";\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-string,\\n.json-formatter-dark.json-formatter-row .json-formatter-stringifiable {\\n  color: #31F031;\\n  white-space: pre;\\n  word-wrap: break-word;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-number {\\n  color: #66C2FF;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-boolean {\\n  color: #EC4242;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-null {\\n  color: #EEC97D;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-undefined {\\n  color: #ef8fbe;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-function {\\n  color: #FD48CB;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-date {\\n  background-color: rgba(255, 255, 255, 0.05);\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-url {\\n  text-decoration: underline;\\n  color: #027BFF;\\n  cursor: pointer;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-bracket {\\n  color: #9494FF;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-key {\\n  color: #23A0DB;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-toggler-link {\\n  cursor: pointer;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-toggler {\\n  line-height: 1.2rem;\\n  font-size: 0.7rem;\\n  vertical-align: middle;\\n  opacity: 0.6;\\n  cursor: pointer;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-toggler:after {\\n  display: inline-block;\\n  transition: transform 100ms ease-in;\\n  content: \\\"\\\\25BA\\\";\\n}\\n.json-formatter-dark.json-formatter-row > a > .json-formatter-preview-text {\\n  opacity: 0;\\n  transition: opacity 0.15s ease-in;\\n  font-style: italic;\\n}\\n.json-formatter-dark.json-formatter-row:hover > a > .json-formatter-preview-text {\\n  opacity: 0.6;\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open > .json-formatter-toggler-link .json-formatter-toggler:after {\\n  transform: rotate(90deg);\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open > .json-formatter-children:after {\\n  display: inline-block;\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open > a > .json-formatter-preview-text {\\n  display: none;\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open.json-formatter-empty:after {\\n  display: block;\\n}\\n\");\n",
       "    },\n",
       "    function(Bokeh) {\n",
       "      inject_raw_css(\".widget-box {\\n\\tmin-height: 20px;\\n\\tbackground-color: #f5f5f5;\\n\\tborder: 1px solid #e3e3e3 !important;\\n\\tborder-radius: 4px;\\n\\t-webkit-box-shadow: inset 0 1px 1px rgba(0,0,0,.05);\\n\\tbox-shadow: inset 0 1px 1px rgba(0,0,0,.05);\\n\\toverflow-x: hidden;\\n\\toverflow-y: hidden;\\n}\\n\\n.scrollable {\\n  overflow: scroll;\\n}\\n\\nprogress {\\n\\tappearance: none;\\n\\t-moz-appearance: none;\\n\\t-webkit-appearance: none;\\n\\n\\tborder: none;\\n\\theight: 20px;\\n\\tbackground-color: whiteSmoke;\\n\\tborder-radius: 3px;\\n\\tbox-shadow: 0 2px 3px rgba(0,0,0,.5) inset;\\n\\tcolor: royalblue;\\n\\tposition: relative;\\n\\tmargin: 0 0 1.5em;\\n}\\n\\nprogress[value]::-webkit-progress-bar {\\n\\tbackground-color: whiteSmoke;\\n\\tborder-radius: 3px;\\n\\tbox-shadow: 0 2px 3px rgba(0,0,0,.5) inset;\\n}\\n\\nprogress[value]::-webkit-progress-value {\\n\\tposition: relative;\\n\\n\\tbackground-size: 35px 20px, 100% 100%, 100% 100%;\\n\\tborder-radius:3px;\\n}\\n\\nprogress.active:not([value])::before {\\n\\tbackground-position: 10%;\\n\\tanimation-name: stripes;\\n\\tanimation-duration: 3s;\\n\\tanimation-timing-function: linear;\\n\\tanimation-iteration-count: infinite;\\n}\\n\\nprogress[value]::-moz-progress-bar {\\n\\tbackground-size: 35px 20px, 100% 100%, 100% 100%;\\n\\tborder-radius:3px;\\n}\\n\\nprogress:not([value])::-moz-progress-bar {\\n\\tborder-radius:3px;\\n\\tbackground:\\n\\tlinear-gradient(-45deg, transparent 33%, rgba(0, 0, 0, 0.2) 33%, rgba(0, 0, 0, 0.2) 66%, transparent 66%) left/2.5em 1.5em;\\n\\n}\\n\\nprogress.active:not([value])::-moz-progress-bar {\\n\\tbackground-position: 10%;\\n\\tanimation-name: stripes;\\n\\tanimation-duration: 3s;\\n\\tanimation-timing-function: linear;\\n\\tanimation-iteration-count: infinite;\\n}\\n\\nprogress.active:not([value])::-webkit-progress-bar {\\n\\tbackground-position: 10%;\\n\\tanimation-name: stripes;\\n\\tanimation-duration: 3s;\\n\\tanimation-timing-function: linear;\\n\\tanimation-iteration-count: infinite;\\n}\\n\\nprogress.primary[value]::-webkit-progress-value { background-color: #007bff; }\\nprogress.primary:not([value])::before { background-color: #007bff; }\\nprogress.primary:not([value])::-webkit-progress-bar { background-color: #007bff; }\\nprogress.primary::-moz-progress-bar { background-color: #007bff; }\\n\\nprogress.secondary[value]::-webkit-progress-value { background-color: #6c757d; }\\nprogress.secondary:not([value])::before { background-color: #6c757d; }\\nprogress.secondary:not([value])::-webkit-progress-bar { background-color: #6c757d; }\\nprogress.secondary::-moz-progress-bar { background-color: #6c757d; }\\n\\nprogress.success[value]::-webkit-progress-value { background-color: #28a745; }\\nprogress.success:not([value])::before { background-color: #28a745; }\\nprogress.success:not([value])::-webkit-progress-bar { background-color: #28a745; }\\nprogress.success::-moz-progress-bar { background-color: #28a745; }\\n\\nprogress.danger[value]::-webkit-progress-value { background-color: #dc3545; }\\nprogress.danger:not([value])::before { background-color: #dc3545; }\\nprogress.danger:not([value])::-webkit-progress-bar { background-color: #dc3545; }\\nprogress.danger::-moz-progress-bar { background-color: #dc3545; }\\n\\nprogress.warning[value]::-webkit-progress-value { background-color: #ffc107; }\\nprogress.warning:not([value])::before { background-color: #ffc107; }\\nprogress.warning:not([value])::-webkit-progress-bar { background-color: #ffc107; }\\nprogress.warning::-moz-progress-bar { background-color: #ffc107; }\\n\\nprogress.info[value]::-webkit-progress-value { background-color: #17a2b8; }\\nprogress.info:not([value])::before { background-color: #17a2b8; }\\nprogress.info:not([value])::-webkit-progress-bar { background-color: #17a2b8; }\\nprogress.info::-moz-progress-bar { background-color: #17a2b8; }\\n\\nprogress.light[value]::-webkit-progress-value { background-color: #f8f9fa; }\\nprogress.light:not([value])::before { background-color: #f8f9fa; }\\nprogress.light:not([value])::-webkit-progress-bar { background-color: #f8f9fa; }\\nprogress.light::-moz-progress-bar { background-color: #f8f9fa; }\\n\\nprogress.dark[value]::-webkit-progress-value { background-color: #343a40; }\\nprogress.dark:not([value])::-webkit-progress-bar { background-color: #343a40; }\\nprogress.dark:not([value])::before { background-color: #343a40; }\\nprogress.dark::-moz-progress-bar { background-color: #343a40; }\\n\\nprogress:not([value])::-webkit-progress-bar {\\n\\tborder-radius: 3px;\\n\\tbackground:\\n\\tlinear-gradient(-45deg, transparent 33%, rgba(0, 0, 0, 0.2) 33%, rgba(0, 0, 0, 0.2) 66%, transparent 66%) left/2.5em 1.5em;\\n}\\nprogress:not([value])::before {\\n\\tcontent:\\\" \\\";\\n\\tposition:absolute;\\n\\theight: 20px;\\n\\ttop:0;\\n\\tleft:0;\\n\\tright:0;\\n\\tbottom:0;\\n\\tborder-radius: 3px;\\n\\tbackground:\\n\\tlinear-gradient(-45deg, transparent 33%, rgba(0, 0, 0, 0.2) 33%, rgba(0, 0, 0, 0.2) 66%, transparent 66%) left/2.5em 1.5em;\\n}\\n\\n@keyframes stripes {\\n  from {background-position: 0%}\\n  to {background-position: 100%}\\n}\");\n",
       "    },\n",
       "    function(Bokeh) {\n",
       "      inject_raw_css(\"table.panel-df {\\n    margin-left: auto;\\n    margin-right: auto;\\n    border: none;\\n    border-collapse: collapse;\\n    border-spacing: 0;\\n    color: black;\\n    font-size: 12px;\\n    table-layout: fixed;\\n    width: 100%;\\n}\\n\\n.panel-df tr, th, td {\\n    text-align: right;\\n    vertical-align: middle;\\n    padding: 0.5em 0.5em !important;\\n    line-height: normal;\\n    white-space: normal;\\n    max-width: none;\\n    border: none;\\n}\\n\\n.panel-df tbody {\\n    display: table-row-group;\\n    vertical-align: middle;\\n    border-color: inherit;\\n}\\n\\n.panel-df tbody tr:nth-child(odd) {\\n    background: #f5f5f5;\\n}\\n\\n.panel-df thead {\\n    border-bottom: 1px solid black;\\n    vertical-align: bottom;\\n}\\n\\n.panel-df tr:hover {\\n    background: lightblue !important;\\n    cursor: pointer;\\n}\\n\");\n",
       "    },\n",
       "    function(Bokeh) {\n",
       "      inject_raw_css(\".codehilite .hll { background-color: #ffffcc }\\n.codehilite  { background: #f8f8f8; }\\n.codehilite .c { color: #408080; font-style: italic } /* Comment */\\n.codehilite .err { border: 1px solid #FF0000 } /* Error */\\n.codehilite .k { color: #008000; font-weight: bold } /* Keyword */\\n.codehilite .o { color: #666666 } /* Operator */\\n.codehilite .ch { color: #408080; font-style: italic } /* Comment.Hashbang */\\n.codehilite .cm { color: #408080; font-style: italic } /* Comment.Multiline */\\n.codehilite .cp { color: #BC7A00 } /* Comment.Preproc */\\n.codehilite .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */\\n.codehilite .c1 { color: #408080; font-style: italic } /* Comment.Single */\\n.codehilite .cs { color: #408080; font-style: italic } /* Comment.Special */\\n.codehilite .gd { color: #A00000 } /* Generic.Deleted */\\n.codehilite .ge { font-style: italic } /* Generic.Emph */\\n.codehilite .gr { color: #FF0000 } /* Generic.Error */\\n.codehilite .gh { color: #000080; font-weight: bold } /* Generic.Heading */\\n.codehilite .gi { color: #00A000 } /* Generic.Inserted */\\n.codehilite .go { color: #888888 } /* Generic.Output */\\n.codehilite .gp { color: #000080; font-weight: bold } /* Generic.Prompt */\\n.codehilite .gs { font-weight: bold } /* Generic.Strong */\\n.codehilite .gu { color: #800080; font-weight: bold } /* Generic.Subheading */\\n.codehilite .gt { color: #0044DD } /* Generic.Traceback */\\n.codehilite .kc { color: #008000; font-weight: bold } /* Keyword.Constant */\\n.codehilite .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */\\n.codehilite .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */\\n.codehilite .kp { color: #008000 } /* Keyword.Pseudo */\\n.codehilite .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */\\n.codehilite .kt { color: #B00040 } /* Keyword.Type */\\n.codehilite .m { color: #666666 } /* Literal.Number */\\n.codehilite .s { color: #BA2121 } /* Literal.String */\\n.codehilite .na { color: #7D9029 } /* Name.Attribute */\\n.codehilite .nb { color: #008000 } /* Name.Builtin */\\n.codehilite .nc { color: #0000FF; font-weight: bold } /* Name.Class */\\n.codehilite .no { color: #880000 } /* Name.Constant */\\n.codehilite .nd { color: #AA22FF } /* Name.Decorator */\\n.codehilite .ni { color: #999999; font-weight: bold } /* Name.Entity */\\n.codehilite .ne { color: #D2413A; font-weight: bold } /* Name.Exception */\\n.codehilite .nf { color: #0000FF } /* Name.Function */\\n.codehilite .nl { color: #A0A000 } /* Name.Label */\\n.codehilite .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */\\n.codehilite .nt { color: #008000; font-weight: bold } /* Name.Tag */\\n.codehilite .nv { color: #19177C } /* Name.Variable */\\n.codehilite .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */\\n.codehilite .w { color: #bbbbbb } /* Text.Whitespace */\\n.codehilite .mb { color: #666666 } /* Literal.Number.Bin */\\n.codehilite .mf { color: #666666 } /* Literal.Number.Float */\\n.codehilite .mh { color: #666666 } /* Literal.Number.Hex */\\n.codehilite .mi { color: #666666 } /* Literal.Number.Integer */\\n.codehilite .mo { color: #666666 } /* Literal.Number.Oct */\\n.codehilite .sa { color: #BA2121 } /* Literal.String.Affix */\\n.codehilite .sb { color: #BA2121 } /* Literal.String.Backtick */\\n.codehilite .sc { color: #BA2121 } /* Literal.String.Char */\\n.codehilite .dl { color: #BA2121 } /* Literal.String.Delimiter */\\n.codehilite .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */\\n.codehilite .s2 { color: #BA2121 } /* Literal.String.Double */\\n.codehilite .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */\\n.codehilite .sh { color: #BA2121 } /* Literal.String.Heredoc */\\n.codehilite .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */\\n.codehilite .sx { color: #008000 } /* Literal.String.Other */\\n.codehilite .sr { color: #BB6688 } /* Literal.String.Regex */\\n.codehilite .s1 { color: #BA2121 } /* Literal.String.Single */\\n.codehilite .ss { color: #19177C } /* Literal.String.Symbol */\\n.codehilite .bp { color: #008000 } /* Name.Builtin.Pseudo */\\n.codehilite .fm { color: #0000FF } /* Name.Function.Magic */\\n.codehilite .vc { color: #19177C } /* Name.Variable.Class */\\n.codehilite .vg { color: #19177C } /* Name.Variable.Global */\\n.codehilite .vi { color: #19177C } /* Name.Variable.Instance */\\n.codehilite .vm { color: #19177C } /* Name.Variable.Magic */\\n.codehilite .il { color: #666666 } /* Literal.Number.Integer.Long */\\n\\n.markdown h1 { margin-block-start: 0.34em }\\n.markdown h2 { margin-block-start: 0.42em }\\n.markdown h3 { margin-block-start: 0.5em }\\n.markdown h4 { margin-block-start: 0.67em }\\n.markdown h5 { margin-block-start: 0.84em }\\n.markdown h6 { margin-block-start: 1.17em }\\n.markdown ul { padding-inline-start: 2em }\\n.markdown ol { padding-inline-start: 2em }\\n.markdown strong { font-weight: 600 }\\n.markdown a { color: -webkit-link }\\n.markdown a { color: -moz-hyperlinkText }\\n\");\n",
       "    },\n",
       "    function(Bokeh) {\n",
       "      Bokeh.set_log_level(\"info\");\n",
       "    },\n",
       "    function(Bokeh) {\n",
       "    \n",
       "    \n",
       "    }\n",
       "  ];\n",
       "\n",
       "  function run_inline_js() {\n",
       "    \n",
       "    if (root.Bokeh !== undefined || force === true) {\n",
       "      \n",
       "    for (var i = 0; i < inline_js.length; i++) {\n",
       "      inline_js[i].call(root, root.Bokeh);\n",
       "    }\n",
       "    if (force === true) {\n",
       "        display_loaded();\n",
       "      }} else if (Date.now() < root._bokeh_timeout) {\n",
       "      setTimeout(run_inline_js, 100);\n",
       "    } else if (!root._bokeh_failed_load) {\n",
       "      console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n",
       "      root._bokeh_failed_load = true;\n",
       "    } else if (force !== true) {\n",
       "      var cell = $(document.getElementById(\"1220\")).parents('.cell').data().cell;\n",
       "      cell.output_area.append_execute_result(NB_LOAD_WARNING)\n",
       "    }\n",
       "\n",
       "  }\n",
       "\n",
       "  if (root._bokeh_is_loading === 0) {\n",
       "    console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n",
       "    run_inline_js();\n",
       "  } else {\n",
       "    load_libs(css_urls, js_urls, function() {\n",
       "      console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n",
       "      run_inline_js();\n",
       "    });\n",
       "  }\n",
       "}(window));"
      ],
      "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n  function now() {\n    return new Date();\n  }\n\n  var force = true;\n\n  if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n    root._bokeh_onload_callbacks = [];\n    root._bokeh_is_loading = undefined;\n  }\n\n  \n\n  \n  if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n    root._bokeh_timeout = Date.now() + 5000;\n    root._bokeh_failed_load = false;\n  }\n\n  var NB_LOAD_WARNING = {'data': {'text/html':\n     \"<div style='background-color: #fdd'>\\n\"+\n     \"<p>\\n\"+\n     \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n     \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n     \"</p>\\n\"+\n     \"<ul>\\n\"+\n     \"<li>re-rerun `output_notebook()` to attempt to load from CDN again, or</li>\\n\"+\n     \"<li>use INLINE resources instead, as so:</li>\\n\"+\n     \"</ul>\\n\"+\n     \"<code>\\n\"+\n     \"from bokeh.resources import INLINE\\n\"+\n     \"output_notebook(resources=INLINE)\\n\"+\n     \"</code>\\n\"+\n     \"</div>\"}};\n\n  function display_loaded() {\n    var el = document.getElementById(\"1220\");\n    if (el != null) {\n      el.textContent = \"BokehJS is loading...\";\n    }\n    if (root.Bokeh !== undefined) {\n      if (el != null) {\n        el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n      }\n    } else if (Date.now() < root._bokeh_timeout) {\n      setTimeout(display_loaded, 100)\n    }\n  }\n\n\n  function run_callbacks() {\n    try {\n      root._bokeh_onload_callbacks.forEach(function(callback) {\n        if (callback != null)\n          callback();\n      });\n    } finally {\n      delete root._bokeh_onload_callbacks\n    }\n    console.debug(\"Bokeh: all callbacks have finished\");\n  }\n\n  function load_libs(css_urls, js_urls, callback) {\n    if (css_urls == null) css_urls = [];\n    if (js_urls == null) js_urls = [];\n\n    root._bokeh_onload_callbacks.push(callback);\n    if (root._bokeh_is_loading > 0) {\n      console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n      return null;\n    }\n    if (js_urls == null || js_urls.length === 0) {\n      run_callbacks();\n      return null;\n    }\n    console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n    root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n    function on_load() {\n      root._bokeh_is_loading--;\n      if (root._bokeh_is_loading === 0) {\n        console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n        run_callbacks()\n      }\n    }\n\n    function on_error() {\n      console.error(\"failed to load \" + url);\n    }\n\n    for (var i = 0; i < css_urls.length; i++) {\n      var url = css_urls[i];\n      const element = document.createElement(\"link\");\n      element.onload = on_load;\n      element.onerror = on_error;\n      element.rel = \"stylesheet\";\n      element.type = \"text/css\";\n      element.href = url;\n      console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n      document.body.appendChild(element);\n    }\n\n    const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.0.1.min.js\": \"JpP8FXbgAZLkfur7LiK3j9AGBhHNIvF742meBJrjO2ShJDhCG2I1uVvW+0DUtrmc\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.0.1.min.js\": \"xZlADit0Q04ISQEdKg2k3L4W9AwQBAuDs9nJL9fM/WwzL1tEU9VPNezOFX0nLEAz\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.0.1.min.js\": \"4BuPRZkdMKSnj3zoxiNrQ86XgNw0rYmBOxe7nshquXwwcauupgBF2DHLVG1WuZlV\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-2.0.1.min.js\": \"Dv1SQ87hmDqK6S5OhBf0bCuwAEvL5QYL0PuR/F1SPVhCS/r/abjkbpKDYL2zeM19\"};\n\n    for (var i = 0; i < js_urls.length; i++) {\n      var url = js_urls[i];\n      var element = document.createElement('script');\n      element.onload = on_load;\n      element.onerror = on_error;\n      element.async = false;\n      element.src = url;\n      if (url in hashes) {\n        element.crossOrigin = \"anonymous\";\n        element.integrity = \"sha384-\" + hashes[url];\n      }\n      console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n      document.head.appendChild(element);\n    }\n  };var element = document.getElementById(\"1220\");\n  if (element == null) {\n    console.error(\"Bokeh: ERROR: autoload.js configured with elementid '1220' but no matching script tag was found. \")\n    return false;\n  }\n\n  function inject_raw_css(css) {\n    const element = document.createElement(\"style\");\n    element.appendChild(document.createTextNode(css));\n    document.body.appendChild(element);\n  }\n\n  \n  var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.0.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.0.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.0.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-2.0.1.min.js\", \"https://unpkg.com/@holoviz/panel@^0.9.5/dist/panel.min.js\"];\n  var css_urls = [];\n  \n\n  var inline_js = [\n    function(Bokeh) {\n      inject_raw_css(\".json-formatter-row {\\n  font-family: monospace;\\n}\\n.json-formatter-row,\\n.json-formatter-row a,\\n.json-formatter-row a:hover {\\n  color: black;\\n  text-decoration: none;\\n}\\n.json-formatter-row .json-formatter-row {\\n  margin-left: 1rem;\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty {\\n  opacity: 0.5;\\n  margin-left: 1rem;\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty:after {\\n  display: none;\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-object:after {\\n  content: \\\"No properties\\\";\\n}\\n.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-array:after {\\n  content: \\\"[]\\\";\\n}\\n.json-formatter-row .json-formatter-string,\\n.json-formatter-row .json-formatter-stringifiable {\\n  color: green;\\n  white-space: pre;\\n  word-wrap: break-word;\\n}\\n.json-formatter-row .json-formatter-number {\\n  color: blue;\\n}\\n.json-formatter-row .json-formatter-boolean {\\n  color: red;\\n}\\n.json-formatter-row .json-formatter-null {\\n  color: #855A00;\\n}\\n.json-formatter-row .json-formatter-undefined {\\n  color: #ca0b69;\\n}\\n.json-formatter-row .json-formatter-function {\\n  color: #FF20ED;\\n}\\n.json-formatter-row .json-formatter-date {\\n  background-color: rgba(0, 0, 0, 0.05);\\n}\\n.json-formatter-row .json-formatter-url {\\n  text-decoration: underline;\\n  color: blue;\\n  cursor: pointer;\\n}\\n.json-formatter-row .json-formatter-bracket {\\n  color: blue;\\n}\\n.json-formatter-row .json-formatter-key {\\n  color: #00008B;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-row .json-formatter-toggler-link {\\n  cursor: pointer;\\n}\\n.json-formatter-row .json-formatter-toggler {\\n  line-height: 1.2rem;\\n  font-size: 0.7rem;\\n  vertical-align: middle;\\n  opacity: 0.6;\\n  cursor: pointer;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-row .json-formatter-toggler:after {\\n  display: inline-block;\\n  transition: transform 100ms ease-in;\\n  content: \\\"\\\\25BA\\\";\\n}\\n.json-formatter-row > a > .json-formatter-preview-text {\\n  opacity: 0;\\n  transition: opacity 0.15s ease-in;\\n  font-style: italic;\\n}\\n.json-formatter-row:hover > a > .json-formatter-preview-text {\\n  opacity: 0.6;\\n}\\n.json-formatter-row.json-formatter-open > .json-formatter-toggler-link .json-formatter-toggler:after {\\n  transform: rotate(90deg);\\n}\\n.json-formatter-row.json-formatter-open > .json-formatter-children:after {\\n  display: inline-block;\\n}\\n.json-formatter-row.json-formatter-open > a > .json-formatter-preview-text {\\n  display: none;\\n}\\n.json-formatter-row.json-formatter-open.json-formatter-empty:after {\\n  display: block;\\n}\\n.json-formatter-dark.json-formatter-row {\\n  font-family: monospace;\\n}\\n.json-formatter-dark.json-formatter-row,\\n.json-formatter-dark.json-formatter-row a,\\n.json-formatter-dark.json-formatter-row a:hover {\\n  color: white;\\n  text-decoration: none;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-row {\\n  margin-left: 1rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty {\\n  opacity: 0.5;\\n  margin-left: 1rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty:after {\\n  display: none;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-object:after {\\n  content: \\\"No properties\\\";\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-children.json-formatter-empty.json-formatter-array:after {\\n  content: \\\"[]\\\";\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-string,\\n.json-formatter-dark.json-formatter-row .json-formatter-stringifiable {\\n  color: #31F031;\\n  white-space: pre;\\n  word-wrap: break-word;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-number {\\n  color: #66C2FF;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-boolean {\\n  color: #EC4242;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-null {\\n  color: #EEC97D;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-undefined {\\n  color: #ef8fbe;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-function {\\n  color: #FD48CB;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-date {\\n  background-color: rgba(255, 255, 255, 0.05);\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-url {\\n  text-decoration: underline;\\n  color: #027BFF;\\n  cursor: pointer;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-bracket {\\n  color: #9494FF;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-key {\\n  color: #23A0DB;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-toggler-link {\\n  cursor: pointer;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-toggler {\\n  line-height: 1.2rem;\\n  font-size: 0.7rem;\\n  vertical-align: middle;\\n  opacity: 0.6;\\n  cursor: pointer;\\n  padding-right: 0.2rem;\\n}\\n.json-formatter-dark.json-formatter-row .json-formatter-toggler:after {\\n  display: inline-block;\\n  transition: transform 100ms ease-in;\\n  content: \\\"\\\\25BA\\\";\\n}\\n.json-formatter-dark.json-formatter-row > a > .json-formatter-preview-text {\\n  opacity: 0;\\n  transition: opacity 0.15s ease-in;\\n  font-style: italic;\\n}\\n.json-formatter-dark.json-formatter-row:hover > a > .json-formatter-preview-text {\\n  opacity: 0.6;\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open > .json-formatter-toggler-link .json-formatter-toggler:after {\\n  transform: rotate(90deg);\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open > .json-formatter-children:after {\\n  display: inline-block;\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open > a > .json-formatter-preview-text {\\n  display: none;\\n}\\n.json-formatter-dark.json-formatter-row.json-formatter-open.json-formatter-empty:after {\\n  display: block;\\n}\\n\");\n    },\n    function(Bokeh) {\n      inject_raw_css(\".widget-box {\\n\\tmin-height: 20px;\\n\\tbackground-color: #f5f5f5;\\n\\tborder: 1px solid #e3e3e3 !important;\\n\\tborder-radius: 4px;\\n\\t-webkit-box-shadow: inset 0 1px 1px rgba(0,0,0,.05);\\n\\tbox-shadow: inset 0 1px 1px rgba(0,0,0,.05);\\n\\toverflow-x: hidden;\\n\\toverflow-y: hidden;\\n}\\n\\n.scrollable {\\n  overflow: scroll;\\n}\\n\\nprogress {\\n\\tappearance: none;\\n\\t-moz-appearance: none;\\n\\t-webkit-appearance: none;\\n\\n\\tborder: none;\\n\\theight: 20px;\\n\\tbackground-color: whiteSmoke;\\n\\tborder-radius: 3px;\\n\\tbox-shadow: 0 2px 3px rgba(0,0,0,.5) inset;\\n\\tcolor: royalblue;\\n\\tposition: relative;\\n\\tmargin: 0 0 1.5em;\\n}\\n\\nprogress[value]::-webkit-progress-bar {\\n\\tbackground-color: whiteSmoke;\\n\\tborder-radius: 3px;\\n\\tbox-shadow: 0 2px 3px rgba(0,0,0,.5) inset;\\n}\\n\\nprogress[value]::-webkit-progress-value {\\n\\tposition: relative;\\n\\n\\tbackground-size: 35px 20px, 100% 100%, 100% 100%;\\n\\tborder-radius:3px;\\n}\\n\\nprogress.active:not([value])::before {\\n\\tbackground-position: 10%;\\n\\tanimation-name: stripes;\\n\\tanimation-duration: 3s;\\n\\tanimation-timing-function: linear;\\n\\tanimation-iteration-count: infinite;\\n}\\n\\nprogress[value]::-moz-progress-bar {\\n\\tbackground-size: 35px 20px, 100% 100%, 100% 100%;\\n\\tborder-radius:3px;\\n}\\n\\nprogress:not([value])::-moz-progress-bar {\\n\\tborder-radius:3px;\\n\\tbackground:\\n\\tlinear-gradient(-45deg, transparent 33%, rgba(0, 0, 0, 0.2) 33%, rgba(0, 0, 0, 0.2) 66%, transparent 66%) left/2.5em 1.5em;\\n\\n}\\n\\nprogress.active:not([value])::-moz-progress-bar {\\n\\tbackground-position: 10%;\\n\\tanimation-name: stripes;\\n\\tanimation-duration: 3s;\\n\\tanimation-timing-function: linear;\\n\\tanimation-iteration-count: infinite;\\n}\\n\\nprogress.active:not([value])::-webkit-progress-bar {\\n\\tbackground-position: 10%;\\n\\tanimation-name: stripes;\\n\\tanimation-duration: 3s;\\n\\tanimation-timing-function: linear;\\n\\tanimation-iteration-count: infinite;\\n}\\n\\nprogress.primary[value]::-webkit-progress-value { background-color: #007bff; }\\nprogress.primary:not([value])::before { background-color: #007bff; }\\nprogress.primary:not([value])::-webkit-progress-bar { background-color: #007bff; }\\nprogress.primary::-moz-progress-bar { background-color: #007bff; }\\n\\nprogress.secondary[value]::-webkit-progress-value { background-color: #6c757d; }\\nprogress.secondary:not([value])::before { background-color: #6c757d; }\\nprogress.secondary:not([value])::-webkit-progress-bar { background-color: #6c757d; }\\nprogress.secondary::-moz-progress-bar { background-color: #6c757d; }\\n\\nprogress.success[value]::-webkit-progress-value { background-color: #28a745; }\\nprogress.success:not([value])::before { background-color: #28a745; }\\nprogress.success:not([value])::-webkit-progress-bar { background-color: #28a745; }\\nprogress.success::-moz-progress-bar { background-color: #28a745; }\\n\\nprogress.danger[value]::-webkit-progress-value { background-color: #dc3545; }\\nprogress.danger:not([value])::before { background-color: #dc3545; }\\nprogress.danger:not([value])::-webkit-progress-bar { background-color: #dc3545; }\\nprogress.danger::-moz-progress-bar { background-color: #dc3545; }\\n\\nprogress.warning[value]::-webkit-progress-value { background-color: #ffc107; }\\nprogress.warning:not([value])::before { background-color: #ffc107; }\\nprogress.warning:not([value])::-webkit-progress-bar { background-color: #ffc107; }\\nprogress.warning::-moz-progress-bar { background-color: #ffc107; }\\n\\nprogress.info[value]::-webkit-progress-value { background-color: #17a2b8; }\\nprogress.info:not([value])::before { background-color: #17a2b8; }\\nprogress.info:not([value])::-webkit-progress-bar { background-color: #17a2b8; }\\nprogress.info::-moz-progress-bar { background-color: #17a2b8; }\\n\\nprogress.light[value]::-webkit-progress-value { background-color: #f8f9fa; }\\nprogress.light:not([value])::before { background-color: #f8f9fa; }\\nprogress.light:not([value])::-webkit-progress-bar { background-color: #f8f9fa; }\\nprogress.light::-moz-progress-bar { background-color: #f8f9fa; }\\n\\nprogress.dark[value]::-webkit-progress-value { background-color: #343a40; }\\nprogress.dark:not([value])::-webkit-progress-bar { background-color: #343a40; }\\nprogress.dark:not([value])::before { background-color: #343a40; }\\nprogress.dark::-moz-progress-bar { background-color: #343a40; }\\n\\nprogress:not([value])::-webkit-progress-bar {\\n\\tborder-radius: 3px;\\n\\tbackground:\\n\\tlinear-gradient(-45deg, transparent 33%, rgba(0, 0, 0, 0.2) 33%, rgba(0, 0, 0, 0.2) 66%, transparent 66%) left/2.5em 1.5em;\\n}\\nprogress:not([value])::before {\\n\\tcontent:\\\" \\\";\\n\\tposition:absolute;\\n\\theight: 20px;\\n\\ttop:0;\\n\\tleft:0;\\n\\tright:0;\\n\\tbottom:0;\\n\\tborder-radius: 3px;\\n\\tbackground:\\n\\tlinear-gradient(-45deg, transparent 33%, rgba(0, 0, 0, 0.2) 33%, rgba(0, 0, 0, 0.2) 66%, transparent 66%) left/2.5em 1.5em;\\n}\\n\\n@keyframes stripes {\\n  from {background-position: 0%}\\n  to {background-position: 100%}\\n}\");\n    },\n    function(Bokeh) {\n      inject_raw_css(\"table.panel-df {\\n    margin-left: auto;\\n    margin-right: auto;\\n    border: none;\\n    border-collapse: collapse;\\n    border-spacing: 0;\\n    color: black;\\n    font-size: 12px;\\n    table-layout: fixed;\\n    width: 100%;\\n}\\n\\n.panel-df tr, th, td {\\n    text-align: right;\\n    vertical-align: middle;\\n    padding: 0.5em 0.5em !important;\\n    line-height: normal;\\n    white-space: normal;\\n    max-width: none;\\n    border: none;\\n}\\n\\n.panel-df tbody {\\n    display: table-row-group;\\n    vertical-align: middle;\\n    border-color: inherit;\\n}\\n\\n.panel-df tbody tr:nth-child(odd) {\\n    background: #f5f5f5;\\n}\\n\\n.panel-df thead {\\n    border-bottom: 1px solid black;\\n    vertical-align: bottom;\\n}\\n\\n.panel-df tr:hover {\\n    background: lightblue !important;\\n    cursor: pointer;\\n}\\n\");\n    },\n    function(Bokeh) {\n      inject_raw_css(\".codehilite .hll { background-color: #ffffcc }\\n.codehilite  { background: #f8f8f8; }\\n.codehilite .c { color: #408080; font-style: italic } /* Comment */\\n.codehilite .err { border: 1px solid #FF0000 } /* Error */\\n.codehilite .k { color: #008000; font-weight: bold } /* Keyword */\\n.codehilite .o { color: #666666 } /* Operator */\\n.codehilite .ch { color: #408080; font-style: italic } /* Comment.Hashbang */\\n.codehilite .cm { color: #408080; font-style: italic } /* Comment.Multiline */\\n.codehilite .cp { color: #BC7A00 } /* Comment.Preproc */\\n.codehilite .cpf { color: #408080; font-style: italic } /* Comment.PreprocFile */\\n.codehilite .c1 { color: #408080; font-style: italic } /* Comment.Single */\\n.codehilite .cs { color: #408080; font-style: italic } /* Comment.Special */\\n.codehilite .gd { color: #A00000 } /* Generic.Deleted */\\n.codehilite .ge { font-style: italic } /* Generic.Emph */\\n.codehilite .gr { color: #FF0000 } /* Generic.Error */\\n.codehilite .gh { color: #000080; font-weight: bold } /* Generic.Heading */\\n.codehilite .gi { color: #00A000 } /* Generic.Inserted */\\n.codehilite .go { color: #888888 } /* Generic.Output */\\n.codehilite .gp { color: #000080; font-weight: bold } /* Generic.Prompt */\\n.codehilite .gs { font-weight: bold } /* Generic.Strong */\\n.codehilite .gu { color: #800080; font-weight: bold } /* Generic.Subheading */\\n.codehilite .gt { color: #0044DD } /* Generic.Traceback */\\n.codehilite .kc { color: #008000; font-weight: bold } /* Keyword.Constant */\\n.codehilite .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */\\n.codehilite .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */\\n.codehilite .kp { color: #008000 } /* Keyword.Pseudo */\\n.codehilite .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */\\n.codehilite .kt { color: #B00040 } /* Keyword.Type */\\n.codehilite .m { color: #666666 } /* Literal.Number */\\n.codehilite .s { color: #BA2121 } /* Literal.String */\\n.codehilite .na { color: #7D9029 } /* Name.Attribute */\\n.codehilite .nb { color: #008000 } /* Name.Builtin */\\n.codehilite .nc { color: #0000FF; font-weight: bold } /* Name.Class */\\n.codehilite .no { color: #880000 } /* Name.Constant */\\n.codehilite .nd { color: #AA22FF } /* Name.Decorator */\\n.codehilite .ni { color: #999999; font-weight: bold } /* Name.Entity */\\n.codehilite .ne { color: #D2413A; font-weight: bold } /* Name.Exception */\\n.codehilite .nf { color: #0000FF } /* Name.Function */\\n.codehilite .nl { color: #A0A000 } /* Name.Label */\\n.codehilite .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */\\n.codehilite .nt { color: #008000; font-weight: bold } /* Name.Tag */\\n.codehilite .nv { color: #19177C } /* Name.Variable */\\n.codehilite .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */\\n.codehilite .w { color: #bbbbbb } /* Text.Whitespace */\\n.codehilite .mb { color: #666666 } /* Literal.Number.Bin */\\n.codehilite .mf { color: #666666 } /* Literal.Number.Float */\\n.codehilite .mh { color: #666666 } /* Literal.Number.Hex */\\n.codehilite .mi { color: #666666 } /* Literal.Number.Integer */\\n.codehilite .mo { color: #666666 } /* Literal.Number.Oct */\\n.codehilite .sa { color: #BA2121 } /* Literal.String.Affix */\\n.codehilite .sb { color: #BA2121 } /* Literal.String.Backtick */\\n.codehilite .sc { color: #BA2121 } /* Literal.String.Char */\\n.codehilite .dl { color: #BA2121 } /* Literal.String.Delimiter */\\n.codehilite .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */\\n.codehilite .s2 { color: #BA2121 } /* Literal.String.Double */\\n.codehilite .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */\\n.codehilite .sh { color: #BA2121 } /* Literal.String.Heredoc */\\n.codehilite .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */\\n.codehilite .sx { color: #008000 } /* Literal.String.Other */\\n.codehilite .sr { color: #BB6688 } /* Literal.String.Regex */\\n.codehilite .s1 { color: #BA2121 } /* Literal.String.Single */\\n.codehilite .ss { color: #19177C } /* Literal.String.Symbol */\\n.codehilite .bp { color: #008000 } /* Name.Builtin.Pseudo */\\n.codehilite .fm { color: #0000FF } /* Name.Function.Magic */\\n.codehilite .vc { color: #19177C } /* Name.Variable.Class */\\n.codehilite .vg { color: #19177C } /* Name.Variable.Global */\\n.codehilite .vi { color: #19177C } /* Name.Variable.Instance */\\n.codehilite .vm { color: #19177C } /* Name.Variable.Magic */\\n.codehilite .il { color: #666666 } /* Literal.Number.Integer.Long */\\n\\n.markdown h1 { margin-block-start: 0.34em }\\n.markdown h2 { margin-block-start: 0.42em }\\n.markdown h3 { margin-block-start: 0.5em }\\n.markdown h4 { margin-block-start: 0.67em }\\n.markdown h5 { margin-block-start: 0.84em }\\n.markdown h6 { margin-block-start: 1.17em }\\n.markdown ul { padding-inline-start: 2em }\\n.markdown ol { padding-inline-start: 2em }\\n.markdown strong { font-weight: 600 }\\n.markdown a { color: -webkit-link }\\n.markdown a { color: -moz-hyperlinkText }\\n\");\n    },\n    function(Bokeh) {\n      Bokeh.set_log_level(\"info\");\n    },\n    function(Bokeh) {\n    \n    \n    }\n  ];\n\n  function run_inline_js() {\n    \n    if (root.Bokeh !== undefined || force === true) {\n      \n    for (var i = 0; i < inline_js.length; i++) {\n      inline_js[i].call(root, root.Bokeh);\n    }\n    if (force === true) {\n        display_loaded();\n      }} else if (Date.now() < root._bokeh_timeout) {\n      setTimeout(run_inline_js, 100);\n    } else if (!root._bokeh_failed_load) {\n      console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n      root._bokeh_failed_load = true;\n    } else if (force !== true) {\n      var cell = $(document.getElementById(\"1220\")).parents('.cell').data().cell;\n      cell.output_area.append_execute_result(NB_LOAD_WARNING)\n    }\n\n  }\n\n  if (root._bokeh_is_loading === 0) {\n    console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n    run_inline_js();\n  } else {\n    load_libs(css_urls, js_urls, function() {\n      console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n      run_inline_js();\n    });\n  }\n}(window));"
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import re\n",
    "import pickle\n",
    "import sklearn\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import holoviews as hv\n",
    "import hvplot\n",
    "import nltk \n",
    "from bokeh.io import output_notebook\n",
    "output_notebook()\n",
    "\n",
    "from hvplot import pandas\n",
    "from pathlib import Path\n",
    "\n",
    "#hv.extension(\"bokeh\")\n",
    "\n",
    "pd.options.display.max_columns = 100\n",
    "pd.options.display.max_rows = 300\n",
    "pd.options.display.max_colwidth = 100\n",
    "np.set_printoptions(threshold=2000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from xgboost import XGBClassifier\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.corpus import stopwords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For dealing with text related tasks, we will be using [nltk](https://www.nltk.org). The terrific [scikit-learn](https://scikit-learn.org/) library will be used to handle tasks related to machine learning.  \n",
    "Now, we can take a peek into the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": [
     "hide"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comment</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Ich bin franzose und bin seit ein paar Wochen in muenchen. Ich hatte Zahn Schmerzen und mein Kol...</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist,er ist unfreundlich ,se...</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hatte akute Beschwerden am Rücken. Herr Magura war der erste Arzt der sich wirklich Zeit für ein...</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                               comment  \\\n",
       "0  Ich bin franzose und bin seit ein paar Wochen in muenchen. Ich hatte Zahn Schmerzen und mein Kol...   \n",
       "1  Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist,er ist unfreundlich ,se...   \n",
       "2  Hatte akute Beschwerden am Rücken. Herr Magura war der erste Arzt der sich wirklich Zeit für ein...   \n",
       "\n",
       "   rating  \n",
       "0     2.0  \n",
       "1     6.0  \n",
       "2     1.0  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read raw data\n",
    "FILE_REVIEWS = Path(CURR_PATH[0]) / \"german_doctor_reviews.csv\"\n",
    "data = pd.read_csv(FILE_REVIEWS, sep=\",\", na_values=[\"\"])\n",
    "data.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll focus on the comments and the ratings, which range from one to six. The comments are mostly written in proper German using punctuation and don't include emojis.  However, as with any real life text data there will be slang, grammatical mistakes, misspellings etc. Also, in some places we find html tags like `<br />`. Nonetheless, compared to data from Facebook or Twitter this is mostly harmless and probably won't pose too many issues for us. We will deal with all of that in the next step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleaning and pre processing\n",
    "\n",
    "Having consistent and clean data is fundamental for good modeling results. No matter how sophisticated your model the basic principle is: trash in trash out. When dealing with NLP the cleaning and pre processing can differ depending on which model you intend to use. We will use frequency based representation methods for our text. Thus, we usually want to have a pretty thorough manipulation of the input data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/mc/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package stopwords to /home/mc/nltk_data...\n",
      "[nltk_data]   Unzipping corpora/stopwords.zip.\n"
     ]
    }
   ],
   "source": [
    "nltk.download('punkt')\n",
    "nltk.download('stopwords')\n",
    "\n",
    "stemmer = SnowballStemmer(\"german\")\n",
    "stop_words = set(stopwords.words(\"german\"))\n",
    "\n",
    "\n",
    "def clean_text(text, for_embedding=False):\n",
    "    \"\"\"\n",
    "        - remove any html tags (< /br> often found)\n",
    "        - Keep only ASCII + European Chars and whitespace, no digits\n",
    "        - remove single letter chars\n",
    "        - convert all whitespaces (tabs etc.) to single wspace\n",
    "        if not for embedding (but e.g. tdf-idf):\n",
    "        - all lowercase\n",
    "        - remove stopwords, punctuation and stemm\n",
    "    \"\"\"\n",
    "    RE_WSPACE = re.compile(r\"\\s+\", re.IGNORECASE)\n",
    "    RE_TAGS = re.compile(r\"<[^>]+>\")\n",
    "    RE_ASCII = re.compile(r\"[^A-Za-zÀ-ž ]\", re.IGNORECASE)\n",
    "    RE_SINGLECHAR = re.compile(r\"\\b[A-Za-zÀ-ž]\\b\", re.IGNORECASE)\n",
    "    if for_embedding:\n",
    "        # Keep punctuation\n",
    "        RE_ASCII = re.compile(r\"[^A-Za-zÀ-ž,.!? ]\", re.IGNORECASE)\n",
    "        RE_SINGLECHAR = re.compile(r\"\\b[A-Za-zÀ-ž,.!?]\\b\", re.IGNORECASE)\n",
    "\n",
    "    text = re.sub(RE_TAGS, \" \", text)\n",
    "    text = re.sub(RE_ASCII, \" \", text)\n",
    "    text = re.sub(RE_SINGLECHAR, \" \", text)\n",
    "    text = re.sub(RE_WSPACE, \" \", text)\n",
    "\n",
    "    word_tokens = word_tokenize(text)\n",
    "    words_tokens_lower = [word.lower() for word in word_tokens]\n",
    "\n",
    "    if for_embedding:\n",
    "        # no stemming, lowering and punctuation / stop words removal\n",
    "        words_filtered = word_tokens\n",
    "    else:\n",
    "        words_filtered = [\n",
    "            stemmer.stem(word) for word in words_tokens_lower if word not in stop_words\n",
    "        ]\n",
    "\n",
    "    text_clean = \" \".join(words_filtered)\n",
    "    return text_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `clean_text` function takes a string input and applies a bunch of manipulations to it (described in the code). Check out this example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'python best programmiersprach welt'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean_text(\"Python ist die beste Programmiersprache der Welt.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This transformation has a few benefits. Removing characters and words that don't hold much meaning reduces the size of our data. Moreover, it can improve prediction performance when modeling by lowering the noise in the data. This is because e.g. stop words like prepositions or punctuation won't allow our model to extract additional information / meaning (at least when using simple models). By stemming and lower casing words we make sure that similar words are treated identically. Thus, we can improve model performance again by increasing the number of relevant data points.  \n",
    "Let's apply this to our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3min 18s, sys: 148 ms, total: 3min 18s\n",
      "Wall time: 3min 18s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Clean Comments\n",
    "data[\"comment_clean\"] = data.loc[data[\"comment\"].str.len() > 20, \"comment\"]\n",
    "data[\"comment_clean\"] = data[\"comment_clean\"].map(\n",
    "    lambda x: clean_text(x, for_embedding=False) if isinstance(x, str) else x\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our classification task, we want to be able to recognize whether a comment has a positive or negative sentiment. We make use of the ratings that come alongside with all comments. Naturally, we assume that good ratings (1-2) convey a positive message while low ratings (5-6) convey a negative one. We exclude neutral ratings, so that our task becomes a binary classification. Keeping them would turn the task into a multi label classification problem, requiring a slightly different modeling approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create binary grade, class 1-2 or 5-6  = good or bad\n",
    "data[\"grade_bad\"] = 0\n",
    "data.loc[data[\"rating\"] >= 3, \"grade_bad\"] = np.NaN\n",
    "data.loc[data[\"rating\"] >= 5, \"grade_bad\"] = 1\n",
    "\n",
    "# Drop when any of x missing\n",
    "data = data[(data[\"comment_clean\"] != \"\") & (data[\"comment_clean\"] != \"null\")]\n",
    "\n",
    "data = data.dropna(\n",
    "    axis=\"index\", subset=[\"grade_bad\", \"comment\", \"comment_clean\"]\n",
    ").reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "tags": [
     "hide"
    ]
   },
   "outputs": [],
   "source": [
    "# data.to_csv(\"../../data/processed/comments_clean.csv\", index=False)\n",
    "data_clean = data.copy()\n",
    "# data_clean = pd.read_csv(\"../../data/processed/comments_clean.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These steps conclude the cleaning and pre processing. In result, we get this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comment</th>\n",
       "      <th>rating</th>\n",
       "      <th>comment_clean</th>\n",
       "      <th>grade_bad</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Ich bin franzose und bin seit ein paar Wochen in muenchen. Ich hatte Zahn Schmerzen und mein Kol...</td>\n",
       "      <td>2.0</td>\n",
       "      <td>franzos seit paar woch muench zahn schmerz kollegu dr mainka empfohl schnell termin bekomm team ...</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist,er ist unfreundlich ,se...</td>\n",
       "      <td>6.0</td>\n",
       "      <td>arzt unmog leb je begegnet unfreund herablass medizin unkompetent diagnos hautarzt gegang ordent...</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hatte akute Beschwerden am Rücken. Herr Magura war der erste Arzt der sich wirklich Zeit für ein...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>akut beschwerd ruck herr magura erst arzt wirklich zeit therapieplan genomm nachhalt schmerz beseit</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                               comment  \\\n",
       "0  Ich bin franzose und bin seit ein paar Wochen in muenchen. Ich hatte Zahn Schmerzen und mein Kol...   \n",
       "1  Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist,er ist unfreundlich ,se...   \n",
       "2  Hatte akute Beschwerden am Rücken. Herr Magura war der erste Arzt der sich wirklich Zeit für ein...   \n",
       "\n",
       "   rating  \\\n",
       "0     2.0   \n",
       "1     6.0   \n",
       "2     1.0   \n",
       "\n",
       "                                                                                         comment_clean  \\\n",
       "0  franzos seit paar woch muench zahn schmerz kollegu dr mainka empfohl schnell termin bekomm team ...   \n",
       "1  arzt unmog leb je begegnet unfreund herablass medizin unkompetent diagnos hautarzt gegang ordent...   \n",
       "2  akut beschwerd ruck herr magura erst arzt wirklich zeit therapieplan genomm nachhalt schmerz beseit   \n",
       "\n",
       "   grade_bad  \n",
       "0        0.0  \n",
       "1        1.0  \n",
       "2        0.0  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_clean.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cleaned comments are much more concise because their original sentence structure and their words have been altered severely. Although the meaning can still be grasped, humans will probably have a harder time understanding these sentences. In contrast, many classification approaches greatly benefit from this simplification. Some reasons for that have been mentioned above. Basically, it boils down to the fact that we try to keep only informative pieces of text that aid the **specific model** we intend to apply to the task. In this case, we will use simple models that benefit from less complexity. However, more advanced models are able to extract information from more complex features. Thus, they can perform better with less simplification of the input texts. We'll take a look at that in an upcoming post."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Descriptive analysis\n",
    "\n",
    "Even though we deal with texts, we should still use some descriptive analysis to get a better understanding of the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {},
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.holoviews_exec.v0+json": "",
      "text/html": [
       "<div id='1002'>\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "  <div class=\"bk-root\" id=\"51389330-f01e-435e-b6cc-e650ce024f3a\" data-root-id=\"1002\"></div>\n",
       "</div>\n",
       "<script type=\"application/javascript\">(function(root) {\n",
       "  function embed_document(root) {\n",
       "  var docs_json = {\"c03a3deb-f3db-4517-8c47-f15dbc5a813f\":{\"roots\":{\"references\":[{\"attributes\":{},\"id\":\"1023\",\"type\":\"SaveTool\"},{\"attributes\":{},\"id\":\"1024\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"1012\",\"type\":\"CategoricalScale\"},{\"attributes\":{\"axis\":{\"id\":\"1019\"},\"dimension\":1,\"grid_line_color\":null,\"ticker\":null},\"id\":\"1022\",\"type\":\"Grid\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.2},\"fill_color\":{\"value\":\"#1f77b3\"},\"line_alpha\":{\"value\":0.2},\"top\":{\"field\":\"Word_frequency_of_most_common_words_in_comments\"},\"width\":{\"value\":0.8},\"x\":{\"field\":\"index\"}},\"id\":\"1041\",\"type\":\"VBar\"},{\"attributes\":{\"factors\":[\"gut\",\"praxis\",\"arzt\",\"freundlich\",\"zeit\",\"behandl\",\"imm\",\"kompetent\",\"wurd\",\"frau\",\"patient\",\"termin\",\"jahr\",\"nimmt\",\"empfehl\",\"schon\",\"nett\",\"seit\",\"zufried\",\"arztin\",\"team\",\"wartezeit\",\"lang\",\"mehr\",\"mal\",\"aufgehob\",\"berat\",\"sup\",\"herr\",\"erst\",\"schnell\",\"viel\",\"dank\",\"behandelt\",\"genomm\",\"erklart\",\"frag\",\"zahnarzt\",\"ganz\"],\"tags\":[[[\"index\",\"index\",null]]]},\"id\":\"1004\",\"type\":\"FactorRange\"},{\"attributes\":{\"overlay\":{\"id\":\"1028\"}},\"id\":\"1026\",\"type\":\"BoxZoomTool\"},{\"attributes\":{},\"id\":\"1025\",\"type\":\"WheelZoomTool\"},{\"attributes\":{},\"id\":\"1044\",\"type\":\"CategoricalTickFormatter\"},{\"attributes\":{},\"id\":\"1014\",\"type\":\"LinearScale\"},{\"attributes\":{\"source\":{\"id\":\"1036\"}},\"id\":\"1043\",\"type\":\"CDSView\"},{\"attributes\":{\"data\":{\"Word_frequency_of_most_common_words_in_comments\":[162511,129970,120630,116950,105617,97376,90841,89608,85562,80343,80140,75008,63463,59776,59084,55873,51753,51562,49960,47157,46402,46221,41913,41026,38349,38048,37662,37193,34813,34364,34133,33531,33425,31703,31414,30729,30549,30133,28968],\"index\":[\"gut\",\"praxis\",\"arzt\",\"freundlich\",\"zeit\",\"behandl\",\"imm\",\"kompetent\",\"wurd\",\"frau\",\"patient\",\"termin\",\"jahr\",\"nimmt\",\"empfehl\",\"schon\",\"nett\",\"seit\",\"zufried\",\"arztin\",\"team\",\"wartezeit\",\"lang\",\"mehr\",\"mal\",\"aufgehob\",\"berat\",\"sup\",\"herr\",\"erst\",\"schnell\",\"viel\",\"dank\",\"behandelt\",\"genomm\",\"erklart\",\"frag\",\"zahnarzt\",\"ganz\"]},\"selected\":{\"id\":\"1037\"},\"selection_policy\":{\"id\":\"1055\"}},\"id\":\"1036\",\"type\":\"ColumnDataSource\"},{\"attributes\":{\"axis\":{\"id\":\"1016\"},\"grid_line_color\":null,\"ticker\":null},\"id\":\"1018\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1001\",\"type\":\"NumeralTickFormatter\"},{\"attributes\":{\"children\":[{\"id\":\"1003\"},{\"id\":\"1007\"},{\"id\":\"1057\"}],\"margin\":[0,0,0,0],\"name\":\"Row01553\",\"tags\":[\"embedded\"]},\"id\":\"1002\",\"type\":\"Row\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b3\"},\"line_alpha\":{\"value\":0.1},\"top\":{\"field\":\"Word_frequency_of_most_common_words_in_comments\"},\"width\":{\"value\":0.8},\"x\":{\"field\":\"index\"}},\"id\":\"1040\",\"type\":\"VBar\"},{\"attributes\":{\"fill_color\":{\"value\":\"#1f77b3\"},\"top\":{\"field\":\"Word_frequency_of_most_common_words_in_comments\"},\"width\":{\"value\":0.8},\"x\":{\"field\":\"index\"}},\"id\":\"1039\",\"type\":\"VBar\"},{\"attributes\":{\"text\":\"Word frequency of most common words in comments\",\"text_color\":{\"value\":\"black\"},\"text_font_size\":{\"value\":\"12pt\"}},\"id\":\"1008\",\"type\":\"Title\"},{\"attributes\":{\"axis_label\":\"\",\"bounds\":\"auto\",\"formatter\":{\"id\":\"1001\"},\"major_label_orientation\":\"horizontal\",\"ticker\":{\"id\":\"1020\"}},\"id\":\"1019\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1055\",\"type\":\"UnionRenderers\"},{\"attributes\":{\"margin\":[5,5,5,5],\"name\":\"HSpacer01557\",\"sizing_mode\":\"stretch_width\"},\"id\":\"1003\",\"type\":\"Spacer\"},{\"attributes\":{\"callback\":null,\"renderers\":[{\"id\":\"1042\"}],\"tags\":[\"hv_created\"],\"tooltips\":[[\"index\",\"@{index}\"],[\"Word frequency of most common words in comments\",\"@{Word_frequency_of_most_common_words_in_comments}\"]]},\"id\":\"1006\",\"type\":\"HoverTool\"},{\"attributes\":{},\"id\":\"1027\",\"type\":\"ResetTool\"},{\"attributes\":{\"margin\":[5,5,5,5],\"name\":\"HSpacer01558\",\"sizing_mode\":\"stretch_width\"},\"id\":\"1057\",\"type\":\"Spacer\"},{\"attributes\":{\"data_source\":{\"id\":\"1036\"},\"glyph\":{\"id\":\"1039\"},\"hover_glyph\":null,\"muted_glyph\":{\"id\":\"1041\"},\"nonselection_glyph\":{\"id\":\"1040\"},\"selection_glyph\":null,\"view\":{\"id\":\"1043\"}},\"id\":\"1042\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"axis_label\":\"\",\"bounds\":\"auto\",\"formatter\":{\"id\":\"1044\"},\"major_label_orientation\":0.7853981633974483,\"ticker\":{\"id\":\"1017\"}},\"id\":\"1016\",\"type\":\"CategoricalAxis\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":\"auto\",\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1006\"},{\"id\":\"1023\"},{\"id\":\"1024\"},{\"id\":\"1025\"},{\"id\":\"1026\"},{\"id\":\"1027\"}]},\"id\":\"1029\",\"type\":\"Toolbar\"},{\"attributes\":{},\"id\":\"1037\",\"type\":\"Selection\"},{\"attributes\":{\"end\":175865.3,\"reset_end\":175865.3,\"reset_start\":0.0,\"tags\":[[[\"Word frequency of most common words in comments\",\"Word frequency of most common words in comments\",null]]]},\"id\":\"1005\",\"type\":\"Range1d\"},{\"attributes\":{\"align\":null,\"below\":[{\"id\":\"1016\"}],\"center\":[{\"id\":\"1018\"},{\"id\":\"1022\"}],\"left\":[{\"id\":\"1019\"}],\"margin\":null,\"min_border_bottom\":10,\"min_border_left\":10,\"min_border_right\":10,\"min_border_top\":10,\"plot_height\":400,\"plot_width\":700,\"renderers\":[{\"id\":\"1042\"}],\"sizing_mode\":\"fixed\",\"title\":{\"id\":\"1008\"},\"toolbar\":{\"id\":\"1029\"},\"x_range\":{\"id\":\"1004\"},\"x_scale\":{\"id\":\"1012\"},\"y_range\":{\"id\":\"1005\"},\"y_scale\":{\"id\":\"1014\"}},\"id\":\"1007\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{},\"id\":\"1020\",\"type\":\"BasicTicker\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":0.5,\"fill_color\":\"lightgrey\",\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":1.0,\"line_color\":\"black\",\"line_dash\":[4,4],\"line_width\":2,\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1028\",\"type\":\"BoxAnnotation\"},{\"attributes\":{},\"id\":\"1017\",\"type\":\"CategoricalTicker\"}],\"root_ids\":[\"1002\"]},\"title\":\"Bokeh Application\",\"version\":\"2.0.1\"}};\n",
       "  var render_items = [{\"docid\":\"c03a3deb-f3db-4517-8c47-f15dbc5a813f\",\"root_ids\":[\"1002\"],\"roots\":{\"1002\":\"51389330-f01e-435e-b6cc-e650ce024f3a\"}}];\n",
       "  root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n",
       "  }\n",
       "if (root.Bokeh !== undefined) {\n",
       "    embed_document(root);\n",
       "  } else {\n",
       "    var attempts = 0;\n",
       "    var timer = setInterval(function(root) {\n",
       "      if (root.Bokeh !== undefined) {\n",
       "        clearInterval(timer);\n",
       "        embed_document(root);\n",
       "      } else if (document.readyState == \"complete\") {\n",
       "        attempts++;\n",
       "        if (attempts > 100) {\n",
       "          clearInterval(timer);\n",
       "          console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n",
       "        }\n",
       "      }\n",
       "    }, 10, root)\n",
       "  }\n",
       "})(window);</script>"
      ],
      "text/plain": [
       ":Bars   [index]   (Word frequency of most common words in comments)"
      ]
     },
     "execution_count": 22,
     "metadata": {
      "application/vnd.holoviews_exec.v0+json": {
       "id": "1002"
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from bokeh.models import NumeralTickFormatter\n",
    "# Word Frequency of most common words\n",
    "word_freq = pd.Series(\" \".join(data_clean[\"comment_clean\"]).split()).value_counts()\n",
    "word_freq[1:40].rename(\"Word frequency of most common words in comments\").hvplot.bar(\n",
    "    rot=45\n",
    ").opts(width=700, height=400, yformatter=NumeralTickFormatter(format=\"0,0\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {},
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.holoviews_exec.v0+json": "",
      "text/html": [
       "<div id='1094'>\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "  <div class=\"bk-root\" id=\"677ffcfa-3e9b-4c04-990f-0b5efe0df98a\" data-root-id=\"1094\"></div>\n",
       "</div>\n",
       "<script type=\"application/javascript\">(function(root) {\n",
       "  function embed_document(root) {\n",
       "  var docs_json = {\"450cc6b2-b56e-4b0f-a6ee-f73456e2ca24\":{\"roots\":{\"references\":[{\"attributes\":{},\"id\":\"1111\",\"type\":\"UnionRenderers\"},{\"attributes\":{},\"id\":\"1097\",\"type\":\"Selection\"},{\"attributes\":{\"editor\":{\"id\":\"1099\"},\"field\":\"index\",\"formatter\":{\"id\":\"1098\"},\"title\":\"index\"},\"id\":\"1100\",\"type\":\"TableColumn\"},{\"attributes\":{\"margin\":[5,5,5,5],\"name\":\"HSpacer01667\",\"sizing_mode\":\"stretch_width\"},\"id\":\"1095\",\"type\":\"Spacer\"},{\"attributes\":{\"columns\":[{\"id\":\"1100\"},{\"id\":\"1105\"}],\"height\":300,\"reorderable\":false,\"source\":{\"id\":\"1096\"},\"view\":{\"id\":\"1109\"},\"width\":700},\"id\":\"1108\",\"type\":\"DataTable\"},{\"attributes\":{\"editor\":{\"id\":\"1104\"},\"field\":\"freq\",\"formatter\":{\"id\":\"1103\"},\"title\":\"freq\"},\"id\":\"1105\",\"type\":\"TableColumn\"},{\"attributes\":{\"children\":[{\"id\":\"1095\"},{\"id\":\"1108\"},{\"id\":\"1112\"}],\"margin\":[0,0,0,0],\"name\":\"Row01663\",\"tags\":[\"embedded\"]},\"id\":\"1094\",\"type\":\"Row\"},{\"attributes\":{},\"id\":\"1104\",\"type\":\"IntEditor\"},{\"attributes\":{\"data\":{\"freq\":[1,1,1,1,1,1,1,1,1,1],\"index\":[\"zahnhygienemitteln\",\"wohlzufuehl\",\"praventiosnvorschlag\",\"altglinick\",\"aufrtet\",\"richtit\",\"anhiebt\",\"ruckuberleit\",\"ethikleitpfad\",\"abschlussgesrach\"]},\"selected\":{\"id\":\"1097\"},\"selection_policy\":{\"id\":\"1111\"}},\"id\":\"1096\",\"type\":\"ColumnDataSource\"},{\"attributes\":{\"source\":{\"id\":\"1096\"}},\"id\":\"1109\",\"type\":\"CDSView\"},{\"attributes\":{},\"id\":\"1098\",\"type\":\"StringFormatter\"},{\"attributes\":{},\"id\":\"1099\",\"type\":\"StringEditor\"},{\"attributes\":{},\"id\":\"1103\",\"type\":\"NumberFormatter\"},{\"attributes\":{\"margin\":[5,5,5,5],\"name\":\"HSpacer01668\",\"sizing_mode\":\"stretch_width\"},\"id\":\"1112\",\"type\":\"Spacer\"}],\"root_ids\":[\"1094\"]},\"title\":\"Bokeh Application\",\"version\":\"2.0.1\"}};\n",
       "  var render_items = [{\"docid\":\"450cc6b2-b56e-4b0f-a6ee-f73456e2ca24\",\"root_ids\":[\"1094\"],\"roots\":{\"1094\":\"677ffcfa-3e9b-4c04-990f-0b5efe0df98a\"}}];\n",
       "  root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n",
       "  }\n",
       "if (root.Bokeh !== undefined) {\n",
       "    embed_document(root);\n",
       "  } else {\n",
       "    var attempts = 0;\n",
       "    var timer = setInterval(function(root) {\n",
       "      if (root.Bokeh !== undefined) {\n",
       "        clearInterval(timer);\n",
       "        embed_document(root);\n",
       "      } else if (document.readyState == \"complete\") {\n",
       "        attempts++;\n",
       "        if (attempts > 100) {\n",
       "          clearInterval(timer);\n",
       "          console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n",
       "        }\n",
       "      }\n",
       "    }, 10, root)\n",
       "  }\n",
       "})(window);</script>"
      ],
      "text/plain": [
       ":Table   [index,freq]"
      ]
     },
     "execution_count": 23,
     "metadata": {
      "application/vnd.holoviews_exec.v0+json": {
       "id": "1094"
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# list most uncommon words\n",
    "word_freq[-10:].reset_index(name=\"freq\").hvplot.table()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the most frequent words, we can identify additional candidates for our stop word list in the pre-processing step. For example \"doctor\" (arzt) and \"miss\" (frau) are very common but probably won't help our algorithm to differentiate between sentiments. In contrast, words like \"good\" (gut) and \"competent\" (kompetent) are not only frequent but also carry a strong sentiment. They will be crucial for the performance of our model. We also observe many uncommon words that are hardly used. Often, these will be misspellings or very uncommon words. Such sparse data will not be useful for our model, as it won't have enough observations to learn any associations. We'll come back to this in the modeling phase making use of our models ability to deal with such issues.  \n",
    "Finally, we should not omit a look at the distribution of our target variable, i.e. the ratings. Highly skewed distributions are common. In some more extreme cases, that might even require adapting the modeling approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:param.BarPlot01791: title_format is deprecated. Please use title instead\n",
      "WARNING:param.BarPlot01791: title_format is deprecated. Please use title instead\n"
     ]
    },
    {
     "data": {},
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.holoviews_exec.v0+json": "",
      "text/html": [
       "<div id='1124'>\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "  <div class=\"bk-root\" id=\"17802677-ecae-466a-ba19-7b07d9afa4a6\" data-root-id=\"1124\"></div>\n",
       "</div>\n",
       "<script type=\"application/javascript\">(function(root) {\n",
       "  function embed_document(root) {\n",
       "  var docs_json = {\"e9d0dde8-8794-4c77-925b-0309dc55f86b\":{\"roots\":{\"references\":[{\"attributes\":{\"axis_label\":\"\",\"bounds\":\"auto\",\"formatter\":{\"id\":\"1168\"},\"major_label_orientation\":\"horizontal\",\"ticker\":{\"id\":\"1142\"}},\"id\":\"1141\",\"type\":\"LinearAxis\"},{\"attributes\":{\"data\":{\"grade_bad\":{\"__ndarray__\":\"E3tPIlrh7D9pJ4TtLvW4Pw==\",\"dtype\":\"float64\",\"shape\":[2]},\"index\":[\"0\",\"1\"]},\"selected\":{\"id\":\"1159\"},\"selection_policy\":{\"id\":\"1176\"}},\"id\":\"1158\",\"type\":\"ColumnDataSource\"},{\"attributes\":{},\"id\":\"1142\",\"type\":\"BasicTicker\"},{\"attributes\":{\"children\":[{\"id\":\"1125\"},{\"id\":\"1129\"},{\"id\":\"1178\"}],\"margin\":[0,0,0,0],\"name\":\"Row01762\",\"tags\":[\"embedded\"]},\"id\":\"1124\",\"type\":\"Row\"},{\"attributes\":{\"align\":null,\"below\":[{\"id\":\"1138\"}],\"center\":[{\"id\":\"1140\"},{\"id\":\"1144\"}],\"left\":[{\"id\":\"1141\"}],\"margin\":null,\"min_border_bottom\":10,\"min_border_left\":10,\"min_border_right\":10,\"min_border_top\":10,\"plot_height\":300,\"plot_width\":700,\"renderers\":[{\"id\":\"1164\"}],\"sizing_mode\":\"fixed\",\"title\":{\"id\":\"1130\"},\"toolbar\":{\"id\":\"1151\"},\"x_range\":{\"id\":\"1126\"},\"x_scale\":{\"id\":\"1134\"},\"y_range\":{\"id\":\"1127\"},\"y_scale\":{\"id\":\"1136\"}},\"id\":\"1129\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{\"axis_label\":\"\",\"bounds\":\"auto\",\"formatter\":{\"id\":\"1166\"},\"major_label_orientation\":\"horizontal\",\"ticker\":{\"id\":\"1139\"}},\"id\":\"1138\",\"type\":\"CategoricalAxis\"},{\"attributes\":{},\"id\":\"1136\",\"type\":\"LinearScale\"},{\"attributes\":{},\"id\":\"1139\",\"type\":\"CategoricalTicker\"},{\"attributes\":{},\"id\":\"1149\",\"type\":\"ResetTool\"},{\"attributes\":{},\"id\":\"1168\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"margin\":[5,5,5,5],\"name\":\"HSpacer01766\",\"sizing_mode\":\"stretch_width\"},\"id\":\"1125\",\"type\":\"Spacer\"},{\"attributes\":{\"axis\":{\"id\":\"1138\"},\"grid_line_color\":null,\"ticker\":null},\"id\":\"1140\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1166\",\"type\":\"CategoricalTickFormatter\"},{\"attributes\":{},\"id\":\"1134\",\"type\":\"CategoricalScale\"},{\"attributes\":{\"overlay\":{\"id\":\"1150\"}},\"id\":\"1148\",\"type\":\"BoxZoomTool\"},{\"attributes\":{\"fill_color\":{\"value\":\"#1f77b3\"},\"top\":{\"field\":\"grade_bad\"},\"width\":{\"value\":0.8},\"x\":{\"field\":\"index\"}},\"id\":\"1161\",\"type\":\"VBar\"},{\"attributes\":{},\"id\":\"1147\",\"type\":\"WheelZoomTool\"},{\"attributes\":{\"text\":\"Distribution of  ratings\",\"text_color\":{\"value\":\"black\"},\"text_font_size\":{\"value\":\"12pt\"}},\"id\":\"1130\",\"type\":\"Title\"},{\"attributes\":{\"callback\":null,\"renderers\":[{\"id\":\"1164\"}],\"tags\":[\"hv_created\"],\"tooltips\":[[\"index\",\"@{index}\"],[\"grade_bad\",\"@{grade_bad}\"]]},\"id\":\"1128\",\"type\":\"HoverTool\"},{\"attributes\":{},\"id\":\"1146\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"1159\",\"type\":\"Selection\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b3\"},\"line_alpha\":{\"value\":0.1},\"top\":{\"field\":\"grade_bad\"},\"width\":{\"value\":0.8},\"x\":{\"field\":\"index\"}},\"id\":\"1162\",\"type\":\"VBar\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.2},\"fill_color\":{\"value\":\"#1f77b3\"},\"line_alpha\":{\"value\":0.2},\"top\":{\"field\":\"grade_bad\"},\"width\":{\"value\":0.8},\"x\":{\"field\":\"index\"}},\"id\":\"1163\",\"type\":\"VBar\"},{\"attributes\":{},\"id\":\"1145\",\"type\":\"SaveTool\"},{\"attributes\":{\"factors\":[\"0\",\"1\"],\"tags\":[[[\"index\",\"index\",null]]]},\"id\":\"1126\",\"type\":\"FactorRange\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":\"auto\",\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1128\"},{\"id\":\"1145\"},{\"id\":\"1146\"},{\"id\":\"1147\"},{\"id\":\"1148\"},{\"id\":\"1149\"}]},\"id\":\"1151\",\"type\":\"Toolbar\"},{\"attributes\":{\"margin\":[5,5,5,5],\"name\":\"HSpacer01767\",\"sizing_mode\":\"stretch_width\"},\"id\":\"1178\",\"type\":\"Spacer\"},{\"attributes\":{\"source\":{\"id\":\"1158\"}},\"id\":\"1165\",\"type\":\"CDSView\"},{\"attributes\":{\"end\":0.9830105594779663,\"reset_end\":0.9830105594779663,\"reset_start\":0.0,\"tags\":[[[\"grade_bad\",\"grade_bad\",null]]]},\"id\":\"1127\",\"type\":\"Range1d\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":0.5,\"fill_color\":\"lightgrey\",\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":1.0,\"line_color\":\"black\",\"line_dash\":[4,4],\"line_width\":2,\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1150\",\"type\":\"BoxAnnotation\"},{\"attributes\":{\"data_source\":{\"id\":\"1158\"},\"glyph\":{\"id\":\"1161\"},\"hover_glyph\":null,\"muted_glyph\":{\"id\":\"1163\"},\"nonselection_glyph\":{\"id\":\"1162\"},\"selection_glyph\":null,\"view\":{\"id\":\"1165\"}},\"id\":\"1164\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"axis\":{\"id\":\"1141\"},\"dimension\":1,\"grid_line_color\":null,\"ticker\":null},\"id\":\"1144\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1176\",\"type\":\"UnionRenderers\"}],\"root_ids\":[\"1124\"]},\"title\":\"Bokeh Application\",\"version\":\"2.0.1\"}};\n",
       "  var render_items = [{\"docid\":\"e9d0dde8-8794-4c77-925b-0309dc55f86b\",\"root_ids\":[\"1124\"],\"roots\":{\"1124\":\"17802677-ecae-466a-ba19-7b07d9afa4a6\"}}];\n",
       "  root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n",
       "  }\n",
       "if (root.Bokeh !== undefined) {\n",
       "    embed_document(root);\n",
       "  } else {\n",
       "    var attempts = 0;\n",
       "    var timer = setInterval(function(root) {\n",
       "      if (root.Bokeh !== undefined) {\n",
       "        clearInterval(timer);\n",
       "        embed_document(root);\n",
       "      } else if (document.readyState == \"complete\") {\n",
       "        attempts++;\n",
       "        if (attempts > 100) {\n",
       "          clearInterval(timer);\n",
       "          console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n",
       "        }\n",
       "      }\n",
       "    }, 10, root)\n",
       "  }\n",
       "})(window);</script>"
      ],
      "text/plain": [
       ":Bars   [index]   (grade_bad)"
      ]
     },
     "execution_count": 24,
     "metadata": {
      "application/vnd.holoviews_exec.v0+json": {
       "id": "1124"
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Distribution of ratings\n",
    "data_clean[\"grade_bad\"].value_counts(normalize=True).sort_index().hvplot.bar(\n",
    "    title=\"Distribution of  ratings\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have pretty unbalanced classes. In some cases, this can have a negative effect on model results. It is obvious, that models will often have a harder time predicting the minority class. There are methods to deal with that, i.e. over- and under-sampling, weighing of classes and more. This won't be necessary in our case as even the minority class has lots of observations. We will see shortly, that our model performance will not be severely impacted by the imbalance.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature creation with TF-IDF\n",
    "\n",
    "Because classification models cannot deal with text data directly, we need to convert our comments to a numeric representation. As mentioned before, there are several ways to achieve this. [This article](https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7) provides a concise overview. All methods have in common that they assign each unique word in a document a unique number. A vector of numbers is created in which each element represents a word. Logically, the length of the vector will equal the number of unique words. In the simplest form (bag of words), a sentence can be represented by such a vector by indicating the presence of a word using a 1 in the appropriate index representing the word. All elements standing for words not included in the sentence will be 0.   \n",
    "Frequency methods improve on this very basic approach. For many applications, `TF-IDF` (term frequency, inverse document frequency) is a good choice. In our case, the `TF` part summarizes how often a word appears in a comment in relation to all words. As was mentioned earlier, that is not always a sufficient indicator for a useful word as it might be overly general or be used inflationary in many comments. This is where the `IDF` part comes into play. It downscales words that are prevalent in many other comments. Consequently, words that are frequent in a comment and also specific to it (i.e. they are uncommon in other comments) will get a high weight. Unspecific words or those with a low overall frequency will get a low weight.    \n",
    "This is how we apply `TF-IDF` to our comments using `scikit-learn`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "                dtype=<class 'numpy.float64'>, encoding='utf-8',\n",
       "                input='content', lowercase=True, max_df=0.3, max_features=None,\n",
       "                min_df=10, ngram_range=(1, 2), norm='l2', preprocessor=None,\n",
       "                smooth_idf=True, stop_words=None, strip_accents=None,\n",
       "                sublinear_tf=False, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "                tokenizer=None, use_idf=True, vocabulary=None)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "Compute unique word vector with frequencies\n",
    "exclude very uncommon (<10 obsv.) and common (>=30%) words\n",
    "use pairs of two words (ngram)\n",
    "\"\"\"\n",
    "vectorizer = TfidfVectorizer(\n",
    "    analyzer=\"word\", max_df=0.3, min_df=10, ngram_range=(1, 2), norm=\"l2\"\n",
    ")\n",
    "vectorizer.fit(data_clean[\"comment_clean\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important parameter that needs explanation is the `ngram_range`. An `ngram` of one means that you look at each word separately. An `ngram` of two (or `bigram`) means that you take the preceding and following word into account as well. Thus, some context is added. This is helpful because then a model can learn that \"good\" and \"not good\" are different. In our case, in addition to using each word by itself we also add `bigrams` to make use of context.  Let's see some of the created `ngrams` and their indices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Unique word (ngram) vector extract:\n",
      "\n",
      " abzusetz                      925\n",
      "rekonstruktion              87321\n",
      "angeb                        2236\n",
      "dr schier                   25838\n",
      "ganzheit herangehensweis    41621\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Vector representation of vocabulary\n",
    "word_vector = pd.Series(vectorizer.vocabulary_).sample(5, random_state=1)\n",
    "print(f\"Unique word (ngram) vector extract:\\n\\n {word_vector}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This creates a numeric representation of the `ngrams` in our corpus. We see that the vectorizer also uses `bigrams` in addition to single words. The word \"abzusetz\" is represented by the number `925`, while the number `25838` stands for the `bigram` \"dr schier\" and so on.  \n",
    "This is only the first part of our text to numeric process. Before we can move on to transform each sentence to a vector of `TF-IDF` values, we need to prepare the data for the modeling part first."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modeling\n",
    "\n",
    "To test the classification performance of our model, we will perform a cross validation. For that, we split our data into a training and a testing set. The former is used to train the model and the latter to evaluate its predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(253776,)\n",
      "(84593,)\n"
     ]
    }
   ],
   "source": [
    "# Sample data - 25% of data to test set\n",
    "train, test = train_test_split(data_clean, random_state=1, test_size=0.25, shuffle=True)\n",
    "\n",
    "X_train = train[\"comment_clean\"]\n",
    "Y_train = train[\"grade_bad\"]\n",
    "X_test = test[\"comment_clean\"]\n",
    "Y_test = test[\"grade_bad\"]\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training set consists of more than 253k rows and the testing will be performed on more than 84k observations. Now, that we have split our data we can transform the text data into its `TF-IDF` representation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(253776, 122618)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform each sentence to numeric vector with tf-idf value as elements\n",
    "X_train_vec = vectorizer.transform(X_train)\n",
    "X_test_vec = vectorizer.transform(X_test)\n",
    "X_train_vec.get_shape()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that each of our sentences is now represented by a vector of length 122618. It might be interesting to compare text and numeric representation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original sentence:\n",
      "['seid jahr dr schmall zufried']\n",
      "\n",
      "Vector representation of sentence:\n",
      "        jahr  jahr dr      seid  seid jahr   zufried\n",
      "0  0.236397  0.46459  0.534599   0.616833  0.248985\n"
     ]
    }
   ],
   "source": [
    "# Compare original comment text with its numeric vector representation\n",
    "print(f\"Original sentence:\\n{X_train[3:4].values}\\n\")\n",
    "# Feature Matrix\n",
    "features = pd.DataFrame(\n",
    "    X_train_vec[3:4].toarray(), columns=vectorizer.get_feature_names()\n",
    ")\n",
    "nonempty_feat = features.loc[:, (features != 0).any(axis=0)]\n",
    "print(f\"Vector representation of sentence:\\n {nonempty_feat}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this five word sentence, the vector of length 122618 contains mostly zeros. However, the indices representing the used words / ngrams are non empty. They include the value that `TF-IDF` assigned to them. In this particular case, \"seid jahr\" (since year) has the largest weight meaning that it is relatively frequent in our sentence while not being very common in other sentences of our dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, that we have prepared our features, we can start to train and evaluate models. For a binary classification task there are many options to [chose from in `scikit-learn`](https://scikit-learn.org/stable/supervised_learning.html). We will focus on the ones that are most promising. In my experience they are: Logistic Regression, Support Vector Classification (SVC), Ensemble Methods (Boosting, Random Forest) and Neural Networks (i.e. Multi Layer Perceptron or MLP in sklearn). We will compare these models and chose the most promising one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Classifiers to test: ['LogisticRegression', 'LinearSVC', 'RandomForestClassifier', 'XGBClassifier', 'MLPClassifier']\n"
     ]
    }
   ],
   "source": [
    "# models to test\n",
    "classifiers = [\n",
    "    LogisticRegression(solver=\"sag\", random_state=1),\n",
    "    LinearSVC(random_state=1),\n",
    "    RandomForestClassifier(random_state=1),\n",
    "    XGBClassifier(random_state=1),\n",
    "    MLPClassifier(\n",
    "        random_state=1,\n",
    "        solver=\"adam\",\n",
    "        hidden_layer_sizes=(12, 12, 12),\n",
    "        activation=\"relu\",\n",
    "        early_stopping=True,\n",
    "        n_iter_no_change=1,\n",
    "    ),\n",
    "]\n",
    "# get names of the objects in list (too lazy for c&p...)\n",
    "names = [re.match(r\"[^\\(]+\", name.__str__())[0] for name in classifiers]\n",
    "print(f\"Classifiers to test: {names}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training classifier: LogisticRegression\n",
      "Training classifier: LinearSVC\n",
      "Training classifier: RandomForestClassifier\n",
      "Training classifier: XGBClassifier\n",
      "Training classifier: MLPClassifier\n",
      "CPU times: user 21min 3s, sys: 730 ms, total: 21min 4s\n",
      "Wall time: 21min 4s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# test all classifiers and save pred. results on test data\n",
    "results = {}\n",
    "for name, clf in zip(names, classifiers):\n",
    "    print(f\"Training classifier: {name}\")\n",
    "    clf.fit(X_train_vec, Y_train)\n",
    "    prediction = clf.predict(X_test_vec)\n",
    "    report = sklearn.metrics.classification_report(Y_test, prediction)\n",
    "    results[name] = report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training LogisticRegression and LinearSVC was very fast while the remaining classifiers were significantly slower. This has to do with their higher model complexity but can also greatly vary depending on the parameters used.  \n",
    "After having trained all our models on the train data and applying their prediction on the test data, we can judge their performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Results for LogisticRegression:\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.97      0.99      0.98     76373\n",
      "         1.0       0.87      0.72      0.79      8220\n",
      "\n",
      "    accuracy                           0.96     84593\n",
      "   macro avg       0.92      0.85      0.88     84593\n",
      "weighted avg       0.96      0.96      0.96     84593\n",
      "\n",
      "\n",
      "Results for LinearSVC:\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.98      0.99      0.98     76373\n",
      "         1.0       0.85      0.78      0.82      8220\n",
      "\n",
      "    accuracy                           0.97     84593\n",
      "   macro avg       0.92      0.88      0.90     84593\n",
      "weighted avg       0.96      0.97      0.97     84593\n",
      "\n",
      "\n",
      "Results for RandomForestClassifier:\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.93      1.00      0.96     76373\n",
      "         1.0       0.90      0.25      0.39      8220\n",
      "\n",
      "    accuracy                           0.92     84593\n",
      "   macro avg       0.91      0.62      0.68     84593\n",
      "weighted avg       0.92      0.92      0.90     84593\n",
      "\n",
      "\n",
      "Results for XGBClassifier:\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.93      0.99      0.96     76373\n",
      "         1.0       0.80      0.31      0.45      8220\n",
      "\n",
      "    accuracy                           0.93     84593\n",
      "   macro avg       0.86      0.65      0.70     84593\n",
      "weighted avg       0.92      0.93      0.91     84593\n",
      "\n",
      "\n",
      "Results for MLPClassifier:\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.98      0.98      0.98     76373\n",
      "         1.0       0.85      0.80      0.82      8220\n",
      "\n",
      "    accuracy                           0.97     84593\n",
      "   macro avg       0.91      0.89      0.90     84593\n",
      "weighted avg       0.97      0.97      0.97     84593\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Prediction results\n",
    "for k, v in results.items():\n",
    "    print(f\"Results for {k}:\")\n",
    "    print(f\"{v}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of sci-kit's `classification_report` provides us with several metrics. For unbalanced data sets `accuracy` is an inappropriate metric. Since it's value solely tells you how many cases have been properly classified, a high value can be achieved by always predicting the majority class. In contrast, the `f1-score` puts precision and recall of each class in relation to each other. As such, it is a more fine grained measure. We can use an aggregate version of it to have a single metric summarizing the performance of each model. For example, the `weighted f1-score` is an average of the `f1-scores` of both our classes taking the class distribution into account. On the other hand, the `macro f1-score` averages over class scores without weighing them. Consequently, we'll use that as we wish to give the same importance to both our classes. This is because even though bad grades are much more rare they also have a more severe impact.   \n",
    "All methods achieve impressive results predicting the good ratings class with `f1-scores` above 0.95. However, results for the bad ratings class are much lower and vary wildly.  \n",
    "Here, the ensemble methods, i.e. RandomForest and XGBoost, perform worst. However, with some more effort to tune their parameters they would probably fare significantly better.  \n",
    "As a general rule of thumb, logistic regressions deliver decent results in many different use cases. Moreover, they are simple to apply, robust and computationally efficient. Our results support this claim as the logistic regression achieves the third best result with an `macro f1-score` of 0.88. This result is only surpassed by the linear SVC and MLP that both score 0.9. In general, SVC is comparable in speed and simplicity to the logistic regression. In addition, it often performs very well in classification tasks as we can see here. In contrast, Neural Networks perform extremely well in all sorts of tasks but are also much more complex and slow all around. Because of this, we will stick to the LinearSVC classifier for now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameter tuning\n",
    "\n",
    "We've learned that linear SVC is a solid model choice and delivers great results out of the box. Still, we might be able to improve upon those by taking a more guided approach to choosing parameters. To do so, we will compare different parameters for feature creation as well as modeling. We can achieve this by making use of the pipeline and grid search functionality in sci-kit learn.  \n",
    "The `Pipeline` object encapsulates several processing steps into one. The last step in a pipeline must have a `fit()` functionality. The previous steps a `transform()` and `fit()` functionality. The `Pipeline` object itself has the same `fit()`, `transform()` and `predict()` methods as any other model in scikit-learn. Thus, we can streamline the whole process of feature creation, model fitting and prediction into one step.  \n",
    "With `GridSearchCV` we can define parameter spaces for our functions. All parameter combinations will be evaluated against one another in a cross validation using a defined score metric. Next, we combine our pipeline with grid search. In result, we can jointly test combinations of parameters in feature creation and model training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# feature creation and modelling in a single function\n",
    "pipe = Pipeline([(\"tfidf\", TfidfVectorizer()), (\"svc\", LinearSVC())])\n",
    "\n",
    "# define parameter space to test # runtime 35min\n",
    "params = {\n",
    "    \"tfidf__ngram_range\": [(1, 1), (1, 2), (1, 3)],\n",
    "    \"tfidf__max_df\": np.arange(0.3, 0.8, 0.2),\n",
    "    \"tfidf__min_df\": np.arange(5, 100, 45),\n",
    "}\n",
    "pipe_clf = GridSearchCV(pipe, params, n_jobs=-1, scoring=\"f1_macro\")\n",
    "pipe_clf.fit(X_train, Y_train)\n",
    "pickle.dump(pipe_clf, open(\"./clf_pipe.pck\", \"wb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each value added to the `params` to be tested with `GridSearchCV` increases the run time of the training significantly. To limit the combinations, we first optimize the feature creation step. We test for different values for `ngram`. Moreover, `max_df` and `min_df` set an upper and lower limit for word frequencies. We want to exclude infrequent words because the model won't be able to learn (meaningful) associations with very few observations. The same is true for very frequent words which won't allow the model to differentiate between classes. For each parameter we check the values which returned the best model fit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'tfidf__max_df': 0.5, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 3)}\n"
     ]
    }
   ],
   "source": [
    "print(pipe_clf.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, using this best parameters for `TF-IDF` we can search for optimal parameters for the `LinearSVC` classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 6s, sys: 5.66 s, total: 1min 12s\n",
      "Wall time: 12min 51s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# feature creation and modelling in a single function\n",
    "pipe = Pipeline([(\"tfidf\", TfidfVectorizer()), (\"svc\", LinearSVC())])\n",
    "\n",
    "# define parameter space to test # runtime 19min\n",
    "params = {\n",
    "    \"tfidf__ngram_range\": [(1, 3)],\n",
    "    \"tfidf__max_df\": [0.5],\n",
    "    \"tfidf__min_df\": [5],\n",
    "    \"svc__C\": np.arange(0.2, 1, 0.15),\n",
    "}\n",
    "pipe_svc_clf = GridSearchCV(pipe, params, n_jobs=-1, scoring=\"f1_macro\")\n",
    "pipe_svc_clf.fit(X_train, Y_train)\n",
    "pickle.dump(pipe_svc_clf, open(\"./pipe_svc_clf.pck\", \"wb\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'svc__C': 0.6499999999999999, 'tfidf__max_df': 0.5, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 3)}\n"
     ]
    }
   ],
   "source": [
    "best_params = pipe_svc_clf.best_params_\n",
    "print(best_params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We focus on the `C` parameter which is basically a regularization parameter and essential in the performance of the SVC classifier. High values of `C` mean the margin of the hyperplane chosen by SVC to separate the data will be smaller. Thus, while classification on training data will be better this can also lead to overfitting. Consequently, `C` controls a trade off between a low training and testing error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we combine these best parameters and test the prediction of our model using the pipe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.98      0.99      0.98     76373\n",
      "         1.0       0.86      0.79      0.82      8220\n",
      "\n",
      "    accuracy                           0.97     84593\n",
      "   macro avg       0.92      0.89      0.90     84593\n",
      "weighted avg       0.97      0.97      0.97     84593\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# run pipe with optimized parameters\n",
    "pipe.set_params(**best_params).fit(X_train, Y_train)\n",
    "pipe_pred = pipe.predict(X_test)\n",
    "report = sklearn.metrics.classification_report(Y_test, pipe_pred)\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using parameter tuning we have slightly improved on our already decent model by increasing the precision for class 1.0. However, we can see that the margin for improvement is little. On one hand, this is because our initial parameters were already very close to the optimum. On the other hand, it might be that given our data and the approach used we might already be close to a barrier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction\n",
    "\n",
    "Now, that we have our best performing model (so far) we can use it to make predictions. One possible application is to find contradictory reviews, i.e. reviews where the sentiment of the comment doesn't match the rating. For that, we look at cases where our model makes a prediction with high confidence which doesn't match the original rating:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold for negative rating: 1.7340235055695108\n",
      "Threshold for positive rating: -2.0051706951873167\n"
     ]
    }
   ],
   "source": [
    "# Get confidence score for prediction\n",
    "conf_score = pipe.decision_function(X_test)\n",
    "# Get the Nth highest / lowest score\n",
    "# high score indicates class 1 (bad), low score 0 (good)\n",
    "score_neg = np.sort(conf_score)[-400]\n",
    "score_pos = np.sort(conf_score)[20000]\n",
    "print(\n",
    "    f\"Threshold for negative rating: {score_neg}\\nThreshold for positive rating: {score_pos}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `decision_function` returns values ranging from negative to positive (without a general boundary). They depict the distance of the input to the hyperplane which the SVM algorithm uses to separate the two classes. In our case, large values indicate a higher confidence in class 1 (i.e. a bad rating) and low values in class 0. Consequently, we take the n highest (lowest) values as a threshold for a high confidence in class 1 (0). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>grade_bad</th>\n",
       "      <th>kommentar</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>270076</td>\n",
       "      <td>1.0</td>\n",
       "      <td>... nimmt sich Zeit, erklärt und kümmert sich.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>184797</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Ich bin Erika , ich hatte am 15 .12.2017 zahnimplanta Bai voll Narkose alles ist gut gelaufen:) ich bin super zufrieden mit Dr Zeides er ist voll nett und locker drauf, ich kann für jede Empfehlen er ist die beste.ich muss sagen ich habe tierisch Angst von Zahnarzt aber egal wie oft ich Frage hatte und ich hatte viele frage:) er hätt e mir immer zeit genomen und erklärt wie es laufen wird.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>208551</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Super zufriedener Zahnarzt &lt;br /&gt;\\nTop Behandlung....  nur weiter zu empfehlen  &lt;br /&gt;\\nBin ein Angst Patient und er nimmt dir die Angst weg&lt;br /&gt;\\nUnd redet auf dich ein . Bester Zahnarzt den ich vertraue.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>103690</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Trotz akuten Schmerzen und Fremdkörper im Ohr,  wird man weggeschickt mit einer Arroganz der Sprechstundenhilfe die wohl einmalig ist und ich noch nie so gesehen habe. Bin jetzt im Heegberg bei einem anderen Arzt super Betreuung und super freundlich.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>213631</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Bie Dr. Hamdosch alls prima, sind zehr nett und perfekt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>246987</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Ich wollte telefonisch ein Termin vereinbaren, mir wurde nur Montags ein Termin angeboten, als ich um den Dienstag gebeten habe, hat die Artthelferin wieder mich auch den Montag verwiesen. Ich habe ihr erklärt das ich nur Dienstags oder Freitag kann. Daraufhin hat diese unverschämte Person einfach aufgelegt.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        grade_bad  \\\n",
       "270076        1.0   \n",
       "184797        1.0   \n",
       "208551        1.0   \n",
       "103690        1.0   \n",
       "213631        1.0   \n",
       "246987        1.0   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                       kommentar  \n",
       "270076                                                                                                                                                                                                                                                                                                                                                            ... nimmt sich Zeit, erklärt und kümmert sich.  \n",
       "184797  Ich bin Erika , ich hatte am 15 .12.2017 zahnimplanta Bai voll Narkose alles ist gut gelaufen:) ich bin super zufrieden mit Dr Zeides er ist voll nett und locker drauf, ich kann für jede Empfehlen er ist die beste.ich muss sagen ich habe tierisch Angst von Zahnarzt aber egal wie oft ich Frage hatte und ich hatte viele frage:) er hätt e mir immer zeit genomen und erklärt wie es laufen wird.  \n",
       "208551                                                                                                                                                                                            Super zufriedener Zahnarzt <br />\\nTop Behandlung....  nur weiter zu empfehlen  <br />\\nBin ein Angst Patient und er nimmt dir die Angst weg<br />\\nUnd redet auf dich ein . Bester Zahnarzt den ich vertraue.  \n",
       "103690                                                                                                                                                Trotz akuten Schmerzen und Fremdkörper im Ohr,  wird man weggeschickt mit einer Arroganz der Sprechstundenhilfe die wohl einmalig ist und ich noch nie so gesehen habe. Bin jetzt im Heegberg bei einem anderen Arzt super Betreuung und super freundlich.  \n",
       "213631                                                                                                                                                                                                                                                                                                                                                   Bie Dr. Hamdosch alls prima, sind zehr nett und perfekt  \n",
       "246987                                                                                     Ich wollte telefonisch ein Termin vereinbaren, mir wurde nur Montags ein Termin angeboten, als ich um den Dienstag gebeten habe, hat die Artthelferin wieder mich auch den Montag verwiesen. Ich habe ihr erklärt das ich nur Dienstags oder Freitag kann. Daraufhin hat diese unverschämte Person einfach aufgelegt.  "
      ]
     },
     "execution_count": 172,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.options.display.max_colwidth = 800\n",
    "# Predicted good but rated bad\n",
    "test[[\"grade_bad\", \"comment\"]][(Y_test != pipe_pred) & (conf_score <= score_pos)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've learned before that our model is really good at classifying the positive class. Thus, it is not surprising that it's able to correctly recognize that most comments above are positive even though they have a negative rating. However, the model makes two mistakes in the sentences `103690` and `246987` which really have a negative sentiment. We can hypothesize why this is the case. The first sentence first talks about a negative experience with the actual doctor. Then, the patient switched doctors and talks very positive about his new doctor. Since our model is rather simple it won't be able to identify that only the first part of the review is addressing the rated doctor. Following, the positive sentiment is attributed to the doctor as well and seems to weigh more than the negative one. The second review is probably too complex and contains a lot of neutral information while the negative sentiment is not very strong.  \n",
    "Let's check the comments that our model predicts as negative with high confidence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>grade_bad</th>\n",
       "      <th>kommentar</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>67996</td>\n",
       "      <td>0.0</td>\n",
       "      <td>- nicht ernsthaft mit den Beschwerden befaßt&lt;br /&gt;\\n- oberflächliche und dazu falsche Diagnose &lt;br /&gt;\\n- ....schnell schnell&lt;br /&gt;\\n&lt;br /&gt;\\n&lt;br /&gt;\\ndas hat gereicht - nie wieder</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>146189</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Wie kann man so unfreundlich sein?!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>331631</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Überhaupt nicht gut, weder Zahnärzlich noch Zwischenmenschlich. Sehr schlecht, ewig wenig Zeit, immer unkonzentriert und lässt sich nicht auf dich ein. Er will immer nur was verkaufen und redet bei jedem Besuch über Geld (Vorschuss usw.) Auch seine eigenen Pläne hält er nicht ein und verändert bei Stress und Unzufriedenheit die Rechnungen. Wir haben dort 910€ für Implantate angezahlt, dieses Geld hat er einfach abkassiert für 2Kronen, welche die DAK auch mit 2500€ honoriert hat. Also wenn euch euere Gesundheit, Zähne und Geld wichtig sind- sucht euch jemand anders, wir haben genug GELD und Zeit dort verloren- deshalb will ich dass es andere Leute auch erfahren.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>260897</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Unfreundlich, unsensibel...unmöglich! Keinerlei Einfühlungsvermögen und Verständnis für Angstpatienten. Direkt nach dem Umbinden der Serviette, noch bevor ich den Arzt überhaupt zu sehen bekam, wollte man mir mit Kältespray an meine temperaturempfindlichen Zähne (angeblich sei dies als Eingangsuntersuchung vorgeschrieben) Ich lehnte dies ab, sah für eine solche Untersuchung überhaupt keinen Anlass, zumal ich davor überhaupt nicht nach meinen Beschwerden gefragt wurde. Verständnis und Aufklärung? Fehlanzeige! Zu einem solchen Arzt kann man kein Vertrauen aufbauen. Ich habe mit einem sehr schlechten Gefühl und ohne Behandlung die Praxis verlassen.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>140500</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Sehr schlechte Beziehung, sehr unfreundlicher Arzt, jetzt es ist mir klar geworden warum das Praxis so leer ist. Furchtbar! Schräklich!  Nie wieder!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>335001</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Unhöflich, gemein, unfreundlich. Nie wieder. Die Anamnese hat 5 Sekunden gedauert. Ohne Anamnese wollte er was er will machen und die Patienten dürfen dazu kein Wort aussprechen, sonst bekommen sie Ärger. Geld Industrie, kein Mensch zu vertrauen als Arzt. Ich würde diesen Arzt niemand empfehlen.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        grade_bad  \\\n",
       "67996         0.0   \n",
       "146189        0.0   \n",
       "331631        0.0   \n",
       "260897        0.0   \n",
       "140500        0.0   \n",
       "335001        0.0   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            kommentar  \n",
       "67996                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               - nicht ernsthaft mit den Beschwerden befaßt<br />\\n- oberflächliche und dazu falsche Diagnose <br />\\n- ....schnell schnell<br />\\n<br />\\n<br />\\ndas hat gereicht - nie wieder  \n",
       "146189                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Wie kann man so unfreundlich sein?!  \n",
       "331631  Überhaupt nicht gut, weder Zahnärzlich noch Zwischenmenschlich. Sehr schlecht, ewig wenig Zeit, immer unkonzentriert und lässt sich nicht auf dich ein. Er will immer nur was verkaufen und redet bei jedem Besuch über Geld (Vorschuss usw.) Auch seine eigenen Pläne hält er nicht ein und verändert bei Stress und Unzufriedenheit die Rechnungen. Wir haben dort 910€ für Implantate angezahlt, dieses Geld hat er einfach abkassiert für 2Kronen, welche die DAK auch mit 2500€ honoriert hat. Also wenn euch euere Gesundheit, Zähne und Geld wichtig sind- sucht euch jemand anders, wir haben genug GELD und Zeit dort verloren- deshalb will ich dass es andere Leute auch erfahren.  \n",
       "260897                  Unfreundlich, unsensibel...unmöglich! Keinerlei Einfühlungsvermögen und Verständnis für Angstpatienten. Direkt nach dem Umbinden der Serviette, noch bevor ich den Arzt überhaupt zu sehen bekam, wollte man mir mit Kältespray an meine temperaturempfindlichen Zähne (angeblich sei dies als Eingangsuntersuchung vorgeschrieben) Ich lehnte dies ab, sah für eine solche Untersuchung überhaupt keinen Anlass, zumal ich davor überhaupt nicht nach meinen Beschwerden gefragt wurde. Verständnis und Aufklärung? Fehlanzeige! Zu einem solchen Arzt kann man kein Vertrauen aufbauen. Ich habe mit einem sehr schlechten Gefühl und ohne Behandlung die Praxis verlassen.  \n",
       "140500                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Sehr schlechte Beziehung, sehr unfreundlicher Arzt, jetzt es ist mir klar geworden warum das Praxis so leer ist. Furchtbar! Schräklich!  Nie wieder!  \n",
       "335001                                                                                                                                                                                                                                                                                                                                                                                       Unhöflich, gemein, unfreundlich. Nie wieder. Die Anamnese hat 5 Sekunden gedauert. Ohne Anamnese wollte er was er will machen und die Patienten dürfen dazu kein Wort aussprechen, sonst bekommen sie Ärger. Geld Industrie, kein Mensch zu vertrauen als Arzt. Ich würde diesen Arzt niemand empfehlen.  "
      ]
     },
     "execution_count": 173,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Predicted bad but rated good\n",
    "test[[\"grade_bad\", \"comment\"]][(Y_test != pipe_pred) & ((conf_score >= score_neg))]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the model's rating prediction fits the comment's sentiment better than the rating given by the patients in all cases. Unlike before, all comments have a clear, direct and strong sentiment. This is why the model performs error free.  \n",
    "We conclude that for cases where our model shows high confidence in the prediction we are able to disclose cases were original comment and rating are mismatched. However, we've also seen that the prediction accuracy has its limit. Particularly, when dealing with more complex sentence structures or more indirect sentiment expressions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a last test of our model's performance, let's see how it copes with data that was never seen before. For that, we use some new comments as input. Then, we apply the same pre processing and cleaning as in our model preparation. Finally, we convert the text to its numeric representations and feed it to our model for the binary prediction:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get new comments from website that were not included in original data\n",
    "INPUT = [\n",
    "    \"Super sympathische Ärztin, fühle mich bei ihr bestens aufgehoben.\"\n",
    "    \"Sprechstundenhilfe war super nett man fühlt sich wohl.\",\n",
    "    \"Frau Doktor Merz nimmt sich richtig Zeit für mich. Hilft wo sie kann.\"\n",
    "    \"Hört wirklich einen zu. Sehr nett und freundlich. Sie ist sehr kompetent,\"\n",
    "    \"zuverlässig und vertrauenswürdig.\",\n",
    "    \"Nach meiner Beobachtung hat diese Praxis eine schlechte Hygiene. \",\n",
    "    \"Mangels akriebischer Behandlung musste mehrmals nachgebessert werden.\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input after pre-processing / cleaning:\n",
      "\n",
      "sup sympath arztin fuhl best aufgehob sprechstundenhilf sup nett fuhlt wohl \n",
      "frau doktor merz nimmt richtig zeit hilft hort wirklich nett freundlich kompetent zuverlass vertrauenswurd \n",
      "beobacht praxis schlecht hygi \n",
      "mangel akrieb behandl mehrmal nachgebessert\n"
     ]
    }
   ],
   "source": [
    "# Pre-Process comments as we did with train data\n",
    "text = [clean_text(comment) for comment in INPUT]\n",
    "text_out = \" \\n\".join(text)\n",
    "print(f\"Input after pre-processing / cleaning:\\n\\n{text_out}\")\n",
    "# run comments through pipe: predict using our best model from above\n",
    "predictions = pipe.predict(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comment</th>\n",
       "      <th>prediction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Super sympathische Ärztin, fühle mich bei ihr bestens aufgehoben.Sprechstundenhilfe war super nett man fühlt sich wohl.</td>\n",
       "      <td>good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Frau Doktor Merz nimmt sich richtig Zeit für mich. Hilft wo sie kann.Hört wirklich einen zu. Sehr nett und freundlich. Sie ist sehr kompetent,zuverlässig und vertrauenswürdig.</td>\n",
       "      <td>good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nach meiner Beobachtung hat diese Praxis eine schlechte Hygiene.</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mangels akriebischer Behandlung musste mehrmals nachgebessert werden.</td>\n",
       "      <td>bad</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                                                                           comment  \\\n",
       "0                                                          Super sympathische Ärztin, fühle mich bei ihr bestens aufgehoben.Sprechstundenhilfe war super nett man fühlt sich wohl.   \n",
       "1  Frau Doktor Merz nimmt sich richtig Zeit für mich. Hilft wo sie kann.Hört wirklich einen zu. Sehr nett und freundlich. Sie ist sehr kompetent,zuverlässig und vertrauenswürdig.   \n",
       "2                                                                                                                Nach meiner Beobachtung hat diese Praxis eine schlechte Hygiene.    \n",
       "3                                                                                                            Mangels akriebischer Behandlung musste mehrmals nachgebessert werden.   \n",
       "\n",
       "  prediction  \n",
       "0       good  \n",
       "1       good  \n",
       "2        bad  \n",
       "3        bad  "
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show comment and predicted Labels\n",
    "predictions = pd.Series(predictions)\n",
    "predictions = predictions.replace(0, \"good\").replace(1, \"bad\")\n",
    "\n",
    "pd.concat(\n",
    "    [pd.Series(INPUT), predictions], axis=\"columns\", keys=[\"comment\", \"prediction\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is very promising. Our model correctly classifies the first two comments as positive and the last two as negative. We can conclude that our predictions are pretty accurate even when dealing with unknown data. Consequently, we have successfully completed our text classification task!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "In this post we have followed a workflow for natural language processing with the goal to implement a binary text classification. After applying a pre processing and cleaning strategy, we have used a frequency based method (`TF-IDF`) to transform input texts to a numeric representation. These text vectors are the features used as input in our classification models. In the next step, we applied different models to our classification task and compared the results using a meaningful prediction score metric. We've learned that this rather simple process is able to already produce decent results. Following, we have performed parameter tuning to further improve our model performance. Finally, we have used the model's predictions to uncover inconsistent reviews and to classify new comments.  \n",
    "In the next post, we will improve upon the outcome of this process, particularly on the harder to predict negative rating class. For that, we will apply more advanced methods. For one, this will include using word embeddings to represent our texts. Moreover, we will use different, advanced implementations of Neural Networks as our classification models."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}