{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ title: Does Jameda discriminate against non-paying users? Part 2: New Data, new Insights\n",
    "+ date: 2021-02-20\n",
    "+ tags: python, plotly, Jameda, premium, doctors\n",
    "+ Slug: discrimination-on-jameda-analysis-new-data-new-insights\n",
    "+ Category: Analytics\n",
    "+ Authors: MC\n",
    "+ Summary: Does the physicians rating platform Jameda discriminate against non-paying physicians? In a previous post, we analyzed this claim inconclusively. Using new data, we are able to gain new insights and shed light on the issue."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Motivation\n",
    "\n",
    "In the [first part]({filename}/jameda_part1.ipynb) of this series, we investigated a claim made in an [article](http://www.zeit.de/2018/04/jameda-aerzte-bewertungsportal-profile-bezahlung/komplettansicht) by the newspaper \"Die Zeit\". The claim was that the popular German physicians rating platform Jameda favors it's paying users while discriminating against non paying physicians. Using a larger and more robust dataset than the original analysis by \"Die Zeit\", we confirmed many of their findings. We found that paying physicians on average have much higher ratings and suspiciously low numbers of poor ratings. However, we took a stance in disagreeing with the conclusion of the article which didn't account for alternative explanations for these observations. As these findings are only based on correlation, we argued that this alone can't be seen as proof for Jameda's favoring of paying members. Especially, as there is a very intuitive theory why paying physicians might have better ratings: Paying to be a premium member on the platform might be related to other positive traits of the physician leading to better ratings. For example, doctors who value their reputation highly might be more careful in interacting with their patients and also more willing to be a paying user. Hence, the result of our first analysis was inconclusive.  \n",
    "However, there still was a credible claim stated by the original article. On Jameda, physicians can report ratings they disagree with. This (temporary) removes the reviews from the site and starts a [validation process](https://www.jameda.de/qualitaetssicherung/). The rating's comment is checked by Jameda and can be removed permanently, if it violates certain rules. It could be that premium members report negative ratings more often. This seems intuitive. Premium members are probably more engaged in cultivating their profiles and also more active in general. In contrast, non paying members might have never looked up their own profile at all. Thus, missing out on the opportunity to report any negative reviews. The author from \"Die Zeit\" asked Jameda whether they remove more negative reviews from premium users' profiles. Jameda stated, that they don't have any data on this.  \n",
    "Well, no worries Jameda. I got your back! I'll gladly offer some data myself to help you out with this question. With a little delay of about three years, we'll finally conclude our analysis. Using new data, we'll be able to shed light on the original question!\n",
    "<br>\n",
    "As always, you can download this notebook on <a href=\"https://mybinder.org/v2/gh/mc51/blog_posts/master?filepath=jameda_part2.ipynb\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open on Binder\" style=\"display:inline;\"></a>  and <a href=\"https://colab.research.google.com/github/mc51/blog_posts/blob/master/jameda_part2.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\" style=\"display:inline;\"/></a>. Unfortunately, I won't be able to share the underlying data this time. Sorry for that!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The data\n",
    "\n",
    "The data was scraped once a day and stored to a SQLite database. It is made up of two parts. The first part consists of the information on the physicians. The second part consists of the reviews. (Refer to the [first post]({filename}/jameda_part1.ipynb) for more details on the data content)  \n",
    "We observed changes in the reviews for a period of about nine months from January 2020 to September 2020. Thus, we were able to identify all reviews that were removed by Jameda during this period because they were reported by a physician. Also, we can see the result of the validation process for some of those. They could have been removed after being reported or could have been re published.  \n",
    "First, we create a class for reading the data from our SQLite database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "tags": [
     "hide"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script src=\"https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.js\"></script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "<script src=\"https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.js\"></script>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import sqlite3\n",
    "import logging\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import plotly.express as px\n",
    "import plotly.io as pio\n",
    "from sqlite3 import Error\n",
    "from pandas.tseries.frequencies import to_offset\n",
    "from IPython.display import Markdown as md\n",
    "\n",
    "pio.renderers.default = \"notebook_connected\"\n",
    "px.defaults.template = \"plotly_white\"\n",
    "pd.options.display.max_columns = 100\n",
    "pd.options.display.max_rows = 600\n",
    "pd.options.display.max_colwidth = 100\n",
    "np.set_printoptions(threshold=2000)\n",
    "log = logging.getLogger()\n",
    "log.setLevel(logging.DEBUG)\n",
    "\n",
    "# Data for original observation period\n",
    "DATA_OLD = \"../data/raw/2020-09-23_jameda.db\"\n",
    "# To check for longer term changes we got some new data\n",
    "DATA_NEW = \"../data/raw/2021-02-13_jameda.db\"\n",
    "DATE_START_WAVE2 = \"2021-02-10\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DB:\n",
    "    def __init__(self, db):\n",
    "        \"\"\"\n",
    "        Connect to sqllite DB expose connection and cursor of instance\n",
    "        \"\"\"\n",
    "        self.cursor = None\n",
    "        self.conn = self.create_connection(db)\n",
    "        self.conn.row_factory = sqlite3.Row  # return column names on fetch\n",
    "        try:\n",
    "            self.cursor = self.conn.cursor()\n",
    "        except Exception as e:\n",
    "            log.exception(f\"Error getting cursor for DB connection: {e}\")\n",
    "\n",
    "    def create_connection(self, db_path):\n",
    "        \"\"\"return a database connection \"\"\"\n",
    "        try:\n",
    "            conn = sqlite3.connect(db_path)\n",
    "        except Error:\n",
    "            log.exception(f\"Error connecting to DB: {db_path}\")\n",
    "        return conn\n",
    "\n",
    "    def send_single_statement(self, statement):\n",
    "        \"\"\" Send single statement to DB  \"\"\"\n",
    "        try:\n",
    "            self.cursor.execute(statement)\n",
    "        except Error:\n",
    "            log.exception(f\"Error sending statement: {statement}\")\n",
    "            self.conn.rollback()\n",
    "            return None\n",
    "        else:\n",
    "            log.info(f\"OK sending statement: {statement}\")\n",
    "            self.conn.commit()\n",
    "            return True\n",
    "\n",
    "    def select_and_fetchall(self, statement):\n",
    "        \"\"\" Execute a select statement and return all rows \"\"\"\n",
    "        try:\n",
    "            self.cursor.execute(statement)\n",
    "            rows = self.cursor.fetchall()\n",
    "        except Exception:\n",
    "            log.exception(\"Could not select and fetchall\")\n",
    "            return None\n",
    "        else:\n",
    "            return rows\n",
    "\n",
    "    def __del__(self):\n",
    "        \"\"\" make sure we close connection \"\"\"\n",
    "        self.conn.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the class and its methods, let's read in the data regarding the doctors' profiles:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ref_id</th>\n",
       "      <th>strasse</th>\n",
       "      <th>anrede</th>\n",
       "      <th>gesamt_note</th>\n",
       "      <th>bewertungen</th>\n",
       "      <th>ort</th>\n",
       "      <th>typ</th>\n",
       "      <th>plz</th>\n",
       "      <th>score</th>\n",
       "      <th>art</th>\n",
       "      <th>entfernung</th>\n",
       "      <th>lat</th>\n",
       "      <th>lng</th>\n",
       "      <th>name_nice</th>\n",
       "      <th>name_kurz</th>\n",
       "      <th>typ_string</th>\n",
       "      <th>fach_string</th>\n",
       "      <th>url</th>\n",
       "      <th>url_hinten</th>\n",
       "      <th>snippets</th>\n",
       "      <th>canHaveReviews</th>\n",
       "      <th>portrait</th>\n",
       "      <th>date_create</th>\n",
       "      <th>date_update</th>\n",
       "      <th>errors</th>\n",
       "      <th>premium</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>80283417</td>\n",
       "      <td>Venloer Str. 389</td>\n",
       "      <td>2</td>\n",
       "      <td>2.43</td>\n",
       "      <td>8</td>\n",
       "      <td>Köln</td>\n",
       "      <td>HA</td>\n",
       "      <td>50825</td>\n",
       "      <td>51.2639</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1</td>\n",
       "      <td>50.951099</td>\n",
       "      <td>6.915398</td>\n",
       "      <td>Dr. med. Brigitte Jähnig</td>\n",
       "      <td>Dr. Jähnig</td>\n",
       "      <td>Ärztin</td>\n",
       "      <td>Kinderärztin</td>\n",
       "      <td>/koeln/aerzte/kinderaerzte/dr-brigitte-jaehnig/</td>\n",
       "      <td>80283417_1/</td>\n",
       "      <td>[{'type': 'positive', 'label': 'öffentlich gut erreichbar'}, {'type': 'positive', 'label': 'freu...</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>2020-01-07 18:16:35</td>\n",
       "      <td>2020-01-07 18:16:35</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80424197</td>\n",
       "      <td>Venloer Str. 389</td>\n",
       "      <td>2</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1</td>\n",
       "      <td>Köln</td>\n",
       "      <td>HA</td>\n",
       "      <td>50825</td>\n",
       "      <td>53.6637</td>\n",
       "      <td>1</td>\n",
       "      <td>0.1</td>\n",
       "      <td>50.951099</td>\n",
       "      <td>6.915398</td>\n",
       "      <td>Dr. med. Andrea Steinle</td>\n",
       "      <td>Dr. Steinle</td>\n",
       "      <td>Ärztin</td>\n",
       "      <td>Internistin</td>\n",
       "      <td>/koeln/aerzte/innere-allgemeinmediziner/dr-andrea-steinle/</td>\n",
       "      <td>80424197_1/</td>\n",
       "      <td>None</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "      <td>2020-01-07 18:16:35</td>\n",
       "      <td>2020-01-07 18:16:35</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     ref_id           strasse  anrede  gesamt_note  bewertungen   ort typ  \\\n",
       "0  80283417  Venloer Str. 389       2         2.43            8  Köln  HA   \n",
       "1  80424197  Venloer Str. 389       2         1.00            1  Köln  HA   \n",
       "\n",
       "     plz    score  art  entfernung        lat       lng  \\\n",
       "0  50825  51.2639    1         0.1  50.951099  6.915398   \n",
       "1  50825  53.6637    1         0.1  50.951099  6.915398   \n",
       "\n",
       "                  name_nice    name_kurz typ_string   fach_string  \\\n",
       "0  Dr. med. Brigitte Jähnig   Dr. Jähnig     Ärztin  Kinderärztin   \n",
       "1   Dr. med. Andrea Steinle  Dr. Steinle     Ärztin   Internistin   \n",
       "\n",
       "                                                          url   url_hinten  \\\n",
       "0             /koeln/aerzte/kinderaerzte/dr-brigitte-jaehnig/  80283417_1/   \n",
       "1  /koeln/aerzte/innere-allgemeinmediziner/dr-andrea-steinle/  80424197_1/   \n",
       "\n",
       "                                                                                              snippets  \\\n",
       "0  [{'type': 'positive', 'label': 'öffentlich gut erreichbar'}, {'type': 'positive', 'label': 'freu...   \n",
       "1                                                                                                 None   \n",
       "\n",
       "   canHaveReviews portrait          date_create          date_update  errors  \\\n",
       "0               1     None  2020-01-07 18:16:35  2020-01-07 18:16:35       0   \n",
       "1               1     None  2020-01-07 18:16:35  2020-01-07 18:16:35       0   \n",
       "\n",
       "   premium  \n",
       "0        0  \n",
       "1        0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from sqlite to df\n",
    "db = DB(DATA_OLD)\n",
    "ret = db.select_and_fetchall(\"SELECT * FROM doctors;\")\n",
    "rows = [dict(row) for row in ret]\n",
    "docs = pd.DataFrame(rows)\n",
    "docs[\"premium\"] = 0\n",
    "# If user has a portrait picture, he is paying member / premium user\n",
    "docs.loc[docs[\"portrait\"].notna(), \"premium\"] = 1\n",
    "docs_unq = (\n",
    "    docs.sort_values([\"ref_id\", \"date_update\"])\n",
    "    .groupby(\"ref_id\", as_index=False)\n",
    "    .agg(\"last\")\n",
    ")\n",
    "# Clean doctor subject string\n",
    "docs[\"fach_string\"] = docs[\"fach_string\"].str.replace(\"[\\['\\]]\", \"\")\n",
    "docs.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is very similar to the data we used in the first post. In this analysis, most of the columns can be ignored. We'll focus on the `ref_id` (which is the user id) and whether or not the physician is a paying (`premium = 1`) or non-paying user (`premium = 0`).  \n",
    "Next, we read in and process the reviews:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from multiple sqlite DBs to single df\n",
    "DATA = [DATA_OLD, DATA_NEW]\n",
    "reviews = pd.DataFrame()\n",
    "for data in DATA:\n",
    "    db = DB(data)\n",
    "    ret = db.select_and_fetchall(\"SELECT * FROM reviews;\")\n",
    "    rows = [dict(row) for row in ret]\n",
    "    df = pd.DataFrame(rows).sort_values([\"ref_id\", \"b_id\", \"date_create\"])\n",
    "    # columns to dates\n",
    "    df[\"b_date\"] = pd.to_datetime(df[\"b_date\"], unit=\"s\")\n",
    "    df[\"date_create\"] = pd.to_datetime(df[\"date_create\"])\n",
    "    df[\"date_update\"] = pd.to_datetime(df[\"date_update\"])\n",
    "    reviews = reviews.append(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# some processing\n",
    "# force numeric content to numeric type\n",
    "reviews[\n",
    "    [\n",
    "        \"ref_id\",\n",
    "        \"b_id\",\n",
    "        \"b_stand\",\n",
    "        \"gesamt_note_class\",\n",
    "        \"br_total_votes\",\n",
    "        \"br_index\",\n",
    "        \"is_archived\",\n",
    "        \"kommentar_entfernt\",\n",
    "    ]\n",
    "] = reviews[\n",
    "    [\n",
    "        \"ref_id\",\n",
    "        \"b_id\",\n",
    "        \"b_stand\",\n",
    "        \"gesamt_note_class\",\n",
    "        \"br_total_votes\",\n",
    "        \"br_index\",\n",
    "        \"is_archived\",\n",
    "        \"kommentar_entfernt\",\n",
    "    ]\n",
    "].apply(\n",
    "    pd.to_numeric\n",
    ")\n",
    "# flag wave 1 and 2\n",
    "reviews[\"wave\"] = np.where(reviews[\"date_create\"] < DATE_START_WAVE2, 1, 2)\n",
    "reviews_w2 = reviews[reviews[\"wave\"] == 2]\n",
    "reviews_w1 = reviews[reviews[\"wave\"] == 1]\n",
    "# skip incomplete days\n",
    "date_firstday = reviews_w1[\"date_create\"].min().ceil(\"d\")\n",
    "date_lastday = reviews_w1[\"date_create\"].max().floor(\"d\")\n",
    "reviews_w1 = reviews_w1.loc[\n",
    "    (reviews_w1[\"date_create\"] >= date_firstday)\n",
    "    & (reviews_w1[\"date_create\"] <= date_lastday),\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ref_id` of each review can be related back to the corresponding physician. The `b_id` refers to the unique id of a review. Moreover, we'll need `gesamt_note` which is the numerical rating of the review (from 1 = best, to 6 = worst), `b_date` which is the first publication date of a review, `date_create` which is the date we observed this review and `b_stand` which is the status of the review (explained below). \n",
    "Here, we strictly look at reviews for which we have multiple observations, i.e. reviews which have changed during the period we scraped them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# filter for reviews with multiple entries (new entry only created when reviews changed)\n",
    "num_entries = reviews_w1.groupby(\"b_id\").size()\n",
    "multi_entry = reviews_w1.loc[\n",
    "    reviews_w1[\"b_id\"].isin(num_entries[num_entries > 1].index),\n",
    "].sort_values([\"ref_id\", \"b_id\", \"date_create\"])\n",
    "\n",
    "# filter on relevant columns\n",
    "cols_change = [\"ref_id\", \"b_date\", \"b_id\", \"b_stand\", \"date_create\", \"wave\"]\n",
    "multi_entry = multi_entry[cols_change].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we come back to the `b_stand` (status) variable. We know that `1` means the status is normal. This is just a regular review, containing a rating and a comment. When `b_stand` is `4` it indicates that the review was reported, temporarily removed, and is being verified by Jameda (the process is explained [here](https://www.jameda.de/qualitaetssicherung/)). A `5` tells us, that the review has a comment but no rating. Here's an example for a reported review:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "titel                                                            Warum ist diese Bewertung aktuell nicht online?\n",
      "kommentar    Dr. Herrmann hat uns die Bewertung gemeldet, da sie sie für rechtswidrig h&auml;lt. Aus diesem G...\n",
      "Name: 55045, dtype: object\n"
     ]
    }
   ],
   "source": [
    "print(\n",
    "    reviews_w1.loc[reviews_w1[\"b_stand\"] == 4,].iloc[\n",
    "        0\n",
    "    ][[\"titel\", \"kommentar\"]]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These reviews are grayed out on the website. The rating is deleted and the original text in the title and comment is replaced by a standard message. It says something along the lines of: \"This review has been reported by the physician and is under review\". Also, the rating is not displayed for them anymore.  \n",
    "Following, for the reviews with multiple entries, we check how `b_stand` changed between observations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ref_id</th>\n",
       "      <th>b_date</th>\n",
       "      <th>b_id</th>\n",
       "      <th>b_stand</th>\n",
       "      <th>date_create</th>\n",
       "      <th>wave</th>\n",
       "      <th>b_stand_prev</th>\n",
       "      <th>change</th>\n",
       "      <th>removed</th>\n",
       "      <th>readded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>80000939</td>\n",
       "      <td>2016-07-14 18:17:01</td>\n",
       "      <td>2817015</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-07-15 03:22:20</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80000939</td>\n",
       "      <td>2016-07-14 18:17:01</td>\n",
       "      <td>2817015</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-08-23 03:03:18</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>80000941</td>\n",
       "      <td>2016-06-10 12:49:17</td>\n",
       "      <td>2754134</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-02-25 03:24:12</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>80000941</td>\n",
       "      <td>2016-06-10 12:49:17</td>\n",
       "      <td>2754134</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-09-19 03:02:19</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     ref_id              b_date     b_id  b_stand         date_create  wave  \\\n",
       "0  80000939 2016-07-14 18:17:01  2817015        1 2020-07-15 03:22:20     1   \n",
       "1  80000939 2016-07-14 18:17:01  2817015        1 2020-08-23 03:03:18     1   \n",
       "2  80000941 2016-06-10 12:49:17  2754134        1 2020-02-25 03:24:12     1   \n",
       "3  80000941 2016-06-10 12:49:17  2754134        1 2020-09-19 03:02:19     1   \n",
       "\n",
       "   b_stand_prev  change  removed  readded  \n",
       "0           NaN     NaN      NaN      NaN  \n",
       "1           1.0     0.0      NaN      NaN  \n",
       "2           NaN     NaN      NaN      NaN  \n",
       "3           1.0     0.0      NaN      NaN  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Compute direction of b_stand change: review removed or re added (after removal)\n",
    "multi_entry = multi_entry[multi_entry[\"b_stand\"].isin([1, 4, 5])]\n",
    "multi_entry[\"b_stand_prev\"] = multi_entry.groupby(\"b_id\")[\"b_stand\"].shift(\n",
    "    fill_value=np.nan\n",
    ")\n",
    "multi_entry[\"change\"] = multi_entry[\"b_stand\"] - multi_entry[\"b_stand_prev\"]\n",
    "multi_entry.loc[multi_entry[\"change\"].isin([3, -1]), \"removed\"] = 1\n",
    "multi_entry.loc[multi_entry[\"change\"].isin([-3, 1]), \"readded\"] = 1\n",
    "multi_entry.head(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the change in `b_stand` we can conclude whether a review has been removed (it went from 1 to 4 or 5 to 4) or re added after being removed some time before (change from 4 to 1 or 4 to 5).  \n",
    "Let's store a data frame of only the reviews which have been removed or re added:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>b_id</th>\n",
       "      <th>removed</th>\n",
       "      <th>readded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>682847</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>807042</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     b_id  removed  readded\n",
       "0  682847      1.0      NaN\n",
       "1  807042      1.0      NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Only reviews that were added or removed\n",
    "changed = multi_entry[(multi_entry[\"removed\"] == 1) | (multi_entry[\"readded\"] == 1)]\n",
    "changed = changed[[\"b_id\", \"removed\", \"readded\"]].groupby(\"b_id\").max().reset_index()\n",
    "changed.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ref_id</th>\n",
       "      <th>b_id</th>\n",
       "      <th>b_stand</th>\n",
       "      <th>u_alter</th>\n",
       "      <th>kasse_privat</th>\n",
       "      <th>b_date</th>\n",
       "      <th>gesamt_note</th>\n",
       "      <th>gesamt_note_class_x</th>\n",
       "      <th>gesamt_note_formatted</th>\n",
       "      <th>bs_inhalt</th>\n",
       "      <th>br_total_votes</th>\n",
       "      <th>br_total_value</th>\n",
       "      <th>br_index</th>\n",
       "      <th>is_archived</th>\n",
       "      <th>titel</th>\n",
       "      <th>kommentar_entfernt</th>\n",
       "      <th>kommentar</th>\n",
       "      <th>header</th>\n",
       "      <th>fragen</th>\n",
       "      <th>date_create</th>\n",
       "      <th>date_update</th>\n",
       "      <th>wave</th>\n",
       "      <th>removed</th>\n",
       "      <th>readded</th>\n",
       "      <th>premium</th>\n",
       "      <th>ort</th>\n",
       "      <th>fach_string</th>\n",
       "      <th>url</th>\n",
       "      <th>url_hinten</th>\n",
       "      <th>gesamt_note_class_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>80000727</td>\n",
       "      <td>3397880</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2017-05-23 09:48:03</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1,0</td>\n",
       "      <td>0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>6</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>Sehr gute Kompetenz mit perfekten Netzwerk</td>\n",
       "      <td>0</td>\n",
       "      <td>Hier fühlt man sich bestens aufgehoben und wird perfekt betreut in einen sehr schönen Ambiente.</td>\n",
       "      <td>Bewertung vom 23.05.17</td>\n",
       "      <td>[{'fragekurz': 'Behandlung', 'note': '1'}, {'fragekurz': 'Aufklärung', 'note': '1'}, {'fragekurz...</td>\n",
       "      <td>2020-09-09 03:24:19</td>\n",
       "      <td>2020-09-09 03:24:19</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>München</td>\n",
       "      <td>Internistin</td>\n",
       "      <td>/muenchen/aerzte/innere-allgemeinmediziner/dr-daniela-grenacher-horn/</td>\n",
       "      <td>80000727_1/</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80000727</td>\n",
       "      <td>3413580</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2017-06-01 19:11:21</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1,0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>Eine super Ärztin mit viel Leidenschaft, Menschlichkeit und Beruf als Berufung</td>\n",
       "      <td>0</td>\n",
       "      <td>Durch ganz großes Glück kam Ich zu Frau Dr. Grenacher-Horn. &lt;br /&gt;\\r\\n&lt;br /&gt;\\r\\nIch habe noch ni...</td>\n",
       "      <td>Bewertung vom 01.06.17, gesetzlich versichert, 30 bis 50</td>\n",
       "      <td>[{'fragekurz': 'Behandlung', 'note': '1'}, {'fragekurz': 'Aufklärung', 'note': '1'}, {'fragekurz...</td>\n",
       "      <td>2020-09-09 03:24:19</td>\n",
       "      <td>2020-09-09 03:24:19</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>München</td>\n",
       "      <td>Internistin</td>\n",
       "      <td>/muenchen/aerzte/innere-allgemeinmediziner/dr-daniela-grenacher-horn/</td>\n",
       "      <td>80000727_1/</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     ref_id     b_id  b_stand u_alter kasse_privat              b_date  \\\n",
       "0  80000727  3397880        1       0            0 2017-05-23 09:48:03   \n",
       "1  80000727  3413580        1       2            1 2017-06-01 19:11:21   \n",
       "\n",
       "  gesamt_note  gesamt_note_class_x gesamt_note_formatted bs_inhalt  \\\n",
       "0        1.00                  1.0                   1,0         0   \n",
       "1        1.00                  1.0                   1,0         0   \n",
       "\n",
       "   br_total_votes br_total_value  br_index  is_archived  \\\n",
       "0             2.0              6       3.0            0   \n",
       "1             3.0              3       1.0            0   \n",
       "\n",
       "                                                                            titel  \\\n",
       "0                                      Sehr gute Kompetenz mit perfekten Netzwerk   \n",
       "1  Eine super Ärztin mit viel Leidenschaft, Menschlichkeit und Beruf als Berufung   \n",
       "\n",
       "   kommentar_entfernt  \\\n",
       "0                   0   \n",
       "1                   0   \n",
       "\n",
       "                                                                                             kommentar  \\\n",
       "0      Hier fühlt man sich bestens aufgehoben und wird perfekt betreut in einen sehr schönen Ambiente.   \n",
       "1  Durch ganz großes Glück kam Ich zu Frau Dr. Grenacher-Horn. <br />\\r\\n<br />\\r\\nIch habe noch ni...   \n",
       "\n",
       "                                                     header  \\\n",
       "0                                    Bewertung vom 23.05.17   \n",
       "1  Bewertung vom 01.06.17, gesetzlich versichert, 30 bis 50   \n",
       "\n",
       "                                                                                                fragen  \\\n",
       "0  [{'fragekurz': 'Behandlung', 'note': '1'}, {'fragekurz': 'Aufklärung', 'note': '1'}, {'fragekurz...   \n",
       "1  [{'fragekurz': 'Behandlung', 'note': '1'}, {'fragekurz': 'Aufklärung', 'note': '1'}, {'fragekurz...   \n",
       "\n",
       "          date_create         date_update  wave  removed  readded  premium  \\\n",
       "0 2020-09-09 03:24:19 2020-09-09 03:24:19     1      0.0      0.0        1   \n",
       "1 2020-09-09 03:24:19 2020-09-09 03:24:19     1      0.0      0.0        1   \n",
       "\n",
       "       ort  fach_string  \\\n",
       "0  München  Internistin   \n",
       "1  München  Internistin   \n",
       "\n",
       "                                                                     url  \\\n",
       "0  /muenchen/aerzte/innere-allgemeinmediziner/dr-daniela-grenacher-horn/   \n",
       "1  /muenchen/aerzte/innere-allgemeinmediziner/dr-daniela-grenacher-horn/   \n",
       "\n",
       "    url_hinten  gesamt_note_class_y  \n",
       "0  80000727_1/                  1.0  \n",
       "1  80000727_1/                  1.0  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# keep only unique reviews and add cols from other tables with infos\n",
    "reviews_unq = reviews_w1[reviews_w1[\"b_stand\"].isin([1, 4, 5])]\n",
    "reviews_unq = reviews_unq.drop_duplicates(\"b_id\", keep=\"last\")\n",
    "reviews_unq = reviews_unq.merge(changed, how=\"left\", on=\"b_id\")\n",
    "doc_infos = docs[\n",
    "    [\"ref_id\", \"premium\", \"ort\", \"fach_string\", \"url\", \"url_hinten\"]\n",
    "].drop_duplicates(\"ref_id\", keep=\"last\")\n",
    "reviews_unq = reviews_unq.merge(doc_infos, how=\"left\", on=\"ref_id\")\n",
    "# store original rating (on removal of review grade disappears) and re add to reviews\n",
    "ratings = reviews_w1[[\"gesamt_note_class\", \"b_id\"]].groupby(\"b_id\").min().reset_index()\n",
    "reviews_unq = reviews_unq.merge(ratings, how=\"left\", on=\"b_id\")\n",
    "reviews_unq = reviews_unq.replace(np.nan, 0)\n",
    "# remove those without a grade (under review without previous entry)\n",
    "# reviews_unq = reviews_unq[reviews_unq[\"gesamt_note_class_y\"] > 0]\n",
    "reviews_unq.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>b_id</th>\n",
       "      <th>b_date</th>\n",
       "      <th>date_create</th>\n",
       "      <th>removed</th>\n",
       "      <th>readded</th>\n",
       "      <th>premium</th>\n",
       "      <th>gesamt_note_class_y</th>\n",
       "      <th>wave</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4682967</td>\n",
       "      <td>2019-07-12 09:34:26</td>\n",
       "      <td>2020-03-02 03:21:15</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4871698</td>\n",
       "      <td>2019-12-02 10:37:17</td>\n",
       "      <td>2020-03-02 03:21:15</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      b_id              b_date         date_create  removed  readded  premium  \\\n",
       "0  4682967 2019-07-12 09:34:26 2020-03-02 03:21:15      1.0      0.0        1   \n",
       "1  4871698 2019-12-02 10:37:17 2020-03-02 03:21:15      1.0      0.0        1   \n",
       "\n",
       "   gesamt_note_class_y  wave  \n",
       "0                  6.0     1  \n",
       "1                  6.0     1  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Only reviews that were removed With additional cols\n",
    "removed = reviews_unq.loc[\n",
    "    (reviews_unq[\"removed\"] == 1),\n",
    "    [\n",
    "        \"b_id\",\n",
    "        \"b_date\",\n",
    "        \"date_create\",\n",
    "        \"removed\",\n",
    "        \"readded\",\n",
    "        \"premium\",\n",
    "        \"gesamt_note_class_y\",\n",
    "        \"wave\",\n",
    "    ],\n",
    "].reset_index(drop=True)\n",
    "removed.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have cleaned and prepared the data, we can check some of its properties. First, let's see how many new reviews are published each week during our observation period:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "        <script type=\"text/javascript\">\n",
       "        window.PlotlyConfig = {MathJaxConfig: 'local'};\n",
       "        if (window.MathJax) {MathJax.Hub.Config({SVG: {font: \"STIX-Web\"}});}\n",
       "        if (typeof require !== 'undefined') {\n",
       "        require.undef(\"plotly\");\n",
       "        requirejs.config({\n",
       "            paths: {\n",
       "                'plotly': ['https://cdn.plot.ly/plotly-latest.min']\n",
       "            }\n",
       "        });\n",
       "        require(['plotly'], function(Plotly) {\n",
       "            window._Plotly = Plotly;\n",
       "        });\n",
       "        }\n",
       "        </script>\n",
       "        "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>                            <div id=\"f6505ef2-ae4e-4fb2-be19-08b74f3bc884\" class=\"plotly-graph-div\" style=\"height:525px; width:100%;\"></div>            <script type=\"text/javascript\">                require([\"plotly\"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById(\"f6505ef2-ae4e-4fb2-be19-08b74f3bc884\")) {                    Plotly.newPlot(                        \"f6505ef2-ae4e-4fb2-be19-08b74f3bc884\",                        [{\"alignmentgroup\": \"True\", \"hovertemplate\": \"Published reviews=non premium<br>Date published=%{x}<br>value=%{y}<extra></extra>\", \"legendgroup\": \"non premium\", \"marker\": {\"color\": \"#636efa\"}, \"name\": \"non premium\", \"offsetgroup\": \"non premium\", \"orientation\": \"v\", \"showlegend\": true, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [\"2020-01-06T00:00:00\", \"2020-01-13T00:00:00\", \"2020-01-20T00:00:00\", \"2020-01-27T00:00:00\", \"2020-02-03T00:00:00\", \"2020-02-10T00:00:00\", \"2020-02-17T00:00:00\", \"2020-02-24T00:00:00\", \"2020-03-02T00:00:00\", \"2020-03-09T00:00:00\", \"2020-03-16T00:00:00\", \"2020-03-23T00:00:00\", \"2020-03-30T00:00:00\", \"2020-04-06T00:00:00\", \"2020-04-13T00:00:00\", \"2020-04-20T00:00:00\", \"2020-04-27T00:00:00\", \"2020-05-04T00:00:00\", \"2020-05-11T00:00:00\", \"2020-05-18T00:00:00\", \"2020-05-25T00:00:00\", \"2020-06-01T00:00:00\", \"2020-06-08T00:00:00\", \"2020-06-15T00:00:00\", \"2020-06-22T00:00:00\", \"2020-06-29T00:00:00\", \"2020-07-06T00:00:00\", \"2020-07-13T00:00:00\", \"2020-07-20T00:00:00\", \"2020-07-27T00:00:00\", \"2020-08-03T00:00:00\", \"2020-08-10T00:00:00\", \"2020-08-17T00:00:00\", \"2020-08-24T00:00:00\", \"2020-08-31T00:00:00\", \"2020-09-07T00:00:00\", \"2020-09-14T00:00:00\"], \"xaxis\": \"x\", \"y\": [387, 486, 447, 489, 491, 522, 389, 415, 503, 417, 355, 358, 288, 195, 280, 326, 330, 358, 377, 310, 295, 367, 354, 400, 382, 365, 393, 379, 336, 292, 318, 319, 342, 278, 227, 244, 133], \"yaxis\": \"y\"}, {\"alignmentgroup\": \"True\", \"hovertemplate\": \"Published reviews=premium<br>Date published=%{x}<br>value=%{y}<extra></extra>\", \"legendgroup\": \"premium\", \"marker\": {\"color\": \"#EF553B\"}, \"name\": \"premium\", \"offsetgroup\": \"premium\", \"orientation\": \"v\", \"showlegend\": true, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [\"2020-01-06T00:00:00\", \"2020-01-13T00:00:00\", \"2020-01-20T00:00:00\", \"2020-01-27T00:00:00\", \"2020-02-03T00:00:00\", \"2020-02-10T00:00:00\", \"2020-02-17T00:00:00\", \"2020-02-24T00:00:00\", \"2020-03-02T00:00:00\", \"2020-03-09T00:00:00\", \"2020-03-16T00:00:00\", \"2020-03-23T00:00:00\", \"2020-03-30T00:00:00\", \"2020-04-06T00:00:00\", \"2020-04-13T00:00:00\", \"2020-04-20T00:00:00\", \"2020-04-27T00:00:00\", \"2020-05-04T00:00:00\", \"2020-05-11T00:00:00\", \"2020-05-18T00:00:00\", \"2020-05-25T00:00:00\", \"2020-06-01T00:00:00\", \"2020-06-08T00:00:00\", \"2020-06-15T00:00:00\", \"2020-06-22T00:00:00\", \"2020-06-29T00:00:00\", \"2020-07-06T00:00:00\", \"2020-07-13T00:00:00\", \"2020-07-20T00:00:00\", \"2020-07-27T00:00:00\", \"2020-08-03T00:00:00\", \"2020-08-10T00:00:00\", \"2020-08-17T00:00:00\", \"2020-08-24T00:00:00\", \"2020-08-31T00:00:00\", \"2020-09-07T00:00:00\", \"2020-09-14T00:00:00\"], \"xaxis\": \"x\", \"y\": [245, 301, 328, 369, 354, 340, 305, 307, 339, 277, 268, 229, 237, 161, 230, 247, 294, 276, 310, 254, 274, 323, 264, 303, 298, 293, 261, 301, 252, 249, 266, 244, 263, 132, 125, 120, 58], \"yaxis\": \"y\"}],                        {\"barmode\": \"stack\", \"legend\": {\"title\": {\"text\": \"Published reviews\"}, \"tracegroupgap\": 0}, \"template\": {\"data\": {\"bar\": [{\"error_x\": {\"color\": \"#2a3f5f\"}, \"error_y\": {\"color\": \"#2a3f5f\"}, \"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"bar\"}], \"barpolar\": [{\"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"barpolar\"}], \"carpet\": [{\"aaxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"baxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"type\": \"carpet\"}], \"choropleth\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"choropleth\"}], \"contour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"contour\"}], \"contourcarpet\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"contourcarpet\"}], \"heatmap\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmap\"}], \"heatmapgl\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmapgl\"}], \"histogram\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"histogram\"}], \"histogram2d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2d\"}], \"histogram2dcontour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2dcontour\"}], \"mesh3d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"mesh3d\"}], \"parcoords\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"parcoords\"}], \"pie\": [{\"automargin\": true, \"type\": \"pie\"}], \"scatter\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter\"}], \"scatter3d\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter3d\"}], \"scattercarpet\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattercarpet\"}], \"scattergeo\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergeo\"}], \"scattergl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergl\"}], \"scattermapbox\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattermapbox\"}], \"scatterpolar\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolar\"}], \"scatterpolargl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolargl\"}], \"scatterternary\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterternary\"}], \"surface\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"surface\"}], \"table\": [{\"cells\": {\"fill\": {\"color\": \"#EBF0F8\"}, \"line\": {\"color\": \"white\"}}, \"header\": {\"fill\": {\"color\": \"#C8D4E3\"}, \"line\": {\"color\": \"white\"}}, \"type\": \"table\"}]}, \"layout\": {\"annotationdefaults\": {\"arrowcolor\": \"#2a3f5f\", \"arrowhead\": 0, \"arrowwidth\": 1}, \"autotypenumbers\": \"strict\", \"coloraxis\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"colorscale\": {\"diverging\": [[0, \"#8e0152\"], [0.1, \"#c51b7d\"], [0.2, \"#de77ae\"], [0.3, \"#f1b6da\"], [0.4, \"#fde0ef\"], [0.5, \"#f7f7f7\"], [0.6, \"#e6f5d0\"], [0.7, \"#b8e186\"], [0.8, \"#7fbc41\"], [0.9, \"#4d9221\"], [1, \"#276419\"]], \"sequential\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"sequentialminus\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]]}, \"colorway\": [\"#636efa\", \"#EF553B\", \"#00cc96\", \"#ab63fa\", \"#FFA15A\", \"#19d3f3\", \"#FF6692\", \"#B6E880\", \"#FF97FF\", \"#FECB52\"], \"font\": {\"color\": \"#2a3f5f\"}, \"geo\": {\"bgcolor\": \"white\", \"lakecolor\": \"white\", \"landcolor\": \"white\", \"showlakes\": true, \"showland\": true, \"subunitcolor\": \"#C8D4E3\"}, \"hoverlabel\": {\"align\": \"left\"}, \"hovermode\": \"closest\", \"mapbox\": {\"style\": \"light\"}, \"paper_bgcolor\": \"white\", \"plot_bgcolor\": \"white\", \"polar\": {\"angularaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"radialaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}}, \"scene\": {\"xaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"yaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"zaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}}, \"shapedefaults\": {\"line\": {\"color\": \"#2a3f5f\"}}, \"ternary\": {\"aaxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"baxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"caxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}}, \"title\": {\"x\": 0.05}, \"xaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}, \"yaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}}}, \"title\": {\"text\": \"New reviews per week in 2020 (Total 22844)\"}, \"xaxis\": {\"anchor\": \"y\", \"domain\": [0.0, 1.0], \"dtick\": 604800000, \"tick0\": \"2020-01-06\", \"tickformat\": \"%d %b\", \"title\": {\"text\": \"Date published\"}}, \"yaxis\": {\"anchor\": \"x\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"value\"}}},                        {\"responsive\": true}                    ).then(function(){\n",
       "                            \n",
       "var gd = document.getElementById('f6505ef2-ae4e-4fb2-be19-08b74f3bc884');\n",
       "var x = new MutationObserver(function (mutations, observer) {{\n",
       "        var display = window.getComputedStyle(gd).display;\n",
       "        if (!display || display === 'none') {{\n",
       "            console.log([gd, 'removed!']);\n",
       "            Plotly.purge(gd);\n",
       "            observer.disconnect();\n",
       "        }}\n",
       "}});\n",
       "\n",
       "// Listen for the removal of the full notebook cells\n",
       "var notebookContainer = gd.closest('#notebook-container');\n",
       "if (notebookContainer) {{\n",
       "    x.observe(notebookContainer, {childList: true});\n",
       "}}\n",
       "\n",
       "// Listen for the clearing of the current output cell\n",
       "var outputEl = gd.closest('.output');\n",
       "if (outputEl) {{\n",
       "    x.observe(outputEl, {childList: true});\n",
       "}}\n",
       "\n",
       "                        })                };                });            </script>        </div>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Only reviews that were published during observation wave 1\n",
    "reviews_unq_new_in_wave1 = reviews_unq.loc[\n",
    "    (reviews_unq[\"b_date\"] >= date_firstday) & (reviews_unq[\"b_date\"] <= date_lastday),\n",
    "]\n",
    "\n",
    "# From daily data to weekly aggregation\n",
    "plt = reviews_unq_new_in_wave1[[\"b_date\", \"premium\"]].set_index(\"b_date\")\n",
    "plt[\"reviews\"] = 1\n",
    "plt[\"non premium\"] = abs(plt[\"premium\"] - 1)\n",
    "plt = plt.resample(\"W-MON\", label=\"left\").sum()\n",
    "reviews_new_total = plt[\"reviews\"].sum()\n",
    "\n",
    "fig = px.bar(\n",
    "    plt,\n",
    "    x=plt.index,\n",
    "    y=[\"non premium\", \"premium\"],\n",
    "    title=f\"New reviews per week in 2020 (Total {reviews_new_total})\",\n",
    "    labels={\"b_date\": \"Date published\", \"variable\": \"Published reviews\"},\n",
    "    barmode=\"stack\",\n",
    ")\n",
    "fig.update_xaxes(dtick=7 * 24 * 60 * 60 * 1000, tickformat=\"%d %b\", tick0=\"2020-01-06\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": [
     "hide"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Each week, there are between 191 and 862 newly published reviews. The average week sees about 617 of them. Reviews on premium profiles have a share of 42%."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Share of Premium\n",
    "reviews_new_total_prem = plt[\"premium\"].sum()\n",
    "reviews_new_total_noprem = plt[\"non premium\"].sum()\n",
    "reviews_new_share_prem = (\n",
    "    reviews_new_total_prem / (reviews_new_total_prem + reviews_new_total_noprem) * 100\n",
    ")\n",
    "# descriptives\n",
    "reviews_min = plt[\"reviews\"].min()\n",
    "reviews_max = plt[\"reviews\"].max()\n",
    "reviews_mean = plt[\"reviews\"].mean()\n",
    "md(\n",
    "    f\"Each week, there are between {reviews_min} and {reviews_max} newly published reviews. The average week sees about {reviews_mean:.0f} of them. Reviews on premium profiles have a share of {reviews_new_share_prem:.0f}%.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we compare those numbers to the number of reviews removed during the same time span:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Removed reviews: 459 (0.18% of all, 2.01% of new reviews in period)\n"
     ]
    }
   ],
   "source": [
    "freq_removed = reviews_unq[\"removed\"].value_counts()\n",
    "freq_removed_perc = reviews_unq[\"removed\"].value_counts(normalize=True) * 100\n",
    "freq_removed_perc_of_new = freq_removed / reviews_new_total * 100\n",
    "print(\n",
    "    f\"Removed reviews: {freq_removed[1]:.0f} ({freq_removed_perc[1]:.2f}% of all\"\\\n",
    "    f\", {freq_removed_perc_of_new[1]:.2f}% of new reviews in period)\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": [
     "hide"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Over the nine month period in which we observed all changes in the reviews on a daily basis, we find only 459 removed reviews. That amounts to only 0.18% of all reviews and 2.01% of the newly published reviews during that same observation period."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md(\n",
    "    f\"Over the nine month period in which we observed all changes in the reviews on a daily basis, we find only {freq_removed[1]:.0f} removed reviews. That amounts to only {freq_removed_perc[1]:.2f}% of all reviews and {freq_removed_perc_of_new[1]:.2f}% of the newly published reviews during that same observation period.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, **the removal of reviews seems not to be very common**. Still, **it could have a substantial impact on total ratings**. As there are only few negative reviews in general, removing those can alter the picture greatly.  \n",
    "As before, we visualize the removed reviews by week and check for patterns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>                            <div id=\"11ea99ce-71b4-4d86-a001-a3fefec3439a\" class=\"plotly-graph-div\" style=\"height:525px; width:100%;\"></div>            <script type=\"text/javascript\">                require([\"plotly\"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById(\"11ea99ce-71b4-4d86-a001-a3fefec3439a\")) {                    Plotly.newPlot(                        \"11ea99ce-71b4-4d86-a001-a3fefec3439a\",                        [{\"alignmentgroup\": \"True\", \"hovertemplate\": \"Removed Reviews=non premium<br>Date=%{x}<br>value=%{y}<extra></extra>\", \"legendgroup\": \"non premium\", \"marker\": {\"color\": \"#636efa\"}, \"name\": \"non premium\", \"offsetgroup\": \"non premium\", \"orientation\": \"v\", \"showlegend\": true, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [\"2020-01-06T00:00:00\", \"2020-01-13T00:00:00\", \"2020-01-20T00:00:00\", \"2020-01-27T00:00:00\", \"2020-02-03T00:00:00\", \"2020-02-10T00:00:00\", \"2020-02-17T00:00:00\", \"2020-02-24T00:00:00\", \"2020-03-02T00:00:00\", \"2020-03-09T00:00:00\", \"2020-03-16T00:00:00\", \"2020-03-23T00:00:00\", \"2020-03-30T00:00:00\", \"2020-04-06T00:00:00\", \"2020-04-13T00:00:00\", \"2020-04-20T00:00:00\", \"2020-04-27T00:00:00\", \"2020-05-04T00:00:00\", \"2020-05-11T00:00:00\", \"2020-05-18T00:00:00\", \"2020-05-25T00:00:00\", \"2020-06-01T00:00:00\", \"2020-06-08T00:00:00\", \"2020-06-15T00:00:00\", \"2020-06-22T00:00:00\", \"2020-06-29T00:00:00\", \"2020-07-06T00:00:00\", \"2020-07-13T00:00:00\", \"2020-07-20T00:00:00\", \"2020-07-27T00:00:00\", \"2020-08-03T00:00:00\", \"2020-08-10T00:00:00\", \"2020-08-17T00:00:00\", \"2020-08-24T00:00:00\", \"2020-08-31T00:00:00\", \"2020-09-07T00:00:00\", \"2020-09-14T00:00:00\"], \"xaxis\": \"x\", \"y\": [1.0, 5.0, 3.0, 9.0, 7.0, 14.0, 7.0, 11.0, 12.0, 8.0, 8.0, 10.0, 9.0, 36.0, 12.0, 3.0, 11.0, 12.0, 7.0, 6.0, 6.0, 4.0, 7.0, 5.0, 9.0, 17.0, 7.0, 8.0, 9.0, 5.0, 10.0, 7.0, 8.0, 11.0, 9.0, 6.0, 28.0], \"yaxis\": \"y\"}, {\"alignmentgroup\": \"True\", \"hovertemplate\": \"Removed Reviews=premium<br>Date=%{x}<br>value=%{y}<extra></extra>\", \"legendgroup\": \"premium\", \"marker\": {\"color\": \"#EF553B\"}, \"name\": \"premium\", \"offsetgroup\": \"premium\", \"orientation\": \"v\", \"showlegend\": true, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [\"2020-01-06T00:00:00\", \"2020-01-13T00:00:00\", \"2020-01-20T00:00:00\", \"2020-01-27T00:00:00\", \"2020-02-03T00:00:00\", \"2020-02-10T00:00:00\", \"2020-02-17T00:00:00\", \"2020-02-24T00:00:00\", \"2020-03-02T00:00:00\", \"2020-03-09T00:00:00\", \"2020-03-16T00:00:00\", \"2020-03-23T00:00:00\", \"2020-03-30T00:00:00\", \"2020-04-06T00:00:00\", \"2020-04-13T00:00:00\", \"2020-04-20T00:00:00\", \"2020-04-27T00:00:00\", \"2020-05-04T00:00:00\", \"2020-05-11T00:00:00\", \"2020-05-18T00:00:00\", \"2020-05-25T00:00:00\", \"2020-06-01T00:00:00\", \"2020-06-08T00:00:00\", \"2020-06-15T00:00:00\", \"2020-06-22T00:00:00\", \"2020-06-29T00:00:00\", \"2020-07-06T00:00:00\", \"2020-07-13T00:00:00\", \"2020-07-20T00:00:00\", \"2020-07-27T00:00:00\", \"2020-08-03T00:00:00\", \"2020-08-10T00:00:00\", \"2020-08-17T00:00:00\", \"2020-08-24T00:00:00\", \"2020-08-31T00:00:00\", \"2020-09-07T00:00:00\", \"2020-09-14T00:00:00\"], \"xaxis\": \"x\", \"y\": [0.0, 0.0, 0.0, 2.0, 1.0, 2.0, 5.0, 5.0, 5.0, 2.0, 3.0, 2.0, 2.0, 2.0, 6.0, 4.0, 3.0, 3.0, 4.0, 2.0, 5.0, 4.0, 3.0, 1.0, 4.0, 1.0, 3.0, 6.0, 10.0, 3.0, 1.0, 0.0, 4.0, 4.0, 0.0, 6.0, 4.0], \"yaxis\": \"y\"}],                        {\"barmode\": \"stack\", \"legend\": {\"title\": {\"text\": \"Removed Reviews\"}, \"tracegroupgap\": 0}, \"template\": {\"data\": {\"bar\": [{\"error_x\": {\"color\": \"#2a3f5f\"}, \"error_y\": {\"color\": \"#2a3f5f\"}, \"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"bar\"}], \"barpolar\": [{\"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"barpolar\"}], \"carpet\": [{\"aaxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"baxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"type\": \"carpet\"}], \"choropleth\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"choropleth\"}], \"contour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"contour\"}], \"contourcarpet\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"contourcarpet\"}], \"heatmap\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmap\"}], \"heatmapgl\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmapgl\"}], \"histogram\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"histogram\"}], \"histogram2d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2d\"}], \"histogram2dcontour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2dcontour\"}], \"mesh3d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"mesh3d\"}], \"parcoords\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"parcoords\"}], \"pie\": [{\"automargin\": true, \"type\": \"pie\"}], \"scatter\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter\"}], \"scatter3d\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter3d\"}], \"scattercarpet\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattercarpet\"}], \"scattergeo\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergeo\"}], \"scattergl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergl\"}], \"scattermapbox\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattermapbox\"}], \"scatterpolar\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolar\"}], \"scatterpolargl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolargl\"}], \"scatterternary\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterternary\"}], \"surface\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"surface\"}], \"table\": [{\"cells\": {\"fill\": {\"color\": \"#EBF0F8\"}, \"line\": {\"color\": \"white\"}}, \"header\": {\"fill\": {\"color\": \"#C8D4E3\"}, \"line\": {\"color\": \"white\"}}, \"type\": \"table\"}]}, \"layout\": {\"annotationdefaults\": {\"arrowcolor\": \"#2a3f5f\", \"arrowhead\": 0, \"arrowwidth\": 1}, \"autotypenumbers\": \"strict\", \"coloraxis\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"colorscale\": {\"diverging\": [[0, \"#8e0152\"], [0.1, \"#c51b7d\"], [0.2, \"#de77ae\"], [0.3, \"#f1b6da\"], [0.4, \"#fde0ef\"], [0.5, \"#f7f7f7\"], [0.6, \"#e6f5d0\"], [0.7, \"#b8e186\"], [0.8, \"#7fbc41\"], [0.9, \"#4d9221\"], [1, \"#276419\"]], \"sequential\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"sequentialminus\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]]}, \"colorway\": [\"#636efa\", \"#EF553B\", \"#00cc96\", \"#ab63fa\", \"#FFA15A\", \"#19d3f3\", \"#FF6692\", \"#B6E880\", \"#FF97FF\", \"#FECB52\"], \"font\": {\"color\": \"#2a3f5f\"}, \"geo\": {\"bgcolor\": \"white\", \"lakecolor\": \"white\", \"landcolor\": \"white\", \"showlakes\": true, \"showland\": true, \"subunitcolor\": \"#C8D4E3\"}, \"hoverlabel\": {\"align\": \"left\"}, \"hovermode\": \"closest\", \"mapbox\": {\"style\": \"light\"}, \"paper_bgcolor\": \"white\", \"plot_bgcolor\": \"white\", \"polar\": {\"angularaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"radialaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}}, \"scene\": {\"xaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"yaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"zaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}}, \"shapedefaults\": {\"line\": {\"color\": \"#2a3f5f\"}}, \"ternary\": {\"aaxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"baxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"caxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}}, \"title\": {\"x\": 0.05}, \"xaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}, \"yaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}}}, \"title\": {\"text\": \"Removed reviews per week in 2020 (Total 459)\"}, \"xaxis\": {\"anchor\": \"y\", \"domain\": [0.0, 1.0], \"dtick\": 604800000, \"tick0\": \"2020-01-06\", \"tickformat\": \"%d %b\", \"title\": {\"text\": \"Date\"}}, \"yaxis\": {\"anchor\": \"x\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"value\"}}},                        {\"responsive\": true}                    ).then(function(){\n",
       "                            \n",
       "var gd = document.getElementById('11ea99ce-71b4-4d86-a001-a3fefec3439a');\n",
       "var x = new MutationObserver(function (mutations, observer) {{\n",
       "        var display = window.getComputedStyle(gd).display;\n",
       "        if (!display || display === 'none') {{\n",
       "            console.log([gd, 'removed!']);\n",
       "            Plotly.purge(gd);\n",
       "            observer.disconnect();\n",
       "        }}\n",
       "}});\n",
       "\n",
       "// Listen for the removal of the full notebook cells\n",
       "var notebookContainer = gd.closest('#notebook-container');\n",
       "if (notebookContainer) {{\n",
       "    x.observe(notebookContainer, {childList: true});\n",
       "}}\n",
       "\n",
       "// Listen for the clearing of the current output cell\n",
       "var outputEl = gd.closest('.output');\n",
       "if (outputEl) {{\n",
       "    x.observe(outputEl, {childList: true});\n",
       "}}\n",
       "\n",
       "                        })                };                });            </script>        </div>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# From daily data to weekly aggregation\n",
    "plt = removed[[\"date_create\", \"removed\", \"premium\"]].set_index(\"date_create\")\n",
    "plt = plt.resample(\"W-MON\", label=\"left\").sum()\n",
    "plt[\"non premium\"] = plt[\"removed\"] - plt[\"premium\"]\n",
    "reviews_removed_total = plt[\"removed\"].sum()\n",
    "\n",
    "fig = px.bar(\n",
    "    plt,\n",
    "    x=plt.index,\n",
    "    y=[\"non premium\", \"premium\"],\n",
    "    title=f\"Removed reviews per week in 2020 (Total {reviews_removed_total:.0f})\",\n",
    "    labels={\"date_create\": \"Date\", \"variable\": \"Removed Reviews\"},\n",
    "    barmode=\"stack\",\n",
    ")\n",
    "fig.update_xaxes(dtick=7 * 24 * 60 * 60 * 1000, tickformat=\"%d %b\", tick0=\"2020-01-06\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Share of Premium\n",
    "reviews_removed_prem = plt[\"premium\"].sum()\n",
    "reviews_removed_prem_share = reviews_removed_prem / reviews_removed_total * 100\n",
    "\n",
    "# descriptives\n",
    "reviews_min = int(plt[\"removed\"].min())\n",
    "reviews_max = int(plt[\"removed\"].max())\n",
    "reviews_mean = plt[\"removed\"].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "tags": [
     "hide"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Each week, between 1 and 38 reviews are removed. In an average week there are about 12 of them. Out of all removed reviews during the observation period, those on premium profiles have a share of 24%. Hence, **the share of removed reviews is substantially lower then the share of published reviews (42%, see above) on premium profiles.**"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md(\n",
    "    f\"Each week, between {reviews_min} and {reviews_max} reviews are removed. In an average week there are about {reviews_mean:.0f} of them. Out of all removed reviews during the observation period, those on premium profiles have a share of {reviews_removed_prem_share:.0f}%. Hence, **the share of removed reviews is substantially lower then the share of published reviews ({reviews_new_share_prem:.0f}%, see above) on premium profiles.**\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, there is no problem with reviews being disproportionately removed from premium profiles to improve their ratings? Not so fast! While this seems to hold true, it's not the right question to ask. We've learned before, that in general premium users get way less poor ratings than non paying users (see [first part]({filename}/jameda_part1.ipynb)). Also, it's intuitive that removed reviews will predominantly have low ratings (we'll check that in a minute). Consequently, the real question is: **\"Are relatively more critical reviews (i.e. those with low ratings) removed from premium profiles?\"**. Thus, we also need to take the ratings into account when comparing groups:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "premium  removed\n",
      "0        0.0        98.880847\n",
      "         1.0         1.119153\n",
      "1        0.0        96.618357\n",
      "         1.0         3.381643\n",
      "Name: removed, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Filter for poor reviews\n",
    "reviews_poor = reviews_unq[reviews_unq[\"gesamt_note_class_y\"] >= 4]\n",
    "# How many of the poor ratings are removed by status\n",
    "share = reviews_poor.groupby([\"premium\"])[\"removed\"].value_counts(normalize=True) * 100\n",
    "prob_removed_premium = share[1][1] / share[0][1]\n",
    "print(share)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "tags": [
     "hide"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The answer to the above is: **Yes**. On premium profiles 3.4% of poor reviews are removed but only 1.1% are removed on non premium profiles. **A poor rating on a premium profile is 3 times more likely to be removed compared to one on a non premium profile**."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md(\n",
    "    f\"The answer to the above is: **Yes**. On premium profiles {share[1][1]:.1f}% of poor reviews are removed but only {share[0][1]:.1f}% are removed on non premium profiles. **A poor rating on a premium profile is {prob_removed_premium:.0f} times more likely to be removed compared to one on a non premium profile**.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following, we look a bit closer at the reviews that get removed. As stated above, we'd expect them to have strictly negative ratings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>                            <div id=\"060d84f0-6916-4342-8168-97e8f60b0fdc\" class=\"plotly-graph-div\" style=\"height:525px; width:100%;\"></div>            <script type=\"text/javascript\">                require([\"plotly\"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById(\"060d84f0-6916-4342-8168-97e8f60b0fdc\")) {                    Plotly.newPlot(                        \"060d84f0-6916-4342-8168-97e8f60b0fdc\",                        [{\"alignmentgroup\": \"True\", \"hovertemplate\": \"Rating=%{x}<br>frequency=%{y}<extra></extra>\", \"legendgroup\": \"\", \"marker\": {\"color\": \"#636efa\"}, \"name\": \"\", \"offsetgroup\": \"\", \"orientation\": \"v\", \"showlegend\": false, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [6.0, 5.0, 4.0, 3.0, 0.0, 2.0, 1.0], \"xaxis\": \"x\", \"y\": [0.43137254901960786, 0.33986928104575165, 0.1328976034858388, 0.058823529411764705, 0.02178649237472767, 0.013071895424836602, 0.002178649237472767], \"yaxis\": \"y\"}],                        {\"barmode\": \"relative\", \"legend\": {\"tracegroupgap\": 0}, \"template\": {\"data\": {\"bar\": [{\"error_x\": {\"color\": \"#2a3f5f\"}, \"error_y\": {\"color\": \"#2a3f5f\"}, \"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"bar\"}], \"barpolar\": [{\"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"barpolar\"}], \"carpet\": [{\"aaxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"baxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"type\": \"carpet\"}], \"choropleth\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"choropleth\"}], \"contour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"contour\"}], \"contourcarpet\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"contourcarpet\"}], \"heatmap\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmap\"}], \"heatmapgl\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmapgl\"}], \"histogram\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"histogram\"}], \"histogram2d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2d\"}], \"histogram2dcontour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2dcontour\"}], \"mesh3d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"mesh3d\"}], \"parcoords\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"parcoords\"}], \"pie\": [{\"automargin\": true, \"type\": \"pie\"}], \"scatter\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter\"}], \"scatter3d\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter3d\"}], \"scattercarpet\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattercarpet\"}], \"scattergeo\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergeo\"}], \"scattergl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergl\"}], \"scattermapbox\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattermapbox\"}], \"scatterpolar\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolar\"}], \"scatterpolargl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolargl\"}], \"scatterternary\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterternary\"}], \"surface\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"surface\"}], \"table\": [{\"cells\": {\"fill\": {\"color\": \"#EBF0F8\"}, \"line\": {\"color\": \"white\"}}, \"header\": {\"fill\": {\"color\": \"#C8D4E3\"}, \"line\": {\"color\": \"white\"}}, \"type\": \"table\"}]}, \"layout\": {\"annotationdefaults\": {\"arrowcolor\": \"#2a3f5f\", \"arrowhead\": 0, \"arrowwidth\": 1}, \"autotypenumbers\": \"strict\", \"coloraxis\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"colorscale\": {\"diverging\": [[0, \"#8e0152\"], [0.1, \"#c51b7d\"], [0.2, \"#de77ae\"], [0.3, \"#f1b6da\"], [0.4, \"#fde0ef\"], [0.5, \"#f7f7f7\"], [0.6, \"#e6f5d0\"], [0.7, \"#b8e186\"], [0.8, \"#7fbc41\"], [0.9, \"#4d9221\"], [1, \"#276419\"]], \"sequential\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"sequentialminus\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]]}, \"colorway\": [\"#636efa\", \"#EF553B\", \"#00cc96\", \"#ab63fa\", \"#FFA15A\", \"#19d3f3\", \"#FF6692\", \"#B6E880\", \"#FF97FF\", \"#FECB52\"], \"font\": {\"color\": \"#2a3f5f\"}, \"geo\": {\"bgcolor\": \"white\", \"lakecolor\": \"white\", \"landcolor\": \"white\", \"showlakes\": true, \"showland\": true, \"subunitcolor\": \"#C8D4E3\"}, \"hoverlabel\": {\"align\": \"left\"}, \"hovermode\": \"closest\", \"mapbox\": {\"style\": \"light\"}, \"paper_bgcolor\": \"white\", \"plot_bgcolor\": \"white\", \"polar\": {\"angularaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"radialaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}}, \"scene\": {\"xaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"yaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"zaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}}, \"shapedefaults\": {\"line\": {\"color\": \"#2a3f5f\"}}, \"ternary\": {\"aaxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"baxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"caxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}}, \"title\": {\"x\": 0.05}, \"xaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}, \"yaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}}}, \"title\": {\"text\": \"Rating frequency of removed reviews\"}, \"xaxis\": {\"anchor\": \"y\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"Rating\"}}, \"yaxis\": {\"anchor\": \"x\", \"domain\": [0.0, 1.0], \"tickformat\": \"%\", \"title\": {\"text\": \"frequency\"}}},                        {\"responsive\": true}                    ).then(function(){\n",
       "                            \n",
       "var gd = document.getElementById('060d84f0-6916-4342-8168-97e8f60b0fdc');\n",
       "var x = new MutationObserver(function (mutations, observer) {{\n",
       "        var display = window.getComputedStyle(gd).display;\n",
       "        if (!display || display === 'none') {{\n",
       "            console.log([gd, 'removed!']);\n",
       "            Plotly.purge(gd);\n",
       "            observer.disconnect();\n",
       "        }}\n",
       "}});\n",
       "\n",
       "// Listen for the removal of the full notebook cells\n",
       "var notebookContainer = gd.closest('#notebook-container');\n",
       "if (notebookContainer) {{\n",
       "    x.observe(notebookContainer, {childList: true});\n",
       "}}\n",
       "\n",
       "// Listen for the clearing of the current output cell\n",
       "var outputEl = gd.closest('.output');\n",
       "if (outputEl) {{\n",
       "    x.observe(outputEl, {childList: true});\n",
       "}}\n",
       "\n",
       "                        })                };                });            </script>        </div>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# frequency removed by rating\n",
    "plt = removed[[\"gesamt_note_class_y\", \"removed\", \"premium\"]].reset_index(drop=True)\n",
    "plt = (\n",
    "    plt.value_counts(\"gesamt_note_class_y\", normalize=True)\n",
    "    .rename(\"frequency\")\n",
    "    .reset_index()\n",
    ")\n",
    "\n",
    "fig = px.bar(\n",
    "    plt,\n",
    "    x=\"gesamt_note_class_y\",\n",
    "    y=\"frequency\",\n",
    "    title=\"Rating frequency of removed reviews\",\n",
    "    labels={\"gesamt_note_class_y\": \"Rating\"},\n",
    ")\n",
    "fig.update_yaxes(tickformat=\"%\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, our intuition was pretty much right. **By far, most of the removed reviews have a poor rating**. Still, a few of the removed reviews had a good rating. Turns out, those are mostly misrated cases where a positive rating was given to a critical comment. (Notice: A rating of 0 means that the review didn't have a rating at all, i.e. it was just a comment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we ask ourselves: How long is the typical time span between a critical rating being published and it being reported (more precisely removed, as we don't now how much time passes from report to removal)? We look at all reviews that were removed during the observation period and compare the time of removal to the time of creation (which can be well before our observation phase):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>b_id</th>\n",
       "      <th>b_date</th>\n",
       "      <th>date_create</th>\n",
       "      <th>removed</th>\n",
       "      <th>readded</th>\n",
       "      <th>premium</th>\n",
       "      <th>gesamt_note_class_y</th>\n",
       "      <th>wave</th>\n",
       "      <th>report_duration</th>\n",
       "      <th>report_duration_days</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4682967</td>\n",
       "      <td>2019-07-12 09:34:26</td>\n",
       "      <td>2020-03-02 03:21:15</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1</td>\n",
       "      <td>233 days 17:46:49</td>\n",
       "      <td>233</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4871698</td>\n",
       "      <td>2019-12-02 10:37:17</td>\n",
       "      <td>2020-03-02 03:21:15</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1</td>\n",
       "      <td>90 days 16:43:58</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      b_id              b_date         date_create  removed  readded  premium  \\\n",
       "0  4682967 2019-07-12 09:34:26 2020-03-02 03:21:15      1.0      0.0        1   \n",
       "1  4871698 2019-12-02 10:37:17 2020-03-02 03:21:15      1.0      0.0        1   \n",
       "\n",
       "   gesamt_note_class_y  wave   report_duration  report_duration_days  \n",
       "0                  6.0     1 233 days 17:46:49                   233  \n",
       "1                  6.0     1  90 days 16:43:58                    90  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# compute duration between review published and removed in days\n",
    "removed[\"report_duration\"] = removed[\"date_create\"] - removed[\"b_date\"]\n",
    "removed[\"report_duration_days\"] = removed[\"report_duration\"].map(lambda x: x.days)\n",
    "removed.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>                            <div id=\"5795f47b-48e0-4a95-8524-b6dd934f03ad\" class=\"plotly-graph-div\" style=\"height:525px; width:100%;\"></div>            <script type=\"text/javascript\">                require([\"plotly\"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById(\"5795f47b-48e0-4a95-8524-b6dd934f03ad\")) {                    Plotly.newPlot(                        \"5795f47b-48e0-4a95-8524-b6dd934f03ad\",                        [{\"alignmentgroup\": \"True\", \"bingroup\": \"x\", \"histnorm\": \"probability\", \"hovertemplate\": \"Days=%{x}<br>probability=%{y}<extra></extra>\", \"legendgroup\": \"\", \"marker\": {\"color\": \"#636efa\"}, \"name\": \"\", \"nbinsx\": 200, \"offsetgroup\": \"\", \"orientation\": \"v\", \"showlegend\": false, \"type\": \"histogram\", \"x\": [233, 90, 52, 44, 80, 1269, 997, 886, 593, 34, 30, 1171, 56, 70, 43, 1292, 1107, 726, 403, 65, 46, 91, 56, 34, 127, 1754, 1268, 699, 88, 56, 39, 1337, 1032, 57, 24, 25, 1263, 785, 600, 523, 521, 44, 59, 1377, 842, 166, 149, 63, 29, 20, 17, 22, 27, 36, 505, 203, 770, 2121, 705, 307, 11, 2232, 1978, 1978, 1559, 1553, 1583, 1456, 1220, 1126, 1120, 975, 956, 936, 863, 814, 803, 790, 783, 763, 758, 750, 716, 674, 628, 618, 612, 591, 586, 394, 364, 196, 176, 2775, 46, 1250, 1125, 1052, 467, 1064, 14, 510, 365, 293, 250, 236, 28, 28, 5, 231, 44, 77, 172, 15, 313, 524, 629, 454, 1892, 1224, 826, 575, 2732, 27, 12, 20, 141, 632, 41, 57, 68, 16, 112, 1136, 721, 4, 1087, 731, 415, 86, 30, 106, 17, 27, 14, 8, 1355, 1162, 262, 50, 22, 8, 19, 1355, 139, 221, 81, 1434, 1096, 29, 191, 22, 237, 39, 36, 78, 82, 77, 398, 14, 40, 78, 16, 53, 13, 10, 27, 877, 360, 332, 1261, 1147, 581, 565, 616, 511, 27, 23, 42, 9, 542, 2436, 800, 704, 636, 64, 15, 609, 602, 525, 505, 36, 28, 21, 9, 28, 5, 74, 19, 31, 1534, 478, 18, 284, 282, 267, 91, 73, 517, 309, 25, 24, 2437, 1912, 1883, 1504, 1073, 987, 612, 167, 47, 22, 32, 124, 91, 44, 139, 32, 1078, 1048, 1581, 739, 387, 29, 87, 603, 16, 15, 16, 9, 72, 19, 100, 763, 5, 447, 255, 18, 44, 81, 71, 22, 2594, 2260, 2249, 2000, 1789, 1458, 921, 875, 680, 216, 203, 2714, 10, 131, 104, 7, 32, 710, 114, 34, 52, 33, 25, 717, 86, 282, 1104, 1033, 64, 89, 6, 75, 426, 1130, 902, 660, 614, 43, 88, 25, 79, 50, 2334, 1672, 1045, 629, 557, 387, 2581, 73, 53, 121, 120, 643, 46, 44, 211, 7, 29, 1479, 1083, 1073, 52, 40, 417, 70, 119, 382, 39, 71, 162, 37, 212, 70, 50, 11, 66, 50, 635, 5, 25, 10, 118, 21, 23, 26, 8, 176, 176, 51, 64, 33, 133, 41, 106, 9, 189, 401, 21, 17, 216, 20, 76, 27, 7, 10, 11, 121, 73, 481, 102, 23, 356, 130, 104, 1121, 1116, 40, 130, 58, 24, 170, 171, 32, 24, 19, 86, 30, 295, 40, 47, 34, 24, 59, 27, 248, 52, 73, 55, 247, 24, 54, 56, 91, 16, 26, 17, 12, 6, 728, 905, 55, 24, 19, 141, 172, 36, 207, 286, 507, 250, 17, 107, 61, 24, 33, 8, 57, 51, 87, 30, 15, 35, 601, 33, 362, 52, 30, 36, 10, 89, 109, 21, 51, 19, 25, 37, 10, 25, 200, 124, 29, 85, 238, 59, 71, 31], \"xaxis\": \"x\", \"yaxis\": \"y\"}, {\"alignmentgroup\": \"True\", \"hovertemplate\": \"Days=%{x}<extra></extra>\", \"legendgroup\": \"\", \"marker\": {\"color\": \"#636efa\"}, \"name\": \"\", \"notched\": true, \"offsetgroup\": \"\", \"showlegend\": false, \"type\": \"box\", \"x\": [233, 90, 52, 44, 80, 1269, 997, 886, 593, 34, 30, 1171, 56, 70, 43, 1292, 1107, 726, 403, 65, 46, 91, 56, 34, 127, 1754, 1268, 699, 88, 56, 39, 1337, 1032, 57, 24, 25, 1263, 785, 600, 523, 521, 44, 59, 1377, 842, 166, 149, 63, 29, 20, 17, 22, 27, 36, 505, 203, 770, 2121, 705, 307, 11, 2232, 1978, 1978, 1559, 1553, 1583, 1456, 1220, 1126, 1120, 975, 956, 936, 863, 814, 803, 790, 783, 763, 758, 750, 716, 674, 628, 618, 612, 591, 586, 394, 364, 196, 176, 2775, 46, 1250, 1125, 1052, 467, 1064, 14, 510, 365, 293, 250, 236, 28, 28, 5, 231, 44, 77, 172, 15, 313, 524, 629, 454, 1892, 1224, 826, 575, 2732, 27, 12, 20, 141, 632, 41, 57, 68, 16, 112, 1136, 721, 4, 1087, 731, 415, 86, 30, 106, 17, 27, 14, 8, 1355, 1162, 262, 50, 22, 8, 19, 1355, 139, 221, 81, 1434, 1096, 29, 191, 22, 237, 39, 36, 78, 82, 77, 398, 14, 40, 78, 16, 53, 13, 10, 27, 877, 360, 332, 1261, 1147, 581, 565, 616, 511, 27, 23, 42, 9, 542, 2436, 800, 704, 636, 64, 15, 609, 602, 525, 505, 36, 28, 21, 9, 28, 5, 74, 19, 31, 1534, 478, 18, 284, 282, 267, 91, 73, 517, 309, 25, 24, 2437, 1912, 1883, 1504, 1073, 987, 612, 167, 47, 22, 32, 124, 91, 44, 139, 32, 1078, 1048, 1581, 739, 387, 29, 87, 603, 16, 15, 16, 9, 72, 19, 100, 763, 5, 447, 255, 18, 44, 81, 71, 22, 2594, 2260, 2249, 2000, 1789, 1458, 921, 875, 680, 216, 203, 2714, 10, 131, 104, 7, 32, 710, 114, 34, 52, 33, 25, 717, 86, 282, 1104, 1033, 64, 89, 6, 75, 426, 1130, 902, 660, 614, 43, 88, 25, 79, 50, 2334, 1672, 1045, 629, 557, 387, 2581, 73, 53, 121, 120, 643, 46, 44, 211, 7, 29, 1479, 1083, 1073, 52, 40, 417, 70, 119, 382, 39, 71, 162, 37, 212, 70, 50, 11, 66, 50, 635, 5, 25, 10, 118, 21, 23, 26, 8, 176, 176, 51, 64, 33, 133, 41, 106, 9, 189, 401, 21, 17, 216, 20, 76, 27, 7, 10, 11, 121, 73, 481, 102, 23, 356, 130, 104, 1121, 1116, 40, 130, 58, 24, 170, 171, 32, 24, 19, 86, 30, 295, 40, 47, 34, 24, 59, 27, 248, 52, 73, 55, 247, 24, 54, 56, 91, 16, 26, 17, 12, 6, 728, 905, 55, 24, 19, 141, 172, 36, 207, 286, 507, 250, 17, 107, 61, 24, 33, 8, 57, 51, 87, 30, 15, 35, 601, 33, 362, 52, 30, 36, 10, 89, 109, 21, 51, 19, 25, 37, 10, 25, 200, 124, 29, 85, 238, 59, 71, 31], \"xaxis\": \"x2\", \"yaxis\": \"y2\"}],                        {\"barmode\": \"relative\", \"legend\": {\"tracegroupgap\": 0}, \"template\": {\"data\": {\"bar\": [{\"error_x\": {\"color\": \"#2a3f5f\"}, \"error_y\": {\"color\": \"#2a3f5f\"}, \"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"bar\"}], \"barpolar\": [{\"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"barpolar\"}], \"carpet\": [{\"aaxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"baxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"type\": \"carpet\"}], \"choropleth\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"choropleth\"}], \"contour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"contour\"}], \"contourcarpet\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"contourcarpet\"}], \"heatmap\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmap\"}], \"heatmapgl\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmapgl\"}], \"histogram\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"histogram\"}], \"histogram2d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2d\"}], \"histogram2dcontour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2dcontour\"}], \"mesh3d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"mesh3d\"}], \"parcoords\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"parcoords\"}], \"pie\": [{\"automargin\": true, \"type\": \"pie\"}], \"scatter\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter\"}], \"scatter3d\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter3d\"}], \"scattercarpet\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattercarpet\"}], \"scattergeo\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergeo\"}], \"scattergl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergl\"}], \"scattermapbox\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattermapbox\"}], \"scatterpolar\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolar\"}], \"scatterpolargl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolargl\"}], \"scatterternary\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterternary\"}], \"surface\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"surface\"}], \"table\": [{\"cells\": {\"fill\": {\"color\": \"#EBF0F8\"}, \"line\": {\"color\": \"white\"}}, \"header\": {\"fill\": {\"color\": \"#C8D4E3\"}, \"line\": {\"color\": \"white\"}}, \"type\": \"table\"}]}, \"layout\": {\"annotationdefaults\": {\"arrowcolor\": \"#2a3f5f\", \"arrowhead\": 0, \"arrowwidth\": 1}, \"autotypenumbers\": \"strict\", \"coloraxis\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"colorscale\": {\"diverging\": [[0, \"#8e0152\"], [0.1, \"#c51b7d\"], [0.2, \"#de77ae\"], [0.3, \"#f1b6da\"], [0.4, \"#fde0ef\"], [0.5, \"#f7f7f7\"], [0.6, \"#e6f5d0\"], [0.7, \"#b8e186\"], [0.8, \"#7fbc41\"], [0.9, \"#4d9221\"], [1, \"#276419\"]], \"sequential\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"sequentialminus\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]]}, \"colorway\": [\"#636efa\", \"#EF553B\", \"#00cc96\", \"#ab63fa\", \"#FFA15A\", \"#19d3f3\", \"#FF6692\", \"#B6E880\", \"#FF97FF\", \"#FECB52\"], \"font\": {\"color\": \"#2a3f5f\"}, \"geo\": {\"bgcolor\": \"white\", \"lakecolor\": \"white\", \"landcolor\": \"white\", \"showlakes\": true, \"showland\": true, \"subunitcolor\": \"#C8D4E3\"}, \"hoverlabel\": {\"align\": \"left\"}, \"hovermode\": \"closest\", \"mapbox\": {\"style\": \"light\"}, \"paper_bgcolor\": \"white\", \"plot_bgcolor\": \"white\", \"polar\": {\"angularaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"radialaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}}, \"scene\": {\"xaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"yaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"zaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}}, \"shapedefaults\": {\"line\": {\"color\": \"#2a3f5f\"}}, \"ternary\": {\"aaxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"baxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"caxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}}, \"title\": {\"x\": 0.05}, \"xaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}, \"yaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}}}, \"title\": {\"text\": \"Removed reviews: days between publishing and removal\"}, \"xaxis\": {\"anchor\": \"y\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"Days\"}}, \"xaxis2\": {\"anchor\": \"y2\", \"domain\": [0.0, 1.0], \"matches\": \"x\", \"showgrid\": true, \"showticklabels\": false}, \"yaxis\": {\"anchor\": \"x\", \"domain\": [0.0, 0.8316], \"tickformat\": \"%\", \"title\": {\"text\": \"probability\"}}, \"yaxis2\": {\"anchor\": \"x2\", \"domain\": [0.8416, 1.0], \"matches\": \"y2\", \"showgrid\": false, \"showline\": false, \"showticklabels\": false, \"tickformat\": \"%\", \"ticks\": \"\"}},                        {\"responsive\": true}                    ).then(function(){\n",
       "                            \n",
       "var gd = document.getElementById('5795f47b-48e0-4a95-8524-b6dd934f03ad');\n",
       "var x = new MutationObserver(function (mutations, observer) {{\n",
       "        var display = window.getComputedStyle(gd).display;\n",
       "        if (!display || display === 'none') {{\n",
       "            console.log([gd, 'removed!']);\n",
       "            Plotly.purge(gd);\n",
       "            observer.disconnect();\n",
       "        }}\n",
       "}});\n",
       "\n",
       "// Listen for the removal of the full notebook cells\n",
       "var notebookContainer = gd.closest('#notebook-container');\n",
       "if (notebookContainer) {{\n",
       "    x.observe(notebookContainer, {childList: true});\n",
       "}}\n",
       "\n",
       "// Listen for the clearing of the current output cell\n",
       "var outputEl = gd.closest('.output');\n",
       "if (outputEl) {{\n",
       "    x.observe(outputEl, {childList: true});\n",
       "}}\n",
       "\n",
       "                        })                };                });            </script>        </div>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Visualize time delta from publishing to removal for removed reviews\n",
    "plt = removed[[\"report_duration_days\", \"removed\"]].reset_index(drop=True)\n",
    "\n",
    "fig = px.histogram(\n",
    "    plt,\n",
    "    x=\"report_duration_days\",\n",
    "    title=\"Removed reviews: days between publishing and removal\",\n",
    "    labels={\"report_duration_days\": \"Days\"},\n",
    "    histnorm=\"probability\",\n",
    "    nbins=200,\n",
    "    marginal=\"box\",\n",
    ")\n",
    "fig.update_yaxes(tickformat=\"%\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This chart give us some nice insights about the whole reporting / removal process:  \n",
    "First, the record for the most short lived comment is only four days. That might be a good proxy for the minimum reaction time of Jameda, i.e. the time between receiving a report and acting on it. About 12% of the removed reviews are removed within 20 days. But in general, **it takes about three months for a review to be removed**. Nonetheless, there are quite a few reviews that get removed much later. In one case, the review was removed after more than seven years!  \n",
    "If we compare this distribution between paying and non paying physicians, we might learn some more:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>                            <div id=\"862a7aee-3fe6-4c7e-b6e3-c04d1e7d6c43\" class=\"plotly-graph-div\" style=\"height:525px; width:100%;\"></div>            <script type=\"text/javascript\">                require([\"plotly\"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById(\"862a7aee-3fe6-4c7e-b6e3-c04d1e7d6c43\")) {                    Plotly.newPlot(                        \"862a7aee-3fe6-4c7e-b6e3-c04d1e7d6c43\",                        [{\"alignmentgroup\": \"True\", \"bingroup\": \"x\", \"histnorm\": \"probability\", \"hovertemplate\": \"premium=1<br>Days=%{x}<br>probability=%{y}<extra></extra>\", \"legendgroup\": \"1\", \"marker\": {\"color\": \"#636efa\"}, \"name\": \"1\", \"nbinsx\": 200, \"offsetgroup\": \"1\", \"orientation\": \"v\", \"showlegend\": true, \"type\": \"histogram\", \"x\": [233, 90, 1269, 997, 886, 34, 1171, 56, 43, 39, 1337, 24, 44, 59, 63, 29, 36, 505, 46, 1250, 1125, 1052, 467, 172, 15, 16, 22, 19, 191, 22, 237, 39, 36, 78, 82, 77, 14, 1261, 1147, 581, 565, 616, 511, 42, 28, 5, 19, 31, 24, 47, 124, 91, 44, 1078, 1048, 1581, 739, 603, 16, 72, 19, 114, 52, 33, 25, 6, 7, 29, 40, 417, 70, 382, 39, 37, 70, 50, 5, 21, 23, 26, 176, 64, 33, 9, 21, 216, 20, 27, 7, 121, 130, 24, 19, 40, 52, 54, 56, 26, 17, 24, 141, 172, 36, 207, 286, 107, 601, 33, 362, 52, 51, 25], \"xaxis\": \"x\", \"yaxis\": \"y\"}, {\"alignmentgroup\": \"True\", \"hovertemplate\": \"premium=1<br>Days=%{x}<extra></extra>\", \"legendgroup\": \"1\", \"marker\": {\"color\": \"#636efa\"}, \"name\": \"1\", \"notched\": true, \"offsetgroup\": \"1\", \"showlegend\": false, \"type\": \"box\", \"x\": [233, 90, 1269, 997, 886, 34, 1171, 56, 43, 39, 1337, 24, 44, 59, 63, 29, 36, 505, 46, 1250, 1125, 1052, 467, 172, 15, 16, 22, 19, 191, 22, 237, 39, 36, 78, 82, 77, 14, 1261, 1147, 581, 565, 616, 511, 42, 28, 5, 19, 31, 24, 47, 124, 91, 44, 1078, 1048, 1581, 739, 603, 16, 72, 19, 114, 52, 33, 25, 6, 7, 29, 40, 417, 70, 382, 39, 37, 70, 50, 5, 21, 23, 26, 176, 64, 33, 9, 21, 216, 20, 27, 7, 121, 130, 24, 19, 40, 52, 54, 56, 26, 17, 24, 141, 172, 36, 207, 286, 107, 601, 33, 362, 52, 51, 25], \"xaxis\": \"x2\", \"yaxis\": \"y2\"}, {\"alignmentgroup\": \"True\", \"bingroup\": \"x\", \"histnorm\": \"probability\", \"hovertemplate\": \"premium=0<br>Days=%{x}<br>probability=%{y}<extra></extra>\", \"legendgroup\": \"0\", \"marker\": {\"color\": \"#EF553B\"}, \"name\": \"0\", \"nbinsx\": 200, \"offsetgroup\": \"0\", \"orientation\": \"v\", \"showlegend\": true, \"type\": \"histogram\", \"x\": [52, 44, 80, 593, 30, 70, 1292, 1107, 726, 403, 65, 46, 91, 56, 34, 127, 1754, 1268, 699, 88, 56, 1032, 57, 25, 1263, 785, 600, 523, 521, 1377, 842, 166, 149, 20, 17, 22, 27, 203, 770, 2121, 705, 307, 11, 2232, 1978, 1978, 1559, 1553, 1583, 1456, 1220, 1126, 1120, 975, 956, 936, 863, 814, 803, 790, 783, 763, 758, 750, 716, 674, 628, 618, 612, 591, 586, 394, 364, 196, 176, 2775, 1064, 14, 510, 365, 293, 250, 236, 28, 28, 5, 231, 44, 77, 313, 524, 629, 454, 1892, 1224, 826, 575, 2732, 27, 12, 20, 141, 632, 41, 57, 68, 112, 1136, 721, 4, 1087, 731, 415, 86, 30, 106, 17, 27, 14, 8, 1355, 1162, 262, 50, 8, 1355, 139, 221, 81, 1434, 1096, 29, 398, 40, 78, 16, 53, 13, 10, 27, 877, 360, 332, 27, 23, 9, 542, 2436, 800, 704, 636, 64, 15, 609, 602, 525, 505, 36, 28, 21, 9, 74, 1534, 478, 18, 284, 282, 267, 91, 73, 517, 309, 25, 2437, 1912, 1883, 1504, 1073, 987, 612, 167, 22, 32, 139, 32, 387, 29, 87, 16, 15, 9, 100, 763, 5, 447, 255, 18, 44, 81, 71, 22, 2594, 2260, 2249, 2000, 1789, 1458, 921, 875, 680, 216, 203, 2714, 10, 131, 104, 7, 32, 710, 34, 717, 86, 282, 1104, 1033, 64, 89, 75, 426, 1130, 902, 660, 614, 43, 88, 25, 79, 50, 2334, 1672, 1045, 629, 557, 387, 2581, 73, 53, 121, 120, 643, 46, 44, 211, 1479, 1083, 1073, 52, 119, 71, 162, 212, 11, 66, 50, 635, 25, 10, 118, 8, 176, 51, 133, 41, 106, 189, 401, 17, 76, 10, 11, 73, 481, 102, 23, 356, 130, 104, 1121, 1116, 40, 58, 24, 170, 171, 32, 86, 30, 295, 47, 34, 24, 59, 27, 248, 73, 55, 247, 24, 91, 16, 12, 6, 728, 905, 55, 19, 507, 250, 17, 61, 24, 33, 8, 57, 51, 87, 30, 15, 35, 30, 36, 10, 89, 109, 21, 19, 37, 10, 25, 200, 124, 29, 85, 238, 59, 71, 31], \"xaxis\": \"x\", \"yaxis\": \"y\"}, {\"alignmentgroup\": \"True\", \"hovertemplate\": \"premium=0<br>Days=%{x}<extra></extra>\", \"legendgroup\": \"0\", \"marker\": {\"color\": \"#EF553B\"}, \"name\": \"0\", \"notched\": true, \"offsetgroup\": \"0\", \"showlegend\": false, \"type\": \"box\", \"x\": [52, 44, 80, 593, 30, 70, 1292, 1107, 726, 403, 65, 46, 91, 56, 34, 127, 1754, 1268, 699, 88, 56, 1032, 57, 25, 1263, 785, 600, 523, 521, 1377, 842, 166, 149, 20, 17, 22, 27, 203, 770, 2121, 705, 307, 11, 2232, 1978, 1978, 1559, 1553, 1583, 1456, 1220, 1126, 1120, 975, 956, 936, 863, 814, 803, 790, 783, 763, 758, 750, 716, 674, 628, 618, 612, 591, 586, 394, 364, 196, 176, 2775, 1064, 14, 510, 365, 293, 250, 236, 28, 28, 5, 231, 44, 77, 313, 524, 629, 454, 1892, 1224, 826, 575, 2732, 27, 12, 20, 141, 632, 41, 57, 68, 112, 1136, 721, 4, 1087, 731, 415, 86, 30, 106, 17, 27, 14, 8, 1355, 1162, 262, 50, 8, 1355, 139, 221, 81, 1434, 1096, 29, 398, 40, 78, 16, 53, 13, 10, 27, 877, 360, 332, 27, 23, 9, 542, 2436, 800, 704, 636, 64, 15, 609, 602, 525, 505, 36, 28, 21, 9, 74, 1534, 478, 18, 284, 282, 267, 91, 73, 517, 309, 25, 2437, 1912, 1883, 1504, 1073, 987, 612, 167, 22, 32, 139, 32, 387, 29, 87, 16, 15, 9, 100, 763, 5, 447, 255, 18, 44, 81, 71, 22, 2594, 2260, 2249, 2000, 1789, 1458, 921, 875, 680, 216, 203, 2714, 10, 131, 104, 7, 32, 710, 34, 717, 86, 282, 1104, 1033, 64, 89, 75, 426, 1130, 902, 660, 614, 43, 88, 25, 79, 50, 2334, 1672, 1045, 629, 557, 387, 2581, 73, 53, 121, 120, 643, 46, 44, 211, 1479, 1083, 1073, 52, 119, 71, 162, 212, 11, 66, 50, 635, 25, 10, 118, 8, 176, 51, 133, 41, 106, 189, 401, 17, 76, 10, 11, 73, 481, 102, 23, 356, 130, 104, 1121, 1116, 40, 58, 24, 170, 171, 32, 86, 30, 295, 47, 34, 24, 59, 27, 248, 73, 55, 247, 24, 91, 16, 12, 6, 728, 905, 55, 19, 507, 250, 17, 61, 24, 33, 8, 57, 51, 87, 30, 15, 35, 30, 36, 10, 89, 109, 21, 19, 37, 10, 25, 200, 124, 29, 85, 238, 59, 71, 31], \"xaxis\": \"x2\", \"yaxis\": \"y2\"}],                        {\"barmode\": \"relative\", \"legend\": {\"title\": {\"text\": \"premium\"}, \"tracegroupgap\": 0}, \"template\": {\"data\": {\"bar\": [{\"error_x\": {\"color\": \"#2a3f5f\"}, \"error_y\": {\"color\": \"#2a3f5f\"}, \"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"bar\"}], \"barpolar\": [{\"marker\": {\"line\": {\"color\": \"white\", \"width\": 0.5}}, \"type\": \"barpolar\"}], \"carpet\": [{\"aaxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"baxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"#C8D4E3\", \"linecolor\": \"#C8D4E3\", \"minorgridcolor\": \"#C8D4E3\", \"startlinecolor\": \"#2a3f5f\"}, \"type\": \"carpet\"}], \"choropleth\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"choropleth\"}], \"contour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"contour\"}], \"contourcarpet\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"contourcarpet\"}], \"heatmap\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmap\"}], \"heatmapgl\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmapgl\"}], \"histogram\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"histogram\"}], \"histogram2d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2d\"}], \"histogram2dcontour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2dcontour\"}], \"mesh3d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"mesh3d\"}], \"parcoords\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"parcoords\"}], \"pie\": [{\"automargin\": true, \"type\": \"pie\"}], \"scatter\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter\"}], \"scatter3d\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter3d\"}], \"scattercarpet\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattercarpet\"}], \"scattergeo\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergeo\"}], \"scattergl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergl\"}], \"scattermapbox\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattermapbox\"}], \"scatterpolar\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolar\"}], \"scatterpolargl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolargl\"}], \"scatterternary\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterternary\"}], \"surface\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"surface\"}], \"table\": [{\"cells\": {\"fill\": {\"color\": \"#EBF0F8\"}, \"line\": {\"color\": \"white\"}}, \"header\": {\"fill\": {\"color\": \"#C8D4E3\"}, \"line\": {\"color\": \"white\"}}, \"type\": \"table\"}]}, \"layout\": {\"annotationdefaults\": {\"arrowcolor\": \"#2a3f5f\", \"arrowhead\": 0, \"arrowwidth\": 1}, \"autotypenumbers\": \"strict\", \"coloraxis\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"colorscale\": {\"diverging\": [[0, \"#8e0152\"], [0.1, \"#c51b7d\"], [0.2, \"#de77ae\"], [0.3, \"#f1b6da\"], [0.4, \"#fde0ef\"], [0.5, \"#f7f7f7\"], [0.6, \"#e6f5d0\"], [0.7, \"#b8e186\"], [0.8, \"#7fbc41\"], [0.9, \"#4d9221\"], [1, \"#276419\"]], \"sequential\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"sequentialminus\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]]}, \"colorway\": [\"#636efa\", \"#EF553B\", \"#00cc96\", \"#ab63fa\", \"#FFA15A\", \"#19d3f3\", \"#FF6692\", \"#B6E880\", \"#FF97FF\", \"#FECB52\"], \"font\": {\"color\": \"#2a3f5f\"}, \"geo\": {\"bgcolor\": \"white\", \"lakecolor\": \"white\", \"landcolor\": \"white\", \"showlakes\": true, \"showland\": true, \"subunitcolor\": \"#C8D4E3\"}, \"hoverlabel\": {\"align\": \"left\"}, \"hovermode\": \"closest\", \"mapbox\": {\"style\": \"light\"}, \"paper_bgcolor\": \"white\", \"plot_bgcolor\": \"white\", \"polar\": {\"angularaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"radialaxis\": {\"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\"}}, \"scene\": {\"xaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"yaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}, \"zaxis\": {\"backgroundcolor\": \"white\", \"gridcolor\": \"#DFE8F3\", \"gridwidth\": 2, \"linecolor\": \"#EBF0F8\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"#EBF0F8\"}}, \"shapedefaults\": {\"line\": {\"color\": \"#2a3f5f\"}}, \"ternary\": {\"aaxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"baxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}, \"bgcolor\": \"white\", \"caxis\": {\"gridcolor\": \"#DFE8F3\", \"linecolor\": \"#A2B1C6\", \"ticks\": \"\"}}, \"title\": {\"x\": 0.05}, \"xaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}, \"yaxis\": {\"automargin\": true, \"gridcolor\": \"#EBF0F8\", \"linecolor\": \"#EBF0F8\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"#EBF0F8\", \"zerolinewidth\": 2}}}, \"title\": {\"text\": \"Removed reviews: days between publishing and removal\"}, \"xaxis\": {\"anchor\": \"y\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"Days\"}}, \"xaxis2\": {\"anchor\": \"y2\", \"domain\": [0.0, 1.0], \"matches\": \"x\", \"showgrid\": true, \"showticklabels\": false}, \"yaxis\": {\"anchor\": \"x\", \"domain\": [0.0, 0.7326], \"tickformat\": \"%\", \"title\": {\"text\": \"probability\"}}, \"yaxis2\": {\"anchor\": \"x2\", \"domain\": [0.7426, 1.0], \"matches\": \"y2\", \"showgrid\": false, \"showline\": false, \"showticklabels\": false, \"tickformat\": \"%\", \"ticks\": \"\"}},                        {\"responsive\": true}                    ).then(function(){\n",
       "                            \n",
       "var gd = document.getElementById('862a7aee-3fe6-4c7e-b6e3-c04d1e7d6c43');\n",
       "var x = new MutationObserver(function (mutations, observer) {{\n",
       "        var display = window.getComputedStyle(gd).display;\n",
       "        if (!display || display === 'none') {{\n",
       "            console.log([gd, 'removed!']);\n",
       "            Plotly.purge(gd);\n",
       "            observer.disconnect();\n",
       "        }}\n",
       "}});\n",
       "\n",
       "// Listen for the removal of the full notebook cells\n",
       "var notebookContainer = gd.closest('#notebook-container');\n",
       "if (notebookContainer) {{\n",
       "    x.observe(notebookContainer, {childList: true});\n",
       "}}\n",
       "\n",
       "// Listen for the clearing of the current output cell\n",
       "var outputEl = gd.closest('.output');\n",
       "if (outputEl) {{\n",
       "    x.observe(outputEl, {childList: true});\n",
       "}}\n",
       "\n",
       "                        })                };                });            </script>        </div>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Visualize time delta from publishing to removal for removed reviews\n",
    "plt = removed[[\"report_duration_days\", \"removed\", \"premium\"]].reset_index(drop=True)\n",
    "plt = plt[(plt[\"removed\"] == 1)]\n",
    "\n",
    "fig = px.histogram(\n",
    "    plt,\n",
    "    x=\"report_duration_days\",\n",
    "    color=\"premium\",\n",
    "    title=\"Removed reviews: days between publishing and removal\",\n",
    "    labels={\"report_duration_days\": \"Days\"},\n",
    "    histnorm=\"probability\",\n",
    "    nbins=200,\n",
    "    marginal=\"box\",\n",
    ")\n",
    "fig.update_yaxes(tickformat=\"%\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those distributions look quite different and support our previous hypothesis: Premium users seem in fact to be more concerned about their reputation on Jameda. **Critical reviews on their profiles are removed (reported) much faster**. On median, this is the case after 52 days while for non premium users the median is 139 days."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a last analysis, let's see what happens to the removed reviews we observed after some time has passed. Following our original observation period in 2020 (wave 1), we updated our review data in February 2021 (wave 2). After about five months since the end of wave 1, how many of the removed reviews have been re published? How many have been deleted for good? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Removed during wave one: 459\n",
      "Out of those, re added until wave two: 33\n",
      "Percent re added of removed total: 7.2%\n",
      "Percent re added of removed for non premium: 7.8%\n",
      "Percent re added of removed for premium: 5.4%\n"
     ]
    }
   ],
   "source": [
    "# Which removed during wave 1 are re added until wave 2\n",
    "print(\"Removed during wave one:\", removed[\"b_id\"].shape[0])\n",
    "readded = removed.merge(reviews_w2, on=\"b_id\")\n",
    "readded = readded[readded[\"b_stand\"].isin([1, 5])]\n",
    "print(\"Out of those, re added until wave two:\", readded.shape[0])\n",
    "pct_readded = readded.shape[0] / removed[\"b_id\"].shape[0] * 100\n",
    "print(f\"Percent re added of removed total: {pct_readded:.1f}%\")\n",
    "# Perct of removed is re added by status\n",
    "pct_readded_by_status = (\n",
    "    readded[\"premium\"].value_counts() / removed[\"premium\"].value_counts() * 100\n",
    ")\n",
    "print(f\"Percent re added of removed for non premium: {pct_readded_by_status[0]:.1f}%\")\n",
    "print(f\"Percent re added of removed for premium: {pct_readded_by_status[1]:.1f}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "tags": [
     "hide"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**When a review gets removed, it stays removed in most cases**. Only 7.2% of all removed reviews are re published after five months or longer in our observation. This share differs by status. For reviews on non premium profiles the share of re added reviews is 7.8% but for premium profiles it is only 5.4%. This could indicate that Jameda does not validate all reported reviews equally and therefor favors premium users. However, the number of observations is low and there are a few necessary assumptions that can't be checked (e.g. that the share of reported reviews violating the rules is identical). Hence, this is speculative and can't be backed by the data.<br>Nonetheless, **reporting unpleasant reviews seems like a good strategy for physicians in order to improve their ratings.**"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md(\n",
    "    f\"**When a review gets removed, it stays removed in most cases**. Only {pct_readded:.1f}% of all removed reviews are re published after five months or longer in our observation. This share differs by status. For reviews on non premium profiles the share of re added reviews is {pct_readded_by_status[0]:.1f}% but for premium profiles it is only {pct_readded_by_status[1]:.1f}%. This could indicate that Jameda does not validate all reported reviews equally and therefor favors premium users. However, the number of observations is low and there are a few necessary assumptions that can't be checked (e.g. that the share of reported reviews violating the rules is identical). Hence, this is speculative and can't be backed by the data.<br>Nonetheless, **reporting unpleasant reviews seems like a good strategy for physicians in order to improve their ratings.**\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "This post was a sequel to the [original analysis]({filename}/jameda_part1.ipynb) which investigated whether Jameda favors its paying users. By collecting and analyzing new data over a period of about nine months, we were able to overcome the limitations of the first analysis. In particular, we focused on insights referring to the removal of critical reviews on the profiles of physicians. The main takeaways are these:\n",
    "\n",
    "1. Jameda has a lot of active users. Each week a lot of reviews are created by patients and published\n",
    "2. About 42% of the created reviews are published on the profiles of premium physicians\n",
    "3. Critical reviews can be reported by physicians and are removed until Jameda has validated them. The share of removed reviews on premium profiles is only 23% of the total removed reviews\n",
    "4. Removed reviews overwhelmingly have poor ratings\n",
    "5. Removed reviews are seldom. However, they exclusively target poor ratings which are rare. As such, the removal can significantly alter profiles\n",
    "6. On premium profiles, reviews with poor ratings are three times more likely to be removed compared to those on non premium profiles\n",
    "7. Critical reviews on premium profiles are removed much faster than those on non premium profiles\n",
    "8. Once deleted, a rating is very unlikely to ever be re published. Only 7.2% of removed reviews were re published after >5 months\n",
    "9. There is some dubious hint, that removed reviews on premium profiles are more likely to stay removed\n",
    "\n",
    "The last few points are the most relevant ones for answering our original question. **Differences in the total ratings of physicians on Jameda are not solely due to received ratings**. The removal of reviews plays a role as well: Poor reviews are removed faster and more often from premium users' profiles. This is particularly impactful, because a deleted reviews is very unlikely to ever be re published again. **As a result, at least some profiles will have inflated ratings (on average, those are premium profiles)**. This has serious consequences. Not only is the rating a strong signal for potential patients but Jameda also uses it as a default sorting criterion in the search. Consequently, physicians with higher ratings will get more patients through the platform.  \n",
    "While this might seem unfair, it's not easy to assign guilt. It's likely a consequence of the greater effort that premium users put in maintaining their good reputation on the platform. They are simply more inclined to report critical reviews. Also, it is not too far-fetched to believe that there are good reasons that at least some of the reviews are removed. While this is not favoring premium users, it certainly is a disadvantage for physicians that are not actively monitoring their profiles. Those are usually the non paying users.  \n",
    "Jameda's main responsibility is to ensure that the reporting of reviews is not abused. This includes making sure that reviews are validated fast and re published if they don't violate any rules. It's questionable if this is currently the case. Only a small share of removed reviews seems to ever be re added.  \n",
    "However, one must also give Jameda some credit. It's a very complicated matter of law to decide which reviews are permissible. Erring on the side of removal might be the safest strategy for them. Also, there have been some efforts to penalize abusive physicians (i.e. those that report reviews on baseless grounds). They can get their [quality seal retracted](https://www.jameda.de/qualitaetssicherung/bewertung-loeschen/). \n",
    "The most critical aspect is that Jameda must treat all reported reviews equally. If they don't, that would really be a severe case of misconduct. Unfortunately, it might be next to impossible to tell \"from the outside\".  \n",
    "To sum up:  \n",
    "it would probably be too simple to judge Jameda very harshly for the outcome. Nonetheless, it's clear that the outcome is sub optimal for at least some (potential) patients and physicians: **total ratings for doctors on the platform do not always represent the unfiltered aggregate feedback of patients**. Hence, patients' choices will be biased. On average, premium profiles benefit from this and non-premium profiles are at a disadvantage.\n",
    "<br><br>\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}