{ "cells": [ { "cell_type": "markdown", "id": "5451dfbe-903f-4700-a0e2-484d3f885b57", "metadata": {}, "source": [ "# Comparision of Proposed Vandalism Criteria with Revert Risk scores\n", "\n", "[TASK: T349083](https://phabricator.wikimedia.org/T349083)\n", "\n", "➤ ***Please view this notebook on [nbviewer](https://nbviewer.org/github/wikimedia-research/moderator-tools-FY24/blob/main/%5BT349083%5D%20vandalism_criteria_comparision/vandal_criteria_revert_risk_comparision.ipynb)***\n", "\n", "For various baseline measurements for evaluation of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator), we want to develop a criteria to identify potential vandalism. In this analysis the criteria will be compared with the [revert risk scores](https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_revert_risk). Starting with an set an intial set, different dimensions will be used to see how that impacts the median revert risk score by project and also how restricting the criteria further elimiates edits from consideration. The goal is find a balance between good median score, without eliminating too many edits from consideration.\n", "\n", "**Initial criteria:**\n", "- Edits from account with less than 25 edits or anonymous user\n", "- Reverted by a different editor\n", "- Reverts happen within 24 hours\n", "- Edits in the content namespace\n", "\n", "**Dimensions considered**\n", "- Time to revert \n", "- User edit count (for registered users)\n", "- Time since user's first revision (for registered users)\n", "- Time since user's previous revision (for registered users)\n", "- Time since previous revision on the page being edited\n", "- Absolute difference in bytes made by the revision\n", "\n", "## Summary\n", "Based on the analysis, the following additions/modifications can improve the median risk score\n", "- Reverted within 12 hours\n", "- User edit count less 15 edits\n", "- Time since user's first edit is less than 48 hours\n", "- Absolute bytes difference is more than 5 bytes" ] }, { "cell_type": "code", "execution_count": 566, "id": "31f2f8c8-5923-4ca3-8f0b-337764e49908", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Changes in the Median Risk & Number of Edits
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
Initial
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90197416829
1enwiki0.910679172584
2eswiki0.92259655105
3fawiki0.9163669967
4frwiki0.90331619375
5idwiki0.9024643554
6itwiki0.91964823440
7jawiki0.87568210170
8ptwiki0.9130643361
9ruwiki0.91429123587
10zhwiki0.8834547568
\n", "
+ Reverted within 12 hours
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90423916077
1enwiki0.912205162439
2eswiki0.92347452922
3fawiki0.9167929228
4frwiki0.90558818401
5idwiki0.9019943231
6itwiki0.92130122077
7jawiki0.8797899401
8ptwiki0.9143633147
9ruwiki0.91640322250
10zhwiki0.8869896880
\n", "
+ User Edit Count <= 15 edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90450316061
1enwiki0.912847160889
2eswiki0.92385052696
3fawiki0.9180569136
4frwiki0.90630418285
5idwiki0.9028923190
6itwiki0.92136522011
7jawiki0.8801169109
8ptwiki0.9169163079
9ruwiki0.91674622204
10zhwiki0.8875886819
\n", "
+ Time Since First Edit <= 48 hrs
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90755515468
1enwiki0.915196153858
2eswiki0.92479251696
3fawiki0.9204688539
4frwiki0.90903417489
5idwiki0.9050713067
6itwiki0.92270921633
7jawiki0.8825258828
8ptwiki0.9306692458
9ruwiki0.91810321661
10zhwiki0.8903806481
\n", "
+ Absolute Bytes Diff >= 5 bytes
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.91721411281
1enwiki0.920194115997
2eswiki0.93048339239
3fawiki0.9243526734
4frwiki0.91370913492
5idwiki0.9100192361
6itwiki0.92453315505
7jawiki0.8836706679
8ptwiki0.9342281855
9ruwiki0.92378816914
10zhwiki0.8963374813
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pr_centered('Changes in the Median Risk & Number of Edits', True)\n", "display_h(results)" ] }, { "cell_type": "markdown", "id": "d3b4e707-db23-4cc9-95fe-19285595e171", "metadata": {}, "source": [ "- Restricting user related related metrics make minor improvements to the median risk, as majority of the reverted edits are made by anonymous users.\n", "- While having at least an n number of absolute bytes difference, improves the median risk, a substantial number of edits are elimiated, as compared to the initial criteria.\n", "- In addition to the time to revert, absolute bytes difference is only the control factor available for anonymous edits.\n" ] }, { "cell_type": "markdown", "id": "1013433a-2909-4b7c-8325-ac05c07ed8ea", "metadata": { "tags": [] }, "source": [ "# Data-Gathering" ] }, { "cell_type": "code", "execution_count": 2, "id": "7bfb58a2-a2f9-4ddd-b622-7f8130c12dfd", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import wmfdata as wmf\n", "\n", "pd.options.display.max_columns = None\n", "from IPython.display import clear_output\n", "\n", "import warnings\n", "import random\n", "from datetime import datetime\n", "\n", "from IPython.display import display_html\n", "from IPython.display import display, HTML\n", "from IPython.display import clear_output" ] }, { "cell_type": "code", "execution_count": 89, "id": "c806fdbd-1195-4d58-b54e-313dd35c8ced", "metadata": {}, "outputs": [], "source": [ "# import seaborn as sns\n", "# import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 180, "id": "0e5f4221-5d33-4abe-991d-e862d4d5e7f7", "metadata": {}, "outputs": [], "source": [ "spark_session = wmf.spark.get_active_session()\n", "\n", "if type(spark_session) != type(None):\n", " spark_session.stop()\n", "else:\n", " print('no active session')" ] }, { "cell_type": "code", "execution_count": 574, "id": "f24d2e1f-eebb-4b99-8dd8-fd5b8ade338f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "

SparkSession - hive

\n", " \n", "
\n", "

SparkContext

\n", "\n", "

Spark UI

\n", "\n", "
\n", "
Version
\n", "
v3.1.2
\n", "
Master
\n", "
yarn
\n", "
AppName
\n", "
vandal-criteria-comparision
\n", "
\n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 574, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark_session = wmf.spark.create_custom_session(\n", " master=\"yarn\",\n", " app_name='vandal-criteria-comparision',\n", " spark_config={\n", " \"spark.driver.memory\": \"6g\",\n", " \"spark.dynamicAllocation.maxExecutors\": 64,\n", " \"spark.executor.memory\": \"24g\",\n", " \"spark.executor.cores\": 4,\n", " \"spark.sql.shuffle.partitions\": 256,\n", " \"spark.driver.maxResultSize\": \"2g\"\n", " \n", " }\n", ")\n", "\n", "clear_output()\n", "\n", "spark_session.sparkContext.setLogLevel(\"ERROR\")\n", "spark_session" ] }, { "cell_type": "markdown", "id": "bd65d9cb-c8af-4a44-826e-85e999a3bc4f", "metadata": {}, "source": [ "## query" ] }, { "cell_type": "code", "execution_count": 575, "id": "54c7c413-8b81-428f-95f7-b408b97d9544", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 0:> (0 + 1) / 1]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- rev_id: long (nullable = true)\n", " |-- wiki_db: string (nullable = true)\n", " |-- rev_timestamp: string (nullable = true)\n", " |-- revision_is_identity_reverted: boolean (nullable = true)\n", " |-- revision_seconds_to_identity_revert: long (nullable = true)\n", " |-- page_id: long (nullable = true)\n", " |-- revision_revert_risk: float (nullable = true)\n", " |-- user_is_anonymous: boolean (nullable = true)\n", " |-- user_is_bot: boolean (nullable = true)\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "rr_scores_path = '/user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet'\n", "\n", "rr_scores = spark_session.read.parquet(rr_scores_path)\n", "rr_scores.createOrReplaceTempView('rr_scores')\n", "\n", "rr_scores.printSchema()" ] }, { "cell_type": "code", "execution_count": 8, "id": "7a5746a4-b4df-4c23-aa3e-6072d0ccf2dc", "metadata": {}, "outputs": [], "source": [ "mwh_snapshot = '2023-10'\n", "\n", "wikis_list = [f'{lang}wiki' for lang in ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']]\n", "wikis_sql = wmf.utils.sql_tuple(wikis_list)" ] }, { "cell_type": "code", "execution_count": 9, "id": "41ff9f3d-87d6-448c-a460-742878d55f7a", "metadata": {}, "outputs": [], "source": [ "# generate 30 random dates in an year\n", "\n", "def generate_random_dates(year, num_dates):\n", " dates = []\n", " for _ in range(num_dates):\n", " month = random.randint(1, 12)\n", " if month in [1, 3, 5, 7, 8, 10, 12]:\n", " day = random.randint(1, 31)\n", " elif month == 2:\n", " day = random.randint(1, 28)\n", " else:\n", " day = random.randint(1, 30)\n", " \n", " date = datetime(year, month, day)\n", " dates.append(date.strftime(\"%Y-%m-%d\"))\n", " \n", " return dates\n", "\n", "random_dates_2022 = generate_random_dates(2022, 30)\n", "random_dates_2022_sql = wmf.utils.sql_tuple(random_dates_2022)" ] }, { "cell_type": "code", "execution_count": 585, "id": "e3058fbc-4d4f-4ecf-9345-423fd62a5bcd", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 2]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.2 s, sys: 0 ns, total: 5.2 s\n", "Wall time: 5min 16s\n" ] } ], "source": [ "%%time\n", "\n", "query = f\"\"\"\n", "WITH \n", " base_criteria AS (\n", " SELECT\n", " mwh.wiki_db,\n", " rr.rev_id,\n", " revision_revert_risk AS risk,\n", " mwh.event_user_text AS user_name,\n", " event_timestamp AS rev_ts,\n", " event_user_is_anonymous AS is_anon,\n", " event_user_revision_count AS user_edit_count,\n", " COALESCE(event_user_registration_timestamp, event_user_creation_timestamp) AS user_reg_ts,\n", " event_user_first_edit_timestamp AS user_first_rev_ts,\n", " event_user_seconds_since_previous_revision AS time_user_prev_rev,\n", " page_seconds_since_previous_revision AS time_page_prev_rev,\n", " revision_text_bytes_diff AS rev_bytes_diff,\n", " mwh.revision_seconds_to_identity_revert AS time_to_revert,\n", " revision_text_bytes AS rev_bytes,\n", " revision_is_identity_revert AS reverting_edit,\n", " revision_first_identity_reverting_revision_id AS reverting_edit_id\n", " FROM \n", " rr_scores rr\n", " JOIN \n", " wmf.mediawiki_history mwh \n", " ON rr.wiki_db = mwh.wiki_db AND rr.rev_id = mwh.revision_id\n", " WHERE \n", " snapshot = '{mwh_snapshot}'\n", " AND rr.wiki_db IN {wikis_sql}\n", " AND event_entity = 'revision'\n", " AND event_type = 'create'\n", " AND DATE(event_timestamp) IN {random_dates_2022_sql}\n", " AND page_namespace_is_content\n", " AND (event_user_is_anonymous OR event_user_revision_count <= 250)\n", " AND SIZE(event_user_is_bot_by_historical) = 0\n", " AND mwh.revision_is_identity_reverted\n", " AND mwh.revision_seconds_to_identity_revert <= 3*24*60*60\n", " )\n", " \n", "\n", "SELECT\n", " bc.*,\n", " mwh.event_user_is_anonymous AS reverting_user_is_anon,\n", " mwh.event_user_revision_count AS reverting_user_edit_count,\n", " mwh.event_user_first_edit_timestamp AS reverting_user_first_rev_ts,\n", " mwh.revision_is_identity_reverted AS is_revert_reverted,\n", " mwh.revision_seconds_to_identity_revert AS revert_time_to_revert\n", "FROM \n", " base_criteria bc\n", "JOIN\n", " wmf.mediawiki_history mwh\n", " ON bc.wiki_db = mwh.wiki_db AND bc.reverting_edit_id = mwh.revision_id\n", "WHERE\n", " snapshot = '{mwh_snapshot}'\n", " AND NOT bc.user_name = mwh.event_user_text\n", "\"\"\"\n", "\n", "edits = wmf.spark.run(query)" ] }, { "cell_type": "code", "execution_count": 586, "id": "d1f360de-a02c-4ca0-ad9f-05c6a79b01cf", "metadata": {}, "outputs": [], "source": [ "edits = (\n", " edits\n", " .assign(\n", " rev_ts=pd.to_datetime(edits['rev_ts'], utc=True),\n", " user_reg_ts=pd.to_datetime(edits['user_reg_ts'], utc=True),\n", " user_first_rev_ts=pd.to_datetime(edits['user_first_rev_ts'], utc=True),\n", " reverting_user_first_rev_ts=pd.to_datetime(edits['reverting_user_first_rev_ts'], utc=True),\n", " is_anon=pd.Categorical(edits['is_anon']),\n", " reverting_user_is_anon=pd.Categorical(edits['reverting_user_is_anon']),\n", " is_revert_reverted=pd.Categorical(edits['is_revert_reverted'])\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 587, "id": "a71392f0-9a22-4a01-a325-b204304e10a1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 391096 entries, 0 to 391095\n", "Data columns (total 21 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 wiki_db 391096 non-null object \n", " 1 rev_id 391096 non-null int64 \n", " 2 risk 391096 non-null float32 \n", " 3 user_name 391096 non-null object \n", " 4 rev_ts 391096 non-null datetime64[ns, UTC]\n", " 5 is_anon 391096 non-null category \n", " 6 user_edit_count 92095 non-null float64 \n", " 7 user_reg_ts 92053 non-null datetime64[ns, UTC]\n", " 8 user_first_rev_ts 92095 non-null datetime64[ns, UTC]\n", " 9 time_user_prev_rev 75259 non-null float64 \n", " 10 time_page_prev_rev 391096 non-null int64 \n", " 11 rev_bytes_diff 387254 non-null float64 \n", " 12 time_to_revert 391096 non-null int64 \n", " 13 rev_bytes 387436 non-null float64 \n", " 14 reverting_edit 391096 non-null bool \n", " 15 reverting_edit_id 391096 non-null int64 \n", " 16 reverting_user_is_anon 391096 non-null category \n", " 17 reverting_user_edit_count 376252 non-null float64 \n", " 18 reverting_user_first_rev_ts 376252 non-null datetime64[ns, UTC]\n", " 19 is_revert_reverted 391096 non-null category \n", " 20 revert_time_to_revert 48583 non-null float64 \n", "dtypes: bool(1), category(3), datetime64[ns, UTC](4), float32(1), float64(6), int64(4), object(2)\n", "memory usage: 50.7+ MB\n" ] } ], "source": [ "edits.info()" ] }, { "cell_type": "markdown", "id": "a2b680e6-afb1-4da3-b3d4-aa79746b0a33", "metadata": {}, "source": [ "# Analysis" ] }, { "cell_type": "markdown", "id": "5691279e-683d-41a7-abb0-ad092bda57f5", "metadata": {}, "source": [ "## Functions" ] }, { "cell_type": "code", "execution_count": 409, "id": "8c0f36bf-2183-42a7-9c6c-fa7690679ce1", "metadata": {}, "outputs": [], "source": [ "# prints a string at center of the output, bold if needed\n", "def pr_centered(content, bold=False):\n", " if bold:\n", " content = f\"{content}\"\n", " \n", " centered_html = f\"
{content}
\"\n", " \n", " display(HTML(centered_html))\n", "\n", "\n", "# display dataframes horizontally with title for each\n", "def display_h(frames, space=100):\n", " html = \"\"\n", " \n", " for key in frames.keys():\n", " html_df =f'
{key} {frames[key]._repr_html_()}
'\n", " html += html_df\n", " \n", " html = f\"\"\"\n", "
\n", " {html}\n", "
\"\"\"\n", " \n", " display_html(html, raw=True)" ] }, { "cell_type": "code", "execution_count": 503, "id": "ec036262-2f85-4be4-88c1-a1f8933b68e5", "metadata": {}, "outputs": [], "source": [ "def calculate_grouped(df, intervals, pivot_column, columns_title=None, column_names=None, target_column='risk', group_column='wiki_db', grp_function='median'):\n", "\n", " final_results = []\n", "\n", " for interval in intervals:\n", " \n", " # unlike other temporal columns, bytes difference should be greater than given value\n", " \n", " if pivot_column == 'rev_bytes_diff':\n", " df[pivot_column] = df[pivot_column].abs()\n", " filtered_df = df[df[pivot_column] >= interval]\n", " else:\n", " filtered_df = df[df[pivot_column] <= interval]\n", " \n", " grouped = filtered_df.groupby(group_column).agg({target_column: grp_function}).reset_index()\n", "\n", " grouped['interval'] = interval\n", " final_results.append(grouped)\n", "\n", " concatenated_df = pd.concat(final_results)\n", " pivot_df = concatenated_df.pivot(index=group_column, columns='interval', values=target_column)\n", " \n", " if columns_title == None:\n", " pivot_df.columns.name = f'median: {pivot_column}'\n", " else:\n", " pivot_df.columns.name = f'median: {columns_title}'\n", " \n", " if column_names != None:\n", " pivot_df.columns = column_names\n", "\n", " return pivot_df\n", "\n", "# def plot_hmap(df, x_label, title, fontsize=10, y_label='Wikipedia', cbar_label='Median Risk'):\n", " \n", "# ax = sns.heatmap(df, annot=True, annot_kws={\"size\": fontsize})\n", " \n", "# # set labels\n", "# ax.set_xlabel(x_label, fontsize=fontsize)\n", "# ax.set_ylabel(y_label, fontsize=fontsize)\n", "# ax.set_title(title, fontsize=fontsize + 1)\n", " \n", "# # color bar properties\n", "# cbar = ax.collections[0].colorbar\n", "# cbar.set_label(cbar_label, fontsize=fontsize)\n", "# cbar.ax.tick_params(labelsize=fontsize)\n", "\n", "# plt.show()\n", " \n", "def time_delta(df, start_column, end_column):\n", " try: \n", " return df.apply(lambda row: (row[end_column] - row[start_column]).total_seconds(), axis=1)\n", " except:\n", " return np.NaN" ] }, { "cell_type": "markdown", "id": "759f2ed7-d168-4ced-9f48-18b21c6f6e48", "metadata": {}, "source": [ "## Initial Criteria" ] }, { "cell_type": "code", "execution_count": null, "id": "c72ae52d-e1d9-48e5-8679-1e08a6de6c9d", "metadata": {}, "outputs": [], "source": [ "init_criteria = edits.query(\"\"\"(time_to_revert <= 24*60*60) & ((is_anon == True) | (user_edit_count <= 25))\"\"\")\n", "\n", "init_criteria = (\n", " init_criteria\n", " .assign(\n", " elapsed_reg=time_delta(init_criteria, 'user_reg_ts', 'rev_ts'),\n", " elapsed_first_rev=time_delta(init_criteria, 'user_first_rev_ts', 'rev_ts'),\n", " rv_user_elapsed_first_rev=time_delta(init_criteria, 'reverting_user_first_rev_ts', 'rev_ts')\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 589, "id": "1785a25a-6880-4c9c-a76d-ae9eadd5a2ae", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90197416829
1enwiki0.910679172584
2eswiki0.92259655105
3fawiki0.9163669967
4frwiki0.90331619375
5idwiki0.9024643554
6itwiki0.91964823440
7jawiki0.87568210170
8ptwiki0.9130643361
9ruwiki0.91429123587
10zhwiki0.8834547568
\n", "
" ], "text/plain": [ " wiki_db median_risk n_edits\n", "0 dewiki 0.901974 16829\n", "1 enwiki 0.910679 172584\n", "2 eswiki 0.922596 55105\n", "3 fawiki 0.916366 9967\n", "4 frwiki 0.903316 19375\n", "5 idwiki 0.902464 3554\n", "6 itwiki 0.919648 23440\n", "7 jawiki 0.875682 10170\n", "8 ptwiki 0.913064 3361\n", "9 ruwiki 0.914291 23587\n", "10 zhwiki 0.883454 7568" ] }, "execution_count": 589, "metadata": {}, "output_type": "execute_result" } ], "source": [ "init_criteria_risk = (\n", " init_criteria\n", " .groupby('wiki_db')\n", " .agg({\n", " 'risk': 'median', \n", " 'rev_id': 'count'\n", " })\n", " .reset_index()\n", " .rename({\n", " 'rev_id': 'n_edits', \n", " 'risk': 'median_risk'\n", " }, axis=1)\n", ")\n", "\n", "init_criteria_risk" ] }, { "cell_type": "markdown", "id": "88270901-6d56-4933-a087-1a0a07875c09", "metadata": {}, "source": [ "## Time to Revert" ] }, { "cell_type": "code", "execution_count": 434, "id": "a7c2f9cb-d5a9-4cde-a4fd-2428742c0557", "metadata": {}, "outputs": [], "source": [ "ttr_hour_intervals = [1, 2, 4, 8, 12, 24]\n", "ttr_time_intervals = [i*60*60 for i in ttr_hour_intervals]\n", "ttr_column_names = [f'{i} hr' for i in ttr_hour_intervals]\n", "\n", "ttr_median_risk = calculate_grouped(init_criteria, ttr_time_intervals, \n", " 'time_to_revert', column_names=ttr_column_names)\n", "ttr_interval_counts = calculate_grouped(init_criteria, ttr_time_intervals, \n", " 'time_to_revert', column_names=ttr_column_names, grp_function = 'count')" ] }, { "cell_type": "code", "execution_count": 457, "id": "0784ab64-9003-4ef1-bf41-25b447bf0940", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Median Risk \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 1 hr2 hr4 hr8 hr12 hr24 hr
wiki_db      
dewiki0.9100.9080.9060.9050.9040.902
enwiki0.9200.9180.9150.9130.9120.911
eswiki0.9280.9260.9250.9240.9230.923
fawiki0.9180.9180.9180.9170.9170.916
frwiki0.9130.9110.9090.9070.9060.903
idwiki0.9000.9010.8990.9020.9020.902
itwiki0.9270.9260.9240.9220.9210.920
jawiki0.8940.8910.8860.8820.8800.876
ptwiki0.9200.9180.9170.9160.9140.913
ruwiki0.9260.9230.9200.9180.9160.914
zhwiki0.8960.8930.8910.8890.8870.883
\n", "
Number of Edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1 hr2 hr4 hr8 hr12 hr24 hr
wiki_db
dewiki133491415114974156611607716829
enwiki114940128218141591155008162439172584
eswiki424874557748656514685292255105
fawiki679874178123881692289967
frwiki140781533516506176871840119375
idwiki166220702550300632313554
itwiki167391819819752211892207723440
jawiki6351724581508943940110170
ptwiki208123472686298531473361
ruwiki158511779419570212482225023587
zhwiki407148235637644668807568
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_h({\n", " 'Median Risk': ttr_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n", " 'Number of Edits': ttr_interval_counts\n", "})" ] }, { "cell_type": "markdown", "id": "20d197eb-630e-4b46-8498-888644874a32", "metadata": {}, "source": [ "Limiting to 8 hr window provides a slight improvement without eliminating a lot of edits." ] }, { "cell_type": "markdown", "id": "fcd1397d-ce09-4d02-aa01-b5fe2af0ef20", "metadata": {}, "source": [ "## User Edit Count" ] }, { "cell_type": "code", "execution_count": 439, "id": "fdd8b535-c4ff-4aef-b575-c40e0762b8b9", "metadata": {}, "outputs": [], "source": [ "edit_count_intervals = [5, 10, 15, 20, 25]\n", "edit_count_column_names = [f'{i} edits' for i in edit_count_intervals]\n", "\n", "edit_count_median_risk = calculate_grouped(init_criteria, edit_count_intervals, \n", " 'user_edit_count', column_names=edit_count_column_names)\n", "edit_count_interval_counts = calculate_grouped(init_criteria, edit_count_intervals, \n", " 'user_edit_count', column_names=edit_count_column_names, grp_function='count')" ] }, { "cell_type": "code", "execution_count": 456, "id": "b3df9e0b-10b3-4696-b80e-f8ee0c2a494c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Median Risk \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 5 edits10 edits15 edits20 edits25 edits
wiki_db     
dewiki0.9000.8920.8850.8800.876
enwiki0.9240.9180.9130.9100.908
eswiki0.9360.9310.9290.9260.924
fawiki0.9290.9230.9200.9150.911
frwiki0.9190.9100.9030.8970.893
idwiki0.9060.8930.8910.8890.886
itwiki0.9210.9140.9080.9050.904
jawiki0.9200.9160.9090.9040.899
ptwiki0.9230.9190.9170.9140.913
ruwiki0.9210.9140.9100.9060.903
zhwiki0.9090.9030.9000.8970.892
\n", "
Number of Edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
5 edits10 edits15 edits20 edits25 edits
wiki_db
dewiki15601893208622002301
enwiki2250328206313503343334986
eswiki38984866534556555851
fawiki12271702201822592442
frwiki23982944325234633611
idwiki268383443495527
itwiki12301514164717451820
jawiki13421889223924842691
ptwiki23452848307932363361
ruwiki19712373259427372833
zhwiki8631201139315101587
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_h({\n", " 'Median Risk': edit_count_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n", " 'Number of Edits': edit_count_interval_counts\n", "})" ] }, { "cell_type": "markdown", "id": "678c9879-762e-4e57-b288-c77267901ade", "metadata": {}, "source": [ "Limiting to 15 edits slightly improves the scores without elimating a lot of edits." ] }, { "cell_type": "markdown", "id": "0953fa80-d3dd-4461-88e8-74a90b9b2491", "metadata": {}, "source": [ "## Time Since User Registration" ] }, { "cell_type": "code", "execution_count": 591, "id": "c13b8c20-39cd-48ad-80e5-e2ccec28be2f", "metadata": {}, "outputs": [], "source": [ "non_anon = init_criteria.query(\"\"\"is_anon == False\"\"\").reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 443, "id": "811998da-7e61-471e-b0d3-d188f283a308", "metadata": {}, "outputs": [], "source": [ "elapsed_reg_minutes = [1, 5, 30]\n", "elapsed_reg_hours = [1, 2, 4, 12, 24, 48, 72, non_anon.elapsed_reg.max()/60*60]\n", "elapsed_reg_time_intervals = [i*60 for i in elapsed_reg_minutes] + [i*60*60 for i in elapsed_reg_hours]\n", "\n", "elapsed_reg_column_names = [f'{i} min' for i in elapsed_reg_minutes] + [f'{i} hr' if i<=72 else 'max' for i in elapsed_reg_hours]\n", "\n", "elapsed_reg_median_risk = calculate_grouped(non_anon, elapsed_reg_time_intervals, \n", " 'elapsed_reg', column_names=elapsed_reg_column_names)\n", "elapsed_reg_interval_counts = calculate_grouped(non_anon, elapsed_reg_time_intervals, \n", " 'elapsed_reg', column_names=elapsed_reg_column_names, grp_function='count')" ] }, { "cell_type": "code", "execution_count": 484, "id": "94f582b0-384d-490c-90a9-9d87b7b057f4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Median Risk \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 1 min5 min30 min1 hr2 hr4 hr12 hr24 hr48 hr72 hrmax
wiki_db           
dewiki0.9530.9360.9300.9300.9290.9290.9280.9270.9260.9240.876
enwiki0.9370.9410.9380.9360.9350.9340.9340.9330.9320.9310.908
eswiki0.9620.9490.9460.9450.9440.9440.9430.9430.9400.9390.924
fawiki0.9290.9430.9440.9440.9430.9430.9420.9400.9380.9360.911
frwiki0.9450.9400.9330.9310.9300.9300.9300.9300.9280.9270.893
idwiki0.9320.9340.9180.9150.9090.9090.9090.9110.9060.9050.886
itwiki0.9680.9470.9390.9400.9400.9400.9380.9380.9350.9350.904
jawiki0.9500.9350.9280.9260.9240.9230.9210.9220.9210.9190.899
ptwiki0.9180.9350.9370.9360.9350.9350.9350.9340.9340.9330.913
ruwiki0.9420.9390.9340.9340.9340.9340.9320.9310.9290.9280.903
zhwiki0.9660.9380.9320.9330.9320.9320.9280.9280.9220.9200.892
\n", "
Number of Edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1 min5 min30 min1 hr2 hr4 hr12 hr24 hr48 hr72 hrmax
wiki_db
dewiki403508559951050109411491191124712952299
enwiki7946181147571682618287192132033521605225812317034976
eswiki1591278285531983459357837333981421243385851
fawiki82477729361038110311641258133113702442
frwiki53652155717661906199520832160226023183611
idwiki856153194230242246270284296527
itwiki694348339251004104810971152121212271820
jawiki99645158217261817187619502016207521112691
ptwiki22673169719462104218522612319237024103361
ruwiki46544135315311627169017511841190719392833
zhwiki21238653765794848910964104410721587
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_h({\n", " 'Median Risk': elapsed_reg_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n", " 'Number of Edits': elapsed_reg_interval_counts\n", "})" ] }, { "cell_type": "markdown", "id": "300da015-79da-4684-92fe-2425180957ea", "metadata": {}, "source": [ "Limiting to 48 hr window significantly improves the scores. However, this only when registered users are considered." ] }, { "cell_type": "markdown", "id": "0acac942-4607-412e-bc82-5b84d69f67cc", "metadata": { "tags": [] }, "source": [ "## Time Since User First Revision" ] }, { "cell_type": "code", "execution_count": 459, "id": "d775040f-910a-4bf7-8585-4b3d4f7b95a4", "metadata": {}, "outputs": [], "source": [ "elapsed_first_rev_minutes = [1, 5, 30]\n", "elapsed_first_rev_hours = [1, 2, 4, 12, 24, 48, 72, non_anon.elapsed_first_rev.max()/60*60]\n", "elapsed_first_rev_time_intervals = [i*60 for i in elapsed_first_rev_minutes] + [i*60*60 for i in elapsed_first_rev_hours]\n", "\n", "elapsed_first_rev_column_names = [f'{i} min' for i in elapsed_first_rev_minutes] + [f'{i} hr' if i<=72 else 'max' for i in elapsed_first_rev_hours]\n", "\n", "elapsed_first_rev_median_risk = calculate_grouped(non_anon, elapsed_first_rev_time_intervals, \n", " 'elapsed_first_rev', column_names=elapsed_first_rev_column_names)\n", "elapsed_first_rev_counts = calculate_grouped(non_anon, elapsed_first_rev_time_intervals, \n", " 'elapsed_first_rev', column_names=elapsed_first_rev_column_names, grp_function='count')" ] }, { "cell_type": "code", "execution_count": 485, "id": "ffb1a034-f114-49ee-93ea-1aeb92221fbd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Median Risk \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 1 min5 min30 min1 hr2 hr4 hr12 hr24 hr48 hr72 hrmax
wiki_db           
dewiki0.9160.9170.9170.9160.9150.9140.9130.9130.9100.9080.876
enwiki0.9300.9330.9320.9310.9300.9300.9290.9280.9260.9250.908
eswiki0.9390.9410.9410.9410.9400.9400.9390.9390.9370.9360.924
fawiki0.9320.9340.9390.9380.9370.9370.9340.9320.9300.9300.911
frwiki0.9240.9270.9270.9260.9250.9250.9250.9230.9210.9210.893
idwiki0.9160.9150.9070.9050.9040.9040.9030.9030.9010.8960.886
itwiki0.9320.9320.9310.9340.9340.9320.9310.9290.9290.9280.904
jawiki0.9350.9310.9250.9240.9220.9210.9170.9170.9160.9150.899
ptwiki0.9250.9300.9320.9320.9320.9320.9310.9310.9300.9290.913
ruwiki0.9250.9290.9270.9270.9280.9270.9270.9260.9250.9240.903
zhwiki0.9030.9200.9250.9260.9240.9220.9210.9210.9170.9120.892
\n", "
Number of Edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1 min5 min30 min1 hr2 hr4 hr12 hr24 hr48 hr72 hrmax
wiki_db
dewiki763937122213001338137514241460152715682301
enwiki991713471190622054821640223552326124472254052593334986
eswiki17662432340636543848397440924321452446245851
fawiki44460998911251199126113191469153715702442
frwiki11071464198121172222228423672438253225733611
idwiki111143215268289300320341355376527
itwiki570770102111001134117512291283132913411820
jawiki6701181189319772056210121762206225522792691
ptwiki10481375201722142323240624822536258726243361
ruwiki9031196171618151874192219652049211521482833
zhwiki33650179087391396110011052113011821587
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_h({\n", " 'Median Risk': elapsed_first_rev_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n", " 'Number of Edits': elapsed_first_rev_counts\n", "})" ] }, { "cell_type": "markdown", "id": "b00e726e-4deb-4bcb-b7ee-5e9405cb42bb", "metadata": {}, "source": [ "- Limiting to 48 hr window significantly improves the scores. However, this only when registered users are considered.\n", "- The impact is similar to that of time since user registration, however, time since user's frist edit eliminates less number of edits compared to user registration." ] }, { "cell_type": "markdown", "id": "3049a57c-a4cd-4848-b562-505efe5be05d", "metadata": { "tags": [] }, "source": [ "## Time Since User's Previous Edit" ] }, { "cell_type": "code", "execution_count": 506, "id": "9acd2efe-f1df-4a81-a516-9231c63fe9bb", "metadata": {}, "outputs": [], "source": [ "time_user_prev_rev_minutes = [1, 5, 15, 30, 60, 120, non_anon.time_user_prev_rev.max()/60]\n", "time_user_prev_rev_time_intervals = [i*60 for i in time_user_prev_rev_minutes]\n", "\n", "time_user_prev_rev_column_names = [f'{i} min' if i<=120 else 'max' for i in time_user_prev_rev_minutes]\n", "\n", "time_user_prev_rev_median_risk = calculate_grouped(non_anon, time_user_prev_rev_time_intervals, \n", " 'time_user_prev_rev', column_names=time_user_prev_rev_column_names)\n", "time_user_prev_rev_counts = calculate_grouped(non_anon, time_user_prev_rev_time_intervals, \n", " 'time_user_prev_rev', column_names=time_user_prev_rev_column_names, grp_function='count')" ] }, { "cell_type": "code", "execution_count": 507, "id": "f05af18d-002a-4c41-b99b-572f1ee09221", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Median Risk \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 1 min5 min15 min30 min60 min120 minmax
wiki_db       
dewiki0.8930.8890.8830.8820.8790.8790.848
enwiki0.9160.9150.9110.9100.9090.9090.897
eswiki0.9310.9290.9260.9260.9260.9260.917
fawiki0.9180.9160.9140.9130.9120.9120.905
frwiki0.9250.9060.9030.9010.9000.8970.877
idwiki0.8850.8850.8840.8830.8840.8840.879
itwiki0.9430.9170.9090.9070.9050.9060.894
jawiki0.9270.9150.9100.9080.9080.9070.897
ptwiki0.9310.9250.9220.9210.9210.9210.908
ruwiki0.9040.9110.9090.9080.9070.9080.894
zhwiki0.8970.9090.9060.9040.9030.9020.890
\n", "
Number of Edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1 min5 min15 min30 min60 min120 minmax
wiki_db
dewiki189658891960100310311571
enwiki4042133451750618766195142004025824
eswiki700234430363242334734294237
fawiki268104613781470151215332020
frwiki309121716531785187619372574
idwiki55211289312327335421
itwiki22467189695298710101307
jawiki1045171619201992201320352266
ptwiki338123516561774183318672371
ruwiki24499213321423148115121977
zhwiki237722927980101110311277
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_h({\n", " 'Median Risk': time_user_prev_rev_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n", " 'Number of Edits': time_user_prev_rev_counts\n", "})" ] }, { "cell_type": "markdown", "id": "ec32ee06-b272-46b7-b18a-c24ac6c6ff8c", "metadata": {}, "source": [ "While resitricting improves the score, a susbsantial number of edits will be elimated for no significant benefit." ] }, { "cell_type": "markdown", "id": "b459d7cc-d95f-4704-bf0d-61fe9fe2609e", "metadata": { "tags": [] }, "source": [ "## Time Since Page's Previous Edit" ] }, { "cell_type": "code", "execution_count": 463, "id": "c37a401a-f2c5-45ce-bce8-e50f509d6cfb", "metadata": {}, "outputs": [], "source": [ "time_page_prev_rev_minutes = [1, 5, 15, 30, 60, init_criteria.time_page_prev_rev.max()/60]\n", "time_page_prev_rev_time_intervals = [i*60 for i in time_page_prev_rev_minutes]\n", "\n", "time_page_prev_rev_column_names = [f'{i} min' if i<=60 else 'max' for i in time_page_prev_rev_minutes]\n", "\n", "time_page_prev_rev_median_risk = calculate_grouped(init_criteria, time_page_prev_rev_time_intervals, \n", " 'time_page_prev_rev', column_names=time_page_prev_rev_column_names)\n", "time_page_prev_rev_counts = calculate_grouped(init_criteria, time_page_prev_rev_time_intervals, \n", " 'time_page_prev_rev', column_names=time_page_prev_rev_column_names, grp_function='count')" ] }, { "cell_type": "code", "execution_count": 487, "id": "bce8bbe7-2c20-44ac-aa13-9cdbfdfb84ad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Median Risk \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 1 min5 min15 min30 min60 minmax
wiki_db      
dewiki0.9320.9180.9130.9110.9100.902
enwiki0.9220.9160.9130.9120.9120.911
eswiki0.9320.9270.9240.9230.9230.923
fawiki0.9430.9310.9270.9270.9270.916
frwiki0.9340.9190.9150.9130.9120.903
idwiki0.9130.9110.9080.9070.9080.902
itwiki0.9340.9260.9220.9210.9200.920
jawiki0.9160.8920.8870.8850.8830.876
ptwiki0.9370.9280.9240.9230.9210.913
ruwiki0.9280.9230.9200.9180.9180.914
zhwiki0.8960.8880.8850.8850.8840.883
\n", "
Number of Edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1 min5 min15 min30 min60 minmax
wiki_db
dewiki1440339839874192441116829
enwiki2082850764600956369466835172584
eswiki69681704419714206872152655105
fawiki108630783671386040219967
frwiki1906531563326647691219375
idwiki46510731286137014423554
itwiki3027670678448220856123440
jawiki1625338939734217445010170
ptwiki3409671235131813773361
ruwiki2323632674207792808923587
zhwiki103526233168338835817568
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_h({\n", " 'Median Risk': time_page_prev_rev_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n", " 'Number of Edits': time_page_prev_rev_counts\n", "})" ] }, { "cell_type": "markdown", "id": "78939d4d-236e-41f2-b8d6-95467ab8f742", "metadata": {}, "source": [ "While resitricting improves the score, a susbsantial number of edits will be elimated for no significant benefit." ] }, { "cell_type": "markdown", "id": "479739c4-ccda-4173-a471-9257720ec1a4", "metadata": {}, "source": [ "## Bytes Diff" ] }, { "cell_type": "code", "execution_count": 494, "id": "b38691ad-c6a4-4a64-9eb0-d40993031666", "metadata": {}, "outputs": [], "source": [ "warnings.filterwarnings('ignore')\n", "\n", "bytes_diff_intervals = [0, 1, 5, 10, 100, 500, 1000, 5000, init_criteria.rev_bytes_diff.abs().max()]\n", "\n", "bytes_diff_column_labels = ['min'] + bytes_diff_intervals[1:-1] + ['max']\n", "\n", "bytes_diff_median_risk = calculate_grouped(init_criteria, bytes_diff_intervals, \n", " 'rev_bytes_diff', column_names=bytes_diff_column_labels)\n", "bytes_diff_counts = calculate_grouped(init_criteria, bytes_diff_intervals, \n", " 'rev_bytes_diff', column_names=bytes_diff_column_labels, grp_function='count')" ] }, { "cell_type": "code", "execution_count": 495, "id": "3a8b18a4-17f8-42e8-9ae2-476e3d758bc2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Median Risk \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 min151010050010005000max
wiki_db         
dewiki0.9010.9050.9120.9120.9150.9680.9830.9930.000
enwiki0.9100.9120.9150.9150.9170.9650.9780.9860.000
eswiki0.9220.9240.9280.9290.9430.9780.9840.9430.963
fawiki0.9160.9170.9200.9200.9210.9510.9730.9780.000
frwiki0.9030.9050.9080.9080.9160.9740.9830.9920.000
idwiki0.9020.9050.9060.9080.9190.9760.9830.9790.000
itwiki0.9170.9190.9210.9210.9280.9780.9870.9950.000
jawiki0.8680.8710.8750.8760.8960.9610.9740.9790.000
ptwiki0.9120.9140.9170.9160.9060.9190.9120.9140.000
ruwiki0.9130.9150.9190.9210.9310.9740.9830.9900.000
zhwiki0.8830.8860.8900.8910.9150.9650.9760.9850.000
\n", "
Number of Edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min151010050010005000max
wiki_db
dewiki16711155661242010723384014918942320
enwiki1711911591061312461142464224615488926819440
eswiki5494951473419133580512103516732421831
fawiki985793878041726931351046592860
frwiki192631815515031132825303230315374300
idwiki3526326127732397824303168310
itwiki227612101016844144804761175610642950
jawiki9659896877516926299913578982240
ptwiki33393153269324461038402217460
ruwiki232642181018545166777075297519946080
zhwiki748266225646486918258545461200
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_h({\n", " 'Median Risk': bytes_diff_median_risk.fillna(0).style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n", " 'Number of Edits': bytes_diff_counts.fillna(0).astype(int)\n", "})" ] }, { "cell_type": "markdown", "id": "31bed787-0a34-407f-b552-bd3628f6d026", "metadata": {}, "source": [ "Restricting to have at least 5 bytes difference provides a good balance between the score and the number of edits" ] }, { "cell_type": "markdown", "id": "b92bd161-00c3-42e7-ad73-3e8c646abd2a", "metadata": {}, "source": [ "## Incremental criteria" ] }, { "cell_type": "markdown", "id": "95bbe63a-40ee-4c7a-b1ff-5fef43c06d25", "metadata": {}, "source": [ "Based on the above results, we will incrementally apply additional restrictions\n", "- Reverted within 12 hours\n", "- User edit count less 15 edits\n", "- Time since user's first edit is less than 48 hours\n", "- Absolute bytes difference is more than 5 bytes" ] }, { "cell_type": "code", "execution_count": 512, "id": "7f10c1a0-13a8-46ab-9eb4-9ccd7987f579", "metadata": {}, "outputs": [], "source": [ "init_criteria['abs_bytes_diff'] = init_criteria['rev_bytes_diff'].abs()" ] }, { "cell_type": "code", "execution_count": 564, "id": "5ed4c037-3230-4a04-a5ef-440b5a55552d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
Initial
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90197416829
1enwiki0.910679172584
2eswiki0.92259655105
3fawiki0.9163669967
4frwiki0.90331619375
5idwiki0.9024643554
6itwiki0.91964823440
7jawiki0.87568210170
8ptwiki0.9130643361
9ruwiki0.91429123587
10zhwiki0.8834547568
\n", "
+ Reverted within 12 hours
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90423916077
1enwiki0.912205162439
2eswiki0.92347452922
3fawiki0.9167929228
4frwiki0.90558818401
5idwiki0.9019943231
6itwiki0.92130122077
7jawiki0.8797899401
8ptwiki0.9143633147
9ruwiki0.91640322250
10zhwiki0.8869896880
\n", "
+ User Edit Count <= 15 edits
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90450316061
1enwiki0.912847160889
2eswiki0.92385052696
3fawiki0.9180569136
4frwiki0.90630418285
5idwiki0.9028923190
6itwiki0.92136522011
7jawiki0.8801169109
8ptwiki0.9169163079
9ruwiki0.91674622204
10zhwiki0.8875886819
\n", "
+ Time Since First Edit <= 48 hrs
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.90755515468
1enwiki0.915196153858
2eswiki0.92479251696
3fawiki0.9204688539
4frwiki0.90903417489
5idwiki0.9050713067
6itwiki0.92270921633
7jawiki0.8825258828
8ptwiki0.9306692458
9ruwiki0.91810321661
10zhwiki0.8903806481
\n", "
+ Absolute Bytes Diff >= 5 bytes
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wiki_dbmedian_riskn_edits
0dewiki0.91721411281
1enwiki0.920194115997
2eswiki0.93048339239
3fawiki0.9243526734
4frwiki0.91370913492
5idwiki0.9100192361
6itwiki0.92453315505
7jawiki0.8836706679
8ptwiki0.9342281855
9ruwiki0.92378816914
10zhwiki0.8963374813
\n", "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def calculate_median_risk_and_count(df, criteria, time_to_revert_limit=12*60*60):\n", " \n", " query_string = f\"time_to_revert <= {time_to_revert_limit} \" + (\"& \" + criteria if criteria else \"\")\n", " filtered_df = df.query(query_string)\n", " aggregated_df = filtered_df.groupby('wiki_db').agg({'risk': 'median', 'rev_id': 'count'})\n", " aggregated_df.rename({'rev_id': 'n_edits', 'risk': 'median_risk'}, inplace=True, axis=1)\n", " \n", " return aggregated_df.reset_index()\n", "\n", "criteria_conditions = {\n", " 'Initial': init_criteria_risk,\n", " '+ Reverted within 12 hours': '',\n", " '+ User Edit Count <= 15 edits': \"(is_anon == True) | (user_edit_count <= 15)\",\n", " '+ Time Since First Edit <= 48 hrs': \"(is_anon == True) | ((user_edit_count <= 15) & (elapsed_first_rev < 48*60*60))\",\n", " '+ Absolute Bytes Diff >= 5 bytes': \"(abs_bytes_diff >= 5) & ((is_anon == True) | ((user_edit_count <= 15) & (elapsed_first_rev < 48*60*60)))\"\n", "}\n", "\n", "results = {label: calculate_median_risk_and_count(init_criteria, criteria) if label != 'Initial' \\\n", " else init_criteria_risk for label, criteria in criteria_conditions.items()}\n", "display_h(results)" ] }, { "cell_type": "markdown", "id": "f44014bf-be98-439f-8d58-672ae7fc0504", "metadata": {}, "source": [ "- Restricting user related related metrics make minor improvements to the median risk, as majority of the reverted edits are made by anonymous users.\n", "- While having at least an n number of absolute bytes difference, a substantial number of edits are elimiated, as compared to the initial criteria.\n", "- In addition to the time to revert, absolute bytes difference is only the control factor available for anonymous edits." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 5 }