{
"cells": [
{
"cell_type": "markdown",
"id": "5451dfbe-903f-4700-a0e2-484d3f885b57",
"metadata": {},
"source": [
"# Comparision of Proposed Vandalism Criteria with Revert Risk scores\n",
"\n",
"[TASK: T349083](https://phabricator.wikimedia.org/T349083)\n",
"\n",
"➤ ***Please view this notebook on [nbviewer](https://nbviewer.org/github/wikimedia-research/moderator-tools-FY24/blob/main/%5BT349083%5D%20vandalism_criteria_comparision/vandal_criteria_revert_risk_comparision.ipynb)***\n",
"\n",
"For various baseline measurements for evaluation of [Automoderator](https://www.mediawiki.org/wiki/Moderator_Tools/Automoderator), we want to develop a criteria to identify potential vandalism. In this analysis the criteria will be compared with the [revert risk scores](https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_revert_risk). Starting with an set an intial set, different dimensions will be used to see how that impacts the median revert risk score by project and also how restricting the criteria further elimiates edits from consideration. The goal is find a balance between good median score, without eliminating too many edits from consideration.\n",
"\n",
"**Initial criteria:**\n",
"- Edits from account with less than 25 edits or anonymous user\n",
"- Reverted by a different editor\n",
"- Reverts happen within 24 hours\n",
"- Edits in the content namespace\n",
"\n",
"**Dimensions considered**\n",
"- Time to revert \n",
"- User edit count (for registered users)\n",
"- Time since user's first revision (for registered users)\n",
"- Time since user's previous revision (for registered users)\n",
"- Time since previous revision on the page being edited\n",
"- Absolute difference in bytes made by the revision\n",
"\n",
"## Summary\n",
"Based on the analysis, the following additions/modifications can improve the median risk score\n",
"- Reverted within 12 hours\n",
"- User edit count less 15 edits\n",
"- Time since user's first edit is less than 48 hours\n",
"- Absolute bytes difference is more than 5 bytes"
]
},
{
"cell_type": "code",
"execution_count": 566,
"id": "31f2f8c8-5923-4ca3-8f0b-337764e49908",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
Changes in the Median Risk & Number of Edits
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" \n",
"
Initial
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.901974 | \n",
" 16829 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.910679 | \n",
" 172584 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.922596 | \n",
" 55105 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.916366 | \n",
" 9967 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.903316 | \n",
" 19375 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.902464 | \n",
" 3554 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.919648 | \n",
" 23440 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.875682 | \n",
" 10170 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.913064 | \n",
" 3361 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.914291 | \n",
" 23587 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.883454 | \n",
" 7568 | \n",
"
\n",
" \n",
"
\n",
"
+ Reverted within 12 hours
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.904239 | \n",
" 16077 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.912205 | \n",
" 162439 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.923474 | \n",
" 52922 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.916792 | \n",
" 9228 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.905588 | \n",
" 18401 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.901994 | \n",
" 3231 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.921301 | \n",
" 22077 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.879789 | \n",
" 9401 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.914363 | \n",
" 3147 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.916403 | \n",
" 22250 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.886989 | \n",
" 6880 | \n",
"
\n",
" \n",
"
\n",
"
+ User Edit Count <= 15 edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.904503 | \n",
" 16061 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.912847 | \n",
" 160889 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.923850 | \n",
" 52696 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.918056 | \n",
" 9136 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.906304 | \n",
" 18285 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.902892 | \n",
" 3190 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.921365 | \n",
" 22011 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.880116 | \n",
" 9109 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.916916 | \n",
" 3079 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.916746 | \n",
" 22204 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.887588 | \n",
" 6819 | \n",
"
\n",
" \n",
"
\n",
"
+ Time Since First Edit <= 48 hrs
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.907555 | \n",
" 15468 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.915196 | \n",
" 153858 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.924792 | \n",
" 51696 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.920468 | \n",
" 8539 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.909034 | \n",
" 17489 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.905071 | \n",
" 3067 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.922709 | \n",
" 21633 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.882525 | \n",
" 8828 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.930669 | \n",
" 2458 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.918103 | \n",
" 21661 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.890380 | \n",
" 6481 | \n",
"
\n",
" \n",
"
\n",
"
+ Absolute Bytes Diff >= 5 bytes
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.917214 | \n",
" 11281 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.920194 | \n",
" 115997 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.930483 | \n",
" 39239 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.924352 | \n",
" 6734 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.913709 | \n",
" 13492 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.910019 | \n",
" 2361 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.924533 | \n",
" 15505 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.883670 | \n",
" 6679 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.934228 | \n",
" 1855 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.923788 | \n",
" 16914 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.896337 | \n",
" 4813 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pr_centered('Changes in the Median Risk & Number of Edits', True)\n",
"display_h(results)"
]
},
{
"cell_type": "markdown",
"id": "d3b4e707-db23-4cc9-95fe-19285595e171",
"metadata": {},
"source": [
"- Restricting user related related metrics make minor improvements to the median risk, as majority of the reverted edits are made by anonymous users.\n",
"- While having at least an n number of absolute bytes difference, improves the median risk, a substantial number of edits are elimiated, as compared to the initial criteria.\n",
"- In addition to the time to revert, absolute bytes difference is only the control factor available for anonymous edits.\n"
]
},
{
"cell_type": "markdown",
"id": "1013433a-2909-4b7c-8325-ac05c07ed8ea",
"metadata": {
"tags": []
},
"source": [
"# Data-Gathering"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7bfb58a2-a2f9-4ddd-b622-7f8130c12dfd",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import wmfdata as wmf\n",
"\n",
"pd.options.display.max_columns = None\n",
"from IPython.display import clear_output\n",
"\n",
"import warnings\n",
"import random\n",
"from datetime import datetime\n",
"\n",
"from IPython.display import display_html\n",
"from IPython.display import display, HTML\n",
"from IPython.display import clear_output"
]
},
{
"cell_type": "code",
"execution_count": 89,
"id": "c806fdbd-1195-4d58-b54e-313dd35c8ced",
"metadata": {},
"outputs": [],
"source": [
"# import seaborn as sns\n",
"# import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 180,
"id": "0e5f4221-5d33-4abe-991d-e862d4d5e7f7",
"metadata": {},
"outputs": [],
"source": [
"spark_session = wmf.spark.get_active_session()\n",
"\n",
"if type(spark_session) != type(None):\n",
" spark_session.stop()\n",
"else:\n",
" print('no active session')"
]
},
{
"cell_type": "code",
"execution_count": 574,
"id": "f24d2e1f-eebb-4b99-8dd8-fd5b8ade338f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
SparkSession - hive
\n",
" \n",
"
\n",
"
SparkContext
\n",
"\n",
"
Spark UI
\n",
"\n",
"
\n",
" - Version
\n",
" v3.1.2
\n",
" - Master
\n",
" yarn
\n",
" - AppName
\n",
" vandal-criteria-comparision
\n",
"
\n",
"
\n",
" \n",
"
\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 574,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark_session = wmf.spark.create_custom_session(\n",
" master=\"yarn\",\n",
" app_name='vandal-criteria-comparision',\n",
" spark_config={\n",
" \"spark.driver.memory\": \"6g\",\n",
" \"spark.dynamicAllocation.maxExecutors\": 64,\n",
" \"spark.executor.memory\": \"24g\",\n",
" \"spark.executor.cores\": 4,\n",
" \"spark.sql.shuffle.partitions\": 256,\n",
" \"spark.driver.maxResultSize\": \"2g\"\n",
" \n",
" }\n",
")\n",
"\n",
"clear_output()\n",
"\n",
"spark_session.sparkContext.setLogLevel(\"ERROR\")\n",
"spark_session"
]
},
{
"cell_type": "markdown",
"id": "bd65d9cb-c8af-4a44-826e-85e999a3bc4f",
"metadata": {},
"source": [
"## query"
]
},
{
"cell_type": "code",
"execution_count": 575,
"id": "54c7c413-8b81-428f-95f7-b408b97d9544",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Stage 0:> (0 + 1) / 1]\r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"root\n",
" |-- rev_id: long (nullable = true)\n",
" |-- wiki_db: string (nullable = true)\n",
" |-- rev_timestamp: string (nullable = true)\n",
" |-- revision_is_identity_reverted: boolean (nullable = true)\n",
" |-- revision_seconds_to_identity_revert: long (nullable = true)\n",
" |-- page_id: long (nullable = true)\n",
" |-- revision_revert_risk: float (nullable = true)\n",
" |-- user_is_anonymous: boolean (nullable = true)\n",
" |-- user_is_bot: boolean (nullable = true)\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"rr_scores_path = '/user/paragon/riskobservatory/revertrisk_20212022_anonymous_bot.parquet'\n",
"\n",
"rr_scores = spark_session.read.parquet(rr_scores_path)\n",
"rr_scores.createOrReplaceTempView('rr_scores')\n",
"\n",
"rr_scores.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7a5746a4-b4df-4c23-aa3e-6072d0ccf2dc",
"metadata": {},
"outputs": [],
"source": [
"mwh_snapshot = '2023-10'\n",
"\n",
"wikis_list = [f'{lang}wiki' for lang in ['en', 'es', 'ja', 'de', 'fr', 'ru', 'zh', 'it', 'pt', 'fa', 'id']]\n",
"wikis_sql = wmf.utils.sql_tuple(wikis_list)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "41ff9f3d-87d6-448c-a460-742878d55f7a",
"metadata": {},
"outputs": [],
"source": [
"# generate 30 random dates in an year\n",
"\n",
"def generate_random_dates(year, num_dates):\n",
" dates = []\n",
" for _ in range(num_dates):\n",
" month = random.randint(1, 12)\n",
" if month in [1, 3, 5, 7, 8, 10, 12]:\n",
" day = random.randint(1, 31)\n",
" elif month == 2:\n",
" day = random.randint(1, 28)\n",
" else:\n",
" day = random.randint(1, 30)\n",
" \n",
" date = datetime(year, month, day)\n",
" dates.append(date.strftime(\"%Y-%m-%d\"))\n",
" \n",
" return dates\n",
"\n",
"random_dates_2022 = generate_random_dates(2022, 30)\n",
"random_dates_2022_sql = wmf.utils.sql_tuple(random_dates_2022)"
]
},
{
"cell_type": "code",
"execution_count": 585,
"id": "e3058fbc-4d4f-4ecf-9345-423fd62a5bcd",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 2]\r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5.2 s, sys: 0 ns, total: 5.2 s\n",
"Wall time: 5min 16s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"query = f\"\"\"\n",
"WITH \n",
" base_criteria AS (\n",
" SELECT\n",
" mwh.wiki_db,\n",
" rr.rev_id,\n",
" revision_revert_risk AS risk,\n",
" mwh.event_user_text AS user_name,\n",
" event_timestamp AS rev_ts,\n",
" event_user_is_anonymous AS is_anon,\n",
" event_user_revision_count AS user_edit_count,\n",
" COALESCE(event_user_registration_timestamp, event_user_creation_timestamp) AS user_reg_ts,\n",
" event_user_first_edit_timestamp AS user_first_rev_ts,\n",
" event_user_seconds_since_previous_revision AS time_user_prev_rev,\n",
" page_seconds_since_previous_revision AS time_page_prev_rev,\n",
" revision_text_bytes_diff AS rev_bytes_diff,\n",
" mwh.revision_seconds_to_identity_revert AS time_to_revert,\n",
" revision_text_bytes AS rev_bytes,\n",
" revision_is_identity_revert AS reverting_edit,\n",
" revision_first_identity_reverting_revision_id AS reverting_edit_id\n",
" FROM \n",
" rr_scores rr\n",
" JOIN \n",
" wmf.mediawiki_history mwh \n",
" ON rr.wiki_db = mwh.wiki_db AND rr.rev_id = mwh.revision_id\n",
" WHERE \n",
" snapshot = '{mwh_snapshot}'\n",
" AND rr.wiki_db IN {wikis_sql}\n",
" AND event_entity = 'revision'\n",
" AND event_type = 'create'\n",
" AND DATE(event_timestamp) IN {random_dates_2022_sql}\n",
" AND page_namespace_is_content\n",
" AND (event_user_is_anonymous OR event_user_revision_count <= 250)\n",
" AND SIZE(event_user_is_bot_by_historical) = 0\n",
" AND mwh.revision_is_identity_reverted\n",
" AND mwh.revision_seconds_to_identity_revert <= 3*24*60*60\n",
" )\n",
" \n",
"\n",
"SELECT\n",
" bc.*,\n",
" mwh.event_user_is_anonymous AS reverting_user_is_anon,\n",
" mwh.event_user_revision_count AS reverting_user_edit_count,\n",
" mwh.event_user_first_edit_timestamp AS reverting_user_first_rev_ts,\n",
" mwh.revision_is_identity_reverted AS is_revert_reverted,\n",
" mwh.revision_seconds_to_identity_revert AS revert_time_to_revert\n",
"FROM \n",
" base_criteria bc\n",
"JOIN\n",
" wmf.mediawiki_history mwh\n",
" ON bc.wiki_db = mwh.wiki_db AND bc.reverting_edit_id = mwh.revision_id\n",
"WHERE\n",
" snapshot = '{mwh_snapshot}'\n",
" AND NOT bc.user_name = mwh.event_user_text\n",
"\"\"\"\n",
"\n",
"edits = wmf.spark.run(query)"
]
},
{
"cell_type": "code",
"execution_count": 586,
"id": "d1f360de-a02c-4ca0-ad9f-05c6a79b01cf",
"metadata": {},
"outputs": [],
"source": [
"edits = (\n",
" edits\n",
" .assign(\n",
" rev_ts=pd.to_datetime(edits['rev_ts'], utc=True),\n",
" user_reg_ts=pd.to_datetime(edits['user_reg_ts'], utc=True),\n",
" user_first_rev_ts=pd.to_datetime(edits['user_first_rev_ts'], utc=True),\n",
" reverting_user_first_rev_ts=pd.to_datetime(edits['reverting_user_first_rev_ts'], utc=True),\n",
" is_anon=pd.Categorical(edits['is_anon']),\n",
" reverting_user_is_anon=pd.Categorical(edits['reverting_user_is_anon']),\n",
" is_revert_reverted=pd.Categorical(edits['is_revert_reverted'])\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 587,
"id": "a71392f0-9a22-4a01-a325-b204304e10a1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 391096 entries, 0 to 391095\n",
"Data columns (total 21 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 wiki_db 391096 non-null object \n",
" 1 rev_id 391096 non-null int64 \n",
" 2 risk 391096 non-null float32 \n",
" 3 user_name 391096 non-null object \n",
" 4 rev_ts 391096 non-null datetime64[ns, UTC]\n",
" 5 is_anon 391096 non-null category \n",
" 6 user_edit_count 92095 non-null float64 \n",
" 7 user_reg_ts 92053 non-null datetime64[ns, UTC]\n",
" 8 user_first_rev_ts 92095 non-null datetime64[ns, UTC]\n",
" 9 time_user_prev_rev 75259 non-null float64 \n",
" 10 time_page_prev_rev 391096 non-null int64 \n",
" 11 rev_bytes_diff 387254 non-null float64 \n",
" 12 time_to_revert 391096 non-null int64 \n",
" 13 rev_bytes 387436 non-null float64 \n",
" 14 reverting_edit 391096 non-null bool \n",
" 15 reverting_edit_id 391096 non-null int64 \n",
" 16 reverting_user_is_anon 391096 non-null category \n",
" 17 reverting_user_edit_count 376252 non-null float64 \n",
" 18 reverting_user_first_rev_ts 376252 non-null datetime64[ns, UTC]\n",
" 19 is_revert_reverted 391096 non-null category \n",
" 20 revert_time_to_revert 48583 non-null float64 \n",
"dtypes: bool(1), category(3), datetime64[ns, UTC](4), float32(1), float64(6), int64(4), object(2)\n",
"memory usage: 50.7+ MB\n"
]
}
],
"source": [
"edits.info()"
]
},
{
"cell_type": "markdown",
"id": "a2b680e6-afb1-4da3-b3d4-aa79746b0a33",
"metadata": {},
"source": [
"# Analysis"
]
},
{
"cell_type": "markdown",
"id": "5691279e-683d-41a7-abb0-ad092bda57f5",
"metadata": {},
"source": [
"## Functions"
]
},
{
"cell_type": "code",
"execution_count": 409,
"id": "8c0f36bf-2183-42a7-9c6c-fa7690679ce1",
"metadata": {},
"outputs": [],
"source": [
"# prints a string at center of the output, bold if needed\n",
"def pr_centered(content, bold=False):\n",
" if bold:\n",
" content = f\"{content}\"\n",
" \n",
" centered_html = f\"{content}
\"\n",
" \n",
" display(HTML(centered_html))\n",
"\n",
"\n",
"# display dataframes horizontally with title for each\n",
"def display_h(frames, space=100):\n",
" html = \"\"\n",
" \n",
" for key in frames.keys():\n",
" html_df =f'{key} {frames[key]._repr_html_()}
'\n",
" html += html_df\n",
" \n",
" html = f\"\"\"\n",
" \n",
" {html}\n",
"
\"\"\"\n",
" \n",
" display_html(html, raw=True)"
]
},
{
"cell_type": "code",
"execution_count": 503,
"id": "ec036262-2f85-4be4-88c1-a1f8933b68e5",
"metadata": {},
"outputs": [],
"source": [
"def calculate_grouped(df, intervals, pivot_column, columns_title=None, column_names=None, target_column='risk', group_column='wiki_db', grp_function='median'):\n",
"\n",
" final_results = []\n",
"\n",
" for interval in intervals:\n",
" \n",
" # unlike other temporal columns, bytes difference should be greater than given value\n",
" \n",
" if pivot_column == 'rev_bytes_diff':\n",
" df[pivot_column] = df[pivot_column].abs()\n",
" filtered_df = df[df[pivot_column] >= interval]\n",
" else:\n",
" filtered_df = df[df[pivot_column] <= interval]\n",
" \n",
" grouped = filtered_df.groupby(group_column).agg({target_column: grp_function}).reset_index()\n",
"\n",
" grouped['interval'] = interval\n",
" final_results.append(grouped)\n",
"\n",
" concatenated_df = pd.concat(final_results)\n",
" pivot_df = concatenated_df.pivot(index=group_column, columns='interval', values=target_column)\n",
" \n",
" if columns_title == None:\n",
" pivot_df.columns.name = f'median: {pivot_column}'\n",
" else:\n",
" pivot_df.columns.name = f'median: {columns_title}'\n",
" \n",
" if column_names != None:\n",
" pivot_df.columns = column_names\n",
"\n",
" return pivot_df\n",
"\n",
"# def plot_hmap(df, x_label, title, fontsize=10, y_label='Wikipedia', cbar_label='Median Risk'):\n",
" \n",
"# ax = sns.heatmap(df, annot=True, annot_kws={\"size\": fontsize})\n",
" \n",
"# # set labels\n",
"# ax.set_xlabel(x_label, fontsize=fontsize)\n",
"# ax.set_ylabel(y_label, fontsize=fontsize)\n",
"# ax.set_title(title, fontsize=fontsize + 1)\n",
" \n",
"# # color bar properties\n",
"# cbar = ax.collections[0].colorbar\n",
"# cbar.set_label(cbar_label, fontsize=fontsize)\n",
"# cbar.ax.tick_params(labelsize=fontsize)\n",
"\n",
"# plt.show()\n",
" \n",
"def time_delta(df, start_column, end_column):\n",
" try: \n",
" return df.apply(lambda row: (row[end_column] - row[start_column]).total_seconds(), axis=1)\n",
" except:\n",
" return np.NaN"
]
},
{
"cell_type": "markdown",
"id": "759f2ed7-d168-4ced-9f48-18b21c6f6e48",
"metadata": {},
"source": [
"## Initial Criteria"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c72ae52d-e1d9-48e5-8679-1e08a6de6c9d",
"metadata": {},
"outputs": [],
"source": [
"init_criteria = edits.query(\"\"\"(time_to_revert <= 24*60*60) & ((is_anon == True) | (user_edit_count <= 25))\"\"\")\n",
"\n",
"init_criteria = (\n",
" init_criteria\n",
" .assign(\n",
" elapsed_reg=time_delta(init_criteria, 'user_reg_ts', 'rev_ts'),\n",
" elapsed_first_rev=time_delta(init_criteria, 'user_first_rev_ts', 'rev_ts'),\n",
" rv_user_elapsed_first_rev=time_delta(init_criteria, 'reverting_user_first_rev_ts', 'rev_ts')\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 589,
"id": "1785a25a-6880-4c9c-a76d-ae9eadd5a2ae",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.901974 | \n",
" 16829 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.910679 | \n",
" 172584 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.922596 | \n",
" 55105 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.916366 | \n",
" 9967 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.903316 | \n",
" 19375 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.902464 | \n",
" 3554 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.919648 | \n",
" 23440 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.875682 | \n",
" 10170 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.913064 | \n",
" 3361 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.914291 | \n",
" 23587 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.883454 | \n",
" 7568 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" wiki_db median_risk n_edits\n",
"0 dewiki 0.901974 16829\n",
"1 enwiki 0.910679 172584\n",
"2 eswiki 0.922596 55105\n",
"3 fawiki 0.916366 9967\n",
"4 frwiki 0.903316 19375\n",
"5 idwiki 0.902464 3554\n",
"6 itwiki 0.919648 23440\n",
"7 jawiki 0.875682 10170\n",
"8 ptwiki 0.913064 3361\n",
"9 ruwiki 0.914291 23587\n",
"10 zhwiki 0.883454 7568"
]
},
"execution_count": 589,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"init_criteria_risk = (\n",
" init_criteria\n",
" .groupby('wiki_db')\n",
" .agg({\n",
" 'risk': 'median', \n",
" 'rev_id': 'count'\n",
" })\n",
" .reset_index()\n",
" .rename({\n",
" 'rev_id': 'n_edits', \n",
" 'risk': 'median_risk'\n",
" }, axis=1)\n",
")\n",
"\n",
"init_criteria_risk"
]
},
{
"cell_type": "markdown",
"id": "88270901-6d56-4933-a087-1a0a07875c09",
"metadata": {},
"source": [
"## Time to Revert"
]
},
{
"cell_type": "code",
"execution_count": 434,
"id": "a7c2f9cb-d5a9-4cde-a4fd-2428742c0557",
"metadata": {},
"outputs": [],
"source": [
"ttr_hour_intervals = [1, 2, 4, 8, 12, 24]\n",
"ttr_time_intervals = [i*60*60 for i in ttr_hour_intervals]\n",
"ttr_column_names = [f'{i} hr' for i in ttr_hour_intervals]\n",
"\n",
"ttr_median_risk = calculate_grouped(init_criteria, ttr_time_intervals, \n",
" 'time_to_revert', column_names=ttr_column_names)\n",
"ttr_interval_counts = calculate_grouped(init_criteria, ttr_time_intervals, \n",
" 'time_to_revert', column_names=ttr_column_names, grp_function = 'count')"
]
},
{
"cell_type": "code",
"execution_count": 457,
"id": "0784ab64-9003-4ef1-bf41-25b447bf0940",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Median Risk \n",
"
\n",
" \n",
" \n",
" | \n",
" 1 hr | \n",
" 2 hr | \n",
" 4 hr | \n",
" 8 hr | \n",
" 12 hr | \n",
" 24 hr | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 0.910 | \n",
" 0.908 | \n",
" 0.906 | \n",
" 0.905 | \n",
" 0.904 | \n",
" 0.902 | \n",
"
\n",
" \n",
" enwiki | \n",
" 0.920 | \n",
" 0.918 | \n",
" 0.915 | \n",
" 0.913 | \n",
" 0.912 | \n",
" 0.911 | \n",
"
\n",
" \n",
" eswiki | \n",
" 0.928 | \n",
" 0.926 | \n",
" 0.925 | \n",
" 0.924 | \n",
" 0.923 | \n",
" 0.923 | \n",
"
\n",
" \n",
" fawiki | \n",
" 0.918 | \n",
" 0.918 | \n",
" 0.918 | \n",
" 0.917 | \n",
" 0.917 | \n",
" 0.916 | \n",
"
\n",
" \n",
" frwiki | \n",
" 0.913 | \n",
" 0.911 | \n",
" 0.909 | \n",
" 0.907 | \n",
" 0.906 | \n",
" 0.903 | \n",
"
\n",
" \n",
" idwiki | \n",
" 0.900 | \n",
" 0.901 | \n",
" 0.899 | \n",
" 0.902 | \n",
" 0.902 | \n",
" 0.902 | \n",
"
\n",
" \n",
" itwiki | \n",
" 0.927 | \n",
" 0.926 | \n",
" 0.924 | \n",
" 0.922 | \n",
" 0.921 | \n",
" 0.920 | \n",
"
\n",
" \n",
" jawiki | \n",
" 0.894 | \n",
" 0.891 | \n",
" 0.886 | \n",
" 0.882 | \n",
" 0.880 | \n",
" 0.876 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 0.920 | \n",
" 0.918 | \n",
" 0.917 | \n",
" 0.916 | \n",
" 0.914 | \n",
" 0.913 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 0.926 | \n",
" 0.923 | \n",
" 0.920 | \n",
" 0.918 | \n",
" 0.916 | \n",
" 0.914 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 0.896 | \n",
" 0.893 | \n",
" 0.891 | \n",
" 0.889 | \n",
" 0.887 | \n",
" 0.883 | \n",
"
\n",
" \n",
"
\n",
"
Number of Edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 hr | \n",
" 2 hr | \n",
" 4 hr | \n",
" 8 hr | \n",
" 12 hr | \n",
" 24 hr | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 13349 | \n",
" 14151 | \n",
" 14974 | \n",
" 15661 | \n",
" 16077 | \n",
" 16829 | \n",
"
\n",
" \n",
" enwiki | \n",
" 114940 | \n",
" 128218 | \n",
" 141591 | \n",
" 155008 | \n",
" 162439 | \n",
" 172584 | \n",
"
\n",
" \n",
" eswiki | \n",
" 42487 | \n",
" 45577 | \n",
" 48656 | \n",
" 51468 | \n",
" 52922 | \n",
" 55105 | \n",
"
\n",
" \n",
" fawiki | \n",
" 6798 | \n",
" 7417 | \n",
" 8123 | \n",
" 8816 | \n",
" 9228 | \n",
" 9967 | \n",
"
\n",
" \n",
" frwiki | \n",
" 14078 | \n",
" 15335 | \n",
" 16506 | \n",
" 17687 | \n",
" 18401 | \n",
" 19375 | \n",
"
\n",
" \n",
" idwiki | \n",
" 1662 | \n",
" 2070 | \n",
" 2550 | \n",
" 3006 | \n",
" 3231 | \n",
" 3554 | \n",
"
\n",
" \n",
" itwiki | \n",
" 16739 | \n",
" 18198 | \n",
" 19752 | \n",
" 21189 | \n",
" 22077 | \n",
" 23440 | \n",
"
\n",
" \n",
" jawiki | \n",
" 6351 | \n",
" 7245 | \n",
" 8150 | \n",
" 8943 | \n",
" 9401 | \n",
" 10170 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 2081 | \n",
" 2347 | \n",
" 2686 | \n",
" 2985 | \n",
" 3147 | \n",
" 3361 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 15851 | \n",
" 17794 | \n",
" 19570 | \n",
" 21248 | \n",
" 22250 | \n",
" 23587 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 4071 | \n",
" 4823 | \n",
" 5637 | \n",
" 6446 | \n",
" 6880 | \n",
" 7568 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_h({\n",
" 'Median Risk': ttr_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n",
" 'Number of Edits': ttr_interval_counts\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "20d197eb-630e-4b46-8498-888644874a32",
"metadata": {},
"source": [
"Limiting to 8 hr window provides a slight improvement without eliminating a lot of edits."
]
},
{
"cell_type": "markdown",
"id": "fcd1397d-ce09-4d02-aa01-b5fe2af0ef20",
"metadata": {},
"source": [
"## User Edit Count"
]
},
{
"cell_type": "code",
"execution_count": 439,
"id": "fdd8b535-c4ff-4aef-b575-c40e0762b8b9",
"metadata": {},
"outputs": [],
"source": [
"edit_count_intervals = [5, 10, 15, 20, 25]\n",
"edit_count_column_names = [f'{i} edits' for i in edit_count_intervals]\n",
"\n",
"edit_count_median_risk = calculate_grouped(init_criteria, edit_count_intervals, \n",
" 'user_edit_count', column_names=edit_count_column_names)\n",
"edit_count_interval_counts = calculate_grouped(init_criteria, edit_count_intervals, \n",
" 'user_edit_count', column_names=edit_count_column_names, grp_function='count')"
]
},
{
"cell_type": "code",
"execution_count": 456,
"id": "b3df9e0b-10b3-4696-b80e-f8ee0c2a494c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Median Risk \n",
"
\n",
" \n",
" \n",
" | \n",
" 5 edits | \n",
" 10 edits | \n",
" 15 edits | \n",
" 20 edits | \n",
" 25 edits | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 0.900 | \n",
" 0.892 | \n",
" 0.885 | \n",
" 0.880 | \n",
" 0.876 | \n",
"
\n",
" \n",
" enwiki | \n",
" 0.924 | \n",
" 0.918 | \n",
" 0.913 | \n",
" 0.910 | \n",
" 0.908 | \n",
"
\n",
" \n",
" eswiki | \n",
" 0.936 | \n",
" 0.931 | \n",
" 0.929 | \n",
" 0.926 | \n",
" 0.924 | \n",
"
\n",
" \n",
" fawiki | \n",
" 0.929 | \n",
" 0.923 | \n",
" 0.920 | \n",
" 0.915 | \n",
" 0.911 | \n",
"
\n",
" \n",
" frwiki | \n",
" 0.919 | \n",
" 0.910 | \n",
" 0.903 | \n",
" 0.897 | \n",
" 0.893 | \n",
"
\n",
" \n",
" idwiki | \n",
" 0.906 | \n",
" 0.893 | \n",
" 0.891 | \n",
" 0.889 | \n",
" 0.886 | \n",
"
\n",
" \n",
" itwiki | \n",
" 0.921 | \n",
" 0.914 | \n",
" 0.908 | \n",
" 0.905 | \n",
" 0.904 | \n",
"
\n",
" \n",
" jawiki | \n",
" 0.920 | \n",
" 0.916 | \n",
" 0.909 | \n",
" 0.904 | \n",
" 0.899 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 0.923 | \n",
" 0.919 | \n",
" 0.917 | \n",
" 0.914 | \n",
" 0.913 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 0.921 | \n",
" 0.914 | \n",
" 0.910 | \n",
" 0.906 | \n",
" 0.903 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 0.909 | \n",
" 0.903 | \n",
" 0.900 | \n",
" 0.897 | \n",
" 0.892 | \n",
"
\n",
" \n",
"
\n",
"
Number of Edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 5 edits | \n",
" 10 edits | \n",
" 15 edits | \n",
" 20 edits | \n",
" 25 edits | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 1560 | \n",
" 1893 | \n",
" 2086 | \n",
" 2200 | \n",
" 2301 | \n",
"
\n",
" \n",
" enwiki | \n",
" 22503 | \n",
" 28206 | \n",
" 31350 | \n",
" 33433 | \n",
" 34986 | \n",
"
\n",
" \n",
" eswiki | \n",
" 3898 | \n",
" 4866 | \n",
" 5345 | \n",
" 5655 | \n",
" 5851 | \n",
"
\n",
" \n",
" fawiki | \n",
" 1227 | \n",
" 1702 | \n",
" 2018 | \n",
" 2259 | \n",
" 2442 | \n",
"
\n",
" \n",
" frwiki | \n",
" 2398 | \n",
" 2944 | \n",
" 3252 | \n",
" 3463 | \n",
" 3611 | \n",
"
\n",
" \n",
" idwiki | \n",
" 268 | \n",
" 383 | \n",
" 443 | \n",
" 495 | \n",
" 527 | \n",
"
\n",
" \n",
" itwiki | \n",
" 1230 | \n",
" 1514 | \n",
" 1647 | \n",
" 1745 | \n",
" 1820 | \n",
"
\n",
" \n",
" jawiki | \n",
" 1342 | \n",
" 1889 | \n",
" 2239 | \n",
" 2484 | \n",
" 2691 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 2345 | \n",
" 2848 | \n",
" 3079 | \n",
" 3236 | \n",
" 3361 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 1971 | \n",
" 2373 | \n",
" 2594 | \n",
" 2737 | \n",
" 2833 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 863 | \n",
" 1201 | \n",
" 1393 | \n",
" 1510 | \n",
" 1587 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_h({\n",
" 'Median Risk': edit_count_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n",
" 'Number of Edits': edit_count_interval_counts\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "678c9879-762e-4e57-b288-c77267901ade",
"metadata": {},
"source": [
"Limiting to 15 edits slightly improves the scores without elimating a lot of edits."
]
},
{
"cell_type": "markdown",
"id": "0953fa80-d3dd-4461-88e8-74a90b9b2491",
"metadata": {},
"source": [
"## Time Since User Registration"
]
},
{
"cell_type": "code",
"execution_count": 591,
"id": "c13b8c20-39cd-48ad-80e5-e2ccec28be2f",
"metadata": {},
"outputs": [],
"source": [
"non_anon = init_criteria.query(\"\"\"is_anon == False\"\"\").reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": 443,
"id": "811998da-7e61-471e-b0d3-d188f283a308",
"metadata": {},
"outputs": [],
"source": [
"elapsed_reg_minutes = [1, 5, 30]\n",
"elapsed_reg_hours = [1, 2, 4, 12, 24, 48, 72, non_anon.elapsed_reg.max()/60*60]\n",
"elapsed_reg_time_intervals = [i*60 for i in elapsed_reg_minutes] + [i*60*60 for i in elapsed_reg_hours]\n",
"\n",
"elapsed_reg_column_names = [f'{i} min' for i in elapsed_reg_minutes] + [f'{i} hr' if i<=72 else 'max' for i in elapsed_reg_hours]\n",
"\n",
"elapsed_reg_median_risk = calculate_grouped(non_anon, elapsed_reg_time_intervals, \n",
" 'elapsed_reg', column_names=elapsed_reg_column_names)\n",
"elapsed_reg_interval_counts = calculate_grouped(non_anon, elapsed_reg_time_intervals, \n",
" 'elapsed_reg', column_names=elapsed_reg_column_names, grp_function='count')"
]
},
{
"cell_type": "code",
"execution_count": 484,
"id": "94f582b0-384d-490c-90a9-9d87b7b057f4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Median Risk \n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 30 min | \n",
" 1 hr | \n",
" 2 hr | \n",
" 4 hr | \n",
" 12 hr | \n",
" 24 hr | \n",
" 48 hr | \n",
" 72 hr | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 0.953 | \n",
" 0.936 | \n",
" 0.930 | \n",
" 0.930 | \n",
" 0.929 | \n",
" 0.929 | \n",
" 0.928 | \n",
" 0.927 | \n",
" 0.926 | \n",
" 0.924 | \n",
" 0.876 | \n",
"
\n",
" \n",
" enwiki | \n",
" 0.937 | \n",
" 0.941 | \n",
" 0.938 | \n",
" 0.936 | \n",
" 0.935 | \n",
" 0.934 | \n",
" 0.934 | \n",
" 0.933 | \n",
" 0.932 | \n",
" 0.931 | \n",
" 0.908 | \n",
"
\n",
" \n",
" eswiki | \n",
" 0.962 | \n",
" 0.949 | \n",
" 0.946 | \n",
" 0.945 | \n",
" 0.944 | \n",
" 0.944 | \n",
" 0.943 | \n",
" 0.943 | \n",
" 0.940 | \n",
" 0.939 | \n",
" 0.924 | \n",
"
\n",
" \n",
" fawiki | \n",
" 0.929 | \n",
" 0.943 | \n",
" 0.944 | \n",
" 0.944 | \n",
" 0.943 | \n",
" 0.943 | \n",
" 0.942 | \n",
" 0.940 | \n",
" 0.938 | \n",
" 0.936 | \n",
" 0.911 | \n",
"
\n",
" \n",
" frwiki | \n",
" 0.945 | \n",
" 0.940 | \n",
" 0.933 | \n",
" 0.931 | \n",
" 0.930 | \n",
" 0.930 | \n",
" 0.930 | \n",
" 0.930 | \n",
" 0.928 | \n",
" 0.927 | \n",
" 0.893 | \n",
"
\n",
" \n",
" idwiki | \n",
" 0.932 | \n",
" 0.934 | \n",
" 0.918 | \n",
" 0.915 | \n",
" 0.909 | \n",
" 0.909 | \n",
" 0.909 | \n",
" 0.911 | \n",
" 0.906 | \n",
" 0.905 | \n",
" 0.886 | \n",
"
\n",
" \n",
" itwiki | \n",
" 0.968 | \n",
" 0.947 | \n",
" 0.939 | \n",
" 0.940 | \n",
" 0.940 | \n",
" 0.940 | \n",
" 0.938 | \n",
" 0.938 | \n",
" 0.935 | \n",
" 0.935 | \n",
" 0.904 | \n",
"
\n",
" \n",
" jawiki | \n",
" 0.950 | \n",
" 0.935 | \n",
" 0.928 | \n",
" 0.926 | \n",
" 0.924 | \n",
" 0.923 | \n",
" 0.921 | \n",
" 0.922 | \n",
" 0.921 | \n",
" 0.919 | \n",
" 0.899 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 0.918 | \n",
" 0.935 | \n",
" 0.937 | \n",
" 0.936 | \n",
" 0.935 | \n",
" 0.935 | \n",
" 0.935 | \n",
" 0.934 | \n",
" 0.934 | \n",
" 0.933 | \n",
" 0.913 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 0.942 | \n",
" 0.939 | \n",
" 0.934 | \n",
" 0.934 | \n",
" 0.934 | \n",
" 0.934 | \n",
" 0.932 | \n",
" 0.931 | \n",
" 0.929 | \n",
" 0.928 | \n",
" 0.903 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 0.966 | \n",
" 0.938 | \n",
" 0.932 | \n",
" 0.933 | \n",
" 0.932 | \n",
" 0.932 | \n",
" 0.928 | \n",
" 0.928 | \n",
" 0.922 | \n",
" 0.920 | \n",
" 0.892 | \n",
"
\n",
" \n",
"
\n",
"
Number of Edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 30 min | \n",
" 1 hr | \n",
" 2 hr | \n",
" 4 hr | \n",
" 12 hr | \n",
" 24 hr | \n",
" 48 hr | \n",
" 72 hr | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 40 | \n",
" 350 | \n",
" 855 | \n",
" 995 | \n",
" 1050 | \n",
" 1094 | \n",
" 1149 | \n",
" 1191 | \n",
" 1247 | \n",
" 1295 | \n",
" 2299 | \n",
"
\n",
" \n",
" enwiki | \n",
" 794 | \n",
" 6181 | \n",
" 14757 | \n",
" 16826 | \n",
" 18287 | \n",
" 19213 | \n",
" 20335 | \n",
" 21605 | \n",
" 22581 | \n",
" 23170 | \n",
" 34976 | \n",
"
\n",
" \n",
" eswiki | \n",
" 159 | \n",
" 1278 | \n",
" 2855 | \n",
" 3198 | \n",
" 3459 | \n",
" 3578 | \n",
" 3733 | \n",
" 3981 | \n",
" 4212 | \n",
" 4338 | \n",
" 5851 | \n",
"
\n",
" \n",
" fawiki | \n",
" 8 | \n",
" 247 | \n",
" 772 | \n",
" 936 | \n",
" 1038 | \n",
" 1103 | \n",
" 1164 | \n",
" 1258 | \n",
" 1331 | \n",
" 1370 | \n",
" 2442 | \n",
"
\n",
" \n",
" frwiki | \n",
" 53 | \n",
" 652 | \n",
" 1557 | \n",
" 1766 | \n",
" 1906 | \n",
" 1995 | \n",
" 2083 | \n",
" 2160 | \n",
" 2260 | \n",
" 2318 | \n",
" 3611 | \n",
"
\n",
" \n",
" idwiki | \n",
" 8 | \n",
" 56 | \n",
" 153 | \n",
" 194 | \n",
" 230 | \n",
" 242 | \n",
" 246 | \n",
" 270 | \n",
" 284 | \n",
" 296 | \n",
" 527 | \n",
"
\n",
" \n",
" itwiki | \n",
" 69 | \n",
" 434 | \n",
" 833 | \n",
" 925 | \n",
" 1004 | \n",
" 1048 | \n",
" 1097 | \n",
" 1152 | \n",
" 1212 | \n",
" 1227 | \n",
" 1820 | \n",
"
\n",
" \n",
" jawiki | \n",
" 99 | \n",
" 645 | \n",
" 1582 | \n",
" 1726 | \n",
" 1817 | \n",
" 1876 | \n",
" 1950 | \n",
" 2016 | \n",
" 2075 | \n",
" 2111 | \n",
" 2691 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 22 | \n",
" 673 | \n",
" 1697 | \n",
" 1946 | \n",
" 2104 | \n",
" 2185 | \n",
" 2261 | \n",
" 2319 | \n",
" 2370 | \n",
" 2410 | \n",
" 3361 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 46 | \n",
" 544 | \n",
" 1353 | \n",
" 1531 | \n",
" 1627 | \n",
" 1690 | \n",
" 1751 | \n",
" 1841 | \n",
" 1907 | \n",
" 1939 | \n",
" 2833 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 21 | \n",
" 238 | \n",
" 653 | \n",
" 765 | \n",
" 794 | \n",
" 848 | \n",
" 910 | \n",
" 964 | \n",
" 1044 | \n",
" 1072 | \n",
" 1587 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_h({\n",
" 'Median Risk': elapsed_reg_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n",
" 'Number of Edits': elapsed_reg_interval_counts\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "300da015-79da-4684-92fe-2425180957ea",
"metadata": {},
"source": [
"Limiting to 48 hr window significantly improves the scores. However, this only when registered users are considered."
]
},
{
"cell_type": "markdown",
"id": "0acac942-4607-412e-bc82-5b84d69f67cc",
"metadata": {
"tags": []
},
"source": [
"## Time Since User First Revision"
]
},
{
"cell_type": "code",
"execution_count": 459,
"id": "d775040f-910a-4bf7-8585-4b3d4f7b95a4",
"metadata": {},
"outputs": [],
"source": [
"elapsed_first_rev_minutes = [1, 5, 30]\n",
"elapsed_first_rev_hours = [1, 2, 4, 12, 24, 48, 72, non_anon.elapsed_first_rev.max()/60*60]\n",
"elapsed_first_rev_time_intervals = [i*60 for i in elapsed_first_rev_minutes] + [i*60*60 for i in elapsed_first_rev_hours]\n",
"\n",
"elapsed_first_rev_column_names = [f'{i} min' for i in elapsed_first_rev_minutes] + [f'{i} hr' if i<=72 else 'max' for i in elapsed_first_rev_hours]\n",
"\n",
"elapsed_first_rev_median_risk = calculate_grouped(non_anon, elapsed_first_rev_time_intervals, \n",
" 'elapsed_first_rev', column_names=elapsed_first_rev_column_names)\n",
"elapsed_first_rev_counts = calculate_grouped(non_anon, elapsed_first_rev_time_intervals, \n",
" 'elapsed_first_rev', column_names=elapsed_first_rev_column_names, grp_function='count')"
]
},
{
"cell_type": "code",
"execution_count": 485,
"id": "ffb1a034-f114-49ee-93ea-1aeb92221fbd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Median Risk \n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 30 min | \n",
" 1 hr | \n",
" 2 hr | \n",
" 4 hr | \n",
" 12 hr | \n",
" 24 hr | \n",
" 48 hr | \n",
" 72 hr | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 0.916 | \n",
" 0.917 | \n",
" 0.917 | \n",
" 0.916 | \n",
" 0.915 | \n",
" 0.914 | \n",
" 0.913 | \n",
" 0.913 | \n",
" 0.910 | \n",
" 0.908 | \n",
" 0.876 | \n",
"
\n",
" \n",
" enwiki | \n",
" 0.930 | \n",
" 0.933 | \n",
" 0.932 | \n",
" 0.931 | \n",
" 0.930 | \n",
" 0.930 | \n",
" 0.929 | \n",
" 0.928 | \n",
" 0.926 | \n",
" 0.925 | \n",
" 0.908 | \n",
"
\n",
" \n",
" eswiki | \n",
" 0.939 | \n",
" 0.941 | \n",
" 0.941 | \n",
" 0.941 | \n",
" 0.940 | \n",
" 0.940 | \n",
" 0.939 | \n",
" 0.939 | \n",
" 0.937 | \n",
" 0.936 | \n",
" 0.924 | \n",
"
\n",
" \n",
" fawiki | \n",
" 0.932 | \n",
" 0.934 | \n",
" 0.939 | \n",
" 0.938 | \n",
" 0.937 | \n",
" 0.937 | \n",
" 0.934 | \n",
" 0.932 | \n",
" 0.930 | \n",
" 0.930 | \n",
" 0.911 | \n",
"
\n",
" \n",
" frwiki | \n",
" 0.924 | \n",
" 0.927 | \n",
" 0.927 | \n",
" 0.926 | \n",
" 0.925 | \n",
" 0.925 | \n",
" 0.925 | \n",
" 0.923 | \n",
" 0.921 | \n",
" 0.921 | \n",
" 0.893 | \n",
"
\n",
" \n",
" idwiki | \n",
" 0.916 | \n",
" 0.915 | \n",
" 0.907 | \n",
" 0.905 | \n",
" 0.904 | \n",
" 0.904 | \n",
" 0.903 | \n",
" 0.903 | \n",
" 0.901 | \n",
" 0.896 | \n",
" 0.886 | \n",
"
\n",
" \n",
" itwiki | \n",
" 0.932 | \n",
" 0.932 | \n",
" 0.931 | \n",
" 0.934 | \n",
" 0.934 | \n",
" 0.932 | \n",
" 0.931 | \n",
" 0.929 | \n",
" 0.929 | \n",
" 0.928 | \n",
" 0.904 | \n",
"
\n",
" \n",
" jawiki | \n",
" 0.935 | \n",
" 0.931 | \n",
" 0.925 | \n",
" 0.924 | \n",
" 0.922 | \n",
" 0.921 | \n",
" 0.917 | \n",
" 0.917 | \n",
" 0.916 | \n",
" 0.915 | \n",
" 0.899 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 0.925 | \n",
" 0.930 | \n",
" 0.932 | \n",
" 0.932 | \n",
" 0.932 | \n",
" 0.932 | \n",
" 0.931 | \n",
" 0.931 | \n",
" 0.930 | \n",
" 0.929 | \n",
" 0.913 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 0.925 | \n",
" 0.929 | \n",
" 0.927 | \n",
" 0.927 | \n",
" 0.928 | \n",
" 0.927 | \n",
" 0.927 | \n",
" 0.926 | \n",
" 0.925 | \n",
" 0.924 | \n",
" 0.903 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 0.903 | \n",
" 0.920 | \n",
" 0.925 | \n",
" 0.926 | \n",
" 0.924 | \n",
" 0.922 | \n",
" 0.921 | \n",
" 0.921 | \n",
" 0.917 | \n",
" 0.912 | \n",
" 0.892 | \n",
"
\n",
" \n",
"
\n",
"
Number of Edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 30 min | \n",
" 1 hr | \n",
" 2 hr | \n",
" 4 hr | \n",
" 12 hr | \n",
" 24 hr | \n",
" 48 hr | \n",
" 72 hr | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 763 | \n",
" 937 | \n",
" 1222 | \n",
" 1300 | \n",
" 1338 | \n",
" 1375 | \n",
" 1424 | \n",
" 1460 | \n",
" 1527 | \n",
" 1568 | \n",
" 2301 | \n",
"
\n",
" \n",
" enwiki | \n",
" 9917 | \n",
" 13471 | \n",
" 19062 | \n",
" 20548 | \n",
" 21640 | \n",
" 22355 | \n",
" 23261 | \n",
" 24472 | \n",
" 25405 | \n",
" 25933 | \n",
" 34986 | \n",
"
\n",
" \n",
" eswiki | \n",
" 1766 | \n",
" 2432 | \n",
" 3406 | \n",
" 3654 | \n",
" 3848 | \n",
" 3974 | \n",
" 4092 | \n",
" 4321 | \n",
" 4524 | \n",
" 4624 | \n",
" 5851 | \n",
"
\n",
" \n",
" fawiki | \n",
" 444 | \n",
" 609 | \n",
" 989 | \n",
" 1125 | \n",
" 1199 | \n",
" 1261 | \n",
" 1319 | \n",
" 1469 | \n",
" 1537 | \n",
" 1570 | \n",
" 2442 | \n",
"
\n",
" \n",
" frwiki | \n",
" 1107 | \n",
" 1464 | \n",
" 1981 | \n",
" 2117 | \n",
" 2222 | \n",
" 2284 | \n",
" 2367 | \n",
" 2438 | \n",
" 2532 | \n",
" 2573 | \n",
" 3611 | \n",
"
\n",
" \n",
" idwiki | \n",
" 111 | \n",
" 143 | \n",
" 215 | \n",
" 268 | \n",
" 289 | \n",
" 300 | \n",
" 320 | \n",
" 341 | \n",
" 355 | \n",
" 376 | \n",
" 527 | \n",
"
\n",
" \n",
" itwiki | \n",
" 570 | \n",
" 770 | \n",
" 1021 | \n",
" 1100 | \n",
" 1134 | \n",
" 1175 | \n",
" 1229 | \n",
" 1283 | \n",
" 1329 | \n",
" 1341 | \n",
" 1820 | \n",
"
\n",
" \n",
" jawiki | \n",
" 670 | \n",
" 1181 | \n",
" 1893 | \n",
" 1977 | \n",
" 2056 | \n",
" 2101 | \n",
" 2176 | \n",
" 2206 | \n",
" 2255 | \n",
" 2279 | \n",
" 2691 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 1048 | \n",
" 1375 | \n",
" 2017 | \n",
" 2214 | \n",
" 2323 | \n",
" 2406 | \n",
" 2482 | \n",
" 2536 | \n",
" 2587 | \n",
" 2624 | \n",
" 3361 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 903 | \n",
" 1196 | \n",
" 1716 | \n",
" 1815 | \n",
" 1874 | \n",
" 1922 | \n",
" 1965 | \n",
" 2049 | \n",
" 2115 | \n",
" 2148 | \n",
" 2833 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 336 | \n",
" 501 | \n",
" 790 | \n",
" 873 | \n",
" 913 | \n",
" 961 | \n",
" 1001 | \n",
" 1052 | \n",
" 1130 | \n",
" 1182 | \n",
" 1587 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_h({\n",
" 'Median Risk': elapsed_first_rev_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n",
" 'Number of Edits': elapsed_first_rev_counts\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "b00e726e-4deb-4bcb-b7ee-5e9405cb42bb",
"metadata": {},
"source": [
"- Limiting to 48 hr window significantly improves the scores. However, this only when registered users are considered.\n",
"- The impact is similar to that of time since user registration, however, time since user's frist edit eliminates less number of edits compared to user registration."
]
},
{
"cell_type": "markdown",
"id": "3049a57c-a4cd-4848-b562-505efe5be05d",
"metadata": {
"tags": []
},
"source": [
"## Time Since User's Previous Edit"
]
},
{
"cell_type": "code",
"execution_count": 506,
"id": "9acd2efe-f1df-4a81-a516-9231c63fe9bb",
"metadata": {},
"outputs": [],
"source": [
"time_user_prev_rev_minutes = [1, 5, 15, 30, 60, 120, non_anon.time_user_prev_rev.max()/60]\n",
"time_user_prev_rev_time_intervals = [i*60 for i in time_user_prev_rev_minutes]\n",
"\n",
"time_user_prev_rev_column_names = [f'{i} min' if i<=120 else 'max' for i in time_user_prev_rev_minutes]\n",
"\n",
"time_user_prev_rev_median_risk = calculate_grouped(non_anon, time_user_prev_rev_time_intervals, \n",
" 'time_user_prev_rev', column_names=time_user_prev_rev_column_names)\n",
"time_user_prev_rev_counts = calculate_grouped(non_anon, time_user_prev_rev_time_intervals, \n",
" 'time_user_prev_rev', column_names=time_user_prev_rev_column_names, grp_function='count')"
]
},
{
"cell_type": "code",
"execution_count": 507,
"id": "f05af18d-002a-4c41-b99b-572f1ee09221",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Median Risk \n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 15 min | \n",
" 30 min | \n",
" 60 min | \n",
" 120 min | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 0.893 | \n",
" 0.889 | \n",
" 0.883 | \n",
" 0.882 | \n",
" 0.879 | \n",
" 0.879 | \n",
" 0.848 | \n",
"
\n",
" \n",
" enwiki | \n",
" 0.916 | \n",
" 0.915 | \n",
" 0.911 | \n",
" 0.910 | \n",
" 0.909 | \n",
" 0.909 | \n",
" 0.897 | \n",
"
\n",
" \n",
" eswiki | \n",
" 0.931 | \n",
" 0.929 | \n",
" 0.926 | \n",
" 0.926 | \n",
" 0.926 | \n",
" 0.926 | \n",
" 0.917 | \n",
"
\n",
" \n",
" fawiki | \n",
" 0.918 | \n",
" 0.916 | \n",
" 0.914 | \n",
" 0.913 | \n",
" 0.912 | \n",
" 0.912 | \n",
" 0.905 | \n",
"
\n",
" \n",
" frwiki | \n",
" 0.925 | \n",
" 0.906 | \n",
" 0.903 | \n",
" 0.901 | \n",
" 0.900 | \n",
" 0.897 | \n",
" 0.877 | \n",
"
\n",
" \n",
" idwiki | \n",
" 0.885 | \n",
" 0.885 | \n",
" 0.884 | \n",
" 0.883 | \n",
" 0.884 | \n",
" 0.884 | \n",
" 0.879 | \n",
"
\n",
" \n",
" itwiki | \n",
" 0.943 | \n",
" 0.917 | \n",
" 0.909 | \n",
" 0.907 | \n",
" 0.905 | \n",
" 0.906 | \n",
" 0.894 | \n",
"
\n",
" \n",
" jawiki | \n",
" 0.927 | \n",
" 0.915 | \n",
" 0.910 | \n",
" 0.908 | \n",
" 0.908 | \n",
" 0.907 | \n",
" 0.897 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 0.931 | \n",
" 0.925 | \n",
" 0.922 | \n",
" 0.921 | \n",
" 0.921 | \n",
" 0.921 | \n",
" 0.908 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 0.904 | \n",
" 0.911 | \n",
" 0.909 | \n",
" 0.908 | \n",
" 0.907 | \n",
" 0.908 | \n",
" 0.894 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 0.897 | \n",
" 0.909 | \n",
" 0.906 | \n",
" 0.904 | \n",
" 0.903 | \n",
" 0.902 | \n",
" 0.890 | \n",
"
\n",
" \n",
"
\n",
"
Number of Edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 15 min | \n",
" 30 min | \n",
" 60 min | \n",
" 120 min | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 189 | \n",
" 658 | \n",
" 891 | \n",
" 960 | \n",
" 1003 | \n",
" 1031 | \n",
" 1571 | \n",
"
\n",
" \n",
" enwiki | \n",
" 4042 | \n",
" 13345 | \n",
" 17506 | \n",
" 18766 | \n",
" 19514 | \n",
" 20040 | \n",
" 25824 | \n",
"
\n",
" \n",
" eswiki | \n",
" 700 | \n",
" 2344 | \n",
" 3036 | \n",
" 3242 | \n",
" 3347 | \n",
" 3429 | \n",
" 4237 | \n",
"
\n",
" \n",
" fawiki | \n",
" 268 | \n",
" 1046 | \n",
" 1378 | \n",
" 1470 | \n",
" 1512 | \n",
" 1533 | \n",
" 2020 | \n",
"
\n",
" \n",
" frwiki | \n",
" 309 | \n",
" 1217 | \n",
" 1653 | \n",
" 1785 | \n",
" 1876 | \n",
" 1937 | \n",
" 2574 | \n",
"
\n",
" \n",
" idwiki | \n",
" 55 | \n",
" 211 | \n",
" 289 | \n",
" 312 | \n",
" 327 | \n",
" 335 | \n",
" 421 | \n",
"
\n",
" \n",
" itwiki | \n",
" 224 | \n",
" 671 | \n",
" 896 | \n",
" 952 | \n",
" 987 | \n",
" 1010 | \n",
" 1307 | \n",
"
\n",
" \n",
" jawiki | \n",
" 1045 | \n",
" 1716 | \n",
" 1920 | \n",
" 1992 | \n",
" 2013 | \n",
" 2035 | \n",
" 2266 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 338 | \n",
" 1235 | \n",
" 1656 | \n",
" 1774 | \n",
" 1833 | \n",
" 1867 | \n",
" 2371 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 244 | \n",
" 992 | \n",
" 1332 | \n",
" 1423 | \n",
" 1481 | \n",
" 1512 | \n",
" 1977 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 237 | \n",
" 722 | \n",
" 927 | \n",
" 980 | \n",
" 1011 | \n",
" 1031 | \n",
" 1277 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_h({\n",
" 'Median Risk': time_user_prev_rev_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n",
" 'Number of Edits': time_user_prev_rev_counts\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "ec32ee06-b272-46b7-b18a-c24ac6c6ff8c",
"metadata": {},
"source": [
"While resitricting improves the score, a susbsantial number of edits will be elimated for no significant benefit."
]
},
{
"cell_type": "markdown",
"id": "b459d7cc-d95f-4704-bf0d-61fe9fe2609e",
"metadata": {
"tags": []
},
"source": [
"## Time Since Page's Previous Edit"
]
},
{
"cell_type": "code",
"execution_count": 463,
"id": "c37a401a-f2c5-45ce-bce8-e50f509d6cfb",
"metadata": {},
"outputs": [],
"source": [
"time_page_prev_rev_minutes = [1, 5, 15, 30, 60, init_criteria.time_page_prev_rev.max()/60]\n",
"time_page_prev_rev_time_intervals = [i*60 for i in time_page_prev_rev_minutes]\n",
"\n",
"time_page_prev_rev_column_names = [f'{i} min' if i<=60 else 'max' for i in time_page_prev_rev_minutes]\n",
"\n",
"time_page_prev_rev_median_risk = calculate_grouped(init_criteria, time_page_prev_rev_time_intervals, \n",
" 'time_page_prev_rev', column_names=time_page_prev_rev_column_names)\n",
"time_page_prev_rev_counts = calculate_grouped(init_criteria, time_page_prev_rev_time_intervals, \n",
" 'time_page_prev_rev', column_names=time_page_prev_rev_column_names, grp_function='count')"
]
},
{
"cell_type": "code",
"execution_count": 487,
"id": "bce8bbe7-2c20-44ac-aa13-9cdbfdfb84ad",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Median Risk \n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 15 min | \n",
" 30 min | \n",
" 60 min | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 0.932 | \n",
" 0.918 | \n",
" 0.913 | \n",
" 0.911 | \n",
" 0.910 | \n",
" 0.902 | \n",
"
\n",
" \n",
" enwiki | \n",
" 0.922 | \n",
" 0.916 | \n",
" 0.913 | \n",
" 0.912 | \n",
" 0.912 | \n",
" 0.911 | \n",
"
\n",
" \n",
" eswiki | \n",
" 0.932 | \n",
" 0.927 | \n",
" 0.924 | \n",
" 0.923 | \n",
" 0.923 | \n",
" 0.923 | \n",
"
\n",
" \n",
" fawiki | \n",
" 0.943 | \n",
" 0.931 | \n",
" 0.927 | \n",
" 0.927 | \n",
" 0.927 | \n",
" 0.916 | \n",
"
\n",
" \n",
" frwiki | \n",
" 0.934 | \n",
" 0.919 | \n",
" 0.915 | \n",
" 0.913 | \n",
" 0.912 | \n",
" 0.903 | \n",
"
\n",
" \n",
" idwiki | \n",
" 0.913 | \n",
" 0.911 | \n",
" 0.908 | \n",
" 0.907 | \n",
" 0.908 | \n",
" 0.902 | \n",
"
\n",
" \n",
" itwiki | \n",
" 0.934 | \n",
" 0.926 | \n",
" 0.922 | \n",
" 0.921 | \n",
" 0.920 | \n",
" 0.920 | \n",
"
\n",
" \n",
" jawiki | \n",
" 0.916 | \n",
" 0.892 | \n",
" 0.887 | \n",
" 0.885 | \n",
" 0.883 | \n",
" 0.876 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 0.937 | \n",
" 0.928 | \n",
" 0.924 | \n",
" 0.923 | \n",
" 0.921 | \n",
" 0.913 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 0.928 | \n",
" 0.923 | \n",
" 0.920 | \n",
" 0.918 | \n",
" 0.918 | \n",
" 0.914 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 0.896 | \n",
" 0.888 | \n",
" 0.885 | \n",
" 0.885 | \n",
" 0.884 | \n",
" 0.883 | \n",
"
\n",
" \n",
"
\n",
"
Number of Edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 min | \n",
" 5 min | \n",
" 15 min | \n",
" 30 min | \n",
" 60 min | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 1440 | \n",
" 3398 | \n",
" 3987 | \n",
" 4192 | \n",
" 4411 | \n",
" 16829 | \n",
"
\n",
" \n",
" enwiki | \n",
" 20828 | \n",
" 50764 | \n",
" 60095 | \n",
" 63694 | \n",
" 66835 | \n",
" 172584 | \n",
"
\n",
" \n",
" eswiki | \n",
" 6968 | \n",
" 17044 | \n",
" 19714 | \n",
" 20687 | \n",
" 21526 | \n",
" 55105 | \n",
"
\n",
" \n",
" fawiki | \n",
" 1086 | \n",
" 3078 | \n",
" 3671 | \n",
" 3860 | \n",
" 4021 | \n",
" 9967 | \n",
"
\n",
" \n",
" frwiki | \n",
" 1906 | \n",
" 5315 | \n",
" 6332 | \n",
" 6647 | \n",
" 6912 | \n",
" 19375 | \n",
"
\n",
" \n",
" idwiki | \n",
" 465 | \n",
" 1073 | \n",
" 1286 | \n",
" 1370 | \n",
" 1442 | \n",
" 3554 | \n",
"
\n",
" \n",
" itwiki | \n",
" 3027 | \n",
" 6706 | \n",
" 7844 | \n",
" 8220 | \n",
" 8561 | \n",
" 23440 | \n",
"
\n",
" \n",
" jawiki | \n",
" 1625 | \n",
" 3389 | \n",
" 3973 | \n",
" 4217 | \n",
" 4450 | \n",
" 10170 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 340 | \n",
" 967 | \n",
" 1235 | \n",
" 1318 | \n",
" 1377 | \n",
" 3361 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 2323 | \n",
" 6326 | \n",
" 7420 | \n",
" 7792 | \n",
" 8089 | \n",
" 23587 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 1035 | \n",
" 2623 | \n",
" 3168 | \n",
" 3388 | \n",
" 3581 | \n",
" 7568 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_h({\n",
" 'Median Risk': time_page_prev_rev_median_risk.style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n",
" 'Number of Edits': time_page_prev_rev_counts\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "78939d4d-236e-41f2-b8d6-95467ab8f742",
"metadata": {},
"source": [
"While resitricting improves the score, a susbsantial number of edits will be elimated for no significant benefit."
]
},
{
"cell_type": "markdown",
"id": "479739c4-ccda-4173-a471-9257720ec1a4",
"metadata": {},
"source": [
"## Bytes Diff"
]
},
{
"cell_type": "code",
"execution_count": 494,
"id": "b38691ad-c6a4-4a64-9eb0-d40993031666",
"metadata": {},
"outputs": [],
"source": [
"warnings.filterwarnings('ignore')\n",
"\n",
"bytes_diff_intervals = [0, 1, 5, 10, 100, 500, 1000, 5000, init_criteria.rev_bytes_diff.abs().max()]\n",
"\n",
"bytes_diff_column_labels = ['min'] + bytes_diff_intervals[1:-1] + ['max']\n",
"\n",
"bytes_diff_median_risk = calculate_grouped(init_criteria, bytes_diff_intervals, \n",
" 'rev_bytes_diff', column_names=bytes_diff_column_labels)\n",
"bytes_diff_counts = calculate_grouped(init_criteria, bytes_diff_intervals, \n",
" 'rev_bytes_diff', column_names=bytes_diff_column_labels, grp_function='count')"
]
},
{
"cell_type": "code",
"execution_count": 495,
"id": "3a8b18a4-17f8-42e8-9ae2-476e3d758bc2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Median Risk \n",
"
\n",
" \n",
" \n",
" | \n",
" min | \n",
" 1 | \n",
" 5 | \n",
" 10 | \n",
" 100 | \n",
" 500 | \n",
" 1000 | \n",
" 5000 | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 0.901 | \n",
" 0.905 | \n",
" 0.912 | \n",
" 0.912 | \n",
" 0.915 | \n",
" 0.968 | \n",
" 0.983 | \n",
" 0.993 | \n",
" 0.000 | \n",
"
\n",
" \n",
" enwiki | \n",
" 0.910 | \n",
" 0.912 | \n",
" 0.915 | \n",
" 0.915 | \n",
" 0.917 | \n",
" 0.965 | \n",
" 0.978 | \n",
" 0.986 | \n",
" 0.000 | \n",
"
\n",
" \n",
" eswiki | \n",
" 0.922 | \n",
" 0.924 | \n",
" 0.928 | \n",
" 0.929 | \n",
" 0.943 | \n",
" 0.978 | \n",
" 0.984 | \n",
" 0.943 | \n",
" 0.963 | \n",
"
\n",
" \n",
" fawiki | \n",
" 0.916 | \n",
" 0.917 | \n",
" 0.920 | \n",
" 0.920 | \n",
" 0.921 | \n",
" 0.951 | \n",
" 0.973 | \n",
" 0.978 | \n",
" 0.000 | \n",
"
\n",
" \n",
" frwiki | \n",
" 0.903 | \n",
" 0.905 | \n",
" 0.908 | \n",
" 0.908 | \n",
" 0.916 | \n",
" 0.974 | \n",
" 0.983 | \n",
" 0.992 | \n",
" 0.000 | \n",
"
\n",
" \n",
" idwiki | \n",
" 0.902 | \n",
" 0.905 | \n",
" 0.906 | \n",
" 0.908 | \n",
" 0.919 | \n",
" 0.976 | \n",
" 0.983 | \n",
" 0.979 | \n",
" 0.000 | \n",
"
\n",
" \n",
" itwiki | \n",
" 0.917 | \n",
" 0.919 | \n",
" 0.921 | \n",
" 0.921 | \n",
" 0.928 | \n",
" 0.978 | \n",
" 0.987 | \n",
" 0.995 | \n",
" 0.000 | \n",
"
\n",
" \n",
" jawiki | \n",
" 0.868 | \n",
" 0.871 | \n",
" 0.875 | \n",
" 0.876 | \n",
" 0.896 | \n",
" 0.961 | \n",
" 0.974 | \n",
" 0.979 | \n",
" 0.000 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 0.912 | \n",
" 0.914 | \n",
" 0.917 | \n",
" 0.916 | \n",
" 0.906 | \n",
" 0.919 | \n",
" 0.912 | \n",
" 0.914 | \n",
" 0.000 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 0.913 | \n",
" 0.915 | \n",
" 0.919 | \n",
" 0.921 | \n",
" 0.931 | \n",
" 0.974 | \n",
" 0.983 | \n",
" 0.990 | \n",
" 0.000 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 0.883 | \n",
" 0.886 | \n",
" 0.890 | \n",
" 0.891 | \n",
" 0.915 | \n",
" 0.965 | \n",
" 0.976 | \n",
" 0.985 | \n",
" 0.000 | \n",
"
\n",
" \n",
"
\n",
"
Number of Edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" min | \n",
" 1 | \n",
" 5 | \n",
" 10 | \n",
" 100 | \n",
" 500 | \n",
" 1000 | \n",
" 5000 | \n",
" max | \n",
"
\n",
" \n",
" wiki_db | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" dewiki | \n",
" 16711 | \n",
" 15566 | \n",
" 12420 | \n",
" 10723 | \n",
" 3840 | \n",
" 1491 | \n",
" 894 | \n",
" 232 | \n",
" 0 | \n",
"
\n",
" \n",
" enwiki | \n",
" 171191 | \n",
" 159106 | \n",
" 131246 | \n",
" 114246 | \n",
" 42246 | \n",
" 15488 | \n",
" 9268 | \n",
" 1944 | \n",
" 0 | \n",
"
\n",
" \n",
" eswiki | \n",
" 54949 | \n",
" 51473 | \n",
" 41913 | \n",
" 35805 | \n",
" 12103 | \n",
" 5167 | \n",
" 3242 | \n",
" 183 | \n",
" 1 | \n",
"
\n",
" \n",
" fawiki | \n",
" 9857 | \n",
" 9387 | \n",
" 8041 | \n",
" 7269 | \n",
" 3135 | \n",
" 1046 | \n",
" 592 | \n",
" 86 | \n",
" 0 | \n",
"
\n",
" \n",
" frwiki | \n",
" 19263 | \n",
" 18155 | \n",
" 15031 | \n",
" 13282 | \n",
" 5303 | \n",
" 2303 | \n",
" 1537 | \n",
" 430 | \n",
" 0 | \n",
"
\n",
" \n",
" idwiki | \n",
" 3526 | \n",
" 3261 | \n",
" 2773 | \n",
" 2397 | \n",
" 824 | \n",
" 303 | \n",
" 168 | \n",
" 31 | \n",
" 0 | \n",
"
\n",
" \n",
" itwiki | \n",
" 22761 | \n",
" 21010 | \n",
" 16844 | \n",
" 14480 | \n",
" 4761 | \n",
" 1756 | \n",
" 1064 | \n",
" 295 | \n",
" 0 | \n",
"
\n",
" \n",
" jawiki | \n",
" 9659 | \n",
" 8968 | \n",
" 7751 | \n",
" 6926 | \n",
" 2999 | \n",
" 1357 | \n",
" 898 | \n",
" 224 | \n",
" 0 | \n",
"
\n",
" \n",
" ptwiki | \n",
" 3339 | \n",
" 3153 | \n",
" 2693 | \n",
" 2446 | \n",
" 1038 | \n",
" 402 | \n",
" 217 | \n",
" 46 | \n",
" 0 | \n",
"
\n",
" \n",
" ruwiki | \n",
" 23264 | \n",
" 21810 | \n",
" 18545 | \n",
" 16677 | \n",
" 7075 | \n",
" 2975 | \n",
" 1994 | \n",
" 608 | \n",
" 0 | \n",
"
\n",
" \n",
" zhwiki | \n",
" 7482 | \n",
" 6622 | \n",
" 5646 | \n",
" 4869 | \n",
" 1825 | \n",
" 854 | \n",
" 546 | \n",
" 120 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_h({\n",
" 'Median Risk': bytes_diff_median_risk.fillna(0).style.background_gradient(cmap ='viridis_r').format(\"{:.3f}\"),\n",
" 'Number of Edits': bytes_diff_counts.fillna(0).astype(int)\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "31bed787-0a34-407f-b552-bd3628f6d026",
"metadata": {},
"source": [
"Restricting to have at least 5 bytes difference provides a good balance between the score and the number of edits"
]
},
{
"cell_type": "markdown",
"id": "b92bd161-00c3-42e7-ad73-3e8c646abd2a",
"metadata": {},
"source": [
"## Incremental criteria"
]
},
{
"cell_type": "markdown",
"id": "95bbe63a-40ee-4c7a-b1ff-5fef43c06d25",
"metadata": {},
"source": [
"Based on the above results, we will incrementally apply additional restrictions\n",
"- Reverted within 12 hours\n",
"- User edit count less 15 edits\n",
"- Time since user's first edit is less than 48 hours\n",
"- Absolute bytes difference is more than 5 bytes"
]
},
{
"cell_type": "code",
"execution_count": 512,
"id": "7f10c1a0-13a8-46ab-9eb4-9ccd7987f579",
"metadata": {},
"outputs": [],
"source": [
"init_criteria['abs_bytes_diff'] = init_criteria['rev_bytes_diff'].abs()"
]
},
{
"cell_type": "code",
"execution_count": 564,
"id": "5ed4c037-3230-4a04-a5ef-440b5a55552d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
Initial
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.901974 | \n",
" 16829 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.910679 | \n",
" 172584 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.922596 | \n",
" 55105 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.916366 | \n",
" 9967 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.903316 | \n",
" 19375 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.902464 | \n",
" 3554 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.919648 | \n",
" 23440 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.875682 | \n",
" 10170 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.913064 | \n",
" 3361 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.914291 | \n",
" 23587 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.883454 | \n",
" 7568 | \n",
"
\n",
" \n",
"
\n",
"
+ Reverted within 12 hours
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.904239 | \n",
" 16077 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.912205 | \n",
" 162439 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.923474 | \n",
" 52922 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.916792 | \n",
" 9228 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.905588 | \n",
" 18401 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.901994 | \n",
" 3231 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.921301 | \n",
" 22077 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.879789 | \n",
" 9401 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.914363 | \n",
" 3147 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.916403 | \n",
" 22250 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.886989 | \n",
" 6880 | \n",
"
\n",
" \n",
"
\n",
"
+ User Edit Count <= 15 edits
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.904503 | \n",
" 16061 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.912847 | \n",
" 160889 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.923850 | \n",
" 52696 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.918056 | \n",
" 9136 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.906304 | \n",
" 18285 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.902892 | \n",
" 3190 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.921365 | \n",
" 22011 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.880116 | \n",
" 9109 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.916916 | \n",
" 3079 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.916746 | \n",
" 22204 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.887588 | \n",
" 6819 | \n",
"
\n",
" \n",
"
\n",
"
+ Time Since First Edit <= 48 hrs
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.907555 | \n",
" 15468 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.915196 | \n",
" 153858 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.924792 | \n",
" 51696 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.920468 | \n",
" 8539 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.909034 | \n",
" 17489 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.905071 | \n",
" 3067 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.922709 | \n",
" 21633 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.882525 | \n",
" 8828 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.930669 | \n",
" 2458 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.918103 | \n",
" 21661 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.890380 | \n",
" 6481 | \n",
"
\n",
" \n",
"
\n",
"
+ Absolute Bytes Diff >= 5 bytes
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wiki_db | \n",
" median_risk | \n",
" n_edits | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" dewiki | \n",
" 0.917214 | \n",
" 11281 | \n",
"
\n",
" \n",
" 1 | \n",
" enwiki | \n",
" 0.920194 | \n",
" 115997 | \n",
"
\n",
" \n",
" 2 | \n",
" eswiki | \n",
" 0.930483 | \n",
" 39239 | \n",
"
\n",
" \n",
" 3 | \n",
" fawiki | \n",
" 0.924352 | \n",
" 6734 | \n",
"
\n",
" \n",
" 4 | \n",
" frwiki | \n",
" 0.913709 | \n",
" 13492 | \n",
"
\n",
" \n",
" 5 | \n",
" idwiki | \n",
" 0.910019 | \n",
" 2361 | \n",
"
\n",
" \n",
" 6 | \n",
" itwiki | \n",
" 0.924533 | \n",
" 15505 | \n",
"
\n",
" \n",
" 7 | \n",
" jawiki | \n",
" 0.883670 | \n",
" 6679 | \n",
"
\n",
" \n",
" 8 | \n",
" ptwiki | \n",
" 0.934228 | \n",
" 1855 | \n",
"
\n",
" \n",
" 9 | \n",
" ruwiki | \n",
" 0.923788 | \n",
" 16914 | \n",
"
\n",
" \n",
" 10 | \n",
" zhwiki | \n",
" 0.896337 | \n",
" 4813 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def calculate_median_risk_and_count(df, criteria, time_to_revert_limit=12*60*60):\n",
" \n",
" query_string = f\"time_to_revert <= {time_to_revert_limit} \" + (\"& \" + criteria if criteria else \"\")\n",
" filtered_df = df.query(query_string)\n",
" aggregated_df = filtered_df.groupby('wiki_db').agg({'risk': 'median', 'rev_id': 'count'})\n",
" aggregated_df.rename({'rev_id': 'n_edits', 'risk': 'median_risk'}, inplace=True, axis=1)\n",
" \n",
" return aggregated_df.reset_index()\n",
"\n",
"criteria_conditions = {\n",
" 'Initial': init_criteria_risk,\n",
" '+ Reverted within 12 hours': '',\n",
" '+ User Edit Count <= 15 edits': \"(is_anon == True) | (user_edit_count <= 15)\",\n",
" '+ Time Since First Edit <= 48 hrs': \"(is_anon == True) | ((user_edit_count <= 15) & (elapsed_first_rev < 48*60*60))\",\n",
" '+ Absolute Bytes Diff >= 5 bytes': \"(abs_bytes_diff >= 5) & ((is_anon == True) | ((user_edit_count <= 15) & (elapsed_first_rev < 48*60*60)))\"\n",
"}\n",
"\n",
"results = {label: calculate_median_risk_and_count(init_criteria, criteria) if label != 'Initial' \\\n",
" else init_criteria_risk for label, criteria in criteria_conditions.items()}\n",
"display_h(results)"
]
},
{
"cell_type": "markdown",
"id": "f44014bf-be98-439f-8d58-672ae7fc0504",
"metadata": {},
"source": [
"- Restricting user related related metrics make minor improvements to the median risk, as majority of the reverted edits are made by anonymous users.\n",
"- While having at least an n number of absolute bytes difference, a substantial number of edits are elimiated, as compared to the initial criteria.\n",
"- In addition to the time to revert, absolute bytes difference is only the control factor available for anonymous edits."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}