{ "cells": [ { "cell_type": "markdown", "id": "6abfc4e5-9349-487d-b549-ca341a89666c", "metadata": {}, "source": [ "# QA of BlockedEditAttemptInstrumentation\n", "\n", "[Task](https://phabricator.wikimedia.org/T310390)\n", "\n", "[Schema](https://schema.wikimedia.org/#!/secondary/jsonschema/analytics/mediawiki/editattemptsblocked)" ] }, { "cell_type": "markdown", "id": "114986a7-9b67-4a05-b5dd-791eae9b202f", "metadata": {}, "source": [ "## QA Notes/ Issues\n", "* Schema documentation lists the stream as 'mediawiki.editattemptsblocked' but it appears in Hive as mediawiki_editattempt_block. Can we updated the documentation to be consistent?\n", "* Less than 1% of events have a country_code that was not logged correctly. These appear as a long string of characters and numbers (i.e. '()&%'). Per T310390#8333040, these appear to be users throwing XSS vectors from the GeoIP cookie." ] }, { "cell_type": "code", "execution_count": 4, "id": "a1edd401-b08c-4510-a731-7faa4c116e00", "metadata": {}, "outputs": [], "source": [ "shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))\n", "shhh({\n", " library(tidyverse); library(glue); library(lubridate); library(scales)\n", "})\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "d1ef3c2d-1b30-4041-9edc-905bcc5b5da8", "metadata": {}, "outputs": [], "source": [ "# Collect all events from new instrumentation to review\n", "\n", "query <-\n", "\"SELECT\n", " date_format(dt, 'yyyy-MM-dd') AS block_time,\n", "block_type,\n", "block_expiry,\n", "block_id,\n", "block_scope,\n", "interface,\n", "country_code,\n", "platform,\n", "`database`,\n", "page_id,\n", "page_namespace,\n", "rev_id,\n", "performer.user_id,\n", "performer.user_edit_count,\n", "http.client_ip\n", "FROM\n", "event.mediawiki_editattempt_block\n", "WHERE\n", "YEAR = 2022\n", "\"" ] }, { "cell_type": "code", "execution_count": 10, "id": "08ae9dd7-eb09-40c2-86af-3f166e473b89", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Don't forget to authenticate with Kerberos using kinit\n", "\n" ] } ], "source": [ "edits_blocked_events <- wmfdata::query_hive(query)" ] }, { "cell_type": "markdown", "id": "3a9be366-0e09-48aa-93b0-7999de88cdb4", "metadata": {}, "source": [ "## Block Types" ] }, { "cell_type": "code", "execution_count": 11, "id": "1f210f0b-e1d2-45d9-a470-f5967097b535", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 4 × 4
block_typen_eventsn_userspct_users
<chr><int><int><dbl>
autoblock 1520 1240.017
ip 146622111770.162
range 659813631690.437
user 1086027820.384
\n" ], "text/latex": [ "A tibble: 4 × 4\n", "\\begin{tabular}{llll}\n", " block\\_type & n\\_events & n\\_users & pct\\_users\\\\\n", " & & & \\\\\n", "\\hline\n", "\t autoblock & 1520 & 124 & 0.017\\\\\n", "\t ip & 1466221 & 1177 & 0.162\\\\\n", "\t range & 6598136 & 3169 & 0.437\\\\\n", "\t user & 10860 & 2782 & 0.384\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 4 × 4\n", "\n", "| block_type <chr> | n_events <int> | n_users <int> | pct_users <dbl> |\n", "|---|---|---|---|\n", "| autoblock | 1520 | 124 | 0.017 |\n", "| ip | 1466221 | 1177 | 0.162 |\n", "| range | 6598136 | 3169 | 0.437 |\n", "| user | 10860 | 2782 | 0.384 |\n", "\n" ], "text/plain": [ " block_type n_events n_users pct_users\n", "1 autoblock 1520 124 0.017 \n", "2 ip 1466221 1177 0.162 \n", "3 range 6598136 3169 0.437 \n", "4 user 10860 2782 0.384 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "edits_blocked_bytype <- edits_blocked_events %>%\n", " group_by(block_type) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_users = round(n_users / sum(n_users), 3)) \n", "\n", "edits_blocked_bytype" ] }, { "cell_type": "markdown", "id": "03ec2b22-6e39-4cd0-96fd-5609f7e7c777", "metadata": {}, "source": [ "## Block Events by platform" ] }, { "cell_type": "code", "execution_count": 12, "id": "d320e46a-082a-4a2c-9228-55c374194964", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\n", "
A tibble: 2 × 4
platformn_eventsn_userspct_events
<chr><int><int><dbl>
desktop740218356910.916
mobile 67455416010.084
\n" ], "text/latex": [ "A tibble: 2 × 4\n", "\\begin{tabular}{llll}\n", " platform & n\\_events & n\\_users & pct\\_events\\\\\n", " & & & \\\\\n", "\\hline\n", "\t desktop & 7402183 & 5691 & 0.916\\\\\n", "\t mobile & 674554 & 1601 & 0.084\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 2 × 4\n", "\n", "| platform <chr> | n_events <int> | n_users <int> | pct_events <dbl> |\n", "|---|---|---|---|\n", "| desktop | 7402183 | 5691 | 0.916 |\n", "| mobile | 674554 | 1601 | 0.084 |\n", "\n" ], "text/plain": [ " platform n_events n_users pct_events\n", "1 desktop 7402183 5691 0.916 \n", "2 mobile 674554 1601 0.084 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "edits_blocked_byplatform <- edits_blocked_events %>%\n", " group_by(platform) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_events = round(n_events / sum(n_events), 3)) \n", "\n", "edits_blocked_byplatform" ] }, { "cell_type": "markdown", "id": "2786a8ae-b8de-4068-9c0f-a4150b1296b3", "metadata": {}, "source": [ "# Block Events by Interface" ] }, { "cell_type": "code", "execution_count": 13, "id": "ba437592-c60d-484d-8453-debf3088ff53", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 4 × 4
interfacen_eventsn_userspct_users
<chr><int><int><dbl>
discussiontools 7880 5040.062
mobilefrontend 67358015830.193
visualeditor 21320720940.256
wikieditor 718207040090.489
\n" ], "text/latex": [ "A tibble: 4 × 4\n", "\\begin{tabular}{llll}\n", " interface & n\\_events & n\\_users & pct\\_users\\\\\n", " & & & \\\\\n", "\\hline\n", "\t discussiontools & 7880 & 504 & 0.062\\\\\n", "\t mobilefrontend & 673580 & 1583 & 0.193\\\\\n", "\t visualeditor & 213207 & 2094 & 0.256\\\\\n", "\t wikieditor & 7182070 & 4009 & 0.489\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 4 × 4\n", "\n", "| interface <chr> | n_events <int> | n_users <int> | pct_users <dbl> |\n", "|---|---|---|---|\n", "| discussiontools | 7880 | 504 | 0.062 |\n", "| mobilefrontend | 673580 | 1583 | 0.193 |\n", "| visualeditor | 213207 | 2094 | 0.256 |\n", "| wikieditor | 7182070 | 4009 | 0.489 |\n", "\n" ], "text/plain": [ " interface n_events n_users pct_users\n", "1 discussiontools 7880 504 0.062 \n", "2 mobilefrontend 673580 1583 0.193 \n", "3 visualeditor 213207 2094 0.256 \n", "4 wikieditor 7182070 4009 0.489 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "edits_blocked_byinterface <- edits_blocked_events %>%\n", " group_by(interface) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_users = round(n_users / sum(n_users), 3)) \n", "\n", "edits_blocked_byinterface" ] }, { "cell_type": "markdown", "id": "4125da0e-cde3-43ff-99d7-82338e6be613", "metadata": {}, "source": [ "## Block Events by anon or logged in" ] }, { "cell_type": "code", "execution_count": 14, "id": "289f77ae-aee5-4a95-83dd-77f559613c97", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\n", "
A tibble: 2 × 3
isanonn_eventsn_users
<chr><int><int>
false 214556993
true 8055282 1
\n" ], "text/latex": [ "A tibble: 2 × 3\n", "\\begin{tabular}{lll}\n", " isanon & n\\_events & n\\_users\\\\\n", " & & \\\\\n", "\\hline\n", "\t false & 21455 & 6993\\\\\n", "\t true & 8055282 & 1\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 2 × 3\n", "\n", "| isanon <chr> | n_events <int> | n_users <int> |\n", "|---|---|---|\n", "| false | 21455 | 6993 |\n", "| true | 8055282 | 1 |\n", "\n" ], "text/plain": [ " isanon n_events n_users\n", "1 false 21455 6993 \n", "2 true 8055282 1 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "edits_blocked_byanon <- edits_blocked_events %>%\n", " mutate(isanon = ifelse(user_id == 0, \"true\", \"false\")) %>%\n", " group_by(isanon) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) \n", "\n", "edits_blocked_byanon" ] }, { "cell_type": "markdown", "id": "51710730-ed2c-4c3e-8131-bbcf3203979c", "metadata": {}, "source": [ "Confirmed both blocks for anon and registered users are included. All anon users are correctly tagged with a user id of 0. " ] }, { "cell_type": "markdown", "id": "2a28e406-c8da-4c23-8fdb-7629bd7fa6eb", "metadata": {}, "source": [ "## Block Events by Block Scope" ] }, { "cell_type": "code", "execution_count": 15, "id": "47c3b972-ff7a-40be-b62c-10788b73832f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\n", "
A tibble: 2 × 4
block_scopen_eventsn_userspct_events
<chr><int><int><dbl>
global287890911970.356
local 519782858930.644
\n" ], "text/latex": [ "A tibble: 2 × 4\n", "\\begin{tabular}{llll}\n", " block\\_scope & n\\_events & n\\_users & pct\\_events\\\\\n", " & & & \\\\\n", "\\hline\n", "\t global & 2878909 & 1197 & 0.356\\\\\n", "\t local & 5197828 & 5893 & 0.644\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 2 × 4\n", "\n", "| block_scope <chr> | n_events <int> | n_users <int> | pct_events <dbl> |\n", "|---|---|---|---|\n", "| global | 2878909 | 1197 | 0.356 |\n", "| local | 5197828 | 5893 | 0.644 |\n", "\n" ], "text/plain": [ " block_scope n_events n_users pct_events\n", "1 global 2878909 1197 0.356 \n", "2 local 5197828 5893 0.644 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "edits_blocked_byscope <- edits_blocked_events %>%\n", " group_by(block_scope) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_events = round(n_events / sum(n_events), 3)) \n", "\n", "edits_blocked_byscope" ] }, { "cell_type": "markdown", "id": "1b318106-a663-4c7a-ba70-0bad23af43fe", "metadata": {}, "source": [ "## Block Expiration" ] }, { "cell_type": "code", "execution_count": 16, "id": "aed3d7b6-ccc1-4a7b-b1d9-f09472f4fa2d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\n", "
A tibble: 2 × 4
is_infiniten_eventsn_userspct_users
<chr><int><int><dbl>
false782884043860.622
true 24789726610.378
\n" ], "text/latex": [ "A tibble: 2 × 4\n", "\\begin{tabular}{llll}\n", " is\\_infinite & n\\_events & n\\_users & pct\\_users\\\\\n", " & & & \\\\\n", "\\hline\n", "\t false & 7828840 & 4386 & 0.622\\\\\n", "\t true & 247897 & 2661 & 0.378\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 2 × 4\n", "\n", "| is_infinite <chr> | n_events <int> | n_users <int> | pct_users <dbl> |\n", "|---|---|---|---|\n", "| false | 7828840 | 4386 | 0.622 |\n", "| true | 247897 | 2661 | 0.378 |\n", "\n" ], "text/plain": [ " is_infinite n_events n_users pct_users\n", "1 false 7828840 4386 0.622 \n", "2 true 247897 2661 0.378 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "edits_blocked_byexpiry <- edits_blocked_events %>%\n", " mutate(is_infinite = ifelse(block_expiry == 'infinity', \"true\", \"false\")) %>%\n", " group_by(is_infinite) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_users = round(n_users / sum(n_users), 3)) \n", "\n", "edits_blocked_byexpiry" ] }, { "cell_type": "code", "execution_count": null, "id": "e2d2f4b7-7843-4d64-b554-dbcba45bd97f", "metadata": {}, "outputs": [], "source": [ "Both infinite and non-infinite blocks have been logged." ] }, { "cell_type": "code", "execution_count": 17, "id": "ad0d14c2-8292-4764-b0a1-05a894a64900", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 2
block_expiryn_events
<chr><int>
2022-10-12T22:41:35Z10
2022-10-12T23:08:11Z 1
2022-10-12T23:10:04Z11
2022-10-13T00:06:06Z 1
2022-10-13T00:13:35Z 2
2022-10-13T00:19:09Z 9
\n" ], "text/latex": [ "A tibble: 6 × 2\n", "\\begin{tabular}{ll}\n", " block\\_expiry & n\\_events\\\\\n", " & \\\\\n", "\\hline\n", "\t 2022-10-12T22:41:35Z & 10\\\\\n", "\t 2022-10-12T23:08:11Z & 1\\\\\n", "\t 2022-10-12T23:10:04Z & 11\\\\\n", "\t 2022-10-13T00:06:06Z & 1\\\\\n", "\t 2022-10-13T00:13:35Z & 2\\\\\n", "\t 2022-10-13T00:19:09Z & 9\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 2\n", "\n", "| block_expiry <chr> | n_events <int> |\n", "|---|---|\n", "| 2022-10-12T22:41:35Z | 10 |\n", "| 2022-10-12T23:08:11Z | 1 |\n", "| 2022-10-12T23:10:04Z | 11 |\n", "| 2022-10-13T00:06:06Z | 1 |\n", "| 2022-10-13T00:13:35Z | 2 |\n", "| 2022-10-13T00:19:09Z | 9 |\n", "\n" ], "text/plain": [ " block_expiry n_events\n", "1 2022-10-12T22:41:35Z 10 \n", "2 2022-10-12T23:08:11Z 1 \n", "3 2022-10-12T23:10:04Z 11 \n", "4 2022-10-13T00:06:06Z 1 \n", "5 2022-10-13T00:13:35Z 2 \n", "6 2022-10-13T00:19:09Z 9 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#quick check of block expiration dates\n", "edits_blocked_exp_date <- edits_blocked_events %>%\n", " group_by(block_expiry) %>%\n", " summarise(n_events = n()) %>%\n", " arrange(block_expiry)\n", "\n", "head(edits_blocked_exp_date)\n" ] }, { "cell_type": "markdown", "id": "9bdc2b6c-1351-4080-88ca-4ac296301179", "metadata": {}, "source": [ "The earliest block expirated date logged is on 12 October 2022, which is the date we started logging events. Let's make sure that also the earliest time logged for when the blocks occurred." ] }, { "cell_type": "markdown", "id": "d388fee4-0f6b-4830-bddc-029e502e5ee1", "metadata": {}, "source": [ "## Block Time" ] }, { "cell_type": "code", "execution_count": 18, "id": "adbc4178-e214-436a-947c-512e0e91fc2e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 2
block_timen_events
<chr><int>
2022-10-12 95976
2022-10-131116715
2022-10-141074355
2022-10-151018007
2022-10-161078969
2022-10-171148724
\n" ], "text/latex": [ "A tibble: 6 × 2\n", "\\begin{tabular}{ll}\n", " block\\_time & n\\_events\\\\\n", " & \\\\\n", "\\hline\n", "\t 2022-10-12 & 95976\\\\\n", "\t 2022-10-13 & 1116715\\\\\n", "\t 2022-10-14 & 1074355\\\\\n", "\t 2022-10-15 & 1018007\\\\\n", "\t 2022-10-16 & 1078969\\\\\n", "\t 2022-10-17 & 1148724\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 2\n", "\n", "| block_time <chr> | n_events <int> |\n", "|---|---|\n", "| 2022-10-12 | 95976 |\n", "| 2022-10-13 | 1116715 |\n", "| 2022-10-14 | 1074355 |\n", "| 2022-10-15 | 1018007 |\n", "| 2022-10-16 | 1078969 |\n", "| 2022-10-17 | 1148724 |\n", "\n" ], "text/plain": [ " block_time n_events\n", "1 2022-10-12 95976 \n", "2 2022-10-13 1116715 \n", "3 2022-10-14 1074355 \n", "4 2022-10-15 1018007 \n", "5 2022-10-16 1078969 \n", "6 2022-10-17 1148724 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#quick check of block expiration dates\n", "edits_blocked_date <- edits_blocked_events %>%\n", " group_by(block_time) %>%\n", " summarise(n_events = n()) %>%\n", " arrange(block_time)\n", "\n", "head(edits_blocked_date)" ] }, { "cell_type": "code", "execution_count": null, "id": "437582e2-c2b1-4c61-8c5c-c70cb8615475", "metadata": {}, "outputs": [], "source": [ "We first start logging events on `2022-10-12` as expected." ] }, { "cell_type": "markdown", "id": "f8cff35d-2643-4c21-bf91-b8b91bd4e8c5", "metadata": {}, "source": [ "## Blocks by Country Code" ] }, { "cell_type": "code", "execution_count": 19, "id": "d48032a6-0f47-404a-b292-c642b3d760d4", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 4
country_coden_eventsn_userspct_users
<chr><int><int><dbl>
US250683014100.185
CA1390084 2000.026
DE 719657 3430.045
RU 653564 3910.051
HK 309201 1890.025
BE 303395 380.005
\n" ], "text/latex": [ "A tibble: 6 × 4\n", "\\begin{tabular}{llll}\n", " country\\_code & n\\_events & n\\_users & pct\\_users\\\\\n", " & & & \\\\\n", "\\hline\n", "\t US & 2506830 & 1410 & 0.185\\\\\n", "\t CA & 1390084 & 200 & 0.026\\\\\n", "\t DE & 719657 & 343 & 0.045\\\\\n", "\t RU & 653564 & 391 & 0.051\\\\\n", "\t HK & 309201 & 189 & 0.025\\\\\n", "\t BE & 303395 & 38 & 0.005\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 4\n", "\n", "| country_code <chr> | n_events <int> | n_users <int> | pct_users <dbl> |\n", "|---|---|---|---|\n", "| US | 2506830 | 1410 | 0.185 |\n", "| CA | 1390084 | 200 | 0.026 |\n", "| DE | 719657 | 343 | 0.045 |\n", "| RU | 653564 | 391 | 0.051 |\n", "| HK | 309201 | 189 | 0.025 |\n", "| BE | 303395 | 38 | 0.005 |\n", "\n" ], "text/plain": [ " country_code n_events n_users pct_users\n", "1 US 2506830 1410 0.185 \n", "2 CA 1390084 200 0.026 \n", "3 DE 719657 343 0.045 \n", "4 RU 653564 391 0.051 \n", "5 HK 309201 189 0.025 \n", "6 BE 303395 38 0.005 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#quick check of block expiration dates\n", "edits_blocked_bycountry <- edits_blocked_events %>%\n", " group_by(country_code) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_users = round(n_users / sum(n_users), 3)) %>%\n", " arrange(desc(n_events)) \n", "\n", "head(edits_blocked_bycountry)" ] }, { "cell_type": "markdown", "id": "4b972f93-9958-4f72-9c02-e4e5114b2158", "metadata": {}, "source": [ "There's some instances where the country code field is logged as a long string vs a country code. Taking a close look at those:" ] }, { "cell_type": "markdown", "id": "da2221df-44fa-4488-8831-605bb55bad20", "metadata": {}, "source": [ "### Country Code Issues" ] }, { "cell_type": "code", "execution_count": 20, "id": "0216cf8c-e124-49d8-9419-d104f82db8f3", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 4
country_coden_eventsn_userspct_users
<chr><int><int><dbl>
'\"()&%<acx><ScRiPt >1pVo(9336)</ScRiPt>110
'\"()&%<acx><ScRiPt >1TBU(9672)</ScRiPt>110
'\"()&%<acx><ScRiPt >23hz(9646)</ScRiPt>110
'\"()&%<acx><ScRiPt >26PX(9583)</ScRiPt>110
'\"()&%<acx><ScRiPt >2asY(9963)</ScRiPt>110
'\"()&%<acx><ScRiPt >2NsB(9331)</ScRiPt>110
\n" ], "text/latex": [ "A tibble: 6 × 4\n", "\\begin{tabular}{llll}\n", " country\\_code & n\\_events & n\\_users & pct\\_users\\\\\n", " & & & \\\\\n", "\\hline\n", "\t '\"()\\&\\% & 1 & 1 & 0\\\\\n", "\t '\"()\\&\\% & 1 & 1 & 0\\\\\n", "\t '\"()\\&\\% & 1 & 1 & 0\\\\\n", "\t '\"()\\&\\% & 1 & 1 & 0\\\\\n", "\t '\"()\\&\\% & 1 & 1 & 0\\\\\n", "\t '\"()\\&\\% & 1 & 1 & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 4\n", "\n", "| country_code <chr> | n_events <int> | n_users <int> | pct_users <dbl> |\n", "|---|---|---|---|\n", "| '\"()&%<acx><ScRiPt >1pVo(9336)</ScRiPt> | 1 | 1 | 0 |\n", "| '\"()&%<acx><ScRiPt >1TBU(9672)</ScRiPt> | 1 | 1 | 0 |\n", "| '\"()&%<acx><ScRiPt >23hz(9646)</ScRiPt> | 1 | 1 | 0 |\n", "| '\"()&%<acx><ScRiPt >26PX(9583)</ScRiPt> | 1 | 1 | 0 |\n", "| '\"()&%<acx><ScRiPt >2asY(9963)</ScRiPt> | 1 | 1 | 0 |\n", "| '\"()&%<acx><ScRiPt >2NsB(9331)</ScRiPt> | 1 | 1 | 0 |\n", "\n" ], "text/plain": [ " country_code n_events n_users pct_users\n", "1 '\"()&% 1 1 0 \n", "2 '\"()&% 1 1 0 \n", "3 '\"()&% 1 1 0 \n", "4 '\"()&% 1 1 0 \n", "5 '\"()&% 1 1 0 \n", "6 '\"()&% 1 1 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "edits_blocked_bycountry_errors <- edits_blocked_events %>%\n", " group_by(country_code) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_users = round(n_users / sum(n_users), 3)) %>%\n", " filter(str_length(country_code) > 2) #find values that are are not 2 digit country code\n", "\n", "head (edits_blocked_bycountry_errors) " ] }, { "cell_type": "code", "execution_count": 21, "id": "90c73828-4835-409f-86e0-17144e24496f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 3 × 4
errorn_eventsn_userspct_events
<chr><int><int><dbl>
country_code_error 90518 70.011
normal 798621569900.989
NA 4 10.000
\n" ], "text/latex": [ "A tibble: 3 × 4\n", "\\begin{tabular}{llll}\n", " error & n\\_events & n\\_users & pct\\_events\\\\\n", " & & & \\\\\n", "\\hline\n", "\t country\\_code\\_error & 90518 & 7 & 0.011\\\\\n", "\t normal & 7986215 & 6990 & 0.989\\\\\n", "\t NA & 4 & 1 & 0.000\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 3 × 4\n", "\n", "| error <chr> | n_events <int> | n_users <int> | pct_events <dbl> |\n", "|---|---|---|---|\n", "| country_code_error | 90518 | 7 | 0.011 |\n", "| normal | 7986215 | 6990 | 0.989 |\n", "| NA | 4 | 1 | 0.000 |\n", "\n" ], "text/plain": [ " error n_events n_users pct_events\n", "1 country_code_error 90518 7 0.011 \n", "2 normal 7986215 6990 0.989 \n", "3 NA 4 1 0.000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# find percent of these occurences\n", "edits_blocked_bycountry_errors <- edits_blocked_events %>%\n", " mutate(error = ifelse(str_length(country_code) > 2, \"country_code_error\", \"normal\")) %>%\n", " group_by(error) %>%\n", " summarise(n_events = n(),\n", " n_users = n_distinct(user_id)) %>%\n", "mutate(pct_events = round(n_events / sum(n_events), 3)) \n", "\n", "edits_blocked_bycountry_errors " ] }, { "cell_type": "markdown", "id": "0ff88a00-6f82-42ff-8854-3e5a5d1b7ba1", "metadata": {}, "source": [ "## Block Types by Page Namespace" ] }, { "cell_type": "code", "execution_count": 22, "id": "2a9daa0c-5ad0-43e0-9b81-2d854b60b8a7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 67 × 2
page_namespacen_events
<int><int>
06979785
1 86777
2 109794
3 124554
4 178533
5 17867
6 38155
7 697
8 5739
9 311
10 316506
11 23257
12 12286
13 1317
14 107361
15 1419
90 1707
92 6
100 21575
101 489
102 9854
103 2117
104 5711
105 105
106 1423
107 45
108 152
109 2
110 145
111 1
121 107
124 1
128 149
129 2
130 3
132 80
133 1
134 47
136 9
200 3904
201 508
202 1264
203 131
206 31
207 1
250 118
252 18
460 3
470 24
471 19
482 2
486 26
487 1
710 4
82818375
829 229
866 22
867 1
1198 964
1199 4
\n" ], "text/latex": [ "A tibble: 67 × 2\n", "\\begin{tabular}{ll}\n", " page\\_namespace & n\\_events\\\\\n", " & \\\\\n", "\\hline\n", "\t 0 & 6979785\\\\\n", "\t 1 & 86777\\\\\n", "\t 2 & 109794\\\\\n", "\t 3 & 124554\\\\\n", "\t 4 & 178533\\\\\n", "\t 5 & 17867\\\\\n", "\t 6 & 38155\\\\\n", "\t 7 & 697\\\\\n", "\t 8 & 5739\\\\\n", "\t 9 & 311\\\\\n", "\t 10 & 316506\\\\\n", "\t 11 & 23257\\\\\n", "\t 12 & 12286\\\\\n", "\t 13 & 1317\\\\\n", "\t 14 & 107361\\\\\n", "\t 15 & 1419\\\\\n", "\t 90 & 1707\\\\\n", "\t 92 & 6\\\\\n", "\t 100 & 21575\\\\\n", "\t 101 & 489\\\\\n", "\t 102 & 9854\\\\\n", "\t 103 & 2117\\\\\n", "\t 104 & 5711\\\\\n", "\t 105 & 105\\\\\n", "\t 106 & 1423\\\\\n", "\t 107 & 45\\\\\n", "\t 108 & 152\\\\\n", "\t 109 & 2\\\\\n", "\t 110 & 145\\\\\n", "\t 111 & 1\\\\\n", "\t ⋮ & ⋮\\\\\n", "\t 121 & 107\\\\\n", "\t 124 & 1\\\\\n", "\t 128 & 149\\\\\n", "\t 129 & 2\\\\\n", "\t 130 & 3\\\\\n", "\t 132 & 80\\\\\n", "\t 133 & 1\\\\\n", "\t 134 & 47\\\\\n", "\t 136 & 9\\\\\n", "\t 200 & 3904\\\\\n", "\t 201 & 508\\\\\n", "\t 202 & 1264\\\\\n", "\t 203 & 131\\\\\n", "\t 206 & 31\\\\\n", "\t 207 & 1\\\\\n", "\t 250 & 118\\\\\n", "\t 252 & 18\\\\\n", "\t 460 & 3\\\\\n", "\t 470 & 24\\\\\n", "\t 471 & 19\\\\\n", "\t 482 & 2\\\\\n", "\t 486 & 26\\\\\n", "\t 487 & 1\\\\\n", "\t 710 & 4\\\\\n", "\t 828 & 18375\\\\\n", "\t 829 & 229\\\\\n", "\t 866 & 22\\\\\n", "\t 867 & 1\\\\\n", "\t 1198 & 964\\\\\n", "\t 1199 & 4\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 67 × 2\n", "\n", "| page_namespace <int> | n_events <int> |\n", "|---|---|\n", "| 0 | 6979785 |\n", "| 1 | 86777 |\n", "| 2 | 109794 |\n", "| 3 | 124554 |\n", "| 4 | 178533 |\n", "| 5 | 17867 |\n", "| 6 | 38155 |\n", "| 7 | 697 |\n", "| 8 | 5739 |\n", "| 9 | 311 |\n", "| 10 | 316506 |\n", "| 11 | 23257 |\n", "| 12 | 12286 |\n", "| 13 | 1317 |\n", "| 14 | 107361 |\n", "| 15 | 1419 |\n", "| 90 | 1707 |\n", "| 92 | 6 |\n", "| 100 | 21575 |\n", "| 101 | 489 |\n", "| 102 | 9854 |\n", "| 103 | 2117 |\n", "| 104 | 5711 |\n", "| 105 | 105 |\n", "| 106 | 1423 |\n", "| 107 | 45 |\n", "| 108 | 152 |\n", "| 109 | 2 |\n", "| 110 | 145 |\n", "| 111 | 1 |\n", "| ⋮ | ⋮ |\n", "| 121 | 107 |\n", "| 124 | 1 |\n", "| 128 | 149 |\n", "| 129 | 2 |\n", "| 130 | 3 |\n", "| 132 | 80 |\n", "| 133 | 1 |\n", "| 134 | 47 |\n", "| 136 | 9 |\n", "| 200 | 3904 |\n", "| 201 | 508 |\n", "| 202 | 1264 |\n", "| 203 | 131 |\n", "| 206 | 31 |\n", "| 207 | 1 |\n", "| 250 | 118 |\n", "| 252 | 18 |\n", "| 460 | 3 |\n", "| 470 | 24 |\n", "| 471 | 19 |\n", "| 482 | 2 |\n", "| 486 | 26 |\n", "| 487 | 1 |\n", "| 710 | 4 |\n", "| 828 | 18375 |\n", "| 829 | 229 |\n", "| 866 | 22 |\n", "| 867 | 1 |\n", "| 1198 | 964 |\n", "| 1199 | 4 |\n", "\n" ], "text/plain": [ " page_namespace n_events\n", "1 0 6979785 \n", "2 1 86777 \n", "3 2 109794 \n", "4 3 124554 \n", "5 4 178533 \n", "6 5 17867 \n", "7 6 38155 \n", "8 7 697 \n", "9 8 5739 \n", "10 9 311 \n", "11 10 316506 \n", "12 11 23257 \n", "13 12 12286 \n", "14 13 1317 \n", "15 14 107361 \n", "16 15 1419 \n", "17 90 1707 \n", "18 92 6 \n", "19 100 21575 \n", "20 101 489 \n", "21 102 9854 \n", "22 103 2117 \n", "23 104 5711 \n", "24 105 105 \n", "25 106 1423 \n", "26 107 45 \n", "27 108 152 \n", "28 109 2 \n", "29 110 145 \n", "30 111 1 \n", "⋮ ⋮ ⋮ \n", "38 121 107 \n", "39 124 1 \n", "40 128 149 \n", "41 129 2 \n", "42 130 3 \n", "43 132 80 \n", "44 133 1 \n", "45 134 47 \n", "46 136 9 \n", "47 200 3904 \n", "48 201 508 \n", "49 202 1264 \n", "50 203 131 \n", "51 206 31 \n", "52 207 1 \n", "53 250 118 \n", "54 252 18 \n", "55 460 3 \n", "56 470 24 \n", "57 471 19 \n", "58 482 2 \n", "59 486 26 \n", "60 487 1 \n", "61 710 4 \n", "62 828 18375 \n", "63 829 229 \n", "64 866 22 \n", "65 867 1 \n", "66 1198 964 \n", "67 1199 4 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#quick check of block expiration dates\n", "edits_blocked_bynp <- edits_blocked_events %>%\n", " group_by(page_namespace) %>%\n", " summarise(n_events = n()) \n", "\n", "edits_blocked_bynp" ] }, { "cell_type": "markdown", "id": "12fa62d7-1403-49a0-a358-7488e9f23e3f", "metadata": {}, "source": [ "## Edit Counts by Anon Users (Should always be 0)" ] }, { "cell_type": "code", "execution_count": 23, "id": "9008d58f-c1ac-4fe2-adb8-ed36ec9283ba", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` regrouping output by 'isanon' (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A grouped_df: 1 × 3
isanonuser_edit_countn_events
<chr><int><int>
true08055282
\n" ], "text/latex": [ "A grouped\\_df: 1 × 3\n", "\\begin{tabular}{lll}\n", " isanon & user\\_edit\\_count & n\\_events\\\\\n", " & & \\\\\n", "\\hline\n", "\t true & 0 & 8055282\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A grouped_df: 1 × 3\n", "\n", "| isanon <chr> | user_edit_count <int> | n_events <int> |\n", "|---|---|---|\n", "| true | 0 | 8055282 |\n", "\n" ], "text/plain": [ " isanon user_edit_count n_events\n", "1 true 0 8055282 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "edits_blocked_byanon_count <- edits_blocked_events %>%\n", " mutate(isanon = ifelse(user_id == 0, \"true\", \"false\")) %>%\n", " filter(isanon == 'true') %>%\n", " group_by(isanon, user_edit_count) %>%\n", " summarise(n_events = n())\n", "\n", "edits_blocked_byanon_count" ] }, { "cell_type": "markdown", "id": "29ef167d-57db-4e83-994d-8338be0d4f5d", "metadata": {}, "source": [ "## Blocks by Database" ] }, { "cell_type": "code", "execution_count": 28, "id": "ccc5dd91-92e8-4ee9-818b-536b22424c35", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`summarise()` ungrouping output (override with `.groups` argument)\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 2
databasen_events
<chr><int>
enwiki4638043
zhwiki 264465
frwiki 263387
ruwiki 189636
dewiki 166379
ptwiki 130881
\n" ], "text/latex": [ "A tibble: 6 × 2\n", "\\begin{tabular}{ll}\n", " database & n\\_events\\\\\n", " & \\\\\n", "\\hline\n", "\t enwiki & 4638043\\\\\n", "\t zhwiki & 264465\\\\\n", "\t frwiki & 263387\\\\\n", "\t ruwiki & 189636\\\\\n", "\t dewiki & 166379\\\\\n", "\t ptwiki & 130881\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 2\n", "\n", "| database <chr> | n_events <int> |\n", "|---|---|\n", "| enwiki | 4638043 |\n", "| zhwiki | 264465 |\n", "| frwiki | 263387 |\n", "| ruwiki | 189636 |\n", "| dewiki | 166379 |\n", "| ptwiki | 130881 |\n", "\n" ], "text/plain": [ " database n_events\n", "1 enwiki 4638043 \n", "2 zhwiki 264465 \n", "3 frwiki 263387 \n", "4 ruwiki 189636 \n", "5 dewiki 166379 \n", "6 ptwiki 130881 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#quick check of block expiration dates\n", "edits_blocked_bydatabase <- edits_blocked_events %>%\n", " group_by(database) %>%\n", " summarise(n_events = n()) %>%\n", "arrange(desc(n_events))\n", "\n", "head(edits_blocked_bydatabase)" ] }, { "cell_type": "code", "execution_count": null, "id": "01df83db-e485-4463-ac5c-3abd3bdca536", "metadata": {}, "outputs": [], "source": [ "Most blocks occur on enwiki." ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 5 }