{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Content Translation Article Deletion Ratios Across All Wikipedias\n", "\n", "[Task](https://phabricator.wikimedia.org/T286636)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Background" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From task description:\n", "\n", "\"Across all languages, Wikipedia articles created with Content Translation are deleted less often than those created from scratch. For example, in 2020, 3% of new translations were deleted, compared to 12% of other new articles. However, this is not the case for all Wikipedias and some specific wikis have a higher deletion rate for translations. For example, for Indonesian ([T219851#5914691](https://phabricator.wikimedia.org/T219851#5914691)) and Telugu ([T244769](https://phabricator.wikimedia.org/T244769)) the deletion ratios for Content Translation were higher compared to other articles created in these wikis.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Purpose\n", "\n", "The purpose of this analysis is to identify and list the number of wikis where the deletion rate of atciles created with content translation is higher than the deletion rate for articles created with other tools. Specifically, we want to answer the following questions:\n", "\n", "* How many wikis have translations deleted more often than regular articles?\n", "* Which are these wikis?\n", "* Has the number of those wikis reduced compared to the previous period?\n", "* How high is the highest deletion ratio a wiki has for translations?\n", "\n", "This analysis will be used as a baseline to assess the evolution of deletion rates as improvements are made. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data\n", "\n", "Data comes from the [mediawiki_history](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history) table and reflects the deletion ratios of main namespace articles that were created using Content Translation compared to the deletion ratio for main namespace articles created without the tool. Bots were excluded. \n", "\n", "For the purpose of this analysis, we compared the deletion ratios for the current most recent completed quarter (Q4: April through June 2021) to the prior quarter (Q1: January through March 2019) and two 6 month periods. The reviewed time period can be adjusted as needed.\n", "\n", "**Wiki size threshold**: We removed wikis where 15 or fewer articles were created with content translation during the reviewed period to reduce noise in the data and focus on wikis with more representative data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Next Steps\n", "- Check threshold proposal - look at wikis with at least 10 articles created\n", "- Look at Different Ways to Present Information in a Table\n", "- Post results to ticket for review\n", "- Create table to collect table and present in superset dashboard potentially\n" ] }, { "cell_type": "code", "execution_count": 218, "metadata": {}, "outputs": [], "source": [ "shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))\n", "shhh({\n", " library(tidyverse);\n", " # Tables:\n", " library(gt);\n", " library(gtsummary);\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Quarterly Comparison" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [], "source": [ "# Q4: April through June 2021\n", "query <-\n", "\"\n", "-- find both cx and non-cx created articles \n", "WITH created_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,\n", " COUNT(*) AS created_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2021-04-01' and '2021-06-30' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", " AND event_type = 'create' \n", "GROUP BY \n", " wiki_db\n", "),\n", "\n", "--find all deleted articles that were created with cx \n", "\n", "deleted_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,\n", " COUNT(*) AS deleted_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2021-04-01' and '2021-06-30' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", "-- find revisions moved to the archive table\n", " AND event_type = 'create'\n", " AND revision_is_deleted_by_page_deletion = TRUE\n", "-- remove all bots\n", " AND SIZE(event_user_is_bot_by_historical) = 0 -- not a bot\n", "GROUP BY \n", " wiki_db\n", ")\n", "\n", "-- main query to aggregate and join sources above\n", "SELECT\n", " created_articles.wiki,\n", " created_cx,\n", " (created_total - created_cx) AS created_non_cx,\n", " deleted_cx,\n", " (deleted_total - deleted_cx) AS deleted_non_cx\n", "FROM created_articles\n", "JOIN deleted_articles ON \n", " created_articles.wiki = deleted_articles.wiki\n", "\"" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Don't forget to authenticate with Kerberos using kinit\n", "\n" ] } ], "source": [ "cx_deletion_ratio_q4 <- wmfdata::query_hive(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overall Deletion Ratio for Q4" ] }, { "cell_type": "code", "execution_count": 204, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 3
deleted_cx_pctdeleted_non_cx_pctdeletion_pct_diff
<chr><chr><chr>
3.88%5.41%1.52%
\n" ], "text/latex": [ "A data.frame: 1 × 3\n", "\\begin{tabular}{lll}\n", " deleted\\_cx\\_pct & deleted\\_non\\_cx\\_pct & deletion\\_pct\\_diff\\\\\n", " & & \\\\\n", "\\hline\n", "\t 3.88\\% & 5.41\\% & 1.52\\%\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 3\n", "\n", "| deleted_cx_pct <chr> | deleted_non_cx_pct <chr> | deletion_pct_diff <chr> |\n", "|---|---|---|\n", "| 3.88% | 5.41% | 1.52% |\n", "\n" ], "text/plain": [ " deleted_cx_pct deleted_non_cx_pct deletion_pct_diff\n", "1 3.88% 5.41% 1.52% " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_ratio_q4_overall <- cx_deletion_ratio_q4 %>%\n", " summarise(deleted_cx_pct = paste0(round(sum(deleted_cx)/sum(created_cx) * 100, 2), \"%\"),\n", " deleted_non_cx_pct = paste0(round(sum(deleted_non_cx)/sum(created_non_cx) * 100, 2), \"%\"),\n", " deletion_pct_diff = paste0(round((sum(deleted_non_cx)/sum(created_non_cx)*100)-((sum(deleted_cx)/sum(created_cx))*100), 2),\"%\")\n", " )\n", "\n", "cx_deletion_ratio_q4_overall \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## By Wiki\n", " " ] }, { "cell_type": "code", "execution_count": 245, "metadata": {}, "outputs": [], "source": [ "# Add columns with calculated deletion ratio\n", "\n", "cx_deletion_ratio_q4_bywiki <- cx_deletion_ratio_q4 %>%\n", " #filter(wiki == 'arwiki') %>% # use to find ratios for single wiki\n", " filter(created_cx > 15) %>% # remove wikis with 15 or fewer articles created using cx\n", " mutate(deleted_cx_ratio = deleted_cx/created_cx, \n", " deleted_non_cx_ratio = deleted_non_cx/created_non_cx, \n", " deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)\n", " ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many wikis have translations deleted more often than regular articles?" ] }, { "cell_type": "code", "execution_count": 283, "metadata": {}, "outputs": [], "source": [ "cx_deletion_higher_q4 <- cx_deletion_ratio_q4_bywiki %>%\n", " filter(deletion_ratio_diff < 0) %>% #find wikis with higher cx deletion ratio\n", " summarise(total_wikis = n())\n" ] }, { "cell_type": "code", "execution_count": 248, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Across all wikis where more than 15 articles have been created with content translation in Q4, there were 15 wikis where articles created with content translation were deleted more than articles created without cx\"\n" ] } ], "source": [ "print(paste0(\"Across all wikis where more than 15 articles have been created with content translation in Q4, there were \", \n", " cx_deletion_higher_q4[1], \n", " \" wikis where articles created with content translation were deleted more than articles created without cx\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Which are these wikis?" ] }, { "cell_type": "code", "execution_count": 249, "metadata": {}, "outputs": [], "source": [ "cx_deletion_higher_list_q4 <- cx_deletion_ratio_q4_bywiki %>%\n", " filter(deletion_ratio_diff < 0)%>% # only wikis where cx deletion ratio is higher\n", " arrange(deletion_ratio_diff) #sort by highest deletion ratio difference\n", " " ] }, { "cell_type": "code", "execution_count": 297, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Wikis with higher deletion ratios for articles created with Content Translation
Reviewed Time Period: April 2021 through June 2021 (Q4)
Wiki project1\n", " Created Articles\n", " \n", " Deleted Articles\n", " \n", " Deletion Ratios\n", "
Created CX ArticlesCreated non-CX ArticlesDeleted CX ArticlesDeleted non-CX ArticlesCX Articles Deletion RatioNon-CX Articles Deletion RatioDeletion Ratio Difference
arywiki25146192436.00%1.64%−34.36%
jvwiki8907211622218.20%3.05%−15.15%
kawiki4845581051920.83%11.39%−9.45%
afwiki1001735109310.00%5.36%−4.64%
astwiki371452511502.70%1.03%−1.67%
mrwiki9958642602.02%1.02%−1.00%
ttwiki9523239610221.05%0.07%−0.98%
crhwiki751883181.33%0.42%−0.91%
bewiki17442155872.87%2.06%−0.81%
hiwiki38527359363924.16%23.36%−0.79%
ckbwiki9438923973.19%2.49%−0.70%
pswiki473352134.26%3.88%−0.37%
swwiki9533052592.11%1.79%−0.32%
urwiki132304371585.30%5.19%−0.11%
fawiki18978169411046995.80%5.75%−0.05%
\n", "

\n", " \n", " 1\n", " \n", " \n", " Excludes wikis with 15 or fewer articles created with Content Translation\n", " during the reviewed time period\n", "
\n", "

\n", "
\n", "\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# reformat into table\n", "\n", "cx_deletion_higher_list_q4_tbl <- cx_deletion_higher_list_q4 %>%\n", " gt() %>%\n", " tab_header(\n", " title = \"Wikis with higher deletion ratios for articles created with Content Translation\",\n", " subtitle = \"Reviewed Time Period: April 2021 through June 2021 (Q4)\") %>%\n", " fmt_percent(\n", " columns = 6:8\n", " ) %>%\n", "\n", " cols_label(wiki = \"Wiki project\",\n", " created_cx = \"Created CX Articles\", \n", " created_non_cx = \"Created non-CX Articles\",\n", " deleted_cx = \"Deleted CX Articles\",\n", " deleted_non_cx = \"Deleted non-CX Articles\",\n", " deleted_cx_ratio = \"CX Articles Deletion Ratio\",\n", " deleted_non_cx_ratio = \"Non-CX Articles Deletion Ratio\",\n", " deletion_ratio_diff = \"Deletion Ratio Difference\") %>%\n", " tab_spanner(\"Created Articles\", 2:3) %>%\n", " tab_spanner(\"Deleted Articles\", 4:5) %>%\n", " tab_spanner(\"Deletion Ratios\", 6:8) %>%\n", " tab_footnote(\n", " footnote = \"Excludes wikis with 15 or fewer articles created with Content Translation\n", " during the reviewed time period\",\n", " locations = cells_column_labels(\n", " columns = 'wiki'\n", " )) %>%\n", " gtsave(\n", " \"cx_deletion_higher_wikis_q4.html\", inline_css = TRUE) \n", "\n", "\n", "IRdisplay::display_html(data = cx_deletion_higher_list_q4_tbl, file = \"cx_deletion_higher_wikis_q4.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How high is the highest deletion ratio a wiki has for translations?" ] }, { "cell_type": "code", "execution_count": 256, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 5 × 8
wikicreated_cxcreated_non_cxdeleted_cxdeleted_non_cxdeleted_cx_ratiodeleted_non_cx_ratiodeletion_ratio_diff
<chr><int><int><int><int><chr><chr><chr>
1arywiki 251461 9 2436% 1.64% -34.36%
2guwiki 16 83 4 4125% 49.4% 24.4%
3hiwiki 38527359363924.16%23.36%-0.79%
4kawiki 4845581051920.83%11.39%-9.45%
5ltwiki 211959 488819.05%45.33%26.28%
\n" ], "text/latex": [ "A data.frame: 5 × 8\n", "\\begin{tabular}{r|llllllll}\n", " & wiki & created\\_cx & created\\_non\\_cx & deleted\\_cx & deleted\\_non\\_cx & deleted\\_cx\\_ratio & deleted\\_non\\_cx\\_ratio & deletion\\_ratio\\_diff\\\\\n", " & & & & & & & & \\\\\n", "\\hline\n", "\t1 & arywiki & 25 & 1461 & 9 & 24 & 36\\% & 1.64\\% & -34.36\\%\\\\\n", "\t2 & guwiki & 16 & 83 & 4 & 41 & 25\\% & 49.4\\% & 24.4\\% \\\\\n", "\t3 & hiwiki & 385 & 2735 & 93 & 639 & 24.16\\% & 23.36\\% & -0.79\\% \\\\\n", "\t4 & kawiki & 48 & 4558 & 10 & 519 & 20.83\\% & 11.39\\% & -9.45\\% \\\\\n", "\t5 & ltwiki & 21 & 1959 & 4 & 888 & 19.05\\% & 45.33\\% & 26.28\\% \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 5 × 8\n", "\n", "| | wiki <chr> | created_cx <int> | created_non_cx <int> | deleted_cx <int> | deleted_non_cx <int> | deleted_cx_ratio <chr> | deleted_non_cx_ratio <chr> | deletion_ratio_diff <chr> |\n", "|---|---|---|---|---|---|---|---|---|\n", "| 1 | arywiki | 25 | 1461 | 9 | 24 | 36% | 1.64% | -34.36% |\n", "| 2 | guwiki | 16 | 83 | 4 | 41 | 25% | 49.4% | 24.4% |\n", "| 3 | hiwiki | 385 | 2735 | 93 | 639 | 24.16% | 23.36% | -0.79% |\n", "| 4 | kawiki | 48 | 4558 | 10 | 519 | 20.83% | 11.39% | -9.45% |\n", "| 5 | ltwiki | 21 | 1959 | 4 | 888 | 19.05% | 45.33% | 26.28% |\n", "\n" ], "text/plain": [ " wiki created_cx created_non_cx deleted_cx deleted_non_cx deleted_cx_ratio\n", "1 arywiki 25 1461 9 24 36% \n", "2 guwiki 16 83 4 41 25% \n", "3 hiwiki 385 2735 93 639 24.16% \n", "4 kawiki 48 4558 10 519 20.83% \n", "5 ltwiki 21 1959 4 888 19.05% \n", " deleted_non_cx_ratio deletion_ratio_diff\n", "1 1.64% -34.36% \n", "2 49.4% 24.4% \n", "3 23.36% -0.79% \n", "4 11.39% -9.45% \n", "5 45.33% 26.28% " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_ration_highest_q4 <- cx_deletion_ratio_q4_bywiki %>%\n", " arrange(desc(deleted_cx_ratio)) %>% #sort by highest to lowest cx deletion ratio\n", " mutate(deleted_cx_ratio = paste0(round(deleted_cx_ratio *100,2),\"%\") ,\n", " deleted_non_cx_ratio = paste0(round(deleted_non_cx_ratio *100,2),\"%\") ,\n", " deletion_ratio_diff = paste0(round(deletion_ratio_diff * 100,2),\"%\") )\n", "\n", "head(cx_deletion_ration_highest_q4, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Moroccan Arabic Wikipedia had the deletion ratio for articles created with content translation. 36% of all articles created with content translation rate were deleted comparted to only 1.64% for articles created without cx.\n", "\n", "The highest deletion ratio for a larger size wiki (with over 100,000+ total articles) is Hindi Wikipedia with a deletion ratio of 24.2% for cx created articles compared to 23.36% for articles created without cx." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Has the number of those wikis reduced compared to the previous period?" ] }, { "cell_type": "code", "execution_count": 298, "metadata": {}, "outputs": [], "source": [ "# Deletion ratios from Q1\n", "\n", "query <-\n", "\"\n", "-- find all created articles \n", "WITH created_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,\n", " COUNT(*) AS created_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2021-01-01' and '2021-03-31' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", " AND event_type = 'create'\n", "-- remove bots\n", " AND SIZE(event_user_is_bot_by_historical) = 0 \n", "GROUP BY \n", " wiki_db\n", "),\n", "\n", "--find all deleted articles \n", "\n", "deleted_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,\n", " COUNT(*) AS deleted_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2021-01-01' and '2021-03-31' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", "-- find revisions moved to the archive table\n", " AND event_type = 'create'\n", " AND revision_is_deleted_by_page_deletion = TRUE\n", "-- remove bots\n", " AND SIZE(event_user_is_bot_by_historical) = 0 \n", "GROUP BY \n", " wiki_db\n", ")\n", "\n", "-- main query \n", "SELECT\n", " created_articles.wiki,\n", " created_cx,\n", " (created_total - created_cx) AS created_non_cx,\n", " deleted_cx,\n", " (deleted_total - deleted_cx) AS deleted_non_cx\n", "FROM created_articles\n", "JOIN deleted_articles ON \n", " created_articles.wiki = deleted_articles.wiki\n", "\"" ] }, { "cell_type": "code", "execution_count": 299, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Don't forget to authenticate with Kerberos using kinit\n", "\n" ] } ], "source": [ "cx_deletion_ratio_q1 <- wmfdata::query_hive(query)" ] }, { "cell_type": "code", "execution_count": 300, "metadata": {}, "outputs": [], "source": [ "cx_deletion_ratio_q1_bywiki <- cx_deletion_ratio_q1 %>%\n", " #filter(wiki == 'idwiki') %>%\n", " filter(created_cx > 15) %>%\n", " mutate(deleted_cx_ratio = deleted_cx/created_cx,\n", " deleted_non_cx_ratio = deleted_non_cx/created_non_cx,\n", " deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)\n", " ))\n" ] }, { "cell_type": "code", "execution_count": 301, "metadata": {}, "outputs": [], "source": [ "cx_deletion_higher_q1 <- cx_deletion_ratio_q1_bywiki %>%\n", " filter(deletion_ratio_diff < 0) %>%\n", " summarise(total_wikis = n())\n" ] }, { "cell_type": "code", "execution_count": 302, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Across all wikis where more than 15 articles have been created with content translation in Q1, there were 13 wikis where articles created with content translation were deleted more than articles created without cx\"\n" ] } ], "source": [ "print(paste0(\"Across all wikis where more than 15 articles have been created with content translation in Q1, there were \", \n", " cx_deletion_higher_q1[1], \n", " \" wikis where articles created with content translation were deleted more than articles created without cx\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of wikis with higher content translation deletion ratios increased by 2 from Q1 to Q4. \n", "We next compared the two lists of wikis to confirm if most of the wikis with higher deletion rates were the same across each quarter." ] }, { "cell_type": "code", "execution_count": 303, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 13 × 8
wikicreated_cxcreated_non_cxdeleted_cxdeleted_non_cxdeleted_cx_ratiodeleted_non_cx_ratiodeletion_ratio_diff
<chr><int><int><int><int><dbl><dbl><dbl>
hawwiki 64 8525 10.390625000.011764706-0.3788602941
kuwiki 204 401134 690.166666670.017202693-0.1494639741
lawiki 25 923 3 480.120000000.052004334-0.0679956663
ltwiki 23 22721311350.565217390.499559859-0.0656575321
fiwiki 83 944812 7870.144578310.083298052-0.0612802608
fiu_vrowiki 16 131 1 40.062500000.030534351-0.0319656489
eowiki 123 5221 5 620.040650410.011875120-0.0287752868
kawiki 122 543423 8890.188524590.163599558-0.0249250318
arzwiki 11037033 3 3350.027272730.009045986-0.0182267413
thwiki 17 4635 1 2080.058823530.044875944-0.0139475855
bewiki 256 5225 9 1150.035156250.022009569-0.0131466806
mrwiki 164 4771 3 600.018292680.012575980-0.0057167030
bswiki 52 1953 5 1870.096153850.095750128-0.0004037181
\n" ], "text/latex": [ "A data.frame: 13 × 8\n", "\\begin{tabular}{llllllll}\n", " wiki & created\\_cx & created\\_non\\_cx & deleted\\_cx & deleted\\_non\\_cx & deleted\\_cx\\_ratio & deleted\\_non\\_cx\\_ratio & deletion\\_ratio\\_diff\\\\\n", " & & & & & & & \\\\\n", "\\hline\n", "\t hawwiki & 64 & 85 & 25 & 1 & 0.39062500 & 0.011764706 & -0.3788602941\\\\\n", "\t kuwiki & 204 & 4011 & 34 & 69 & 0.16666667 & 0.017202693 & -0.1494639741\\\\\n", "\t lawiki & 25 & 923 & 3 & 48 & 0.12000000 & 0.052004334 & -0.0679956663\\\\\n", "\t ltwiki & 23 & 2272 & 13 & 1135 & 0.56521739 & 0.499559859 & -0.0656575321\\\\\n", "\t fiwiki & 83 & 9448 & 12 & 787 & 0.14457831 & 0.083298052 & -0.0612802608\\\\\n", "\t fiu\\_vrowiki & 16 & 131 & 1 & 4 & 0.06250000 & 0.030534351 & -0.0319656489\\\\\n", "\t eowiki & 123 & 5221 & 5 & 62 & 0.04065041 & 0.011875120 & -0.0287752868\\\\\n", "\t kawiki & 122 & 5434 & 23 & 889 & 0.18852459 & 0.163599558 & -0.0249250318\\\\\n", "\t arzwiki & 110 & 37033 & 3 & 335 & 0.02727273 & 0.009045986 & -0.0182267413\\\\\n", "\t thwiki & 17 & 4635 & 1 & 208 & 0.05882353 & 0.044875944 & -0.0139475855\\\\\n", "\t bewiki & 256 & 5225 & 9 & 115 & 0.03515625 & 0.022009569 & -0.0131466806\\\\\n", "\t mrwiki & 164 & 4771 & 3 & 60 & 0.01829268 & 0.012575980 & -0.0057167030\\\\\n", "\t bswiki & 52 & 1953 & 5 & 187 & 0.09615385 & 0.095750128 & -0.0004037181\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 13 × 8\n", "\n", "| wiki <chr> | created_cx <int> | created_non_cx <int> | deleted_cx <int> | deleted_non_cx <int> | deleted_cx_ratio <dbl> | deleted_non_cx_ratio <dbl> | deletion_ratio_diff <dbl> |\n", "|---|---|---|---|---|---|---|---|\n", "| hawwiki | 64 | 85 | 25 | 1 | 0.39062500 | 0.011764706 | -0.3788602941 |\n", "| kuwiki | 204 | 4011 | 34 | 69 | 0.16666667 | 0.017202693 | -0.1494639741 |\n", "| lawiki | 25 | 923 | 3 | 48 | 0.12000000 | 0.052004334 | -0.0679956663 |\n", "| ltwiki | 23 | 2272 | 13 | 1135 | 0.56521739 | 0.499559859 | -0.0656575321 |\n", "| fiwiki | 83 | 9448 | 12 | 787 | 0.14457831 | 0.083298052 | -0.0612802608 |\n", "| fiu_vrowiki | 16 | 131 | 1 | 4 | 0.06250000 | 0.030534351 | -0.0319656489 |\n", "| eowiki | 123 | 5221 | 5 | 62 | 0.04065041 | 0.011875120 | -0.0287752868 |\n", "| kawiki | 122 | 5434 | 23 | 889 | 0.18852459 | 0.163599558 | -0.0249250318 |\n", "| arzwiki | 110 | 37033 | 3 | 335 | 0.02727273 | 0.009045986 | -0.0182267413 |\n", "| thwiki | 17 | 4635 | 1 | 208 | 0.05882353 | 0.044875944 | -0.0139475855 |\n", "| bewiki | 256 | 5225 | 9 | 115 | 0.03515625 | 0.022009569 | -0.0131466806 |\n", "| mrwiki | 164 | 4771 | 3 | 60 | 0.01829268 | 0.012575980 | -0.0057167030 |\n", "| bswiki | 52 | 1953 | 5 | 187 | 0.09615385 | 0.095750128 | -0.0004037181 |\n", "\n" ], "text/plain": [ " wiki created_cx created_non_cx deleted_cx deleted_non_cx\n", "1 hawwiki 64 85 25 1 \n", "2 kuwiki 204 4011 34 69 \n", "3 lawiki 25 923 3 48 \n", "4 ltwiki 23 2272 13 1135 \n", "5 fiwiki 83 9448 12 787 \n", "6 fiu_vrowiki 16 131 1 4 \n", "7 eowiki 123 5221 5 62 \n", "8 kawiki 122 5434 23 889 \n", "9 arzwiki 110 37033 3 335 \n", "10 thwiki 17 4635 1 208 \n", "11 bewiki 256 5225 9 115 \n", "12 mrwiki 164 4771 3 60 \n", "13 bswiki 52 1953 5 187 \n", " deleted_cx_ratio deleted_non_cx_ratio deletion_ratio_diff\n", "1 0.39062500 0.011764706 -0.3788602941 \n", "2 0.16666667 0.017202693 -0.1494639741 \n", "3 0.12000000 0.052004334 -0.0679956663 \n", "4 0.56521739 0.499559859 -0.0656575321 \n", "5 0.14457831 0.083298052 -0.0612802608 \n", "6 0.06250000 0.030534351 -0.0319656489 \n", "7 0.04065041 0.011875120 -0.0287752868 \n", "8 0.18852459 0.163599558 -0.0249250318 \n", "9 0.02727273 0.009045986 -0.0182267413 \n", "10 0.05882353 0.044875944 -0.0139475855 \n", "11 0.03515625 0.022009569 -0.0131466806 \n", "12 0.01829268 0.012575980 -0.0057167030 \n", "13 0.09615385 0.095750128 -0.0004037181 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_higher_list_q1 <- cx_deletion_ratio_q1_bywiki %>%\n", " filter(deletion_ratio_diff < 0) %>%\n", " arrange(deletion_ratio_diff)\n", "\n", "cx_deletion_higher_list_q1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many wikis had higher deletion ratios for cx translated articles both quarters?" ] }, { "cell_type": "code", "execution_count": 304, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 3 × 1
wiki
<chr>
kawiki
bewiki
mrwiki
\n" ], "text/latex": [ "A data.frame: 3 × 1\n", "\\begin{tabular}{l}\n", " wiki\\\\\n", " \\\\\n", "\\hline\n", "\t kawiki\\\\\n", "\t bewiki\\\\\n", "\t mrwiki\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 3 × 1\n", "\n", "| wiki <chr> |\n", "|---|\n", "| kawiki |\n", "| bewiki |\n", "| mrwiki |\n", "\n" ], "text/plain": [ " wiki \n", "1 kawiki\n", "2 bewiki\n", "3 mrwiki" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "intersect(cx_deletion_higher_list_q1[1], cx_deletion_higher_list_q4[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There were 3 wikis that had higher deletion ratios for content translated articles both quarters. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6 Month Period Comparison" ] }, { "cell_type": "code", "execution_count": 269, "metadata": {}, "outputs": [], "source": [ "# Current 6 Months\n", "# Jan - June 2021\n", "query <-\n", "\"\n", "-- find both cx and non-cx created articles \n", "WITH created_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,\n", " COUNT(*) AS created_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2021-01-01' and '2021-06-30' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", " AND event_type = 'create'\n", "-- rremove bots\n", " AND SIZE(event_user_is_bot_by_historical) = 0 \n", "GROUP BY \n", " wiki_db\n", "),\n", "\n", "--find all deleted articles that were created with cx \n", "\n", "deleted_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,\n", " COUNT(*) AS deleted_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2021-01-01' and '2021-06-30' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", " AND event_type = 'create'\n", "-- find revisions moved to the archive table\n", " AND revision_is_deleted_by_page_deletion = TRUE\n", "-- remove bots\n", " AND SIZE(event_user_is_bot_by_historical) = 0 \n", "GROUP BY \n", " wiki_db\n", ")\n", "\n", "-- main query to aggregate and join sources above\n", "SELECT\n", " created_articles.wiki,\n", " created_cx,\n", " (created_total - created_cx) AS created_non_cx,\n", " deleted_cx,\n", " (deleted_total - deleted_cx) AS deleted_non_cx\n", "FROM created_articles\n", "JOIN deleted_articles ON \n", " created_articles.wiki = deleted_articles.wiki\n", "\"" ] }, { "cell_type": "code", "execution_count": 270, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Don't forget to authenticate with Kerberos using kinit\n", "\n" ] } ], "source": [ "cx_deletion_ratio_current_6mo <- wmfdata::query_hive(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overall Deletion Ratio - Current 6 mo" ] }, { "cell_type": "code", "execution_count": 271, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 3
deleted_cx_pctdeleted_non_cx_pctdeletion_pct_diff
<chr><chr><chr>
3.6%8.47%4.87%
\n" ], "text/latex": [ "A data.frame: 1 × 3\n", "\\begin{tabular}{lll}\n", " deleted\\_cx\\_pct & deleted\\_non\\_cx\\_pct & deletion\\_pct\\_diff\\\\\n", " & & \\\\\n", "\\hline\n", "\t 3.6\\% & 8.47\\% & 4.87\\%\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 3\n", "\n", "| deleted_cx_pct <chr> | deleted_non_cx_pct <chr> | deletion_pct_diff <chr> |\n", "|---|---|---|\n", "| 3.6% | 8.47% | 4.87% |\n", "\n" ], "text/plain": [ " deleted_cx_pct deleted_non_cx_pct deletion_pct_diff\n", "1 3.6% 8.47% 4.87% " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_ratio_6cur_overall <- cx_deletion_ratio_current_6mo %>%\n", " summarise(deleted_cx_pct = paste0(round(sum(deleted_cx)/sum(created_cx) * 100, 2), \"%\"),\n", " deleted_non_cx_pct = paste0(round(sum(deleted_non_cx)/sum(created_non_cx) * 100, 2), \"%\"),\n", " deletion_pct_diff = paste0(round((sum(deleted_non_cx)/sum(created_non_cx)*100)-((sum(deleted_cx)/sum(created_cx))*100), 2),\"%\")\n", " )\n", "\n", "cx_deletion_ratio_6cur_overall" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## By Wiki" ] }, { "cell_type": "code", "execution_count": 272, "metadata": {}, "outputs": [], "source": [ "cx_deletion_ratio_current_bywiki <- cx_deletion_ratio_current_6mo %>%\n", " #filter(wiki == 'idwiki') %>%\n", " filter(created_cx > 15) %>% # only review wikis with more than 15 cx articles\n", " mutate(deleted_cx_ratio = deleted_cx/created_cx,\n", " deleted_non_cx_ratio = deleted_non_cx/created_non_cx,\n", " deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)\n", " ))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many wikis have translations deleted more often than regular articles?" ] }, { "cell_type": "code", "execution_count": 274, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 1
total_wikis
<int>
20
\n" ], "text/latex": [ "A data.frame: 1 × 1\n", "\\begin{tabular}{l}\n", " total\\_wikis\\\\\n", " \\\\\n", "\\hline\n", "\t 20\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 1\n", "\n", "| total_wikis <int> |\n", "|---|\n", "| 20 |\n", "\n" ], "text/plain": [ " total_wikis\n", "1 20 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_higher_current_6mo <- cx_deletion_ratio_current_bywiki %>%\n", " filter(deletion_ratio_diff < 0) %>%\n", " summarise(total_wikis = n())\n", "\n", "cx_deletion_higher_current_6mo " ] }, { "cell_type": "code", "execution_count": 284, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Across all wikis where more than 15 articles have been created with content translation from Jan 2021 - June 2021, there were 20 wikis where articles created with content translation were deleted more than articles created without cx\"\n" ] } ], "source": [ "print(paste0(\"Across all wikis where more than 15 articles have been created with content translation from Jan 2021 - June 2021, there were \", \n", " cx_deletion_higher_current_6mo[1], \n", " \" wikis where articles created with content translation were deleted more than articles created without cx\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Which are these wikis?" ] }, { "cell_type": "code", "execution_count": 276, "metadata": {}, "outputs": [], "source": [ "cx_deletion_higher_list_current <- cx_deletion_ratio_current_bywiki %>%\n", " filter(deletion_ratio_diff < 0)%>% #only wikis with higher cx deletion ratios\n", " arrange(deletion_ratio_diff)\n", " " ] }, { "cell_type": "code", "execution_count": 279, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Wikis with higher deletion ratios for articles created with Content Translation
Reviewed Time Period: January 2021 through June 2021
Wiki project1\n", " Created Articles\n", " \n", " Deleted Articles\n", " \n", " Deletion Ratios\n", "
Created CX ArticlesCreated non-CX ArticlesDeleted CX ArticlesDeleted non-CX ArticlesCX Articles Deletion RatioNon-CX Articles Deletion RatioDeletion Ratio Difference
hawwiki68128252536.76%19.53%−17.23%
iswiki302157714023.33%6.49%−16.84%
kuwiki22154863412715.38%2.31%−13.07%
arywiki57116194615.79%3.96%−11.83%
fiu_vrowiki312354612.90%2.55%−10.35%
thwiki249975341112.50%4.12%−8.38%
arzwiki11912164396067.56%0.50%−7.06%
azbwiki18167428311.11%4.96%−6.15%
siwiki37157348710.81%5.53%−5.28%
kawiki1701001033141519.41%14.14%−5.28%
lldwiki18171115.56%0.58%−4.97%
jvwiki25361182162456.39%3.81%−2.58%
crhwiki7726582182.60%0.68%−1.92%
fiwiki2301837025164610.87%8.96%−1.91%
pswiki557803315.45%3.97%−1.48%
bewiki4379537142033.20%2.13%−1.08%
afwiki2103797111705.24%4.48%−0.76%
mrwiki2681066751201.87%1.12%−0.74%
lawiki5617663835.36%4.70%−0.66%
eowiki5581076691271.61%1.18%−0.43%
\n", "

\n", " \n", " 1\n", " \n", " \n", " Excludes wikis with 15 or fewer articles created with Content Translation\n", " during the reviewed time period\n", "
\n", "

\n", "
\n", "\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# reformat into table\n", "\n", "cx_deletion_higher_list_6mo_tbl <- cx_deletion_higher_list_current %>%\n", " gt() %>%\n", " tab_header(\n", " title = \"Wikis with higher deletion ratios for articles created with Content Translation\",\n", " subtitle = \"Reviewed Time Period: January 2021 through June 2021\") %>%\n", " fmt_percent(\n", " columns = 6:8\n", " ) %>%\n", "\n", " cols_label(wiki = \"Wiki project\",\n", " created_cx = \"Created CX Articles\", \n", " created_non_cx = \"Created non-CX Articles\",\n", " deleted_cx = \"Deleted CX Articles\",\n", " deleted_non_cx = \"Deleted non-CX Articles\",\n", " deleted_cx_ratio = \"CX Articles Deletion Ratio\",\n", " deleted_non_cx_ratio = \"Non-CX Articles Deletion Ratio\",\n", " deletion_ratio_diff = \"Deletion Ratio Difference\") %>%\n", " tab_spanner(\"Created Articles\", 2:3) %>%\n", " tab_spanner(\"Deleted Articles\", 4:5) %>%\n", " tab_spanner(\"Deletion Ratios\", 6:8) %>%\n", " tab_footnote(\n", " footnote = \"Excludes wikis with 15 or fewer articles created with Content Translation\n", " during the reviewed time period\",\n", " locations = cells_column_labels(\n", " columns = 'wiki'\n", " )) %>%\n", " gtsave(\n", " \"cx_deletion_higher_wikis_6mo.html\", inline_css = TRUE) \n", "\n", "\n", "IRdisplay::display_html(data = cx_deletion_higher_list_6mo_tbl, file = \"cx_deletion_higher_wikis_6mo.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How high is the highest deletion ratio a wiki has for translations?\n" ] }, { "cell_type": "code", "execution_count": 282, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 5 × 8
wikicreated_cxcreated_non_cxdeleted_cxdeleted_non_cxdeleted_cx_ratiodeleted_non_cx_ratiodeletion_ratio_diff
<chr><int><int><int><int><chr><chr><chr>
1ltwiki 45 425417204137.78%47.98%10.2%
2hawwiki 68 12825 2536.76%19.53%-17.23%
3mnwiki 30 126510 54233.33%42.85%9.51%
4iswiki 30 2157 7 14023.33%6.49% -16.84%
5kawiki 1701001033141519.41%14.14%-5.28%
\n" ], "text/latex": [ "A data.frame: 5 × 8\n", "\\begin{tabular}{r|llllllll}\n", " & wiki & created\\_cx & created\\_non\\_cx & deleted\\_cx & deleted\\_non\\_cx & deleted\\_cx\\_ratio & deleted\\_non\\_cx\\_ratio & deletion\\_ratio\\_diff\\\\\n", " & & & & & & & & \\\\\n", "\\hline\n", "\t1 & ltwiki & 45 & 4254 & 17 & 2041 & 37.78\\% & 47.98\\% & 10.2\\% \\\\\n", "\t2 & hawwiki & 68 & 128 & 25 & 25 & 36.76\\% & 19.53\\% & -17.23\\%\\\\\n", "\t3 & mnwiki & 30 & 1265 & 10 & 542 & 33.33\\% & 42.85\\% & 9.51\\% \\\\\n", "\t4 & iswiki & 30 & 2157 & 7 & 140 & 23.33\\% & 6.49\\% & -16.84\\%\\\\\n", "\t5 & kawiki & 170 & 10010 & 33 & 1415 & 19.41\\% & 14.14\\% & -5.28\\% \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 5 × 8\n", "\n", "| | wiki <chr> | created_cx <int> | created_non_cx <int> | deleted_cx <int> | deleted_non_cx <int> | deleted_cx_ratio <chr> | deleted_non_cx_ratio <chr> | deletion_ratio_diff <chr> |\n", "|---|---|---|---|---|---|---|---|---|\n", "| 1 | ltwiki | 45 | 4254 | 17 | 2041 | 37.78% | 47.98% | 10.2% |\n", "| 2 | hawwiki | 68 | 128 | 25 | 25 | 36.76% | 19.53% | -17.23% |\n", "| 3 | mnwiki | 30 | 1265 | 10 | 542 | 33.33% | 42.85% | 9.51% |\n", "| 4 | iswiki | 30 | 2157 | 7 | 140 | 23.33% | 6.49% | -16.84% |\n", "| 5 | kawiki | 170 | 10010 | 33 | 1415 | 19.41% | 14.14% | -5.28% |\n", "\n" ], "text/plain": [ " wiki created_cx created_non_cx deleted_cx deleted_non_cx deleted_cx_ratio\n", "1 ltwiki 45 4254 17 2041 37.78% \n", "2 hawwiki 68 128 25 25 36.76% \n", "3 mnwiki 30 1265 10 542 33.33% \n", "4 iswiki 30 2157 7 140 23.33% \n", "5 kawiki 170 10010 33 1415 19.41% \n", " deleted_non_cx_ratio deletion_ratio_diff\n", "1 47.98% 10.2% \n", "2 19.53% -17.23% \n", "3 42.85% 9.51% \n", "4 6.49% -16.84% \n", "5 14.14% -5.28% " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_ration_highest_current <- cx_deletion_ratio_current_bywiki %>%\n", " arrange(desc(deleted_cx_ratio)) %>% \n", " mutate(deleted_cx_ratio = paste0(round(deleted_cx_ratio *100,2),\"%\") ,\n", " deleted_non_cx_ratio = paste0(round(deleted_non_cx_ratio *100,2),\"%\") ,\n", " deletion_ratio_diff = paste0(round(deletion_ratio_diff * 100,2),\"%\") )\n", "\n", "head(cx_deletion_ration_highest_current, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lithuanian Wikipedia had the highest deletion ratio for articles created with content translation. 37.8% of all articles created with content translation rate were deleted; however, this was still less than the percent of non content translated article deletion ratio (47.9%).\n", "\n", "The Wiki that had the highest different in deletion ratios was Hawaiian Wikipedia. 36.8% of all articles created with cx were deleted during the reviewed time period comparted to 19.5% of articles created without content translation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Has the number of those wikis reduced compared to the previous period?" ] }, { "cell_type": "code", "execution_count": 285, "metadata": {}, "outputs": [], "source": [ "# Previous 6 Months\n", "# July 2020 - December 2020\n", "\n", "query <-\n", "\"\n", "-- find both cx and non-cx created articles \n", "WITH created_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,\n", " COUNT(*) AS created_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2020-07-01' and '2020-12-31' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", " AND event_type = 'create'\n", "-- remove bots\n", " AND SIZE(event_user_is_bot_by_historical) = 0 \n", "GROUP BY \n", " wiki_db\n", "),\n", "\n", "--find all deleted articles that were created with cx \n", "\n", "deleted_articles AS (\n", "\n", "SELECT\n", " wiki_db AS wiki,\n", " SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,\n", " COUNT(*) AS deleted_total\n", "FROM wmf.mediawiki_history\n", "WHERE\n", " snapshot = '2021-08'\n", " AND event_timestamp BETWEEN '2020-07-01' and '2020-12-31' \n", "-- interested in main page namespaces\n", " AND page_namespace = 0\n", "-- only look at new page creations\n", " AND revision_parent_id = 0\n", " AND event_entity = 'revision'\n", "-- find revisions moved to the archive table\n", " AND event_type = 'create'\n", " AND revision_is_deleted_by_page_deletion = TRUE\n", "-- remove bots\n", " AND SIZE(event_user_is_bot_by_historical) = 0 \n", "GROUP BY \n", " wiki_db\n", ")\n", "\n", "-- main query to aggregate and join sources above\n", "SELECT\n", " created_articles.wiki,\n", " created_cx,\n", " (created_total - created_cx) AS created_non_cx,\n", " deleted_cx,\n", " (deleted_total - deleted_cx) AS deleted_non_cx\n", "FROM created_articles\n", "JOIN deleted_articles ON \n", " created_articles.wiki = deleted_articles.wiki\n", "\"" ] }, { "cell_type": "code", "execution_count": 286, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Don't forget to authenticate with Kerberos using kinit\n", "\n" ] } ], "source": [ "cx_deletion_ratio_previous_6mo <- wmfdata::query_hive(query)" ] }, { "cell_type": "code", "execution_count": 287, "metadata": {}, "outputs": [], "source": [ "cx_deletion_ratio_bywiki_previous <- cx_deletion_ratio_previous_6mo %>%\n", " #filter(wiki == 'idwiki') %>%\n", " filter(created_cx > 15) %>% # only wikis with at leat 15 created articles\n", " mutate(deleted_cx_ratio = deleted_cx/created_cx,\n", " deleted_non_cx_ratio = deleted_non_cx/created_non_cx,\n", " deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)\n", " ))\n" ] }, { "cell_type": "code", "execution_count": 288, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A data.frame: 1 × 1
total_wikis
<int>
21
\n" ], "text/latex": [ "A data.frame: 1 × 1\n", "\\begin{tabular}{l}\n", " total\\_wikis\\\\\n", " \\\\\n", "\\hline\n", "\t 21\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 1 × 1\n", "\n", "| total_wikis <int> |\n", "|---|\n", "| 21 |\n", "\n" ], "text/plain": [ " total_wikis\n", "1 21 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_higher_previous <- cx_deletion_ratio_bywiki_previous %>%\n", " filter(deletion_ratio_diff < 0) %>%\n", " summarise(total_wikis = n())\n", "\n", "cx_deletion_higher_previous" ] }, { "cell_type": "code", "execution_count": 290, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Across all wikis where more than 15 articles have been created with content translation between July 2020 and December 2020, there were 21 wikis where articles created with content translation were deleted more than articles created without cx\"\n" ] } ], "source": [ "print(paste0(\"Across all wikis where more than 15 articles have been created with content translation between July 2020 and December 2020, there were \", \n", " cx_deletion_higher_previous[1], \n", " \" wikis where articles created with content translation were deleted more than articles created without cx\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of wikis with higher content translation deletion ratios decreased by 1 from July 2020 to December 2020 to January 2021 to June 2021.\n", "\n", "We next compared the two lists of wikis to confirm if most of the wikis with higher deletion rates were the same across each quarter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many wikis had higher deletion ratios for cx translated articles both quarters?" ] }, { "cell_type": "code", "execution_count": 296, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 21 × 8
wikicreated_cxcreated_non_cxdeleted_cxdeleted_non_cxdeleted_cx_ratiodeleted_non_cx_ratiodeletion_ratio_diff
<chr><int><int><int><int><dbl><dbl><dbl>
fywiki 17 1755 14 650.823529410.037037037-0.786492375
hawwiki 42 132 31 240.738095240.181818182-0.556277056
ltwiki 59 3337 28 6440.474576270.192987714-0.281588558
iswiki 26 2000 7 1550.269230770.077500000-0.191730769
lawiki 48 2979 9 1580.187500000.053037932-0.134462068
hywiki 159 33338 22 10800.138364780.032395465-0.105969315
azwiki 206 29671 29 18850.140776700.063530046-0.077246653
arywiki 63 2443 5 500.079365080.020466639-0.058898440
mywiki 313 6698 37 4390.118210860.065541953-0.052668910
cywiki 122 1451 13 850.106557380.058580289-0.047977088
vecwiki 20 10293 1 460.050000000.004469057-0.045530943
arzwiki 133355316 4 7300.030075190.002054509-0.028020679
eowiki 277 10800 8 900.028880870.008333333-0.020547533
zhwiki 1512 80866137 57530.090608470.071142384-0.019466082
dewiki 505119158 87183510.172277230.154005606-0.018271622
zh_yuewiki 35 23696 1 2670.028571430.011267725-0.017303704
ckbwiki 64 2901 5 1830.078125000.063081696-0.015043304
kuwiki 402 5291 13 1330.032338310.025137025-0.007201283
fiwiki 138 20467 12 16800.086956520.082083354-0.004873168
etwiki 55 8239 6 8650.109090910.104988469-0.004102440
bswiki 62 2677 6 2540.096774190.094882331-0.001891863
\n" ], "text/latex": [ "A data.frame: 21 × 8\n", "\\begin{tabular}{llllllll}\n", " wiki & created\\_cx & created\\_non\\_cx & deleted\\_cx & deleted\\_non\\_cx & deleted\\_cx\\_ratio & deleted\\_non\\_cx\\_ratio & deletion\\_ratio\\_diff\\\\\n", " & & & & & & & \\\\\n", "\\hline\n", "\t fywiki & 17 & 1755 & 14 & 65 & 0.82352941 & 0.037037037 & -0.786492375\\\\\n", "\t hawwiki & 42 & 132 & 31 & 24 & 0.73809524 & 0.181818182 & -0.556277056\\\\\n", "\t ltwiki & 59 & 3337 & 28 & 644 & 0.47457627 & 0.192987714 & -0.281588558\\\\\n", "\t iswiki & 26 & 2000 & 7 & 155 & 0.26923077 & 0.077500000 & -0.191730769\\\\\n", "\t lawiki & 48 & 2979 & 9 & 158 & 0.18750000 & 0.053037932 & -0.134462068\\\\\n", "\t hywiki & 159 & 33338 & 22 & 1080 & 0.13836478 & 0.032395465 & -0.105969315\\\\\n", "\t azwiki & 206 & 29671 & 29 & 1885 & 0.14077670 & 0.063530046 & -0.077246653\\\\\n", "\t arywiki & 63 & 2443 & 5 & 50 & 0.07936508 & 0.020466639 & -0.058898440\\\\\n", "\t mywiki & 313 & 6698 & 37 & 439 & 0.11821086 & 0.065541953 & -0.052668910\\\\\n", "\t cywiki & 122 & 1451 & 13 & 85 & 0.10655738 & 0.058580289 & -0.047977088\\\\\n", "\t vecwiki & 20 & 10293 & 1 & 46 & 0.05000000 & 0.004469057 & -0.045530943\\\\\n", "\t arzwiki & 133 & 355316 & 4 & 730 & 0.03007519 & 0.002054509 & -0.028020679\\\\\n", "\t eowiki & 277 & 10800 & 8 & 90 & 0.02888087 & 0.008333333 & -0.020547533\\\\\n", "\t zhwiki & 1512 & 80866 & 137 & 5753 & 0.09060847 & 0.071142384 & -0.019466082\\\\\n", "\t dewiki & 505 & 119158 & 87 & 18351 & 0.17227723 & 0.154005606 & -0.018271622\\\\\n", "\t zh\\_yuewiki & 35 & 23696 & 1 & 267 & 0.02857143 & 0.011267725 & -0.017303704\\\\\n", "\t ckbwiki & 64 & 2901 & 5 & 183 & 0.07812500 & 0.063081696 & -0.015043304\\\\\n", "\t kuwiki & 402 & 5291 & 13 & 133 & 0.03233831 & 0.025137025 & -0.007201283\\\\\n", "\t fiwiki & 138 & 20467 & 12 & 1680 & 0.08695652 & 0.082083354 & -0.004873168\\\\\n", "\t etwiki & 55 & 8239 & 6 & 865 & 0.10909091 & 0.104988469 & -0.004102440\\\\\n", "\t bswiki & 62 & 2677 & 6 & 254 & 0.09677419 & 0.094882331 & -0.001891863\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 21 × 8\n", "\n", "| wiki <chr> | created_cx <int> | created_non_cx <int> | deleted_cx <int> | deleted_non_cx <int> | deleted_cx_ratio <dbl> | deleted_non_cx_ratio <dbl> | deletion_ratio_diff <dbl> |\n", "|---|---|---|---|---|---|---|---|\n", "| fywiki | 17 | 1755 | 14 | 65 | 0.82352941 | 0.037037037 | -0.786492375 |\n", "| hawwiki | 42 | 132 | 31 | 24 | 0.73809524 | 0.181818182 | -0.556277056 |\n", "| ltwiki | 59 | 3337 | 28 | 644 | 0.47457627 | 0.192987714 | -0.281588558 |\n", "| iswiki | 26 | 2000 | 7 | 155 | 0.26923077 | 0.077500000 | -0.191730769 |\n", "| lawiki | 48 | 2979 | 9 | 158 | 0.18750000 | 0.053037932 | -0.134462068 |\n", "| hywiki | 159 | 33338 | 22 | 1080 | 0.13836478 | 0.032395465 | -0.105969315 |\n", "| azwiki | 206 | 29671 | 29 | 1885 | 0.14077670 | 0.063530046 | -0.077246653 |\n", "| arywiki | 63 | 2443 | 5 | 50 | 0.07936508 | 0.020466639 | -0.058898440 |\n", "| mywiki | 313 | 6698 | 37 | 439 | 0.11821086 | 0.065541953 | -0.052668910 |\n", "| cywiki | 122 | 1451 | 13 | 85 | 0.10655738 | 0.058580289 | -0.047977088 |\n", "| vecwiki | 20 | 10293 | 1 | 46 | 0.05000000 | 0.004469057 | -0.045530943 |\n", "| arzwiki | 133 | 355316 | 4 | 730 | 0.03007519 | 0.002054509 | -0.028020679 |\n", "| eowiki | 277 | 10800 | 8 | 90 | 0.02888087 | 0.008333333 | -0.020547533 |\n", "| zhwiki | 1512 | 80866 | 137 | 5753 | 0.09060847 | 0.071142384 | -0.019466082 |\n", "| dewiki | 505 | 119158 | 87 | 18351 | 0.17227723 | 0.154005606 | -0.018271622 |\n", "| zh_yuewiki | 35 | 23696 | 1 | 267 | 0.02857143 | 0.011267725 | -0.017303704 |\n", "| ckbwiki | 64 | 2901 | 5 | 183 | 0.07812500 | 0.063081696 | -0.015043304 |\n", "| kuwiki | 402 | 5291 | 13 | 133 | 0.03233831 | 0.025137025 | -0.007201283 |\n", "| fiwiki | 138 | 20467 | 12 | 1680 | 0.08695652 | 0.082083354 | -0.004873168 |\n", "| etwiki | 55 | 8239 | 6 | 865 | 0.10909091 | 0.104988469 | -0.004102440 |\n", "| bswiki | 62 | 2677 | 6 | 254 | 0.09677419 | 0.094882331 | -0.001891863 |\n", "\n" ], "text/plain": [ " wiki created_cx created_non_cx deleted_cx deleted_non_cx\n", "1 fywiki 17 1755 14 65 \n", "2 hawwiki 42 132 31 24 \n", "3 ltwiki 59 3337 28 644 \n", "4 iswiki 26 2000 7 155 \n", "5 lawiki 48 2979 9 158 \n", "6 hywiki 159 33338 22 1080 \n", "7 azwiki 206 29671 29 1885 \n", "8 arywiki 63 2443 5 50 \n", "9 mywiki 313 6698 37 439 \n", "10 cywiki 122 1451 13 85 \n", "11 vecwiki 20 10293 1 46 \n", "12 arzwiki 133 355316 4 730 \n", "13 eowiki 277 10800 8 90 \n", "14 zhwiki 1512 80866 137 5753 \n", "15 dewiki 505 119158 87 18351 \n", "16 zh_yuewiki 35 23696 1 267 \n", "17 ckbwiki 64 2901 5 183 \n", "18 kuwiki 402 5291 13 133 \n", "19 fiwiki 138 20467 12 1680 \n", "20 etwiki 55 8239 6 865 \n", "21 bswiki 62 2677 6 254 \n", " deleted_cx_ratio deleted_non_cx_ratio deletion_ratio_diff\n", "1 0.82352941 0.037037037 -0.786492375 \n", "2 0.73809524 0.181818182 -0.556277056 \n", "3 0.47457627 0.192987714 -0.281588558 \n", "4 0.26923077 0.077500000 -0.191730769 \n", "5 0.18750000 0.053037932 -0.134462068 \n", "6 0.13836478 0.032395465 -0.105969315 \n", "7 0.14077670 0.063530046 -0.077246653 \n", "8 0.07936508 0.020466639 -0.058898440 \n", "9 0.11821086 0.065541953 -0.052668910 \n", "10 0.10655738 0.058580289 -0.047977088 \n", "11 0.05000000 0.004469057 -0.045530943 \n", "12 0.03007519 0.002054509 -0.028020679 \n", "13 0.02888087 0.008333333 -0.020547533 \n", "14 0.09060847 0.071142384 -0.019466082 \n", "15 0.17227723 0.154005606 -0.018271622 \n", "16 0.02857143 0.011267725 -0.017303704 \n", "17 0.07812500 0.063081696 -0.015043304 \n", "18 0.03233831 0.025137025 -0.007201283 \n", "19 0.08695652 0.082083354 -0.004873168 \n", "20 0.10909091 0.104988469 -0.004102440 \n", "21 0.09677419 0.094882331 -0.001891863 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cx_deletion_higher_list_previous <- cx_deletion_ratio_bywiki_previous %>%\n", " filter(deletion_ratio_diff < 0) %>%\n", " arrange(deletion_ratio_diff)\n", "\n", "cx_deletion_higher_list_previous" ] }, { "cell_type": "code", "execution_count": 294, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 8 × 1
wiki
<chr>
hawwiki
iswiki
kuwiki
arywiki
arzwiki
fiwiki
lawiki
eowiki
\n" ], "text/latex": [ "A data.frame: 8 × 1\n", "\\begin{tabular}{l}\n", " wiki\\\\\n", " \\\\\n", "\\hline\n", "\t hawwiki\\\\\n", "\t iswiki \\\\\n", "\t kuwiki \\\\\n", "\t arywiki\\\\\n", "\t arzwiki\\\\\n", "\t fiwiki \\\\\n", "\t lawiki \\\\\n", "\t eowiki \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 8 × 1\n", "\n", "| wiki <chr> |\n", "|---|\n", "| hawwiki |\n", "| iswiki |\n", "| kuwiki |\n", "| arywiki |\n", "| arzwiki |\n", "| fiwiki |\n", "| lawiki |\n", "| eowiki |\n", "\n" ], "text/plain": [ " wiki \n", "1 hawwiki\n", "2 iswiki \n", "3 kuwiki \n", "4 arywiki\n", "5 arzwiki\n", "6 fiwiki \n", "7 lawiki \n", "8 eowiki " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "intersect(cx_deletion_higher_list_current[1], cx_deletion_higher_list_previous[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There were 8 wikis that had higher deletion ratios for content translated articles both quarters. " ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 4 }