{
"cells": [
{
"cell_type": "markdown",
"id": "f313f054-03ba-4431-bdd0-20082e476892",
"metadata": {},
"source": [
"# What projects do volunteers from Sub-Saharan African edit most?\n",
"\n",
"Date: 17 June 2022\n",
"[Task](https://phabricator.wikimedia.org/T310322)"
]
},
{
"cell_type": "markdown",
"id": "82768a4d-35e0-4976-ac52-a951a1623b3b",
"metadata": {},
"source": [
"# Scratch Notes\n",
"* Do not know first time editors currently only 1-4 grou? Any way possible in aggregated data? Might need to look at raw data\n",
"* Find a quick way to limit project familty to wikipedia\n",
"* There is also Geoeditors Edits Monthly which show the number of edits not distinct editors. \n",
"* How are anons handled in the data?"
]
},
{
"cell_type": "markdown",
"id": "b087001b-0d1a-46ac-b98d-2454ae7d0366",
"metadata": {},
"source": [
"# Purpose\n",
"\n",
"For the 2022-2023 fiscal year, the Editing Team is considering centering the work we will do around people A) who are contributing to a Wikipedia from sub-Saharan Africa and B) who have made fewer than 100 cumulative edits to Wikipedia.\n",
"\n",
"Before moving forward with defining the audience in the ways \"A)\" and \"B)\" describe, we would like to understand:\n",
"\n",
"* What language Wikipedias the people who are contributing from sub-Saharan Africa and who have made fewer than 100 cumulative edits are most active with?\n",
"Knowing the above will help us determine what projects we will consider partnering most closely with in the 2022-2023 fiscal year.\n",
"* On average, how many people from sub-Saharan Africa during a given month are making their first edit to Wikipedia as an account holder?\n",
"Knowing the above will help us determine how long we can expect the quantitative analyses we have planned to reach statistic significance.\n"
]
},
{
"cell_type": "markdown",
"id": "91a609f3-290a-4934-ad3a-d1d93ed3b89b",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"# Data caveats\n",
"* This data comes from from [geoeditors_monthly](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors) which is accessible via Superset. \n",
"* Mediawiki_history joined with X for geo location data is needed to access details on first time editors. Query and results for that are noted below.\n",
"* Reviewed data logged from 1 January 2022 data to date (17 June 2022)\n",
"* Limited analysis to countries within Subsaharan African region as defined in https://docs.google.com/spreadsheets/d/1xPzo1KtspT2EaWbO_HfLHRvuk9BfbXsfwjkrOV_5m2M/edit#gid=741931431\n",
"* Geoeditors Data https://superset.wikimedia.org/superset/dashboard/9/\n"
]
},
{
"cell_type": "markdown",
"id": "86574516-f3e4-4311-bb39-ca28c9a20fbc",
"metadata": {},
"source": [
"# References\n",
"* [Regional pageview and editor metrics](https://docs.google.com/spreadsheets/d/1BFymKLz8PzClZBIDT7DoaNLARe5xWOHGxX0CDYLt--8/edit#gid=1419525221)\n",
"* Irene’s recent work on SSA Campaigns activity T287715\n",
"* [Irene's superset dashboard](https://superset.wikimedia.org/superset/dashboard/88/)\n",
"* [Sub-saharan Africa country names and codes](https://docs.google.com/spreadsheets/d/18zsc-KqEbAfgQrD4yGCC60PC42i4t39agz8DFRvBhEs/edit#gid=0)"
]
},
{
"cell_type": "markdown",
"id": "148e1bbd-d8bf-4dda-b40e-4b74da0f2263",
"metadata": {},
"source": [
"\n",
"## Data Notes:\n",
"* Data comes from `wmf.editors_daily` and `wmf.mediawiki_history` table.\n",
"* Data is limited to May 2022 as data from `wmf.editors_daily` is only retained for the last two calendar months.\n",
"* This reflects editors who made only 1 edit on the wiki. "
]
},
{
"cell_type": "markdown",
"id": "69f2c3bd-18c8-4244-8e8a-03040b5118c8",
"metadata": {},
"source": [
"# On average, how many people from sub-Saharan Africa during a given month are making their first edit to Wikipedia as an account holder?\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "a66e0e31-cced-4203-b8db-78cd5a869844",
"metadata": {},
"outputs": [],
"source": [
"shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))\n",
"shhh({\n",
" library(tidyverse); library(glue); library(lubridate); library(scales)\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "103ec952-15c7-4a71-a092-969927169598",
"metadata": {},
"source": [
"### May 2022"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "5774a904-17e0-47c2-aa54-d36a0b0a27a7",
"metadata": {},
"outputs": [],
"source": [
"query <-\n",
"\n",
"\"\n",
"SELECT\n",
"mwh.event_user_id as user_id,\n",
"CASE\n",
" WHEN min(event_user_revision_count) = 1 THEN 'first_time_editor'\n",
" WHEN min(event_user_revision_count) < 100 THEN 'under 100'\n",
" WHEN (min(event_user_revision_count) >=100 AND min(event_user_revision_count <= 500)) THEN '100-500'\n",
" ELSE 'over 500'\n",
" END AS edit_count_group,\n",
"mwh.wiki_db AS project\n",
"FROM \n",
"wmf.mediawiki_history mwh\n",
"INNER JOIN canonical_data.wikis\n",
"ON wiki_db = database_code and\n",
" database_group == 'wikipedia'\n",
"INNER JOIN\n",
"wmf.editors_daily ed\n",
"ON mwh.event_user_id = CAST(ed.user_fingerprint_or_id AS BIGINT)\n",
"AND mwh.wiki_db = ed.wiki_db\n",
"WHERE\n",
"-- from sug-saharan Africa region\n",
"ed.country_code IN ('AO','BJ', 'BW','BF','BI','CM','CV','TD','KM','CI', 'DJ', 'GQ', 'ER', 'ET', 'GA','GM','GH',\n",
"'GN','GW','KE','LS','LR','MG','MW','ML','MU', 'MZ', 'NA', 'NE', 'NG', 'CD', 'RW', 'ST', 'SN', 'SC', 'SL', 'SO', 'ZA', \n",
"'SD', 'TZ','CF', 'CD', 'TG', 'UG', 'ZM','ZW')\n",
"AND ed.month = '2022-05'\n",
"AND mwh.snapshot = '2022-05'\n",
"AND mwh.event_timestamp >= '2022-05-01'\n",
"AND mwh.event_timestamp <= '2022-06-01'\n",
"AND ed.user_is_anonymous = FALSE \n",
"AND mwh.event_user_is_anonymous = FALSE\n",
"GROUP BY \n",
"mwh.event_user_id,\n",
"mwh.wiki_db\n",
"\""
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "79bef232-c2d0-4b9a-88a1-871ade0e1442",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Don't forget to authenticate with Kerberos using kinit\n",
"\n"
]
}
],
"source": [
"first_time_editors_may <- wmfdata::query_hive(query)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "fd9f3c42-0847-4d62-a17b-8b641a351f4b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"A data.frame: 1 × 1\n",
"\n",
"\tn_users |
\n",
"\t<int> |
\n",
"\n",
"\n",
"\t10534 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 1 × 1\n",
"\\begin{tabular}{l}\n",
" n\\_users\\\\\n",
" \\\\\n",
"\\hline\n",
"\t 10534\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 1 × 1\n",
"\n",
"| n_users <int> |\n",
"|---|\n",
"| 10534 |\n",
"\n"
],
"text/plain": [
" n_users\n",
"1 10534 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"first_time_editors_total_may <- first_time_editors_may %>%\n",
" #filter(edit_count_group == 'first_time_editor') %>%\n",
" filter(project == 'enwiki') %>%\n",
" summarise(n_users = n_distinct(user_id))\n",
"\n",
"first_time_editors_total_may"
]
},
{
"cell_type": "markdown",
"id": "ae25e381-9f22-4ee2-b157-e42dc95e67ab",
"metadata": {},
"source": [
"### April 2022 ###"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "f4968917-b085-4bf4-a30d-40fbf4a9c1a4",
"metadata": {},
"outputs": [],
"source": [
"query <-\n",
"\n",
"\"\n",
"SELECT\n",
"mwh.event_user_id as user_id,\n",
"CASE\n",
" WHEN min(event_user_revision_count) = 1 THEN 'first_time_editor'\n",
" WHEN min(event_user_revision_count) < 100 THEN 'under 100'\n",
" WHEN (min(event_user_revision_count) >=100 AND min(event_user_revision_count <= 500)) THEN '100-500'\n",
" ELSE 'over 500'\n",
" END AS edit_count_group,\n",
"mwh.wiki_db AS project\n",
"FROM \n",
"wmf.mediawiki_history mwh\n",
"INNER JOIN canonical_data.wikis\n",
"ON wiki_db = database_code and\n",
" database_group == 'wikipedia'\n",
"INNER JOIN\n",
"wmf.editors_daily ed\n",
"ON mwh.event_user_id = CAST(ed.user_fingerprint_or_id AS BIGINT)\n",
"AND mwh.wiki_db = ed.wiki_db\n",
"WHERE\n",
"ed.country_code IN ('AO','BJ', 'BW','BF','BI','CM','CV','TD','KM','CI', 'DJ', 'GQ', 'ER', 'ET', 'GA','GM','GH',\n",
"'GN','GW','KE','LS','LR','MG','MW','ML','MU', 'MZ', 'NA', 'NE', 'NG', 'CD', 'RW', 'ST', 'SN', 'SC', 'SL', 'SO', 'ZA', \n",
"'SD', 'TZ','CF', 'CD', 'TG', 'UG', 'ZM','ZW')\n",
"AND ed.month = '2022-04'\n",
"AND mwh.snapshot = '2022-05'\n",
"AND mwh.event_timestamp >= '2022-04-01'\n",
"AND mwh.event_timestamp <= '2022-05-01'\n",
"AND ed.user_is_anonymous = FALSE \n",
"AND mwh.event_user_is_anonymous = FALSE\n",
"GROUP BY \n",
"mwh.event_user_id,\n",
"mwh.wiki_db\n",
"\""
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "316e3bde-e90a-45ca-96e7-c5ee6d0ca4ca",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Don't forget to authenticate with Kerberos using kinit\n",
"\n"
]
}
],
"source": [
"first_time_editors_april <- wmfdata::query_hive(query)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "adac7e69-e98e-4f42-8fe9-258e32f5f264",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A data.frame: 1 × 1\n",
"\n",
"\tn_users |
\n",
"\t<int> |
\n",
"\n",
"\n",
"\t3072 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A data.frame: 1 × 1\n",
"\\begin{tabular}{l}\n",
" n\\_users\\\\\n",
" \\\\\n",
"\\hline\n",
"\t 3072\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A data.frame: 1 × 1\n",
"\n",
"| n_users <int> |\n",
"|---|\n",
"| 3072 |\n",
"\n"
],
"text/plain": [
" n_users\n",
"1 3072 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"first_time_editors_total_april <- first_time_editors_april %>%\n",
" filter(edit_count_group == 'first_time_editor') %>%\n",
" summarise(n_users = n_distinct(user_id))\n",
"\n",
"first_time_editors_total_april"
]
},
{
"cell_type": "markdown",
"id": "00d668dd-4ebf-4d7a-8340-8764fdd7ae9b",
"metadata": {},
"source": [
"# What language Wikipedias the people who are contributing from sub-Saharan Africa and who have made fewer than 100 cumulative edits are most active with?\n",
"\n",
"Note: We'll use May to get a more recent depication."
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "2b58fbee-0e04-434e-82ff-04af994ac7bd",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"`summarise()` ungrouping output (override with `.groups` argument)\n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"A tibble: 15 × 3\n",
"\n",
"\tproject | n_users | pct_users |
\n",
"\t<chr> | <int> | <chr> |
\n",
"\n",
"\n",
"\tenwiki | 3017 | 68.17% |
\n",
"\tfrwiki | 711 | 16.06% |
\n",
"\tptwiki | 143 | 3.23% |
\n",
"\tarwiki | 97 | 2.19% |
\n",
"\tswwiki | 67 | 1.51% |
\n",
"\tigwiki | 41 | 0.93% |
\n",
"\thawiki | 37 | 0.84% |
\n",
"\tsimplewiki | 32 | 0.72% |
\n",
"\tyowiki | 22 | 0.5% |
\n",
"\tdewiki | 21 | 0.47% |
\n",
"\trwwiki | 21 | 0.47% |
\n",
"\tafwiki | 18 | 0.41% |
\n",
"\tsowiki | 14 | 0.32% |
\n",
"\ttwwiki | 13 | 0.29% |
\n",
"\teswiki | 12 | 0.27% |
\n",
"\n",
"
\n"
],
"text/latex": [
"A tibble: 15 × 3\n",
"\\begin{tabular}{lll}\n",
" project & n\\_users & pct\\_users\\\\\n",
" & & \\\\\n",
"\\hline\n",
"\t enwiki & 3017 & 68.17\\%\\\\\n",
"\t frwiki & 711 & 16.06\\%\\\\\n",
"\t ptwiki & 143 & 3.23\\% \\\\\n",
"\t arwiki & 97 & 2.19\\% \\\\\n",
"\t swwiki & 67 & 1.51\\% \\\\\n",
"\t igwiki & 41 & 0.93\\% \\\\\n",
"\t hawiki & 37 & 0.84\\% \\\\\n",
"\t simplewiki & 32 & 0.72\\% \\\\\n",
"\t yowiki & 22 & 0.5\\% \\\\\n",
"\t dewiki & 21 & 0.47\\% \\\\\n",
"\t rwwiki & 21 & 0.47\\% \\\\\n",
"\t afwiki & 18 & 0.41\\% \\\\\n",
"\t sowiki & 14 & 0.32\\% \\\\\n",
"\t twwiki & 13 & 0.29\\% \\\\\n",
"\t eswiki & 12 & 0.27\\% \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 15 × 3\n",
"\n",
"| project <chr> | n_users <int> | pct_users <chr> |\n",
"|---|---|---|\n",
"| enwiki | 3017 | 68.17% |\n",
"| frwiki | 711 | 16.06% |\n",
"| ptwiki | 143 | 3.23% |\n",
"| arwiki | 97 | 2.19% |\n",
"| swwiki | 67 | 1.51% |\n",
"| igwiki | 41 | 0.93% |\n",
"| hawiki | 37 | 0.84% |\n",
"| simplewiki | 32 | 0.72% |\n",
"| yowiki | 22 | 0.5% |\n",
"| dewiki | 21 | 0.47% |\n",
"| rwwiki | 21 | 0.47% |\n",
"| afwiki | 18 | 0.41% |\n",
"| sowiki | 14 | 0.32% |\n",
"| twwiki | 13 | 0.29% |\n",
"| eswiki | 12 | 0.27% |\n",
"\n"
],
"text/plain": [
" project n_users pct_users\n",
"1 enwiki 3017 68.17% \n",
"2 frwiki 711 16.06% \n",
"3 ptwiki 143 3.23% \n",
"4 arwiki 97 2.19% \n",
"5 swwiki 67 1.51% \n",
"6 igwiki 41 0.93% \n",
"7 hawiki 37 0.84% \n",
"8 simplewiki 32 0.72% \n",
"9 yowiki 22 0.5% \n",
"10 dewiki 21 0.47% \n",
"11 rwwiki 21 0.47% \n",
"12 afwiki 18 0.41% \n",
"13 sowiki 14 0.32% \n",
"14 twwiki 13 0.29% \n",
"15 eswiki 12 0.27% "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"first_time_editors_byproject <- first_time_editors_may %>%\n",
" filter(edit_count_group %in% c('first_time_editor', 'under 100')) %>%\n",
" group_by(project) %>%\n",
" summarise(n_users = n_distinct(user_id)) %>%\n",
" mutate(pct_users = paste0(round(prop.table(n_users) * 100, 2), \"%\")) %>%\n",
" arrange(desc(n_users))\n",
" \n",
"head(first_time_editors_byproject, 15)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61a1120e-a8af-4f5c-aae5-ef6674ad2a3d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}