{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 5: Persistence of editor actions over time\n", "\n", "[Wikiwho Edit Persistence](https://www.wikiwho.net/en/edit_persistence/v1.0.0-beta/) to track the persistence of all actions done in a page, or all actions done by an editor by providing monthly accumulates (sums) of the actions (insertions, deletions and reinserions) per page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Edit Persistence of a page\n", "\n", "The method `edit_persistance` will query the persistence of changes on a page." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from wikiwho_wrapper import WikiWho\n", "ww = WikiWho(lng='en')\n", "\n", "df = ww.dv.edit_persistence(page_id=6187)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataframe accumulates several actions by month (`year_month` column), and editor (`editor_id`):\n", " \n", "- **`adds`**: number of tokens inserted for the first time\n", "- **`adds_surv_48h`**: number of tokens inserted for the first time that survived at least 48 hours\n", "- **`adds_persistent`**: number of tokens inserted for the first time that survived until, at least, the end of the month\n", "- **`adds_stopword_count`**: number of tokens inserted that were stop words\n", "- **`dels`**: number of tokens deleted\n", "- **`dels_surv_48h`**: number of tokens deleted that were not resinserted in the next 48 hours\n", "- **`dels_persistent`**: number of tokens deleted that were not resinserted until, at least, the end of the month\n", "- **`dels_stopword_count`**: number of tokens deleted that were stop words\n", "- **`reins`**: number of tokens reinserted \n", "- **`reins_surv_48h`**: number of tokens reinserted that survived at least 48 hours\n", "- **`reins_persistent`**: number of tokens reinserted that survived until the end of the month\n", "- **`reins_stopword_count`**: number of tokens reinserted that were stop words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Calculating total actions\n", "\n", "The dataframe present the persistance of granulated actions, i.e insertions, resinsertions and deletions, so a common operation is to add them up to get the big picture of the persistence." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Calculating total actions (regardles persistence)\n", "df['total_actions'] = df['adds'] + df['dels'] + df['reins']\n", "\n", "#Calculating total actions in 48h\n", "df['total_actions_48h'] = df['adds_surv_48h'] + df['dels_surv_48h'] + df['reins_surv_48h']\n", "\n", "# Calculating total persistent actions\n", "df['total_persistent'] = df['adds_persistent'] + df['dels_persistent'] + df['reins_persistent']\n", "\n", "# Calculating total stopword counts\n", "df['total_stopword_count'] = df['adds_stopword_count'] + df['dels_stopword_count'] + df['reins_stopword_count']\n", "\n", "#display\n", "df[['year_month', 'editor_id', 'total_actions', 'total_actions_48h', 'total_persistent', 'total_stopword_count']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Edit persistance of a page (without editor)\n", "\n", "The dataframe present the data by month and by editor. If the goal is to only track the changes of the pages (without the editor), a simple groupby will be sufficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(columns=['editor_id']).groupby(['year_month', 'page_id']).sum().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. Total actions per editor (without month)\n", "\n", "Another possibility is to calculate the total number of actions per editor regardless the time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(columns=['page_id']).groupby('editor_id').sum().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4 Number of actions per editor\n", "\n", "We can also check the number of months in which each of the editors have had, at least, one contribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby('editor_id').size().sort_values(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Edit Persistence of an editor (across all page)\n", "\n", "The most valuable service that the `edit_persistance` function provides is tracking all actions of an editor across all pages. Let's start with the editor id `2092791` taken from the previous section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = ww.dv.edit_persistence(editor_id=2092791)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataframe accumulates several actions by month (`year_month` column), and page (`page_id`):\n", " \n", "- **`adds`**: number of tokens inserted for the first time\n", "- **`adds_surv_48h`**: number of tokens inserted for the first time that survived at least 48 hours\n", "- **`adds_persistent`**: number of tokens inserted for the first time that survived until, at least, the end of the month\n", "- **`adds_stopword_count`**: number of tokens inserted that were stop words\n", "- **`dels`**: number of tokens deleted\n", "- **`dels_surv_48h`**: number of tokens deleted that were not resinserted in the next 48 hours\n", "- **`dels_persistent`**: number of tokens deleted that were not resinserted until, at least, the end of the month\n", "- **`dels_stopword_count`**: number of tokens deleted that were stop words\n", "- **`reins`**: number of tokens reinserted \n", "- **`reins_surv_48h`**: number of tokens reinserted that survived at least 48 hours\n", "- **`reins_persistent`**: number of tokens reinserted that survived until the end of the month\n", "- **`reins_stopword_count`**: number of tokens reinserted that were stop words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1. Calculating total actions \n", "\n", "The dataframe present the persistance of granulated actions, i.e insertions, resinsertions and deletions, so a common operation is to add them up to get the big picture of the persistence." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Calculating total actions (regardles persistence)\n", "df['total_actions'] = df['adds'] + df['dels'] + df['reins']\n", "\n", "#Calculating total actions in 48h\n", "df['total_actions_48h'] = df['adds_surv_48h'] + df['dels_surv_48h'] + df['reins_surv_48h']\n", "\n", "# Calculating total persistent actions\n", "df['total_persistent'] = df['adds_persistent'] + df['dels_persistent'] + df['reins_persistent']\n", "\n", "# Calculating total stopword counts\n", "df['total_stopword_count'] = df['adds_stopword_count'] + df['dels_stopword_count'] + df['reins_stopword_count']\n", "\n", "#display\n", "df[['year_month', 'editor_id', 'total_actions', 'total_actions_48h', 'total_persistent', 'total_stopword_count']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Edit persistance of an editor (across all pages)\n", "\n", "The dataframe present the data by month and by page. If the goal is to only track the actions of the editor (without the pages), a simple groupby will be sufficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(columns=['page_id']).groupby(['year_month', 'editor_id']).sum().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Total actions per editor (without month)\n", "\n", "Another possibility is to calculate the total number of actions (per page) per editor regardless the time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(columns=['editor_id']).groupby(['page_id']).sum().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can get number of actions across all pages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(columns=['page_id']).groupby('editor_id').sum().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Number of actions per page\n", "\n", "We can also check the number of months in which each of the editors have had, at least, one contribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby('page_id').size().sort_values(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the above also list all the pages in which an author has contributed. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from utils.notebooks import get_next_notebook\n", "from IPython.display import HTML\n", "try:\n", " display(HTML(f'Go to next workbook'))\n", "except:\n", " HTML('Go to next workbook')\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }