{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is focused on users. You can:\n",
    "- [Fetch all languages of a specific user](#languages_of_user)\n",
    "- [Get the list of users speaking a specific language](#users_of_language) \n",
    "- [Get the list of native speakers of a specific language](#natives_of_language)\n",
    "- [Get a list of native speakers of one language speaking another](#natives_speaking)\n",
    "- [Check the date of the last contribution of some users](#last_contribution)\n",
    "\n",
    "Before experimenting with any of the options described above, it is necessary to set and execute the cells under the [Languages of users](#languages_of_users) section.\n",
    "\n",
    "If you're new to Jupyter, please click on `Cell > Run All` from the top menu to see what the notebook does. You should see that cells that are running have an `In[*]` that will become `In[n]` when their execution is finished (`n` is a number). To run a specific cell, click in it and press `Shift + Enter` or click the `Run` button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In any case, to be able to use the notebook correctly, please run the two following cells first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import csv\n",
    "import tarfile\n",
    "\n",
    "import requests as req\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_colwidth', -1) # To display full content of the column\n",
    "# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='languages_of_users'></a>\n",
    "# Languages of users"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the two following cell (you don't have to modify it). \n",
    "\n",
    "Note that by default, we remove unknown users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_user_languages():\n",
    "    data = pd.read_csv('./user_languages.csv', \n",
    "            sep='\\t', \n",
    "            header=None, \n",
    "            names=['Language', 'Level', 'Username', 'Details'],\n",
    "            quoting=csv.QUOTE_NONE)\n",
    "    # The next two lines remove unknown users\n",
    "    data = data[data['Username'] != r'\\N']\n",
    "    data = data.dropna(subset=['Username'])\n",
    "    return data.fillna('')\n",
    "\n",
    "user_infos = read_user_languages()\n",
    "print(f\"{len(user_infos):,} users found.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cell below displays 10 random rows, just to give you an overview of the structure of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_infos.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='languages_of_user'></a>\n",
    "### Languages of a specific user"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following cell (you don't have to modify it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def languages_of_user(username, user_frame):\n",
    "    return user_frame[user_frame['Username'] == username].sort_values(by='Level', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace `username` by the username you want to check, and run the following cell. The results are displayed by descending `Level` order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "username = 'nimfeo'  # <-- Modify this value\n",
    "languages_of_user(username, user_infos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='users_of_language'></a>\n",
    "### Users of a specific language"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following cell. You don't have to modify it unless you want to sort the final results by an attribute other than `Username`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def users_of_language(iso, user_frame):\n",
    "    frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username')  # <-- Modify by=<value> to change the sort order\n",
    "    print(f\"{len(frame):,} users found.\")\n",
    "    return frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the next cell to obtain the list of all its speakers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "language = 'fra'  # <-- Modify this value\n",
    "users_of_language(language, user_infos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='natives_of_language'></a>\n",
    "### Natives of a specific language"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following cell (you don't have to modify it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def natives_of_language(iso, user_frame):\n",
    "    frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username')  # <-- Modify by=<value> to change the sort order\n",
    "    frame = frame[frame['Level'] == '5']\n",
    "    print(f\"{len(frame):,} native users found.\")\n",
    "    return frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the following cells to obtain the list of all its native speakers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "language = 'fra'  # <-- Modify this value\n",
    "natives_of_language(language, user_infos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='natives_speaking'></a>\n",
    "### Natives of X speaking Y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following cell (you don't have to modify it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def natives_speaking_other(main_language, other_language, user_frame):\n",
    "    native_frame = user_frame[user_frame['Language'] == main_language]\n",
    "    native_users = native_frame[native_frame['Level'] == '5'].Username.tolist()\n",
    "    second_frame = user_frame[user_frame['Language'] == other_language]\n",
    "    second_users = second_frame.Username.tolist()\n",
    "    target_users = list(set(native_users).intersection(second_users))\n",
    "    result = user_frame[user_frame['Username'].isin(target_users) & user_frame['Language'].isin([main_language, other_language])].sort_values(by='Username')\n",
    "    print(f'{len(result) // 2:,} users found.')\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell will fetch users who are natives in `main_language` but also speak `other_language`.  \n",
    "Specify your target languages as 3-letter ISO codes (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "main_language = 'fra'  # <-- Modify this value\n",
    "other_language = 'eng'  # <-- Modify this value\n",
    "natives_speaking_other(main_language, other_language, user_infos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can get the list of usernames from any frame you built by appending `.Username.tolist()`. A list may be easier to export and work with. Try it for the natives of X speaking Y you built above:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "natives_speaking_other(main_language, other_language, user_infos).Username.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other Users information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='last_contribution'></a>\n",
    "### Date of the last contribution of some users"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following cell (you don't have to modify it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def last_contribution_date(users):\n",
    "    dates = []\n",
    "    for user in users:\n",
    "        response = req.get(f\"https://tatoeba.org/eng/contributions/of_user/{user}\")\n",
    "        soup = BeautifulSoup(response.text, features=\"html.parser\")\n",
    "        logs = soup.find(id=\"logs\")\n",
    "        p_tag = logs.find(\"p\")\n",
    "        if p_tag is None:\n",
    "            dates.append(\"Unknown\")\n",
    "        else:\n",
    "            text = p_tag.text\n",
    "            dates.append(text[text.find(',') + 2:])\n",
    "    frame = {\"Username\":users, 'Last contribution':dates} \n",
    "    return pd.DataFrame(frame)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check the date of the last contributed sentence by specific users, replace the values in `users` by the username you're interested in.  \n",
    "Don't forget the brackets. For example, for only one user, write `users = ['usersname']`.  \n",
    "\n",
    "Note: The execution of the cell may take some time, especially if many usernames are given."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users = ['Ergulis', 'Abdellah', 'TRANG']  # <-- Modify these values\n",
    "last_contribution_date(users)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}