{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is focused on users. You can:\n", "- [Fetch all languages of a specific user](#languages_of_user)\n", "- [Get the list of users speaking a specific language](#users_of_language) \n", "- [Get the list of native speakers of a specific language](#natives_of_language)\n", "- [Get a list of native speakers of one language speaking another](#natives_speaking)\n", "- [Check the date of the last contribution of some users](#last_contribution)\n", "\n", "Before experimenting with any of the options described above, it is necessary to set and execute the cells under the [Languages of users](#languages_of_users) section.\n", "\n", "If you're new to Jupyter, please click on `Cell > Run All` from the top menu to see what the notebook does. You should see that cells that are running have an `In[*]` that will become `In[n]` when their execution is finished (`n` is a number). To run a specific cell, click in it and press `Shift + Enter` or click the `Run` button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In any case, to be able to use the notebook correctly, please run the two following cells first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import csv\n", "import tarfile\n", "\n", "import requests as req\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth', -1) # To display full content of the column\n", "# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Languages of users" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the two following cell (you don't have to modify it). \n", "\n", "Note that by default, we remove unknown users." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_user_languages():\n", " data = pd.read_csv('./user_languages.csv', \n", " sep='\\t', \n", " header=None, \n", " names=['Language', 'Level', 'Username', 'Details'],\n", " quoting=csv.QUOTE_NONE)\n", " # The next two lines remove unknown users\n", " data = data[data['Username'] != r'\\N']\n", " data = data.dropna(subset=['Username'])\n", " return data.fillna('')\n", "\n", "user_infos = read_user_languages()\n", "print(f\"{len(user_infos):,} users found.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below displays 10 random rows, just to give you an overview of the structure of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "user_infos.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Languages of a specific user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def languages_of_user(username, user_frame):\n", " return user_frame[user_frame['Username'] == username].sort_values(by='Level', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace `username` by the username you want to check, and run the following cell. The results are displayed by descending `Level` order." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "username = 'nimfeo' # <-- Modify this value\n", "languages_of_user(username, user_infos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Users of a specific language" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell. You don't have to modify it unless you want to sort the final results by an attribute other than `Username`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def users_of_language(iso, user_frame):\n", " frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username') # <-- Modify by= to change the sort order\n", " print(f\"{len(frame):,} users found.\")\n", " return frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the next cell to obtain the list of all its speakers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "language = 'fra' # <-- Modify this value\n", "users_of_language(language, user_infos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Natives of a specific language" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def natives_of_language(iso, user_frame):\n", " frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username') # <-- Modify by= to change the sort order\n", " frame = frame[frame['Level'] == '5']\n", " print(f\"{len(frame):,} native users found.\")\n", " return frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the following cells to obtain the list of all its native speakers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "language = 'fra' # <-- Modify this value\n", "natives_of_language(language, user_infos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Natives of X speaking Y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def natives_speaking_other(main_language, other_language, user_frame):\n", " native_frame = user_frame[user_frame['Language'] == main_language]\n", " native_users = native_frame[native_frame['Level'] == '5'].Username.tolist()\n", " second_frame = user_frame[user_frame['Language'] == other_language]\n", " second_users = second_frame.Username.tolist()\n", " target_users = list(set(native_users).intersection(second_users))\n", " result = user_frame[user_frame['Username'].isin(target_users) & user_frame['Language'].isin([main_language, other_language])].sort_values(by='Username')\n", " print(f'{len(result) // 2:,} users found.')\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell will fetch users who are natives in `main_language` but also speak `other_language`. \n", "Specify your target languages as 3-letter ISO codes (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "main_language = 'fra' # <-- Modify this value\n", "other_language = 'eng' # <-- Modify this value\n", "natives_speaking_other(main_language, other_language, user_infos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can get the list of usernames from any frame you built by appending `.Username.tolist()`. A list may be easier to export and work with. Try it for the natives of X speaking Y you built above: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "natives_speaking_other(main_language, other_language, user_infos).Username.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other Users information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Date of the last contribution of some users" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def last_contribution_date(users):\n", " dates = []\n", " for user in users:\n", " response = req.get(f\"https://tatoeba.org/eng/contributions/of_user/{user}\")\n", " soup = BeautifulSoup(response.text, features=\"html.parser\")\n", " logs = soup.find(id=\"logs\")\n", " p_tag = logs.find(\"p\")\n", " if p_tag is None:\n", " dates.append(\"Unknown\")\n", " else:\n", " text = p_tag.text\n", " dates.append(text[text.find(',') + 2:])\n", " frame = {\"Username\":users, 'Last contribution':dates} \n", " return pd.DataFrame(frame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check the date of the last contributed sentence by specific users, replace the values in `users` by the username you're interested in. \n", "Don't forget the brackets. For example, for only one user, write `users = ['usersname']`. \n", "\n", "Note: The execution of the cell may take some time, especially if many usernames are given." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users = ['Ergulis', 'Abdellah', 'TRANG'] # <-- Modify these values\n", "last_contribution_date(users)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }