{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook is focused on users. You can:\n",
"- [Fetch all languages of a specific user](#languages_of_user)\n",
"- [Get the list of users speaking a specific language](#users_of_language) \n",
"- [Get the list of native speakers of a specific language](#natives_of_language)\n",
"- [Get a list of native speakers of one language speaking another](#natives_speaking)\n",
"- [Check the date of the last contribution of some users](#last_contribution)\n",
"\n",
"Before experimenting with any of the options described above, it is necessary to set and execute the cells under the [Languages of users](#languages_of_users) section.\n",
"\n",
"If you're new to Jupyter, please click on `Cell > Run All` from the top menu to see what the notebook does. You should see that cells that are running have an `In[*]` that will become `In[n]` when their execution is finished (`n` is a number). To run a specific cell, click in it and press `Shift + Enter` or click the `Run` button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In any case, to be able to use the notebook correctly, please run the two following cells first."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import csv\n",
"import tarfile\n",
"\n",
"import requests as req\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_colwidth', -1) # To display full content of the column\n",
"# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Languages of users"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the two following cell (you don't have to modify it). \n",
"\n",
"Note that by default, we remove unknown users."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def read_user_languages():\n",
" data = pd.read_csv('./user_languages.csv', \n",
" sep='\\t', \n",
" header=None, \n",
" names=['Language', 'Level', 'Username', 'Details'],\n",
" quoting=csv.QUOTE_NONE)\n",
" # The next two lines remove unknown users\n",
" data = data[data['Username'] != r'\\N']\n",
" data = data.dropna(subset=['Username'])\n",
" return data.fillna('')\n",
"\n",
"user_infos = read_user_languages()\n",
"print(f\"{len(user_infos):,} users found.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell below displays 10 random rows, just to give you an overview of the structure of the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"user_infos.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Languages of a specific user"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell (you don't have to modify it)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def languages_of_user(username, user_frame):\n",
" return user_frame[user_frame['Username'] == username].sort_values(by='Level', ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace `username` by the username you want to check, and run the following cell. The results are displayed by descending `Level` order."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"username = 'nimfeo' # <-- Modify this value\n",
"languages_of_user(username, user_infos)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Users of a specific language"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell. You don't have to modify it unless you want to sort the final results by an attribute other than `Username`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def users_of_language(iso, user_frame):\n",
" frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username') # <-- Modify by= to change the sort order\n",
" print(f\"{len(frame):,} users found.\")\n",
" return frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the next cell to obtain the list of all its speakers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"language = 'fra' # <-- Modify this value\n",
"users_of_language(language, user_infos)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Natives of a specific language"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell (you don't have to modify it)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def natives_of_language(iso, user_frame):\n",
" frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username') # <-- Modify by= to change the sort order\n",
" frame = frame[frame['Level'] == '5']\n",
" print(f\"{len(frame):,} native users found.\")\n",
" return frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the following cells to obtain the list of all its native speakers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"language = 'fra' # <-- Modify this value\n",
"natives_of_language(language, user_infos)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Natives of X speaking Y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell (you don't have to modify it)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def natives_speaking_other(main_language, other_language, user_frame):\n",
" native_frame = user_frame[user_frame['Language'] == main_language]\n",
" native_users = native_frame[native_frame['Level'] == '5'].Username.tolist()\n",
" second_frame = user_frame[user_frame['Language'] == other_language]\n",
" second_users = second_frame.Username.tolist()\n",
" target_users = list(set(native_users).intersection(second_users))\n",
" result = user_frame[user_frame['Username'].isin(target_users) & user_frame['Language'].isin([main_language, other_language])].sort_values(by='Username')\n",
" print(f'{len(result) // 2:,} users found.')\n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following cell will fetch users who are natives in `main_language` but also speak `other_language`. \n",
"Specify your target languages as 3-letter ISO codes (`cmn`, `fra`, `jpn`, `eng`, etc.) and run the cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"main_language = 'fra' # <-- Modify this value\n",
"other_language = 'eng' # <-- Modify this value\n",
"natives_speaking_other(main_language, other_language, user_infos)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get the list of usernames from any frame you built by appending `.Username.tolist()`. A list may be easier to export and work with. Try it for the natives of X speaking Y you built above: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"natives_speaking_other(main_language, other_language, user_infos).Username.tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Other Users information"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Date of the last contribution of some users"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell (you don't have to modify it)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def last_contribution_date(users):\n",
" dates = []\n",
" for user in users:\n",
" response = req.get(f\"https://tatoeba.org/eng/contributions/of_user/{user}\")\n",
" soup = BeautifulSoup(response.text, features=\"html.parser\")\n",
" logs = soup.find(id=\"logs\")\n",
" p_tag = logs.find(\"p\")\n",
" if p_tag is None:\n",
" dates.append(\"Unknown\")\n",
" else:\n",
" text = p_tag.text\n",
" dates.append(text[text.find(',') + 2:])\n",
" frame = {\"Username\":users, 'Last contribution':dates} \n",
" return pd.DataFrame(frame)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To check the date of the last contributed sentence by specific users, replace the values in `users` by the username you're interested in. \n",
"Don't forget the brackets. For example, for only one user, write `users = ['usersname']`. \n",
"\n",
"Note: The execution of the cell may take some time, especially if many usernames are given."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"users = ['Ergulis', 'Abdellah', 'TRANG'] # <-- Modify these values\n",
"last_contribution_date(users)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}