{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is focused on users' sentences and audio contributions. You can\n", "- [Filter sentences by their owner](#users_sentences), including or excluding only some users.\n", "- [Check audio contributions](#audio) and fetch sentences with or without audio.\n", "\n", "Before experimenting with any of the options described above, it is necessary to set and execute the cells under the [Read sentences section](#read_sentences).\n", "\n", "If you're new to Jupyter, please click on `Cell > Run All` from the top menu to see what the notebook does. You should see that cells that are running have an `In[*]` that will become `In[n]` when their execution is finished (`n` is a number). To run a specific cell, click in it and press `Shift + Enter` or click the `Run` button in the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In any case, to be able to use the notebook correctly, please run the two following cells first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import csv\n", "import tarfile" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth', -1) # To display the full content of the column\n", "# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Read sentences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading all sentences takes a long time so let's split the process into two steps. You only need to run the two following cells once." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2\n", "def read_sentences_file():\n", " with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:\n", " csv_path = tar.getnames()[0]\n", " sentences = pd.read_csv(tar.extractfile(csv_path), \n", " sep='\\t', \n", " header=None, \n", " names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],\n", " quoting=csv.QUOTE_NONE)\n", " print(f\"{len(sentences):,} sentences fetched.\")\n", " return sentences" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "all_sentences = read_sentences_file()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you can fetch sentences of a specific language using the following cells. If you want to change your target language, you can start again from here.\n", "\n", "Note that by default, we get rid of the `ISO` (that is, ISO 639 three-letter language code), `Date added`, and `Date last modified` columns. \n", "If you need any of these columns, you can comment out the lines you need by adding a `#` at the beginning of the corresponding lines of the next cell.\n", "\n", "So run the following cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sentences_of_language(sentences, language):\n", " target_sentences = sentences[sentences['ISO'] == language]\n", " del target_sentences['Date added']\n", " del target_sentences['Date last modified']\n", " del target_sentences['ISO']\n", " target_sentences = target_sentences.set_index(\"sentenceID\")\n", " # Unknown users have their names set to \n", " target_sentences.loc[target_sentences['Username'] == r'\\N', 'Username'] = ''\n", " print(f\"{len(target_sentences):,} sentences fetched.\")\n", " return target_sentences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose your target `language` as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.), and run the next one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "language = 'eng' # <-- Modify this value\n", "sentences = sentences_of_language(all_sentences, language)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the variable `sentences` contains the sentences of the language you specified. Wanna check? The following cell displays five random sentences in your set, just for a quick check." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "sentences.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, you can see three columns: `sentenceID`, `Text`, and `Username`. `sentenceID` is the same as on Tatoeba, so you can easily access that sentence page there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Filter sentences by their owner" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At its name indicates, you can use this section to fetch sentences belonging or not to some users.\n", "\n", "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_sentences_of_users(sentences, users_included, users_excluded):\n", " if users_to_include == []: # Include everybody\n", " target = sentences[~sentences['Username'].isin(users_excluded)]\n", " else:\n", " target = sentences[sentences['Username'].isin(users_included) & ~sentences['Username'].isin(users_excluded)]\n", " print(f\"{len(target):,} sentences fetched.\")\n", " return target.sort_values(by='Username') # Modify this to change the sort order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can specify the users to include or exclude in the next cell. Obviously, users to include go into the `users_to_include` variable and the users to exclude go into the `users_to_exclude` variable. The syntax is a bit special here. If you want to do an \"include\" search, set `users_to_exclude = []`. If you want to do an \"exclude\" search, set `users_to_include = []`. For example\n", "- `users_to_include = ['user1', 'user2'], users_to_exclude = []` will fetch every sentence belonging to user1 or user2.\n", "- `users_to_include = [], users_to_exclude = ['userA', 'userB']` will fetch every sentence that do not belong to userA or userB.\n", "\n", "By running the cell, you will fetch the sentences in the language you set above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users_to_include = ['AlanF_US', 'CK'] # <-- Modify these values\n", "users_to_exclude = [] # <-- Modify these values\n", "user_sentences = get_sentences_of_users(sentences, users_to_include, users_to_exclude)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell displays a small sample of the sentences you fetched, just to verify that everything looks fine." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "user_sentences.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Audio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, you can filter your set of sentences to fetch only the ones having audio (or the ones that don't). Note that you need to have prepared the `user_sentences` variables from one of the previous sections.\n", "\n", "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_audio_file():\n", " with tarfile.open('./sentences_with_audio.tar.bz2', 'r:*') as tar:\n", " csv_path = tar.getnames()[0]\n", " sentences = pd.read_csv(tar.extractfile(csv_path), \n", " sep='\\t', \n", " header=None, \n", " names=['sentenceID', 'Username', 'License', 'Attribution URL'], \n", " quoting=csv.QUOTE_NONE)\n", " del sentences['License']\n", " del sentences['Attribution URL']\n", " return sentences\n", "\n", "sentences_with_audio = read_audio_file()\n", "print(f'{len(sentences_with_audio):,} sentences have audio.')\n", "audio_ids = sentences_with_audio['sentenceID'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you should have all sentences having audio inside the `sentences_with_audio` variable. You can quickly check whether `sentences_with_audio` looks OK by running the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentences_with_audio.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following section allows you to check which sentences of your current `user_sentences` set have / do not have audio. That's where you'll need to make sure to have run the cells of one of the previous sections.\n", "\n", "Note that `user_sentences` was filtered by the sentence owner, not the audio contributor.\n", "\n", "Run the following cell (you don't have to modify it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def subset_with_audio(sentences, audio_ids):\n", " target = user_sentences[user_sentences.index.isin(audio_ids)]\n", " print(f\"{len(target):,} sentences with audio fetched.\")\n", " return target\n", "\n", "def subset_without_audio(sentences, audio_ids):\n", " target = user_sentences[~user_sentences.index.isin(audio_ids)]\n", " print(f\"{len(target):,} sentences without audio fetched.\")\n", " return target\n", "\n", "user_sentences_with_audio = subset_with_audio(user_sentences, audio_ids)\n", "user_sentences_without_audio = subset_without_audio(user_sentences, audio_ids)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, `user_sentences_with_audio` and `users_sentences_without_audio` contain the sentences with / without audio belonging to your current set. You can fetch them and play with them. \n", "That way, you can extract sentences you want to record in the way you want: in order, in random, containing specific words, etc. (You may need to have a look to other notebooks, or get your hands dirty to achieve what you want, though). \n", "\n", "Notice also that taking sentences at random may require a bit more management if you want to do it often. Being random, it may happen that you get the same sentences several times. Using files, copy-pasting directly, etc.\n", "\n", "As a side note, by default, a maximum of 60 rows will be displayed (the first 30 and the last 30). If you want to display more, you can use \n", "```\n", "pandas.set_option('display.max_rows', n)\n", "```\n", "where n is the number of rows you want to display at maximum.\n", "\n", "Slicing is often better than displaying everything, but in this particular case, we may need to display one or two hundred sentences so...\n", "\n", "Just for an illustration, here is how to take the first 50 sentences of `user_sentences_without_audio`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "user_sentences_without_audio[:50]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "50 sentences at random:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "user_sentences_without_audio.sample(50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More filtering, limiting to one user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you created a set with sentences belonging to several users, and created the `user_sentences_without_audio` above. Now, if you want to filter the set to only one user, of course you can go back to the [Get sentences of specific users](#users_sentences) section and do everything again. However, there is a simpler way!\n", "\n", "The sentences without audio owned by a specific user can be fetched in the following manner. Here, we include only the sentences belonging to the user CK." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "username = 'CK' # <-- Modify this value\n", "one_user_no_audio = user_sentences_without_audio[user_sentences_without_audio['Username'] == username]\n", "print(f'{one_user_no_audio.shape[0]} sentences without audio belonging to {username}.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, you can fetch the `one_user_no_audio` set the same way as previously. For example, for `sample_size` random sentences, run the following." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "sample_size = 50 # <-- Modify this value\n", "if (sample_size > len(one_user_no_audio)):\n", " print(f'Warning: Your specified sample size ({sample_size}) is larger than the number of items that match ({len(one_user_no_audio)}). \\\n", " \\n Printing all items that match.')\n", " sample_size = len(one_user_no_audio)\n", "one_user_no_audio.sample(sample_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more advanced way of fetching sentences, check the other books and add code below!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }