{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, you can: \n",
    "- [Fetch sentences that do not have final punctuation](#no_punctuation)\n",
    "- [Fetch sentences that do not start with a capital letter](#no_capital)\n",
    "\n",
    "First of all, please run the cells under the [Read sentences section](#read_sentences).\n",
    "\n",
    "\n",
    "If you're new to Jupyter, please click on `Cell > Run All` from the top menu to see what the notebook does. You should see that cells that are running have an `In[*]` that will become `In[n]` when their execution is finished (`n` is a number). To run a specific cell, click in it and press `Shift + Enter` or click the `Run` button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In any case, to be able to use the notebook correctly, please run the two following cells first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import csv\n",
    "import tarfile"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_colwidth', -1) # To display full content of the column\n",
    "# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='read_sentences'></a>\n",
    "# Read sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading all sentences takes a long time so let's split the process into two steps. You only need to run the two following cells once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2\n",
    "def read_sentences_file():\n",
    "    with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:\n",
    "        csv_path = tar.getnames()[0]\n",
    "        sentences = pd.read_csv(tar.extractfile(csv_path), \n",
    "                sep='\\t', \n",
    "                header=None, \n",
    "                names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],\n",
    "                quoting=csv.QUOTE_NONE)\n",
    "        print(f\"{len(sentences):,} sentences fetched.\")\n",
    "        return sentences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_sentences = read_sentences_file()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, you can fetch sentences of a specific language using the following cells. If you want to change you target language, you can start again from here.\n",
    "\n",
    "Note that by default, we get rid of the `ISO` (that is, ISO 639 three-letter language code), `Date added`, `Date last modified`, and `Username` columns.  \n",
    "If you need any of these columns, you can comment out the lines you need by adding a `#` at the beginning of the corresponding lines of the next cell.\n",
    "\n",
    "So run the following cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sentences_of_language(sentences, language):\n",
    "    target_sentences = sentences[sentences['ISO'] == language]\n",
    "    del target_sentences['Date added']\n",
    "    del target_sentences['Date last modified']\n",
    "    del target_sentences['ISO']\n",
    "    del target_sentences['Username']\n",
    "    target_sentences = target_sentences.set_index(\"sentenceID\")\n",
    "    print(f\"{len(target_sentences):,} sentences fetched.\")\n",
    "    return target_sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choose your target `language` as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.), and run the next one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "language = 'fra'\n",
    "sentences = sentences_of_language(all_sentences, language)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the variable `sentences` contains the sentences of the language you specified. Wanna check? The following cell displays five random sentences in your set, just for a quick check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sentences.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='no_punctuation'></a>\n",
    "# Sentences without punctuation at the end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check the sentences that do not have final punctuation, we first need to define correct final punctuation signs.\n",
    "\n",
    "You can use one of the list provided in the cell below or define a list of all the correct punctuations that can be found in your target language.  If you define it yourself, be careful to use the same format.\n",
    "\n",
    "So make sure that one variable has all the signs you need and run the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Punctuation I expect to find at the end of French sentences\n",
    "french_end_punctuation = ('!', '?', '.', '»', '…')\n",
    "# Punctuation I expect to find at the end of English sentences\n",
    "english_end_punctuation = ('!', '?', '.', '\"', '…')\n",
    "# Punctuation I expect to find at the end of Japanese sentences\n",
    "japanese_end_punctuation = ('！', '!', '？', '?', '。', '」')\n",
    "# Punctuation I expect to find at the end of English sentences\n",
    "german_end_punctuation = ('!', '?', '.', '“', '…')\n",
    "# Punctuation I expect to find at the end of Esperanto sentences\n",
    "esperanto_end_punctuation = ('!', '?', '.', '\"', '…')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set `end_punctuation` to the list you need and run the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "end_punctuation = french_end_punctuation  # Replace by the one you need\n",
    "no_punc = sentences[~sentences['Text'].str.endswith(end_punctuation)]\n",
    "print(f'{len(no_punc):,} sentences found.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`no_punc` contains the list of sentences not ending by any of the characters you specified earlier. Note that if a sentence seems to end correctly, it is probably because there exists a space after the punctuation symbol."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "no_punc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may have noticed that if `no_punc` is too long, only the first and last 30 rows are displayed. It is better to avoid displaying its entire content at once, but you can still explore it by slices. \n",
    "\n",
    "`no_punc[n:m]` will give you the sentences between the n-th and the m-th (excluded)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "no_punc[15:50]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='no_capital'></a>\n",
    "# Sentences that do not start with a capital letter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to check sentences that do not start with a capital letter, simply run the following cells. You do not need to modify anything this time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "no_capital = sentences[[x[0].islower() for x in sentences['Text']]]\n",
    "print(f'{len(no_capital):,} sentences found.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`no_capital` contains the list of sentences found in your current set of sentences. Run the following cell to display them.\n",
    "\n",
    "Note: You may find sentences that look like they start by a capital I (i). That is a font issue. The sentence actually starts by a lower l (L)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "no_capital"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}