{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Translation in Python 3 with NLTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Version:** 1.1, January 2024"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Prerequisites:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a brief introduction to the Machine Translation components in NLTK."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading an Aligned Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the *comtrans* module from *nltk.corpus*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import comtrans"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can load a word-level alignment corpus for English and French from the NLTK dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = comtrans.words(\"alignment-en-fr.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print out the words in the corpus as a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Resumption\n",
      "of\n",
      "the\n",
      "session\n",
      "I\n",
      "declare\n",
      "resumed\n",
      "the\n",
      "session\n",
      "of\n",
      "the\n",
      "European\n",
      "Parliament\n",
      "adjourned\n",
      "on\n",
      "Friday\n",
      "17\n",
      "December\n",
      "1999\n",
      ",\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "for word in words[:20]:\n",
    "    print(word)\n",
    "print(\"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Access a word by index in the list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Resumption\n"
     ]
    }
   ],
   "source": [
    "print(words[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can load the aligned sentences. Here we will load just one sentence, the firs one in the corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "als = comtrans.aligned_sents(\"alignment-en-fr.txt\")[0]\n",
    "als\n",
    "\n",
    "print(\" \".join(als.words))\n",
    "print(\" \".join(als.mots))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The alignments can be accessed via the *alignment* property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "als.alignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can display the alignment using the *invert* function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "als.invert()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also create alignments directly using the NLTK translate module. We import the translation modules from NLTK:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.translate import Alignment, AlignedSent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create an alignment example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "als = AlignedSent( [\"Reprise\", \"de\", \"la\", \"session\" ], \\\n",
    "    [\"Resumption\", \"of\", \"the\", \"session\" ] , \\\n",
    "    Alignment( [ (0 , 0), (1 , 1), (2 , 2), (3 , 3) ] ) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Translating with IBM Model 1 in NLTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We already imported comtrans from NLTK in the code above. We have to import IBMModel1 from *nltk.translate*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.translate import IBMModel1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create an IBMModel1 using 20 iterations to run the learning algorithm using the first 10 sentences from the aligned corpus; see the EM explanation on the slides and the following publications:\n",
    "\n",
    "- Philipp Koehn. 2010. *Statistical Machine Translation*. Cambridge University Press, New York.\n",
    "\n",
    "- Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. *Computational Linguistics*, 19 (2), 263-311.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "com_ibm1 = IBMModel1(comtrans.aligned_sents()[:10], 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(round(com_ibm1.translation_table[\"bitte\"][\"Please\"], 3) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(round(com_ibm1.translation_table[\"Sitzungsperiode\"][\"session\"] , 3) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "latex_metadata": {
   "affiliation": "Indiana University, Bloomington, IN, USA",
   "author": "Damir Cavar",
   "title": "Machine Translation in Python 3 with NLTK"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": false,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}