{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Translation in Python 3 with NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.1, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a brief introduction to the Machine Translation components in NLTK." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading an Aligned Corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the *comtrans* module from *nltk.corpus*." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import comtrans" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load a word-level alignment corpus for English and French from the NLTK dataset:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "words = comtrans.words(\"alignment-en-fr.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print out the words in the corpus as a list:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Resumption\n", "of\n", "the\n", "session\n", "I\n", "declare\n", "resumed\n", "the\n", "session\n", "of\n", "the\n", "European\n", "Parliament\n", "adjourned\n", "on\n", "Friday\n", "17\n", "December\n", "1999\n", ",\n", "...\n" ] } ], "source": [ "for word in words[:20]:\n", " print(word)\n", "print(\"...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access a word by index in the list:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Resumption\n" ] } ], "source": [ "print(words[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load the aligned sentences. Here we will load just one sentence, the firs one in the corpus:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "als = comtrans.aligned_sents(\"alignment-en-fr.txt\")[0]\n", "als\n", "\n", "print(\" \".join(als.words))\n", "print(\" \".join(als.mots))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The alignments can be accessed via the *alignment* property:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "als.alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can display the alignment using the *invert* function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "als.invert()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also create alignments directly using the NLTK translate module. We import the translation modules from NLTK:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.translate import Alignment, AlignedSent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create an alignment example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "als = AlignedSent( [\"Reprise\", \"de\", \"la\", \"session\" ], \\\n", " [\"Resumption\", \"of\", \"the\", \"session\" ] , \\\n", " Alignment( [ (0 , 0), (1 , 1), (2 , 2), (3 , 3) ] ) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Translating with IBM Model 1 in NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We already imported comtrans from NLTK in the code above. We have to import IBMModel1 from *nltk.translate*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.translate import IBMModel1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create an IBMModel1 using 20 iterations to run the learning algorithm using the first 10 sentences from the aligned corpus; see the EM explanation on the slides and the following publications:\n", "\n", "- Philipp Koehn. 2010. *Statistical Machine Translation*. Cambridge University Press, New York.\n", "\n", "- Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. *Computational Linguistics*, 19 (2), 263-311.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "com_ibm1 = IBMModel1(comtrans.aligned_sents()[:10], 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(round(com_ibm1.translation_table[\"bitte\"][\"Please\"], 3) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(round(com_ibm1.translation_table[\"Sitzungsperiode\"][\"session\"] , 3) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "latex_metadata": { "affiliation": "Indiana University, Bloomington, IN, USA", "author": "Damir Cavar", "title": "Machine Translation in Python 3 with NLTK" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }