{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Translation in *krain*\n", "\n", "As of v0.17.x, *ktrain* supports language translation. There are currently two different classes to support language translation. 1) `EnglishTranslator` for translation to English and 2) `Translator` for translations between many different languages. Both are wrappers around `MarianMT` models in the `transformers` library.\n", "\n", "## The `EnglishTranslator` Class for Translation to English\n", "The `EnglishTranslator` class can be used to easily convert text in various languages to English. It currently supports the following languages:\n", "\n", "- `zh` : Chinese (both Simplified and Traditional)\n", "- `ar` : Arabic\n", "- `ru` : Russian\n", "- `de` : German\n", "- `af` : Afrikaans\n", "- `fr` : French\n", "- `es` : Spanish\n", "- `it` : Italian\n", "- `pt` : Portuguese\n", "\n", "To demonstrate such translations, let us first begin by importing *ktrain*." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ktrain\n", "from ktrain.text.translation import EnglishTranslator, Translator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will translate the following sentence to English from a number of different source languages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> The pandemic has wreaked havoc on world economies. However, as of June 2020, the U.S. stock market continues to rise.\n", "\n", "To generate translations to English from a particuar language, we simply supply the language code for the language as the `src_lang` argument to `EnglishTranslator`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chinese to English" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The pandemic has caused serious damage to the world economy.\n", "However, the United States stock market continued to rise as of June 2020.\n" ] } ], "source": [ "translator = EnglishTranslator(src_lang='zh')\n", "src_text = '''大流行对世界经济造成了严重破坏。但是,截至2020年6月,美国股票市场持续上涨。'''\n", "print(translator.translate(src_text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Some comments about translations:\n", "Notice in the example above that we supplied a document of **two** sentences as input. The `translate` method can accept single sentences, paragraphs, or entire documents. However, if the document is large (e.g., a book), we recommend that you break it up into smaller chunks (e.g., pages or paragraphs). This is because *ktrain* tokenizes your document into individual sentences, which are joined together and fed to model as single batch when generating the translation. If the batch is too large for memory, errors will occur.\n", "\n", "When instantiating the `EnglishTranslator`, pretrained models are automatically loaded, which may take a few seconds. Once instantiated, the `translate` method can be repeatedly invoked on different documents or sentences. Next, let us reinstantiate an `EnglishTranslator` object to translate Arabic.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Arabic to English" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The epidemic has devastated the world's economies.\n", "However, as of June 2020, the American stock market continued to rise.\n" ] } ], "source": [ "translator = EnglishTranslator(src_lang='ar')\n", "src_text = '''لقد أحدث الوباء دمارا في اقتصادات العالم.\n", "ومع ذلك ، اعتبارًا من يونيو 2020 ، استمرت سوق الأسهم الأمريكية في الارتفاع.\n", "'''\n", "print(translator.translate(src_text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Russian to English" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The pandemic has damaged the world economy.\n", "However, as of June 2020, the US stock market continues to grow.\n" ] } ], "source": [ "translator = EnglishTranslator(src_lang='ru')\n", "src_text = '''Пандемия нанесла ущерб мировой экономике.\n", "Однако по состоянию на июнь 2020 года фондовый рынок США продолжает расти.\n", "'''\n", "print(translator.translate(src_text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### German to English" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The pandemic has devastated the global economy.\n", "However, as of June 2020, the US stock market will continue to rise.\n" ] } ], "source": [ "translator = EnglishTranslator(src_lang='de')\n", "src_text = '''Die Pandemie hat die Weltwirtschaft verwüstet. \n", "Ab Juni 2020 steigt der US-Aktienmarkt jedoch weiter an.'''\n", "print(translator.translate(src_text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### French to English" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The pandemic has wreaked havoc in world economies.\n", "However, in June 2020, the U.S. stock market continues to climb.\n" ] } ], "source": [ "translator = EnglishTranslator(src_lang='fr')\n", "src_text = '''La pandémie a fait des ravages dans les économies mondiales. \n", "Cependant, en juin 2020, le marché boursier américain continue de grimper.'''\n", "print(translator.translate(src_text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Spanish to English" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The pandemic has wreaked havoc on world economies.\n", "However, from June 2020 onwards, the US securities market.\n", "USA.\n", "Continues to grow.\n" ] } ], "source": [ "translator = EnglishTranslator(src_lang='es')\n", "src_text = '''La pandemia ha causado estragos en las economías mundiales. \n", "Sin embargo, a partir de junio de 2020, el mercado de valores de EE. UU. Continúa aumentando.'''\n", "print(translator.translate(src_text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `Translator` Class for Translating to and from Many Languages\n", "\n", "For translations **from** and **to** other languages, `text.Translator`instances can be used. `Translator` instances accept as input a pretrained model from [Helsinki-NLP](https://huggingface.co/models?search=Helsinki-NLP%2Fopus-mt). For instance, to translate Chinese to German, one can use the [Helsinki-NLP/opus-mt-zh-de ](https://huggingface.co/Helsinki-NLP/opus-mt-zh-de) model:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Die Pandemie hat eine ernste Zerstörung der Weltwirtschaft verursacht.\n", "Aber bis Juni 2020 stieg der US-Markt weiter an.\n" ] } ], "source": [ "translator = Translator(model_name='Helsinki-NLP/opus-mt-zh-de')\n", "src_text = '''冠状病毒大流行对世界经济造成了严重破坏。但是,截至2020年6月,美国股市继续上涨。'''\n", "print(translator.translate(src_text))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }