{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transliteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transliteration is the conversion of a text from one script to another.\n",
    "For instance, a Latin transliteration of the Greek phrase \"Ελληνική Δημοκρατία\", usually translated as 'Hellenic Republic', is \"Ellēnikḗ Dēmokratía\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.transliteration import Transliterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Languages Coverage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  1. Haitian; Haitian Creole    2. Tamil                      3. Vietnamese               \n",
      "  4. Telugu                     5. Croatian                   6. Hungarian                \n",
      "  7. Thai                       8. Kannada                    9. Tagalog                  \n",
      " 10. Armenian                  11. Hebrew (modern)           12. Turkish                  \n",
      " 13. Portuguese                14. Belarusian                15. Norwegian Nynorsk        \n",
      " 16. Norwegian                 17. Dutch                     18. Japanese                 \n",
      " 19. Albanian                  20. Bulgarian                 21. Serbian                  \n",
      " 22. Swahili                   23. Swedish                   24. French                   \n",
      " 25. Latin                     26. Czech                     27. Yiddish                  \n",
      " 28. Hindi                     29. Danish                    30. Finnish                  \n",
      " 31. German                    32. Bosnian-Croatian-Serbian  33. Slovak                   \n",
      " 34. Persian                   35. Lithuanian                36. Slovene                  \n",
      " 37. Latvian                   38. Bosnian                   39. Gujarati                 \n",
      " 40. Italian                   41. Icelandic                 42. Spanish; Castilian       \n",
      " 43. Ukrainian                 44. Georgian                  45. Urdu                     \n",
      " 46. Indonesian                47. Marathi (Marāṭhī)         48. Korean                   \n",
      " 49. Galician                  50. Khmer                     51. Catalan; Valencian       \n",
      " 52. Romanian, Moldavian, ...  53. Basque                    54. Macedonian               \n",
      " 55. Russian                   56. Azerbaijani               57. Chinese                  \n",
      " 58. Estonian                  59. Welsh                     60. Arabic                   \n",
      " 61. Bengali                   62. Amharic                   63. Irish                    \n",
      " 64. Malay                     65. Afrikaans                 66. Polish                   \n",
      " 67. Greek, Modern             68. Esperanto                 69. Maltese                  \n",
      "\n"
     ]
    }
   ],
   "source": [
    "from polyglot.downloader import downloader\n",
    "print(downloader.supported_languages_table(\"transliteration2\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Downloading Necessary Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[polyglot_data] Downloading package embeddings2.en to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package embeddings2.en is already up-to-date!\n",
      "[polyglot_data] Downloading package transliteration2.ar to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package transliteration2.ar is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "polyglot download embeddings2.en transliteration2.ar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We tag each word in the text with one part of speech."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.text import Text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "blob = \"\"\"We will meet at eight o'clock on Thursday morning.\"\"\"\n",
    "text = Text(blob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can query all the tagged words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "وي\n",
      "ويل\n",
      "ميت\n",
      "ات\n",
      "ييايت\n",
      "أوكلوك\n",
      "ون\n",
      "ثورسداي\n",
      "مورنينغ\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for x in text.transliterate(\"ar\"):\n",
    "  print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Command Line Interface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "which           ويكه            \r\n",
      "India           ينديا           \r\n",
      "beat            بيت             \r\n",
      "Bermuda         بيرمودا         \r\n",
      "in              ين              \r\n",
      "Port            بورت            \r\n",
      "of              وف              \r\n",
      "Spain           سباين           \r\n",
      "in              ين              \r\n",
      "2007                            \r\n",
      ",                               \r\n",
      "which           ويكه            \r\n",
      "was             واس             \r\n",
      "equalled        يكالليد         \r\n",
      "five            فيفي            \r\n",
      "days            دايس            \r\n",
      "ago             اغو             \r\n",
      "by              بي              \r\n",
      "South           سووث            \r\n",
      "Africa          افريكا          \r\n",
      "in              ين              \r\n",
      "their           ثير             \r\n",
      "victory         فيكتوري         \r\n",
      "over            وفير            \r\n",
      "West            ويست            \r\n",
      "Indies          يندييس          \r\n",
      "in              ين              \r\n",
      "Sydney          سيدني           \r\n",
      ".                               \r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en transliteration --target ar | tail -n 30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Citation\n",
    "\n",
    "\n",
    "This work is a direct implementation of the research being described in\n",
    "the [False-Friend Detection and Entity Matching via Unsupervised Transliteration](https://arxiv.org/abs/1611.06722) paper. The author of this library strongly encourage you to cite the following paper if you\n",
    "are using this software.\n",
    "\n",
    "```\n",
    "       @article{chen2016false,\n",
    "       title = {False-Friend Detection and Entity Matching via Unsupervised Transliteration},\n",
    "       author = {Chen, Yanqing and Skiena, Steven},\n",
    "       journal = {arXiv preprint arXiv:1611.06722},\n",
    "       year = {2016}\n",
    "       }\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}