{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transliteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transliteration is the conversion of a text from one script to another.\n", "For instance, a Latin transliteration of the Greek phrase \"Ελληνική Δημοκρατία\", usually translated as 'Hellenic Republic', is \"Ellēnikḗ Dēmokratía\"." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.transliteration import Transliterator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Languages Coverage" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1. Haitian; Haitian Creole 2. Tamil 3. Vietnamese \n", " 4. Telugu 5. Croatian 6. Hungarian \n", " 7. Thai 8. Kannada 9. Tagalog \n", " 10. Armenian 11. Hebrew (modern) 12. Turkish \n", " 13. Portuguese 14. Belarusian 15. Norwegian Nynorsk \n", " 16. Norwegian 17. Dutch 18. Japanese \n", " 19. Albanian 20. Bulgarian 21. Serbian \n", " 22. Swahili 23. Swedish 24. French \n", " 25. Latin 26. Czech 27. Yiddish \n", " 28. Hindi 29. Danish 30. Finnish \n", " 31. German 32. Bosnian-Croatian-Serbian 33. Slovak \n", " 34. Persian 35. Lithuanian 36. Slovene \n", " 37. Latvian 38. Bosnian 39. Gujarati \n", " 40. Italian 41. Icelandic 42. Spanish; Castilian \n", " 43. Ukrainian 44. Georgian 45. Urdu \n", " 46. Indonesian 47. Marathi (Marāṭhī) 48. Korean \n", " 49. Galician 50. Khmer 51. Catalan; Valencian \n", " 52. Romanian, Moldavian, ... 53. Basque 54. Macedonian \n", " 55. Russian 56. Azerbaijani 57. Chinese \n", " 58. Estonian 59. Welsh 60. Arabic \n", " 61. Bengali 62. Amharic 63. Irish \n", " 64. Malay 65. Afrikaans 66. Polish \n", " 67. Greek, Modern 68. Esperanto 69. Maltese \n", "\n" ] } ], "source": [ "from polyglot.downloader import downloader\n", "print(downloader.supported_languages_table(\"transliteration2\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Downloading Necessary Models" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[polyglot_data] Downloading package embeddings2.en to\n", "[polyglot_data] /home/rmyeid/polyglot_data...\n", "[polyglot_data] Package embeddings2.en is already up-to-date!\n", "[polyglot_data] Downloading package transliteration2.ar to\n", "[polyglot_data] /home/rmyeid/polyglot_data...\n", "[polyglot_data] Package transliteration2.ar is already up-to-date!\n" ] } ], "source": [ "%%bash\n", "polyglot download embeddings2.en transliteration2.ar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We tag each word in the text with one part of speech." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.text import Text" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "blob = \"\"\"We will meet at eight o'clock on Thursday morning.\"\"\"\n", "text = Text(blob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query all the tagged words" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "وي\n", "ويل\n", "ميت\n", "ات\n", "ييايت\n", "أوكلوك\n", "ون\n", "ثورسداي\n", "مورنينغ\n", "\n" ] } ], "source": [ "for x in text.transliterate(\"ar\"):\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Command Line Interface" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "which ويكه \r\n", "India ينديا \r\n", "beat بيت \r\n", "Bermuda بيرمودا \r\n", "in ين \r\n", "Port بورت \r\n", "of وف \r\n", "Spain سباين \r\n", "in ين \r\n", "2007 \r\n", ", \r\n", "which ويكه \r\n", "was واس \r\n", "equalled يكالليد \r\n", "five فيفي \r\n", "days دايس \r\n", "ago اغو \r\n", "by بي \r\n", "South سووث \r\n", "Africa افريكا \r\n", "in ين \r\n", "their ثير \r\n", "victory فيكتوري \r\n", "over وفير \r\n", "West ويست \r\n", "Indies يندييس \r\n", "in ين \r\n", "Sydney سيدني \r\n", ". \r\n", "\r\n" ] } ], "source": [ "!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en transliteration --target ar | tail -n 30" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Citation\n", "\n", "\n", "This work is a direct implementation of the research being described in\n", "the [False-Friend Detection and Entity Matching via Unsupervised Transliteration](https://arxiv.org/abs/1611.06722) paper. The author of this library strongly encourage you to cite the following paper if you\n", "are using this software.\n", "\n", "```\n", " @article{chen2016false,\n", " title = {False-Friend Detection and Entity Matching via Unsupervised Transliteration},\n", " author = {Chen, Yanqing and Skiena, Steven},\n", " journal = {arXiv preprint arXiv:1611.06722},\n", " year = {2016}\n", " }\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }