{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Morphological Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Polyglot offers trained [morfessor models](http://www.cis.hut.fi/cis/projects/morpho/) to generate morphemes from words.\n", "The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, Morpho project is focussing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Languages Coverage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words 50,000 words of each language." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1. Piedmontese language 2. Lombard language 3. Gan Chinese \n", " 4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz \n", " 7. Pashto, Pushto 8. Kurdish 9. Portuguese \n", " 10. Kannada 11. Korean 12. Khmer \n", " 13. Kazakh 14. Ilokano 15. Polish \n", " 16. Panjabi, Punjabi 17. Georgian 18. Chuvash \n", " 19. Alemannic 20. Czech 21. Welsh \n", " 22. Chechen 23. Catalan; Valencian 24. Northern Sami \n", " 25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese \n", " 28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian \n", " 31. Swedish 32. Swahili 33. Sundanese \n", " 34. Serbian 35. Albanian 36. Japanese \n", " 37. Western Frisian 38. French 39. Finnish \n", " 40. Upper Sorbian 41. Faroese 42. Persian \n", " 43. Sinhala, Sinhalese 44. Italian 45. Amharic \n", " 46. Aragonese 47. Volapük 48. Icelandic \n", " 49. Sakha 50. Afrikaans 51. Indonesian \n", " 52. Interlingua 53. Azerbaijani 54. Ido \n", " 55. Arabic 56. Assamese 57. Yoruba \n", " 58. Yiddish 59. Waray-Waray 60. Croatian \n", " 61. Hungarian 62. Haitian; Haitian Creole 63. Quechua \n", " 64. Armenian 65. Hebrew (modern) 66. Silesian \n", " 67. Hindi 68. Divehi; Dhivehi; Mald... 69. German \n", " 70. Danish 71. Occitan 72. Tagalog \n", " 73. Turkmen 74. Thai 75. Tajik \n", " 76. Greek, Modern 77. Telugu 78. Tamil \n", " 79. Oriya 80. Ossetian, Ossetic 81. Tatar \n", " 82. Turkish 83. Kapampangan 84. Venetian \n", " 85. Manx 86. Gujarati 87. Galician \n", " 88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali \n", " 91. Cebuano 92. Zazaki 93. Walloon \n", " 94. Dutch 95. Norwegian 96. Norwegian Nynorsk \n", " 97. West Flemish 98. Chinese 99. Bosnian \n", "100. Breton 101. Belarusian 102. Bulgarian \n", "103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib... \n", "106. Bengali 107. Burmese 108. Romansh \n", "109. Marathi (Marāṭhī) 110. Malay 111. Maltese \n", "112. Russian 113. Macedonian 114. Malayalam \n", "115. Mongolian 116. Malagasy 117. Vietnamese \n", "118. Spanish; Castilian 119. Estonian 120. Basque \n", "121. Bishnupriya Manipuri 122. Asturian 123. English \n", "124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin \n", "127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan... \n", "130. Latvian 131. Urdu 132. Lithuanian \n", "133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ... \n", "\n" ] } ], "source": [ "from polyglot.downloader import downloader\n", "print(downloader.supported_languages_table(\"morph2\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Download Necessary Models" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[polyglot_data] Downloading package morph2.en to\n", "[polyglot_data] /home/rmyeid/polyglot_data...\n", "[polyglot_data] Package morph2.en is already up-to-date!\n", "[polyglot_data] Downloading package morph2.ar to\n", "[polyglot_data] /home/rmyeid/polyglot_data...\n", "[polyglot_data] Package morph2.ar is already up-to-date!\n" ] } ], "source": [ "%%bash\n", "polyglot download morph2.en morph2.ar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example\n", "\n", "### Word Segmentation" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.text import Text, Word" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "preprocessing ['pre', 'process', 'ing']\n", "processor ['process', 'or']\n", "invaluable ['in', 'valuable']\n", "thankful ['thank', 'ful']\n", "crossed ['cross', 'ed']\n" ] } ], "source": [ "words = [\"preprocessing\", \"processor\", \"invaluable\", \"thankful\", \"crossed\"]\n", "for w in words:\n", " w = Word(w, language=\"en\")\n", " print(\"{:<20}{}\".format(w, w.morphemes))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence Segmentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "blob = \"Wewillmeettoday.\"\n", "text = Text(blob)\n", "text.language = \"en\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text.morphemes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Command Line Interface" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "which which\r\n", "India In_dia\r\n", "beat beat \r\n", "Bermuda Ber_mud_a\r\n", "in in \r\n", "Port Port \r\n", "of of \r\n", "Spain Spa_in\r\n", "in in \r\n", "2007 2007 \r\n", ", , \r\n", "which which\r\n", "was wa_s \r\n", "equalled equal_led\r\n", "five five \r\n", "days day_s\r\n", "ago ago \r\n", "by by \r\n", "South South\r\n", "Africa Africa\r\n", "in in \r\n", "their t_heir\r\n", "victory victor_y\r\n", "over over \r\n", "West West \r\n", "Indies In_dies\r\n", "in in \r\n", "Sydney Syd_ney\r\n", ". . \r\n", "\r\n" ] } ], "source": [ "!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en morph | tail -n 30" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This demo does not reflect the models supplied by polyglot, however, we think it is indicative of what you should expect from morfessor\n", "\n", "[Demo](http://www.cis.hut.fi/cgi-bin/morpho/nform.cgi)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Citation\n", "\n", "This is an interface to the implementation being described in the [Morfessor2.0: Python Implementation and Extensions for Morfessor Baseline](https://aaltodoc.aalto.fi/bitstream/handle/123456789/11836/isbn9789526055015.pdf?sequence=1) technical report." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "```\n", "@InProceedings{morfessor2,\n", " title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},\n", " author:\t{Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},\n", " year: {2013},\n", " publisher: {Department of Signal Processing and Acoustics, Aalto University},\n", " booktitle:{Aalto University publication series}\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "- [Morpho project](http://www.cis.hut.fi/cis/projects/morpho/)\n", "- [Background information on morpheme discovery](http://www.cis.hut.fi/cis/projects/morpho/problem.shtml)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }