{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Morphological Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Polyglot offers trained [morfessor models](http://www.cis.hut.fi/cis/projects/morpho/) to generate morphemes from words.\n",
    "The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, Morpho project is focussing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Languages Coverage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words 50,000 words of each language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  1. Piedmontese language       2. Lombard language           3. Gan Chinese              \n",
      "  4. Sicilian                   5. Scots                      6. Kirghiz, Kyrgyz          \n",
      "  7. Pashto, Pushto             8. Kurdish                    9. Portuguese               \n",
      " 10. Kannada                   11. Korean                    12. Khmer                    \n",
      " 13. Kazakh                    14. Ilokano                   15. Polish                   \n",
      " 16. Panjabi, Punjabi          17. Georgian                  18. Chuvash                  \n",
      " 19. Alemannic                 20. Czech                     21. Welsh                    \n",
      " 22. Chechen                   23. Catalan; Valencian        24. Northern Sami            \n",
      " 25. Sanskrit (Saṁskṛta)       26. Slovene                   27. Javanese                 \n",
      " 28. Slovak                    29. Bosnian-Croatian-Serbian  30. Bavarian                 \n",
      " 31. Swedish                   32. Swahili                   33. Sundanese                \n",
      " 34. Serbian                   35. Albanian                  36. Japanese                 \n",
      " 37. Western Frisian           38. French                    39. Finnish                  \n",
      " 40. Upper Sorbian             41. Faroese                   42. Persian                  \n",
      " 43. Sinhala, Sinhalese        44. Italian                   45. Amharic                  \n",
      " 46. Aragonese                 47. Volapük                   48. Icelandic                \n",
      " 49. Sakha                     50. Afrikaans                 51. Indonesian               \n",
      " 52. Interlingua               53. Azerbaijani               54. Ido                      \n",
      " 55. Arabic                    56. Assamese                  57. Yoruba                   \n",
      " 58. Yiddish                   59. Waray-Waray               60. Croatian                 \n",
      " 61. Hungarian                 62. Haitian; Haitian Creole   63. Quechua                  \n",
      " 64. Armenian                  65. Hebrew (modern)           66. Silesian                 \n",
      " 67. Hindi                     68. Divehi; Dhivehi; Mald...  69. German                   \n",
      " 70. Danish                    71. Occitan                   72. Tagalog                  \n",
      " 73. Turkmen                   74. Thai                      75. Tajik                    \n",
      " 76. Greek, Modern             77. Telugu                    78. Tamil                    \n",
      " 79. Oriya                     80. Ossetian, Ossetic         81. Tatar                    \n",
      " 82. Turkish                   83. Kapampangan               84. Venetian                 \n",
      " 85. Manx                      86. Gujarati                  87. Galician                 \n",
      " 88. Irish                     89. Scottish Gaelic; Gaelic   90. Nepali                   \n",
      " 91. Cebuano                   92. Zazaki                    93. Walloon                  \n",
      " 94. Dutch                     95. Norwegian                 96. Norwegian Nynorsk        \n",
      " 97. West Flemish              98. Chinese                   99. Bosnian                  \n",
      "100. Breton                   101. Belarusian               102. Bulgarian                \n",
      "103. Bashkir                  104. Egyptian Arabic          105. Tibetan Standard, Tib... \n",
      "106. Bengali                  107. Burmese                  108. Romansh                  \n",
      "109. Marathi (Marāṭhī)        110. Malay                    111. Maltese                  \n",
      "112. Russian                  113. Macedonian               114. Malayalam                \n",
      "115. Mongolian                116. Malagasy                 117. Vietnamese               \n",
      "118. Spanish; Castilian       119. Estonian                 120. Basque                   \n",
      "121. Bishnupriya Manipuri     122. Asturian                 123. English                  \n",
      "124. Esperanto                125. Luxembourgish, Letzeb... 126. Latin                    \n",
      "127. Uighur, Uyghur           128. Ukrainian                129. Limburgish, Limburgan... \n",
      "130. Latvian                  131. Urdu                     132. Lithuanian               \n",
      "133. Fiji Hindi               134. Uzbek                    135. Romanian, Moldavian, ... \n",
      "\n"
     ]
    }
   ],
   "source": [
    "from polyglot.downloader import downloader\n",
    "print(downloader.supported_languages_table(\"morph2\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download Necessary Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[polyglot_data] Downloading package morph2.en to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package morph2.en is already up-to-date!\n",
      "[polyglot_data] Downloading package morph2.ar to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package morph2.ar is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "polyglot download morph2.en morph2.ar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example\n",
    "\n",
    "### Word Segmentation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.text import Text, Word"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "preprocessing       ['pre', 'process', 'ing']\n",
      "processor           ['process', 'or']\n",
      "invaluable          ['in', 'valuable']\n",
      "thankful            ['thank', 'ful']\n",
      "crossed             ['cross', 'ed']\n"
     ]
    }
   ],
   "source": [
    "words = [\"preprocessing\", \"processor\", \"invaluable\", \"thankful\", \"crossed\"]\n",
    "for w in words:\n",
    "  w = Word(w, language=\"en\")\n",
    "  print(\"{:<20}{}\".format(w, w.morphemes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sentence Segmentation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "blob = \"Wewillmeettoday.\"\n",
    "text = Text(blob)\n",
    "text.language = \"en\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.morphemes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Command Line Interface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "which           which\r\n",
      "India           In_dia\r\n",
      "beat            beat \r\n",
      "Bermuda         Ber_mud_a\r\n",
      "in              in   \r\n",
      "Port            Port \r\n",
      "of              of   \r\n",
      "Spain           Spa_in\r\n",
      "in              in   \r\n",
      "2007            2007 \r\n",
      ",               ,    \r\n",
      "which           which\r\n",
      "was             wa_s \r\n",
      "equalled        equal_led\r\n",
      "five            five \r\n",
      "days            day_s\r\n",
      "ago             ago  \r\n",
      "by              by   \r\n",
      "South           South\r\n",
      "Africa          Africa\r\n",
      "in              in   \r\n",
      "their           t_heir\r\n",
      "victory         victor_y\r\n",
      "over            over \r\n",
      "West            West \r\n",
      "Indies          In_dies\r\n",
      "in              in   \r\n",
      "Sydney          Syd_ney\r\n",
      ".               .    \r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en morph | tail -n 30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This demo does not reflect the models supplied by polyglot, however, we think it is indicative of what you should expect from morfessor\n",
    "\n",
    "[Demo](http://www.cis.hut.fi/cgi-bin/morpho/nform.cgi)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Citation\n",
    "\n",
    "This is an interface to the implementation being described in the [Morfessor2.0: Python Implementation and Extensions for Morfessor Baseline](https://aaltodoc.aalto.fi/bitstream/handle/123456789/11836/isbn9789526055015.pdf?sequence=1) technical report."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "```\n",
    "@InProceedings{morfessor2,\n",
    "               title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},\n",
    "               author:\t{Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},\n",
    "               year: {2013},\n",
    "               publisher: {Department of Signal Processing and Acoustics, Aalto University},\n",
    "               booktitle:{Aalto University publication series}\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "- [Morpho project](http://www.cis.hut.fi/cis/projects/morpho/)\n",
    "- [Background information on morpheme discovery](http://www.cis.hut.fi/cis/projects/morpho/problem.shtml)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}