{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Ri5cVDAWIgp7"
   },
   "source": [
    "# PyThaiNLP Get Started\n",
    "\n",
    "Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "id": "3HsfhZlwInqs",
    "outputId": "c4e91a7c-356c-4d07-802d-530cd62e4a6d"
   },
   "outputs": [],
   "source": [
    "# # pip install required modules\n",
    "# # uncomment if running from colab\n",
    "# # see list of modules in `requirements` and `extras`\n",
    "# # in https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py\n",
    "\n",
    "#!pip install pythainlp\n",
    "#!pip install epitran"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import PyThaiNLP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.2.1'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pythainlp\n",
    "\n",
    "pythainlp.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "A6gy4MLGIgp9"
   },
   "source": [
    "## Thai Characters\n",
    "\n",
    "PyThaiNLP provides some ready-to-use Thai character set (e.g. Thai consonants, vowels, tonemarks, symbols) as a string for convenience. There are also few utility functions to test if a string is in Thai or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "GAvoeZg3Igp-",
    "outputId": "13509870-fe94-4957-ae37-b86a677d9234"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮฤฦะัาำิีึืุูเแโใไๅํ็่้๊๋ฯฺๆ์ํ๎๏๚๛๐๑๒๓๔๕๖๗๘๙฿'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pythainlp.thai_characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "88"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(pythainlp.thai_characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "uPwx53A6IgqF",
    "outputId": "7693ee7c-f42f-4503-fc0a-fa2a47e5a374"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pythainlp.thai_consonants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "44"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(pythainlp.thai_consonants)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "5UA7Hwy_IgqI",
    "outputId": "9de7d50e-8499-48d9-bd2f-b025ddab9479"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"๔\" in pythainlp.thai_digits  # check if Thai digit \"4\" is in the character set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checking if a string contains Thai character or not, or how many"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "t3NvXqYFIgqK",
    "outputId": "52d91e75-cfd7-4176-ff3b-a725724a8871"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pythainlp.util\n",
    "\n",
    "pythainlp.util.isthai(\"ก\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "sRzSQjugIgqM",
    "outputId": "212049ed-56d2-4b03-aef0-87d05b861ddb"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pythainlp.util.isthai(\"(ก.พ.)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "DP5yfJebIgqP",
    "outputId": "0eca64e8-dbfc-479a-ec0c-c4da71ff3b1c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pythainlp.util.isthai(\"(ก.พ.)\", ignore_chars=\".()\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`counthai()` returns proportion of Thai characters in the text. It will ignore non-alphabets by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "87Z8P9WPIgqS",
    "outputId": "0b92019f-9773-49db-e0b0-840ba9f7d8a0"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pythainlp.util.countthai(\"วันอาทิตย์ที่ 24 มีนาคม 2562\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can specify characters to be ignored, using `ignore_chars=` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "ukSQP8ZTIgqV",
    "outputId": "9f0bff09-0527-45ca-9f25-65c60f286930"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "67.85714285714286"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pythainlp.util.countthai(\"วันอาทิตย์ที่ 24 มีนาคม 2562\", ignore_chars=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kW89ZW-IIgqX"
   },
   "source": [
    "## Collation\n",
    "\n",
    "Sorting according to Thai dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "hT1Pj52bIgqY",
    "outputId": "b948f6ce-ee51-4f3e-cdad-43b3957155e0"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['กรรไกร', 'กระดาษ', 'ไข่', 'ค้อน', 'ผ้าไหม']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.util import collate\n",
    "\n",
    "thai_words = [\"ค้อน\", \"กระดาษ\", \"กรรไกร\", \"ไข่\", \"ผ้าไหม\"]\n",
    "collate(thai_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "XgWpZM8hIgqb",
    "outputId": "d9003bd2-e0ee-47c7-aa67-f498e4f47578"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ผ้าไหม', 'ค้อน', 'ไข่', 'กระดาษ', 'กรรไกร']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collate(thai_words, reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "g-czYhLoIgqd"
   },
   "source": [
    "## Date/Time Format and Spellout\n",
    "\n",
    "### Date/Time Format\n",
    "\n",
    "Get Thai day and month names with Thai Buddhist Era (B.E.).\n",
    "Use [formatting directives](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) similar to `datetime.strftime()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "F03_rMWzIgqe",
    "outputId": "ffeda738-0926-4439-d3a0-14869d0d59db"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'วันพุธที่ 6 ตุลาคม พ.ศ. 2519 เวลา 01:40 น. (พ 06-ต.ค.-19)'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import datetime\n",
    "from pythainlp.util import thai_strftime\n",
    "\n",
    "fmt = \"%Aที่ %-d %B พ.ศ. %Y เวลา %H:%M น. (%a %d-%b-%y)\"\n",
    "date = datetime.datetime(1976, 10, 6, 1, 40)\n",
    "\n",
    "thai_strftime(date, fmt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From version 2.2, these modifiers can be applied right before the main directive:\n",
    "\n",
    "- \\- *(minus)* Do not pad a numeric result string (also available in version 2.1)\n",
    "- _ *(underscore)* Pad a numeric result string with spaces\n",
    "- 0 *(zero)* Pad a number result string with zeros\n",
    "- ^ Convert alphabetic characters in result string to upper case\n",
    "- \\# Swap the case of the result string\n",
    "- O *(letter o)* Use the locale's alternative numeric symbols (Thai digit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'06 ต.ค. 19'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "thai_strftime(date, \"%d %b %y\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'06 ต.ค. 2519'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "thai_strftime(date, \"%d %b %Y\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Time Spellout\n",
    "\n",
    "*Note: `thai_time()` will be renamed to `time_to_thaiword()` in version 2.2.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ศูนย์นาฬิกาสิบสี่นาทียี่สิบเก้าวินาที'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.util import thai_time\n",
    "\n",
    "thai_time(\"00:14:29\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way to spellout can be chosen, using `fmt` parameter.\n",
    "It can be `24h`, `6h`, or `m6h`. Try one by yourself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'เที่ยงคืนสิบสี่นาทียี่สิบเก้าวินาที'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "thai_time(\"00:14:29\", fmt=\"6h\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Precision of spellout can be chosen as well. Using `precision` parameter.\n",
    "It can be `m` for minute-level, `s` for second-level, or `None` for only read the non-zero value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ศูนย์นาฬิกาสิบสี่นาที'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "thai_time(\"00:14:29\", precision=\"m\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "สองโมงเช้าสิบเจ็ดนาที\n",
      "แปดโมงสิบเจ็ดนาทีศูนย์วินาที\n",
      "หกโมงครึ่ง\n",
      "บ่ายโมงครึ่ง\n"
     ]
    }
   ],
   "source": [
    "print(thai_time(\"8:17:00\", fmt=\"6h\"))\n",
    "print(thai_time(\"8:17:00\", fmt=\"m6h\", precision=\"s\"))\n",
    "print(thai_time(\"18:30:01\", fmt=\"m6h\", precision=\"m\"))\n",
    "print(thai_time(\"13:30:01\", fmt=\"6h\", precision=\"m\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also pass `datetime` and `time` objects to `thai_time()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'สิบสามนาฬิกาสิบสี่นาทีสิบห้าวินาที'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import datetime\n",
    "\n",
    "time = datetime.time(13, 14, 15)\n",
    "thai_time(time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'บ่ายโมงสิบสี่นาที'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "time = datetime.datetime(10, 11, 12, 13, 14, 15)\n",
    "thai_time(time, fmt=\"6h\", precision=\"m\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8VFPOHyZIgqh"
   },
   "source": [
    "## Tokenization and Segmentation\n",
    "\n",
    "At sentence, word, and sub-word levels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sentence\n",
    "\n",
    "Default sentence tokenizer is \"crfcut\". Tokenization engine can be chosen ussing `engine=` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "default (crfcut):\n",
      "['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ เป็นรัฐธรรมนูญฉบับชั่วคราว ', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม ', 'ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 ', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร']\n",
      "\n",
      "whitespace+newline:\n",
      "['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว', 'พุทธศักราช', '๒๔๗๕', 'เป็นรัฐธรรมนูญฉบับชั่วคราว', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม', 'ประกาศใช้เมื่อวันที่', '27', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่', '24', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยคณะราษฎร']\n"
     ]
    }
   ],
   "source": [
    "from pythainlp import sent_tokenize\n",
    "\n",
    "text = (\"พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ \"\n",
    "        \"เป็นรัฐธรรมนูญฉบับชั่วคราว ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม \"\n",
    "        \"ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 \"\n",
    "        \"โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร\")\n",
    "\n",
    "print(\"default (crfcut):\")\n",
    "print(sent_tokenize(text))\n",
    "print(\"\\nwhitespace+newline:\")\n",
    "print(sent_tokenize(text, engine=\"whitespace+newline\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SklPJ-DbIgqi"
   },
   "source": [
    "### Word\n",
    "Default word tokenizer (\"newmm\") use maximum matching algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 68
    },
    "colab_type": "code",
    "id": "JEbY-MGCIgqi",
    "outputId": "ce82fcbe-117f-4e12-db86-f01b4ea988e4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "default (newmm):\n",
      "['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', '     ', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน', ' ']\n",
      "\n",
      "newmm and keep_whitespace=False:\n",
      "['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน']\n"
     ]
    }
   ],
   "source": [
    "from pythainlp import word_tokenize\n",
    "\n",
    "text = \"ก็จะรู้ความชั่วร้ายที่ทำไว้     และคงจะไม่ยอมให้ทำนาบนหลังคน \"\n",
    "\n",
    "print(\"default (newmm):\")\n",
    "print(word_tokenize(text))\n",
    "print(\"\\nnewmm and keep_whitespace=False:\")\n",
    "print(word_tokenize(text, keep_whitespace=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "e5P_YygrIgqm"
   },
   "source": [
    "Other algorithm can be chosen. We can also create a tokenizer with a custom dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 68
    },
    "colab_type": "code",
    "id": "mI_Qz3k3Igqm",
    "outputId": "2d10dc44-fc8d-4c4d-8526-3b5abe9494d5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "newmm  : ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศ', 'ใช้แล้ว']\n",
      "longest: ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศใช้', 'แล้ว']\n",
      "newmm (custom dictionary): ['กฎหมาย', 'แรงงาน', 'ฉบับปรับปรุงใหม่ประกาศใช้แล้ว']\n"
     ]
    }
   ],
   "source": [
    "from pythainlp import word_tokenize, Tokenizer\n",
    "\n",
    "text = \"กฎหมายแรงงานฉบับปรับปรุงใหม่ประกาศใช้แล้ว\"\n",
    "\n",
    "print(\"newmm  :\", word_tokenize(text))  # default engine is \"newmm\"\n",
    "print(\"longest:\", word_tokenize(text, engine=\"longest\"))\n",
    "\n",
    "words = [\"แรงงาน\"]\n",
    "custom_tokenizer = Tokenizer(words)\n",
    "print(\"newmm (custom dictionary):\", custom_tokenizer.word_tokenize(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zIXUxXlTIgqo"
   },
   "source": [
    "Default word tokenizer use a word list from `pythainlp.corpus.common.thai_words()`.\n",
    "We can get that list, add/remove words, and create new tokenizer from the modified list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "RblqNckGIgqp",
    "outputId": "b0c50208-55ce-4f63-8e99-8f98bbd31733"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "default dictionary: ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิ', 'มอ', 'ฟ']\n",
      "custom dictionary : ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิมอฟ']\n"
     ]
    }
   ],
   "source": [
    "from pythainlp.corpus.common import thai_words\n",
    "from pythainlp import Tokenizer\n",
    "\n",
    "text = \"นิยายวิทยาศาสตร์ของไอแซค อสิมอฟ\"\n",
    "\n",
    "print(\"default dictionary:\", word_tokenize(text))\n",
    "\n",
    "words = set(thai_words())  # thai_words() returns frozenset\n",
    "words.add(\"ไอแซค\")  # Isaac\n",
    "words.add(\"อสิมอฟ\")  # Asimov\n",
    "custom_tokenizer = Tokenizer(words)\n",
    "print(\"custom dictionary :\", custom_tokenizer.word_tokenize(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also, alternatively, create a dictionary trie, using `pythainlp.util.Trie()` function, and pass it to a default tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "default dictionary: ['ILO', '87', ' ', 'ว่าด้วย', 'เสรีภาพ', 'ใน', 'การสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', ' ', 'ILO', '98', ' ', 'ว่าด้วย', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', 'และ', 'การ', 'ร่วม', 'เจรจา', 'ต่อรอง']\n",
      "custom dictionary : ['ILO87', ' ', 'ว่าด้วย', 'เสรีภาพในการสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิในการรวมตัว', ' ', 'ILO98', ' ', 'ว่าด้วย', 'สิทธิในการรวมตัว', 'และ', 'การร่วมเจรจาต่อรอง']\n"
     ]
    }
   ],
   "source": [
    "from pythainlp.corpus.common import thai_words\n",
    "from pythainlp.util import Trie\n",
    "\n",
    "text = \"ILO87 ว่าด้วยเสรีภาพในการสมาคมและการคุ้มครองสิทธิในการรวมตัว ILO98 ว่าด้วยสิทธิในการรวมตัวและการร่วมเจรจาต่อรอง\"\n",
    "\n",
    "print(\"default dictionary:\", word_tokenize(text))\n",
    "\n",
    "new_words = {\"ILO87\", \"ILO98\", \"การร่วมเจรจาต่อรอง\", \"สิทธิในการรวมตัว\", \"เสรีภาพในการสมาคม\", \"แรงงานสัมพันธ์\"}\n",
    "words = new_words.union(thai_words())\n",
    "\n",
    "custom_dictionary_trie = Trie(words)\n",
    "print(\"custom dictionary :\", word_tokenize(text, custom_dict=custom_dictionary_trie))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Testing different tokenization engines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4L2kRMY5Igqr"
   },
   "outputs": [],
   "source": [
    "speedtest_text = \"\"\"\n",
    "ครบรอบ 14 ปี ตากใบ เช้าวันนั้น 25 ต.ค. 2547 ผู้ชุมนุมชายกว่า 1,370 คน\n",
    "ถูกโยนขึ้นรถยีเอ็มซี 22 หรือ 24 คัน นอนซ้อนกันคันละ 4-5 ชั้น เดินทางจากสถานีตำรวจตากใบ ไปไกล 150 กิโลเมตร\n",
    "ไปถึงค่ายอิงคยุทธบริหาร ใช้เวลากว่า 6 ชั่วโมง / ในอีกคดีที่ญาติฟ้องร้องรัฐ คดีจบลงที่การประนีประนอมยอมความ\n",
    "กระทรวงกลาโหมจ่ายค่าสินไหมทดแทนรวม 42 ล้านบาทให้กับญาติผู้เสียหาย 79 ราย\n",
    "ปิดหีบและนับคะแนนเสร็จแล้ว ที่หน่วยเลือกตั้งที่ 32 เขต 13 แขวงหัวหมาก เขตบางกะปิ กรุงเทพมหานคร\n",
    "ผู้สมัคร ส.ส. และตัวแทนพรรคการเมืองจากหลายพรรคต่างมาเฝ้าสังเกตการนับคะแนนอย่างใกล้ชิด โดย\n",
    "ฐิติภัสร์ โชติเดชาชัยนันต์ จากพรรคพลังประชารัฐ และพริษฐ์ วัชรสินธุ จากพรรคประชาธิปัตย์ได้คะแนน\n",
    "96 คะแนนเท่ากัน\n",
    "เช้าวันอาทิตย์ที่ 21 เมษายน 2019 ซึ่งเป็นวันอีสเตอร์ วันสำคัญของชาวคริสต์\n",
    "เกิดเหตุระเบิดต่อเนื่องในโบสถ์คริสต์และโรงแรมอย่างน้อย 7 แห่งในประเทศศรีลังกา\n",
    "มีผู้เสียชีวิตแล้วอย่างน้อย 156 คน และบาดเจ็บหลายร้อยคน ยังไม่มีข้อมูลว่าผู้ก่อเหตุมาจากฝ่ายใด\n",
    "จีนกำหนดจัดการประชุมข้อริเริ่มสายแถบและเส้นทางในช่วงปลายสัปดาห์นี้ ปักกิ่งยืนยันว่า\n",
    "อภิมหาโครงการเชื่อมโลกของจีนไม่ใช่เครื่องมือแผ่อิทธิพล แต่ยินดีรับฟังข้อวิจารณ์ เช่น ประเด็นกับดักหนี้สิน\n",
    "และความไม่โปร่งใส รัฐบาลปักกิ่งบอกว่า เวทีประชุม Belt and Road Forum ในช่วงวันที่ 25-27 เมษายน\n",
    "ถือเป็นงานการทูตที่สำคัญที่สุดของจีนในปี 2019\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "qMF_0xyOIgqs",
    "outputId": "7b914ac8-456f-4af7-f62d-e4cdf61409aa"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 253 ms, sys: 2.27 ms, total: 256 ms\n",
      "Wall time: 255 ms\n"
     ]
    }
   ],
   "source": [
    "# Speed test: Calling \"longest\" engine through word_tokenize wrapper\n",
    "%time tokens = word_tokenize(speedtest_text, engine=\"longest\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "NlCSHylIIgqv",
    "outputId": "c270e307-6804-4dc6-93a4-5b64776d01e7"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.4 ms, sys: 60 µs, total: 3.46 ms\n",
      "Wall time: 3.47 ms\n"
     ]
    }
   ],
   "source": [
    "# Speed test: Calling \"newmm\" engine through word_tokenize wrapper\n",
    "%time tokens = word_tokenize(speedtest_text, engine=\"newmm\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4.08 ms, sys: 88 µs, total: 4.16 ms\n",
      "Wall time: 4.15 ms\n"
     ]
    }
   ],
   "source": [
    "# Speed test: Calling \"newmm\" engine through word_tokenize wrapper\n",
    "%time tokens = word_tokenize(speedtest_text, engine=\"newmm-safe\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 833 ms, sys: 174 ms, total: 1.01 s\n",
      "Wall time: 576 ms\n"
     ]
    }
   ],
   "source": [
    "#!pip install attacut\n",
    "# Speed test: Calling \"attacut\" engine through word_tokenize wrapper\n",
    "%time tokens = word_tokenize(speedtest_text, engine=\"attacut\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get all possible segmentations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 272
    },
    "colab_type": "code",
    "id": "qFTYqAB1Igq1",
    "outputId": "607ea178-c070-48c6-c12e-1a62c09f8847"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['มี|ความ|เป็น|ไป|ได้|อย่าง|ไร|บ้าง|',\n",
       " 'มี|ความ|เป็นไป|ได้|อย่าง|ไร|บ้าง|',\n",
       " 'มี|ความ|เป็นไปได้|อย่าง|ไร|บ้าง|',\n",
       " 'มี|ความเป็นไป|ได้|อย่าง|ไร|บ้าง|',\n",
       " 'มี|ความเป็นไปได้|อย่าง|ไร|บ้าง|',\n",
       " 'มี|ความ|เป็น|ไป|ได้|อย่างไร|บ้าง|',\n",
       " 'มี|ความ|เป็นไป|ได้|อย่างไร|บ้าง|',\n",
       " 'มี|ความ|เป็นไปได้|อย่างไร|บ้าง|',\n",
       " 'มี|ความเป็นไป|ได้|อย่างไร|บ้าง|',\n",
       " 'มี|ความเป็นไปได้|อย่างไร|บ้าง|',\n",
       " 'มี|ความ|เป็น|ไป|ได้|อย่างไรบ้าง|',\n",
       " 'มี|ความ|เป็นไป|ได้|อย่างไรบ้าง|',\n",
       " 'มี|ความ|เป็นไปได้|อย่างไรบ้าง|',\n",
       " 'มี|ความเป็นไป|ได้|อย่างไรบ้าง|',\n",
       " 'มี|ความเป็นไปได้|อย่างไรบ้าง|']"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.tokenize.multi_cut import find_all_segment, mmcut, segment\n",
    "\n",
    "find_all_segment(\"มีความเป็นไปได้อย่างไรบ้าง\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fWXiNUoBIgq4"
   },
   "source": [
    "### Subword, syllable, and Thai Character Cluster (TCC)\n",
    "\n",
    "Tokenization can also be done at subword level, either syllable or Thai Character Cluster (TCC).\n",
    "\n",
    "- Syllable segmentation is using [`ssg`](https://github.com/ponrawee/ssg), a CRF syllable segmenter for Thai by Ponrawee Prasertsom.\n",
    "- TCC is smaller than syllable. For information about TCC, see [Character Cluster Based Thai Information Retrieval](https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval) (Theeramunkong et al. 2004)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Subword tokenization\n",
    "Default subword tokenization engine is `tcc`, which will use Thai Character Cluster (TCC) as a subword unit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "Z1wdMmdmIgq6",
    "outputId": "c74fa03a-a288-47b8-9c49-7ed42c3545c9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp import subword_tokenize\n",
    "\n",
    "subword_tokenize(\"ประเทศไทย\")  # default subword unit is TCC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Syllable tokenization\n",
    "Default syllable tokenization engine is `dict`, which will use `newmm` word tokenization engine with a custom dictionary contains known syllables in Thai language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['อับ',\n",
       " 'ดุล',\n",
       " 'เลาะ',\n",
       " ' ',\n",
       " 'อี',\n",
       " 'ซอ',\n",
       " 'มู',\n",
       " 'ซอ',\n",
       " ' ',\n",
       " 'สมอง',\n",
       " 'บวม',\n",
       " 'รุน',\n",
       " 'แรง']"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.tokenize import syllable_tokenize\n",
    "\n",
    "text = \"อับดุลเลาะ อีซอมูซอ สมองบวมรุนแรง\"\n",
    "\n",
    "syllable_tokenize(text)  # default engine is \"dict\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "External [`ssg`](https://github.com/ponrawee/ssg) engine call be called. Note that `ssg` engine ommitted whitespaces in the output tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['อับ', 'ดุล', 'เลาะ', ' อี', 'ซอ', 'มู', 'ซอ ', 'สมอง', 'บวม', 'รุน', 'แรง']"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "syllable_tokenize(text, engine=\"ssg\")  # use \"ssg\" for syllable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "RfO1WbneIgq9"
   },
   "source": [
    "#### Low-level subword operations\n",
    "\n",
    "These low-level TCC operations can be useful for some pre-processing tasks. Like checking if it's ok to cut a string at a certain point or to find typos."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "3Gyig20XIgq-",
    "outputId": "da9484b7-4dc4-4159-9b5f-df2771805ac9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.tokenize import tcc\n",
    "\n",
    "tcc.segment(\"ประเทศไทย\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "cF-zQJU1IgrA",
    "outputId": "fd966eb3-fd8b-46b6-ea61-93427431cdf9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{1, 3, 5, 6, 8, 9}"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tcc.tcc_pos(\"ประเทศไทย\")  # return positions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "aL2PiPUvIgrE",
    "outputId": "a04d64f4-b174-4337-8859-ecac7e660f29"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ป-ระ-เท-ศ-ไท-ย-"
     ]
    }
   ],
   "source": [
    "for ch in tcc.tcc(\"ประเทศไทย\"):  # TCC generator\n",
    "    print(ch, end='-')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jxvgbdlhIgrG"
   },
   "source": [
    "## Transliteration\n",
    "\n",
    "There are two types of transliteration here: romanization and transliteration.\n",
    "\n",
    "- Romanization will render Thai words in the Latin alphabet using the [Royal Thai General System of Transcription (RTGS)](https://en.wikipedia.org/wiki/Royal_Thai_General_System_of_Transcription).\n",
    "  - Two engines are supported here: a simple `royin` engine (default) and a more accurate `thai2rom` engine.\n",
    "- Transliteration here, in PyThaiNLP context, means the sound representation of a string.\n",
    "  - Two engines are supported here: `ipa` (International Phonetic Alphabet system, using [Epitran](https://github.com/dmort27/epitran)) (default) and `icu` (International Components for Unicode, using [PyICU](https://github.com/ovalhub/pyicu))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "ujAsMHwyIgrH",
    "outputId": "4a14e9df-9699-4ede-848b-a280ac0ba5d1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'maeo'"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.transliterate import romanize\n",
    "\n",
    "romanize(\"แมว\")  # output: 'maeo'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'phapn'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "romanize(\"ภาพยนตร์\")  # output: 'phapn' (*obviously wrong)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "LlDosHqXIgrJ",
    "outputId": "a7a88a3a-a04e-4e90-aef4-c6b357ab04c7"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Update Corpus...\n",
      "Corpus: thai-g2p\n",
      "- Already up to date.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'m ɛː w ˧'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.transliterate import transliterate\n",
    "\n",
    "transliterate(\"แมว\")  # output: 'mɛːw'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'pʰ aː p̚ ˥˩ . pʰ a ˦˥ . j o n ˧'"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transliterate(\"ภาพยนตร์\")  # output: 'pʰaːpjanot'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4UwQtF3oIgrM"
   },
   "source": [
    "## Normalization\n",
    "\n",
    "`normalize()` removes zero-width spaces (ZWSP and ZWNJ), duplicated spaces, repeating vowels, and dangling characters. It also reorder vowels and tone marks during the process of removing repeating vowels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "WXPq5bqfIgrN",
    "outputId": "42a7a5be-7d7d-4997-811c-424c79ce3169"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.util import normalize\n",
    "\n",
    "normalize(\"เเปลก\") == \"แปลก\"  # เ เ ป ล ก  vs แ ป ล ก"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The string below contains a non-standard order of Thai characters,\n",
    "Sara Aa (following vowel) + Mai Ek (upper tone mark).\n",
    "`normalize()` will reorder it to Mai Ek + Sara Aa."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'เก่า'"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"เกา่\"\n",
    "normalize(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can be useful for string matching, including tokenization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tokenize immediately:\n",
      "['เก็บ', 'วัน', 'น้ี', ' ', 'พรุ่งน้ี', 'ก็', 'เกา', '่']\n",
      "\n",
      "normalize, then tokenize:\n",
      "['เก็บ', 'วันนี้', ' ', 'พรุ่งนี้', 'ก็', 'เก่า']\n"
     ]
    }
   ],
   "source": [
    "from pythainlp import word_tokenize\n",
    "\n",
    "text = \"เก็บวันน้ี พรุ่งน้ีก็เกา่\"\n",
    "\n",
    "print(\"tokenize immediately:\")\n",
    "print(word_tokenize(text))\n",
    "print(\"\\nnormalize, then tokenize:\")\n",
    "print(word_tokenize(normalize(text)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The string below contains repeating vowels (multiple Sara A in a row)\n",
    "normalize() will keep only one of them. It can be use to reduce variations in spellings, useful for classification task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'เกะ'"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "normalize(\"เกะะะ\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Internally, `normalize()` is just a series of function calls like this:\n",
    "```\n",
    "text = remove_zw(text)\n",
    "text = remove_dup_spaces(text)\n",
    "text = remove_repeat_vowels(text)\n",
    "text = remove_dangling(text)\n",
    "```\n",
    "\n",
    "If you don't like the behavior of default `normalize()`, you can call those functions shown above, also `remove_tonemark()` and `reorder_vowels()`, individually from `pythainlp.util`, to customize your own normalization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Digit conversion\n",
    "\n",
    "Thai text sometimes use Thai digits. This can reduce performance for classification and searching. PyThaiNP provides few utility functions to deal with this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ฉุกเฉินที่ยุโรปเรียก ๑๑๒ ๑๑๒'"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.util import arabic_digit_to_thai_digit, thai_digit_to_arabic_digit, digit_to_text\n",
    "\n",
    "text = \"ฉุกเฉินที่ยุโรปเรียก 112 ๑๑๒\"\n",
    "\n",
    "arabic_digit_to_thai_digit(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ฉุกเฉินที่ยุโรปเรียก 112 112'"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "thai_digit_to_arabic_digit(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ฉุกเฉินที่ยุโรปเรียก หนึ่งหนึ่งสอง หนึ่งหนึ่งสอง'"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "digit_to_text(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rlDji6ecIgrP"
   },
   "source": [
    "## Soundex\n",
    "\n",
    "\"Soundex is a phonetic algorithm for indexing names by sound.\" ([Wikipedia](https://en.wikipedia.org/wiki/Soundex)). PyThaiNLP provides three kinds of Thai soundex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 68
    },
    "colab_type": "code",
    "id": "I4JyUCRJIgrP",
    "outputId": "6af3c11c-3f9a-4154-b7f2-c899312846dc"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "from pythainlp.soundex import lk82, metasound, udom83\n",
    "\n",
    "# check equivalence\n",
    "print(lk82(\"รถ\") == lk82(\"รด\"))\n",
    "print(udom83(\"วรร\") == udom83(\"วัน\"))\n",
    "print(metasound(\"นพ\") == metasound(\"นภ\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 170
    },
    "colab_type": "code",
    "id": "XTznoTg5IgrS",
    "outputId": "8178cd7b-735d-4ccc-c36b-a8c67ea2ddb2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n",
      "บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n",
      "มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
      "มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
      "มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n",
      "ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n",
      "รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
      "รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
      " - lk82:  - udom83:  - metasound: \n"
     ]
    }
   ],
   "source": [
    "texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n",
    "for text in texts:\n",
    "    print(\n",
    "        \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n",
    "            text, lk82(text), udom83(text), metasound(text)\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "spFQD8QsIgrT"
   },
   "source": [
    "## Spellchecking\n",
    "\n",
    "Default spellchecker uses [Peter Norvig's algorithm](http://www.norvig.com/spell-correct.html) together with word frequency from Thai National Corpus (TNC).\n",
    "\n",
    "`spell()` returns a list of all possible spellings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "GAz0q6lWIgrU",
    "outputId": "73427202-cdfe-47d9-8925-9596baafd9d3"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['เหลียม', 'เหลือม']"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp import spell\n",
    "\n",
    "spell(\"เหลืยม\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`correct()` returns the most likely spelling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "I_fDSYEmIgrV",
    "outputId": "e9b6f2eb-37b6-4189-8cfd-1273caf48f38"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'เหลียม'"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp import correct\n",
    "\n",
    "correct(\"เหลืยม\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "N-dOvexUIgrX"
   },
   "source": [
    "## Spellchecking - Custom dictionary and word frequency\n",
    "\n",
    "Custom dictionary can be provided when creating spellchecker.\n",
    "\n",
    "When create a `NorvigSpellChecker` object, you can pass a custom dictionary to `custom_dict` parameter.\n",
    "\n",
    "`custom_dict` can be:\n",
    "- a dictionary (`dict`), with words (`str`) as keys and frequencies (`int`) as values; or\n",
    "- a list, a tuple, or a set of (word, frequency) tuples; or\n",
    "- a list, a tuple, or a set of just words, without their frequencies -- in this case `1` will be assigned to every words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "ixx-8YtfIgrY",
    "outputId": "0dbcf0dc-287e-4c00-f005-c371c81211cd"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['เหลือม', 'เหลียม']"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.spell import NorvigSpellChecker\n",
    "\n",
    "user_dict = [(\"เหลียม\", 50), (\"เหลือม\", 1000), (\"เหลียว\", 1000000)]\n",
    "checker = NorvigSpellChecker(custom_dict=user_dict)\n",
    "\n",
    "checker.spell(\"เหลืยม\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "As you can see, our version of `NorvigSpellChecker` gives the edit distance a priority over a word frequency."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use word frequencies from Thai National Corpus and Thai Textbook Corpus as well.\n",
    "\n",
    "By default, `NorvigSpellChecker` uses Thai National Corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['เหลือม']"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.corpus import ttc  # Thai Textbook Corpus\n",
    "\n",
    "checker = NorvigSpellChecker(custom_dict=ttc.word_freqs())\n",
    "\n",
    "checker.spell(\"เหลืยม\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'เหลือม'"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "checker.correct(\"เหลืยม\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check the current dictionary of a spellchecker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 170
    },
    "colab_type": "code",
    "id": "H7TxgdwbIgra",
    "outputId": "3709f50f-3541-41d1-d8c6-7dfd090f3c1f"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('พิธีเปิด', 18),\n",
       " ('ไส้กรอก', 40),\n",
       " ('ปลิง', 6),\n",
       " ('เต็ง', 13),\n",
       " ('ขอบคุณ', 356),\n",
       " ('ประสาน', 84),\n",
       " ('รำไร', 11),\n",
       " ('ร่วมท้อง', 4),\n",
       " ('ฝักมะขาม', 3)]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(checker.dictionary())[1:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "70sKEAlGIgrc"
   },
   "source": [
    "We can also apply conditions and filter function to dictionary when creating spellchecker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "gT8G4cFzIgrc",
    "outputId": "ad4dd927-4c65-4164-a6cb-d123a08c9ee2"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "39963"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "checker = NorvigSpellChecker()  # use default filter (remove any word with number or non-Thai character)\n",
    "len(checker.dictionary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "w6qI7M92Igre",
    "outputId": "862b3111-e83d-4662-f643-e013e2fc8cd5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "30376"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "checker = NorvigSpellChecker(min_freq=5, min_len=2, max_len=15)\n",
    "len(checker.dictionary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "cTkFjK8IIgrh",
    "outputId": "7c2e3d09-aa49-4ee0-edfa-49fd4876a968"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "66209"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "checker_no_filter = NorvigSpellChecker(dict_filter=None)  # use no filter\n",
    "len(checker_no_filter.dictionary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "70ZHCbBQIgrm",
    "outputId": "0dc68873-9e46-4578-bd22-aa3fc1cfb198"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "66204"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def remove_yamok(word):\n",
    "    return False if \"ๆ\" in word else True\n",
    "\n",
    "checker_custom_filter = NorvigSpellChecker(dict_filter=remove_yamok)  # use custom filter\n",
    "len(checker_custom_filter.dictionary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1hoODyDrIgro"
   },
   "source": [
    "## Part-of-Speech Tagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "39JixRHsIgro",
    "outputId": "cff28d16-3fa7-4b66-df2e-beef601ec41d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('การ', 'FIXN'), ('เดินทาง', 'VACT')]"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.tag import pos_tag, pos_tag_sents\n",
    "\n",
    "pos_tag([\"การ\",\"เดินทาง\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 306
    },
    "colab_type": "code",
    "id": "qrSDelkrIgrq",
    "outputId": "8cce2c89-7599-4020-b5a6-771c0fa0c005"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[('ประกาศสำนักนายกฯ', 'NCMN'),\n",
       "  (' ', 'PUNC'),\n",
       "  ('ให้', 'JSBR'),\n",
       "  (' ', 'PUNC'),\n",
       "  (\"'พล.ท.สรรเสริญ แก้วกำเนิด'\", 'NCMN'),\n",
       "  (' ', 'PUNC'),\n",
       "  ('พ้นจากตำแหน่ง', 'NCMN'),\n",
       "  (' ', 'PUNC'),\n",
       "  ('ผู้ทรงคุณวุฒิพิเศษ', 'NCMN'),\n",
       "  ('กองทัพบก', 'NCMN'),\n",
       "  (' ', 'PUNC'),\n",
       "  ('กระทรวงกลาโหม', 'NCMN')],\n",
       " [('และ', 'JCRG'),\n",
       "  ('แต่งตั้ง', 'VACT'),\n",
       "  ('ให้', 'JSBR'),\n",
       "  ('เป็น', 'VSTA'),\n",
       "  (\"'อธิบดีกรมประชาสัมพันธ์'\", 'NCMN')]]"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sents = [[\"ประกาศสำนักนายกฯ\", \" \", \"ให้\",\n",
    "    \" \", \"'พล.ท.สรรเสริญ แก้วกำเนิด'\", \" \", \"พ้นจากตำแหน่ง\",\n",
    "    \" \", \"ผู้ทรงคุณวุฒิพิเศษ\", \"กองทัพบก\", \" \", \"กระทรวงกลาโหม\"],\n",
    "    [\"และ\", \"แต่งตั้ง\", \"ให้\", \"เป็น\", \"'อธิบดีกรมประชาสัมพันธ์'\"]]\n",
    "\n",
    "pos_tag_sents(sents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "f6ShDKpHIgrs"
   },
   "source": [
    "## Named-Entity Tagging\n",
    "\n",
    "The tagger use BIO scheme:\n",
    "- B - beginning of entity\n",
    "- I - inside entity\n",
    "- O - outside entity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 561
    },
    "colab_type": "code",
    "id": "TVso09S7Igrv",
    "outputId": "f801ac2c-d013-4243-ba0e-b8f88bc69efd"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('24', 'NUM', 'B-DATE'),\n",
       " (' ', 'PUNCT', 'I-DATE'),\n",
       " ('มิ.ย.', 'NOUN', 'I-DATE'),\n",
       " (' ', 'PUNCT', 'O'),\n",
       " ('2563', 'NUM', 'O'),\n",
       " (' ', 'PUNCT', 'O'),\n",
       " ('ทดสอบระบบ', 'PART', 'O'),\n",
       " ('เวลา', 'NOUN', 'O'),\n",
       " (' ', 'PUNCT', 'O'),\n",
       " ('6:00', 'NUM', 'B-TIME'),\n",
       " (' ', 'PUNCT', 'I-TIME'),\n",
       " ('น.', 'NOUN', 'I-TIME'),\n",
       " (' ', 'PUNCT', 'O'),\n",
       " ('เดินทาง', 'VERB', 'O'),\n",
       " ('จาก', 'ADP', 'O'),\n",
       " ('ขนส่ง', 'NOUN', 'B-ORGANIZATION'),\n",
       " ('กรุงเทพ', 'NOUN', 'I-ORGANIZATION'),\n",
       " ('ใกล้', 'ADJ', 'O'),\n",
       " ('ถนน', 'NOUN', 'B-LOCATION'),\n",
       " ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),\n",
       " (' ', 'PUNCT', 'O'),\n",
       " ('ไป', 'AUX', 'O'),\n",
       " ('จังหวัด', 'VERB', 'B-LOCATION'),\n",
       " ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),\n",
       " (' ', 'PUNCT', 'O'),\n",
       " ('ตั๋ว', 'NOUN', 'O'),\n",
       " ('ราคา', 'NOUN', 'O'),\n",
       " (' ', 'PUNCT', 'O'),\n",
       " ('297', 'NUM', 'B-MONEY'),\n",
       " (' ', 'PUNCT', 'I-MONEY'),\n",
       " ('บาท', 'NOUN', 'I-MONEY')]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#!pip3 install pythainlp[ner]\n",
    "from pythainlp.tag.thainer import ThaiNameTagger\n",
    "\n",
    "ner = ThaiNameTagger()\n",
    "ner.get_ner(\"24 มิ.ย. 2563 ทดสอบระบบเวลา 6:00 น. เดินทางจากขนส่งกรุงเทพใกล้ถนนกำแพงเพชร ไปจังหวัดกำแพงเพชร ตั๋วราคา 297 บาท\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6cF88wN2Igry"
   },
   "source": [
    "## Word Vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 224
    },
    "colab_type": "code",
    "id": "GshCfJiBIgrz",
    "outputId": "921340b3-4962-41a5-f550-f463360fb3b8"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2504981"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pythainlp.word_vector\n",
    "\n",
    "pythainlp.word_vector.similarity(\"คน\", \"มนุษย์\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 122
    },
    "colab_type": "code",
    "id": "qJP9As-_Igr0",
    "outputId": "7f528d29-0edf-4b3c-9c31-138d7b85e83a"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.7/site-packages/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n",
      "  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'ไก่'"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pythainlp.word_vector.doesnt_match([\"คน\", \"มนุษย์\", \"บุคคล\", \"เจ้าหน้าที่\", \"ไก่\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "iS7iwPoiIgr3"
   },
   "source": [
    "## Number Spell Out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "F9PEEvWLIgr4",
    "outputId": "a5782efd-aceb-4c5e-d746-df69ba9cad8d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'หนึ่งล้านสองแสนสามหมื่นสี่พันห้าร้อยหกสิบเจ็ดล้านแปดแสนเก้าหมื่นหนึ่งร้อยยี่สิบสามบาทสี่สิบห้าสตางค์'"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pythainlp.util import bahttext\n",
    "\n",
    "bahttext(1234567890123.45)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`bahttext()` will round the satang part"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "Y6DLJYOEIgr7",
    "outputId": "eac48468-ab8b-4e67-acad-5d6560a18979"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'หนึ่งบาทเก้าสิบเอ็ดสตางค์'"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bahttext(1.909)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "pythainlp-get-started.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3.8.13 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "a1d6ff38954a1cdba4cf61ffa51e42f4658fc35985cd256cd89123cae8466a39"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}