{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Ri5cVDAWIgp7" }, "source": [ "# PyThaiNLP Get Started\n", "\n", "Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "id": "3HsfhZlwInqs", "outputId": "c4e91a7c-356c-4d07-802d-530cd62e4a6d" }, "outputs": [], "source": [ "# # pip install required modules\n", "# # uncomment if running from colab\n", "# # see list of modules in `requirements` and `extras`\n", "# # in https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py\n", "\n", "#!pip install pythainlp\n", "#!pip install epitran" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import PyThaiNLP" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2.2.1'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pythainlp\n", "\n", "pythainlp.__version__" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "A6gy4MLGIgp9" }, "source": [ "## Thai Characters\n", "\n", "PyThaiNLP provides some ready-to-use Thai character set (e.g. Thai consonants, vowels, tonemarks, symbols) as a string for convenience. There are also few utility functions to test if a string is in Thai or not." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "GAvoeZg3Igp-", "outputId": "13509870-fe94-4957-ae37-b86a677d9234" }, "outputs": [ { "data": { "text/plain": [ "'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮฤฦะัาำิีึืุูเแโใไๅํ็่้๊๋ฯฺๆ์ํ๎๏๚๛๐๑๒๓๔๕๖๗๘๙฿'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pythainlp.thai_characters" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "88" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pythainlp.thai_characters)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "uPwx53A6IgqF", "outputId": "7693ee7c-f42f-4503-fc0a-fa2a47e5a374" }, "outputs": [ { "data": { "text/plain": [ "'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pythainlp.thai_consonants" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "44" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pythainlp.thai_consonants)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "5UA7Hwy_IgqI", "outputId": "9de7d50e-8499-48d9-bd2f-b025ddab9479" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"๔\" in pythainlp.thai_digits # check if Thai digit \"4\" is in the character set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking if a string contains Thai character or not, or how many" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "t3NvXqYFIgqK", "outputId": "52d91e75-cfd7-4176-ff3b-a725724a8871" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pythainlp.util\n", "\n", "pythainlp.util.isthai(\"ก\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "sRzSQjugIgqM", "outputId": "212049ed-56d2-4b03-aef0-87d05b861ddb" }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pythainlp.util.isthai(\"(ก.พ.)\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "DP5yfJebIgqP", "outputId": "0eca64e8-dbfc-479a-ec0c-c4da71ff3b1c" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pythainlp.util.isthai(\"(ก.พ.)\", ignore_chars=\".()\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`counthai()` returns proportion of Thai characters in the text. It will ignore non-alphabets by default." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "87Z8P9WPIgqS", "outputId": "0b92019f-9773-49db-e0b0-840ba9f7d8a0" }, "outputs": [ { "data": { "text/plain": [ "100.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pythainlp.util.countthai(\"วันอาทิตย์ที่ 24 มีนาคม 2562\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can specify characters to be ignored, using `ignore_chars=` parameter." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "ukSQP8ZTIgqV", "outputId": "9f0bff09-0527-45ca-9f25-65c60f286930" }, "outputs": [ { "data": { "text/plain": [ "67.85714285714286" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pythainlp.util.countthai(\"วันอาทิตย์ที่ 24 มีนาคม 2562\", ignore_chars=\"\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "kW89ZW-IIgqX" }, "source": [ "## Collation\n", "\n", "Sorting according to Thai dictionary." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "hT1Pj52bIgqY", "outputId": "b948f6ce-ee51-4f3e-cdad-43b3957155e0" }, "outputs": [ { "data": { "text/plain": [ "['กรรไกร', 'กระดาษ', 'ไข่', 'ค้อน', 'ผ้าไหม']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.util import collate\n", "\n", "thai_words = [\"ค้อน\", \"กระดาษ\", \"กรรไกร\", \"ไข่\", \"ผ้าไหม\"]\n", "collate(thai_words)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "XgWpZM8hIgqb", "outputId": "d9003bd2-e0ee-47c7-aa67-f498e4f47578" }, "outputs": [ { "data": { "text/plain": [ "['ผ้าไหม', 'ค้อน', 'ไข่', 'กระดาษ', 'กรรไกร']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "collate(thai_words, reverse=True)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "g-czYhLoIgqd" }, "source": [ "## Date/Time Format and Spellout\n", "\n", "### Date/Time Format\n", "\n", "Get Thai day and month names with Thai Buddhist Era (B.E.).\n", "Use [formatting directives](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) similar to `datetime.strftime()`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "F03_rMWzIgqe", "outputId": "ffeda738-0926-4439-d3a0-14869d0d59db" }, "outputs": [ { "data": { "text/plain": [ "'วันพุธที่ 6 ตุลาคม พ.ศ. 2519 เวลา 01:40 น. (พ 06-ต.ค.-19)'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime\n", "from pythainlp.util import thai_strftime\n", "\n", "fmt = \"%Aที่ %-d %B พ.ศ. %Y เวลา %H:%M น. (%a %d-%b-%y)\"\n", "date = datetime.datetime(1976, 10, 6, 1, 40)\n", "\n", "thai_strftime(date, fmt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From version 2.2, these modifiers can be applied right before the main directive:\n", "\n", "- \\- *(minus)* Do not pad a numeric result string (also available in version 2.1)\n", "- _ *(underscore)* Pad a numeric result string with spaces\n", "- 0 *(zero)* Pad a number result string with zeros\n", "- ^ Convert alphabetic characters in result string to upper case\n", "- \\# Swap the case of the result string\n", "- O *(letter o)* Use the locale's alternative numeric symbols (Thai digit)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'06 ต.ค. 19'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thai_strftime(date, \"%d %b %y\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'06 ต.ค. 2519'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thai_strftime(date, \"%d %b %Y\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Time Spellout\n", "\n", "*Note: `thai_time()` will be renamed to `time_to_thaiword()` in version 2.2.*" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ศูนย์นาฬิกาสิบสี่นาทียี่สิบเก้าวินาที'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.util import thai_time\n", "\n", "thai_time(\"00:14:29\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The way to spellout can be chosen, using `fmt` parameter.\n", "It can be `24h`, `6h`, or `m6h`. Try one by yourself." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'เที่ยงคืนสิบสี่นาทียี่สิบเก้าวินาที'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thai_time(\"00:14:29\", fmt=\"6h\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Precision of spellout can be chosen as well. Using `precision` parameter.\n", "It can be `m` for minute-level, `s` for second-level, or `None` for only read the non-zero value." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ศูนย์นาฬิกาสิบสี่นาที'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thai_time(\"00:14:29\", precision=\"m\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "สองโมงเช้าสิบเจ็ดนาที\n", "แปดโมงสิบเจ็ดนาทีศูนย์วินาที\n", "หกโมงครึ่ง\n", "บ่ายโมงครึ่ง\n" ] } ], "source": [ "print(thai_time(\"8:17:00\", fmt=\"6h\"))\n", "print(thai_time(\"8:17:00\", fmt=\"m6h\", precision=\"s\"))\n", "print(thai_time(\"18:30:01\", fmt=\"m6h\", precision=\"m\"))\n", "print(thai_time(\"13:30:01\", fmt=\"6h\", precision=\"m\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also pass `datetime` and `time` objects to `thai_time()`." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'สิบสามนาฬิกาสิบสี่นาทีสิบห้าวินาที'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime\n", "\n", "time = datetime.time(13, 14, 15)\n", "thai_time(time)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'บ่ายโมงสิบสี่นาที'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "time = datetime.datetime(10, 11, 12, 13, 14, 15)\n", "thai_time(time, fmt=\"6h\", precision=\"m\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "8VFPOHyZIgqh" }, "source": [ "## Tokenization and Segmentation\n", "\n", "At sentence, word, and sub-word levels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence\n", "\n", "Default sentence tokenizer is \"crfcut\". Tokenization engine can be chosen ussing `engine=` parameter." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "default (crfcut):\n", "['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ เป็นรัฐธรรมนูญฉบับชั่วคราว ', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม ', 'ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 ', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร']\n", "\n", "whitespace+newline:\n", "['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว', 'พุทธศักราช', '๒๔๗๕', 'เป็นรัฐธรรมนูญฉบับชั่วคราว', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม', 'ประกาศใช้เมื่อวันที่', '27', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่', '24', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยคณะราษฎร']\n" ] } ], "source": [ "from pythainlp import sent_tokenize\n", "\n", "text = (\"พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ \"\n", " \"เป็นรัฐธรรมนูญฉบับชั่วคราว ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม \"\n", " \"ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 \"\n", " \"โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร\")\n", "\n", "print(\"default (crfcut):\")\n", "print(sent_tokenize(text))\n", "print(\"\\nwhitespace+newline:\")\n", "print(sent_tokenize(text, engine=\"whitespace+newline\"))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "SklPJ-DbIgqi" }, "source": [ "### Word\n", "Default word tokenizer (\"newmm\") use maximum matching algorithm." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "JEbY-MGCIgqi", "outputId": "ce82fcbe-117f-4e12-db86-f01b4ea988e4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "default (newmm):\n", "['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', ' ', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน', ' ']\n", "\n", "newmm and keep_whitespace=False:\n", "['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน']\n" ] } ], "source": [ "from pythainlp import word_tokenize\n", "\n", "text = \"ก็จะรู้ความชั่วร้ายที่ทำไว้ และคงจะไม่ยอมให้ทำนาบนหลังคน \"\n", "\n", "print(\"default (newmm):\")\n", "print(word_tokenize(text))\n", "print(\"\\nnewmm and keep_whitespace=False:\")\n", "print(word_tokenize(text, keep_whitespace=False))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "e5P_YygrIgqm" }, "source": [ "Other algorithm can be chosen. We can also create a tokenizer with a custom dictionary." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "mI_Qz3k3Igqm", "outputId": "2d10dc44-fc8d-4c4d-8526-3b5abe9494d5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "newmm : ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศ', 'ใช้แล้ว']\n", "longest: ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศใช้', 'แล้ว']\n", "newmm (custom dictionary): ['กฎหมาย', 'แรงงาน', 'ฉบับปรับปรุงใหม่ประกาศใช้แล้ว']\n" ] } ], "source": [ "from pythainlp import word_tokenize, Tokenizer\n", "\n", "text = \"กฎหมายแรงงานฉบับปรับปรุงใหม่ประกาศใช้แล้ว\"\n", "\n", "print(\"newmm :\", word_tokenize(text)) # default engine is \"newmm\"\n", "print(\"longest:\", word_tokenize(text, engine=\"longest\"))\n", "\n", "words = [\"แรงงาน\"]\n", "custom_tokenizer = Tokenizer(words)\n", "print(\"newmm (custom dictionary):\", custom_tokenizer.word_tokenize(text))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zIXUxXlTIgqo" }, "source": [ "Default word tokenizer use a word list from `pythainlp.corpus.common.thai_words()`.\n", "We can get that list, add/remove words, and create new tokenizer from the modified list." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "RblqNckGIgqp", "outputId": "b0c50208-55ce-4f63-8e99-8f98bbd31733" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "default dictionary: ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิ', 'มอ', 'ฟ']\n", "custom dictionary : ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิมอฟ']\n" ] } ], "source": [ "from pythainlp.corpus.common import thai_words\n", "from pythainlp import Tokenizer\n", "\n", "text = \"นิยายวิทยาศาสตร์ของไอแซค อสิมอฟ\"\n", "\n", "print(\"default dictionary:\", word_tokenize(text))\n", "\n", "words = set(thai_words()) # thai_words() returns frozenset\n", "words.add(\"ไอแซค\") # Isaac\n", "words.add(\"อสิมอฟ\") # Asimov\n", "custom_tokenizer = Tokenizer(words)\n", "print(\"custom dictionary :\", custom_tokenizer.word_tokenize(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also, alternatively, create a dictionary trie, using `pythainlp.util.Trie()` function, and pass it to a default tokenizer." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "default dictionary: ['ILO', '87', ' ', 'ว่าด้วย', 'เสรีภาพ', 'ใน', 'การสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', ' ', 'ILO', '98', ' ', 'ว่าด้วย', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', 'และ', 'การ', 'ร่วม', 'เจรจา', 'ต่อรอง']\n", "custom dictionary : ['ILO87', ' ', 'ว่าด้วย', 'เสรีภาพในการสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิในการรวมตัว', ' ', 'ILO98', ' ', 'ว่าด้วย', 'สิทธิในการรวมตัว', 'และ', 'การร่วมเจรจาต่อรอง']\n" ] } ], "source": [ "from pythainlp.corpus.common import thai_words\n", "from pythainlp.util import Trie\n", "\n", "text = \"ILO87 ว่าด้วยเสรีภาพในการสมาคมและการคุ้มครองสิทธิในการรวมตัว ILO98 ว่าด้วยสิทธิในการรวมตัวและการร่วมเจรจาต่อรอง\"\n", "\n", "print(\"default dictionary:\", word_tokenize(text))\n", "\n", "new_words = {\"ILO87\", \"ILO98\", \"การร่วมเจรจาต่อรอง\", \"สิทธิในการรวมตัว\", \"เสรีภาพในการสมาคม\", \"แรงงานสัมพันธ์\"}\n", "words = new_words.union(thai_words())\n", "\n", "custom_dictionary_trie = Trie(words)\n", "print(\"custom dictionary :\", word_tokenize(text, custom_dict=custom_dictionary_trie))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Testing different tokenization engines" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": {}, "colab_type": "code", "id": "4L2kRMY5Igqr" }, "outputs": [], "source": [ "speedtest_text = \"\"\"\n", "ครบรอบ 14 ปี ตากใบ เช้าวันนั้น 25 ต.ค. 2547 ผู้ชุมนุมชายกว่า 1,370 คน\n", "ถูกโยนขึ้นรถยีเอ็มซี 22 หรือ 24 คัน นอนซ้อนกันคันละ 4-5 ชั้น เดินทางจากสถานีตำรวจตากใบ ไปไกล 150 กิโลเมตร\n", "ไปถึงค่ายอิงคยุทธบริหาร ใช้เวลากว่า 6 ชั่วโมง / ในอีกคดีที่ญาติฟ้องร้องรัฐ คดีจบลงที่การประนีประนอมยอมความ\n", "กระทรวงกลาโหมจ่ายค่าสินไหมทดแทนรวม 42 ล้านบาทให้กับญาติผู้เสียหาย 79 ราย\n", "ปิดหีบและนับคะแนนเสร็จแล้ว ที่หน่วยเลือกตั้งที่ 32 เขต 13 แขวงหัวหมาก เขตบางกะปิ กรุงเทพมหานคร\n", "ผู้สมัคร ส.ส. และตัวแทนพรรคการเมืองจากหลายพรรคต่างมาเฝ้าสังเกตการนับคะแนนอย่างใกล้ชิด โดย\n", "ฐิติภัสร์ โชติเดชาชัยนันต์ จากพรรคพลังประชารัฐ และพริษฐ์ วัชรสินธุ จากพรรคประชาธิปัตย์ได้คะแนน\n", "96 คะแนนเท่ากัน\n", "เช้าวันอาทิตย์ที่ 21 เมษายน 2019 ซึ่งเป็นวันอีสเตอร์ วันสำคัญของชาวคริสต์\n", "เกิดเหตุระเบิดต่อเนื่องในโบสถ์คริสต์และโรงแรมอย่างน้อย 7 แห่งในประเทศศรีลังกา\n", "มีผู้เสียชีวิตแล้วอย่างน้อย 156 คน และบาดเจ็บหลายร้อยคน ยังไม่มีข้อมูลว่าผู้ก่อเหตุมาจากฝ่ายใด\n", "จีนกำหนดจัดการประชุมข้อริเริ่มสายแถบและเส้นทางในช่วงปลายสัปดาห์นี้ ปักกิ่งยืนยันว่า\n", "อภิมหาโครงการเชื่อมโลกของจีนไม่ใช่เครื่องมือแผ่อิทธิพล แต่ยินดีรับฟังข้อวิจารณ์ เช่น ประเด็นกับดักหนี้สิน\n", "และความไม่โปร่งใส รัฐบาลปักกิ่งบอกว่า เวทีประชุม Belt and Road Forum ในช่วงวันที่ 25-27 เมษายน\n", "ถือเป็นงานการทูตที่สำคัญที่สุดของจีนในปี 2019\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "qMF_0xyOIgqs", "outputId": "7b914ac8-456f-4af7-f62d-e4cdf61409aa" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 253 ms, sys: 2.27 ms, total: 256 ms\n", "Wall time: 255 ms\n" ] } ], "source": [ "# Speed test: Calling \"longest\" engine through word_tokenize wrapper\n", "%time tokens = word_tokenize(speedtest_text, engine=\"longest\")" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "NlCSHylIIgqv", "outputId": "c270e307-6804-4dc6-93a4-5b64776d01e7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.4 ms, sys: 60 µs, total: 3.46 ms\n", "Wall time: 3.47 ms\n" ] } ], "source": [ "# Speed test: Calling \"newmm\" engine through word_tokenize wrapper\n", "%time tokens = word_tokenize(speedtest_text, engine=\"newmm\")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.08 ms, sys: 88 µs, total: 4.16 ms\n", "Wall time: 4.15 ms\n" ] } ], "source": [ "# Speed test: Calling \"newmm\" engine through word_tokenize wrapper\n", "%time tokens = word_tokenize(speedtest_text, engine=\"newmm-safe\")" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 833 ms, sys: 174 ms, total: 1.01 s\n", "Wall time: 576 ms\n" ] } ], "source": [ "#!pip install attacut\n", "# Speed test: Calling \"attacut\" engine through word_tokenize wrapper\n", "%time tokens = word_tokenize(speedtest_text, engine=\"attacut\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get all possible segmentations" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 272 }, "colab_type": "code", "id": "qFTYqAB1Igq1", "outputId": "607ea178-c070-48c6-c12e-1a62c09f8847" }, "outputs": [ { "data": { "text/plain": [ "['มี|ความ|เป็น|ไป|ได้|อย่าง|ไร|บ้าง|',\n", " 'มี|ความ|เป็นไป|ได้|อย่าง|ไร|บ้าง|',\n", " 'มี|ความ|เป็นไปได้|อย่าง|ไร|บ้าง|',\n", " 'มี|ความเป็นไป|ได้|อย่าง|ไร|บ้าง|',\n", " 'มี|ความเป็นไปได้|อย่าง|ไร|บ้าง|',\n", " 'มี|ความ|เป็น|ไป|ได้|อย่างไร|บ้าง|',\n", " 'มี|ความ|เป็นไป|ได้|อย่างไร|บ้าง|',\n", " 'มี|ความ|เป็นไปได้|อย่างไร|บ้าง|',\n", " 'มี|ความเป็นไป|ได้|อย่างไร|บ้าง|',\n", " 'มี|ความเป็นไปได้|อย่างไร|บ้าง|',\n", " 'มี|ความ|เป็น|ไป|ได้|อย่างไรบ้าง|',\n", " 'มี|ความ|เป็นไป|ได้|อย่างไรบ้าง|',\n", " 'มี|ความ|เป็นไปได้|อย่างไรบ้าง|',\n", " 'มี|ความเป็นไป|ได้|อย่างไรบ้าง|',\n", " 'มี|ความเป็นไปได้|อย่างไรบ้าง|']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.tokenize.multi_cut import find_all_segment, mmcut, segment\n", "\n", "find_all_segment(\"มีความเป็นไปได้อย่างไรบ้าง\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fWXiNUoBIgq4" }, "source": [ "### Subword, syllable, and Thai Character Cluster (TCC)\n", "\n", "Tokenization can also be done at subword level, either syllable or Thai Character Cluster (TCC).\n", "\n", "- Syllable segmentation is using [`ssg`](https://github.com/ponrawee/ssg), a CRF syllable segmenter for Thai by Ponrawee Prasertsom.\n", "- TCC is smaller than syllable. For information about TCC, see [Character Cluster Based Thai Information Retrieval](https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval) (Theeramunkong et al. 2004)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Subword tokenization\n", "Default subword tokenization engine is `tcc`, which will use Thai Character Cluster (TCC) as a subword unit." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "Z1wdMmdmIgq6", "outputId": "c74fa03a-a288-47b8-9c49-7ed42c3545c9" }, "outputs": [ { "data": { "text/plain": [ "['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp import subword_tokenize\n", "\n", "subword_tokenize(\"ประเทศไทย\") # default subword unit is TCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Syllable tokenization\n", "Default syllable tokenization engine is `dict`, which will use `newmm` word tokenization engine with a custom dictionary contains known syllables in Thai language." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['อับ',\n", " 'ดุล',\n", " 'เลาะ',\n", " ' ',\n", " 'อี',\n", " 'ซอ',\n", " 'มู',\n", " 'ซอ',\n", " ' ',\n", " 'สมอง',\n", " 'บวม',\n", " 'รุน',\n", " 'แรง']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.tokenize import syllable_tokenize\n", "\n", "text = \"อับดุลเลาะ อีซอมูซอ สมองบวมรุนแรง\"\n", "\n", "syllable_tokenize(text) # default engine is \"dict\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "External [`ssg`](https://github.com/ponrawee/ssg) engine call be called. Note that `ssg` engine ommitted whitespaces in the output tokens." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['อับ', 'ดุล', 'เลาะ', ' อี', 'ซอ', 'มู', 'ซอ ', 'สมอง', 'บวม', 'รุน', 'แรง']" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "syllable_tokenize(text, engine=\"ssg\") # use \"ssg\" for syllable" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "RfO1WbneIgq9" }, "source": [ "#### Low-level subword operations\n", "\n", "These low-level TCC operations can be useful for some pre-processing tasks. Like checking if it's ok to cut a string at a certain point or to find typos." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "3Gyig20XIgq-", "outputId": "da9484b7-4dc4-4159-9b5f-df2771805ac9" }, "outputs": [ { "data": { "text/plain": [ "['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.tokenize import tcc\n", "\n", "tcc.segment(\"ประเทศไทย\")" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "cF-zQJU1IgrA", "outputId": "fd966eb3-fd8b-46b6-ea61-93427431cdf9" }, "outputs": [ { "data": { "text/plain": [ "{1, 3, 5, 6, 8, 9}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tcc.tcc_pos(\"ประเทศไทย\") # return positions" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "aL2PiPUvIgrE", "outputId": "a04d64f4-b174-4337-8859-ecac7e660f29" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ป-ระ-เท-ศ-ไท-ย-" ] } ], "source": [ "for ch in tcc.tcc(\"ประเทศไทย\"): # TCC generator\n", " print(ch, end='-')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "jxvgbdlhIgrG" }, "source": [ "## Transliteration\n", "\n", "There are two types of transliteration here: romanization and transliteration.\n", "\n", "- Romanization will render Thai words in the Latin alphabet using the [Royal Thai General System of Transcription (RTGS)](https://en.wikipedia.org/wiki/Royal_Thai_General_System_of_Transcription).\n", " - Two engines are supported here: a simple `royin` engine (default) and a more accurate `thai2rom` engine.\n", "- Transliteration here, in PyThaiNLP context, means the sound representation of a string.\n", " - Two engines are supported here: `ipa` (International Phonetic Alphabet system, using [Epitran](https://github.com/dmort27/epitran)) (default) and `icu` (International Components for Unicode, using [PyICU](https://github.com/ovalhub/pyicu))." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "ujAsMHwyIgrH", "outputId": "4a14e9df-9699-4ede-848b-a280ac0ba5d1" }, "outputs": [ { "data": { "text/plain": [ "'maeo'" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.transliterate import romanize\n", "\n", "romanize(\"แมว\") # output: 'maeo'" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'phapn'" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "romanize(\"ภาพยนตร์\") # output: 'phapn' (*obviously wrong)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "LlDosHqXIgrJ", "outputId": "a7a88a3a-a04e-4e90-aef4-c6b357ab04c7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Update Corpus...\n", "Corpus: thai-g2p\n", "- Already up to date.\n" ] }, { "data": { "text/plain": [ "'m ɛː w ˧'" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.transliterate import transliterate\n", "\n", "transliterate(\"แมว\") # output: 'mɛːw'" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'pʰ aː p̚ ˥˩ . pʰ a ˦˥ . j o n ˧'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transliterate(\"ภาพยนตร์\") # output: 'pʰaːpjanot'" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4UwQtF3oIgrM" }, "source": [ "## Normalization\n", "\n", "`normalize()` removes zero-width spaces (ZWSP and ZWNJ), duplicated spaces, repeating vowels, and dangling characters. It also reorder vowels and tone marks during the process of removing repeating vowels." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "WXPq5bqfIgrN", "outputId": "42a7a5be-7d7d-4997-811c-424c79ce3169" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.util import normalize\n", "\n", "normalize(\"เเปลก\") == \"แปลก\" # เ เ ป ล ก vs แ ป ล ก" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The string below contains a non-standard order of Thai characters,\n", "Sara Aa (following vowel) + Mai Ek (upper tone mark).\n", "`normalize()` will reorder it to Mai Ek + Sara Aa." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'เก่า'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"เกา่\"\n", "normalize(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be useful for string matching, including tokenization." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tokenize immediately:\n", "['เก็บ', 'วัน', 'น้ี', ' ', 'พรุ่งน้ี', 'ก็', 'เกา', '่']\n", "\n", "normalize, then tokenize:\n", "['เก็บ', 'วันนี้', ' ', 'พรุ่งนี้', 'ก็', 'เก่า']\n" ] } ], "source": [ "from pythainlp import word_tokenize\n", "\n", "text = \"เก็บวันน้ี พรุ่งน้ีก็เกา่\"\n", "\n", "print(\"tokenize immediately:\")\n", "print(word_tokenize(text))\n", "print(\"\\nnormalize, then tokenize:\")\n", "print(word_tokenize(normalize(text)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The string below contains repeating vowels (multiple Sara A in a row)\n", "normalize() will keep only one of them. It can be use to reduce variations in spellings, useful for classification task." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'เกะ'" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalize(\"เกะะะ\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Internally, `normalize()` is just a series of function calls like this:\n", "```\n", "text = remove_zw(text)\n", "text = remove_dup_spaces(text)\n", "text = remove_repeat_vowels(text)\n", "text = remove_dangling(text)\n", "```\n", "\n", "If you don't like the behavior of default `normalize()`, you can call those functions shown above, also `remove_tonemark()` and `reorder_vowels()`, individually from `pythainlp.util`, to customize your own normalization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Digit conversion\n", "\n", "Thai text sometimes use Thai digits. This can reduce performance for classification and searching. PyThaiNP provides few utility functions to deal with this." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ฉุกเฉินที่ยุโรปเรียก ๑๑๒ ๑๑๒'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.util import arabic_digit_to_thai_digit, thai_digit_to_arabic_digit, digit_to_text\n", "\n", "text = \"ฉุกเฉินที่ยุโรปเรียก 112 ๑๑๒\"\n", "\n", "arabic_digit_to_thai_digit(text)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ฉุกเฉินที่ยุโรปเรียก 112 112'" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thai_digit_to_arabic_digit(text)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ฉุกเฉินที่ยุโรปเรียก หนึ่งหนึ่งสอง หนึ่งหนึ่งสอง'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "digit_to_text(text)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "rlDji6ecIgrP" }, "source": [ "## Soundex\n", "\n", "\"Soundex is a phonetic algorithm for indexing names by sound.\" ([Wikipedia](https://en.wikipedia.org/wiki/Soundex)). PyThaiNLP provides three kinds of Thai soundex." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "I4JyUCRJIgrP", "outputId": "6af3c11c-3f9a-4154-b7f2-c899312846dc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n", "True\n" ] } ], "source": [ "from pythainlp.soundex import lk82, metasound, udom83\n", "\n", "# check equivalence\n", "print(lk82(\"รถ\") == lk82(\"รด\"))\n", "print(udom83(\"วรร\") == udom83(\"วัน\"))\n", "print(metasound(\"นพ\") == metasound(\"นภ\"))" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 170 }, "colab_type": "code", "id": "XTznoTg5IgrS", "outputId": "8178cd7b-735d-4ccc-c36b-a8c67ea2ddb2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n", "บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n", "มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n", "มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n", "มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n", "ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n", "รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n", "รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n", " - lk82: - udom83: - metasound: \n" ] } ], "source": [ "texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n", "for text in texts:\n", " print(\n", " \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n", " text, lk82(text), udom83(text), metasound(text)\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "spFQD8QsIgrT" }, "source": [ "## Spellchecking\n", "\n", "Default spellchecker uses [Peter Norvig's algorithm](http://www.norvig.com/spell-correct.html) together with word frequency from Thai National Corpus (TNC).\n", "\n", "`spell()` returns a list of all possible spellings." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "GAz0q6lWIgrU", "outputId": "73427202-cdfe-47d9-8925-9596baafd9d3" }, "outputs": [ { "data": { "text/plain": [ "['เหลียม', 'เหลือม']" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp import spell\n", "\n", "spell(\"เหลืยม\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`correct()` returns the most likely spelling." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "I_fDSYEmIgrV", "outputId": "e9b6f2eb-37b6-4189-8cfd-1273caf48f38" }, "outputs": [ { "data": { "text/plain": [ "'เหลียม'" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp import correct\n", "\n", "correct(\"เหลืยม\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "N-dOvexUIgrX" }, "source": [ "## Spellchecking - Custom dictionary and word frequency\n", "\n", "Custom dictionary can be provided when creating spellchecker.\n", "\n", "When create a `NorvigSpellChecker` object, you can pass a custom dictionary to `custom_dict` parameter.\n", "\n", "`custom_dict` can be:\n", "- a dictionary (`dict`), with words (`str`) as keys and frequencies (`int`) as values; or\n", "- a list, a tuple, or a set of (word, frequency) tuples; or\n", "- a list, a tuple, or a set of just words, without their frequencies -- in this case `1` will be assigned to every words." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "ixx-8YtfIgrY", "outputId": "0dbcf0dc-287e-4c00-f005-c371c81211cd" }, "outputs": [ { "data": { "text/plain": [ "['เหลือม', 'เหลียม']" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.spell import NorvigSpellChecker\n", "\n", "user_dict = [(\"เหลียม\", 50), (\"เหลือม\", 1000), (\"เหลียว\", 1000000)]\n", "checker = NorvigSpellChecker(custom_dict=user_dict)\n", "\n", "checker.spell(\"เหลืยม\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "As you can see, our version of `NorvigSpellChecker` gives the edit distance a priority over a word frequency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use word frequencies from Thai National Corpus and Thai Textbook Corpus as well.\n", "\n", "By default, `NorvigSpellChecker` uses Thai National Corpus." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['เหลือม']" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.corpus import ttc # Thai Textbook Corpus\n", "\n", "checker = NorvigSpellChecker(custom_dict=ttc.word_freqs())\n", "\n", "checker.spell(\"เหลืยม\")" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'เหลือม'" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "checker.correct(\"เหลืยม\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check the current dictionary of a spellchecker:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 170 }, "colab_type": "code", "id": "H7TxgdwbIgra", "outputId": "3709f50f-3541-41d1-d8c6-7dfd090f3c1f" }, "outputs": [ { "data": { "text/plain": [ "[('พิธีเปิด', 18),\n", " ('ไส้กรอก', 40),\n", " ('ปลิง', 6),\n", " ('เต็ง', 13),\n", " ('ขอบคุณ', 356),\n", " ('ประสาน', 84),\n", " ('รำไร', 11),\n", " ('ร่วมท้อง', 4),\n", " ('ฝักมะขาม', 3)]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(checker.dictionary())[1:10]" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "70sKEAlGIgrc" }, "source": [ "We can also apply conditions and filter function to dictionary when creating spellchecker." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "gT8G4cFzIgrc", "outputId": "ad4dd927-4c65-4164-a6cb-d123a08c9ee2" }, "outputs": [ { "data": { "text/plain": [ "39963" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "checker = NorvigSpellChecker() # use default filter (remove any word with number or non-Thai character)\n", "len(checker.dictionary())" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "w6qI7M92Igre", "outputId": "862b3111-e83d-4662-f643-e013e2fc8cd5" }, "outputs": [ { "data": { "text/plain": [ "30376" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "checker = NorvigSpellChecker(min_freq=5, min_len=2, max_len=15)\n", "len(checker.dictionary())" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "cTkFjK8IIgrh", "outputId": "7c2e3d09-aa49-4ee0-edfa-49fd4876a968" }, "outputs": [ { "data": { "text/plain": [ "66209" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "checker_no_filter = NorvigSpellChecker(dict_filter=None) # use no filter\n", "len(checker_no_filter.dictionary())" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "70ZHCbBQIgrm", "outputId": "0dc68873-9e46-4578-bd22-aa3fc1cfb198" }, "outputs": [ { "data": { "text/plain": [ "66204" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def remove_yamok(word):\n", " return False if \"ๆ\" in word else True\n", "\n", "checker_custom_filter = NorvigSpellChecker(dict_filter=remove_yamok) # use custom filter\n", "len(checker_custom_filter.dictionary())" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1hoODyDrIgro" }, "source": [ "## Part-of-Speech Tagging" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "39JixRHsIgro", "outputId": "cff28d16-3fa7-4b66-df2e-beef601ec41d" }, "outputs": [ { "data": { "text/plain": [ "[('การ', 'FIXN'), ('เดินทาง', 'VACT')]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.tag import pos_tag, pos_tag_sents\n", "\n", "pos_tag([\"การ\",\"เดินทาง\"])" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 306 }, "colab_type": "code", "id": "qrSDelkrIgrq", "outputId": "8cce2c89-7599-4020-b5a6-771c0fa0c005" }, "outputs": [ { "data": { "text/plain": [ "[[('ประกาศสำนักนายกฯ', 'NCMN'),\n", " (' ', 'PUNC'),\n", " ('ให้', 'JSBR'),\n", " (' ', 'PUNC'),\n", " (\"'พล.ท.สรรเสริญ แก้วกำเนิด'\", 'NCMN'),\n", " (' ', 'PUNC'),\n", " ('พ้นจากตำแหน่ง', 'NCMN'),\n", " (' ', 'PUNC'),\n", " ('ผู้ทรงคุณวุฒิพิเศษ', 'NCMN'),\n", " ('กองทัพบก', 'NCMN'),\n", " (' ', 'PUNC'),\n", " ('กระทรวงกลาโหม', 'NCMN')],\n", " [('และ', 'JCRG'),\n", " ('แต่งตั้ง', 'VACT'),\n", " ('ให้', 'JSBR'),\n", " ('เป็น', 'VSTA'),\n", " (\"'อธิบดีกรมประชาสัมพันธ์'\", 'NCMN')]]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sents = [[\"ประกาศสำนักนายกฯ\", \" \", \"ให้\",\n", " \" \", \"'พล.ท.สรรเสริญ แก้วกำเนิด'\", \" \", \"พ้นจากตำแหน่ง\",\n", " \" \", \"ผู้ทรงคุณวุฒิพิเศษ\", \"กองทัพบก\", \" \", \"กระทรวงกลาโหม\"],\n", " [\"และ\", \"แต่งตั้ง\", \"ให้\", \"เป็น\", \"'อธิบดีกรมประชาสัมพันธ์'\"]]\n", "\n", "pos_tag_sents(sents)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "f6ShDKpHIgrs" }, "source": [ "## Named-Entity Tagging\n", "\n", "The tagger use BIO scheme:\n", "- B - beginning of entity\n", "- I - inside entity\n", "- O - outside entity" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 561 }, "colab_type": "code", "id": "TVso09S7Igrv", "outputId": "f801ac2c-d013-4243-ba0e-b8f88bc69efd" }, "outputs": [ { "data": { "text/plain": [ "[('24', 'NUM', 'B-DATE'),\n", " (' ', 'PUNCT', 'I-DATE'),\n", " ('มิ.ย.', 'NOUN', 'I-DATE'),\n", " (' ', 'PUNCT', 'O'),\n", " ('2563', 'NUM', 'O'),\n", " (' ', 'PUNCT', 'O'),\n", " ('ทดสอบระบบ', 'PART', 'O'),\n", " ('เวลา', 'NOUN', 'O'),\n", " (' ', 'PUNCT', 'O'),\n", " ('6:00', 'NUM', 'B-TIME'),\n", " (' ', 'PUNCT', 'I-TIME'),\n", " ('น.', 'NOUN', 'I-TIME'),\n", " (' ', 'PUNCT', 'O'),\n", " ('เดินทาง', 'VERB', 'O'),\n", " ('จาก', 'ADP', 'O'),\n", " ('ขนส่ง', 'NOUN', 'B-ORGANIZATION'),\n", " ('กรุงเทพ', 'NOUN', 'I-ORGANIZATION'),\n", " ('ใกล้', 'ADJ', 'O'),\n", " ('ถนน', 'NOUN', 'B-LOCATION'),\n", " ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),\n", " (' ', 'PUNCT', 'O'),\n", " ('ไป', 'AUX', 'O'),\n", " ('จังหวัด', 'VERB', 'B-LOCATION'),\n", " ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),\n", " (' ', 'PUNCT', 'O'),\n", " ('ตั๋ว', 'NOUN', 'O'),\n", " ('ราคา', 'NOUN', 'O'),\n", " (' ', 'PUNCT', 'O'),\n", " ('297', 'NUM', 'B-MONEY'),\n", " (' ', 'PUNCT', 'I-MONEY'),\n", " ('บาท', 'NOUN', 'I-MONEY')]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#!pip3 install pythainlp[ner]\n", "from pythainlp.tag.thainer import ThaiNameTagger\n", "\n", "ner = ThaiNameTagger()\n", "ner.get_ner(\"24 มิ.ย. 2563 ทดสอบระบบเวลา 6:00 น. เดินทางจากขนส่งกรุงเทพใกล้ถนนกำแพงเพชร ไปจังหวัดกำแพงเพชร ตั๋วราคา 297 บาท\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "6cF88wN2Igry" }, "source": [ "## Word Vector" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 224 }, "colab_type": "code", "id": "GshCfJiBIgrz", "outputId": "921340b3-4962-41a5-f550-f463360fb3b8" }, "outputs": [ { "data": { "text/plain": [ "0.2504981" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pythainlp.word_vector\n", "\n", "pythainlp.word_vector.similarity(\"คน\", \"มนุษย์\")" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 122 }, "colab_type": "code", "id": "qJP9As-_Igr0", "outputId": "7f528d29-0edf-4b3c-9c31-138d7b85e83a" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n", " vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n" ] }, { "data": { "text/plain": [ "'ไก่'" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pythainlp.word_vector.doesnt_match([\"คน\", \"มนุษย์\", \"บุคคล\", \"เจ้าหน้าที่\", \"ไก่\"])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iS7iwPoiIgr3" }, "source": [ "## Number Spell Out" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "F9PEEvWLIgr4", "outputId": "a5782efd-aceb-4c5e-d746-df69ba9cad8d" }, "outputs": [ { "data": { "text/plain": [ "'หนึ่งล้านสองแสนสามหมื่นสี่พันห้าร้อยหกสิบเจ็ดล้านแปดแสนเก้าหมื่นหนึ่งร้อยยี่สิบสามบาทสี่สิบห้าสตางค์'" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pythainlp.util import bahttext\n", "\n", "bahttext(1234567890123.45)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`bahttext()` will round the satang part" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "Y6DLJYOEIgr7", "outputId": "eac48468-ab8b-4e67-acad-5d6560a18979" }, "outputs": [ { "data": { "text/plain": [ "'หนึ่งบาทเก้าสิบเอ็ดสตางค์'" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bahttext(1.909)" ] } ], "metadata": { "colab": { "name": "pythainlp-get-started.ipynb", "provenance": [], "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3.8.13 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "a1d6ff38954a1cdba4cf61ffa51e42f4658fc35985cd256cd89123cae8466a39" } } }, "nbformat": 4, "nbformat_minor": 1 }