{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Wordnet.ipynb", "version": "0.3.2", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "cells": [ { "metadata": { "colab_type": "text", "id": "xVExYqDizWSe" }, "cell_type": "markdown", "source": [ "# **통계를 활용한 자연어 분석**\n", "## **1 WordNet 을 활용한 문장분석 및 의미구분**\n", "문법적 구조를 활용한 분석" ] }, { "metadata": { "id": "DrEfvaFP3rCr", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 454 }, "outputId": "90ecec73-0030-4af5-c8b1-b4ccf6f1dfa1" }, "cell_type": "code", "source": [ "! pip3 install nltk pywsd symspellpy\n", "\n", "import nltk\n", "nltk.download('treebank')\n", "nltk.download('wordnet')\n", "nltk.download('stopwords')\n", "nltk.download('averaged_perceptron_tagger')\n", "nltk.download('punkt')" ], "execution_count": 7, "outputs": [ { "output_type": "stream", "text": [ "Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (3.2.5)\n", "Requirement already satisfied: pywsd in /usr/local/lib/python3.6/dist-packages (1.1.7)\n", "Collecting symspellpy\n", " Downloading https://files.pythonhosted.org/packages/4c/d5/9cf41f05a30f205c00489e3d37639c348349ba6f8d0e1005f26dc9a9ac60/symspellpy-6.3.8-py3-none-any.whl\n", "Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.11.0)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from pywsd) (0.22.0)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pywsd) (1.14.6)\n", "Requirement already satisfied: python-dateutil>=2 in /usr/local/lib/python3.6/dist-packages (from pandas->pywsd) (2.5.3)\n", "Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas->pywsd) (2018.9)\n", "Installing collected packages: symspellpy\n", "Successfully installed symspellpy-6.3.8\n", "[nltk_data] Downloading package treebank to /root/nltk_data...\n", "[nltk_data] Package treebank is already up-to-date!\n", "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", "[nltk_data] Package wordnet is already up-to-date!\n", "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n", "[nltk_data] Downloading package averaged_perceptron_tagger to\n", "[nltk_data] /root/nltk_data...\n", "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", "[nltk_data] date!\n", "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "text/plain": [ "True" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "metadata": { "colab_type": "code", "id": "sgj4wE8NzWS1", "colab": { "base_uri": "https://localhost:8080/", "height": 454 }, "outputId": "ce8e835f-a71a-4fed-88e4-42aea8d9b323" }, "cell_type": "code", "source": [ "# 촘스키 CGF 문법규칙\n", "from nltk.grammar import toy_pcfg2\n", "grammar = toy_pcfg2\n", "print(grammar)" ], "execution_count": 2, "outputs": [ { "output_type": "stream", "text": [ "Grammar with 23 productions (start state = S)\n", " S -> NP VP [1.0]\n", " VP -> V NP [0.59]\n", " VP -> V [0.4]\n", " VP -> VP PP [0.01]\n", " NP -> Det N [0.41]\n", " NP -> Name [0.28]\n", " NP -> NP PP [0.31]\n", " PP -> P NP [1.0]\n", " V -> 'saw' [0.21]\n", " V -> 'ate' [0.51]\n", " V -> 'ran' [0.28]\n", " N -> 'boy' [0.11]\n", " N -> 'cookie' [0.12]\n", " N -> 'table' [0.13]\n", " N -> 'telescope' [0.14]\n", " N -> 'hill' [0.5]\n", " Name -> 'Jack' [0.52]\n", " Name -> 'Bob' [0.48]\n", " P -> 'with' [0.61]\n", " P -> 'under' [0.39]\n", " Det -> 'the' [0.41]\n", " Det -> 'a' [0.31]\n", " Det -> 'my' [0.28]\n" ], "name": "stdout" } ] }, { "metadata": { "colab_type": "code", "id": "XO64HgtKzWTC", "colab": { "base_uri": "https://localhost:8080/", "height": 219 }, "outputId": "ec40aa03-5eaf-4dd7-9c24-a3edf5044b45" }, "cell_type": "code", "source": [ "# WordNet 데이터를 활용한 의미관계 분석\n", "from nltk.corpus import wordnet as wn\n", "word = 'bank'\n", "\n", "print(\"Synset list : \\n{}\\n\\nbank.n.02 뜻과 예제 : \\n{}\\n{}\\n\\ntrust.v.01 뜻과 예제 :\\n{}\\n{}\".format(\n", " wn.synsets(word),\n", " wn.synset('bank.n.02').definition(), wn.synset('bank.n.02').examples(),\n", " wn.synset('trust.v.01').definition(), wn.synset('trust.v.01').examples()))" ], "execution_count": 3, "outputs": [ { "output_type": "stream", "text": [ "Synset list : \n", "[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]\n", "\n", "bank.n.02 뜻과 예제 : \n", "a financial institution that accepts deposits and channels the money into lending activities\n", "['he cashed a check at the bank', 'that bank holds the mortgage on my home']\n", "\n", "trust.v.01 뜻과 예제 :\n", "have confidence or faith in\n", "['We can trust in God', 'Rely on your friends', 'bank on your good education', \"I swear by my grandmother's recipes\"]\n" ], "name": "stdout" } ] }, { "metadata": { "id": "JU9MW1om4m6l", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## **2 pywsd 모듈을 활용한 문장별 의미구분**\n", "통계적인 방법을 활용하여 결과를 출력합니다" ] }, { "metadata": { "colab_type": "code", "id": "yxd_6tpQzWTr", "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "outputId": "dafad391-4eb1-4c64-fe16-190e72d1e16f" }, "cell_type": "code", "source": [ "# pywsd 를 사용한 문장 내 의미분석\n", "sent = 'I went to the bank to deposit my money'\n", "# sent = 'I bank my money'\n", "ambiguous = 'bank'\n", "\n", "from pywsd.lesk import simple_lesk\n", "answer = simple_lesk(sent, ambiguous)\n", "answer.definition()" ], "execution_count": 4, "outputs": [ { "output_type": "stream", "text": [ "Warming up PyWSD (takes ~10 secs)... took 4.01186728477478 secs.\n" ], "name": "stderr" }, { "output_type": "execute_result", "data": { "text/plain": [ "'a financial institution that accepts deposits and channels the money into lending activities'" ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "metadata": { "id": "fqo3OEu_4xUP", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## **2 Symspellpy 모듈을 활용한 오타찾기**\n", "단어별 빈도정보를 활용하여 단어들을 재배치 합니다" ] }, { "metadata": { "id": "9dGoblFW3rDB", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "outputId": "c6972e2e-b1de-481f-fedd-f1a8de099707" }, "cell_type": "code", "source": [ "from symspellpy.symspellpy import SymSpell\n", "\n", "def spellCheck(dict_file, sentence):\n", " max_edit_dist, prefix_length = 0, 7\n", " term_index, count_index = 0, 1\n", " \n", " sym_spell = SymSpell(max_edit_dist, prefix_length)\n", " if not sym_spell.load_dictionary(dict_file, term_index, count_index):\n", " print(\"사전파일을 정의하지 않았습니다\"); return\n", " \n", " result = sym_spell.word_segmentation(sentence)\n", " print(\"오타 수정결과: {}\\n편집거리 총합: {}\".format(\n", " result.corrected_string, \n", " result.distance_sum))\n", " \n", "text = \"thequickbrownfoxjumpsoverthelazydog\"\n", "spellCheck(\"frequency_dictionary_en_82_765.txt\", sentence=text)" ], "execution_count": 8, "outputs": [ { "output_type": "stream", "text": [ "오타 수정결과: the quick brown fox jumps over the lazy dog\n", "편집거리 총합: 8\n" ], "name": "stdout" } ] } ] }