{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **문장 구조의 이해**\n",
    "## **1 구문분석**\n",
    "- **[스탠포드 대학교 구문분석 모듈](https://github.com/smilli/py-corenlp)**\n",
    "- ! pip install **pycorenlp**\n",
    "- NLTK 의 **CFG 구분분석** 모듈을 기본으로 사용\n",
    "- **pycorenlp(stanford CoreNLP)** 모듈은 Java 연결 오류가 발생"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--------Parsing result as per defined grammar-------\n",
      "(S (NP I) (VP (V shot) (NP (Det an) (N elephant))))\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "from nltk import CFG\n",
    "from nltk.tree import Tree\n",
    "\n",
    "# Part 1: NLTK 기본 모듈을 사용한 결과출력\n",
    "def definegrammar_pasrereult():\n",
    "    Grammar = nltk.CFG.fromstring(\"\"\" \n",
    "    S -> NP VP \n",
    "    PP -> P NP \n",
    "    NP -> Det N | Det N PP | 'I' \n",
    "    VP -> V NP | VP PP \n",
    "    Det -> 'an' | 'my' \n",
    "    N -> 'elephant' | 'pajamas' \n",
    "    V -> 'shot' \n",
    "    P -> 'in' \n",
    "    \"\"\")\n",
    "    sent   = \"I shot an elephant\".split()\n",
    "    parser = nltk.ChartParser(Grammar)\n",
    "    trees  = parser.parse(sent)\n",
    "    for tree in trees:\n",
    "        print (tree)\n",
    "\n",
    "print (\"\\n--------Parsing result as per defined grammar-------\")\n",
    "definegrammar_pasrereult()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--------Drawing Parse Tree-------\n",
      "(s (dp (d the) (np dog)) (vp (v chased) (dp (d the) (np cat))))\n",
      "\\Tree [.s\n",
      "        [.dp [.d the ] [.np dog ] ]\n",
      "        [.vp [.v chased ] [.dp [.d the ] [.np cat ] ] ] ]\n",
      "              s               \n",
      "      ________|_____           \n",
      "     |              vp        \n",
      "     |         _____|___       \n",
      "     dp       |         dp    \n",
      "  ___|___     |      ___|___   \n",
      " d       np   v     d       np\n",
      " |       |    |     |       |  \n",
      "the     dog chased the     cat\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Part 2: Tree 함수를 사용한 parse tree 그리기\n",
    "def draw_parser_tree():\n",
    "    dp1  = Tree('dp', [Tree('d', ['the']), Tree('np', ['dog'])])\n",
    "    dp2  = Tree('dp', [Tree('d', ['the']), Tree('np', ['cat'])])\n",
    "    vp   = Tree('vp', [Tree('v', ['chased']), dp2])\n",
    "    tree = Tree('s', [dp1, vp])\n",
    "    print(tree)\n",
    "    print(tree.pformat_latex_qtree())\n",
    "    tree.pretty_print()\n",
    "\n",
    "print (\"\\n--------Drawing Parse Tree-------\")\n",
    "draw_parser_tree()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from pycorenlp import StanfordCoreNLP\n",
    "# from collections import defaultdict\n",
    "\n",
    "# Part 3: \"pycorenlp\" 을 사용한 시각화\n",
    "# pycorenlp 설치 후 standford corenlp website 에서 stanford-corenlp-full-* 설치가 필요\n",
    "# def stanford_parsing_result():\n",
    "#     text = \"\"\" I shot an elephant. The dog chased the cat. School go to boy. \"\"\"\n",
    "#     nlp  = StanfordCoreNLP('http://localhost:9000')\n",
    "#     res  = nlp.annotate(text, properties={\n",
    "#         'annotators': 'tokenize,ssplit,pos,depparse,parse',\n",
    "#         'outputFormat': 'json'\n",
    "#     })\n",
    "#     print(res['sentences'][0]['parse'])\n",
    "#     print(res['sentences'][2]['parse'])\n",
    "\n",
    "# Exception: Check whether you have started the CoreNLP server e.g. 오류가 발생\n",
    "# print (\"\\n--------Stanford Parser result------\")\n",
    "# stanford_parsing_result()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **2 spaCy NER 예제**\n",
    "개체명 인식을 활용한 NER 구현\n",
    "- **[spaCy](https://spacy.io/usage/models)** 에는 Ko 모듈도 있긴 하지만 아직 자료가 전무한 상황 입니다.\n",
    "- **PERSON**(사람), **NORP**(그룹), **FACILITY**(건물기관), **ORG**(조직/기관), **GPE**(도시/국가), **LOC**(수맥/수역), **PRODUCT**(물건/차량/음식), **EVENT**(이벤트), **WORK_OF_ART**(문화), **LANGUAGE**(이름) 기타 **날짜, 시간, 퍼센트, 금융, 수량, 서수, 기수** 등의 클래스 구분이 가능합니다\n",
    "1. ! pip install spacy\n",
    "1. ! python -m spacy download en  # [모듈의 설치방법](https://stackoverflow.com/questions/49964028/spacy-oserror-cant-find-model-en)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------Example 1 ------\n",
      "GPE : London\n",
      "GPE : the United Kingdom\n"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en')\n",
    "doc = nlp(u'London is a big city in the United Kingdom.')\n",
    "print (\"-------Example 1 ------\")\n",
    "for _ in doc.ents:\n",
    "    print(\"{} : {}\".format(_.label_, _.text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------Example 2 ------\n",
      "GPE : France\n",
      "PERSON : Christine Lagarde\n",
      "TIME : 5:00 P.M.\n",
      "ORG : the Wall Street Journal\n"
     ]
    }
   ],
   "source": [
    "doc1 = nlp(u'While in France, Christine Lagarde discussed short-term stimulus efforts in a '\n",
    "           u'recent interview on 5:00 P.M. with the Wall Street Journal')\n",
    "print (\"-------Example 2 ------\")\n",
    "for _ in doc1.ents:\n",
    "    print(\"{} : {}\".format(_.label_, _.text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **단어가방 (Bag of Words)**\n",
    "**sklearn** 을 활용한 **Bag of Words** 만들기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 1, 1, 0, 1, 1, 1, 0],\n",
       "       [1, 1, 0, 1, 1, 1, 0, 1]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)\n",
    "counts = ngram_vectorizer.fit_transform(['words', 'wprds'])\n",
    "ngram_vectorizer.get_feature_names() == ([' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])\n",
    "counts.toarray().astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',\n",
       "                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "                lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "                ngram_range=(2, 2), preprocessor=None, stop_words=None,\n",
       "                strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "                tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ngram_vectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Tf-IDF 측정하기**\n",
    "## **1 textblob 을 활용한 Tf-IDF**\n",
    "! pip install textblob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[TextBlob(\"tf idf, short form of term frequency, inverse document frequency\"),\n",
       " TextBlob(\"is a numerical statistic that is intended to reflect how important\"),\n",
       " TextBlob(\"a word is to a document in a collection or corpus\")]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text  = 'tf idf, short form of term frequency, inverse document frequency'\n",
    "text2 = 'is a numerical statistic that is intended to reflect how important'\n",
    "text3 = 'a word is to a document in a collection or corpus'\n",
    "\n",
    "from textblob import TextBlob\n",
    "blob     = TextBlob(text)\n",
    "blob2    = TextBlob(text2)\n",
    "blob3    = TextBlob(text3)\n",
    "bloblist = [blob, blob2, blob3]\n",
    "bloblist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tf : 0.1\n",
      "idf : 0.40546\n",
      "tf*idf : 0.04054\n"
     ]
    }
   ],
   "source": [
    "import math\n",
    "def tf(word, blob):\n",
    "    return blob.words.count(word) / len(blob.words)\n",
    "\n",
    "def idf(word, bloblist):\n",
    "    # n_containing (0이 없도록 1 smoothing)\n",
    "    x = 1 + sum(1 for blob in bloblist if word in blob) \n",
    "    return math.log(len(bloblist) / (x if x else 1))\n",
    "\n",
    "def tfidf(word, blob, bloblist):\n",
    "    return tf(word, blob) * idf(word, bloblist)\n",
    "\n",
    "tf_score    = tf('short', blob)\n",
    "idf_score   = idf('short', bloblist)\n",
    "tfidf_score = tfidf('short', blob, bloblist)\n",
    "print (\"\"\"tf : {:.7}\\nidf : {:.7}\\ntf*idf : {:.7}\"\"\".format(\n",
    "    str(tf_score), str(idf_score), str(tfidf_score)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **2 scikit-learn 을 활용한 Tf-IDF**\n",
    "1. ! pip install textblob\n",
    "1. https://stackoverflow.com/questions/23175809/str-translate-gives-typeerror-translate-takes-one-argument-2-given-worked\n",
    "    - String문자.**translate**( **str**.maketrans('','', string.punctuation))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('and', 48), ('the', 33), ('to', 29), ('i', 26), ('of', 25)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 문서 1개 호출한 뒤 Token의 생성\n",
    "import nltk, string, os\n",
    "from collections import Counter\n",
    "\n",
    "# 전처리 : 소문자 처리, 문장기호 삭제 등\n",
    "def get_tokens(file):\n",
    "    with open(file, 'r') as shakes:\n",
    "        text = shakes.read()\n",
    "    \n",
    "    lowers = text.lower() # 소문자 전처리\n",
    "    no_punctuation = lowers.translate(str.maketrans('','',string.punctuation)) # 문장기호 삭제\n",
    "    tokens = no_punctuation.split(' ') # 문장 내용을 Token 으로 분리\n",
    "    return tokens\n",
    "\n",
    "tokens = get_tokens('./data/shakes1.txt')\n",
    "count  = Counter(tokens)\n",
    "count.most_common(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('thy', 11), ('go', 7), ('love', 7), ('would', 5), ('thou', 5)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# nlkt 의 stopword 를 활용한 Token 정규화\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "filtered = [w for w in tokens if not w in stopwords.words('english')]\n",
    "count    = Counter(filtered)\n",
    "count.most_common(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('thi', 11), ('go', 7), ('love', 7), ('would', 5), ('natur', 5)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# PorterStemmer 사용 Stemming 활용\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "\n",
    "def stem_tokens(tokens, stemmer):\n",
    "    stemmed = []\n",
    "    for item in tokens:\n",
    "        stemmed.append(stemmer.stem(item))\n",
    "    return stemmed\n",
    "\n",
    "stemmer = PorterStemmer()\n",
    "stemmed = stem_tokens(filtered, stemmer)\n",
    "count   = Counter(stemmed)\n",
    "count.most_common(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from glob import glob\n",
    "fileList   = glob(\"./data/shake*.txt\")\n",
    "token_dict = {}\n",
    "stemmer    = PorterStemmer()\n",
    "\n",
    "def stem_tokens(tokens, stemmer):\n",
    "    stemmed = []\n",
    "    for item in tokens:\n",
    "        stemmed.append(stemmer.stem(item))\n",
    "    return stemmed\n",
    "\n",
    "def tokenize(text):\n",
    "    tokens = nltk.word_tokenize(text)\n",
    "    stems  = stem_tokens(tokens, stemmer)\n",
    "    return stems\n",
    "\n",
    "for file in fileList:\n",
    "    shakes    = open(file, 'r')\n",
    "    text      = shakes.read()\n",
    "    lowers    = text.lower()\n",
    "    no_punctuation   = lowers.translate(str.maketrans('','',string.punctuation)) # 문장기호 삭제\n",
    "    token_dict[file] = no_punctuation\n",
    "# token_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/markbaum/Python/nltk/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'formerli', 'forti', 'ha', 'henc', 'hereaft', 'herebi', 'hi', 'howev', 'hundr', 'inde', 'latterli', 'mani', 'meanwhil', 'moreov', 'mostli', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'seriou', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thu', 'togeth', 'twelv', 'twenti', 'veri', 'wa', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.\n",
      "  'stop_words.' % sorted(inconsistent))\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([0.34618161, 0.66338461, 0.66338461])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# this can take some time\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfidf      = TfidfVectorizer(tokenizer=tokenize, stop_words='english')\n",
    "tfs        = tfidf.fit_transform(token_dict.values())\n",
    "sample_str = 'this sentence has unseen text such as computer but also king lord juliet'\n",
    "response   = tfidf.transform([sample_str])\n",
    "response.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "thi  -  0.34618161159873423\n",
      "lord  -  0.6633846138519129\n",
      "king  -  0.6633846138519129\n"
     ]
    }
   ],
   "source": [
    "feature_names = tfidf.get_feature_names()\n",
    "for col in response.nonzero()[1]:\n",
    "    print (feature_names[col], ' - ', response[0, col])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array(['king', 'lord', 'thi'], dtype='<U12'),\n",
       " array(['king', 'lord', 'thi', 'youth'], dtype='<U12'))"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "feature_array = np.array(tfidf.get_feature_names())\n",
    "tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]\n",
    "top_n_tri     = feature_array[tfidf_sorting][: 3]\n",
    "top_n_forth   = feature_array[tfidf_sorting][: 4]\n",
    "top_n_tri,  top_n_forth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **인코더와 디코더**\n",
    "## **1 One-Hot Encoding**\n",
    "! pip install textblob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>age-group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>rick</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>phil</td>\n",
       "      <td>old</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name age-group\n",
       "0  rick     young\n",
       "1  phil       old"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "df = pd.DataFrame([['rick','young'],['phil','old']], columns=['name','age-group'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name_phil</th>\n",
       "      <th>name_rick</th>\n",
       "      <th>age-group_old</th>\n",
       "      <th>age-group_young</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name_phil  name_rick  age-group_old  age-group_young\n",
       "0          0          1              0                1\n",
       "1          1          0              1                0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pandas 모듈을 활용한 Encoding\n",
    "pd.get_dummies(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'country=US': 2, 'country=CAN': 0, 'country=MEX': 1},\n",
       " '\\n',\n",
       " array([[0., 0., 1.],\n",
       "        [1., 0., 0.],\n",
       "        [0., 0., 1.],\n",
       "        [1., 0., 0.],\n",
       "        [0., 1., 0.],\n",
       "        [0., 0., 1.]]))"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Scikit-learn 모듈을 사용한 Tf-IDF\n",
    "X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],\n",
    "                  'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],\n",
    "                  'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']})\n",
    "\n",
    "v = DictVectorizer()\n",
    "qualitative_features = ['country']\n",
    "X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))\n",
    "v.vocabulary_ ,\"\\n\", X_qual.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **2 Concept Text Normalization**\n",
    "**개념텍스트 정규화** : 텍스트를 하나의 표준 형식으로 변환\n",
    "1. **Stemming** 도 정규화 방법 중 하나다\n",
    "1. NLP 분석의 **Language Model** 중 **최대 최소 스케일링** 이 유용하게 쓰인다\n",
    "\n",
    "## 인덱싱 : 카테고리 데이터를 인덱스 숫자로 활용\n",
    "## 랭킹 : 관련성이 높은 데이터를 먼저, 낮은 데이터를 아래로 재정렬ㄹ"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}