{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from numpy.random import randint, seed\n", "from sklearn.feature_extraction.text import CountVectorizer #n-Gram 만들어줌" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. n-Gram based autofill:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#train을 위한 텍스트 데이터\n", "my_text = \"\"\"Machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model of sample data, known as \"training data\", in order to make predictions or decisions without being explicitly programmed to perform the task.[1][2]:2 Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning In its application across business problems, machine learning is also referred to as predictive analytics.\"\"\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "my_text = [my_text.lower()] # 소문자로 변환 => Required by the CountVectorizer()." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. it is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model of sample data, known as \"training data\", in order to make predictions or decisions without being explicitly programmed to perform the task.[1][2]:2 machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. machine learning is closely related to computational statistics, which focuses on making predictions using computers. the study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning in its application across business problems, machine learning is also referred to as predictive analytics.']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. n-Gram trial run:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "n = 3 \n", "n_min = n #n이 3이면 3까지만 가져오도록 함\n", "n_max = n\n", "n_gram_type = 'word' \n", "vectorizer = CountVectorizer(ngram_range=(n_min,n_max), analyzer = n_gram_type) #CountVectorizer: ngram의 range를 보냄" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "n_grams = vectorizer.fit(my_text).get_feature_names() #get_feature_names: 리스트로 ngram 가져옴 \n", "n_gram_cts = vectorizer.transform(my_text).toarray() #출력물은 배열의 배열\n", "#ngram에 해당하는 횟수\n", "#transform: 도수분포표 \n", "#toarray를 안하면 도수분포표가 듬성듬성한 행렬로 만들어짐 \n", "\n", "n_gram_cts = list(n_gram_cts[0]) #첫 번째 원소 가져와야 제대로 된 도수분포표 나옴 " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('across business problems', 1),\n", " ('algorithm of specific', 1),\n", " ('algorithms and statistical', 1),\n", " ('algorithms are used', 1),\n", " ('algorithms build mathematical', 1),\n", " ('also referred to', 1),\n", " ('an algorithm of', 1),\n", " ('analysis through unsupervised', 1),\n", " ('and application domains', 1),\n", " ('and computer vision', 1),\n", " ('and focuses on', 1),\n", " ('and inference instead', 1),\n", " ('and statistical models', 1),\n", " ('application across business', 1),\n", " ('application domains to', 1),\n", " ('applications of email', 1),\n", " ('are used in', 1),\n", " ('artificial intelligence machine', 1),\n", " ('as predictive analytics', 1),\n", " ('as subset of', 1),\n", " ('as training data', 1),\n", " ('being explicitly programmed', 1),\n", " ('build mathematical model', 1),\n", " ('business problems machine', 1),\n", " ('closely related to', 1),\n", " ('computational statistics which', 1),\n", " ('computer systems use', 1),\n", " ('computer vision where', 1),\n", " ('computers the study', 1),\n", " ('data analysis through', 1),\n", " ('data in order', 1),\n", " ('data known as', 1),\n", " ('data mining is', 1),\n", " ('decisions without being', 1),\n", " ('delivers methods theory', 1),\n", " ('detection of network', 1),\n", " ('develop an algorithm', 1),\n", " ('domains to the', 1),\n", " ('effectively perform specific', 1),\n", " ('email filtering detection', 1),\n", " ('explicit instructions relying', 1),\n", " ('explicitly programmed to', 1),\n", " ('exploratory data analysis', 1),\n", " ('field of machine', 1),\n", " ('field of study', 1),\n", " ('filtering detection of', 1),\n", " ('focuses on exploratory', 1),\n", " ('focuses on making', 1),\n", " ('for performing the', 1),\n", " ('in its application', 1),\n", " ('in order to', 1),\n", " ('in the applications', 1),\n", " ('infeasible to develop', 1),\n", " ('inference instead it', 1),\n", " ('instead it is', 1),\n", " ('instructions for performing', 1),\n", " ('instructions relying on', 1),\n", " ('intelligence machine learning', 1),\n", " ('intruders and computer', 1),\n", " ('is also referred', 1),\n", " ('is closely related', 1),\n", " ('is field of', 1),\n", " ('is infeasible to', 1),\n", " ('is seen as', 1),\n", " ('is the scientific', 1),\n", " ('it is infeasible', 1),\n", " ('it is seen', 1),\n", " ('its application across', 1),\n", " ('known as training', 1),\n", " ('learning algorithms are', 1),\n", " ('learning algorithms build', 1),\n", " ('learning and focuses', 1),\n", " ('learning data mining', 1),\n", " ('learning in its', 1),\n", " ('learning is also', 1),\n", " ('learning is closely', 1),\n", " ('learning is the', 1),\n", " ('machine learning algorithms', 2),\n", " ('machine learning and', 1),\n", " ('machine learning data', 1),\n", " ('machine learning is', 3),\n", " ('make predictions or', 1),\n", " ('making predictions using', 1),\n", " ('mathematical model of', 1),\n", " ('mathematical optimization delivers', 1),\n", " ('methods theory and', 1),\n", " ('mining is field', 1),\n", " ('model of sample', 1),\n", " ('models that computer', 1),\n", " ('network intruders and', 1),\n", " ('of algorithms and', 1),\n", " ('of artificial intelligence', 1),\n", " ('of email filtering', 1),\n", " ('of machine learning', 1),\n", " ('of mathematical optimization', 1),\n", " ('of network intruders', 1),\n", " ('of sample data', 1),\n", " ('of specific instructions', 1),\n", " ('of study within', 1),\n", " ('on exploratory data', 1),\n", " ('on making predictions', 1),\n", " ('on patterns and', 1),\n", " ('optimization delivers methods', 1),\n", " ('or decisions without', 1),\n", " ('order to make', 1),\n", " ('patterns and inference', 1),\n", " ('perform specific task', 1),\n", " ('perform the task', 1),\n", " ('performing the task', 1),\n", " ('predictions or decisions', 1),\n", " ('predictions using computers', 1),\n", " ('problems machine learning', 1),\n", " ('programmed to perform', 1),\n", " ('referred to as', 1),\n", " ('related to computational', 1),\n", " ('relying on patterns', 1),\n", " ('sample data known', 1),\n", " ('scientific study of', 1),\n", " ('seen as subset', 1),\n", " ('specific instructions for', 1),\n", " ('specific task without', 1),\n", " ('statistical models that', 1),\n", " ('statistics which focuses', 1),\n", " ('study of algorithms', 1),\n", " ('study of mathematical', 1),\n", " ('study within machine', 1),\n", " ('subset of artificial', 1),\n", " ('systems use to', 1),\n", " ('task machine learning', 2),\n", " ('task without using', 1),\n", " ('that computer systems', 1),\n", " ('the applications of', 1),\n", " ('the field of', 1),\n", " ('the scientific study', 1),\n", " ('the study of', 1),\n", " ('the task machine', 2),\n", " ('theory and application', 1),\n", " ('through unsupervised learning', 1),\n", " ('to as predictive', 1),\n", " ('to computational statistics', 1),\n", " ('to develop an', 1),\n", " ('to effectively perform', 1),\n", " ('to make predictions', 1),\n", " ('to perform the', 1),\n", " ('to the field', 1),\n", " ('training data in', 1),\n", " ('unsupervised learning in', 1),\n", " ('use to effectively', 1),\n", " ('used in the', 1),\n", " ('using computers the', 1),\n", " ('using explicit instructions', 1),\n", " ('vision where it', 1),\n", " ('where it is', 1),\n", " ('which focuses on', 1),\n", " ('within machine learning', 1),\n", " ('without being explicitly', 1),\n", " ('without using explicit', 1)]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip(n_grams,n_gram_cts)) #도수분포표와 ngram 맞물리기위해 zip " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Train by making a dictionary based on n-Grams:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trigram근사란?
\n", "\n", "바로 직전의 두 단어만 가지고 그 다음 단어가 나올 확률에 구해 예측하는 것" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "n = 3 \n", "n_min = n \n", "n_max = n \n", "n_gram_type = 'word'\n", "vectorizer = CountVectorizer(ngram_range=(n_min,n_max), analyzer = n_gram_type)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "n_grams = vectorizer.fit(my_text).get_feature_names() #Trigram 만듬\n", "my_dict = {}\n", "for a_gram in n_grams: \n", " \n", " #ngram가져와서 trigram이면 trigram에서 첫번째 두단어만 쪼갬, 그다음단어는 value로 넣음\n", " #트라이 w1,w2,w3을 w1w2,w1로 쪼갬\n", " #어폴시는 w3,w1w2 w1w2 두개가 발생했을 때 세번째 단어가 발생할 확률\n", " #두단어 w1w2를 키로 두고 3을 value로 둠\n", " \n", " words = nltk.word_tokenize(a_gram)\n", " a_nm1_gram = ' '.join(words[0:n-1]) # (n-1)-Gram.\n", " next_word = words[-1] # Word after the a_nm1_gram.\n", " if a_nm1_gram not in my_dict.keys(): #딕셔너리\n", " my_dict[a_nm1_gram] = [next_word] #딕셔너리 키에 없다면 리스트로 새로 만들고, # a_nm1_gram is a new key. So, initialize the dictionary entry.\n", " #키 #value는 그 다음에 나오는 단어를 리스트로 추가\n", " else: #키에 이미 있다면 추가\n", " my_dict[a_nm1_gram] += [next_word] # an_nm1_gram is already in the dictionary.\n", " \n", " #2개를 ㅜn-1로 \n", " #1개를 넥스트" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'across business': ['problems'],\n", " 'algorithm of': ['specific'],\n", " 'algorithms and': ['statistical'],\n", " 'algorithms are': ['used'],\n", " 'algorithms build': ['mathematical'],\n", " 'also referred': ['to'],\n", " 'an algorithm': ['of'],\n", " 'analysis through': ['unsupervised'],\n", " 'and application': ['domains'],\n", " 'and computer': ['vision'],\n", " 'and focuses': ['on'],\n", " 'and inference': ['instead'],\n", " 'and statistical': ['models'],\n", " 'application across': ['business'],\n", " 'application domains': ['to'],\n", " 'applications of': ['email'],\n", " 'are used': ['in'],\n", " 'artificial intelligence': ['machine'],\n", " 'as predictive': ['analytics'],\n", " 'as subset': ['of'],\n", " 'as training': ['data'],\n", " 'being explicitly': ['programmed'],\n", " 'build mathematical': ['model'],\n", " 'business problems': ['machine'],\n", " 'closely related': ['to'],\n", " 'computational statistics': ['which'],\n", " 'computer systems': ['use'],\n", " 'computer vision': ['where'],\n", " 'computers the': ['study'],\n", " 'data analysis': ['through'],\n", " 'data in': ['order'],\n", " 'data known': ['as'],\n", " 'data mining': ['is'],\n", " 'decisions without': ['being'],\n", " 'delivers methods': ['theory'],\n", " 'detection of': ['network'],\n", " 'develop an': ['algorithm'],\n", " 'domains to': ['the'],\n", " 'effectively perform': ['specific'],\n", " 'email filtering': ['detection'],\n", " 'explicit instructions': ['relying'],\n", " 'explicitly programmed': ['to'],\n", " 'exploratory data': ['analysis'],\n", " 'field of': ['machine', 'study'],\n", " 'filtering detection': ['of'],\n", " 'focuses on': ['exploratory', 'making'],\n", " 'for performing': ['the'],\n", " 'in its': ['application'],\n", " 'in order': ['to'],\n", " 'in the': ['applications'],\n", " 'infeasible to': ['develop'],\n", " 'inference instead': ['it'],\n", " 'instead it': ['is'],\n", " 'instructions for': ['performing'],\n", " 'instructions relying': ['on'],\n", " 'intelligence machine': ['learning'],\n", " 'intruders and': ['computer'],\n", " 'is also': ['referred'],\n", " 'is closely': ['related'],\n", " 'is field': ['of'],\n", " 'is infeasible': ['to'],\n", " 'is seen': ['as'],\n", " 'is the': ['scientific'],\n", " 'it is': ['infeasible', 'seen'],\n", " 'its application': ['across'],\n", " 'known as': ['training'],\n", " 'learning algorithms': ['are', 'build'],\n", " 'learning and': ['focuses'],\n", " 'learning data': ['mining'],\n", " 'learning in': ['its'],\n", " 'learning is': ['also', 'closely', 'the'],\n", " 'machine learning': ['algorithms', 'and', 'data', 'is'],\n", " 'make predictions': ['or'],\n", " 'making predictions': ['using'],\n", " 'mathematical model': ['of'],\n", " 'mathematical optimization': ['delivers'],\n", " 'methods theory': ['and'],\n", " 'mining is': ['field'],\n", " 'model of': ['sample'],\n", " 'models that': ['computer'],\n", " 'network intruders': ['and'],\n", " 'of algorithms': ['and'],\n", " 'of artificial': ['intelligence'],\n", " 'of email': ['filtering'],\n", " 'of machine': ['learning'],\n", " 'of mathematical': ['optimization'],\n", " 'of network': ['intruders'],\n", " 'of sample': ['data'],\n", " 'of specific': ['instructions'],\n", " 'of study': ['within'],\n", " 'on exploratory': ['data'],\n", " 'on making': ['predictions'],\n", " 'on patterns': ['and'],\n", " 'optimization delivers': ['methods'],\n", " 'or decisions': ['without'],\n", " 'order to': ['make'],\n", " 'patterns and': ['inference'],\n", " 'perform specific': ['task'],\n", " 'perform the': ['task'],\n", " 'performing the': ['task'],\n", " 'predictions or': ['decisions'],\n", " 'predictions using': ['computers'],\n", " 'problems machine': ['learning'],\n", " 'programmed to': ['perform'],\n", " 'referred to': ['as'],\n", " 'related to': ['computational'],\n", " 'relying on': ['patterns'],\n", " 'sample data': ['known'],\n", " 'scientific study': ['of'],\n", " 'seen as': ['subset'],\n", " 'specific instructions': ['for'],\n", " 'specific task': ['without'],\n", " 'statistical models': ['that'],\n", " 'statistics which': ['focuses'],\n", " 'study of': ['algorithms', 'mathematical'],\n", " 'study within': ['machine'],\n", " 'subset of': ['artificial'],\n", " 'systems use': ['to'],\n", " 'task machine': ['learning'],\n", " 'task without': ['using'],\n", " 'that computer': ['systems'],\n", " 'the applications': ['of'],\n", " 'the field': ['of'],\n", " 'the scientific': ['study'],\n", " 'the study': ['of'],\n", " 'the task': ['machine'],\n", " 'theory and': ['application'],\n", " 'through unsupervised': ['learning'],\n", " 'to as': ['predictive'],\n", " 'to computational': ['statistics'],\n", " 'to develop': ['an'],\n", " 'to effectively': ['perform'],\n", " 'to make': ['predictions'],\n", " 'to perform': ['the'],\n", " 'to the': ['field'],\n", " 'training data': ['in'],\n", " 'unsupervised learning': ['in'],\n", " 'use to': ['effectively'],\n", " 'used in': ['the'],\n", " 'using computers': ['the'],\n", " 'using explicit': ['instructions'],\n", " 'vision where': ['it'],\n", " 'where it': ['is'],\n", " 'which focuses': ['on'],\n", " 'within machine': ['learning'],\n", " 'without being': ['explicitly'],\n", " 'without using': ['explicit']}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#딕셔너리 보기\n", "my_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. Predict the next word: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "위에서 만든 딕셔너리 가지고 그 다음 단어를 확률에 의해 예측" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Helper function that picks the following word.\n", "def predict_next(a_nm1_gram):\n", " value_list_size = len(my_dict[a_nm1_gram]) # 키에 해당하는 값의 길이 = a_nm1_gram.\n", " i_pick = randint(0, value_list_size) # 0 ~ value_list_size의 랜덤숫자\n", " return(my_dict[a_nm1_gram][i_pick]) # 랜덤으로 선택한 다음 다음 단어를 반환" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'make'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Test.\n", "input_str = 'order to' # Has to be a VALID (n-1)-Gram!\n", "predict_next(input_str) #그 다음 단어를 확률적으로 뽑아서 줌\n", "\n", "#위 결과에서 'order to': ['make'] 2개뿐임" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "is\n", "and\n", "algorithms\n", "algorithms\n", "and\n", "and\n", "data\n", "data\n", "data\n", "algorithms\n" ] } ], "source": [ "# Another test.\n", "# 10 번 반복하고 다음 단어가 확률에 따라 무작위로 선택되는지 확인\n", "input_str = 'machine learning' # Has to be a VALID (n-1)-Gram!\n", "for i in range(10):\n", " print(predict_next(input_str))\n", " \n", "#machine learning을 넣으면 확률에 따라 결과값이 다르게 나옴\n", " \n", "#range에 해당하는 원소를 랜덤으로 뽑음\n", "#반복되는게 있으면 더 높은 확률로 뽑힐 것 (그래서 결과값이 다 다름)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4. Predict a sequence:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "seed 사용하는 방법도 있음" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "#random seed 초기화\n", "seed(123) " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# A seed string has to be input by the user.\n", "my_seed_str = 'machine learning' # Has to be a VALID (n-1)-Gram!\n", "# my_seed_str = 'in order' # Has to be a VALID (n-1)-Gram!" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "a_nm1_gram = my_seed_str #seed가지고 n-1gram 만듬\n", "output_string = my_seed_str # Initialize the output string.\n", "while a_nm1_gram in my_dict:\n", " output_string += \" \" + predict_next(a_nm1_gram) #n-1바탕으로 예측해서 추가시킴\n", " words = nltk.word_tokenize(output_string)\n", " a_nm1_gram = ' '.join(words[-n+1:]) #만들어논 단어를 토큰화해서 끝에서 두개만 뽑아옴 해서\n", " #n-1gram bigram 만들어 게속 감 # Update a_nm1_gram." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'machine learning data mining is field of study within machine learning data mining is field of machine learning algorithms are used in the applications of email filtering detection of network intruders and computer vision where it is infeasible to develop an algorithm of specific instructions for performing the task machine learning and focuses on making predictions using computers the study of algorithms and statistical models that computer systems use to effectively perform specific task without using explicit instructions relying on patterns and inference instead it is seen as subset of artificial intelligence machine learning and focuses on exploratory data analysis through unsupervised learning in its application across business problems machine learning and focuses on exploratory data analysis through unsupervised learning in its application across business problems machine learning and focuses on exploratory data analysis through unsupervised learning in its application across business problems machine learning is closely related to computational statistics which focuses on exploratory data analysis through unsupervised learning in its application across business problems machine learning data mining is field of machine learning is closely related to computational statistics which focuses on making predictions using computers the study of algorithms and statistical models that computer systems use to effectively perform specific task without using explicit instructions relying on patterns and inference instead it is seen as subset of artificial intelligence machine learning algorithms are used in the applications of email filtering detection of network intruders and computer vision where it is infeasible to develop an algorithm of specific instructions for performing the task machine learning algorithms build mathematical model of sample data known as training data in order to make predictions or decisions without being explicitly programmed to perform the task machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform specific task without using explicit instructions relying on patterns and inference instead it is seen as subset of artificial intelligence machine learning data mining is field of machine learning is also referred to as predictive analytics'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Output the predicted sequence.\n", "output_string \n", "\n", "#trigram근사에 따라 bigram seed를 주면 그 다음 단어를 예측해 trigram 완성시킴\n", "#완성시킨 끝에 두개를 bifram으로 뽑아 그걸 바탕으로 3번째 단어 예측해 완성시킴\n", "#이 과정이 계속 돌고돌아 문장만들어 감" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }