{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 9장. 웹 애플리케이션에 머신 러닝 모델 내장하기" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lst = [1,2,4,5]\n", "list(map(lambda x: 'lower' if x < 3 else 'higher', lst))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load_ext watermark\n", "# %watermark -a \"Sebastian Raschka\" -u -d -v -p numpy,pandas,pyprind,matplotlib,nltk,sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "플래스크(Flask) 웹 애플리케이션 코드는 다음 디렉토리에 있습니다:\n", " \n", "- `1st_flask_app_1/`: 간단한 플래스크 웹 애플리케이션\n", "- `1st_flask_app_2/`: `1st_flask_app_1`에 폼 검증과 렌더링을 추가하여 확장한 버전\n", "- `movieclassifier/`: 웹 애플리케이션에 내장한 영화 리뷰 분류기\n", "- `movieclassifier_with_update/`: `movieclassifier`와 같지만 초기화를 위해 sqlite 데이터베이스를 사용합니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "웹 애플리케이션을 로컬에서 실행하려면 `cd`로 (위에 나열된) 각 디렉토리에 들어가서 메인 애플리케이션 스크립트를 실행합니다.\n", "\n", " cd ./1st_flask_app_1\n", " python3 app.py\n", " \n", "터미널에서 다음같은 내용일 출력됩니다.\n", " \n", " * Running on http://127.0.0.1:5000/\n", " * Restarting with reloader\n", " \n", "웹 브라우저를 열고 터미널에 출력된 주소(일반적으로 http://127.0.0.1:5000/)를 입력하여 웹 애플리케이션에 접속합니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**이 튜토리얼로 만든 예제 애플리케이션 데모는 다음 주소에서 볼 수 있습니다: http://haesun.pythonanywhere.com/**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8장 정리 - 영화 리뷰 분류를 위한 모델 훈련하기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "이 절은 8장의 마지막 섹션에서 훈련한 로지스틱 회귀 모델을 다시 사용합니다. 이어지는 코드 블럭을 실행하여 다음 절에서 사용할 모델을 훈련시키겠습니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**노트**\n", "\n", "다음 코드는 8장에서 만든 `movie_data.csv` 데이터셋을 사용합니다." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # 압축파일을 풀어서 저장하는 함수\n", "# import gzip\n", "# with gzip.open('movie_data.csv.gz') as f_in, open('movie_data.csv', 'wb') as f_out:\n", "# f_out.writelines(f_in)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import nltk\n", "# nltk.download('stopwords')" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('\"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\\'s, they discover the criminal and a net of power and money to cover the murder.

\"\"Murder in Greenwich\"\" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.

Title (Brazil): Not Available\"',\n", " 1)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import re\n", "from nltk.corpus import stopwords\n", "from nltk.stem import PorterStemmer\n", "\n", "stop = stopwords.words('english')\n", "porter = PorterStemmer()\n", "\n", "def tokenizer(text):\n", " text = re.sub('<[^>]*>', '', text)\n", " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text.lower())\n", " text = re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')\n", " tokenized = [w for w in text.split() if w not in stop]\n", " return tokenized\n", "\n", "def stream_docs(path):\n", " with open(path, 'r', encoding='utf-8') as csv:\n", " next(csv) # skip header\n", " for line in csv:\n", " text, label = line[:-3], int(line[-2])\n", " yield text, label\n", "\n", "text_instnace = stream_docs(path='data/movie_data.csv')\n", "next(text_instnace)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('\"OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low energy style and he will steal a scene effortlessly. But, Disappearance is his misstep. Holy Moly, this was a bad movie!

I must give kudos to the cinematography and and the actors, including Kris, for trying their darndest to make sense from this goofy, confusing story! None of it made sense and Kris probably didn\\'t understand it either and he was just going through the motions hoping someone would come up to him and tell him what it was all about!

I don\\'t care that everyone on this movie was doing out of love for the project, or some such nonsense... I\\'ve seen low budget movies that had a plot for goodness sake! This had none, zilcho, nada, zippo, empty of reason... a complete waste of good talent, scenery and celluloid!

I rented this piece of garbage for a buck, and I want my money back! I want my 2 hours back I invested on this Grade F waste of my time! Don\\'t watch this movie, or waste 1 minute of your valuable time while passing through a room where it\\'s playing or even open up the case that is holding the DVD! Believe me, you\\'ll thank me for the advice!\"',\n", " 0)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 반복 실행할때 마다 새로운 인스턴스 객체를 출력 합니다\n", "next(text_instnace)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def get_minibatch(doc_stream, size):\n", " docs, y = [], []\n", " try:\n", " for _ in range(size):\n", " text, label = next(doc_stream)\n", " docs.append(text)\n", " y.append(label)\n", " except StopIteration:\n", " return None, None\n", " return docs, y" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import HashingVectorizer\n", "from sklearn.linear_model import SGDClassifier\n", "\n", "vect = HashingVectorizer(decode_error='ignore', \n", " n_features=2**21,\n", " preprocessor=None, \n", " tokenizer=tokenizer)\n", "\n", "# max_iter 를 설정시 tol=0.01 값을 0.21 ~ 0.01 사이의 값으로 추가를 해주면 warning이 사라짐\n", "clf = SGDClassifier(loss='log', random_state=1, max_iter=1, tol=0.01)\n", "doc_stream = stream_docs(path='data/movie_data.csv')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% [##############################] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:22\n" ] } ], "source": [ "import pyprind\n", "pbar = pyprind.ProgBar(45)\n", "\n", "classes = np.array([0, 1])\n", "for _ in range(45):\n", " X_train, y_train = get_minibatch(doc_stream, size=1000)\n", " if not X_train:\n", " break\n", " X_train = vect.transform(X_train)\n", " clf.partial_fit(X_train, y_train, classes=classes)\n", " pbar.update()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "정확도: 0.867\n" ] } ], "source": [ "X_test, y_test = get_minibatch(doc_stream, size=5000)\n", "X_test = vect.transform(X_test)\n", "print('정확도: %.3f' % clf.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "clf = clf.partial_fit(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 학습된 사이킷런 추정기 저장하기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "앞에서 로지스틱 회귀 모델을 훈련한 후에 분류기, 불용어, 포터 어간 추출기, `HashingVectorizer`를 로컬 디스크에 직렬화된 객체로 저장합니다. 나중에 웹 애플리케이션에서 학습된 분류기를 이용하겠습니다." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import pickle, os\n", "# dest = os.path.join('movieclassifier', 'pkl_objects')\n", "# if not os.path.exists(dest):\n", "# os.makedirs(dest)\n", "\n", "pickle.dump(stop, open('data/stopwords.pkl', 'wb'), protocol=4) \n", "pickle.dump(clf, open('data/classifier.pkl', 'wb'), protocol=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile movieclassifier/vectorizer.py\n", "\n", "# 나중에 임포트할 수 있도록 별도의 파일에 `HashingVectorizer`를 저장합니다.\n", "from sklearn.feature_extraction.text import HashingVectorizer\n", "import re, os, pickle\n", "cur_dir = os.path.dirname(__file__)\n", "stop = pickle.load(open('data/stopwords.pkl'), 'rb')\n", "\n", "def tokenizer(text):\n", " text = re.sub('<[^>]*>', '', text)\n", " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)',\n", " text.lower())\n", " text = re.sub('[\\W]+', ' ', text.lower()) \\\n", " + ' '.join(emoticons).replace('-', '')\n", " tokenized = [w for w in text.split() if w not in stop]\n", " return tokenized\n", "\n", "vect = HashingVectorizer(decode_error='ignore',\n", " n_features=2**21,\n", " preprocessor=None,\n", " tokenizer=tokenizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 새로운 환경에서도 제대로 작동을 확인하기 위해 IPython 노트북 커널을 재시작\n", "먼저 현재 파이썬 디렉토리를 `movieclassifer`로 변경합니다:\n", "```\n", "%reset\n", "Once deleted, variables cannot be recovered. Proceed (y/[n])?\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import os\n", "# os.chdir('movieclassifier')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%reset\n", "import numpy as np\n", "label = {0:'양성', 1:'음성'}\n", "\n", "example = ['I love this movie']\n", "X = vect.transform(example)\n", "print('예측: %s\\n확률: %.2f%%' %\\\n", " (label[clf.predict(X)[0]], \n", " np.max(clf.predict_proba(X))*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 데이터 저장을 위해 SQLite 데이터베이스 설정하기\n", "이 코드를 실행하기 전에 현재 위치가 `movieclassifier` 디렉토리인지 확인합니다." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "\n", "conn = sqlite3.connect('reviews.sqlite')\n", "c = conn.cursor()\n", "c.execute('DROP TABLE IF EXISTS review_db')\n", "c.execute('CREATE TABLE review_db (review TEXT, sentiment INTEGER, date TEXT)')\n", "\n", "example1 = 'I love this movie'\n", "c.execute(\"INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))\", (example1, 1))\n", "\n", "example2 = 'I disliked this movie'\n", "c.execute(\"INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))\", (example2, 0))\n", "\n", "conn.commit()\n", "conn.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conn = sqlite3.connect('reviews.sqlite')\n", "c = conn.cursor()\n", "c.execute(\"SELECT * FROM review_db WHERE date BETWEEN '2017-01-01 10:10:10' AND DATETIME('now')\")\n", "results = c.fetchall()\n", "conn.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 플라스크 웹 애플리케이션 개발하기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 영화 분류기 업데이트" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "다운로드한 깃허브 저장소에 들어있는 movieclassifier_with_update 디렉토리를 사용합니다(그렇지 않으면 `movieclassifier` 디렉토리를 복사해서 사용하세요)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import shutil\n", "\n", "os.chdir('../movieclassifier_with_update')\n", "shutil.copyfile('../movieclassifier/pkl_objects/classifier.pkl',\n", " './pkl_objects/classifier.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# SQLite 데이터베이스에 저장된 데이터로 분류기를 업데이트하는 함수를 정의합니다:\n", "import pickle, sqlite3\n", "import numpy as np\n", "\n", "# 로컬 디렉토리에서 HashingVectorizer를 임포트합니다\n", "from vectorizer import vect\n", "\n", "def update_model(db_path, model, batch_size=10000):\n", "\n", " conn = sqlite3.connect(db_path)\n", " c = conn.cursor()\n", " c.execute('SELECT * from review_db') \n", " results = c.fetchmany(batch_size)\n", " while results:\n", " data = np.array(results)\n", " X = data[:, 0]\n", " y = data[:, 1].astype(int)\n", " classes = np.array([0, 1])\n", " X_train = vect.transform(X)\n", " clf.partial_fit(X_train, y, classes=classes)\n", " results = c.fetchmany(batch_size) \n", " conn.close()\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 모델을 업데이트합니다:\n", "cur_dir = '.'\n", "\n", "# app.py 파일에 이 코드를 삽입했다면 다음 경로를 사용하세요.\n", "# import os\n", "# cur_dir = os.path.dirname(__file__)\n", "\n", "clf = pickle.load(open(os.path.join(cur_dir,\n", " 'pkl_objects',\n", " 'classifier.pkl'), 'rb'))\n", "db = os.path.join(cur_dir, 'reviews.sqlite')\n", "\n", "update_model(db_path=db, model=clf, batch_size=10000)\n", "\n", "# classifier.pkl 파일을 업데이트하려면 다음 주석을 해제하세요.\n", "\n", "# pickle.dump(clf, open(os.path.join(cur_dir, \n", "# 'pkl_objects', 'classifier.pkl'), 'wb')\n", "# , protocol=4)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }