{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 레진 데이터 챌린지 with 파이콘 한국 2017\n", "\n", "http://tech.lezhin.com/events/data-challenge-pyconkr-2017" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 모델 학습 및 평가" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "학습에 사용할 데이터를 다음처럼 준비하였습니다." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_table('lezhin_dataset_v2_training.tsv', header=None)\n", "df_1 = df.drop(df.columns[[6,7,]], axis=1)\n", "df_1 = df_1.dropna(axis=1)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "# very important. It does not work without it.\n", "scaler = MinMaxScaler(feature_range=(0, 1))\n", "df_1 = scaler.fit_transform(df_1)\n", "df_1 = pd.DataFrame(df_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "학습에 사용할 데이터는 전체 사용합니다. 정확도를 판단하기 위해 30% 데이터를 사용합니다." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "455675\n" ] } ], "source": [ "from keras.models import Sequential\n", "from keras.layers import Dense\n", "from keras.optimizers import SGD\n", "import numpy as np\n", "\n", "# 데이터 전처리\n", "rows = len(df_1)\n", "cut_row = rows * 7 // 10\n", "#cut_row = rows\n", "print(cut_row)\n", "x_train = df_1.iloc[:,1:].values.tolist()\n", "y_train = df_1.iloc[:,0].values.tolist()\n", "x_test = df_1.iloc[cut_row:,1:].values.tolist()\n", "y_test = df_1.iloc[cut_row:,0].values.tolist()\n", "\n", "# pickle로 저장\n", "import pickle\n", "import os\n", "def save_data(data, fname='data2.pic'):\n", " if not os.path.isfile(fname):\n", " f = open(fname, 'wb')\n", " pickle.dump(data, f)\n", " f.close()\n", " \n", "def load_data(fname='data2.pic'):\n", " f = open(fname, 'rb')\n", " x_train,y_train,x_test,y_test = pickle.load(f)\n", " f.close()\n", " return x_train,y_train,x_test,y_test\n", " \n", "save_data([x_train,y_train,x_test,y_test])\n", "x_train, y_train, x_test, y_test = load_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "binary crossentropy를 사용하여 학습합니다." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "dense_1 (Dense) (None, 1) 148 \n", "=================================================================\n", "Total params: 148\n", "Trainable params: 148\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "Epoch 1/10\n", "650965/650965 [==============================] - 54s - loss: 0.5262 - acc: 0.7354 \n", "Epoch 2/10\n", "650965/650965 [==============================] - 51s - loss: 0.5214 - acc: 0.7396 \n", "Epoch 3/10\n", "650965/650965 [==============================] - 60s - loss: 0.5209 - acc: 0.7404 \n", "Epoch 4/10\n", "650965/650965 [==============================] - 62s - loss: 0.5207 - acc: 0.7402 \n", "Epoch 5/10\n", "650965/650965 [==============================] - 51s - loss: 0.5205 - acc: 0.7405 \n", "Epoch 6/10\n", "650965/650965 [==============================] - 52s - loss: 0.5204 - acc: 0.7408 \n", "Epoch 7/10\n", "650965/650965 [==============================] - 50s - loss: 0.5204 - acc: 0.7405 \n", "Epoch 8/10\n", "650965/650965 [==============================] - 51s - loss: 0.5204 - acc: 0.7404 \n", "Epoch 9/10\n", "650965/650965 [==============================] - 52s - loss: 0.5203 - acc: 0.7407 \n", "Epoch 10/10\n", "650965/650965 [==============================] - 53s - loss: 0.5203 - acc: 0.7406 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = Sequential()\n", "model.add(Dense(1, input_dim=147, activation='sigmoid'))\n", "\n", "model.compile(loss='binary_crossentropy', optimizer=SGD(lr=0.1), metrics=['accuracy'])\n", "\n", "model.summary()\n", "model.fit(x_train, y_train, epochs=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "해당 모델을 사용하여 정확도를 구합니다.\n", "정확도는 약 67.64%가 나왔습니다." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "194496/195290 [============================>.] - ETA: 0s\n", "loss_and_metrics : [0.56578984513517916, 0.67638896000838367]\n" ] } ], "source": [ "loss_and_metrics = model.evaluate(x_test, y_test)\n", "\n", "print('')\n", "print('loss_and_metrics : ' + str(loss_and_metrics))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4개의 relu 계층과 1개의 sigmoid 계층을 사용하여 학습합니다." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/donglyeolsin/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:4: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(100, kernel_initializer=\"uniform\", input_dim=147, activation=\"relu\")`\n", "/Users/donglyeolsin/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:5: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(100, kernel_initializer=\"uniform\", activation=\"relu\")`\n", "/Users/donglyeolsin/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:6: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(100, kernel_initializer=\"uniform\", activation=\"relu\")`\n", "/Users/donglyeolsin/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:7: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(50, kernel_initializer=\"uniform\", activation=\"relu\")`\n", "/Users/donglyeolsin/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:8: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, kernel_initializer=\"uniform\", activation=\"sigmoid\")`\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "dense_2 (Dense) (None, 100) 14800 \n", "_________________________________________________________________\n", "dense_3 (Dense) (None, 100) 10100 \n", "_________________________________________________________________\n", "dense_4 (Dense) (None, 100) 10100 \n", "_________________________________________________________________\n", "dense_5 (Dense) (None, 50) 5050 \n", "_________________________________________________________________\n", "dense_6 (Dense) (None, 1) 51 \n", "=================================================================\n", "Total params: 40,101\n", "Trainable params: 40,101\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "Epoch 1/100\n", "650965/650965 [==============================] - 119s - loss: 0.4386 - acc: 0.7938 \n", "Epoch 2/100\n", "650965/650965 [==============================] - 116s - loss: 0.4090 - acc: 0.8132 \n", "Epoch 3/100\n", "650965/650965 [==============================] - 115s - loss: 0.3985 - acc: 0.8190 \n", "Epoch 4/100\n", "650965/650965 [==============================] - 114s - loss: 0.3913 - acc: 0.8236 \n", "Epoch 5/100\n", "650965/650965 [==============================] - 114s - loss: 0.3860 - acc: 0.8269 \n", "Epoch 6/100\n", "650965/650965 [==============================] - 115s - loss: 0.3816 - acc: 0.8292 \n", "Epoch 7/100\n", "650965/650965 [==============================] - 117s - loss: 0.3779 - acc: 0.8316 \n", "Epoch 8/100\n", "650965/650965 [==============================] - 114s - loss: 0.3746 - acc: 0.8338 \n", "Epoch 9/100\n", "650965/650965 [==============================] - 115s - loss: 0.3718 - acc: 0.8351 \n", "Epoch 10/100\n", "650965/650965 [==============================] - 114s - loss: 0.3692 - acc: 0.8363 \n", "Epoch 11/100\n", "650965/650965 [==============================] - 121s - loss: 0.3668 - acc: 0.8380 \n", "Epoch 12/100\n", "650965/650965 [==============================] - 121s - loss: 0.3647 - acc: 0.8393 \n", "Epoch 13/100\n", "650965/650965 [==============================] - 119s - loss: 0.3626 - acc: 0.8403 \n", "Epoch 14/100\n", "650965/650965 [==============================] - 120s - loss: 0.3609 - acc: 0.8411 \n", "Epoch 15/100\n", "650965/650965 [==============================] - 121s - loss: 0.3589 - acc: 0.8422 \n", "Epoch 16/100\n", "650965/650965 [==============================] - 124s - loss: 0.3572 - acc: 0.8432 \n", "Epoch 17/100\n", "650965/650965 [==============================] - 122s - loss: 0.3552 - acc: 0.8445 \n", "Epoch 18/100\n", "650965/650965 [==============================] - 121s - loss: 0.3540 - acc: 0.8447 \n", "Epoch 19/100\n", "650965/650965 [==============================] - 121s - loss: 0.3522 - acc: 0.8460 \n", "Epoch 20/100\n", "650965/650965 [==============================] - 122s - loss: 0.3508 - acc: 0.8469 \n", "Epoch 21/100\n", "650965/650965 [==============================] - 131s - loss: 0.3494 - acc: 0.8474 \n", "Epoch 22/100\n", "650965/650965 [==============================] - 144s - loss: 0.3483 - acc: 0.8486 \n", "Epoch 23/100\n", "650965/650965 [==============================] - 134s - loss: 0.3471 - acc: 0.8489 \n", "Epoch 24/100\n", "650965/650965 [==============================] - 134s - loss: 0.3456 - acc: 0.8498 \n", "Epoch 25/100\n", "650965/650965 [==============================] - 134s - loss: 0.3445 - acc: 0.8501 \n", "Epoch 26/100\n", "650965/650965 [==============================] - 134s - loss: 0.3434 - acc: 0.8508 \n", "Epoch 27/100\n", "650965/650965 [==============================] - 134s - loss: 0.3424 - acc: 0.8515 \n", "Epoch 28/100\n", "650965/650965 [==============================] - 134s - loss: 0.3411 - acc: 0.8521 \n", "Epoch 29/100\n", "650965/650965 [==============================] - 134s - loss: 0.3405 - acc: 0.8523 \n", "Epoch 30/100\n", "650965/650965 [==============================] - 141s - loss: 0.3393 - acc: 0.8528 \n", "Epoch 31/100\n", "650965/650965 [==============================] - 149s - loss: 0.3390 - acc: 0.8531 \n", "Epoch 32/100\n", "650965/650965 [==============================] - 149s - loss: 0.3378 - acc: 0.8539 \n", "Epoch 33/100\n", "650965/650965 [==============================] - 151s - loss: 0.3373 - acc: 0.8541 \n", "Epoch 34/100\n", "650965/650965 [==============================] - 150s - loss: 0.3363 - acc: 0.8551 \n", "Epoch 35/100\n", "650965/650965 [==============================] - 150s - loss: 0.3356 - acc: 0.8550 \n", "Epoch 36/100\n", "650965/650965 [==============================] - 150s - loss: 0.3347 - acc: 0.8559 \n", "Epoch 37/100\n", "650965/650965 [==============================] - 150s - loss: 0.3339 - acc: 0.8560 \n", "Epoch 38/100\n", "650965/650965 [==============================] - 150s - loss: 0.3333 - acc: 0.8565 \n", "Epoch 39/100\n", "650965/650965 [==============================] - 150s - loss: 0.3327 - acc: 0.8570 \n", "Epoch 40/100\n", "650965/650965 [==============================] - 151s - loss: 0.3322 - acc: 0.8574 \n", "Epoch 41/100\n", "650965/650965 [==============================] - 150s - loss: 0.3313 - acc: 0.8577 \n", "Epoch 42/100\n", "650965/650965 [==============================] - 150s - loss: 0.3302 - acc: 0.8582 \n", "Epoch 43/100\n", "650965/650965 [==============================] - 150s - loss: 0.3298 - acc: 0.8580 \n", "Epoch 44/100\n", "650965/650965 [==============================] - 150s - loss: 0.3295 - acc: 0.8586 \n", "Epoch 45/100\n", "650965/650965 [==============================] - 151s - loss: 0.3287 - acc: 0.8588 \n", "Epoch 46/100\n", "650965/650965 [==============================] - 150s - loss: 0.3285 - acc: 0.8589 \n", "Epoch 47/100\n", "650965/650965 [==============================] - 149s - loss: 0.3279 - acc: 0.8592 \n", "Epoch 48/100\n", "650965/650965 [==============================] - 149s - loss: 0.3271 - acc: 0.8597 \n", "Epoch 49/100\n", "650965/650965 [==============================] - 149s - loss: 0.3264 - acc: 0.8601 \n", "Epoch 50/100\n", "650965/650965 [==============================] - 150s - loss: 0.3264 - acc: 0.8601 \n", "Epoch 51/100\n", "650965/650965 [==============================] - 150s - loss: 0.3260 - acc: 0.8604 \n", "Epoch 52/100\n", "650965/650965 [==============================] - 150s - loss: 0.3250 - acc: 0.8608 \n", "Epoch 53/100\n", "650965/650965 [==============================] - 150s - loss: 0.3248 - acc: 0.8608 \n", "Epoch 54/100\n", "650965/650965 [==============================] - 150s - loss: 0.3243 - acc: 0.8611 \n", "Epoch 55/100\n", "650965/650965 [==============================] - 153s - loss: 0.3237 - acc: 0.8614 \n", "Epoch 56/100\n", "650965/650965 [==============================] - 150s - loss: 0.3232 - acc: 0.8620 \n", "Epoch 57/100\n", "650965/650965 [==============================] - 150s - loss: 0.3228 - acc: 0.8621 \n", "Epoch 58/100\n", "650965/650965 [==============================] - 150s - loss: 0.3225 - acc: 0.8623 \n", "Epoch 59/100\n", "650965/650965 [==============================] - 150s - loss: 0.3223 - acc: 0.8622 \n", "Epoch 60/100\n", "650965/650965 [==============================] - 150s - loss: 0.3220 - acc: 0.8624 \n", "Epoch 61/100\n", "650965/650965 [==============================] - 150s - loss: 0.3213 - acc: 0.8624 \n", "Epoch 62/100\n", "650965/650965 [==============================] - 150s - loss: 0.3210 - acc: 0.8626 \n", "Epoch 63/100\n", "650965/650965 [==============================] - 150s - loss: 0.3209 - acc: 0.8629 \n", "Epoch 64/100\n", "650965/650965 [==============================] - 152s - loss: 0.3204 - acc: 0.8632 \n", "Epoch 65/100\n", "650965/650965 [==============================] - 151s - loss: 0.3204 - acc: 0.8630 \n", "Epoch 66/100\n", "650965/650965 [==============================] - 150s - loss: 0.3200 - acc: 0.8633 \n", "Epoch 67/100\n", "650965/650965 [==============================] - 151s - loss: 0.3194 - acc: 0.8638 \n", "Epoch 68/100\n", "650965/650965 [==============================] - 149s - loss: 0.3190 - acc: 0.8637 \n", "Epoch 69/100\n", "650965/650965 [==============================] - 151s - loss: 0.3190 - acc: 0.8638 \n", "Epoch 70/100\n", "650965/650965 [==============================] - 153s - loss: 0.3185 - acc: 0.8638 \n", "Epoch 71/100\n", "650965/650965 [==============================] - 151s - loss: 0.3182 - acc: 0.8641 \n", "Epoch 72/100\n", "650965/650965 [==============================] - 150s - loss: 0.3181 - acc: 0.8643 \n", "Epoch 73/100\n", "650965/650965 [==============================] - 151s - loss: 0.3177 - acc: 0.8643 \n", "Epoch 74/100\n", "650965/650965 [==============================] - 150s - loss: 0.3173 - acc: 0.8646 \n", "Epoch 75/100\n", "650965/650965 [==============================] - 151s - loss: 0.3174 - acc: 0.8644 \n", "Epoch 76/100\n", "650965/650965 [==============================] - 151s - loss: 0.3171 - acc: 0.8650 \n", "Epoch 77/100\n", "650965/650965 [==============================] - 152s - loss: 0.3169 - acc: 0.8652 \n", "Epoch 78/100\n", "650965/650965 [==============================] - 151s - loss: 0.3166 - acc: 0.8652 \n", "Epoch 79/100\n", "650965/650965 [==============================] - 150s - loss: 0.3161 - acc: 0.8655 \n", "Epoch 80/100\n", "650965/650965 [==============================] - 150s - loss: 0.3162 - acc: 0.8653 \n", "Epoch 81/100\n", "650965/650965 [==============================] - 152s - loss: 0.3158 - acc: 0.8655 \n", "Epoch 82/100\n", "650965/650965 [==============================] - 152s - loss: 0.3156 - acc: 0.8654 \n", "Epoch 83/100\n", "650965/650965 [==============================] - 151s - loss: 0.3159 - acc: 0.8657 \n", "Epoch 84/100\n", "650965/650965 [==============================] - 151s - loss: 0.3155 - acc: 0.8656 \n", "Epoch 85/100\n", "650965/650965 [==============================] - 151s - loss: 0.3154 - acc: 0.8657 \n", "Epoch 86/100\n", "650965/650965 [==============================] - 152s - loss: 0.3150 - acc: 0.8658 \n", "Epoch 87/100\n", "650965/650965 [==============================] - 151s - loss: 0.3147 - acc: 0.8659 \n", "Epoch 88/100\n", "650965/650965 [==============================] - 152s - loss: 0.3149 - acc: 0.8656 \n", "Epoch 89/100\n", "650965/650965 [==============================] - 151s - loss: 0.3142 - acc: 0.8664 \n", "Epoch 90/100\n", "650965/650965 [==============================] - 153s - loss: 0.3141 - acc: 0.8665 \n", "Epoch 91/100\n", "650965/650965 [==============================] - 151s - loss: 0.3141 - acc: 0.8662 \n", "Epoch 92/100\n", "650965/650965 [==============================] - 152s - loss: 0.3138 - acc: 0.8664 \n", "Epoch 93/100\n", "650965/650965 [==============================] - 151s - loss: 0.3136 - acc: 0.8668 \n", "Epoch 94/100\n", "650965/650965 [==============================] - 153s - loss: 0.3132 - acc: 0.8669 \n", "Epoch 95/100\n", "650965/650965 [==============================] - 160s - loss: 0.3132 - acc: 0.8668 \n", "Epoch 96/100\n", "650965/650965 [==============================] - 157s - loss: 0.3135 - acc: 0.8667 \n", "Epoch 97/100\n", "650965/650965 [==============================] - 152s - loss: 0.3131 - acc: 0.8664 \n", "Epoch 98/100\n", "650965/650965 [==============================] - 152s - loss: 0.3125 - acc: 0.8668 \n", "Epoch 99/100\n", "650965/650965 [==============================] - 152s - loss: 0.3127 - acc: 0.8669 \n", "Epoch 100/100\n", "650965/650965 [==============================] - 152s - loss: 0.3125 - acc: 0.8669 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from keras.optimizers import Adam\n", "\n", "model = Sequential()\n", "model.add(Dense(100, input_dim=147, init='uniform', activation='relu'))\n", "model.add(Dense(100, init='uniform', activation='relu'))\n", "model.add(Dense(100, init='uniform', activation='relu'))\n", "model.add(Dense(50, init='uniform', activation='relu'))\n", "model.add(Dense(1, init='uniform', activation='sigmoid'))\n", "\n", "#sgd = SGD()\n", "#model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.1), metrics=['accuracy'])\n", "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", "\n", "model.summary()\n", "model.fit(x_train, y_train, epochs=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "해당 모델을 사용하여 정확도를 구합니다.\n", "정확도는 약 87.29%가 나왔습니다." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "194560/195290 [============================>.] - ETA: 0s\n", "loss_and_metrics : [0.3222619390542914, 0.87285575298030194]\n" ] } ], "source": [ "loss_and_metrics = model.evaluate(x_test, y_test)\n", "\n", "print('')\n", "print('loss_and_metrics : ' + str(loss_and_metrics))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "해당 모델을 저장합니다." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 모델 저장하기\n", "from keras.models import load_model\n", "model.save('lezhin.h5')" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }