{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **로지스틱 회귀를 이용한 Click Through Rate**\n", "1. 데이터 전처리 및 One-Hot Encoing\n", "1. **Logistic 회귀** 동작의 원리\n", "1. **Gradient descent** 기법, **Statistic Gradient descent** 기법\n", "1. **Logistic 회기** 분류기 학습 및 예측모델\n", "1. **L1, L2 정규화**를 이용한 Logistic 회귀\n", "1. On - Line Learning\n", "1. **Random Forest** 를 이용한 **feacture selection**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **1 One - Hot Encoding**\n", "1. **범주형 feacture** 를 **이진형 수치 feacture** 로 변환\n", "1. **K개의 값**을 갖는 **범주형** feacture를 **1~k 의** feacture로 매핑시킨다\n", "1. 변환된 범주형 데이터를 **원본으로** 되돌린다" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **01 One - Hot Encoding 임베딩 데이터 만들기**\n", "1. **범주형 feacture** 를 **이진형 수치 feacture** 로 변환\n", "1. **K개의 값**을 갖는 **범주형** feacture를 **1~k 의** feacture로 매핑시킨다\n", "1. 변환된 범주형 데이터를 **원본으로** 되돌린다" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 1., 1., 0., 0.],\n", " [1., 0., 0., 0., 0., 1.],\n", " [1., 0., 0., 1., 0., 0.],\n", " [0., 1., 0., 0., 0., 1.],\n", " [0., 0., 1., 0., 0., 1.],\n", " [0., 0., 1., 0., 1., 0.],\n", " [0., 1., 0., 1., 0., 0.]])" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dict 범주형 데이터를 One-Hot-encoding으로 변환\n", "from sklearn.feature_extraction import DictVectorizer\n", "dict_one_hot_encoder = DictVectorizer(sparse=False)\n", "\n", "X_dict = [{'interest': 'tech', 'occupation': 'professional'},\n", " {'interest': 'fashion', 'occupation': 'student'},\n", " {'interest': 'fashion', 'occupation': 'professional'},\n", " {'interest': 'sports', 'occupation': 'student'},\n", " {'interest': 'tech', 'occupation': 'student'},\n", " {'interest': 'tech', 'occupation': 'retired'},\n", " {'interest': 'sports', 'occupation': 'professional'}]\n", "\n", "X_encoded = dict_one_hot_encoder.fit_transform(X_dict)\n", "X_encoded" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'interest=fashion': 0,\n", " 'interest=sports': 1,\n", " 'interest=tech': 2,\n", " 'occupation=professional': 3,\n", " 'occupation=retired': 4,\n", " 'occupation=student': 5}\n" ] } ], "source": [ "# 범주형 Dataset Index 매핑내용 살펴보기\n", "from pprint import pprint\n", "pprint(dict_one_hot_encoder.vocabulary_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **02 Converting Data by Using Map Data**\n", "위에서 학습한 **dict_one_hot_encoder** 를 활용하여 데이터를 컨버팅/ 복원" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0. 1. 0. 0. 1. 0.]]\n" ] } ], "source": [ "# 위에서 매팽한 table 을 사용하여 새로운 데이터 인코딩\n", "new_dict = [{'interest': 'sports', 'occupation': 'retired'}]\n", "new_encoded = dict_one_hot_encoder.transform(new_dict)\n", "print(new_encoded)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'interest=sports': 1.0, 'occupation=retired': 1.0}]\n" ] } ], "source": [ "# new_encoded 인코딩 데이터를 원본형태로 되돌린다\n", "print(dict_one_hot_encoder.inverse_transform(new_encoded))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **03 Learning New Map Data**\n", "1. **new_encoded :** 새로운 매핑 데이터 추가하면, 결과적으로 **무시된다**\n", "1. 두개의 **dict** 데이터 중 **없는건 제외하고 나머지만 Converting** 된다" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0. 0. 0. 0. 1. 0.]\n", " [0. 0. 1. 0. 0. 0.]]\n" ] } ], "source": [ "# 1개의 인덱스에 포함된 2개의 Dict 중, 1개만 converting 된다\n", "new_dict = [{'interest': 'unknown_interest', 'occupation': 'retired'},\n", " {'interest': 'tech', 'occupation': 'unseen_occupation'}]\n", "new_encoded = dict_one_hot_encoder.transform(new_dict)\n", "print(new_encoded)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **04 LabelEncoder 를 활용한 One-Hot-Encoding**\n", "1. **X_int** : One Hot 의 **인덱스값을** 출력한다\n", "1. 보다 간결하고 식별력이 높다" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[5 1]\n", " [0 4]\n", " [0 1]\n", " [3 4]\n", " [5 4]\n", " [5 2]\n", " [3 1]]\n" ] } ], "source": [ "import numpy as np\n", "X_str = np.array([['tech', 'professional'],\n", " ['fashion', 'student'],\n", " ['fashion', 'professional'],\n", " ['sports', 'student'],\n", " ['tech', 'student'],\n", " ['tech', 'retired'],\n", " ['sports', 'professional']])\n", "\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", "label_encoder = LabelEncoder()\n", "X_int = label_encoder.fit_transform(X_str.ravel()).reshape(*X_str.shape)\n", "print(X_int)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0. 0. 1. 1. 0. 0.]\n", " [1. 0. 0. 0. 0. 1.]\n", " [1. 0. 0. 1. 0. 0.]\n", " [0. 1. 0. 0. 0. 1.]\n", " [0. 0. 1. 0. 0. 1.]\n", " [0. 0. 1. 0. 1. 0.]\n", " [0. 1. 0. 1. 0. 0.]]\n" ] } ], "source": [ "# X_int 를 X_encoded 로 변환\n", "one_hot_encoder = OneHotEncoder()\n", "X_encoded = one_hot_encoder.fit_transform(X_int).toarray()\n", "print(X_encoded)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0. 0. 0. 0. 1. 0.]\n", " [0. 0. 1. 0. 0. 0.]\n", " [0. 0. 0. 0. 0. 0.]]\n" ] } ], "source": [ "# Mapping 입력되지 않은 값들은 위와 동일하게 무시된다\n", "new_str = np.array([['unknown_interest', 'retired'],\n", " ['tech', 'unseen_occupation'],\n", " ['unknown_interest', 'unseen_occupation']])\n", "\n", "def string_to_dict(columns, data_str):\n", " data_dict = []\n", " for sample_str in data_str:\n", " data_dict.append({column : value for column, value in zip(columns, sample_str)})\n", " return data_dict\n", "\n", "columns = ['interest', 'occupation']\n", "new_encoded = dict_one_hot_encoder.transform(string_to_dict(columns, new_str))\n", "print(new_encoded)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **2 로지스틱 회귀 분류기**\n", "1. **실수값 데이터는 0~1 사이의 값으로** 변환한다\n", "1. $y(z) = \\frac{1}{1+exp(-z)}$ 대용량 데이터에 **확장성이 좋은** 알고리즘이다" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **01 로지스틱 회귀의 동작원리**\n", "로지스틱 회귀는 나이브 베이즈 분류기처럼 **확률 기반 분류기이다**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# 로지스틱 회귀 함수를 정의한다\n", "import numpy as np\n", "\n", "def sigmoid(input):\n", " return 1.0 / (1 + np.exp(-input))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "# -8~8 사이의 값으로 로지스틱 회귀모델을 구현\n", "import matplotlib.pyplot as plt\n", "plt.figure(figsize=(5,3))\n", "z = np.linspace(-8, 8, 1000)\n", "y = sigmoid(z)\n", "plt.plot(z, y)\n", "plt.axhline(y=0, ls='dotted', color='k')\n", "plt.axhline(y=0.5, ls='dotted', color='k')\n", "plt.axhline(y=1, ls='dotted', color='k')\n", "plt.yticks([0.0, 0.25, 0.5, 0.75, 1.0])\n", "plt.xlabel('z'); plt.ylabel('y(z)'); plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **02 MSE를 최소로 하는 로지스틱 회귀**\n", "비용함수를 최소로(실질적으로는 **MSE기반의 비용함수를** 최소로)하는 값들을 예측한다" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/markbaum/Python/python/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log\n", " This is separate from the ipykernel package so we can avoid doing imports until\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot sample cost vs y_hat (prediction), for y (truth) = 1\n", "y_hat = np.linspace(0, 1, 1000)\n", "cost = -np.log(y_hat)\n", "plt.figure(figsize=(4,3))\n", "plt.plot(y_hat, cost)\n", "plt.xlabel('Prediction'); plt.ylabel('Cost')\n", "plt.xlim(0, 1); plt.ylim(0, 7); plt.show()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/markbaum/Python/python/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log\n", " This is separate from the ipykernel package so we can avoid doing imports until\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot sample cost vs y_hat (prediction), for y (truth) = 0\n", "y_hat = np.linspace(0, 1, 1000)\n", "cost = -np.log(1 - y_hat)\n", "plt.figure(figsize=(4,3))\n", "plt.plot(y_hat, cost)\n", "plt.xlabel('Prediction'); plt.ylabel('Cost')\n", "plt.xlim(0, 1); plt.ylim(0, 7); plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **3 그레디언트 하강을 활용한 로지스틱 회귀**\n", "1. 단볼록이 아닌, 비볼록 형태의 데이터에 대한 로지스틱 회귀 최적값을 예측한다" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **01 그레디언트 하강기법의 로지스틱 함수 정의**\n", "로지스틱 회귀는 나이브 베이즈 분류기처럼 **확률 기반 분류기이다**" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# 현재의 가중치 값을 사용하여 예측값을 계산하는 함수\n", "def compute_prediction(X, weights):\n", " z = np.dot(X, weights)\n", " predictions = sigmoid(z)\n", " return predictions \n", "\n", "# Gradient 하강 기법을 단계저긍로 정의하여 가중치를 업데이트 한다\n", "def update_weights_gd(X_train, y_train, weights, learning_rate):\n", " predictions = compute_prediction(X_train, weights)\n", " weights_delta = np.dot(X_train.T, y_train - predictions)\n", " m = y_train.shape[0]\n", " weights += learning_rate / float(m) * weights_delta\n", " return weights # updated weights(numpy.ndarray)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# 비용을 계산하는 함수를 계산한다\n", "def compute_cost(X, y, weights):\n", " predictions = compute_prediction(X, weights)\n", " cost = np.mean(-y * np.log(predictions) - (1-y) * np.log(1-predictions))\n", " return cost # float\n", "\n", "# 로지스틱 회귀 모델을 학습한다\n", "def train_logistic_regression(X_train, y_train, max_iter, learning_rate, fit_intercept=False):\n", " if fit_intercept:\n", " intercept = np.ones((X_train.shape[0], 1))\n", " # .hstack() 행의 수가 같은 두 개 이상의 배열을 옆으로 연결\n", " X_train_np = np.hstack((intercept, X_train))\n", " weights = np.zeros(X_train_np.shape[1])\n", " for iteration in range(max_iter):\n", " weights = update_weights_gd(X_train_np, y_train, weights, learning_rate)\n", " if iteration % 1000 == 0: # 1000번 학습을 반복한다\n", " print(\"{:,}th Logistic Cost : {:.5f}\".format(\n", " iteration, compute_cost(X_train_np, y_train, weights)))\n", " return weights" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# 학습모델을 이용하여 새로운 데이터의 결과를 예측하는 함수\n", "def predict(X, weights):\n", " if X.shape[1] == weights.shape[0] - 1:\n", " intercept = np.ones((X.shape[0], 1))\n", " X = np.hstack((intercept, X))\n", " return compute_prediction(X, weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **02 예제 데이터를 활용하여 모델을 학습한다**\n", "1. **절편값이 포함된** 가중치 함수를 기반으로 학습한다\n", "1. **학습률은 0.1**, 로지스틱 회귀 모델을 **1,000번 반복하여** 학습한다" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0th Logistic Cost : 0.57440\n", "1,000th Logistic Cost : 0.00395\n", "2,000th Logistic Cost : 0.00202\n", "3,000th Logistic Cost : 0.00136\n", "4,000th Logistic Cost : 0.00103\n", "5,000th Logistic Cost : 0.00082\n", "6,000th Logistic Cost : 0.00069\n", "7,000th Logistic Cost : 0.00059\n", "8,000th Logistic Cost : 0.00052\n", "9,000th Logistic Cost : 0.00046\n" ] } ], "source": [ "# iterator를 반복할수록 학습의 Cost 값이 줄어듬을 알 수 있다\n", "X_train = np.array([[6, 7],[2, 4],[3, 6],[4, 7],[1, 6],\n", " [5, 2],[2, 0],[6, 3],[4, 1],[7, 2]])\n", "y_train = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])\n", "weights = train_logistic_regression(X_train, y_train, \n", " max_iter = 10000, \n", " learning_rate = 0.1, \n", " fit_intercept = True)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([9.99999394e-01, 8.71880199e-04, 9.96881227e-01, 3.66361408e-03])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test = np.array([[6, 1],[1, 3],[3, 1],[4, 5]])\n", "predictions = predict(X_test, weights)\n", "predictions" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 분류 판단을 위한 임계치로 0.5를 설정하여 결과를 출력한다\n", "# Train 데이터로 학습한 모델이, 새로운 데이터에 대해서도 잘 적용됨을 볼 수 있다\n", "plt.scatter(X_train[:,0], X_train[:,1], \n", " marker = 'o',\n", " c = ['b'] * 5 + ['k'] * 5)\n", "\n", "colours = ['k' if prediction >= 0.5 else 'b' for prediction in predictions]\n", "plt.scatter(X_test[:,0], X_test[:,1], \n", " marker = '*',\n", " c = colours)\n", "plt.xlabel('x1'); plt.ylabel('x2'); plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **4 그레디언트 하강과 로지스틱 화귀를 이용한 CTR 예측**\n", "Click Through Rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **01 알고리즘에 1K개의 데이터로 학습**\n", "데이터 Set **앞 1,000개로** 학습, **뒤의 1,000개로** 테스트 한다" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'C1': '1005', 'C14': '15706', 'C15': '320', 'C16': '50', 'C17': '1722', 'C18': '0', 'C19': '35', 'C20': '-1', 'C21': '79', 'app_category': '07d7df22', 'app_domain': '7801e8d9', 'app_id': 'ecad2386', 'banner_pos': '0', 'device_conn_type': '2', 'device_model': '44956a24', 'device_type': '1', 'site_category': '28905ebd', 'site_domain': 'f3845767', 'site_id': '1fbe01fe'}\n", "{'C1': '1005', 'C14': '15704', 'C15': '320', 'C16': '50', 'C17': '1722', 'C18': '0', 'C19': '35', 'C20': '100084', 'C21': '79', 'app_category': '07d7df22', 'app_domain': '7801e8d9', 'app_id': 'ecad2386', 'banner_pos': '0', 'device_conn_type': '0', 'device_model': '711ee120', 'device_type': '1', 'site_category': '28905ebd', 'site_domain': 'f3845767', 'site_id': '1fbe01fe'}\n" ] } ], "source": [ "import csv\n", "def read_ad_click_data(n, offset=0):\n", " X_dict, y = [], []\n", " with open('./data/train.csv', 'r') as csvfile:\n", " reader = csv.DictReader(csvfile)\n", " for i in range(offset):\n", " next(reader)\n", " i = 0\n", " for row in reader:\n", " i += 1\n", " y.append(int(row['click']))\n", " del row['click'], row['id'], row['hour'], row['device_id'], row['device_ip']\n", " X_dict.append(dict(row))\n", " if i >= n: break\n", " return X_dict, y\n", "\n", "n = 1000\n", "X_dict_train, y_train = read_ad_click_data(n)\n", "print(X_dict_train[0])\n", "print(X_dict_train[1])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0th Logistic Cost : 0.68107\n", "1,000th Logistic Cost : 0.41219\n", "2,000th Logistic Cost : 0.40069\n", "3,000th Logistic Cost : 0.39300\n", "4,000th Logistic Cost : 0.38696\n", "5,000th Logistic Cost : 0.38186\n", "6,000th Logistic Cost : 0.37740\n", "7,000th Logistic Cost : 0.37341\n", "8,000th Logistic Cost : 0.36979\n", "9,000th Logistic Cost : 0.36646\n", "--- 7.747s seconds ---\n" ] } ], "source": [ "# 데이터 학습을 위해 One-Hot-Encoding 객체로 임베딩\n", "from sklearn.feature_extraction import DictVectorizer\n", "dict_one_hot_encoder = DictVectorizer(sparse=False)\n", "X_train = dict_one_hot_encoder.fit_transform(X_dict_train)\n", "X_dict_test, y_test_1k = read_ad_click_data(n, n)\n", "X_test = dict_one_hot_encoder.transform(X_dict_test)\n", "\n", "X_train_1k = X_train\n", "y_train_1k = np.array(y_train)\n", "\n", "import timeit\n", "start_time = timeit.default_timer()\n", "weights = train_logistic_regression(X_train_1k, y_train_1k, max_iter=10000, learning_rate=0.01, fit_intercept=True)\n", "print(\"--- %0.3fs seconds ---\" % (timeit.default_timer() - start_time))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The ROC AUC on testing set is: 0.663\n" ] } ], "source": [ "# 위에서 학습한 모델의 정확도 측정\n", "X_test_1k = X_test\n", "predictions = predict(X_test_1k, weights)\n", "from sklearn.metrics import roc_auc_score\n", "print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(y_test_1k, predictions)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **02 SGD 그래디언트 하강기법을 사용**\n", "1. **update_weights_sgd()** 함수를 사용\n", "1. **SGD 기법으로** 데이터 Set **앞 1,000개로** 학습, **뒤의 1,000개로** 테스트 한다" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# SGD를 이용한 로지스틱 회귀 알고리즘에 맞게 수정\n", "def update_weights_sgd(X_train, y_train, weights, learning_rate):\n", " for X_each, y_each in zip(X_train, y_train):\n", " prediction = compute_prediction(X_each, weights)\n", " weights_delta = X_each.T * (y_each - prediction)\n", " weights += learning_rate * weights_delta\n", " return weights" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# update_weights_sgd() 로 로지스틱 회귀 알고리즘을 적용\n", "def train_logistic_regression(X_train, y_train, max_iter, learning_rate, fit_intercept=False):\n", " if fit_intercept:\n", " intercept = np.ones((X_train.shape[0], 1))\n", " X_train = np.hstack((intercept, X_train))\n", " weights = np.zeros(X_train.shape[1])\n", " for iteration in range(max_iter):\n", " weights = update_weights_sgd(X_train, y_train, weights, learning_rate)\n", " # Check the cost for every 2 (for example) iterations\n", " if iteration % 2 == 0:\n", " print(\"{:,}th SGD Logistic : {:.5f}\".format(\n", " iteration, compute_cost(X_train, y_train, weights)))\n", " return weights" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0th SGD Logistic : 0.41983\n", "2th SGD Logistic : 0.40212\n", "4th SGD Logistic : 0.39185\n", "--- 0.155s seconds ---\n" ] } ], "source": [ "# 1K 샘플 데이터로 SGD model \n", "start_time = timeit.default_timer()\n", "weights = train_logistic_regression(X_train_1k, y_train_1k, max_iter=5, learning_rate=0.01, fit_intercept=True)\n", "print(\"--- %0.3fs seconds ---\" % (timeit.default_timer() - start_time))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The ROC AUC on testing set is: 0.672\n" ] } ], "source": [ "predictions = predict(X_test_1k, weights)\n", "print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(y_test_1k, predictions)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **03 SGD 알고리즘에 10K개의 데이터로 학습**\n", "1. 데이터 Set **앞 10,000개로** 학습, **뒤의 10,000개로** 테스트 한다\n", "1. 훨씬 속도도 빠르고 모델의 결과도 좋게 출력된다" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0th SGD Logistic : 0.41497\n", "2th SGD Logistic : 0.40601\n", "4th SGD Logistic : 0.40105\n", "--- 0.947s seconds ---\n" ] } ], "source": [ "n = 10000\n", "X_dict_train, y_train = read_ad_click_data(n)\n", "dict_one_hot_encoder = DictVectorizer(sparse=False)\n", "X_train = dict_one_hot_encoder.fit_transform(X_dict_train)\n", "\n", "X_train_10k = X_train\n", "y_train_10k = np.array(y_train)\n", "\n", "# Train the SGD model based on 100000 samples\n", "start_time = timeit.default_timer()\n", "weights = train_logistic_regression(X_train_10k, y_train_10k, max_iter=5, learning_rate=0.01, fit_intercept=True)\n", "print(\"--- %0.3fs seconds ---\" % (timeit.default_timer() - start_time))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The ROC AUC on testing set is: 0.720\n" ] } ], "source": [ "X_dict_test, y_test_10k = read_ad_click_data(10000, 10000)\n", "X_test_10k = dict_one_hot_encoder.transform(X_dict_test)\n", "\n", "predictions = predict(X_test_10k, weights)\n", "print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(y_test_10k, predictions)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **5 Sklearn 을 활용한 SGD 알고리즘으로 CTR 예측**\n", "scikit-learn 모듈의 활용" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The ROC AUC on testing set is: 0.721\n" ] } ], "source": [ "# Use scikit-learn package\n", "from sklearn.linear_model import SGDClassifier\n", "sgd_lr = SGDClassifier(loss='log', penalty=None, \n", " fit_intercept=True, max_iter=5, \n", " learning_rate='constant', eta0=0.01)\n", "sgd_lr.fit(X_train_10k, y_train_10k)\n", "\n", "\n", "predictions = sgd_lr.predict_proba(X_test_10k)[:, 1]\n", "print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(y_test_10k, predictions)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **6 정규화 기법을 이용한 SGD**\n", "**L1 정규화 기법을** 활용한 로지스틱 회귀모델 **Feature Selection**" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,\n", " eta0=0.01, fit_intercept=True, l1_ratio=0.15,\n", " learning_rate='constant', loss='log', max_iter=5, n_iter=None,\n", " n_jobs=1, penalty='l1', power_t=0.5, random_state=None,\n", " shuffle=True, tol=None, verbose=0, warm_start=False)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l1_feature_selector = SGDClassifier(loss = 'log', penalty = 'l1', \n", " alpha = 0.0001, fit_intercept = True, \n", " max_iter = 5, learning_rate = 'constant', \n", " eta0 = 0.01)\n", "l1_feature_selector.fit(X_train_10k, y_train_10k)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/markbaum/Python/python/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", " from numpy.core.umath_tests import inner1d\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "원본 데이터 Set : (10000, 2820)\n", "Ramdom Forest 로 특징 feature 선별 : (10000, 500)\n" ] } ], "source": [ "# 중요 Feature 정규화 cf) transform 은 작동하지 않는다 (Coding Error)\n", "# X_train_10k_selected = l1_feature_selector.transform(X_train_10k)\n", "print(\"원본 데이터 Set : \", X_train_10k.shape)\n", "print(\"Ramdom Forest 로 특징 feature 선별 : \", X_train_10k_selected.shape)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-0.5962561 -0.44022485 -0.42428472 -0.42428472 -0.41595815 -0.41548047\n", " -0.31676318 -0.30903059 -0.30744771 -0.28089655]\n", "[ 559 2172 2566 2370 1540 34 579 2116 278 577]\n" ] } ], "source": [ "# 하위 10 개의 weights 그리고 the corresponding 10 least important features\n", "print(np.sort(l1_feature_selector.coef_)[0][:10])\n", "print(np.argsort(l1_feature_selector.coef_)[0][:10])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.28423705 0.2842371 0.29318359 0.29969314 0.31062841 0.34092667\n", " 0.34649048 0.34906087 0.36057499 0.40919723]\n", "[2769 363 546 2275 547 2149 1503 2580 1519 2761]\n" ] } ], "source": [ "# 상위 10 개의 weights and the corresponding 10 most important features\n", "print(np.sort(l1_feature_selector.coef_)[0][-10:])\n", "print(np.argsort(l1_feature_selector.coef_)[0][-10:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **7 온라인 러닝 대규모 데이터세트 학습**\n", "실시간 데이터는 **청크 데이터**로 **작은 크기로 전처리 작업을** 수행한다" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- 2.423s seconds ---\n" ] } ], "source": [ "# The number of iterations is set to 1 if using partial_fit.\n", "sgd_lr = SGDClassifier(loss='log', penalty=None, fit_intercept=True, max_iter=1, learning_rate='constant', eta0=0.01)\n", "\n", "import timeit\n", "start_time = timeit.default_timer()\n", "\n", "# there are 40428968 labelled samples, use the first ten 100k samples for training, and the next 100k for testing\n", "for i in range(20):\n", " X_dict_train, y_train_every_100k = read_ad_click_data(1000, i * 1000)\n", " X_train_every_100k = dict_one_hot_encoder.transform(X_dict_train)\n", " sgd_lr.partial_fit(X_train_every_100k, y_train_every_100k, classes=[0, 1])\n", "\n", "print(\"--- %0.3fs seconds ---\" % (timeit.default_timer() - start_time))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The ROC AUC on testing set is: 0.694\n" ] } ], "source": [ "X_dict_test, y_test_next10k = read_ad_click_data(1000, (i + 1) * 1000)\n", "X_test_next10k = dict_one_hot_encoder.transform(X_dict_test)\n", "predictions = sgd_lr.predict_proba(X_test_next10k)[:, 1]\n", "print('The ROC AUC on testing set is: {0:.3f}'.format(roc_auc_score(y_test_next10k, predictions)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **8 다중 클래스의 분류처리**\n", "전체 **20개의 카테고리로** 분류된 텍스트를 **SGD 를 활용한** 모델링" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# 뉴스그룹 데이터를 호출하기 & 텍스트 전처리를 작업한다\n", "from nltk.corpus import names\n", "from nltk.stem import WordNetLemmatizer\n", "all_names = set(names.words())\n", "lemmatizer = WordNetLemmatizer()\n", "\n", "def letters_only(astr):\n", " for c in astr:\n", " if not c.isalpha(): \n", " return False\n", " return True\n", "\n", "def clean_text(docs):\n", " cleaned_docs = []\n", " for doc in docs:\n", " cleaned_docs.append(' '.join([lemmatizer.lemmatize(word.lower())\n", " for word in doc.split()\n", " if letters_only(word)\n", " and word not in all_names]))\n", " return cleaned_docs" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# 뉴스그룹 데이터를 불러온다\n", "from sklearn.datasets import fetch_20newsgroups\n", "data_train = fetch_20newsgroups(subset='train', categories=None, random_state=42)\n", "data_test = fetch_20newsgroups(subset='test', categories=None, random_state=42)\n", "\n", "# 텍스트를 전처리\n", "cleaned_train = clean_text(data_train.data)\n", "cleaned_test = clean_text(data_test.data)\n", "# 라벨링 처리\n", "label_train = data_train.target\n", "label_test = data_test.target" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# 전처리 작업된 텍스트를 Tf-IDF로 변환\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english', max_features=40000)\n", "term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)\n", "term_docs_test = tfidf_vectorizer.transform(cleaned_test)\n", "\n", "# grid search 검색을 적용\n", "from sklearn.model_selection import GridSearchCV\n", "parameters = {'penalty': ['l2', None],\n", " 'alpha' : [1e-07, 1e-06, 1e-05, 1e-04],\n", " 'eta0' : [0.01, 0.1, 1, 10]}" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'alpha': 1e-07, 'eta0': 10, 'penalty': 'l2'}\n" ] } ], "source": [ "# SGD 분류기를 활용하여 예측모델을 생성한다\n", "sgd_lr = SGDClassifier(loss='log', learning_rate='constant', eta0=0.01, fit_intercept=True, max_iter=10)\n", "grid_search = GridSearchCV(sgd_lr, parameters, n_jobs=-1, cv=3)\n", "\n", "grid_search.fit(term_docs_train, label_train)\n", "print(grid_search.best_params_)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The accuracy on testing set is: 79.6%\n" ] }, { "data": { "text/plain": [ "SGDClassifier(alpha=1e-07, average=False, class_weight=None, epsilon=0.1,\n", " eta0=10, fit_intercept=True, l1_ratio=0.15,\n", " learning_rate='constant', loss='log', max_iter=10, n_iter=None,\n", " n_jobs=1, penalty='l2', power_t=0.5, random_state=None,\n", " shuffle=True, tol=None, verbose=0, warm_start=False)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy = sgd_lr_best.score(term_docs_test, label_test)\n", "print('The accuracy on testing set is: {0:.1f}%'.format(accuracy*100))\n", "grid_search.best_estimator_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# **9 Feature Selection 과 Random Forest 비교**\n", "1. **feature_importance_** : feature의 중요도를 출력" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# Random Foreset 로 중요도 높은 500개의 데이터를 추출한다\n", "from sklearn.ensemble import RandomForestClassifier\n", "random_forest = RandomForestClassifier(n_estimators = 100, \n", " criterion = 'gini', \n", " min_samples_split = 30, \n", " n_jobs = -1)\n", "random_forest.fit(X_train_10k, y_train_10k)\n", "\n", "top500_feature = np.argsort(random_forest.feature_importances_)[-500:]\n", "X_train_10k_selected = X_train_10k[:, top500_feature]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", "[2040 2764 2280 1896 1001 1454 756 135 2676 764]\n" ] } ], "source": [ "# 상관성 중요도 하위 10 위 가중치 출력\n", "print(np.sort(random_forest.feature_importances_)[:10])\n", "print(np.argsort(random_forest.feature_importances_)[:10])" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.00755481 0.00772242 0.00798538 0.00818412 0.00886733 0.00905481\n", " 0.00942318 0.00986043 0.01424382 0.01465488]\n", "[2307 549 1284 1503 1540 1923 1085 314 554 393]\n" ] } ], "source": [ "# 상관성 중요도 상위 10 위 가중치 출력 (중요도 클수록 나중에 출력)\n", "print(np.sort(random_forest.feature_importances_)[-10:])\n", "print(np.argsort(random_forest.feature_importances_)[-10:])" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C18=2\n" ] } ], "source": [ "# 393번째 학습모델이 상위\n", "print(dict_one_hot_encoder.feature_names_[393])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(10000, 500)\n" ] } ], "source": [ "# 상위 500개의 feature를 선택 출력한다\n", "top500_feature = np.argsort(random_forest.feature_importances_)[-500:]\n", "X_train_10k_selected = X_train_10k[:, top500_feature]\n", "print(X_train_10k_selected.shape)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }