{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Breast Cancer Wisconsin (Diagnostic) Data Set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Attribute Information:\n", "\n", "1. ID number \n", "1. Diagnosis (M = malignant:1, B = benign:-1) M:悪性,B:良性\n", "2. 3-32\n", "\n", " * M:悪性(-1),B:良性(1) => M:悪性(1),B:良性(-1)\n", " * cancer detectorの評価なので,positiveは悪性であるべき.\n", "\n", "Ten real-valued features are computed for each cell nucleus: \n", "\n", "* 半径radius (mean of distances from center to points on the perimeter) \n", "* テクスチャtexture (standard deviation of gray-scale values) \n", "* 境界の長さperimeter \n", "* 面積area \n", "* なめらかさsmoothness (local variation in radius lengths) \n", "* コンパクトさcompactness (perimeter^2 / area - 1.0) \n", "* くぼみ度合いconcavity (severity of concave portions of the contour) \n", "* くぼみの数concave points (number of concave portions of the contour) \n", "* 対称性symmetry \n", "* フラクタル次元fractal dimension (\"coastline approximation\" - 1)\n", "\n", "のそれぞれのmean, stderr, worst数値を保持している.\n", "\n", "http://people.idsia.ch/~juergen/deeplearningwinsMICCAIgrandchallenge.html\n", "\n", "## 分類器\n", "用意された訓練(training)データには,\n", "$\\boldsymbol{A}$に上に記した特徴量が,\n", "$\\boldsymbol{b}$に悪性(1, positive)か良性(-1, negative)かを\n", "示す数値が入っている.\n", "\n", "与えられた特徴ベクトル$\\boldsymbol{y}$に対し,\n", "細胞組織が悪性か良性かを分類する関数$C(\\boldsymbol{y})$を選び出すプログラムを作成しよう." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 仮説クラス\n", "分類器は可能な分類器の集合(**仮説クラス**)から選ばれる.この場合,仮説クラスとは特徴ベクトルの空間$\\mathbb{R}^D$から$\\mathbb{R}$への線形関数$h(\\cdot)$である.すると分類器は次のような関数として定義される.\n", "\n", "$$\n", "C(\\boldsymbol{y}) = \n", "\\left\\{ \\begin{array}{ccc}\n", "+1 & {\\rm when} & h(\\boldsymbol{y})\\geq 0\\\\\n", "-1 & {\\rm when} & h(\\boldsymbol{y})<0\n", "\\end{array} \\right.\n", "$$\n", "\n", "各線形関数$h:\\mathbb{R}^D \\rightarrow \\mathbb{R}$に対して,\n", "次のような$D$ベクトル$\\boldsymbol{w}$が存在する.\n", "$$\n", "h(\\boldsymbol{y}) = \\boldsymbol{w}\\cdot \\boldsymbol{y}\n", "$$\n", "したがって,そのような線形関数を選ぶことは,結局$D$ベクトル$\\boldsymbol{w}$を\n", "選ぶことに等しい.特に,$\\boldsymbol{w}$を選ぶことは,仮説クラス$h$を\n", "選ぶことと等価なので,$\\boldsymbol{w}$を**仮説ベクトル**と呼ぶ.\n", "\n", "問題を単純化すると,分類器を単なるベクトルとみなして,データとの掛け算で予測がつきます.本来は予測を-1,1とかに投影しないといけないんですが,単純化のためにそのままの値を用います.問題はどうやってこの仮説ベクトル$\\boldsymbol{w}$の各要素の値を決定するか?ですよね.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 最急降下法\n", "\n", "損失関数に\n", "$$\n", "L(w)=\\sum_{i=1}^n (A_i \\cdot \\boldsymbol{w} - b_i)^2\n", "$$\n", "を選ぶと,ベクトル$\\boldsymbol{w}$の\n", "$j$偏微分は,\n", "$$\n", "\\begin{aligned}\n", "\\frac{\\partial L}{\\partial w_j} &= \n", "\\sum_{i=1}^n \\frac{\\partial}{\\partial w_j}(A_i \\cdot w -b_i)^2 \\\\\n", "&= \\sum_{i=1}^n 2(A_i \\cdot w -b_i) A_{ij}\n", "\\end{aligned}\n", "$$\n", "となる.\n", "ここで,$A_{ij}$は$A_i$の$j$番目の要素を意味する.\n", "この偏微分$\\frac{\\partial L}{\\partial w_j}$を\n", "$\\boldsymbol{w}_j$の勾配(slope)として,$L(w)$の極小値(local minimum)を求める.\n", "\n", "このような探索方法を最急降下法(steepest descent method)と呼ぶ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## print_w\n", "\n", "出てきた$\\boldsymbol{w}$の$j$要素をきれいに表示する関数を用意しておきます." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def print_w(w):\n", " import numpy as np\n", " params = [\n", " \"radius\", \"texture\", \"perimeter\", \"area\",\n", " \"smoothness\", \"compactness\", \"concavity\", \"concave_pts\",\n", " \"symmetry\", \"fractal_dim\"\n", " ]\n", " w = np.array(w).flatten()\n", " print(\"{:<13} {:>13} {:>13} {:>13}\".format(\"Parameter\", \"Mean\", \"StdErr\", \"Worst\"))\n", " for i, param in enumerate(params):\n", " mean, stderr, worst = w[i*3], w[i*3+1], w[i*3+2]\n", " print(\"{:<13} {:13.9f} {:13.9f} {:13.9f}\".format(param, mean, stderr, worst))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## データの読み込みと初期化" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "tmp = np.fromfile('./train_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(300,30)\n", "tmp = np.fromfile('./train_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(300,1)\n", "w = np.zeros(30).reshape(30,1)\n", "for i in range(30):\n", " w[i] = 0" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,\n", " 3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,\n", " 8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,\n", " 3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,\n", " 1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A[0]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [-1.]])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[0:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 最急降下法によるw探索(steepest descent)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter Mean StdErr Worst\n", "radius -0.000426997 -0.000741817 -0.002548876\n", "texture -0.001687946 -0.000004707 -0.000000127\n", "perimeter 0.000003968 0.000002078 -0.000008954\n", "area -0.000003595 -0.000002569 -0.000070324\n", "smoothness -0.000001139 0.000881778 -0.000000430\n", "compactness -0.000000441 -0.000000723 -0.000000267\n", "concavity -0.000001200 -0.000000191 -0.000411499\n", "concave_pts -0.000921972 -0.002395138 0.001932789\n", "symmetry -0.000005930 0.000003750 0.000008147\n", "fractal_dim 0.000002341 -0.000011565 -0.000003523\n" ] } ], "source": [ "loop, sigma = 300, 3.0*10**(-9)\n", "for i in range(loop):\n", " dLw = A.dot(w)-b\n", " w = w - (dLw.transpose().dot(A)).transpose()*sigma\n", "\n", "print_w(w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 結果" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def show_accuracy(mA, vb, vw):\n", " # M:悪性(1),B:良性(-1)\n", "\n", " correct,safe_error,critical_error=0,0,0\n", " predict = mA.dot(vw)\n", " n = vb.size\n", " for i in range(n):\n", " if predict[i]*vb[i]>0:\n", " correct += 1\n", " elif (predict[i]<0 and vb[i]>0): # 悪性なのに良性と予測:見落としcritical(fn)\n", " critical_error += 1\n", " elif (predict[i]>0 and vb[i]<0): # 良性なのに悪性と予測:再検査safe(fp)\n", " safe_error += 1\n", " print(\" correct(true predicttion): %4d/%4d\" % (correct,n))\n", " print(\" safe error(false positive): %4d\" % safe_error)\n", " print(\"critical error(false negative): %4d\" % critical_error)\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def show_accuracy2(mA, vb, vw):\n", " # M:悪性(1) positive,B:良性(-1) negative\n", " # true(真) , false(偽)\n", " tp, tn, fp, fn = 0, 0, 0, 0\n", " predict = mA.dot(vw)\n", " n = vb.size\n", " for i in range(n):\n", " if predict[i]*vb[i]>0:\n", " if predict[i]>0:\n", " tp += 1\n", " else:\n", " tn += 1\n", " elif (predict[i]<0 and vb[i]>0): # 悪性なのに良性と予測:見落としcritical(fn)\n", " fn += 1\n", " elif (predict[i]>0 and vb[i]<0): # 良性なのに悪性と予測:再検査safe(fp)\n", " fp += 1\n", " print(\"%21s %12s %10s\" % (\"predict\",\"positive\", \"negative\"))\n", " print(\"%21s | %10d %10d\" % (\"actual positive\",tp, fn))\n", " print(\"%21s | %10d %10d\" % (\"actual negative\",fp, tn))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct(true predicttion): 274/ 300\n", " safe error(false positive): 5\n", "critical error(false negative): 21\n" ] } ], "source": [ "show_accuracy(A, b, w)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct(true predicttion): 248/ 269\n", " safe error(false positive): 10\n", "critical error(false negative): 11\n" ] } ], "source": [ "tmp = np.fromfile('./validate_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(269,30)\n", "tmp = np.fromfile('./validate_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(269,1)\n", "\n", "show_accuracy(A, b, w)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " predict positive negative\n", " actual positive | 55 11\n", " actual negative | 10 193\n" ] } ], "source": [ "show_accuracy2(A, b, w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# QR decomposition\n", "\n", "QR分解を使うとより簡単に最小値を求めることができる.\n", "行列$A$は正方行列でないので,逆行列をもとめることができない.\n", "しかし,その場合でも$||A.w -b ||^2$を最小にする$w$を求めることができる.\n", "\n", "QR分解によって,$n \\times m$行列は\n", "$$\n", "A = QR\n", "$$\n", "と分解される.ここで,Qは$n \\times m$行列,Rは$m \\times m$の正方行列で,逆行列を求めることができる.\n", "\n", "$|| Aw - b ||$が最小となるのはQRを使って,\n", "$$\n", "Q.R.w=b \\\\\n", "R.w = Q^t.b \\\\\n", "R^{-1}.R.w = R^{-1}.Q^t.b\n", "$$\n", "となりそう." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "tmp = np.fromfile('./train_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(300,30)\n", "tmp = np.fromfile('./train_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(300,1)\n", "\n", "q, r = np.linalg.qr(A)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "ww = np.linalg.inv(r).dot(np.transpose(q).dot(b))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(300, 30)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q.shape" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-2.57579883e+02 -3.32324268e+02 -1.68607899e+03 -1.29450676e+04\n", " -1.65446346e+00]\n" ] } ], "source": [ "print(r[0,0:5])" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct(true predicttion): 286/ 300\n", " safe error(false positive): 1\n", "critical error(false negative): 13\n", " predict positive negative\n", " actual positive | 133 13\n", " actual negative | 1 153\n" ] } ], "source": [ "show_accuracy(A, b, ww)\n", "show_accuracy2(A, b, ww)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter Mean StdErr Worst\n", "radius -0.869922200 0.024313945 0.062679604\n", "texture 0.003274620 8.790304633 -1.747147681\n", "perimeter 0.202847615 6.506449469 -5.061760150\n", "area -49.167544936 0.956592244 0.082052701\n", "smoothness 0.007943079 -0.004976910 27.841937812\n", "compactness -3.301518835 -4.985956723 16.318887300\n", "concavity -10.316290100 21.332168405 0.408605799\n", "concave_pts 0.003345717 0.000677877 -0.002510735\n", "symmetry -4.531369160 -0.590110921 0.719368538\n", "fractal_dim 2.158967280 3.803466937 12.298424597\n" ] } ], "source": [ "print_w(ww)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct(true predicttion): 261/ 269\n", " safe error(false positive): 6\n", "critical error(false negative): 2\n", " predict positive negative\n", " actual positive | 64 2\n", " actual negative | 6 197\n" ] } ], "source": [ "tmp = np.fromfile('./validate_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(269,30)\n", "tmp = np.fromfile('./validate_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(269,1)\n", "\n", "show_accuracy(A, b, ww)\n", "show_accuracy2(A, b, ww)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": { "height": "12px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }