{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  Breast Cancer Wisconsin (Diagnostic) Data Set
1.1  Attribute Information:
1.2  分類器
1.3  仮説クラス
2  最急降下法
3  code(python)
3.1  print_w
3.2  データの読み込みと初期化
3.3  最急降下法によるw探索(steepest descent)
4  結果
5  QR decomposition
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
\n", "Breast Cancer Detector\n", "
\n", "
\n", "
\n", "cc by Shigeto R. Nishitani 2018-19 \n", "
\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Breast Cancer Wisconsin (Diagnostic) Data Set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Attribute Information:\n", "\n", "1. ID number \n", "1. Diagnosis (M = malignant, B = benign) M:悪性,B:良性\n", "1. 3-32\n", "\n", "Ten real-valued features are computed for each cell nucleus: \n", "\n", "* 半径radius (mean of distances from center to points on the perimeter) \n", "* テクスチャtexture (standard deviation of gray-scale values) \n", "* 境界の長さperimeter \n", "* 面積area \n", "* なめらかさsmoothness (local variation in radius lengths) \n", "* コンパクトさcompactness (perimeter^2 / area - 1.0) \n", "* くぼみ度合いconcavity (severity of concave portions of the contour) \n", "* くぼみの数concave points (number of concave portions of the contour) \n", "* 対称性symmetry \n", "* フラクタル次元fractal dimension (\"coastline approximation\" - 1)\n", "\n", "http://people.idsia.ch/~juergen/deeplearningwinsMICCAIgrandchallenge.html\n", "\n", "## 分類器\n", "与えられた特徴ベクトル$\\boldsymbol{y}$に対し,\n", "細胞組織が悪性か良性かを分類する関数$C(\\boldsymbol{y})$を選び出すプログラムを作成しよう." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 仮説クラス\n", "分類器は可能な分類器の集合(**仮説クラス**)から選ばれる.この場合,仮説クラスとは特徴ベクトルの空間$\\mathbb{R}^D$から$\\mathbb{R}$への線形関数$h(\\cdot)$である.すると分類器は次のような関数として定義される.\n", "\n", "$$\n", "C(\\boldsymbol{y}) = \n", "\\left\\{ \\begin{array}{ccc}\n", "+1 & {\\rm when} & h(\\boldsymbol{y})\\geq 0\\\\\n", "-1 & {\\rm when} & h(\\boldsymbol{y})<0\n", "\\end{array} \\right.\n", "$$\n", "\n", "各線形関数$h:\\mathbb{R}^D \\rightarrow \\mathbb{R}$に対して,\n", "次のような$D$ベクトル$\\boldsymbol{w}$が存在する.\n", "$$\n", "h(\\boldsymbol{y}) = \\boldsymbol{w}\\cdot \\boldsymbol{y}\n", "$$\n", "したがって,そのような線形関数を選ぶことは,結局$D$ベクトル$\\boldsymbol{w}$を\n", "選ぶことに等しい.特に,$\\boldsymbol{w}$を選ぶことは,仮説クラス$h$を\n", "選ぶことと等価なので,$\\boldsymbol{w}$を**仮説ベクトル**と呼ぶ.\n", "\n", "単に,ベクトルの掛け算で分類器はできそう.問題はどうやってこの仮説ベクトル$\\boldsymbol{w}$の各要素の値を決定するか?ですよね.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 最急降下法\n", "\n", "損失関数に\n", "$$\n", "L(w)=\\sum_{i=1}^n (A_i \\cdot \\boldsymbol{w} - b_i)^2\n", "$$\n", "を選ぶと,ベクトル$\\boldsymbol{w}$の$j$偏微分は,\n", "$$\n", "\\begin{aligned}\n", "\\frac{\\partial L}{\\partial w_j} &= \n", "\\sum_{i=1}^n \\frac{\\partial}{\\partial w_j}(A_i \\cdot w -b_i)^2 \\\\\n", "&= \\sum_{i=1}^n 2(A_i \\cdot w -b_i) A_{ij}\n", "\\end{aligned}\n", "$$\n", "となる.\n", "ここで,$A_{ij}$は$A_i$の$j$番目の要素です.\n", "この偏微分$\\frac{\\partial L}{\\partial w_j}$を\n", "$\\boldsymbol{w}_j$の勾配(slope)として,$L(w)$の極小値(local minimum)を求める.\n", "\n", "このような探索方法を最急降下法(steepest descent method)と呼ぶ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# code(python)\n", "\n", "* /Users/bob/python/doing_math_with_python/numerical_calc/breast_cancer_detector/codes/python.ipynb\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## print_w\n", "\n", "出てきた$\\boldsymbol{w}$の$j$要素をきれいに表示する関数を用意しておきます." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def print_w(w):\n", " params = [\"radius\", \"texture\",\"perimeter\",\"area\",\n", " \"smoothness\",\"compactness\",\"concavity\",\"concave points\",\n", " \"symmetry\",\"fractal dimension\"]\n", " print(\" (params) : \",end=\"\")\n", " print(\" (mean) (stderr) (worst)\")\n", " for i, param in enumerate(params):\n", " print(\"%18s:\" %param, end=\"\")\n", " for j in range(3):\n", " print(\"%13.9f\" % w[i*3+j], end=\"\")\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## データの読み込みと初期化" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(30, 1)\n" ] } ], "source": [ "import numpy as np\n", "tmp = np.fromfile('./codes/train_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(300,30)\n", "tmp = np.fromfile('./codes/train_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(300,1)\n", "w = np.zeros(30).reshape(30,1)\n", "for i in range(30):\n", " w[i] = 0.0\n", "print(w.shape)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-1.],\n", " [ 1.],\n", " [-1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [ 1.],\n", " [-1.],\n", " [ 1.],\n", " [ 1.]])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[0:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 最急降下法によるw探索(steepest descent)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (params) : (mean) (stderr) (worst)\n", " radius: 0.000426997 0.000741817 0.002548876\n", " texture: 0.001687946 0.000004707 0.000000127\n", " perimeter: -0.000003968 -0.000002078 0.000008954\n", " area: 0.000003595 0.000002569 0.000070324\n", " smoothness: 0.000001139 -0.000881778 0.000000430\n", " compactness: 0.000000441 0.000000723 0.000000267\n", " concavity: 0.000001200 0.000000191 0.000411499\n", " concave points: 0.000921972 0.002395138 -0.001932789\n", " symmetry: 0.000005930 -0.000003750 -0.000008147\n", " fractal dimension: -0.000002341 0.000011565 0.000003523\n" ] } ], "source": [ "loop, sigma = 300, 3.0*10**(-9)\n", "for i in range(loop):\n", " dLw = A.dot(w)-b\n", " w = w - (dLw.transpose().dot(A)).transpose()*sigma\n", "\n", "print_w(w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 結果" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def show_accuracy(mA, vb, vw):\n", " # M:悪性(-1),B:良性(1)\n", "\n", " correct,safe_error,critical_error=0,0,0\n", " predict = mA.dot(vw)\n", " n = vb.size\n", " for i in range(n):\n", " if predict[i]*vb[i]>0:\n", " correct += 1\n", " elif (predict[i]<0 and vb[i]>0):\n", " safe_error += 1\n", " elif (predict[i]>0 and vb[i]<0):\n", " critical_error += 1\n", " print(\" correct: %4d/%4d\" % (correct,n))\n", " print(\" safe error: %4d\" % safe_error)\n", " print(\"critical error: %4d\" % critical_error)\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct: 274/ 300\n", " safe error: 5\n", "critical error: 21\n" ] } ], "source": [ "show_accuracy(A, b, w)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct: 240/ 260\n", " safe error: 10\n", "critical error: 10\n" ] } ], "source": [ "tmp = np.fromfile('./codes/validate_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(260,30)\n", "tmp = np.fromfile('./codes/validate_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(260,1)\n", "\n", "show_accuracy(A, b, w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# QR decomposition\n", "\n", "QR分解を使うとより簡単に最小値を求めることができる.\n", "行列$A$は正方行列でないので,逆行列をもとめることができない.\n", "しかし,その場合でも$||A.w -b ||^2$を最小にする$w$を求めることができる.\n", "\n", "QR分解によって,$n \\times m$行列は\n", "$$\n", "A = QR\n", "$$\n", "と分解される.ここで,Qは$n \\times m$行列,Rは$m \\times m$の正方行列.逆行列を求めることができる.\n", "\n", "$|| Aw - b ||$がzeroとなるのはQRを使って,\n", "$$\n", "Q.R.w=b \\\\\n", "R.w = Q^t.b \\\\\n", "R^{-1}.R.w = R^{-1}.Q^t.b\n", "$$\n", "となりそう." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "tmp = np.fromfile('./codes/train_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(300,30)\n", "tmp = np.fromfile('./codes/train_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(300,1)\n", "\n", "q, r = np.linalg.qr(A)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "ww = np.linalg.inv(r).dot(np.transpose(q).dot(b))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(300, 30)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q.shape" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ -2.57579883e+02 -3.32324268e+02 -1.68607899e+03 -1.29450676e+04\n", " -1.65446346e+00]\n" ] } ], "source": [ "print(r[0,0:5])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct: 286/ 300\n", " safe error: 1\n", "critical error: 13\n" ] } ], "source": [ "show_accuracy(A, b, ww)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (params) : (mean) (stderr) (worst)\n", " radius: 0.869921844 -0.024313948 -0.062679561\n", " texture: -0.003274619 -8.790300861 1.747147500\n", " perimeter: -0.202849407 -6.506451098 5.061760446\n", " area: 49.167541566 -0.956591421 -0.082052658\n", " smoothness: -0.007943157 0.004976908-27.841944367\n", " compactness: 3.301527110 4.985959134-16.318886295\n", " concavity: 10.316289081-21.332232171 -0.408605816\n", " concave points: -0.003345722 -0.000677873 0.002510735\n", " symmetry: 4.531369718 0.590110016 -0.719368704\n", " fractal dimension: -2.158965299 -3.803467225-12.298417038\n" ] } ], "source": [ "print_w(ww)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " correct: 252/ 260\n", " safe error: 6\n", "critical error: 2\n" ] } ], "source": [ "tmp = np.fromfile('./codes/validate_A.data', np.float64, -1, \" \")\n", "A = tmp.reshape(260,30)\n", "tmp = np.fromfile('./codes/validate_b.data', np.float64, -1, \" \")\n", "b = tmp.reshape(260,1)\n", "\n", "show_accuracy(A, b, ww)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "12px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }