{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load /Users/facai/Study/book_notes/preconfig.py\n",
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from IPython.display import SVG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "逻辑回归在spark中的实现简介\n",
    "======================="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "分析用的代码版本信息：\n",
    "\n",
    "```bash\n",
    "~/W/g/spark ❯❯❯ git log -n 1\n",
    "commit d9ad78908f6189719cec69d34557f1a750d2e6af\n",
    "Author: Wenchen Fan <wenchen@databricks.com>\n",
    "Date:   Fri May 26 15:01:28 2017 +0800\n",
    "\n",
    "    [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0. 总纲\n",
    "\n",
    "下图是ml包中逻辑回归的构成情况："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<svg height=\"976\" version=\"1.1\" width=\"869.38134765625\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"><defs/><g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"70\" opacity=\"0.2\" stroke=\"none\" width=\"135.8896484375\" x=\"271\" y=\"455\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"70\" stroke=\"none\" width=\"135.8896484375\" x=\"264\" y=\"448\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 264 448 L 399.8896484375 448 L 399.8896484375 518 L 264 518 L 264 448 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 264 473 L 399.8896484375 473\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 264 483 L 399.8896484375 483\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"276.94482421875\" y=\"467.5\">LogisticRegression</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"269\" y=\"500.5\">+train()</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"155.70703125\" x=\"263\" y=\"295\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"155.70703125\" x=\"256\" y=\"288\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 256 288 L 411.70703125 288 L 411.70703125 332 L 256 332 L 256 288 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 256 313 L 411.70703125 313\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 256 323 L 411.70703125 323\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"271.353515625\" y=\"307.5\">ProbabilisticClassifier</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"65.18310546875\" x=\"687\" y=\"175\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"65.18310546875\" x=\"680\" y=\"168\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 680 168 L 745.18310546875 168 L 745.18310546875 212 L 680 212 L 680 168 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 680 193 L 745.18310546875 193\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 680 203 L 745.18310546875 203\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"687.591552734375\" y=\"187.5\">Instance</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"116.9228515625\" x=\"663\" y=\"271\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"116.9228515625\" x=\"656\" y=\"264\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 264 L 772.9228515625 264 L 772.9228515625 308 L 656 308 L 656 264 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 289 L 772.9228515625 289\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 299 L 772.9228515625 299\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"669.46142578125\" y=\"283.5\">Instrumentation</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"213.09619140625\" x=\"663\" y=\"375\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"213.09619140625\" x=\"656\" y=\"368\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 368 L 869.09619140625 368 L 869.09619140625 412 L 656 412 L 656 368 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 393 L 869.09619140625 393\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 403 L 869.09619140625 403\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"675.548095703125\" y=\"387.5\">MultivariateOnlineSummarizer</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"159.05859375\" x=\"663\" y=\"495\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"159.05859375\" x=\"656\" y=\"488\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 488 L 815.05859375 488 L 815.05859375 532 L 656 532 L 656 488 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 513 L 815.05859375 513\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 656 523 L 815.05859375 523\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"670.029296875\" y=\"507.5\">MultiClassSummarizer</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"57\" opacity=\"0.2\" stroke=\"none\" width=\"121.86767578125\" x=\"679\" y=\"599\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"57\" stroke=\"none\" width=\"121.86767578125\" x=\"672\" y=\"592\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 672 592 L 793.86767578125 592 L 793.86767578125 649 L 672 649 L 672 592 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 672 617 L 793.86767578125 617\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 672 627 L 793.86767578125 627\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"692.933837890625\" y=\"611.5\">MetadataUtils</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"677\" y=\"644.5\">+getNumClasses()</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"57\" opacity=\"0.2\" stroke=\"none\" width=\"116.15478515625\" x=\"695\" y=\"1023\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"57\" stroke=\"none\" width=\"116.15478515625\" x=\"688\" y=\"1016\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 688 1016 L 804.15478515625 1016 L 804.15478515625 1073 L 688 1073 L 688 1016 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 688 1041 L 804.15478515625 1041\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 688 1051 L 804.15478515625 1051\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"699.077392578125\" y=\"1035.5\">LogisticCostFun</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"693\" y=\"1068.5\">+calculate()</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"57\" opacity=\"0.2\" stroke=\"none\" width=\"92.10986328125\" x=\"703\" y=\"807\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"57\" stroke=\"none\" width=\"92.10986328125\" x=\"696\" y=\"800\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 696 800 L 788.10986328125 800 L 788.10986328125 857 L 696 857 L 696 800 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 696 825 L 788.10986328125 825\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 696 835 L 788.10986328125 835\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"707.054931640625\" y=\"819.5\">DiffFunction</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"701\" y=\"852.5\">+calculate()</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 745 1015 L 743 858\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 751.6772531276895 878.2164605744429 L 743 858 L 734.8405481648823 878.4309408924404\" fill=\"#FFFFFF\" stroke=\"none\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 751.6772531276895 878.2164605744429 L 743 858 L 734.8405481648823 878.4309408924404 L 751.6772531276895 878.2164605744429\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"100\" opacity=\"0.2\" stroke=\"none\" width=\"137.38134765625\" x=\"911\" y=\"999\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"100\" stroke=\"none\" width=\"137.38134765625\" x=\"904\" y=\"992\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 904 992 L 1041.38134765625 992 L 1041.38134765625 1092 L 904 1092 L 904 992 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 904 1017 L 1041.38134765625 1017\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 904 1055 L 1041.38134765625 1055\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"918.690673828125\" y=\"1011.5\">LogisticAggregator</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"909\" y=\"1034.5\">+gradient</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"909\" y=\"1049.5\">+loss</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"909\" y=\"1072.5\">+add()</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"909\" y=\"1087.5\">+merge()</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 903 1043 L 805 1043\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 815.1626748576241 1038.7904822439841 L 805 1043 L 815.1626748576241 1047.2095177560159 L 827 1043\" fill=\"#000000\" stroke=\"none\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 815.1626748576241 1038.7904822439841 L 805 1043 L 815.1626748576241 1047.2095177560159 L 827 1043 L 815.1626748576241 1038.7904822439841\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"95\" x=\"215\" y=\"983\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"95\" x=\"208\" y=\"976\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 208 976 L 303 976 L 303 1020 L 208 1020 L 208 976 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 208 1001 L 303 1001\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 208 1011 L 303 1011\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"234\" y=\"995.5\">LBFGS</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"104.34814453125\" x=\"207\" y=\"1087\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"104.34814453125\" x=\"200\" y=\"1080\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 200 1080 L 304.34814453125 1080 L 304.34814453125 1124 L 200 1124 L 200 1080 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 200 1105 L 304.34814453125 1105\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 200 1115 L 304.34814453125 1115\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"228.174072265625\" y=\"1099.5\">QWLQN</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"44\" opacity=\"0.2\" stroke=\"none\" width=\"140.701171875\" x=\"431\" y=\"1031\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"44\" stroke=\"none\" width=\"140.701171875\" x=\"424\" y=\"1024\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 424 1024 L 564.701171875 1024 L 564.701171875 1068 L 424 1068 L 424 1024 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 424 1049 L 564.701171875 1049\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 424 1059 L 564.701171875 1059\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"437.3505859375\" y=\"1043.5\">CachedDiffFunction</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 253 1079 L 254 1021\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 262.0673994586762 1041.4674635940594 L 254 1021 L 245.2318305627224 1041.1771951648188\" fill=\"#FFFFFF\" stroke=\"none\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 262.0673994586762 1041.4674635940594 L 254 1021 L 245.2318305627224 1041.1771951648188 L 262.0673994586762 1041.4674635940594\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#C0C0C0\" height=\"70\" opacity=\"0.2\" stroke=\"none\" width=\"134\" x=\"199\" y=\"807\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><rect fill=\"#ffffff\" height=\"70\" stroke=\"none\" width=\"134\" x=\"192\" y=\"800\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 192 800 L 326 800 L 326 870 L 192 870 L 192 800 Z Z\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 192 825 L 326 825\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 192 848 L 326 848\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"bold\" stroke=\"none\" text-decoration=\"none\" x=\"202\" y=\"819.5\">FirstOrderMinimizer</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"197\" y=\"842.5\">+State</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><g><path fill=\"none\" stroke=\"none\"/><text fill=\"#000000\" font-family=\"Arial\" font-size=\"13px\" font-style=\"normal\" font-weight=\"normal\" stroke=\"none\" text-decoration=\"none\" x=\"197\" y=\"865.5\">+iterations()</text></g></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 256 975 L 258 871\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 266.02667931607647 891.4834669711178 L 258 871 L 249.19172097731945 891.1597177722955\" fill=\"#FFFFFF\" stroke=\"none\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 266.02667931607647 891.4834669711178 L 258 871 L 249.19172097731945 891.1597177722955 L 266.02667931607647 891.4834669711178\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 327 834 L 695 829\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 684.8954523110349 833.3471963380357 L 695 829 L 684.7810737985355 824.9289378180843\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 708 858 L 520 1023\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 524.8613670395874 1013.1325225864756 L 520 1023 L 530.4148657918372 1019.4601454072208 L 536.5348752665845 1008.4880084096466\" fill=\"#000000\" stroke=\"none\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 524.8613670395874 1013.1325225864756 L 520 1023 L 530.4148657918372 1019.4601454072208 L 536.5348752665845 1008.4880084096466 L 524.8613670395874 1013.1325225864756\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 304 1008 L 423 1032\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 412.2056885155089 1034.1172717293211 L 423 1032 L 413.87013195365887 1025.8644063484942\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 305 1090 L 423 1062\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 414.0837734573728 1068.4421195456582 L 423 1062 L 412.14000933316356 1060.2505421650621\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 331 447 L 333 333\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 341.0612099568909 353.4699021500085 L 333 333 L 324.22572960458996 353.1745428455822\" fill=\"#FFFFFF\" stroke=\"none\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 341.0612099568909 353.4699021500085 L 333 333 L 324.22572960458996 353.1745428455822 L 341.0612099568909 353.4699021500085\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 378 447 L 682 213\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 676.5144388169448 222.53460111944688 L 682 213 L 671.3791413813782 215.86310359631767\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 401 447 L 669 309\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 661.8919164710477 317.3949477988512 L 669 309 L 658.0377014191627 309.9099504517122\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 401 468 L 655 413\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 645.9583767781703 419.264906185608 L 655 413 L 644.1766492595694 411.03656455425147\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 401 488 L 655 505\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 644.5789003891965 508.5214603365241 L 655 505 L 645.1411213141447 500.1212182814161\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 401 507 L 671 599\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 660.022729557676 599.7067769351642 L 671 599 L 662.7381308676732 591.7376643949549\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 324 519 L 266 799\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g><g transform=\"translate(-182,-158) scale(1,1)\"><path d=\"M 263.9393527196287 788.1947358761621 L 266 799 L 272.18337786889333 789.9024267999383\" fill=\"none\" stroke=\"#000000\" stroke-miterlimit=\"10\"/></g></g></svg>"
      ],
      "text/plain": [
       "<IPython.core.display.SVG object>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SVG(\"./res/spark_ml_lr.svg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，逻辑回归是比较简单的，在它的`train`函数里，除开左侧的几个辅助类：\n",
    "\n",
    "+ Instance: 封装数据\n",
    "+ MetadataUtils: 数据信息\n",
    "+ Instrumentation: 日志 \n",
    "+ Multi*Summarizer: 统计\n",
    "\n",
    "主要就是做两件事：\n",
    "\n",
    "+ 构造损失函数 => costFun: DiffFunction\n",
    "+ 创建寻优算子 => optimizer: FirstOrderMinizer    \n",
    "  ml里两种算子都是拟牛顿法，理论上比SGD迭代更少，收敛更快。其中QWLQN是LBFGS的变种，可使用L1正则。\n",
    "  \n",
    "接下来，我们就将精力放在这两件事的实现上。这里寻优算子主要是根据正则确定的，而损失函数会由二分类和多分类而有所变化，下面一一叙迖述。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. 寻优算子\n",
    "\n",
    "jupyter的markdown，无法正确处理`$`取值语法，所以做了点小变动。\n",
    "\n",
    "```scala\n",
    " 645         val optimizer = if (elasticNetParam == 0.0 || regParam == 0.0) {\n",
    " 646 // +--  4 lines: if (lowerBounds != null && upperBounds != null) {--------------------------------------\n",
    " 650             new BreezeLBFGS[BDV[Double]](maxIter, 10, tol)\n",
    " 651           }\n",
    " 652         } else {\n",
    " 653           val standardizationParam = standardization\n",
    " 654           def regParamL1Fun = (index: Int) => {\n",
    " 655 // +--  2 lines: Remove the L1 penalization on the intercept--------------------------------------------\n",
    " 657             if (isIntercept) {\n",
    " 658               0.0\n",
    " 659             } else {\n",
    " 660               if (standardizationParam) {\n",
    " 661                 regParamL1\n",
    " 662               } else {\n",
    " 663                 val featureIndex = index / numCoefficientSets\n",
    " 664 // +--  5 lines: If `standardization` is false, we still standardize the data---------------------------\n",
    " 669                 if (featuresStd(featureIndex) != 0.0) {\n",
    " 670                   regParamL1 / featuresStd(featureIndex)\n",
    " 671                 } else {\n",
    " 672                   0.0\n",
    " 673                 }\n",
    " 674               }\n",
    " 675             }\n",
    " 676           }\n",
    " 677           new BreezeOWLQN[Int, BDV[Double]](maxIter, 10, regParamL1Fun, $(tol))\n",
    " 678         }\n",
    "```\n",
    "\n",
    "可以看到，逻辑很简单：如果不用正则，或只用L2，就用LBFGS算子；如果用到L1正则，就用QWLQN算子。其中下半代码均是在折算合适的L1正则值。\n",
    "\n",
    "因为QWLQN会自己处理L1正则，所以在接下来的损失函数计算中，我们只考虑L2正则，而不管L1。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 损失函数\n",
    "#### 2.1 二分类\n",
    "\n",
    "预测公式：$f(x) = \\frac1{1 + e^{w^T x}}$\n",
    "\n",
    "[损失函数](http://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression)定义是：\n",
    "\n",
    "\\begin{equation}\n",
    "L(w;x,y) = \\log(1+e^{-y w^T x}) + r_2 \\cdot \\frac{1}{2} w^T w + r_1 \\cdot \\|w\\|\n",
    "\\end{equation}\n",
    "\n",
    "[导数是](http://spark.apache.org/docs/latest/mllib-linear-methods.html#loss-functions)：\n",
    "\n",
    "\\begin{align}\n",
    "  \\frac{\\partial L}{\\partial w} &= -y \\left(1-\\frac1{1+e^{-y w^T x}} \\right) \\cdot x + r_2 w \\pm r_1 \\\\\n",
    "  &= \\left ( \\frac{y}{1+e^{-y w^T x}} - y \\right ) \\cdot x + r_2 w \\pm r_1 \\\\\n",
    "  \\text{因为$y$只有1和-1两值，可简化为} \\\\\n",
    "  &= \\left ( \\frac{1}{1+e^{-w^T x}} - y \\right ) \\cdot x + r_2 w \\pm r_1 \\\\\n",
    "  &= \\left ( f(x) - y \\right ) \\cdot x + r_2 w \\pm r_1\n",
    "\\end{align}\n",
    "\n",
    "好，我们先看没有正则的计算，在LogisticAggregator类里：\n",
    "\n",
    "```scala\n",
    "1670   /** Update gradient and loss using binary loss function. */\n",
    "1671   private def binaryUpdateInPlace(\n",
    "1672       features: Vector,\n",
    "1673       weight: Double,\n",
    "1674       label: Double): Unit = {\n",
    "1675 +--  4 lines: val localFeaturesStd = bcFeaturesStd.value----------\n",
    "1679     val margin = - {\n",
    "1680       var sum = 0.0\n",
    "1681       features.foreachActive { (index, value) =>\n",
    "1682         if (localFeaturesStd(index) != 0.0 && value != 0.0) {\n",
    "1683           sum += localCoefficients(index) * value / localFeaturesStd(index)\n",
    "1684         }\n",
    "1685       }\n",
    "1686       if (fitIntercept) sum += localCoefficients(numFeaturesPlusIntercept - 1)\n",
    "1687       sum\n",
    "1688     }\n",
    "1689\n",
    "1690     val multiplier = weight * (1.0 / (1.0 + math.exp(margin)) - label)\n",
    "1691\n",
    "1692     features.foreachActive { (index, value) =>\n",
    "1693       if (localFeaturesStd(index) != 0.0 && value != 0.0) {\n",
    "1694         localGradientArray(index) += multiplier * value / localFeaturesStd(index)\n",
    "1695       }\n",
    "1696     }\n",
    "1697\n",
    "1698     if (fitIntercept) {\n",
    "1699       localGradientArray(numFeaturesPlusIntercept - 1) += multiplier\n",
    "1700     }\n",
    "1701\n",
    "1702     if (label > 0) {\n",
    "1703       // The following is equivalent to log(1 + exp(margin)) but more numerically stable.\n",
    "1704       lossSum += weight * MLUtils.log1pExp(margin)\n",
    "1705     } else {\n",
    "1706       lossSum += weight * (MLUtils.log1pExp(margin) - margin)\n",
    "1707     }\n",
    "1708   }\n",
    "```\n",
    "\n",
    "其中，\n",
    "+ margin = $-w^T x$    \n",
    "  注意：这里用的$x / \\operatorname{std}(x)$，相当于归一化，统一量纲。很奇怪，没有同时移动坐标，我不清楚是否合理。\n",
    "+ multiplier = $\\frac1{1 + e^{w^T x}} - y$ = $f(x) - y$\n",
    "+ localGradientArray = $(f(x) - y) x$\n",
    "+ lossSum = $\\log(1+e^{-y w^T x})$。注意：因为margin计算时是$y=1$，所以1706L，对$y=-1$做了变换。数学技巧比较简单：\n",
    "\n",
    "\\begin{align}\n",
    "  log(1 + e^x) - x &= log(1 + e^x) - log(e^x) \\\\\n",
    "  &= log(\\frac{1 + e^x}{e^x}) \\\\\n",
    "  &= log(1 + e^{-x})\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "再在损失函数和偏导里，均加上L2的部份，代码在LogisticCostFun类的calculate方法里：\n",
    "\n",
    "```scala\n",
    "1877   override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {\n",
    "1878 // +--  6 lines: val coeffs = Vectors.fromBreeze(coefficients)------\n",
    "1884\n",
    "1885     val logisticAggregator = {\n",
    "1886 // +--  3 lines: val seqOp = (c: LogisticAggregator, instance: Instance) =>\n",
    "1889       instances.treeAggregate(\n",
    "1890         new LogisticAggregator(bcCoeffs, bcFeaturesStd, numClasses, fitIntercept,\n",
    "1891           multinomial)\n",
    "1892       )(seqOp, combOp, aggregationDepth)\n",
    "1893     }\n",
    "1894\n",
    "1895     val totalGradientMatrix = logisticAggregator.gradient\n",
    "1896     val coefMatrix = new DenseMatrix(numCoefficientSets, numFeaturesPlusIntercept, coeffs.toArray)\n",
    "1897     // regVal is the sum of coefficients squares excluding intercept for L2 regularization.\n",
    "1898     val regVal = if (regParamL2 == 0.0) {\n",
    "1899       0.0\n",
    "1900     } else {\n",
    "1901       var sum = 0.0\n",
    "1902       coefMatrix.foreachActive { case (classIndex, featureIndex, value) =>\n",
    "1903         // We do not apply regularization to the intercepts\n",
    "1904         val isIntercept = fitIntercept && (featureIndex == numFeatures)\n",
    "1905         if (!isIntercept) {\n",
    "1906 // +--  2 lines: The following code will compute the loss of the regularization; also---\n",
    "1908           sum += {\n",
    "1909             if (standardization) {\n",
    "1910               val gradValue = totalGradientMatrix(classIndex, featureIndex)\n",
    "1911               totalGradientMatrix.update(classIndex, featureIndex, gradValue + regParamL2 * value)\n",
    "1912               value * value\n",
    "1913 // +-- 14 lines: } else {------------------\n",
    "1927             }\n",
    "1928           }\n",
    "1929         }\n",
    "1930       }\n",
    "1931       0.5 * regParamL2 * sum\n",
    "1932     }\n",
    "1933 // +--  2 lines: bcCoeffs.destroy(blocking = false)--------\n",
    "1935     (logisticAggregator.loss + regVal, new BDV(totalGradientMatrix.toArray))\n",
    "1936   }\n",
    "1\n",
    "```\n",
    "\n",
    "其中，1912L和1931L是加L2正则$r_2 \\cdot \\frac{1}{2}w^T w$；1911L是加L2的偏导$r_2 \\cdot w$。因为有额外的分支处理归一的情况，分支较多。同时，损失和偏导混在一起算，代码有点混杂。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如此，就有了二分类的损失函数和偏导数。\n",
    "\n",
    "```scala\n",
    " 601         val costFun = new LogisticCostFun(instances, numClasses, fitIntercept,\n",
    " 602           standardization, bcFeaturesStd, regParamL2, multinomial = isMultinomial,\n",
    " 603           aggregationDepth)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2 多分类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "spark在这里用的是softmax函数来代替logit函数，是比较有意思的解决方案。因为在LogisticAggregator类，已经详细地注释了推导关键过程，所以我就直接搬运过来，稍微作点附注，再把代码和公式对应起来就好。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LogisticAggregator computes the gradient and loss for binary or multinomial logistic (softmax)\n",
    "loss function, as used in classification for instances in sparse or dense vector in an online\n",
    "fashion.\n",
    "                                                                                                  \n",
    "Two LogisticAggregators can be merged together to have a summary of loss and gradient of\n",
    "the corresponding joint dataset.\n",
    "                                                                                                  \n",
    "For improving the convergence rate during the optimization process and also to prevent against\n",
    "features with very large variances exerting an overly large influence during model training,\n",
    "packages like R's GLMNET perform the scaling to unit variance and remove the mean in order to\n",
    "reduce the condition number. The model is then trained in this scaled space, but returns the\n",
    "coefficients in the original scale. See page 9 in\n",
    "http://cran.r-project.org/web/packages/glmnet/glmnet.pdf\n",
    "                                                                                                  \n",
    "However, we don't want to apply the [[org.apache.spark.ml.feature.StandardScaler]] on the\n",
    "training dataset, and then cache the standardized dataset since it will create a lot of overhead.\n",
    "As a result, we perform the scaling implicitly when we compute the objective function (though\n",
    "we do not subtract the mean).\n",
    "                                                                                                  \n",
    "Note that there is a difference between multinomial (softmax) and binary loss. The binary case\n",
    "uses one outcome class as a \"pivot\" and regresses the other class against the pivot. In the\n",
    "multinomial case, the softmax loss function is used to model each class probability\n",
    "independently. Using softmax loss produces `K` sets of coefficients, while using a pivot class\n",
    "produces `K - 1` sets of coefficients (a single coefficient vector in the binary case). In the\n",
    "binary case, we can say that the coefficients are shared between the positive and negative\n",
    "classes. When regularization is applied, multinomial (softmax) loss will produce a result\n",
    "different from binary loss since the positive and negative don't share the coefficients while the\n",
    "binary regression shares the coefficients between positive and negative.\n",
    "                                                                                                  \n",
    "The following is a mathematical derivation for the multinomial (softmax) loss.\n",
    "                                                                                                  \n",
    "The probability of the multinomial outcome $y$ taking on any of the K possible outcomes is:\n",
    "                                                                                                  \n",
    "<blockquote>\n",
    "   $$\n",
    "   P(y_i=0|\\vec{x}_i, \\beta) = \\frac{e^{\\vec{x}_i^T \\vec{\\beta}_0}}{\\sum_{k=0}^{K-1}\n",
    "      e^{\\vec{x}_i^T \\vec{\\beta}_k}} \\\\\n",
    "   P(y_i=1|\\vec{x}_i, \\beta) = \\frac{e^{\\vec{x}_i^T \\vec{\\beta}_1}}{\\sum_{k=0}^{K-1}\n",
    "      e^{\\vec{x}_i^T \\vec{\\beta}_k}}\\\\\n",
    "   P(y_i=K-1|\\vec{x}_i, \\beta) = \\frac{e^{\\vec{x}_i^T \\vec{\\beta}_{K-1}}\\,}{\\sum_{k=0}^{K-1}\n",
    "      e^{\\vec{x}_i^T \\vec{\\beta}_k}}\n",
    "   $$\n",
    "</blockquote>\n",
    "                                                                                                  \n",
    "The model coefficients $\\beta = (\\beta_0, \\beta_1, \\beta_2, ..., \\beta_{K-1})$ become a matrix\n",
    "which has dimension of $K \\times (N+1)$ if the intercepts are added. If the intercepts are not\n",
    "added, the dimension will be $K \\times N$.\n",
    "                                                                                                  \n",
    "Note that the coefficients in the model above lack identifiability. That is, any constant scalar\n",
    "can be added to all of the coefficients and the probabilities remain the same.\n",
    "                                                                                                  \n",
    "<blockquote>\n",
    "   $$\n",
    "   \\begin{align}\n",
    "   \\frac{e^{\\vec{x}_i^T \\left(\\vec{\\beta}_0 + \\vec{c}\\right)}}{\\sum_{k=0}^{K-1}\n",
    "      e^{\\vec{x}_i^T \\left(\\vec{\\beta}_k + \\vec{c}\\right)}}\n",
    "   = \\frac{e^{\\vec{x}_i^T \\vec{\\beta}_0}e^{\\vec{x}_i^T \\vec{c}}\\,}{e^{\\vec{x}_i^T \\vec{c}}\n",
    "      \\sum_{k=0}^{K-1} e^{\\vec{x}_i^T \\vec{\\beta}_k}}\n",
    "   = \\frac{e^{\\vec{x}_i^T \\vec{\\beta}_0}}{\\sum_{k=0}^{K-1} e^{\\vec{x}_i^T \\vec{\\beta}_k}}\n",
    "   \\end{align}\n",
    "   $$\n",
    "</blockquote>\n",
    "                                                                                                  \n",
    "However, when regularization is added to the loss function, the coefficients are indeed\n",
    "identifiable because there is only one set of coefficients which minimizes the regularization\n",
    "term. When no regularization is applied, we choose the coefficients with the minimum L2\n",
    "penalty for consistency and reproducibility. For further discussion see:\n",
    "                                                                                                  \n",
    "Friedman, et al. \"Regularization Paths for Generalized Linear Models via Coordinate Descent\"\n",
    "                                                                                                  \n",
    "The loss of objective function for a single instance of data (we do not include the\n",
    "regularization term here for simplicity) can be written as\n",
    "                                                                                                  \n",
    "<blockquote>\n",
    "   $$\n",
    "   \\begin{align}\n",
    "   \\ell\\left(\\beta, x_i\\right) &= -log{P\\left(y_i \\middle| \\vec{x}_i, \\beta\\right)} \\\\\n",
    "   &= log\\left(\\sum_{k=0}^{K-1}e^{\\vec{x}_i^T \\vec{\\beta}_k}\\right) - \\vec{x}_i^T \\vec{\\beta}_y\\\\\n",
    "   &= log\\left(\\sum_{k=0}^{K-1} e^{margins_k}\\right) - margins_y\n",
    "   \\end{align}\n",
    "   $$\n",
    "</blockquote>\n",
    "                                                                                                  \n",
    "where ${margins}_k = \\vec{x}_i^T \\vec{\\beta}_k$.\n",
    "                                                                                                  \n",
    "For optimization, we have to calculate the first derivative of the loss function, and a simple\n",
    "calculation shows that\n",
    "                                                                                                  \n",
    "<blockquote>\n",
    "   $$\n",
    "   \\begin{align}\n",
    "   \\frac{\\partial \\ell(\\beta, \\vec{x}_i, w_i)}{\\partial \\beta_{j, k}}\n",
    "   &= x_{i,j} \\cdot w_i \\cdot \\left(\\frac{e^{\\vec{x}_i \\cdot \\vec{\\beta}_k}}{\\sum_{k'=0}^{K-1}\n",
    "     e^{\\vec{x}_i \\cdot \\vec{\\beta}_{k'}}\\,} - I_{y=k}\\right) \\\\\n",
    "   &= x_{i, j} \\cdot w_i \\cdot multiplier_k\n",
    "   \\end{align}\n",
    "   $$\n",
    "</blockquote>\n",
    "                                                                                                  \n",
    "where $w_i$ is the sample weight, $I_{y=k}$ is an indicator function\n",
    "                                                                                                  \n",
    " <blockquote>\n",
    "   $$\n",
    "   I_{y=k} = \\begin{cases}\n",
    "         1 & y = k \\\\\n",
    "         0 & else\n",
    "      \\end{cases}\n",
    "   $$\n",
    "</blockquote>\n",
    "                                                                                                  \n",
    "and\n",
    "                                                                                                  \n",
    "<blockquote>\n",
    "   $$\n",
    "   multiplier_k = \\left(\\frac{e^{\\vec{x}_i \\cdot \\vec{\\beta}_k}}{\\sum_{k=0}^{K-1}\n",
    "      e^{\\vec{x}_i \\cdot \\vec{\\beta}_k}} - I_{y=k}\\right)\n",
    "   $$\n",
    "</blockquote>\n",
    "\n",
    "$\\exp(709.78)$超出Double上限。\n",
    "\n",
    "If any of margins is larger than 709.78, the numerical computation of multiplier and loss\n",
    "function will suffer from arithmetic overflow. This issue occurs when there are outliers in\n",
    "data which are far away from the hyperplane, and this will cause the failing of training once\n",
    "infinity is introduced. Note that this is only a concern when max(margins) &gt; 0.\n",
    "                                                                                                  \n",
    "Fortunately, when max(margins) = maxMargin &gt; 0, the loss function and the multiplier can\n",
    "easily be rewritten into the following equivalent numerically stable formula.\n",
    "                                                                                               \n",
    "这里变换非常简单，将括号打开，用指数和对数规则依次套用。                                                               \n",
    "                                                                                               \n",
    "<blockquote>\n",
    "   $$\n",
    "   \\ell\\left(\\beta, x\\right) = log\\left(\\sum_{k=0}^{K-1} e^{margins_k - maxMargin}\\right) -\n",
    "      margins_{y} + maxMargin\n",
    "   $$\n",
    "</blockquote>\n",
    "                                                                                                  \n",
    "Note that each term, $(margins_k - maxMargin)$ in the exponential is no greater than zero; as a\n",
    "result, overflow will not happen with this formula.\n",
    "                                                                                                  \n",
    "For $multiplier$, a similar trick can be applied as the following,\n",
    "                                                                                                  \n",
    "<blockquote>\n",
    "   $$\n",
    "   multiplier_k = \\left(\\frac{e^{\\vec{x}_i \\cdot \\vec{\\beta}_k - maxMargin}}{\\sum_{k'=0}^{K-1}\n",
    "      e^{\\vec{x}_i \\cdot \\vec{\\beta}_{k'} - maxMargin}} - I_{y=k}\\right)\n",
    "   $$\n",
    "</blockquote>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```scala\n",
    "1711   private def multinomialUpdateInPlace(\n",
    "1712 // +-- 12 lines: features: Vector,----------\n",
    "1724     // marginOfLabel is margins(label) in the formula\n",
    "1725     var marginOfLabel = 0.0\n",
    "1726     var maxMargin = Double.NegativeInfinity\n",
    "1727\n",
    "1728     val margins = new Array[Double](numClasses)\n",
    "1729     features.foreachActive { (index, value) =>\n",
    "1730 // +--  2 lines: val stdValue = value / localFeaturesStd(index)-------\n",
    "1732       while (j < numClasses) {\n",
    "1733         margins(j) += localCoefficients(index * numClasses + j) * stdValue\n",
    "1734 // +--  4 lines: j += 1-----------------\n",
    "1738     while (i < numClasses) {\n",
    "1739       if (fitIntercept) {\n",
    "1740         margins(i) += localCoefficients(numClasses * numFeatures + i)\n",
    "1741       }\n",
    "1742       if (i == label.toInt) marginOfLabel = margins(i)\n",
    "1743       if (margins(i) > maxMargin) {\n",
    "1744         maxMargin = margins(i)\n",
    "1745       }\n",
    "1746       i += 1\n",
    "1747     }\n",
    "1748 // +--  6 lines: *---------------------\n",
    "1754     val multipliers = new Array[Double](numClasses)\n",
    "1755     val sum = {\n",
    "1756       var temp = 0.0\n",
    "1757       var i = 0\n",
    "1758       while (i < numClasses) {\n",
    "1759         if (maxMargin > 0) margins(i) -= maxMargin\n",
    "1760         val exp = math.exp(margins(i))\n",
    "1761         temp += exp\n",
    "1762         multipliers(i) = exp\n",
    "1763         i += 1\n",
    "1764       }\n",
    "1765       temp\n",
    "1766     }\n",
    "1767\n",
    "1768     margins.indices.foreach { i =>\n",
    "1769       multipliers(i) = multipliers(i) / sum - (if (label == i) 1.0 else 0.0)\n",
    "1770     }\n",
    "1771     features.foreachActive { (index, value) =>\n",
    "1772       if (localFeaturesStd(index) != 0.0 && value != 0.0) {\n",
    "1773         val stdValue = value / localFeaturesStd(index)\n",
    "1774         var j = 0\n",
    "1775         while (j < numClasses) {\n",
    "1776           localGradientArray(index * numClasses + j) +=\n",
    "1777             weight * multipliers(j) * stdValue\n",
    "1778           j += 1\n",
    "1779         }\n",
    "1780       }\n",
    "1781     }\n",
    "1782     if (fitIntercept) {\n",
    "1783       var i = 0\n",
    "1784       while (i < numClasses) {\n",
    "1785         localGradientArray(numFeatures * numClasses + i) += weight * multipliers(i)\n",
    "1786         i += 1\n",
    "1787       }\n",
    "1788     }\n",
    "1789\n",
    "1790     val loss = if (maxMargin > 0) {\n",
    "1791       math.log(sum) - marginOfLabel + maxMargin\n",
    "1792     } else {\n",
    "1793       math.log(sum) - marginOfLabel\n",
    "1794     }\n",
    "1795     lossSum += weight * loss\n",
    "1796   }\n",
    "```\n",
    "\n",
    "+ 1728L-1733L，在计算margins = $x \\beta$。1738L的循环是找出maxMargin和标签对应的marginOfLabel，因为后面公式要用到。\n",
    "+ 1754L-1770L，计算了multipliers。我个人很不喜欢这种一个循环做两件事，且出口不同的风格。\n",
    "+ 1771L-1788L，计算导数localGradientArray = $x_{i, j} \\cdot w_i \\cdot \\operatorname{multiplier}_k$。\n",
    "+ 1790L-1795L，根据最大margin是否大于0，计算损失值loss。注意1759L也有针对做修正。\n",
    "\n",
    "公式较复杂，但代码挺简单的。为了效率，有的地方写得不太好看。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. 小结\n",
    "\n",
    "spark-ml里逻辑回归支持样本加权，二分类和多分类。寻优算子是相对优秀的拟牛顿算法，多分类是softmax。总体而言，功能完整够用，实现也比较优秀。但有的代码，个人认为像面条，冗余，不够清减。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}