{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 02. Lab excercies, Supervised learning, KNN\n", "\n", "----\n", "\n", "1. Implement naive K nearest neighbour regression as a function only using python and numpy. The signature of the functions should be:\n", "\n", "\n", " ```\n", " def knn_regression(x2pred, x_train, y_train, k=10):\n", " \"\"\"Return prediction with knn regression.\"\"\"\n", " .\n", " .\n", " .\n", " return y_pred\n", " ```\n", " \n", " \n", "2. Apply the KNN regressor on photometric redshift estimation using the provided photoz_mini.csv file. Use a 80-20% train test split. Calculate the mean absolute error of predictions, and plot the true and the predicted values on a scatterplot.\n", "\n", "3. Apply the KNN regressor on photometric redshift estimation using the provided photoz_mini.csv file. Use 5 fold cross validation. Estimate the mean and satndard deviation of the MAE of the predictions.\n", "\n", "4. Repeat 3 with the KNN regression class from sklearn. Compare the predictions and the runtime.\n", "\n", "5. Implement weighted KNN regression and apply it on the same data. Use 5 fold cross validation. Estimate the mean and satndard deviation of the MAE of the predictions. Plot the true and the predicted values from one fold on a scatterplot.\n", "\n", "---\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1, Write KNN regressor" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def knn_regression(x2pred, x_train, y_train, k=10):\n", " \"\"\"Return prediction with knn regression.\"\"\"\n", " dist = [((x2pred-xi)**2).sum() for xi in x_train]\n", " knn = np.argsort(dist)[:k]\n", " return y_train[knn].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2, Apply it Photoz data with 80%-20% split" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/photoz_mini.csv') # load train data\n", "x = df[['u','g','r','i','z']].values # format x as scipy expects it\n", "y = df['redshift'].values # format y as scipy expects it" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 652 ms, sys: 6.47 ms, total: 658 ms\n", "Wall time: 657 ms\n" ] } ], "source": [ "%%time\n", "yp = [knn_regression(xi, x_train, y_train) for xi in x_test]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0, 1)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "figsize(6,6)\n", "plot(y_test, yp,'o')\n", "xlabel('z true')\n", "ylabel('z predicted')\n", "title('mae = %.3f' % np.mean(np.abs(y_test-yp)))\n", "xlim(0,1)\n", "ylim(0,1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3, Apply it in 5 fold cross validation" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import KFold" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "maes = 0.0487 +/- 0.0031\n", "CPU times: user 3.04 s, sys: 13.8 ms, total: 3.06 s\n", "Wall time: 3.08 s\n" ] } ], "source": [ "%%time\n", "kf = KFold(n_splits=5)\n", "maes = []\n", "for train_index, test_index in kf.split(x):\n", " x_train, x_test = x[train_index], x[test_index]\n", " y_train, y_test = y[train_index], y[test_index]\n", " yp = [knn_regression(xi, x_train, y_train) for xi in x_test]\n", " maes.append(np.mean(np.abs(y_test-yp)))\n", "print 'maes = %.4f +/- %.4f' % (np.mean(maes), np.std(maes, ddof=1)) # note ddof!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4, Use sklearn, compare results and runtime" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsRegressor" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "maes = 0.0487 +/- 0.0031\n", "CPU times: user 13.6 ms, sys: 1.6 ms, total: 15.2 ms\n", "Wall time: 13.9 ms\n" ] } ], "source": [ "%%time\n", "knnr = KNeighborsRegressor(n_neighbors=10)\n", "kf = KFold(n_splits=5)\n", "maes = []\n", "for train_index, test_index in kf.split(x):\n", " x_train, x_test = x[train_index], x[test_index]\n", " y_train, y_test = y[train_index], y[test_index]\n", " knnr.fit(x_train, y_train)\n", " yp = knnr.predict(x_test)\n", " maes.append(np.mean(np.abs(y_test-yp)))\n", "print 'maes = %.4f +/- %.4f' % (np.mean(maes), np.std(maes, ddof=1)) # note ddof!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5, Implement weighted" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def knn_regression_w(x2pred, x_train, y_train, k=10):\n", " \"\"\"Return predictions with weighted knn regression.\"\"\"\n", " dist = np.array([((x2pred-xi)**2).sum() for xi in x_train])\n", " knn = np.argsort(dist)[:k]\n", " w = dist[knn]\n", " return w.mean() * (y_train[knn] * 1./w).mean()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "maes = 0.1387 +/- 0.0122\n", "CPU times: user 3.15 s, sys: 24.9 ms, total: 3.17 s\n", "Wall time: 3.21 s\n" ] } ], "source": [ "%%time\n", "kf = KFold(n_splits=5)\n", "maes = []\n", "for train_index, test_index in kf.split(x):\n", " x_train, x_test = x[train_index], x[test_index]\n", " y_train, y_test = y[train_index], y[test_index]\n", " yp = [knn_regression_w(xi, x_train, y_train) for xi in x_test]\n", " maes.append(np.mean(np.abs(y_test-yp)))\n", "print 'maes = %.4f +/- %.4f' % (np.mean(maes), np.std(maes))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0, 1)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "figsize(6,6)\n", "plot(y_test, yp,'o')\n", "xlabel('z true')\n", "ylabel('z predicted')\n", "title('mae = %.3f' % np.mean(np.abs(y_test-yp)))\n", "xlim(0,1)\n", "ylim(0,1)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" } }, "nbformat": 4, "nbformat_minor": 2 }