{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Custom Courses\n", "\n", "In response to this Cross-Validated post: http://stats.stackexchange.com/questions/90330/using-knn-with-contributing-weights-to-calculate-ranking.\n", "\n", "This is hosted at on my [github](https://github.com/Ogaday/PyDoodles/blob/master/Custom_Courses.ipynb). Feel free to use it.\n", "\n", "## First Impressions\n", "\n", "Hello, This over a year old but I hope this might still be useful.\n", "\n", "If I understand correctly, you want an algorithm to rank your $800$ student in terms of their suitability for a specific course, based upon their achievements in the subjects that comprise the course.\n", "\n", "Now, normally this is tricky because there are multiple measures: how do you compare someone who gets $79$ in history and $41$ in maths with someone who gets $65$ in history and $62$ in maths? Summing the points and comparing (ie. student $b$ has $127$pts vs $120$pts for student $a$) gives implicit equal weighting to both measures and this is problem that comes up in multi-objective/pareto optimisation. I can write more detail about this if you're interested, but you've already decided on weights! Great - that makes this task very simple.\n", "\n", "The suitability of a student for a given course with defined weights can be found easily: multiply each subject score of the student by the corresponding weight (which should be $0$ is the subject is not considered) and sum the results. Once this has been done for all students, you can sort them according to this suitability, and viola, you are done.\n", "\n", "## Code\n", "\n", "I will use python." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np # Use numpy for mathematical operations\n", "\n", "np.random.seed(0)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0 47 64 67 67 9 83 21 36]\n", " [ 1 70 88 88 12 58 65 39 87]\n", " [ 2 88 81 37 25 77 72 9 20]\n", " [ 3 69 79 47 64 82 99 88 49]\n", " [ 4 19 19 14 39 32 65 9 57]\n", " [ 5 31 74 23 35 75 55 28 34]\n", " [ 6 0 36 53 5 38 17 79 4]\n", " [ 7 58 31 1 65 41 57 35 11]\n", " [ 8 82 91 0 14 99 53 12 42]\n", " [ 9 75 68 6 68 47 3 76 52]]\n" ] } ], "source": [ "# students = load(file)\n", "# Will generate random data instead\n", "\n", "n_subs = 8 # Number of subjects\n", "students = np.random.randint(0, 100, (800, n_subs+1)) # Uniformaly random generated grades for 800 students\n", "students[:,0] = range(800)\n", "\n", "# First ten students:\n", "print(students[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I assume here that ```students``` is an array with $800$ rows and $N+1$ columns. Each row represents a student and the first column is the student id, and the remain $N$ columns are the student's grades. eg with $8$ subjects, ```students``` might look like this:\n", "\n", " [[ 0 38 26 13 60 95 58 72 44]\n", " [ 1 32 63 62 66 1 16 43 91]\n", " [ 2 97 60 4 9 53 21 9 96]\n", " [ 3 56 56 51 34 69 37 96 77]\n", " [ 4 50 76 56 55 40 32 29 11]\n", " [ 5 35 30 75 81 97 23 38 19]\n", " [ 6 57 45 96 35 4 72 11 36]\n", " [ 7 69 46 95 24 73 56 95 5]\n", " [ 8 45 91 59 24 7 30 82 84]\n", " .\n", " .\n", " .\n", " [ 799 24 88 75 96 82 60 7 27]]\n", "\n", "Back to the code:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def rank(students, course_weights):\n", " ids = students[:,0] # Get a vector of student ids\n", " grades = students[:, 1:] # Get just the grades by themselves in a matrix\n", " scores = grades@course_weights # matrix multiplication as of python 3.5\n", " I = scores.argsort()[::-1] # sort by suitability in descending order\n", " return ids[I]\n", "\n", "course_A_weights = [0,1,0,4,3,2,0,5]\n", " \n", "most_suitable_for_A = rank(students, course_A_weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```most_suitable_for_A``` is a list of student ids in descending order of the most suitable for the custom course A, as defined by the weights." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The three most suitable students for course A are students 750, 14 and 779\n" ] } ], "source": [ "print(\"The three most suitable students for course A are students {}, {} and {}\".format(*most_suitable_for_A[:3]))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[750 84 43 90 91 87 90 73 97]\n", "[14 98 62 35 94 67 82 46 99]\n", "[779 94 92 54 85 83 88 47 87]\n" ] } ], "source": [ "# Inspect the students:\n", "for student in students[most_suitable_for_A[:3]]:\n", " print(student)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, they all have top grades in eighth and fourth subjects because those are most import subjects for this course.\n", "\n", "\n", "## Alternatively\n", "\n", "What you want to do is, given a student's grades and several optional custom courses, you want to find the most suitable custom course for the student. In which case, we can define a function that takes a student's grades and a matrix of course weights:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The most successful courses for student 1112:\n", "In descending order: 0 1 \n" ] } ], "source": [ "def most_suitable(student, courses):\n", " # Evaluate suitability of student for each course\n", " scores = courses@student[1:]\n", " I = scores.argsort()[::-1]\n", " return I\n", "\n", "student = [1112, 789, 56, 64, 38, 41, 15, 25, 32]\n", "courses = np.array([[5,3,4,0,2,1,0,0],\n", " [0,1,0,4,3,2,0,5]])\n", "\n", "print(\"The most successful courses for student {}:\".format(student[0]))\n", "print(\"In descending order: \"+(len(courses)*\"{} \").format(*most_suitable(student, courses)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What you might use KNN for\n", "\n", "The above isn't KNN. What KNN is used for is predicting the classification of observations. If you received the transcript for a new student and wanted to predict which custom course they might choose, you might be able to use KNN to do that.\n", "\n", "## Final Words\n", "\n", "In my opinion, what would be interesting is if you look at clustering the students. There are some simple and effective clustering algorithms: agglomerative such as single and complete linkage, and centroid based, such as k-means. You could feed these algorithms your student data and see if there are any patterns, such as students who do well at mathematics also doing well at history. Then you could tailor your custom course to the clusters that you find in the data for instance doing a maths-history combo custom course if you find it's a common dual specialism." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }