{
 "metadata": {
  "name": "",
  "signature": "sha256:81c38476aaea95ac17ccfa86f93aea408282475780e2e20545818910e0014e2a"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Project Outline\n",
      "\n",
      "From the [task description](https://pslcdatashop.web.cmu.edu/KDDCup/rules_task.jsp):\n",
      "\n",
      ">The competition task will be to develop a learning model based on the challenge and/or development data sets, use this algorithm to learn from the training portion of the challenge data sets, and then accurately predict student performance in the test sections.\n",
      "\n",
      "Some of the technical challenges of this problem include:\n",
      "\n",
      "> The data matrix is sparse: not all students are given every problem, and some problems have only 1 or 2 students who completed each item. So, the contestants need to exploit relationships among problems to bring to bear enough data to hope to learn.\n",
      "\n",
      "> There is a strong temporal dimension to the data: students improve over the course of the school year, students must master some skills before moving on to others, and incorrect responses to some items lead to incorrect assumptions in other items. So, contestants must pay attention to temporal relationships as well as conceptual relationships among items.\n",
      "\n",
      "> Which problems a given student sees is determined in part by student choices or past success history: e.g., students only see remedial problems if they are having trouble with the non-remedial problems. So, contestants need to pay attention to causal relationships in order to avoid selection bias."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "I am not going to concern myself too much with the last aspect for now. The interactive tutorial system that students are using is suggestion remedial problems based on mistakes. The result could be that students are seeing more of a certain kind of problem. This could skew estimations of the student's total competency if this is not taking into account. But this will addressed later."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first step in solving this problem is establishing the relationships between problems. To predict how well a student is going to perform against a new problem, we must first establish that problem in relation to the other problems in the database, and then to the problems within the database that the student has already encountered.\n",
      "\n",
      "To establish the relationships between problems, we must use some kind of unsupervised machine learning technique."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import matplotlib.pyplot as plt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Read in the data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Get the data: Algebra 2005-2006\n",
      "train_filepath = 'data/algebra0506/algebra_2005_2006_train.txt'\n",
      "test_filepath  = 'data/algebra0506/algebra_2005_2006_test.txt'\n",
      "traindata = pd.read_table(train_filepath)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# What does the training data look like?\n",
      "traindata.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Row</th>\n",
        "      <th>Anon Student Id</th>\n",
        "      <th>Problem Hierarchy</th>\n",
        "      <th>Problem Name</th>\n",
        "      <th>Problem View</th>\n",
        "      <th>Step Name</th>\n",
        "      <th>Step Start Time</th>\n",
        "      <th>First Transaction Time</th>\n",
        "      <th>Correct Transaction Time</th>\n",
        "      <th>Step End Time</th>\n",
        "      <th>Step Duration (sec)</th>\n",
        "      <th>Correct Step Duration (sec)</th>\n",
        "      <th>Error Step Duration (sec)</th>\n",
        "      <th>Correct First Attempt</th>\n",
        "      <th>Incorrects</th>\n",
        "      <th>Hints</th>\n",
        "      <th>Corrects</th>\n",
        "      <th>KC(Default)</th>\n",
        "      <th>Opportunity(Default)</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0BrbPbwCMz</td>\n",
        "      <td> Unit ES_04, Section ES_04-1</td>\n",
        "      <td> EG4-FIXED</td>\n",
        "      <td> 1</td>\n",
        "      <td> 3(x+2) = 15</td>\n",
        "      <td> 2005-09-09 12:24:35.0</td>\n",
        "      <td> 2005-09-09 12:24:49.0</td>\n",
        "      <td> 2005-09-09 12:25:15.0</td>\n",
        "      <td> 2005-09-09 12:25:15.0</td>\n",
        "      <td>  40</td>\n",
        "      <td> NaN</td>\n",
        "      <td> 40</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "      <td> 3</td>\n",
        "      <td> 1</td>\n",
        "      <td> [SkillRule: Eliminate Parens; {CLT nested; CLT...</td>\n",
        "      <td>    1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 2</td>\n",
        "      <td> 0BrbPbwCMz</td>\n",
        "      <td> Unit ES_04, Section ES_04-1</td>\n",
        "      <td> EG4-FIXED</td>\n",
        "      <td> 1</td>\n",
        "      <td>     x+2 = 5</td>\n",
        "      <td> 2005-09-09 12:25:15.0</td>\n",
        "      <td> 2005-09-09 12:25:31.0</td>\n",
        "      <td> 2005-09-09 12:25:31.0</td>\n",
        "      <td> 2005-09-09 12:25:31.0</td>\n",
        "      <td>  16</td>\n",
        "      <td>  16</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> [SkillRule: Remove constant; {ax+b=c, positive...</td>\n",
        "      <td> 1~~1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 3</td>\n",
        "      <td> 0BrbPbwCMz</td>\n",
        "      <td> Unit ES_04, Section ES_04-1</td>\n",
        "      <td>      EG40</td>\n",
        "      <td> 1</td>\n",
        "      <td>   2-8y = -4</td>\n",
        "      <td> 2005-09-09 12:25:36.0</td>\n",
        "      <td> 2005-09-09 12:25:43.0</td>\n",
        "      <td> 2005-09-09 12:26:12.0</td>\n",
        "      <td> 2005-09-09 12:26:12.0</td>\n",
        "      <td>  36</td>\n",
        "      <td> NaN</td>\n",
        "      <td> 36</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "      <td> 3</td>\n",
        "      <td> 1</td>\n",
        "      <td> [SkillRule: Remove constant; {ax+b=c, positive...</td>\n",
        "      <td>    2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 4</td>\n",
        "      <td> 0BrbPbwCMz</td>\n",
        "      <td> Unit ES_04, Section ES_04-1</td>\n",
        "      <td>      EG40</td>\n",
        "      <td> 1</td>\n",
        "      <td>    -8y = -6</td>\n",
        "      <td> 2005-09-09 12:26:12.0</td>\n",
        "      <td> 2005-09-09 12:26:34.0</td>\n",
        "      <td> 2005-09-09 12:26:34.0</td>\n",
        "      <td> 2005-09-09 12:26:34.0</td>\n",
        "      <td>  22</td>\n",
        "      <td>  22</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> [SkillRule: Remove coefficient; {ax+b=c, divid...</td>\n",
        "      <td> 1~~1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 5</td>\n",
        "      <td> 0BrbPbwCMz</td>\n",
        "      <td> Unit ES_04, Section ES_04-1</td>\n",
        "      <td>      EG40</td>\n",
        "      <td> 2</td>\n",
        "      <td>  -7y-5 = -4</td>\n",
        "      <td> 2005-09-09 12:26:38.0</td>\n",
        "      <td> 2005-09-09 12:28:36.0</td>\n",
        "      <td> 2005-09-09 12:28:36.0</td>\n",
        "      <td> 2005-09-09 12:28:36.0</td>\n",
        "      <td> 118</td>\n",
        "      <td> 118</td>\n",
        "      <td>NaN</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> [SkillRule: Remove constant; {ax+b=c, positive...</td>\n",
        "      <td> 3~~1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 19 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "   Row Anon Student Id            Problem Hierarchy Problem Name  \\\n",
        "0    1      0BrbPbwCMz  Unit ES_04, Section ES_04-1    EG4-FIXED   \n",
        "1    2      0BrbPbwCMz  Unit ES_04, Section ES_04-1    EG4-FIXED   \n",
        "2    3      0BrbPbwCMz  Unit ES_04, Section ES_04-1         EG40   \n",
        "3    4      0BrbPbwCMz  Unit ES_04, Section ES_04-1         EG40   \n",
        "4    5      0BrbPbwCMz  Unit ES_04, Section ES_04-1         EG40   \n",
        "\n",
        "   Problem View    Step Name        Step Start Time First Transaction Time  \\\n",
        "0             1  3(x+2) = 15  2005-09-09 12:24:35.0  2005-09-09 12:24:49.0   \n",
        "1             1      x+2 = 5  2005-09-09 12:25:15.0  2005-09-09 12:25:31.0   \n",
        "2             1    2-8y = -4  2005-09-09 12:25:36.0  2005-09-09 12:25:43.0   \n",
        "3             1     -8y = -6  2005-09-09 12:26:12.0  2005-09-09 12:26:34.0   \n",
        "4             2   -7y-5 = -4  2005-09-09 12:26:38.0  2005-09-09 12:28:36.0   \n",
        "\n",
        "  Correct Transaction Time          Step End Time  Step Duration (sec)  \\\n",
        "0    2005-09-09 12:25:15.0  2005-09-09 12:25:15.0                   40   \n",
        "1    2005-09-09 12:25:31.0  2005-09-09 12:25:31.0                   16   \n",
        "2    2005-09-09 12:26:12.0  2005-09-09 12:26:12.0                   36   \n",
        "3    2005-09-09 12:26:34.0  2005-09-09 12:26:34.0                   22   \n",
        "4    2005-09-09 12:28:36.0  2005-09-09 12:28:36.0                  118   \n",
        "\n",
        "   Correct Step Duration (sec)  Error Step Duration (sec)  \\\n",
        "0                          NaN                         40   \n",
        "1                           16                        NaN   \n",
        "2                          NaN                         36   \n",
        "3                           22                        NaN   \n",
        "4                          118                        NaN   \n",
        "\n",
        "   Correct First Attempt  Incorrects  Hints  Corrects  \\\n",
        "0                      0           2      3         1   \n",
        "1                      1           0      0         1   \n",
        "2                      0           2      3         1   \n",
        "3                      1           0      0         1   \n",
        "4                      1           0      0         1   \n",
        "\n",
        "                                         KC(Default) Opportunity(Default)  \n",
        "0  [SkillRule: Eliminate Parens; {CLT nested; CLT...                    1  \n",
        "1  [SkillRule: Remove constant; {ax+b=c, positive...                 1~~1  \n",
        "2  [SkillRule: Remove constant; {ax+b=c, positive...                    2  \n",
        "3  [SkillRule: Remove coefficient; {ax+b=c, divid...                 1~~1  \n",
        "4  [SkillRule: Remove constant; {ax+b=c, positive...                 3~~1  \n",
        "\n",
        "[5 rows x 19 columns]"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## What relates problems to each other?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Let's look at the columns\n",
      "traindata.columns"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So the columns in the dataset pertaining to problem characteristics are 'KC(Default)' (the knowledge components of a problem), the 'Problem Name', the 'Problem Hierarchy', and the 'Step Name'. The problem is that 'Step Name' is unique within a problem but there may be collisions with other problems. When taken together with 'Problem Name', it is unique.\n",
      "\n",
      "There may also be a need to generate more features if these prove insufficient for clustering analysis. Other features might include an estimated relative difficult based on how successful students are at solving the problem, the number of hints they require, and the time required to solve certain problems."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Perhaps the first thing to do is to establish a dictionary of knowledge components."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create empty list\n",
      "KCs = []\n",
      "\n",
      "# Grab the column of Knowledge Components, dropping all NaNs\n",
      "KCcol = traindata['KC(Default)']\n",
      "KCcol = list(KCcol.dropna())\n",
      "\n",
      "# Loop over every database entry, read the skills, split on '~~' separator, and append to list\n",
      "for i in range(len(KCcol)):\n",
      "    skills = KCcol[i].split('~~')\n",
      "    for skill in skills:\n",
      "        KCs.append(skill)\n",
      "        \n",
      "# Convert to set, which keeps only unique entries, then convert back to list\n",
      "KCs = list(set(KCs))\n",
      "\n",
      "# Print length\n",
      "print 'The total number of unique skills is: ',len(KCs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The total number of unique skills is:  112\n"
       ]
      }
     ],
     "prompt_number": 103
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}