{ "metadata": { "name": "", "signature": "sha256:81c38476aaea95ac17ccfa86f93aea408282475780e2e20545818910e0014e2a" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Project Outline\n", "\n", "From the [task description](https://pslcdatashop.web.cmu.edu/KDDCup/rules_task.jsp):\n", "\n", ">The competition task will be to develop a learning model based on the challenge and/or development data sets, use this algorithm to learn from the training portion of the challenge data sets, and then accurately predict student performance in the test sections.\n", "\n", "Some of the technical challenges of this problem include:\n", "\n", "> The data matrix is sparse: not all students are given every problem, and some problems have only 1 or 2 students who completed each item. So, the contestants need to exploit relationships among problems to bring to bear enough data to hope to learn.\n", "\n", "> There is a strong temporal dimension to the data: students improve over the course of the school year, students must master some skills before moving on to others, and incorrect responses to some items lead to incorrect assumptions in other items. So, contestants must pay attention to temporal relationships as well as conceptual relationships among items.\n", "\n", "> Which problems a given student sees is determined in part by student choices or past success history: e.g., students only see remedial problems if they are having trouble with the non-remedial problems. So, contestants need to pay attention to causal relationships in order to avoid selection bias." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I am not going to concern myself too much with the last aspect for now. The interactive tutorial system that students are using is suggestion remedial problems based on mistakes. The result could be that students are seeing more of a certain kind of problem. This could skew estimations of the student's total competency if this is not taking into account. But this will addressed later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step in solving this problem is establishing the relationships between problems. To predict how well a student is going to perform against a new problem, we must first establish that problem in relation to the other problems in the database, and then to the problems within the database that the student has already encountered.\n", "\n", "To establish the relationships between problems, we must use some kind of unsupervised machine learning technique." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read in the data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Get the data: Algebra 2005-2006\n", "train_filepath = 'data/algebra0506/algebra_2005_2006_train.txt'\n", "test_filepath = 'data/algebra0506/algebra_2005_2006_test.txt'\n", "traindata = pd.read_table(train_filepath)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# What does the training data look like?\n", "traindata.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | Row | \n", "Anon Student Id | \n", "Problem Hierarchy | \n", "Problem Name | \n", "Problem View | \n", "Step Name | \n", "Step Start Time | \n", "First Transaction Time | \n", "Correct Transaction Time | \n", "Step End Time | \n", "Step Duration (sec) | \n", "Correct Step Duration (sec) | \n", "Error Step Duration (sec) | \n", "Correct First Attempt | \n", "Incorrects | \n", "Hints | \n", "Corrects | \n", "KC(Default) | \n", "Opportunity(Default) | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "0BrbPbwCMz | \n", "Unit ES_04, Section ES_04-1 | \n", "EG4-FIXED | \n", "1 | \n", "3(x+2) = 15 | \n", "2005-09-09 12:24:35.0 | \n", "2005-09-09 12:24:49.0 | \n", "2005-09-09 12:25:15.0 | \n", "2005-09-09 12:25:15.0 | \n", "40 | \n", "NaN | \n", "40 | \n", "0 | \n", "2 | \n", "3 | \n", "1 | \n", "[SkillRule: Eliminate Parens; {CLT nested; CLT... | \n", "1 | \n", "
1 | \n", "2 | \n", "0BrbPbwCMz | \n", "Unit ES_04, Section ES_04-1 | \n", "EG4-FIXED | \n", "1 | \n", "x+2 = 5 | \n", "2005-09-09 12:25:15.0 | \n", "2005-09-09 12:25:31.0 | \n", "2005-09-09 12:25:31.0 | \n", "2005-09-09 12:25:31.0 | \n", "16 | \n", "16 | \n", "NaN | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "[SkillRule: Remove constant; {ax+b=c, positive... | \n", "1~~1 | \n", "
2 | \n", "3 | \n", "0BrbPbwCMz | \n", "Unit ES_04, Section ES_04-1 | \n", "EG40 | \n", "1 | \n", "2-8y = -4 | \n", "2005-09-09 12:25:36.0 | \n", "2005-09-09 12:25:43.0 | \n", "2005-09-09 12:26:12.0 | \n", "2005-09-09 12:26:12.0 | \n", "36 | \n", "NaN | \n", "36 | \n", "0 | \n", "2 | \n", "3 | \n", "1 | \n", "[SkillRule: Remove constant; {ax+b=c, positive... | \n", "2 | \n", "
3 | \n", "4 | \n", "0BrbPbwCMz | \n", "Unit ES_04, Section ES_04-1 | \n", "EG40 | \n", "1 | \n", "-8y = -6 | \n", "2005-09-09 12:26:12.0 | \n", "2005-09-09 12:26:34.0 | \n", "2005-09-09 12:26:34.0 | \n", "2005-09-09 12:26:34.0 | \n", "22 | \n", "22 | \n", "NaN | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "[SkillRule: Remove coefficient; {ax+b=c, divid... | \n", "1~~1 | \n", "
4 | \n", "5 | \n", "0BrbPbwCMz | \n", "Unit ES_04, Section ES_04-1 | \n", "EG40 | \n", "2 | \n", "-7y-5 = -4 | \n", "2005-09-09 12:26:38.0 | \n", "2005-09-09 12:28:36.0 | \n", "2005-09-09 12:28:36.0 | \n", "2005-09-09 12:28:36.0 | \n", "118 | \n", "118 | \n", "NaN | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "[SkillRule: Remove constant; {ax+b=c, positive... | \n", "3~~1 | \n", "
5 rows \u00d7 19 columns
\n", "