{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Pandas, vectors and text work homework\n",
    "###Data and Databases\n",
    "\n",
    "Get moving on your own project thoughts this week! \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's use the 'ml-100k' data set.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use pd.read_csv to load the `u.item` and `u.data` files as dataframes.\n",
    "\n",
    "Using the dataframe you've created from u.data, produce:\n",
    "\n",
    "    - a dataframe including all the item numbers and ratings given by user 42\n",
    "    - the mean of user 42's ratings\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take the item numbers that user 42 gave a rating greater than his/her mean. Using the data u.item, give the titles of the movies corresponding to those item numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start with the \"ratings\" dataframe we created with the pivot command, with users as rows and films as columns.\n",
    "\n",
    "Using cosine_similarity from Wednesday's lecture, compute the similarity between \n",
    "\n",
    "    a) any two users of your choice\n",
    "\n",
    "    b) any two films of your choice\n",
    "\n",
    "based on the users' ratings of films.\n",
    "\n",
    "\n",
    "You will need to replace all the NaN's with something. For now, use `ratings.fillna(0)` to replace the NaN's with zeros. This would massively mess up averages, of course!\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, compute two similarity matrices:\n",
    "\n",
    "1) among users (i.e. a matrix with users as columns and as rows)\n",
    "\n",
    "2) of films\n",
    "\n",
    "(You can use panda dataframes if you want). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pick two films. For each, give:\n",
    "\n",
    "    a) the top ten \"similar\" films for each film using the item number\n",
    "    \n",
    "    b) the names and dates of those top ten similar films "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, using the Capitolwords API, obtain at least one hundred speeches with some terms of interest to you. You may need to try several phrase. Note that the API will only give you 50 results at a time. Bonus points if \"kittens\" figures prominently. HA!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Who is the most frequent speaker in your data set? Show how you computed it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, look at the documentation for the API at http://capitolwords.org/api/1/. Look at the boldfaced section *phrases.json*. \n",
    "\n",
    "Using the example there, perform your own queries to request:\n",
    "\n",
    "1. the top words in August 2011 by count.\n",
    "2. the top words in August 2011 by tf-idf\n",
    "\n",
    "Explain briefly why count and tf-idf are likely different, and give an example. (See the explanation in the notes; wikipedia is not bad on the subject)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More text mining next time!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}