{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Code and IPython Notebook\n",
      "\n",
      "https://github.com/BartBaddeley/PyDataTalk-2014"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Why\n",
      "\n",
      "- Preprocessing to another technique\n",
      "- Exploring\n",
      "\n",
      "## Why not use scikit-learn\n",
      "\n",
      "- When things don't work useful to dig in with knowing how it works\n",
      "- Do use this library, sometimes do stuff by hand\n",
      "\n",
      "## K-Means\n",
      "\n",
      "- Convex problem, so will always find **a** solution but might be a local minima, not guaranteed to be the global minima.\n",
      "    - Run multiple times, randomise start, pick best\n",
      "\n",
      "##\u00a0When does K-Means break down and when not\n",
      "\n",
      "- If features (dimensions) have different scales\n",
      "    - KM assumes data well defined by centroid, and have same scales.\n",
      "    - !!AI so what kind of feature scaling to use?\n",
      "- Need to scale features\n",
      "- K-Means is resistant to irrelevant dimensions (no information)\n",
      "    - but if large variance in irrelevant dimension(s), fails (!!AI variance, or different scale? both?)\n",
      "- If dimension isn't Gaussian but uniform random (noise), then also breaks down.\n",
      "- Lesson:\n",
      "    - be careful when adding new features / dimensions.\n",
      "    - \"curse of dimensionality\": with too many dimensions, the actual notion of similarity breaks down. Euclidean distance with too many dimensions less useful.\n",
      "- If non-isotropic (not linearly separatable), i.e. no clear centroid, will fail too\n",
      "\n",
      "## Choosing between results\n",
      "\n",
      "- Given two sets of clustering results, what is error?\n",
      "- **Inertia**: sum squared error between point and nearest centre (!!AI ?)\n",
      "    - But in fact this isn't useful. As you add more clusters this always decreases, but we're not penalising the increased complexity of the solution. (regularisation)\n",
      "    - Use if comparing two different clustering results given a priori assumption of number of clusters.\n",
      "- **Sillhouette coefficient**: considers compactness of your own cluster, in relation to distance to points not part of your cluster.\n",
      "    - Use to choose optimum number of clusters.\n",
      "\n",
      "## Visualising errors\n",
      "\n",
      "!!AI see presenter notes.\n",
      "\n",
      "##\u00a0Spectral clustering\n",
      "\n",
      "- Can capture non-linearly-separatable clusters. (non-isotropic).\n",
      "- Estimates number of clusters.\n",
      "- Not restricted to Euclidean distance as distance metric\n",
      "    - Pass in some \"similarity matrix\"\n",
      "- Good for text.\n",
      "\n",
      "!!AI see notes. requires solid linear algebra knowledge (maybe add references).\n",
      "\n",
      "- Looking at smallest eigenvectors, whereas PCA looks at largest eigenvectors, different problem.\n",
      "\n",
      "- Unnormalised spectral clustering tries to balance the size of the clusters\n",
      "- Normalised spectral clustering tries to balance the **degree** (density) of the clusters.\n",
      "    - Ratio cut graph partitioning algorithms\n",
      "\n",
      "## Questions\n",
      "\n",
      "- Time complexity?\n",
      "    - Not sure about analytically. But did recommendation problem with 400,000 data points on users and movies, took 30 minutes on laptop.\n",
      "    - You first go O(n^2) to calculate distances, then sparsify using threshold. Couldn't you just threshold first to sparsify, then calculate distances?\n",
      "        - Yes, probably could."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}