{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Code and IPython Notebook\n", "\n", "https://github.com/BartBaddeley/PyDataTalk-2014" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why\n", "\n", "- Preprocessing to another technique\n", "- Exploring\n", "\n", "## Why not use scikit-learn\n", "\n", "- When things don't work useful to dig in with knowing how it works\n", "- Do use this library, sometimes do stuff by hand\n", "\n", "## K-Means\n", "\n", "- Convex problem, so will always find **a** solution but might be a local minima, not guaranteed to be the global minima.\n", " - Run multiple times, randomise start, pick best\n", "\n", "##\u00a0When does K-Means break down and when not\n", "\n", "- If features (dimensions) have different scales\n", " - KM assumes data well defined by centroid, and have same scales.\n", " - !!AI so what kind of feature scaling to use?\n", "- Need to scale features\n", "- K-Means is resistant to irrelevant dimensions (no information)\n", " - but if large variance in irrelevant dimension(s), fails (!!AI variance, or different scale? both?)\n", "- If dimension isn't Gaussian but uniform random (noise), then also breaks down.\n", "- Lesson:\n", " - be careful when adding new features / dimensions.\n", " - \"curse of dimensionality\": with too many dimensions, the actual notion of similarity breaks down. Euclidean distance with too many dimensions less useful.\n", "- If non-isotropic (not linearly separatable), i.e. no clear centroid, will fail too\n", "\n", "## Choosing between results\n", "\n", "- Given two sets of clustering results, what is error?\n", "- **Inertia**: sum squared error between point and nearest centre (!!AI ?)\n", " - But in fact this isn't useful. As you add more clusters this always decreases, but we're not penalising the increased complexity of the solution. (regularisation)\n", " - Use if comparing two different clustering results given a priori assumption of number of clusters.\n", "- **Sillhouette coefficient**: considers compactness of your own cluster, in relation to distance to points not part of your cluster.\n", " - Use to choose optimum number of clusters.\n", "\n", "## Visualising errors\n", "\n", "!!AI see presenter notes.\n", "\n", "##\u00a0Spectral clustering\n", "\n", "- Can capture non-linearly-separatable clusters. (non-isotropic).\n", "- Estimates number of clusters.\n", "- Not restricted to Euclidean distance as distance metric\n", " - Pass in some \"similarity matrix\"\n", "- Good for text.\n", "\n", "!!AI see notes. requires solid linear algebra knowledge (maybe add references).\n", "\n", "- Looking at smallest eigenvectors, whereas PCA looks at largest eigenvectors, different problem.\n", "\n", "- Unnormalised spectral clustering tries to balance the size of the clusters\n", "- Normalised spectral clustering tries to balance the **degree** (density) of the clusters.\n", " - Ratio cut graph partitioning algorithms\n", "\n", "## Questions\n", "\n", "- Time complexity?\n", " - Not sure about analytically. But did recommendation problem with 400,000 data points on users and movies, took 30 minutes on laptop.\n", " - You first go O(n^2) to calculate distances, then sparsify using threshold. Couldn't you just threshold first to sparsify, then calculate distances?\n", " - Yes, probably could." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }