{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Common Preprocesing Methods\n",
      "- **feature-wise normalization** - minus feature-mean and divided by feature std: $\\tilde{f_i}=\\frac{f_i-[f_i]}{std(f_i)}$. but if the features are sparse, usually we dont do the mean-minus step - for numerical stability, we usually add a small number to std before the normalization\n",
      "- **sample-wise normalization** - each sample divided by sample $l1$ or $l2$ norm: $\\tilde{x_i}=\\frac{x_i}{|x_i|_2}$, usually we dont need to substract the mean of each sample, specially when the samples are sparse vectors\n",
      "- **global sample-wise normalization** - divide each sample by the global mean/median $l1$ or $l2$ norm of all samples\n",
      "- based on some reports of machine learning competence, sample-wise normalization is usually suprisingly better, where feature-wise should always be the first thing to try\n",
      "- **PCA** is always worth trying since it filter out low-pass noise by dimensionality reduction\n",
      "- **ZCA** might be very useful for image data\n",
      "- **Clustering**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Common Data Patterns\n",
      "- **sparseness**: how sparse the sample (feature) vectors are - this affect normalization methods\n",
      "- **binary/discrete v.s. continouse** - this affect PCA, clustering - if it is binary data, it doesnt make sense to use kmeans directly as it will be doomed to converge to a tie. One way of doing this is to use specialized clustering (such as bi-clustering) for text data and do PCA first, then use kmeans. For discrete values, we need to use other tech such as mixture of Bernoulli to do clustering\n",
      "- **shape of data** - fat or long. e.g., if the dimensionality of data is too high, it does not make much sense to use an Euclidean based distance to do clustering\n",
      "\n",
      "## Common Data Types:\n",
      "- **Text data** - e.g., 20-newsgroup\n",
      "- **Image data** - e.g. MNIST, human faces, CIFAR\n",
      "- **Time Series** - e.g. YAHOO Stock\n",
      "- **Discrete-value features** - e.g. evergreen Kaggle competetion\n",
      "- **Continous features where a deep architecture is needed to find the non-linear structure** - e.g. Kaggle ICML 2013 Blackbox (similar to CIFAR?)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## 20-newsgroup\n",
      "from sklearn.datasets import fetch_20newsgroups\n",
      "text = fetch_20newsgroups()\n",
      "print len(text.data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "11314\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## MNIST\n",
      "import cPickle\n",
      "train_mnist, valid_mnist, test_mnist = cPickle.load(open('data/mnist.pkl', 'r'))\n",
      "print train_mnist[0].shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(50000, 784)\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## human faces\n",
      "from sklearn.datasets import fetch_olivetti_faces\n",
      "faces = fetch_olivetti_faces()\n",
      "faces.data.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "(400, 4096)"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## time series \n",
      "from matplotlib import finance\n",
      "symbol_dict = {\n",
      "    'TOT': 'Total',\n",
      "    'XOM': 'Exxon',\n",
      "    'CVX': 'Chevron',\n",
      "    'COP': 'ConocoPhillips',\n",
      "    'VLO': 'Valero Energy',\n",
      "    'MSFT': 'Microsoft',\n",
      "    'IBM': 'IBM',\n",
      "    'TWX': 'Time Warner',\n",
      "    'CMCSA': 'Comcast',\n",
      "    'CVC': 'Cablevision',\n",
      "    'YHOO': 'Yahoo',\n",
      "    'DELL': 'Dell',\n",
      "    'HPQ': 'HP',\n",
      "    'AMZN': 'Amazon',\n",
      "    'TM': 'Toyota',\n",
      "    'CAJ': 'Canon',\n",
      "    'MTU': 'Mitsubishi',\n",
      "    'SNE': 'Sony',\n",
      "    'F': 'Ford',\n",
      "    'HMC': 'Honda',\n",
      "    'NAV': 'Navistar',\n",
      "    'NOC': 'Northrop Grumman',\n",
      "    'BA': 'Boeing',\n",
      "    'KO': 'Coca Cola',\n",
      "    'MMM': '3M',\n",
      "    'MCD': 'Mc Donalds',\n",
      "    'PEP': 'Pepsi',\n",
      "    'MDLZ': 'Kraft Foods',\n",
      "    'K': 'Kellogg',\n",
      "    'UN': 'Unilever',\n",
      "    'MAR': 'Marriott',\n",
      "    'PG': 'Procter Gamble',\n",
      "    'CL': 'Colgate-Palmolive',\n",
      "    'NWS': 'News Corp',\n",
      "    'GE': 'General Electrics',\n",
      "    'WFC': 'Wells Fargo',\n",
      "    'JPM': 'JPMorgan Chase',\n",
      "    'AIG': 'AIG',\n",
      "    'AXP': 'American express',\n",
      "    'BAC': 'Bank of America',\n",
      "    'GS': 'Goldman Sachs',\n",
      "    'AAPL': 'Apple',\n",
      "    'SAP': 'SAP',\n",
      "    'CSCO': 'Cisco',\n",
      "    'TXN': 'Texas instruments',\n",
      "    'XRX': 'Xerox',\n",
      "    'LMT': 'Lookheed Martin',\n",
      "    'WMT': 'Wal-Mart',\n",
      "    'WAG': 'Walgreen',\n",
      "    'HD': 'Home Depot',\n",
      "    'GSK': 'GlaxoSmithKline',\n",
      "    'PFE': 'Pfizer',\n",
      "    'SNY': 'Sanofi-Aventis',\n",
      "    'NVS': 'Novartis',\n",
      "    'KMB': 'Kimberly-Clark',\n",
      "    'R': 'Ryder',\n",
      "    'GD': 'General Dynamics',\n",
      "    'RTN': 'Raytheon',\n",
      "    'CVS': 'CVS',\n",
      "    'CAT': 'Caterpillar',\n",
      "    'DD': 'DuPont de Nemours'}\n",
      "d1 = datetime.datetime(2003, 01, 01)\n",
      "d2 = datetime.datetime(2008, 01, 01)\n",
      "symbols, names = np.array(symbol_dict.items()).T\n",
      "quotes = [finance.quotes_historical_yahoo(symbol, d1, d2, asobject=True)\n",
      "          for symbol in symbols]\n",
      "open_quotes = np.array([q.open for q in quotes]).astype(np.float)\n",
      "close_quotes = np.array([q.close for q in quotes]).astype(np.float)\n",
      "## It seems that the daily variations of the quotes are what carry most information\n",
      "variation = close_quotes - open_quotes\n",
      "print variation.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "HTTPError",
       "evalue": "HTTP Error 404: Not Found",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mHTTPError\u001b[0m                                 Traceback (most recent call last)",
        "\u001b[1;32m<ipython-input-9-8c45fc130542>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m     67\u001b[0m \u001b[0msymbols\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnames\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msymbol_dict\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mT\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     68\u001b[0m quotes = [finance.quotes_historical_yahoo(symbol, d1, d2, asobject=True)\n\u001b[1;32m---> 69\u001b[1;33m           for symbol in symbols]\n\u001b[0m\u001b[0;32m     70\u001b[0m \u001b[0mopen_quotes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mq\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mopen\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mq\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mquotes\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfloat\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     71\u001b[0m \u001b[0mclose_quotes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mq\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclose\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mq\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mquotes\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfloat\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/pymodules/python2.7/matplotlib/finance.pyc\u001b[0m in \u001b[0;36mquotes_historical_yahoo\u001b[1;34m(ticker, date1, date2, asobject, adjusted, cachename)\u001b[0m\n\u001b[0;32m    225\u001b[0m     \u001b[1;31m#    warnings.warn(\"Recommend changing to asobject=None\")\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    226\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 227\u001b[1;33m     \u001b[0mfh\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfetch_historical_yahoo\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mticker\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdate1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdate2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcachename\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    228\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    229\u001b[0m     \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/pymodules/python2.7/matplotlib/finance.pyc\u001b[0m in \u001b[0;36mfetch_historical_yahoo\u001b[1;34m(ticker, date1, date2, cachename, dividends)\u001b[0m\n\u001b[0;32m    186\u001b[0m     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    187\u001b[0m         \u001b[0mmkdirs\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcachedir\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 188\u001b[1;33m         \u001b[0murlfh\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    189\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    190\u001b[0m         \u001b[0mfh\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcachename\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'wb'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36murlopen\u001b[1;34m(url, data, timeout)\u001b[0m\n\u001b[0;32m    125\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0m_opener\u001b[0m \u001b[1;32mis\u001b[0m \u001b[0mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    126\u001b[0m         \u001b[0m_opener\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mbuild_opener\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 127\u001b[1;33m     \u001b[1;32mreturn\u001b[0m \u001b[0m_opener\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    128\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    129\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0minstall_opener\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mopener\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mopen\u001b[1;34m(self, fullurl, data, timeout)\u001b[0m\n\u001b[0;32m    408\u001b[0m         \u001b[1;32mfor\u001b[0m \u001b[0mprocessor\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mprocess_response\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    409\u001b[0m             \u001b[0mmeth\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mprocessor\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmeth_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 410\u001b[1;33m             \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmeth\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mreq\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    411\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    412\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mhttp_response\u001b[1;34m(self, request, response)\u001b[0m\n\u001b[0;32m    521\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;33m(\u001b[0m\u001b[1;36m200\u001b[0m \u001b[1;33m<=\u001b[0m \u001b[0mcode\u001b[0m \u001b[1;33m<\u001b[0m \u001b[1;36m300\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    522\u001b[0m             response = self.parent.error(\n\u001b[1;32m--> 523\u001b[1;33m                 'http', request, response, code, msg, hdrs)\n\u001b[0m\u001b[0;32m    524\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    525\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36merror\u001b[1;34m(self, proto, *args)\u001b[0m\n\u001b[0;32m    446\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mhttp_err\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    447\u001b[0m             \u001b[0margs\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'default'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'http_error_default'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m+\u001b[0m \u001b[0morig_args\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 448\u001b[1;33m             \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_call_chain\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    449\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    450\u001b[0m \u001b[1;31m# XXX probably also want an abstract factory that knows when it makes\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36m_call_chain\u001b[1;34m(self, chain, kind, meth_name, *args)\u001b[0m\n\u001b[0;32m    380\u001b[0m             \u001b[0mfunc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mhandler\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmeth_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    381\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 382\u001b[1;33m             \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    383\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0mresult\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    384\u001b[0m                 \u001b[1;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mhttp_error_default\u001b[1;34m(self, req, fp, code, msg, hdrs)\u001b[0m\n\u001b[0;32m    529\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mHTTPDefaultErrorHandler\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mBaseHandler\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    530\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mhttp_error_default\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mreq\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfp\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mhdrs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 531\u001b[1;33m         \u001b[1;32mraise\u001b[0m \u001b[0mHTTPError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mreq\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_full_url\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mhdrs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfp\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    532\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    533\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mHTTPRedirectHandler\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mBaseHandler\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
        "\u001b[1;31mHTTPError\u001b[0m: HTTP Error 404: Not Found"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## evergreen data\n",
      "import pandas as pd\n",
      "df = pd.read_csv('data/evergreen.tsv', sep = '\\t', )\n",
      "print df.shape\n",
      "df.describe()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(7395, 27)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/usr/local/lib/python2.7/dist-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n",
        "\n",
        "  warnings.warn(d.msg, DeprecationWarning)\n",
        "/usr/local/lib/python2.7/dist-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n",
        "\n",
        "  warnings.warn(d.msg, DeprecationWarning)\n",
        "/usr/local/lib/python2.7/dist-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n",
        "\n",
        "  warnings.warn(d.msg, DeprecationWarning)\n"
       ]
      },
      {
       "html": [
        "<pre>\n",
        "&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
        "Index: 8 entries, count to max\n",
        "Data columns (total 21 columns):\n",
        "urlid                             8  non-null values\n",
        "avglinksize                       8  non-null values\n",
        "commonlinkratio_1                 8  non-null values\n",
        "commonlinkratio_2                 8  non-null values\n",
        "commonlinkratio_3                 8  non-null values\n",
        "commonlinkratio_4                 8  non-null values\n",
        "compression_ratio                 8  non-null values\n",
        "embed_ratio                       8  non-null values\n",
        "framebased                        8  non-null values\n",
        "frameTagRatio                     8  non-null values\n",
        "hasDomainLink                     8  non-null values\n",
        "html_ratio                        8  non-null values\n",
        "image_ratio                       8  non-null values\n",
        "lengthyLinkDomain                 8  non-null values\n",
        "linkwordscore                     8  non-null values\n",
        "non_markup_alphanum_characters    8  non-null values\n",
        "numberOfLinks                     8  non-null values\n",
        "numwords_in_url                   8  non-null values\n",
        "parametrizedLinkRatio             8  non-null values\n",
        "spelling_errors_ratio             8  non-null values\n",
        "label                             8  non-null values\n",
        "dtypes: float64(21)\n",
        "</pre>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Index: 8 entries, count to max\n",
        "Data columns (total 21 columns):\n",
        "urlid                             8  non-null values\n",
        "avglinksize                       8  non-null values\n",
        "commonlinkratio_1                 8  non-null values\n",
        "commonlinkratio_2                 8  non-null values\n",
        "commonlinkratio_3                 8  non-null values\n",
        "commonlinkratio_4                 8  non-null values\n",
        "compression_ratio                 8  non-null values\n",
        "embed_ratio                       8  non-null values\n",
        "framebased                        8  non-null values\n",
        "frameTagRatio                     8  non-null values\n",
        "hasDomainLink                     8  non-null values\n",
        "html_ratio                        8  non-null values\n",
        "image_ratio                       8  non-null values\n",
        "lengthyLinkDomain                 8  non-null values\n",
        "linkwordscore                     8  non-null values\n",
        "non_markup_alphanum_characters    8  non-null values\n",
        "numberOfLinks                     8  non-null values\n",
        "numwords_in_url                   8  non-null values\n",
        "parametrizedLinkRatio             8  non-null values\n",
        "spelling_errors_ratio             8  non-null values\n",
        "label                             8  non-null values\n",
        "dtypes: float64(21)"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## blackbox \n",
      "black_X, black_y = cPickle.load(open('data/blackbox.pkl'))\n",
      "print black_X.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(1000, 1875)\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}