{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Common Preprocesing Methods\n", "- **feature-wise normalization** - minus feature-mean and divided by feature std: $\\tilde{f_i}=\\frac{f_i-[f_i]}{std(f_i)}$. but if the features are sparse, usually we dont do the mean-minus step - for numerical stability, we usually add a small number to std before the normalization\n", "- **sample-wise normalization** - each sample divided by sample $l1$ or $l2$ norm: $\\tilde{x_i}=\\frac{x_i}{|x_i|_2}$, usually we dont need to substract the mean of each sample, specially when the samples are sparse vectors\n", "- **global sample-wise normalization** - divide each sample by the global mean/median $l1$ or $l2$ norm of all samples\n", "- based on some reports of machine learning competence, sample-wise normalization is usually suprisingly better, where feature-wise should always be the first thing to try\n", "- **PCA** is always worth trying since it filter out low-pass noise by dimensionality reduction\n", "- **ZCA** might be very useful for image data\n", "- **Clustering**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Common Data Patterns\n", "- **sparseness**: how sparse the sample (feature) vectors are - this affect normalization methods\n", "- **binary/discrete v.s. continouse** - this affect PCA, clustering - if it is binary data, it doesnt make sense to use kmeans directly as it will be doomed to converge to a tie. One way of doing this is to use specialized clustering (such as bi-clustering) for text data and do PCA first, then use kmeans. For discrete values, we need to use other tech such as mixture of Bernoulli to do clustering\n", "- **shape of data** - fat or long. e.g., if the dimensionality of data is too high, it does not make much sense to use an Euclidean based distance to do clustering\n", "\n", "## Common Data Types:\n", "- **Text data** - e.g., 20-newsgroup\n", "- **Image data** - e.g. MNIST, human faces, CIFAR\n", "- **Time Series** - e.g. YAHOO Stock\n", "- **Discrete-value features** - e.g. evergreen Kaggle competetion\n", "- **Continous features where a deep architecture is needed to find the non-linear structure** - e.g. Kaggle ICML 2013 Blackbox (similar to CIFAR?)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## 20-newsgroup\n", "from sklearn.datasets import fetch_20newsgroups\n", "text = fetch_20newsgroups()\n", "print len(text.data)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "11314\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "## MNIST\n", "import cPickle\n", "train_mnist, valid_mnist, test_mnist = cPickle.load(open('data/mnist.pkl', 'r'))\n", "print train_mnist[0].shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(50000, 784)\n" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "## human faces\n", "from sklearn.datasets import fetch_olivetti_faces\n", "faces = fetch_olivetti_faces()\n", "faces.data.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "(400, 4096)" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "## time series \n", "from matplotlib import finance\n", "symbol_dict = {\n", " 'TOT': 'Total',\n", " 'XOM': 'Exxon',\n", " 'CVX': 'Chevron',\n", " 'COP': 'ConocoPhillips',\n", " 'VLO': 'Valero Energy',\n", " 'MSFT': 'Microsoft',\n", " 'IBM': 'IBM',\n", " 'TWX': 'Time Warner',\n", " 'CMCSA': 'Comcast',\n", " 'CVC': 'Cablevision',\n", " 'YHOO': 'Yahoo',\n", " 'DELL': 'Dell',\n", " 'HPQ': 'HP',\n", " 'AMZN': 'Amazon',\n", " 'TM': 'Toyota',\n", " 'CAJ': 'Canon',\n", " 'MTU': 'Mitsubishi',\n", " 'SNE': 'Sony',\n", " 'F': 'Ford',\n", " 'HMC': 'Honda',\n", " 'NAV': 'Navistar',\n", " 'NOC': 'Northrop Grumman',\n", " 'BA': 'Boeing',\n", " 'KO': 'Coca Cola',\n", " 'MMM': '3M',\n", " 'MCD': 'Mc Donalds',\n", " 'PEP': 'Pepsi',\n", " 'MDLZ': 'Kraft Foods',\n", " 'K': 'Kellogg',\n", " 'UN': 'Unilever',\n", " 'MAR': 'Marriott',\n", " 'PG': 'Procter Gamble',\n", " 'CL': 'Colgate-Palmolive',\n", " 'NWS': 'News Corp',\n", " 'GE': 'General Electrics',\n", " 'WFC': 'Wells Fargo',\n", " 'JPM': 'JPMorgan Chase',\n", " 'AIG': 'AIG',\n", " 'AXP': 'American express',\n", " 'BAC': 'Bank of America',\n", " 'GS': 'Goldman Sachs',\n", " 'AAPL': 'Apple',\n", " 'SAP': 'SAP',\n", " 'CSCO': 'Cisco',\n", " 'TXN': 'Texas instruments',\n", " 'XRX': 'Xerox',\n", " 'LMT': 'Lookheed Martin',\n", " 'WMT': 'Wal-Mart',\n", " 'WAG': 'Walgreen',\n", " 'HD': 'Home Depot',\n", " 'GSK': 'GlaxoSmithKline',\n", " 'PFE': 'Pfizer',\n", " 'SNY': 'Sanofi-Aventis',\n", " 'NVS': 'Novartis',\n", " 'KMB': 'Kimberly-Clark',\n", " 'R': 'Ryder',\n", " 'GD': 'General Dynamics',\n", " 'RTN': 'Raytheon',\n", " 'CVS': 'CVS',\n", " 'CAT': 'Caterpillar',\n", " 'DD': 'DuPont de Nemours'}\n", "d1 = datetime.datetime(2003, 01, 01)\n", "d2 = datetime.datetime(2008, 01, 01)\n", "symbols, names = np.array(symbol_dict.items()).T\n", "quotes = [finance.quotes_historical_yahoo(symbol, d1, d2, asobject=True)\n", " for symbol in symbols]\n", "open_quotes = np.array([q.open for q in quotes]).astype(np.float)\n", "close_quotes = np.array([q.close for q in quotes]).astype(np.float)\n", "## It seems that the daily variations of the quotes are what carry most information\n", "variation = close_quotes - open_quotes\n", "print variation.shape" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "HTTPError", "evalue": "HTTP Error 404: Not Found", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mHTTPError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 67\u001b[0m \u001b[0msymbols\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnames\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msymbol_dict\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mT\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 68\u001b[0m quotes = [finance.quotes_historical_yahoo(symbol, d1, d2, asobject=True)\n\u001b[1;32m---> 69\u001b[1;33m for symbol in symbols]\n\u001b[0m\u001b[0;32m 70\u001b[0m \u001b[0mopen_quotes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mq\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mopen\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mq\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mquotes\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfloat\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 71\u001b[0m \u001b[0mclose_quotes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mq\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclose\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mq\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mquotes\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfloat\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/pymodules/python2.7/matplotlib/finance.pyc\u001b[0m in \u001b[0;36mquotes_historical_yahoo\u001b[1;34m(ticker, date1, date2, asobject, adjusted, cachename)\u001b[0m\n\u001b[0;32m 225\u001b[0m \u001b[1;31m# warnings.warn(\"Recommend changing to asobject=None\")\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 226\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 227\u001b[1;33m \u001b[0mfh\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfetch_historical_yahoo\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mticker\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdate1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdate2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcachename\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 228\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 229\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/pymodules/python2.7/matplotlib/finance.pyc\u001b[0m in \u001b[0;36mfetch_historical_yahoo\u001b[1;34m(ticker, date1, date2, cachename, dividends)\u001b[0m\n\u001b[0;32m 186\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 187\u001b[0m \u001b[0mmkdirs\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcachedir\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 188\u001b[1;33m \u001b[0murlfh\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 189\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 190\u001b[0m \u001b[0mfh\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcachename\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'wb'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36murlopen\u001b[1;34m(url, data, timeout)\u001b[0m\n\u001b[0;32m 125\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0m_opener\u001b[0m \u001b[1;32mis\u001b[0m \u001b[0mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 126\u001b[0m \u001b[0m_opener\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mbuild_opener\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 127\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0m_opener\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 128\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 129\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0minstall_opener\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mopener\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mopen\u001b[1;34m(self, fullurl, data, timeout)\u001b[0m\n\u001b[0;32m 408\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mprocessor\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mprocess_response\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 409\u001b[0m \u001b[0mmeth\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mprocessor\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmeth_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 410\u001b[1;33m \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmeth\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mreq\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 411\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 412\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mhttp_response\u001b[1;34m(self, request, response)\u001b[0m\n\u001b[0;32m 521\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;33m(\u001b[0m\u001b[1;36m200\u001b[0m \u001b[1;33m<=\u001b[0m \u001b[0mcode\u001b[0m \u001b[1;33m<\u001b[0m \u001b[1;36m300\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 522\u001b[0m response = self.parent.error(\n\u001b[1;32m--> 523\u001b[1;33m 'http', request, response, code, msg, hdrs)\n\u001b[0m\u001b[0;32m 524\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 525\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36merror\u001b[1;34m(self, proto, *args)\u001b[0m\n\u001b[0;32m 446\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mhttp_err\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 447\u001b[0m \u001b[0margs\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'default'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'http_error_default'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m+\u001b[0m \u001b[0morig_args\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 448\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_call_chain\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 449\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 450\u001b[0m \u001b[1;31m# XXX probably also want an abstract factory that knows when it makes\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36m_call_chain\u001b[1;34m(self, chain, kind, meth_name, *args)\u001b[0m\n\u001b[0;32m 380\u001b[0m \u001b[0mfunc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mhandler\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmeth_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 381\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 382\u001b[1;33m \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 383\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mresult\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 384\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mhttp_error_default\u001b[1;34m(self, req, fp, code, msg, hdrs)\u001b[0m\n\u001b[0;32m 529\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mHTTPDefaultErrorHandler\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mBaseHandler\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 530\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mhttp_error_default\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mreq\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfp\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mhdrs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 531\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mHTTPError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mreq\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_full_url\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mhdrs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfp\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 532\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 533\u001b[0m \u001b[1;32mclass\u001b[0m \u001b[0mHTTPRedirectHandler\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mBaseHandler\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mHTTPError\u001b[0m: HTTP Error 404: Not Found" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "## evergreen data\n", "import pandas as pd\n", "df = pd.read_csv('data/evergreen.tsv', sep = '\\t', )\n", "print df.shape\n", "df.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(7395, 27)\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "/usr/local/lib/python2.7/dist-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n", "\n", " warnings.warn(d.msg, DeprecationWarning)\n", "/usr/local/lib/python2.7/dist-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n", "\n", " warnings.warn(d.msg, DeprecationWarning)\n", "/usr/local/lib/python2.7/dist-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.\n", "\n", " warnings.warn(d.msg, DeprecationWarning)\n" ] }, { "html": [ "
\n",
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Index: 8 entries, count to max\n",
        "Data columns (total 21 columns):\n",
        "urlid                             8  non-null values\n",
        "avglinksize                       8  non-null values\n",
        "commonlinkratio_1                 8  non-null values\n",
        "commonlinkratio_2                 8  non-null values\n",
        "commonlinkratio_3                 8  non-null values\n",
        "commonlinkratio_4                 8  non-null values\n",
        "compression_ratio                 8  non-null values\n",
        "embed_ratio                       8  non-null values\n",
        "framebased                        8  non-null values\n",
        "frameTagRatio                     8  non-null values\n",
        "hasDomainLink                     8  non-null values\n",
        "html_ratio                        8  non-null values\n",
        "image_ratio                       8  non-null values\n",
        "lengthyLinkDomain                 8  non-null values\n",
        "linkwordscore                     8  non-null values\n",
        "non_markup_alphanum_characters    8  non-null values\n",
        "numberOfLinks                     8  non-null values\n",
        "numwords_in_url                   8  non-null values\n",
        "parametrizedLinkRatio             8  non-null values\n",
        "spelling_errors_ratio             8  non-null values\n",
        "label                             8  non-null values\n",
        "dtypes: float64(21)\n",
        "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "\n", "Index: 8 entries, count to max\n", "Data columns (total 21 columns):\n", "urlid 8 non-null values\n", "avglinksize 8 non-null values\n", "commonlinkratio_1 8 non-null values\n", "commonlinkratio_2 8 non-null values\n", "commonlinkratio_3 8 non-null values\n", "commonlinkratio_4 8 non-null values\n", "compression_ratio 8 non-null values\n", "embed_ratio 8 non-null values\n", "framebased 8 non-null values\n", "frameTagRatio 8 non-null values\n", "hasDomainLink 8 non-null values\n", "html_ratio 8 non-null values\n", "image_ratio 8 non-null values\n", "lengthyLinkDomain 8 non-null values\n", "linkwordscore 8 non-null values\n", "non_markup_alphanum_characters 8 non-null values\n", "numberOfLinks 8 non-null values\n", "numwords_in_url 8 non-null values\n", "parametrizedLinkRatio 8 non-null values\n", "spelling_errors_ratio 8 non-null values\n", "label 8 non-null values\n", "dtypes: float64(21)" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "## blackbox \n", "black_X, black_y = cPickle.load(open('data/blackbox.pkl'))\n", "print black_X.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(1000, 1875)\n" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }