{ "metadata": { "kernelspec": { "codemirror_mode": { "name": "ipython", "version": 3 }, "display_name": "IPython (Python 3)", "language": "python", "name": "python3" }, "name": "", "signature": "sha256:56ee8b7ea5e38567dafb3c8b98bb3d6b2f93787914bb6ee7f7a978dec06fd473" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "from IPython.core.display import HTML\n", "\n", "with open('creative_commons.txt', 'r') as f:\n", " html = f.read()\n", " \n", "name = '2015-03-16-outlier_detection'\n", "\n", "html = '''\n", "\n", "
This post was written as an IPython notebook.\n", " It is available for download\n", " or as a static html.
\n", "\n", "%s''' % (name, name, html)\n", "\n", "%matplotlib inline\n", "from matplotlib import style\n", "style.use('ggplot')\n", "\n", "from datetime import datetime\n", "\n", "title = \"3 ways to remove outliers from your data\"\n", "hour = datetime.utcnow().strftime('%H:%M')\n", "comments=\"true\"\n", "\n", "date = '-'.join(name.split('-')[:3])\n", "slug = '-'.join(name.split('-')[3:])\n", "\n", "metadata = dict(title=title,\n", " date=date,\n", " hour=hour,\n", " comments=comments,\n", " slug=slug,\n", " name=name)\n", "\n", "markdown = \"\"\"Title: {title}\n", "date: {date} {hour}\n", "comments: {comments}\n", "slug: {slug}\n", "\n", "{{% notebook {name}.ipynb cells[1:] %}}\n", "\"\"\".format(**metadata)\n", "\n", "content = os.path.abspath(os.path.join(os.getcwd(), os.pardir, os.pardir, '{}.md'.format(name)))\n", "with open('{}'.format(content), 'w') as f:\n", " f.writelines(markdown)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to Google Analytics, my post\n", "[\"Dealing with spiky data\"](https://ocefpaf.github.io/python4oceanographers/blog/2013/05/20/spikes),\n", "is by far the most visited on the blog. I think that the reasons are: it is\n", "one of the oldest posts, and it is a real problem that people have to deal everyday.\n", "\n", "Recently I found an amazing series of post writing by Bugra on how to perform\n", "outlier detection using\n", "[FFT, median filtering](http://bugra.github.io/work/notes/2014-03-31/outlier-detection-in-time-series-signals-fft-median-filtering/),\n", "[Gaussian processes](http://bugra.github.io/work/notes/2014-05-11/robust-regression-and-outlier-detection-via-gaussian-processes/),\n", "and [MCMC](http://bugra.github.io/work/notes/2014-04-26/outlier-detection-markov-chain-monte-carlo-via-pymc/)\n", "\n", "I will test out the low hanging fruit (FFT and median filtering) using the same\n", "data from my [original post](https://ocefpaf.github.io/python4oceanographers/blog/2013/05/20/spikes)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from datetime import datetime\n", "from pandas import read_table\n", "\n", "fname = './data/spikey_v.dat'\n", "cols = ['j', 'u', 'v', 'temp', 'sal', 'y', 'mn', 'd', 'h', 'mi']\n", "\n", "df = read_table(fname , delim_whitespace=True, names=cols)\n", "\n", "df.index = [datetime(*x) for x in zip(df['y'], df['mn'], df['d'], df['h'], df['mi'])]\n", "df = df.drop(['y', 'mn', 'd', 'h', 'mi'], axis=1)\n", "\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "| \n", " | j | \n", "u | \n", "v | \n", "temp | \n", "sal | \n", "
|---|---|---|---|---|---|
| 1994-11-22 04:00:00 | \n", "1056.1667 | \n", "-0.1 | \n", "0.0 | \n", "23.9 | \n", "34.6 | \n", "
| 1994-11-22 05:00:00 | \n", "1056.2083 | \n", "-0.2 | \n", "0.7 | \n", "23.9 | \n", "34.6 | \n", "
| 1994-11-22 06:00:00 | \n", "1056.2500 | \n", "-0.1 | \n", "2.0 | \n", "23.9 | \n", "34.6 | \n", "
| 1994-11-22 07:00:00 | \n", "1056.2917 | \n", "0.0 | \n", "3.1 | \n", "23.9 | \n", "34.6 | \n", "
| 1994-11-22 08:00:00 | \n", "1056.3333 | \n", "-0.1 | \n", "2.7 | \n", "23.9 | \n", "34.6 | \n", "
This post was written as an IPython notebook.\n", " It is available for download\n", " or as a static html.
\n", "\n", "