{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What top-line value does data science give to companies?\n", "\n", "- Medical analysis using decision trees.\n", "- Helping engineering companies rent/reuse/extend lifetime of physical goods.\n", " - e.g. GE instrumenting engines to predict failure.\n", " - In fact GE bought 10% of Pivotal to help them extend this.\n", " - Don't need much percent increase in savings to make massive difference.\n", "- It's hard to put percentage upfront on savings; US companies more willing than UK companies to try out data science without this.\n", "- Sometimes you need to push companies to have well defined measures of success to prove projects provide better analyses.\n", "\n", "## What is most under appreciated by clients about data science?\n", "\n", "- Pitching how accurate your results are and making sure clients know it does or doesn't meet their needs.\n", "- Best practices for software engineering (documentation, automated tests) often are ignored.\n", "- If you want to make impact you need buy-in from high level business side.\n", " - Sometimes in large companies departments are resistant without high-level buy in.\n", "- Scoping - allowing for an initial exploration stage with iterative feedback.\n", "- Customers often ignore caveats associated wih results.\n", "- Customers often are deeply wedded with domain knowledge or heuristics.\n", " - If you show up with a fancy data-driven classifier customers will go \"but what is this? what did you do?\"\n", "- How is data gathered? Even before cleaning or feature engineering.\n", " - Crafting surveys to gather good data is important.\n", "- Customers see a number and ignore the prerequisite expertise required to get there.\n", "- Finance doesn't need motiviation to be data driven, in fact it couldn't work without data.\n", " - Really the big problem is poor tooling.\n", " - Excel, VBA is too widespread. No arguments there.\n", " - Maybe being controvertial, MATLAB and R is also too widespread.\n", "\n", "## Visualisation and its role\n", "\n", "- Important to start, day 1, with any picture, e.g. matplotlib, so you can poke at it with the customer from day 1.\n", " - Or at least agree that is the type of picture you want to get to.\n", "- Rule of thumb - customers are always impressed with a histogram of their own data.\n", " - They've never even bothered looking at it.\n", " - Simplicity is powerful.\n", "- Sketching the data pipeline (source, processing, output) is often useful.\n", " - Try to make it not just a set of numbers on a spreadsheet.\n", " - Give it form, allow customers to visualise operations.\n", " - They don't trust the outright output from a classifier. It's not auitable without expertise.\n", "\n", "## Customers want magic APIs or methods. What would you wish for?\n", "\n", "- Clients often don't even have data. Their magic is \"get me data\". \n", " - Scraping etc.\n", "- 80% of time is cleaning up data.\n", " - Even if there were a magic API dangerous assumptions!\n", " - Want a Dyson of data.\n", "- Blackbox classification solvers, even if they work, don't give enough confidence or auditable results.\n", "\n", "## Data munging, data cleaning\n", "\n", "- Suspect there's some automation of data cleaning and munging that is missing.\n", " - Not fully automatable but surely some work can be done.\n", "- But is there always too much domain knowledge required to automate data munging?\n", " - e.g. sentiment analysis of tweets. Not automatable because too much domain knowledge required.\n", "\n", "## What tool needs more exposure and education?\n", "\n", "- RStudio, although R-only, is very intuitive.\n", "- RShiny is great for quick, interactive, publishable results.\n", " - Would be great if IPython Notebook was exportable to something that RShiny produces.\n", "- Want to be able to hide code in an IPython Notebook.\n", "- Tools inevitably are domain and organisation specific, integration work is necessary.\n", " - Software needs to be better engineered.\n", " - All software is subordinate to business requirements.\n", " \n", "## Questions\n", "\n", "### Part of Speech tagging?\n", "\n", "- TextBlob is a combination of `nltk` and `pattern`. Not version 1 but good for playing.\n", "- `wordtovec` is fast, heavily optimised C code, for Python.\n", "\n", "### Education vs Experience?\n", "\n", "- \"Science to Data Science\" - finishing school for numerate PhDs to become data scientists.\n", "\n", "###\u00a0Do you miss academia and its freedom?\n", "\n", "- The only output in academia is papers, and doesn't matter if anyone reads them.\n", " - Now want to have an impact.\n", " - Yes sacrifices freedom, but if you fight your corner you can get some back.\n", " - You can still write papers, but on your own initiative.\n", "- Academia's freedom can be illusory.\n", " - There still is structure. Some professor hates an idea so you will not be able to pursue it.\n", " - No software craftmanship." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }