{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sponsors\n", "\n", "- NESTA - looking for areas of growth to invest in and support.\n", "- Research frely available online, to provoke discussion.\n", "\n", "\n", "## Research questions\n", "\n", "- What are \"digital companies\". (main focus of talk).\n", "- What do they look like.\n", "- What drives their innovation/growth.\n", "\n", "##\u00a0Why?\n", "\n", "- Standard classifications of businesses don't work.\n", "- Used to measure economic output, doesn't work for digital companies.\n", "\n", "## SIC - Standard Industry Classification\n", "\n", "- 731 SICs, self-classified.\n", " - (from a question) self-classification has no incentives for accuracy, in fact directly the opposite. Changing your classification over time to accurately reflect changing business strategy just adds paperwork.\n", "- e.g.\n", " - 77220: renting of video tapes and disks\n", " - 01440: raising of camels and cemlids\n", "- 82990: other business support service activities (**10%**)\n", "- **20%** not classified\n", "- 3 million companies in Companies House\n", "- Almost a million are unclassified or improperly classified.\n", "- This presentation / research did not attempt to classify these unclassified companies.\n", "\n", "## Challenge\n", "\n", "- Mapping is necessarily imprecise.\n", "- Data-driven methods can be richer, more informative, more up to date.\n", "\n", "## Linked datasets\n", "\n", "- Online activity\n", "- Trade activity\n", "- Trademarks / Patents\n", "- News/events\n", "- Financials\n", "- ...\n", "\n", "## Approach\n", "\n", "- Classify by:\n", " - Sector (their vertical)\n", " - Product type\n", " - Client type (B2B, B2C, government)\n", " - Sales process (franchise, subscription)\n", "- e.g. you might be an Oil and Gas company that produces software\n", "\n", "## Tech stack\n", "\n", "- scrapy / pandas / scikit-learn\n", "\n", "## Getting training set\n", "\n", "- Some public companies are pre-classified.\n", "- Expert panels for authoritative labels.\n", "- Crowd sourcing\n", " - !!AI sounds like Amazon Mechanical Turk\n", " - use qualification tests, and send tasks to many humans and take majority vote.\n", "\n", "##\u00a0Feature engineering\n", "\n", "- Multiple sources\n", " - Free text (news)\n", " - Structured (patent filings)\n", "- Cleaning\n", " - Malformed HTML\n", " - Stripping JavaScript\n", " - `lxml`, `beautifulsoup` (prefer `lxml`, more robust)\n", " - !!AI `beautifulsoup4` defaults to using `lxml` if installed.\n", " - `goose` for article extraction\n", "- Tokenising and calculating TF-IDF weights\n", "\n", "##\u00a0Modelling\n", "\n", "- pandas to build up feature sets\n", "- linear SVMs, linear models are fast for thousands of features\n", "- multi-class classifiers for sector, product, client, sales process\n", "\n", "## Example\n", "\n", "- Kelton\n", " - Official SIC classification: 82290 (other)\n", " - Their classification\n", " - Sector: Oil and Energy\n", " - Product: Software Company\n", " - Client - Businesses\n", " - Sale process: Projects\n", " - Based in: Aberdeen\n", "\n", "## Challenges\n", "\n", "- Company structure\n", " - Subsidiaries, trading partners, who is actually trading and where and what." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }