{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Conversocial\n", "\n", "- Aggregate social data for Sainsbury's, etc large companies\n", "- Web platform, make it easy for large teams to collaborate.\n", "- Know where to focus efforts, both short and long term.\n", "- Diverse customer base.\n", " - One solution for everyone doesn't (cannot) exist.\n", " - Learn what data each particular person (and business) find important, autonomously.\n", " - What do they tend to read and respond to? Return or raise importance of those types of elements.\n", " - Still need heuristics to establish priority of items, most popular doesn't always work either.\n", "\n", "## Process\n", "\n", "- Iterative.\n", " - Constantly seek feedback from customers, monitor performance.\n", " - This is useful across customers; particular concerns, when addresed, improve experience for others.\n", " - Plan, Do, Check, Act, Plan, ...\n", "- Customers have metrics\n", " - Given a customer tweet, how long does it take a company respond to it.\n", " - Need to minimise this to prevent crises.\n", " \n", "##\u00a0How\n", "\n", "- Feature engineering\n", " - Term Frequency, Inverse Document Frequency (TF-IDF), bag of words\n", " - Strip out stopwords\n", " - ngrams\n", " - length of text\n", " - contains a URL\n", " - contains question mark\n", " - ...\n", "- Converto binary matrix (0/1)\n", "- Singular Value Decomposition (SVD)\n", " - Reduces size of matrix you're working on\n", " - !!AI feature reduction, kind of like Principle Component Analysis (PCA).\n", "- Feature selection same for all customers\n", "- But different machine learning models for each customers.\n", "- Use a read-only slave copy of data, away from production, safter.\n", " - Experiment with different models\n", " \n", "## Training schedule\n", "\n", "- Daily training with queue\n", "- Low priority; production already has good models, just trying to find if better ones can be found.\n", "\n", "##\u00a0Models in production\n", "\n", "- Dedicated servers to classifying live data\n", "- Chef, AWS.\n", "- In-memory as much as possible.\n", "\n", "## Validation\n", "\n", "- Compare to crowd-sourced data to confirm models are useful.\n", "- True positive rate vs false positive rate.\n", "\n", "##\u00a0Questions\n", "\n", "- What models and libraries do you use?\n", " - Bayesian, SVMs, linear regression.\n", " - `scikit-learn` and `nltk`\n", " - Not very mysterious, obvious methods.\n", "\n", "- Stack, how to scale `nltk` to production?\n", " - Yes, there are problems with scaling, then switch to something else.\n", " - `redis` didn't scale, switching to `rabbitmq`.\n", " - !!AI shame, no specific feedback on scaling `nltk`.\n", "\n", "- What customer feedback specifically?\n", " - Heuristic - customer who posts, then posts again, defaults to high priority.\n", " - Some customers like this, but \"super fans\" tend to multiple post just because, not due to high priority.\n", " - Other customers need to filter out super fans.\n", "\n", "- Is Python a bottleneck for high throughput?\n", " - Yes, systems did collapse at peak throughput sometimes.\n", " - Didn't have DevOps or monitoring to warn about these events, but now do. Part of natural lifecycle of a company.\n", "\n", "- Do you do picture analysis?\n", " - No, not yet.\n", " - Lots of startups got acquired that did picture analysis, but went quiet.\n", "\n", "- Have you considered feature reduction using `word2vec` [https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/):\n", " - No. Use SVD for feature reduction for now.\n", " - `word2vec` used for feature engineering raw text into numeric arrays.\n", "\n", "- Time series analysis? Seasonality, old documents?\n", " - No.\n", " \n", "- Disambiguation of language, slang?\n", " - Use heuristics, manual dictionary, to convert words to standard vocabulary. This is a vital part of feature engineering.\n", " - Use stopword lists to filter out uninformative words.\n", " - Rely on SVD to do the rest.\n", "\n", "- Do any categorisation or sentiment analysis?\n", " - No, don't do sentiment analysis. Customers think manual tagging is more accurate and don't want automatic tagging.\n", " - No, don't do automatic tagging, but are looking into it. Need to confirm exactly what customers want; variable tags, fixed set of customer-defined tags.\n", " - Different customers want different things, still at product stage." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }