{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## What is Conversocial\n",
      "\n",
      "- Aggregate social data for Sainsbury's, etc large companies\n",
      "- Web platform, make it easy for large teams to collaborate.\n",
      "- Know where to focus efforts, both short and long term.\n",
      "- Diverse customer base.\n",
      "    - One solution for everyone doesn't (cannot) exist.\n",
      "    - Learn what data each particular person (and business) find important, autonomously.\n",
      "    - What do they tend to read and respond to? Return or raise importance of those types of elements.\n",
      "    - Still need heuristics to establish priority of items, most popular doesn't always work either.\n",
      "\n",
      "## Process\n",
      "\n",
      "- Iterative.\n",
      "    - Constantly seek feedback from customers, monitor performance.\n",
      "    - This is useful across customers; particular concerns, when addresed, improve experience for others.\n",
      "    - Plan, Do, Check, Act, Plan, ...\n",
      "- Customers have metrics\n",
      "    - Given a customer tweet, how long does it take a company respond to it.\n",
      "    - Need to minimise this to prevent crises.\n",
      "    \n",
      "##\u00a0How\n",
      "\n",
      "- Feature engineering\n",
      "    - Term Frequency, Inverse Document Frequency (TF-IDF), bag of words\n",
      "    - Strip out stopwords\n",
      "    - ngrams\n",
      "    - length of text\n",
      "    - contains a URL\n",
      "    - contains question mark\n",
      "    - ...\n",
      "- Converto binary matrix (0/1)\n",
      "- Singular Value Decomposition (SVD)\n",
      "    - Reduces size of matrix you're working on\n",
      "    - !!AI feature reduction, kind of like Principle Component Analysis (PCA).\n",
      "- Feature selection same for all customers\n",
      "- But different machine learning models for each customers.\n",
      "- Use a read-only slave copy of data, away from production, safter.\n",
      "    - Experiment with different models\n",
      "    \n",
      "## Training schedule\n",
      "\n",
      "- Daily training with queue\n",
      "- Low priority; production already has good models, just trying to find if better ones can be found.\n",
      "\n",
      "##\u00a0Models in production\n",
      "\n",
      "- Dedicated servers to classifying live data\n",
      "- Chef, AWS.\n",
      "- In-memory as much as possible.\n",
      "\n",
      "## Validation\n",
      "\n",
      "- Compare to crowd-sourced data to confirm models are useful.\n",
      "- True positive rate vs false positive rate.\n",
      "\n",
      "##\u00a0Questions\n",
      "\n",
      "- What models and libraries do you use?\n",
      "    - Bayesian, SVMs, linear regression.\n",
      "    - `scikit-learn` and `nltk`\n",
      "    - Not very mysterious, obvious methods.\n",
      "\n",
      "- Stack, how to scale `nltk` to production?\n",
      "    - Yes, there are problems with scaling, then switch to something else.\n",
      "    - `redis` didn't scale, switching to `rabbitmq`.\n",
      "    - !!AI shame, no specific feedback on scaling `nltk`.\n",
      "\n",
      "- What customer feedback specifically?\n",
      "    - Heuristic - customer who posts, then posts again, defaults to high priority.\n",
      "    - Some customers like this, but \"super fans\" tend to multiple post just because, not due to high priority.\n",
      "    - Other customers need to filter out super fans.\n",
      "\n",
      "- Is Python a bottleneck for high throughput?\n",
      "    - Yes, systems did collapse at peak throughput sometimes.\n",
      "    - Didn't have DevOps or monitoring to warn about these events, but now do. Part of natural lifecycle of a company.\n",
      "\n",
      "- Do you do picture analysis?\n",
      "    - No, not yet.\n",
      "    - Lots of startups got acquired that did picture analysis, but went quiet.\n",
      "\n",
      "- Have you considered feature reduction using `word2vec` [https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/):\n",
      "    - No. Use SVD for feature reduction for now.\n",
      "    - `word2vec` used for feature engineering raw text into numeric arrays.\n",
      "\n",
      "- Time series analysis? Seasonality, old documents?\n",
      "    - No.\n",
      "    \n",
      "- Disambiguation of language, slang?\n",
      "    - Use heuristics, manual dictionary, to convert words to standard vocabulary. This is a vital part of feature engineering.\n",
      "    - Use stopword lists to filter out uninformative words.\n",
      "    - Rely on SVD to do the rest.\n",
      "\n",
      "- Do any categorisation or sentiment analysis?\n",
      "    - No, don't do sentiment analysis. Customers think manual tagging is more accurate and don't want automatic tagging.\n",
      "    - No, don't do automatic tagging, but are looking into it. Need to confirm exactly what customers want; variable tags, fixed set of customer-defined tags.\n",
      "    - Different customers want different things, still at product stage."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}