{ "metadata": { "name": "", "signature": "sha256:923f8917f6de195904aab6f660095890f6ea9dbf0cf518ff1741faf4b0e0af07" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning\n", "> Machine learning is to be able to write programs with data without explicitly writing it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problem Statement\n", "You are given a task: build an Human Resources classifier; we want to be able to classify company news into human resources or not.\n", " \n", "#### What do you do? \n", "\n", "- Let's look at the job titles in the news. It may not say if it is human resources event, but the news that has job titles in the news is more probable to be human resources news.\n", "- If there are company names, that is also a good signal. \n", "- We could look at the verbs and other noun forms(hiring, resignation, letting go, laying off) to say that is actually an HR article.\n", "- We combine all these rules.\n", "- Then, we use some heuristics to determine a threshold to \"classify\" the news article whether it is an HR or not.\n", "\n", "#### Disadvantages\n", "- Rule and heuristic based. If we want to _improve_ the classifier, we introduce more rules.\n", "- The rules are highly dependent on the developer. Biases and prejudices on the articles that you read will be the main driver.\n", "- For a more complicated classification problem, you may not cover all of the rules.\n", "- For a domain that you do not know much about, you need to first learn the domain, the article structure and what you need to use to build the rules.(this is true for machine learning but a lesser extent)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### No more heuristics and rules" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## HR Article classification Schema without ML\n", "company_names = ['Google', 'Microsoft', 'Apple']\n", "job_titles = ['CFO', 'CEO', 'CFO']\n", "text = 'Microsoft recently hired such and such person AS CFO'\n", "SOME_THRESHOLD = 20\n", "confidence =0 \n", "### HR article keywords\n", "if 'hire' in text or 'hiring' in text or 'join' or 'joining' in text or 'laying off' in text or 'resign' in text: \n", " confidence += 10\n", "\n", "# Job titles are definitely good again, HR articles generally say the position of the new hire\n", "for job_title in job_titles:\n", " if job_title in text:\n", " confidence += 10\n", "\n", "# If we have company name, that is a good sign as article could be in business domain\n", "for company_name in company_names:\n", " if company_name in text:\n", " confidence += 10\n", "label = 0\n", "if confidence >= SOME_THRESHOLD:\n", " label = 1 # It is an hr article" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "if label == 1:\n", " print('-- {} -- is an HR article'.format(text))\n", "else:\n", " print('-- {} -- is not an HR article'.format(text))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "-- Microsoft recently hired such and such person AS CFO -- is an HR article\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### But data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Given trained classifier, vectorizer and feature selection method\n", "# This is how one may classify an article in Scikit-learn(assuming the classifier is also trained on labeled data)\n", "## Convert into a vector\n", "count = vectorizer.transform(np.asarray(text).toarray())\n", "## Do feature selection\n", "selected_feats = feat_selector.transform(count)\n", "## Algorithm to classify\n", "pred_class = clf.predict(selected_feats)\n", "if label == 1:\n", " print('-- {} -- is an HR article'.format(text))\n", "else:\n", " print('-- {} -- is not an HR article'.format(text))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data\n", "Data does replace heuristics, hard-coded rules, assumptions and beliefs. Machine learning only enables data to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generalization\n", "> We need to learn small amount of data and induce from it; generalize over much larger datasets by using that knowledge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Advantages\n", "- No rules, no heuristic.\n", "- Provides an evaluation mechanism and based on the classifier, probability of the classifier for a given class. You could have confidence levels based on probability of classifier.(if you still need it).\n", "- As long as you have a representative training data, you would have a much more comprehensive classifier.\n", "- In order to improve the classifier, you need more data. This scales very well. \n", "- If you want to still incorporate heuristics and rules, you could still have the option. Actually, if you want to improve on precision or recall, some heuristics may come handy in order to prevent false positives and/or false negatives." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine Learning Types\n", "==== \n", "\n", "### Unsupervised Learning\n", "- Dimensionality Reduction (Feature Extraction(PCA), Feature Selection, Text Feature Visualization)\n", "- Clustering (Community Detection, Image Segmentation, Recommender Systems)\n", "\n", "### Supervised Learning\n", "- Classification (Object recognition, text classification, loan prediction)\n", "- Regression (Trend Estimation, House Price Prediction)" ] } ], "metadata": {} } ] }