{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# News Headline Analysis\n", "\n", "In this project we're analyzing news headlines written by two journalists – a **finance** reporter from the Business Insider, and a **celebrity** reporter from the Huffington post – to find similarities and differences between the ways that these authors write headlines for their news articles and blog posts. Our selected reporters are:\n", "\n", "- Akin Oyedele the Business Insider who covers market updates; and\n", "- Carly Ledbetter from the Huffington Post who mainly writes about celebrities.\n", "\n", "### Approach\n", "\n", "We're initially going to collect and parse news headline from each of the authors, to obtain a parse tree, and then we're going to extract certain information from these parse trees that are indicative of the overall structure of the headline. \n", "\n", "Next, we will define a simple sequence similarity metric to compare any pair of headlines quantitatively, and we will apply the same method to all of the headlines we've gathered for each author, to find out how similar each pair of headlines is.\n", "\n", "Finally, we're going to use K-Means and tSNE to produce a visual map of all the headlines, where we can see the similarities and the differences between the two authors more clearly.\n", "\n", "### Data\n", "\n", "For this project we've gathered 700 headlines for each author using the [AYLIEN News API](https://newsapi.aylien.com) which we're going to analyze using Python. You can obtain the Pickled data files directly from the GitHub repository, or by using [the data collection notebook](XXX) that we've prepared for this project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A primer on parse trees\n", "\n", "In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar.\n", "\n", "For a simple sentence like \"The cat sat on the mat\", a parse tree might look like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![The cat sat on the mat](https://raw.githubusercontent.com/AYLIEN/headline_analysis/master/parsetree.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to use the [Pattern library](http://www.clips.ua.ac.be/pages/pattern-en#tree) for Python to parse the headlines and create parse trees for them:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pattern.en import parsetree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see an example:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NP [(u'The', u'DT'), (u'cat', u'NN')]\n", "VP [(u'sat', u'VBD')]\n", "PP [(u'on', u'IN')]\n", "NP [(u'the', u'DT'), (u'mat', u'NN')]\n" ] } ], "source": [ "s = parsetree('The cat sat on the mat.')\n", "for sentence in s:\n", " for chunk in sentence.chunks:\n", " print chunk.type, [(w.string, w.type) for w in chunk.words]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the data\n", "\n", "Let's load the Pickled data file for the first author (Akin Oyedele) which contains 700 headlines, and let's see an example of what a headline might look like:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{u'title': u\"One corner of the real-estate market might've peaked\"}" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import cPickle as pickle\n", "author1 = pickle.load( open( \"author1.p\", \"rb\" ) )\n", "author1[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parsing the data\n", "\n", "Now that we have all the headlines for the first author loaded, we're going to analyze them, and create parse trees for each headline, and store them together with some basic information about the headline in the same object:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for story in author1:\n", " story[\"title_length\"] = len(story[\"title\"])\n", " story[\"title_chunks\"] = [chunk.type for chunk in parsetree(story[\"title\"])[0].chunks]\n", " story[\"title_chunks_length\"] = len(story[\"title_chunks\"])" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{u'title': u\"One corner of the real-estate market might've peaked\",\n", " 'title_chunks': [u'NP', u'PP', u'NP', u'VP'],\n", " 'title_chunks_length': 4,\n", " 'title_length': 52}" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "author1[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what the numeric attributes for headlines written by this author look like. We're going to use [Pandas](http://pandas.pydata.org/) for this." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df1 = pd.DataFrame.from_dict(author1)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | title_chunks_length | \n", "title_length | \n", "
---|---|---|
count | \n", "700.000000 | \n", "700.000000 | \n", "
mean | \n", "5.691429 | \n", "57.730000 | \n", "
std | \n", "3.762884 | \n", "28.035283 | \n", "
min | \n", "1.000000 | \n", "9.000000 | \n", "
25% | \n", "2.000000 | \n", "35.000000 | \n", "
50% | \n", "5.000000 | \n", "53.000000 | \n", "
75% | \n", "7.000000 | \n", "77.000000 | \n", "
max | \n", "30.000000 | \n", "188.000000 | \n", "
<Bokeh Notebook handle for In[6]>
\n", " | title_chunks_length | \n", "title_length | \n", "
---|---|---|
count | \n", "700.000000 | \n", "700.000000 | \n", "
mean | \n", "5.452857 | \n", "62.532857 | \n", "
std | \n", "1.896252 | \n", "9.996154 | \n", "
min | \n", "1.000000 | \n", "35.000000 | \n", "
25% | \n", "4.000000 | \n", "57.000000 | \n", "
50% | \n", "5.000000 | \n", "62.000000 | \n", "
75% | \n", "7.000000 | \n", "68.000000 | \n", "
max | \n", "13.000000 | \n", "96.000000 | \n", "
<Bokeh Notebook handle for In[6]>