{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Matteo Renzi mentions across Italian and Portuguese Wikipedia\n",
    "\n",
    "__*Remark:*__ Since interactive plots are present open [this](https://nbviewer.jupyter.org/github/CriMenghini/Wikipedia/blob/master/Mention/Mention_example.ipynb) link to read the `Notebook` correctly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Table of contents\n",
    "1. [Find articles](#parse)\n",
    "2. [Rank articles according to November pageviews](#rank)\n",
    "3. [Make comparisons](#comp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import plotly\n",
    "from pageviews import *\n",
    "from wiki_parser import *\n",
    "import plotly.tools as tls\n",
    "from helpers_parser import *\n",
    "from across_languages import *\n",
    "plotly.tools.set_credentials_file(username='crimenghini', api_key='***')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Find articles <a name=\"parse\"></a>\n",
    "The fist goal to achieve is to find all the Italian and Portuguese Wikipedia articles that mention `Matteo Renzi`. In order to do so, we use the [WikiHandler](wiki_parser.py) class which goes through the raw data and then keeps and stores the `title` and the `text` of the elements of the corpora that mention the [Italian (almost) ex-Prime Minister](http://gph.is/29zqwgy).\n",
    "\n",
    "In this example we focus on the sets of articles written in Italian and in Portugal, they are collected respectively until 20th November 2016 and 1st December 2016. In general, the code allows you to take into account more than two languages. In the [`README`](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) file, you can find the information related to the collection of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Define the path of the corpora\n",
    "path = '/Users/cristinamenghini/Downloads/'\n",
    "# Xml file\n",
    "xml_files = ['itwiki-20161120-pages-articles-multistream.xml', \n",
    "             'ptwiki-20161201-pages-articles-multistream.xml']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After having a quick peek at a snippet of the `XML`. The elements we are interested in are on the child `page`, which identifies an article. Then we want to get the contents of `title` and `text`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to the big size of the `XML` we opted for a parser which registers callbacks for events of interest and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.\n",
    "\n",
    "Hence, we proceed to parse the Italian corpus using the `parse_articles` function stored in the [`wiki_parser`](wiki_parser.py) library - it basically activates the parser."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Parse italian corpus\n",
    "parse_articles('ita', path + xml_files[0], 'Matteo Renzi')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then move towards the Portuguese one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Parse portuguese corpus\n",
    "parse_articles('port', path + xml_files[1], 'Matteo Renzi')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a [`.json`](Corpus/wiki_ita_matteo_renzi.json) file whose each line corresponds to a page (`title`, `text`). The same holds for the [articles](Corpus/wiki_port_matteo_renzi.json) in Portuguese. The two corpora are automatically stored in the folder [`Corpus`](https://github.com/CriMenghini/Wikipedia/tree/master/Corpus).\n",
    "\n",
    "                                  {\"title\": \"title_1\", \"text\": \"text_1\"}\n",
    "                                                        ...\n",
    "                                                        ...\n",
    "                                  {\"title\": \"title_n\", \"text\": \"text_n\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Rank articles according to November pageviews <a name=\"rank\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the data has been filtered, we proceed with a *simple* analysis of the pageviews. In particular, using the [`article_df_from_json`](pageviews.py) function, all the article titles are extracted from the corpus and then stored in a `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get the df for the Italian articles\n",
    "df_it_titles = article_df_from_json('Corpus/wiki_ita_Matteo_Renzi.json')\n",
    "\n",
    "# Get the df for the Portuguese articles\n",
    "df_pt_titles = article_df_from_json('Corpus/wiki_port_Matteo_Renzi.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the obtained `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>Centro-sinistra</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>Viadotto Italia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>223</th>\n",
       "      <td>TG5 Prima Pagina</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>403</th>\n",
       "      <td>Carcere di Santo Stefano</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>498</th>\n",
       "      <td>Referendum costituzionale del 2016 in Italia</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            Title\n",
       "101                               Centro-sinistra\n",
       "413                               Viadotto Italia\n",
       "223                              TG5 Prima Pagina\n",
       "403                      Carcere di Santo Stefano\n",
       "498  Referendum costituzionale del 2016 in Italia"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_it_titles.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Thus, we extract the number of monthly page views for each article related to the languages of interest (i.e. `it` and `pt`) from the *page views* file - [Additional data](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) in the `README`. To filter the file we use the [`filter_pageviews_file`](pageviews.py) function and get a dictionary of dictionaries with the following structure (according to our example):\n",
    "\n",
    "                                {'it':{'Title_1':'No pageviews',\n",
    "                                               ...\n",
    "                                       'Title_n':'No pageviews'},\n",
    "                                 'pt':{'Title_1':'No pageviews',\n",
    "                                               ...\n",
    "                                       'Title_k':'No pageviews'}}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Page views file\n",
    "pageviews_file = 'pagecounts-2016-11-views-ge-5-totals'\n",
    "\n",
    "# Filter the page view file\n",
    "articles_pageviews = filter_pageviews_file(path + pageviews_file, ['pt','it'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, a right join between the `DataFrames`, namely the one obtained from the pageviews and the other obtained from the corpus, is performed. It results that both for the Italian and Portuguese articles there are articles that mention Matteo Renzi that have not been visualized in November. The `define_ranked_df` function is stored in the [`pageviews`](pageviews.py) library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Over the whole number of articles in the corpus  39  have not been visited during the considered period.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Pageviews</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>409</th>\n",
       "      <td>Marco Travaglio</td>\n",
       "      <td>19795.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>Pif (conduttore televisivo)</td>\n",
       "      <td>19557.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>474</th>\n",
       "      <td>Partito Democratico (Italia)</td>\n",
       "      <td>11653.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>154</th>\n",
       "      <td>Vittorio Sgarbi</td>\n",
       "      <td>11324.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>Malala Yousafzai</td>\n",
       "      <td>9452.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>433</th>\n",
       "      <td>Jobs Act</td>\n",
       "      <td>8908.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>274</th>\n",
       "      <td>Enrico Letta</td>\n",
       "      <td>7791.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>343</th>\n",
       "      <td>Startup (economia)</td>\n",
       "      <td>7698.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>312</th>\n",
       "      <td>Marianna Madia</td>\n",
       "      <td>7608.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>155</th>\n",
       "      <td>Nuovo Centro Congressi</td>\n",
       "      <td>6894.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            Title  Pageviews\n",
       "409               Marco Travaglio    19795.0\n",
       "146   Pif (conduttore televisivo)    19557.0\n",
       "474  Partito Democratico (Italia)    11653.0\n",
       "154               Vittorio Sgarbi    11324.0\n",
       "226              Malala Yousafzai     9452.0\n",
       "433                      Jobs Act     8908.0\n",
       "274                  Enrico Letta     7791.0\n",
       "343            Startup (economia)     7698.0\n",
       "312                Marianna Madia     7608.0\n",
       "155        Nuovo Centro Congressi     6894.0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define the italian ranked article df according to the number of page views\n",
    "ranked_df_ita = define_ranked_df(articles_pageviews, 'it', df_it_titles)\n",
    "# Show the df head\n",
    "ranked_df_ita.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Over the whole number of articles in the corpus  4  have not been visited during the considered period.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Pageviews</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Partido Democrático (Itália)</td>\n",
       "      <td>567.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Lista de chefes de Estado e de governo atuais</td>\n",
       "      <td>410.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>G7</td>\n",
       "      <td>259.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Federica Mogherini</td>\n",
       "      <td>215.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>G20</td>\n",
       "      <td>185.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Centro-esquerda</td>\n",
       "      <td>141.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Privatização</td>\n",
       "      <td>93.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Lista de chefes de Estado e de governo por dat...</td>\n",
       "      <td>83.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>9.ª reunião de cúpula do G20</td>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>10.ª reunião de cúpula do G20</td>\n",
       "      <td>70.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                Title  Pageviews\n",
       "33                       Partido Democrático (Itália)      567.0\n",
       "31      Lista de chefes de Estado e de governo atuais      410.0\n",
       "4                                                  G7      259.0\n",
       "19                                 Federica Mogherini      215.0\n",
       "7                                                 G20      185.0\n",
       "23                                    Centro-esquerda      141.0\n",
       "8                                        Privatização       93.0\n",
       "15  Lista de chefes de Estado e de governo por dat...       83.0\n",
       "21                       9.ª reunião de cúpula do G20       81.0\n",
       "17                      10.ª reunião de cúpula do G20       70.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define the italian ranked article df according to the number of page views\n",
    "ranked_df_port = define_ranked_df(articles_pageviews, 'pt', df_pt_titles)\n",
    "# Show the df\n",
    "ranked_df_port.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having a quick glance at the two top 10, we notice:\n",
    "* The number of page views for the Italian articles which mention Matteo Renzi is considerably higher than for those that are written in Portuguese.\n",
    "* The only article that is present in both the top ranking is `Partito Democratico (Italia)`. \n",
    "* It seems that the pages differ in the content: the Portuguese ones are more related to topics that regard the international politics rather the Italians that refer to politics, journalists and public figures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Make comparisons <a name = 'comp'></a>\n",
    "\n",
    "We now move ahead exploring the data that we preprocessed and trying to figure out something interesting.\n",
    "* We take a look at the number of mentions received in each article. In this contest, it may be possible that Matteo Renzi received more than one mention just because of the presence of references. For instance on [this](https://it.wikipedia.org/wiki/Francesco_Guccini) page, if you look up for Matteo Renzi, you will find 2 mentions but one of those just refers to the first. For the moment we do not address this issue.\n",
    "\n",
    "The `DataFrame` below- obtained using `article_mentions` function in [this](across_languages.py) library- shows the number of mentions that Matteo Renzi has received in each article according to both for the Italian and Portuguese corpora. The `DataFrames` are sorted by the number of mentions so that we get the pages where Matteo Renzi is more \"popular\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Number of mentions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>Matteo Renzi</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>424</th>\n",
       "      <td>Governo Renzi</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>Partito Democratico (Italia)</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>492</th>\n",
       "      <td>Riforma costituzionale Renzi-Boschi</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>214</th>\n",
       "      <td>Storia del Partito Democratico (Italia)</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                       Title  Number of mentions\n",
       "37                              Matteo Renzi                  62\n",
       "424                            Governo Renzi                  30\n",
       "195             Partito Democratico (Italia)                  17\n",
       "492      Riforma costituzionale Renzi-Boschi                  12\n",
       "214  Storia del Partito Democratico (Italia)                  12"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Italian df of mentions per page\n",
    "df_it_mentions = article_mentions('Corpus/wiki_ita_Matteo_Renzi.json', 'Matteo Renzi')\n",
    "\n",
    "# Sort the df by the number of mentions and see the top 5\n",
    "df_it_mentions = df_it_mentions.sort_values('Number of mentions', ascending = False)\n",
    "\n",
    "# Show results\n",
    "df_it_mentions.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Number of mentions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Matteo Renzi</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Partido Democrático (Itália)</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Itália</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Lista de primeiros-ministros da Itália</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Lista de viagens presidenciais de Dilma Rousseff</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               Title  Number of mentions\n",
       "20                                      Matteo Renzi                  11\n",
       "7                       Partido Democrático (Itália)                   6\n",
       "9                                             Itália                   3\n",
       "0             Lista de primeiros-ministros da Itália                   2\n",
       "16  Lista de viagens presidenciais de Dilma Rousseff                   2"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Portuguese df of mentions per page\n",
    "df_pt_mentions = article_mentions('Corpus/wiki_port_Matteo_Renzi.json', 'Matteo Renzi')\n",
    "\n",
    "# Sort the df by the number of mentions and see the top 5\n",
    "df_pt_mentions = df_pt_mentions.sort_values('Number of mentions', ascending = False)\n",
    "\n",
    "# Show results\n",
    "df_pt_mentions.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Comparing the two `DataFrames` we immediately notice that even if the maximum number of mentions that Matteo Renzi received for Italian and Portuguese articles are very different. In the Portuguese corpus there are only two articles that have more than 5 mentions. Thus, can be interesting to visualize the distribution of the mentions both for the IT and PT corpora."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The distributions are represented using the boxplots. They show that for both the languages the 75% of the articles contain no more than 3 mentions of the Italian premier. For the Portuguese corpus stand out two outliers that correspond to `Matteo Renzi 11 mentions` and `Partido Democrático (Itália) 6 mentions`, rather for the Italians the number of outliers is bigger and the maximum number of mentions are contained in `Matteo Renzi 62 mentions`. Moreover, zooming in the boxes, we observe that the two distributions are skewed toward left (number of mentions equal to 1). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~crimenghini/20.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#boxplot_mentions(df_pt_mentions, df_it_mentions, 'PT', 'IT', 'Number of mentions')\n",
    "tls.embed(\"https://plot.ly/~crimenghini/20\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this direction, one aspect that can be considered is the following: \n",
    "> Define how important is Matteo Renzi in the articles that mention him. It requires defining the concept of *importance*. Intuitively, we would say that higher is the number of mentions more is the importance of our object in the article. Moreover, it may be useful to weight the number of mentions according to the number of words in the article. \n",
    "$$I_{string} = \\frac{M}{|D|}$$\n",
    "Where *I* is the importance, *M* is the number of mentions and *D* the number of words in the document. In this way, whether an article cited Renzi once but it is made up just by a few lines, the string of interest will result more significant.\n",
    "\n",
    "Moreover, another aspect should be considered, especially when there is only one mention: \n",
    "* The string (i.e. Matteo Renzi) is a pointer to its main page (i.e. Matteo Renzi -> [Matteo Renzi](https://it.wikipedia.org/wiki/Matteo_Renzi)). Whether the pointer is present we can imagine that the figure is more *important* than a page where there is no a hyperlink."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another thing that can be visualized is the realtionship between the `Number of mentions` and the `Pageviews`. In order to do that we first merge the two pageviews and mentions `DataFrames`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Number of mentions</th>\n",
       "      <th>Pageviews</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>255</th>\n",
       "      <td>Faccia a faccia (programma televisivo)</td>\n",
       "      <td>1</td>\n",
       "      <td>483.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>457</th>\n",
       "      <td>Fausto Brizzi</td>\n",
       "      <td>1</td>\n",
       "      <td>2161.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Ivan Scalfarotto</td>\n",
       "      <td>4</td>\n",
       "      <td>1077.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Elezioni amministrative italiane del 2009</td>\n",
       "      <td>4</td>\n",
       "      <td>723.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>472</th>\n",
       "      <td>Anonymous</td>\n",
       "      <td>1</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         Title  Number of mentions  Pageviews\n",
       "255     Faccia a faccia (programma televisivo)                   1      483.0\n",
       "457                              Fausto Brizzi                   1     2161.0\n",
       "20                            Ivan Scalfarotto                   4     1077.0\n",
       "32   Elezioni amministrative italiane del 2009                   4      723.0\n",
       "472                                  Anonymous                   1        6.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Merge pageviews and mentions DataFrames for IT\n",
    "df_it_mension_pageview = pd.merge(df_it_mentions, ranked_df_ita, on=['Title'])\n",
    "\n",
    "# Show it\n",
    "df_it_mension_pageview.sample(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Number of mentions</th>\n",
       "      <th>Pageviews</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>42.ª reunião de cúpula do G7</td>\n",
       "      <td>2</td>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Maria Elena Boschi</td>\n",
       "      <td>2</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Lista de primeiros-ministros da Itália</td>\n",
       "      <td>2</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>G20</td>\n",
       "      <td>1</td>\n",
       "      <td>185.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Lista de líderes do G20</td>\n",
       "      <td>1</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     Title  Number of mentions  Pageviews\n",
       "14            42.ª reunião de cúpula do G7                   2       33.0\n",
       "12                      Maria Elena Boschi                   2        5.0\n",
       "3   Lista de primeiros-ministros da Itália                   2       12.0\n",
       "19                                     G20                   1      185.0\n",
       "20                 Lista de líderes do G20                   1        8.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Merge pageviews and mentions DataFrames for PT\n",
    "df_pt_mension_pageview = pd.merge(df_pt_mentions, ranked_df_port, on=['Title'])\n",
    "\n",
    "# Show it\n",
    "df_pt_mension_pageview.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "A scatterplot is used to get how an article is positioned according to these two variables. The plot shows:\n",
    "* __IT__: when the mentions are equal to 1 the number of page views is spread between 0 and ~20k. Where the number of mentions increases the number of page visualizations belongs to a smaller range. \n",
    "* __PT__: also for Portuguese article the same is observed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~crimenghini/36.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# def scatter_plot(df_it_mension_pageview, df_pt_mension_pageview, 'Number of mentions', 'Pageviews', 'Italian', 'Portuguese')\n",
    "tls.embed('https://plot.ly/~crimenghini/36')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "About these two features, we can think that another way to explore should be the following:\n",
    "> Consider how the number of pageviews of an article changes when the number of Matteo Renzi citations increases from a revision to another. In particular, the *importance*(I) is re-defined as: \n",
    "$$I = \\sum_{t = 1}^{T} \\frac{(p_t-p_{t-1}) \\times m_t}{|D_t|}$$\n",
    "Where *t* is the time of sequential revision of the article, *p* is the number of page views at time and m is the number of mentions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus we proceed to look for the presence of same articles (in different languages) that mention Matteo Renzi. To do so we make a request for each Portugues *Wikipedia* page (that cites Renzi) than we parse the `HTML` source to extract - where available- the title of the IT article related to that the request has been sent. Precisely, the requests are sent for each title of the language that has less article that match Matteo Renzi. The function `get_matches` is stored in [this](across_languages) library. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Built the common articles matches\n",
    "dict_italian = get_matches(df_pt_titles, 'it')\n",
    "# Create the inverted one\n",
    "inverted_dict = {v : k for k, v in dict_italian.items()}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Portuguese articles that mention Matteo Renzi and correspond to an Italian article are:  31 . The number of PT articles that have not been matched is:  11 .\n"
     ]
    }
   ],
   "source": [
    "print ('The Portuguese articles that mention Matteo Renzi and correspond to an Italian article are: ', len(dict_italian), \n",
    "       '. The number of PT articles that have not been matched is: ', len(df_pt_titles)-len(dict_italian), '.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Proceed to create a `DataFrame` that contains the information related to those articles.\n",
    "\n",
    "* We extract the titles of all involved articles (both IT and PT)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# From the dictionary get the titles of both languages\n",
    "italian_titles = list(dict_italian.values())\n",
    "portugues_titles = list(dict_italian.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before gooing further, we check whether all the matched IT articles mention Matteo Renzi. In order to do so, we run a query on the `DataFrame` that stores all the IT articles that cite Renzi."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are  10 IT articles that do not mention Matteo Renzi.\n"
     ]
    }
   ],
   "source": [
    "# Run the query\n",
    "match_with_mention = df_it_titles.query('Title in @italian_titles')\n",
    "\n",
    "# Get the number\n",
    "print ('There are ', len(portugues_titles)-len(match_with_mention), 'IT articles that do not mention Matteo Renzi.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Re-define the list of IT articles according to the aforementioned \"issue\"\n",
    "it_titles_with_mention = list(match_with_mention.Title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The dictionaries that match the PT and IT titles are re-defined taking into account the fact that some IT do not mention Renzi."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Re-define the two dictionaries \n",
    "dict_italian_mentions = {k:v for k,v in dict_italian.items() if v in it_titles_with_mention}\n",
    "# Define the inverted\n",
    "inverted_dict_italian_mentions = {v : k for k, v in dict_italian.items()}\n",
    "\n",
    "# Create the list of titles for PT articles according to the IT that don't mention Renzi\n",
    "pt_titles_with_mention = list(dict_italian_mentions.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we create a unique `DataFrame` which contains the mentions in IT an PT articles for the tuple of articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create df for IT mentions\n",
    "df_match_it_mentions = df_it_mentions.query('Title in @it_titles_with_mention').sort_values('Number of mentions', ascending = False)\n",
    "\n",
    "# Create df for PT mentions\n",
    "df_match_pt_mentions = df_pt_mentions.query('Title in @pt_titles_with_mention').sort_values('Number of mentions', ascending = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Add a column containing the matches to join the two dfs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create new column\n",
    "new_column_it = ['/'.join([k]+[v]) for i in df_match_it_mentions.Title for k,v in dict_italian_mentions.items()  if i == v]\n",
    "new_column_pt = ['/'.join([k]+[v]) for i in df_match_pt_mentions.Title for k,v in dict_italian_mentions.items()  if i == k]\n",
    "\n",
    "# Add the new column to the two dataframes\n",
    "df_match_it_mentions['Matches'] = new_column_it\n",
    "df_match_pt_mentions['Matches'] = new_column_pt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Perform the join on the `Matches` and plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title_IT</th>\n",
       "      <th>Number of mentions_IT</th>\n",
       "      <th>Matches</th>\n",
       "      <th>Title_PT</th>\n",
       "      <th>Number of mentions_PT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Matteo Renzi</td>\n",
       "      <td>62</td>\n",
       "      <td>Matteo Renzi/Matteo Renzi</td>\n",
       "      <td>Matteo Renzi</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Partito Democratico (Italia)</td>\n",
       "      <td>17</td>\n",
       "      <td>Partido Democrático (Itália)/Partito Democrati...</td>\n",
       "      <td>Partido Democrático (Itália)</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Maria Elena Boschi</td>\n",
       "      <td>6</td>\n",
       "      <td>Maria Elena Boschi/Maria Elena Boschi</td>\n",
       "      <td>Maria Elena Boschi</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Federica Mogherini</td>\n",
       "      <td>4</td>\n",
       "      <td>Federica Mogherini/Federica Mogherini</td>\n",
       "      <td>Federica Mogherini</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Presidenti del Consiglio dei ministri della Re...</td>\n",
       "      <td>3</td>\n",
       "      <td>Lista de primeiros-ministros da Itália/Preside...</td>\n",
       "      <td>Lista de primeiros-ministros da Itália</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            Title_IT  Number of mentions_IT  \\\n",
       "0                                       Matteo Renzi                     62   \n",
       "1                       Partito Democratico (Italia)                     17   \n",
       "2                                 Maria Elena Boschi                      6   \n",
       "3                                 Federica Mogherini                      4   \n",
       "4  Presidenti del Consiglio dei ministri della Re...                      3   \n",
       "\n",
       "                                             Matches  \\\n",
       "0                          Matteo Renzi/Matteo Renzi   \n",
       "1  Partido Democrático (Itália)/Partito Democrati...   \n",
       "2              Maria Elena Boschi/Maria Elena Boschi   \n",
       "3              Federica Mogherini/Federica Mogherini   \n",
       "4  Lista de primeiros-ministros da Itália/Preside...   \n",
       "\n",
       "                                 Title_PT  Number of mentions_PT  \n",
       "0                            Matteo Renzi                     11  \n",
       "1            Partido Democrático (Itália)                      6  \n",
       "2                      Maria Elena Boschi                      2  \n",
       "3                      Federica Mogherini                      1  \n",
       "4  Lista de primeiros-ministros da Itália                      2  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Join the two dfs on the correspondence tuples\n",
    "matches_mention = pd.merge(df_match_it_mentions, df_match_pt_mentions, on = 'Matches', suffixes = ('_IT','_PT'))\n",
    "\n",
    "# Show result\n",
    "matches_mention.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~crimenghini/38.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# bar_plot(df, 'Matches', 'Number of mentions_IT', 'Number of mentions_P', 'IT', 'PT', 'Compare IT and PT mentions', \n",
    "# 'Article','No. mentions', 'color-bar-prova')\n",
    "tls.embed('https://plot.ly/~crimenghini/38')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the plot:\n",
    "* Among this group of articles, the two that mention Matteo Renzi more result to be the same.\n",
    "\n",
    "From this kind of analysis a question one can think about is the following:\n",
    "\n",
    "> Given articles in different languages that correspond one to each other, if we are interested in measuring the proximity of these articles, an element that may be considered is the number of common mentions. It is likely that the necessity of quoting s.o./s.t. derives from the fact that the two articles are talking about the same topics that need to refer to the same thing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same procedure is repeated for the page views.\n",
    "\n",
    "* Check whether some articles have not been visited."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are  12 IT articles that have not been visited.\n",
      "There are  3 PT articles that have not been visited.\n"
     ]
    }
   ],
   "source": [
    "# Run the query\n",
    "match_with_pageviews_it = ranked_df_ita.query('Title in @italian_titles')\n",
    "match_with_pageviews_pt = ranked_df_port.query('Title in @portugues_titles')\n",
    "# Get the number\n",
    "print ('There are ', len(portugues_titles)-len(match_with_pageviews_it), 'IT articles that have not been visited.')\n",
    "print ('There are ', len(portugues_titles)-len(match_with_pageviews_pt), 'PT articles that have not been visited.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Define list of articles that have been visualized\n",
    "it_titles_with_pageviews = list(match_with_pageviews_it.Title)\n",
    "pt_titles_with_pageviews = list(match_with_pageviews_pt.Title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Define the matching dictionaries according to what said above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Re-define the two dictionaries according to this evidence\n",
    "dict_italian_pageviews = {k:v for k,v in dict_italian.items() if v in it_titles_with_pageviews}\n",
    "\n",
    "# PT \n",
    "dict_pt_pageviews = {v : k for k, v in dict_italian.items() if k in pt_titles_with_pageviews}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create df for IT mentions\n",
    "df_match_it_pageviews = ranked_df_ita.query('Title in @it_titles_with_pageviews').sort_values('Pageviews', ascending = False)\n",
    "\n",
    "# Create df for PT mentions\n",
    "df_match_pt_pageviews = ranked_df_port.query('Title in @pt_titles_with_pageviews').sort_values('Pageviews', ascending = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Add new variable to allow the join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create new column\n",
    "new_column_it = ['/'.join([k]+[v]) for i in df_match_it_pageviews.Title for k,v in dict_italian_pageviews.items()  if i == v]\n",
    "new_column_pt = ['/'.join([v]+[k]) for i in df_match_pt_pageviews.Title for k,v in dict_pt_pageviews.items()  if i == v]\n",
    "\n",
    "# Add the new column to the two dataframes\n",
    "df_match_it_pageviews['Matches'] = new_column_it\n",
    "df_match_pt_pageviews['Matches'] = new_column_pt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Pageviews</th>\n",
       "      <th>Matches</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>474</th>\n",
       "      <td>Partito Democratico (Italia)</td>\n",
       "      <td>11653.0</td>\n",
       "      <td>Partido Democrático (Itália)/Partito Democrati...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>274</th>\n",
       "      <td>Enrico Letta</td>\n",
       "      <td>7791.0</td>\n",
       "      <td>Enrico Letta/Enrico Letta</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>312</th>\n",
       "      <td>Marianna Madia</td>\n",
       "      <td>7608.0</td>\n",
       "      <td>Marianna Madia/Marianna Madia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>442</th>\n",
       "      <td>G20 (paesi industrializzati)</td>\n",
       "      <td>2545.0</td>\n",
       "      <td>G20/G20 (paesi industrializzati)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>318</th>\n",
       "      <td>Giuliano Poletti</td>\n",
       "      <td>2021.0</td>\n",
       "      <td>Giuliano Poletti/Giuliano Poletti</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            Title  Pageviews  \\\n",
       "474  Partito Democratico (Italia)    11653.0   \n",
       "274                  Enrico Letta     7791.0   \n",
       "312                Marianna Madia     7608.0   \n",
       "442  G20 (paesi industrializzati)     2545.0   \n",
       "318              Giuliano Poletti     2021.0   \n",
       "\n",
       "                                               Matches  \n",
       "474  Partido Democrático (Itália)/Partito Democrati...  \n",
       "274                          Enrico Letta/Enrico Letta  \n",
       "312                      Marianna Madia/Marianna Madia  \n",
       "442                   G20/G20 (paesi industrializzati)  \n",
       "318                  Giuliano Poletti/Giuliano Poletti  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_match_it_pageviews.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Pageviews</th>\n",
       "      <th>Matches</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Partido Democrático (Itália)</td>\n",
       "      <td>567.0</td>\n",
       "      <td>Partido Democrático (Itália)/Partito Democrati...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Lista de chefes de Estado e de governo atuais</td>\n",
       "      <td>410.0</td>\n",
       "      <td>Lista de chefes de Estado e de governo atuais/...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>G7</td>\n",
       "      <td>259.0</td>\n",
       "      <td>G7/G7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Federica Mogherini</td>\n",
       "      <td>215.0</td>\n",
       "      <td>Federica Mogherini/Federica Mogherini</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>G20</td>\n",
       "      <td>185.0</td>\n",
       "      <td>G20/G20 (paesi industrializzati)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            Title  Pageviews  \\\n",
       "33                   Partido Democrático (Itália)      567.0   \n",
       "31  Lista de chefes de Estado e de governo atuais      410.0   \n",
       "4                                              G7      259.0   \n",
       "19                             Federica Mogherini      215.0   \n",
       "7                                             G20      185.0   \n",
       "\n",
       "                                              Matches  \n",
       "33  Partido Democrático (Itália)/Partito Democrati...  \n",
       "31  Lista de chefes de Estado e de governo atuais/...  \n",
       "4                                               G7/G7  \n",
       "19              Federica Mogherini/Federica Mogherini  \n",
       "7                    G20/G20 (paesi industrializzati)  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_match_pt_pageviews.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Join the two `DatFrames` with a right join, so that we see also the PT articles that have not been visualised in IT."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title_IT</th>\n",
       "      <th>Pageviews_IT</th>\n",
       "      <th>Matches</th>\n",
       "      <th>Title_PT</th>\n",
       "      <th>Pageviews_PT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Partito Democratico (Italia)</td>\n",
       "      <td>11653.0</td>\n",
       "      <td>Partido Democrático (Itália)/Partito Democrati...</td>\n",
       "      <td>Partido Democrático (Itália)</td>\n",
       "      <td>567.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Enrico Letta</td>\n",
       "      <td>7791.0</td>\n",
       "      <td>Enrico Letta/Enrico Letta</td>\n",
       "      <td>Enrico Letta</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Marianna Madia</td>\n",
       "      <td>7608.0</td>\n",
       "      <td>Marianna Madia/Marianna Madia</td>\n",
       "      <td>Marianna Madia</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>G20 (paesi industrializzati)</td>\n",
       "      <td>2545.0</td>\n",
       "      <td>G20/G20 (paesi industrializzati)</td>\n",
       "      <td>G20</td>\n",
       "      <td>185.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Giuliano Poletti</td>\n",
       "      <td>2021.0</td>\n",
       "      <td>Giuliano Poletti/Giuliano Poletti</td>\n",
       "      <td>Giuliano Poletti</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       Title_IT  Pageviews_IT  \\\n",
       "0  Partito Democratico (Italia)       11653.0   \n",
       "1                  Enrico Letta        7791.0   \n",
       "2                Marianna Madia        7608.0   \n",
       "3  G20 (paesi industrializzati)        2545.0   \n",
       "4              Giuliano Poletti        2021.0   \n",
       "\n",
       "                                             Matches  \\\n",
       "0  Partido Democrático (Itália)/Partito Democrati...   \n",
       "1                          Enrico Letta/Enrico Letta   \n",
       "2                      Marianna Madia/Marianna Madia   \n",
       "3                   G20/G20 (paesi industrializzati)   \n",
       "4                  Giuliano Poletti/Giuliano Poletti   \n",
       "\n",
       "                       Title_PT  Pageviews_PT  \n",
       "0  Partido Democrático (Itália)         567.0  \n",
       "1                  Enrico Letta           9.0  \n",
       "2                Marianna Madia           6.0  \n",
       "3                           G20         185.0  \n",
       "4              Giuliano Poletti           7.0  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Join the two dfs on the correspondence tuples\n",
    "matches_pageviews = pd.merge(df_match_it_pageviews, df_match_pt_pageviews, how = 'right',on = 'Matches', suffixes = ('_IT','_PT'))\n",
    "matches_pageviews.fillna(0, inplace =True)\n",
    "# Show result\n",
    "matches_pageviews.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use a bar plot to visualize the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https://plot.ly/~crimenghini/40.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# bar_plot(df, 'Matches', 'Pageviews_IT', 'Pageviews_PT', 'IT', 'PT', 'Compare IT and PT pageviews', 'Article',\n",
    "# 'No. pageviews', 'color-bar-pvs')\n",
    "tls.embed('https://plot.ly/~crimenghini/40')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "From the plot:\n",
    "* The page with the highest visits are the same.\n",
    "* In general, it seems that the PT pages that mention Matteo Renzi are related to general topic and politic figures on the international stage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can be interesting:\n",
    "\n",
    "> To present the same plot using the relative frequencies of the visit to see the importance of the page respect the list of articles (that mention Renzi) in that language. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> To study the relationships between the articles that mention Renzi. In particular, whether they are connected and point to each other. It may be used for define the importance of Matteo Renzi in an article (i.e. Matteo Renzi mentioned on the page of a TV show (just because he has been a guest), whether the page doesn't result to be connected to other articles it is possible to assume that Renzi in not the main topic of the article). I'm not totally sure it can be done, since moving from an article to another (even if the talk about an extremely different topic) does not need many hops."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}