{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Matteo Renzi mentions across Italian and Portuguese Wikipedia\n", "\n", "__*Remark:*__ Since interactive plots are present open [this](https://nbviewer.jupyter.org/github/CriMenghini/Wikipedia/blob/master/Mention/Mention_example.ipynb) link to read the `Notebook` correctly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Table of contents\n", "1. [Find articles](#parse)\n", "2. [Rank articles according to November pageviews](#rank)\n", "3. [Make comparisons](#comp)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import plotly\n", "from pageviews import *\n", "from wiki_parser import *\n", "import plotly.tools as tls\n", "from helpers_parser import *\n", "from across_languages import *\n", "plotly.tools.set_credentials_file(username='crimenghini', api_key='***')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Find articles \n", "The fist goal to achieve is to find all the Italian and Portuguese Wikipedia articles that mention `Matteo Renzi`. In order to do so, we use the [WikiHandler](wiki_parser.py) class which goes through the raw data and then keeps and stores the `title` and the `text` of the elements of the corpora that mention the [Italian (almost) ex-Prime Minister](http://gph.is/29zqwgy).\n", "\n", "In this example we focus on the sets of articles written in Italian and in Portugal, they are collected respectively until 20th November 2016 and 1st December 2016. In general, the code allows you to take into account more than two languages. In the [`README`](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) file, you can find the information related to the collection of data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Define the path of the corpora\n", "path = '/Users/cristinamenghini/Downloads/'\n", "# Xml file\n", "xml_files = ['itwiki-20161120-pages-articles-multistream.xml', \n", " 'ptwiki-20161201-pages-articles-multistream.xml']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After having a quick peek at a snippet of the `XML`. The elements we are interested in are on the child `page`, which identifies an article. Then we want to get the contents of `title` and `text`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the big size of the `XML` we opted for a parser which registers callbacks for events of interest and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.\n", "\n", "Hence, we proceed to parse the Italian corpus using the `parse_articles` function stored in the [`wiki_parser`](wiki_parser.py) library - it basically activates the parser." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Parse italian corpus\n", "parse_articles('ita', path + xml_files[0], 'Matteo Renzi')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then move towards the Portuguese one." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Parse portuguese corpus\n", "parse_articles('port', path + xml_files[1], 'Matteo Renzi')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a [`.json`](Corpus/wiki_ita_matteo_renzi.json) file whose each line corresponds to a page (`title`, `text`). The same holds for the [articles](Corpus/wiki_port_matteo_renzi.json) in Portuguese. The two corpora are automatically stored in the folder [`Corpus`](https://github.com/CriMenghini/Wikipedia/tree/master/Corpus).\n", "\n", " {\"title\": \"title_1\", \"text\": \"text_1\"}\n", " ...\n", " ...\n", " {\"title\": \"title_n\", \"text\": \"text_n\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Rank articles according to November pageviews " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the data has been filtered, we proceed with a *simple* analysis of the pageviews. In particular, using the [`article_df_from_json`](pageviews.py) function, all the article titles are extracted from the corpus and then stored in a `DataFrame`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the df for the Italian articles\n", "df_it_titles = article_df_from_json('Corpus/wiki_ita_Matteo_Renzi.json')\n", "\n", "# Get the df for the Portuguese articles\n", "df_pt_titles = article_df_from_json('Corpus/wiki_port_Matteo_Renzi.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a look at the obtained `DataFrame`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Title
101Centro-sinistra
413Viadotto Italia
223TG5 Prima Pagina
403Carcere di Santo Stefano
498Referendum costituzionale del 2016 in Italia
\n", "
" ], "text/plain": [ " Title\n", "101 Centro-sinistra\n", "413 Viadotto Italia\n", "223 TG5 Prima Pagina\n", "403 Carcere di Santo Stefano\n", "498 Referendum costituzionale del 2016 in Italia" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_it_titles.sample(5)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Thus, we extract the number of monthly page views for each article related to the languages of interest (i.e. `it` and `pt`) from the *page views* file - [Additional data](https://github.com/CriMenghini/Wikipedia/tree/master/Mention) in the `README`. To filter the file we use the [`filter_pageviews_file`](pageviews.py) function and get a dictionary of dictionaries with the following structure (according to our example):\n", "\n", " {'it':{'Title_1':'No pageviews',\n", " ...\n", " 'Title_n':'No pageviews'},\n", " 'pt':{'Title_1':'No pageviews',\n", " ...\n", " 'Title_k':'No pageviews'}}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Page views file\n", "pageviews_file = 'pagecounts-2016-11-views-ge-5-totals'\n", "\n", "# Filter the page view file\n", "articles_pageviews = filter_pageviews_file(path + pageviews_file, ['pt','it'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, a right join between the `DataFrames`, namely the one obtained from the pageviews and the other obtained from the corpus, is performed. It results that both for the Italian and Portuguese articles there are articles that mention Matteo Renzi that have not been visualized in November. The `define_ranked_df` function is stored in the [`pageviews`](pageviews.py) library." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Over the whole number of articles in the corpus 39 have not been visited during the considered period.\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitlePageviews
409Marco Travaglio19795.0
146Pif (conduttore televisivo)19557.0
474Partito Democratico (Italia)11653.0
154Vittorio Sgarbi11324.0
226Malala Yousafzai9452.0
433Jobs Act8908.0
274Enrico Letta7791.0
343Startup (economia)7698.0
312Marianna Madia7608.0
155Nuovo Centro Congressi6894.0
\n", "
" ], "text/plain": [ " Title Pageviews\n", "409 Marco Travaglio 19795.0\n", "146 Pif (conduttore televisivo) 19557.0\n", "474 Partito Democratico (Italia) 11653.0\n", "154 Vittorio Sgarbi 11324.0\n", "226 Malala Yousafzai 9452.0\n", "433 Jobs Act 8908.0\n", "274 Enrico Letta 7791.0\n", "343 Startup (economia) 7698.0\n", "312 Marianna Madia 7608.0\n", "155 Nuovo Centro Congressi 6894.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the italian ranked article df according to the number of page views\n", "ranked_df_ita = define_ranked_df(articles_pageviews, 'it', df_it_titles)\n", "# Show the df head\n", "ranked_df_ita.head(10)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Over the whole number of articles in the corpus 4 have not been visited during the considered period.\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitlePageviews
33Partido Democrático (Itália)567.0
31Lista de chefes de Estado e de governo atuais410.0
4G7259.0
19Federica Mogherini215.0
7G20185.0
23Centro-esquerda141.0
8Privatização93.0
15Lista de chefes de Estado e de governo por dat...83.0
219.ª reunião de cúpula do G2081.0
1710.ª reunião de cúpula do G2070.0
\n", "
" ], "text/plain": [ " Title Pageviews\n", "33 Partido Democrático (Itália) 567.0\n", "31 Lista de chefes de Estado e de governo atuais 410.0\n", "4 G7 259.0\n", "19 Federica Mogherini 215.0\n", "7 G20 185.0\n", "23 Centro-esquerda 141.0\n", "8 Privatização 93.0\n", "15 Lista de chefes de Estado e de governo por dat... 83.0\n", "21 9.ª reunião de cúpula do G20 81.0\n", "17 10.ª reunião de cúpula do G20 70.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the italian ranked article df according to the number of page views\n", "ranked_df_port = define_ranked_df(articles_pageviews, 'pt', df_pt_titles)\n", "# Show the df\n", "ranked_df_port.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having a quick glance at the two top 10, we notice:\n", "* The number of page views for the Italian articles which mention Matteo Renzi is considerably higher than for those that are written in Portuguese.\n", "* The only article that is present in both the top ranking is `Partito Democratico (Italia)`. \n", "* It seems that the pages differ in the content: the Portuguese ones are more related to topics that regard the international politics rather the Italians that refer to politics, journalists and public figures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Make comparisons \n", "\n", "We now move ahead exploring the data that we preprocessed and trying to figure out something interesting.\n", "* We take a look at the number of mentions received in each article. In this contest, it may be possible that Matteo Renzi received more than one mention just because of the presence of references. For instance on [this](https://it.wikipedia.org/wiki/Francesco_Guccini) page, if you look up for Matteo Renzi, you will find 2 mentions but one of those just refers to the first. For the moment we do not address this issue.\n", "\n", "The `DataFrame` below- obtained using `article_mentions` function in [this](across_languages.py) library- shows the number of mentions that Matteo Renzi has received in each article according to both for the Italian and Portuguese corpora. The `DataFrames` are sorted by the number of mentions so that we get the pages where Matteo Renzi is more \"popular\"." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleNumber of mentions
37Matteo Renzi62
424Governo Renzi30
195Partito Democratico (Italia)17
492Riforma costituzionale Renzi-Boschi12
214Storia del Partito Democratico (Italia)12
\n", "
" ], "text/plain": [ " Title Number of mentions\n", "37 Matteo Renzi 62\n", "424 Governo Renzi 30\n", "195 Partito Democratico (Italia) 17\n", "492 Riforma costituzionale Renzi-Boschi 12\n", "214 Storia del Partito Democratico (Italia) 12" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Italian df of mentions per page\n", "df_it_mentions = article_mentions('Corpus/wiki_ita_Matteo_Renzi.json', 'Matteo Renzi')\n", "\n", "# Sort the df by the number of mentions and see the top 5\n", "df_it_mentions = df_it_mentions.sort_values('Number of mentions', ascending = False)\n", "\n", "# Show results\n", "df_it_mentions.head(5)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleNumber of mentions
20Matteo Renzi11
7Partido Democrático (Itália)6
9Itália3
0Lista de primeiros-ministros da Itália2
16Lista de viagens presidenciais de Dilma Rousseff2
\n", "
" ], "text/plain": [ " Title Number of mentions\n", "20 Matteo Renzi 11\n", "7 Partido Democrático (Itália) 6\n", "9 Itália 3\n", "0 Lista de primeiros-ministros da Itália 2\n", "16 Lista de viagens presidenciais de Dilma Rousseff 2" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Portuguese df of mentions per page\n", "df_pt_mentions = article_mentions('Corpus/wiki_port_Matteo_Renzi.json', 'Matteo Renzi')\n", "\n", "# Sort the df by the number of mentions and see the top 5\n", "df_pt_mentions = df_pt_mentions.sort_values('Number of mentions', ascending = False)\n", "\n", "# Show results\n", "df_pt_mentions.head(5)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Comparing the two `DataFrames` we immediately notice that even if the maximum number of mentions that Matteo Renzi received for Italian and Portuguese articles are very different. In the Portuguese corpus there are only two articles that have more than 5 mentions. Thus, can be interesting to visualize the distribution of the mentions both for the IT and PT corpora." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distributions are represented using the boxplots. They show that for both the languages the 75% of the articles contain no more than 3 mentions of the Italian premier. For the Portuguese corpus stand out two outliers that correspond to `Matteo Renzi 11 mentions` and `Partido Democrático (Itália) 6 mentions`, rather for the Italians the number of outliers is bigger and the maximum number of mentions are contained in `Matteo Renzi 62 mentions`. Moreover, zooming in the boxes, we observe that the two distributions are skewed toward left (number of mentions equal to 1). " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#boxplot_mentions(df_pt_mentions, df_it_mentions, 'PT', 'IT', 'Number of mentions')\n", "tls.embed(\"https://plot.ly/~crimenghini/20\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this direction, one aspect that can be considered is the following: \n", "> Define how important is Matteo Renzi in the articles that mention him. It requires defining the concept of *importance*. Intuitively, we would say that higher is the number of mentions more is the importance of our object in the article. Moreover, it may be useful to weight the number of mentions according to the number of words in the article. \n", "$$I_{string} = \\frac{M}{|D|}$$\n", "Where *I* is the importance, *M* is the number of mentions and *D* the number of words in the document. In this way, whether an article cited Renzi once but it is made up just by a few lines, the string of interest will result more significant.\n", "\n", "Moreover, another aspect should be considered, especially when there is only one mention: \n", "* The string (i.e. Matteo Renzi) is a pointer to its main page (i.e. Matteo Renzi -> [Matteo Renzi](https://it.wikipedia.org/wiki/Matteo_Renzi)). Whether the pointer is present we can imagine that the figure is more *important* than a page where there is no a hyperlink." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another thing that can be visualized is the realtionship between the `Number of mentions` and the `Pageviews`. In order to do that we first merge the two pageviews and mentions `DataFrames`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleNumber of mentionsPageviews
255Faccia a faccia (programma televisivo)1483.0
457Fausto Brizzi12161.0
20Ivan Scalfarotto41077.0
32Elezioni amministrative italiane del 20094723.0
472Anonymous16.0
\n", "
" ], "text/plain": [ " Title Number of mentions Pageviews\n", "255 Faccia a faccia (programma televisivo) 1 483.0\n", "457 Fausto Brizzi 1 2161.0\n", "20 Ivan Scalfarotto 4 1077.0\n", "32 Elezioni amministrative italiane del 2009 4 723.0\n", "472 Anonymous 1 6.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Merge pageviews and mentions DataFrames for IT\n", "df_it_mension_pageview = pd.merge(df_it_mentions, ranked_df_ita, on=['Title'])\n", "\n", "# Show it\n", "df_it_mension_pageview.sample(5)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleNumber of mentionsPageviews
1442.ª reunião de cúpula do G7233.0
12Maria Elena Boschi25.0
3Lista de primeiros-ministros da Itália212.0
19G201185.0
20Lista de líderes do G2018.0
\n", "
" ], "text/plain": [ " Title Number of mentions Pageviews\n", "14 42.ª reunião de cúpula do G7 2 33.0\n", "12 Maria Elena Boschi 2 5.0\n", "3 Lista de primeiros-ministros da Itália 2 12.0\n", "19 G20 1 185.0\n", "20 Lista de líderes do G20 1 8.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Merge pageviews and mentions DataFrames for PT\n", "df_pt_mension_pageview = pd.merge(df_pt_mentions, ranked_df_port, on=['Title'])\n", "\n", "# Show it\n", "df_pt_mension_pageview.sample(5)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "A scatterplot is used to get how an article is positioned according to these two variables. The plot shows:\n", "* __IT__: when the mentions are equal to 1 the number of page views is spread between 0 and ~20k. Where the number of mentions increases the number of page visualizations belongs to a smaller range. \n", "* __PT__: also for Portuguese article the same is observed. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# def scatter_plot(df_it_mension_pageview, df_pt_mension_pageview, 'Number of mentions', 'Pageviews', 'Italian', 'Portuguese')\n", "tls.embed('https://plot.ly/~crimenghini/36')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "About these two features, we can think that another way to explore should be the following:\n", "> Consider how the number of pageviews of an article changes when the number of Matteo Renzi citations increases from a revision to another. In particular, the *importance*(I) is re-defined as: \n", "$$I = \\sum_{t = 1}^{T} \\frac{(p_t-p_{t-1}) \\times m_t}{|D_t|}$$\n", "Where *t* is the time of sequential revision of the article, *p* is the number of page views at time and m is the number of mentions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus we proceed to look for the presence of same articles (in different languages) that mention Matteo Renzi. To do so we make a request for each Portugues *Wikipedia* page (that cites Renzi) than we parse the `HTML` source to extract - where available- the title of the IT article related to that the request has been sent. Precisely, the requests are sent for each title of the language that has less article that match Matteo Renzi. The function `get_matches` is stored in [this](across_languages) library. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Built the common articles matches\n", "dict_italian = get_matches(df_pt_titles, 'it')\n", "# Create the inverted one\n", "inverted_dict = {v : k for k, v in dict_italian.items()}" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Portuguese articles that mention Matteo Renzi and correspond to an Italian article are: 31 . The number of PT articles that have not been matched is: 11 .\n" ] } ], "source": [ "print ('The Portuguese articles that mention Matteo Renzi and correspond to an Italian article are: ', len(dict_italian), \n", " '. The number of PT articles that have not been matched is: ', len(df_pt_titles)-len(dict_italian), '.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Proceed to create a `DataFrame` that contains the information related to those articles.\n", "\n", "* We extract the titles of all involved articles (both IT and PT)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# From the dictionary get the titles of both languages\n", "italian_titles = list(dict_italian.values())\n", "portugues_titles = list(dict_italian.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before gooing further, we check whether all the matched IT articles mention Matteo Renzi. In order to do so, we run a query on the `DataFrame` that stores all the IT articles that cite Renzi." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 10 IT articles that do not mention Matteo Renzi.\n" ] } ], "source": [ "# Run the query\n", "match_with_mention = df_it_titles.query('Title in @italian_titles')\n", "\n", "# Get the number\n", "print ('There are ', len(portugues_titles)-len(match_with_mention), 'IT articles that do not mention Matteo Renzi.')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Re-define the list of IT articles according to the aforementioned \"issue\"\n", "it_titles_with_mention = list(match_with_mention.Title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The dictionaries that match the PT and IT titles are re-defined taking into account the fact that some IT do not mention Renzi." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Re-define the two dictionaries \n", "dict_italian_mentions = {k:v for k,v in dict_italian.items() if v in it_titles_with_mention}\n", "# Define the inverted\n", "inverted_dict_italian_mentions = {v : k for k, v in dict_italian.items()}\n", "\n", "# Create the list of titles for PT articles according to the IT that don't mention Renzi\n", "pt_titles_with_mention = list(dict_italian_mentions.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we create a unique `DataFrame` which contains the mentions in IT an PT articles for the tuple of articles." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create df for IT mentions\n", "df_match_it_mentions = df_it_mentions.query('Title in @it_titles_with_mention').sort_values('Number of mentions', ascending = False)\n", "\n", "# Create df for PT mentions\n", "df_match_pt_mentions = df_pt_mentions.query('Title in @pt_titles_with_mention').sort_values('Number of mentions', ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Add a column containing the matches to join the two dfs." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create new column\n", "new_column_it = ['/'.join([k]+[v]) for i in df_match_it_mentions.Title for k,v in dict_italian_mentions.items() if i == v]\n", "new_column_pt = ['/'.join([k]+[v]) for i in df_match_pt_mentions.Title for k,v in dict_italian_mentions.items() if i == k]\n", "\n", "# Add the new column to the two dataframes\n", "df_match_it_mentions['Matches'] = new_column_it\n", "df_match_pt_mentions['Matches'] = new_column_pt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Perform the join on the `Matches` and plot the results." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Title_ITNumber of mentions_ITMatchesTitle_PTNumber of mentions_PT
0Matteo Renzi62Matteo Renzi/Matteo RenziMatteo Renzi11
1Partito Democratico (Italia)17Partido Democrático (Itália)/Partito Democrati...Partido Democrático (Itália)6
2Maria Elena Boschi6Maria Elena Boschi/Maria Elena BoschiMaria Elena Boschi2
3Federica Mogherini4Federica Mogherini/Federica MogheriniFederica Mogherini1
4Presidenti del Consiglio dei ministri della Re...3Lista de primeiros-ministros da Itália/Preside...Lista de primeiros-ministros da Itália2
\n", "
" ], "text/plain": [ " Title_IT Number of mentions_IT \\\n", "0 Matteo Renzi 62 \n", "1 Partito Democratico (Italia) 17 \n", "2 Maria Elena Boschi 6 \n", "3 Federica Mogherini 4 \n", "4 Presidenti del Consiglio dei ministri della Re... 3 \n", "\n", " Matches \\\n", "0 Matteo Renzi/Matteo Renzi \n", "1 Partido Democrático (Itália)/Partito Democrati... \n", "2 Maria Elena Boschi/Maria Elena Boschi \n", "3 Federica Mogherini/Federica Mogherini \n", "4 Lista de primeiros-ministros da Itália/Preside... \n", "\n", " Title_PT Number of mentions_PT \n", "0 Matteo Renzi 11 \n", "1 Partido Democrático (Itália) 6 \n", "2 Maria Elena Boschi 2 \n", "3 Federica Mogherini 1 \n", "4 Lista de primeiros-ministros da Itália 2 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Join the two dfs on the correspondence tuples\n", "matches_mention = pd.merge(df_match_it_mentions, df_match_pt_mentions, on = 'Matches', suffixes = ('_IT','_PT'))\n", "\n", "# Show result\n", "matches_mention.head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# bar_plot(df, 'Matches', 'Number of mentions_IT', 'Number of mentions_P', 'IT', 'PT', 'Compare IT and PT mentions', \n", "# 'Article','No. mentions', 'color-bar-prova')\n", "tls.embed('https://plot.ly/~crimenghini/38')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the plot:\n", "* Among this group of articles, the two that mention Matteo Renzi more result to be the same.\n", "\n", "From this kind of analysis a question one can think about is the following:\n", "\n", "> Given articles in different languages that correspond one to each other, if we are interested in measuring the proximity of these articles, an element that may be considered is the number of common mentions. It is likely that the necessity of quoting s.o./s.t. derives from the fact that the two articles are talking about the same topics that need to refer to the same thing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same procedure is repeated for the page views.\n", "\n", "* Check whether some articles have not been visited." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 12 IT articles that have not been visited.\n", "There are 3 PT articles that have not been visited.\n" ] } ], "source": [ "# Run the query\n", "match_with_pageviews_it = ranked_df_ita.query('Title in @italian_titles')\n", "match_with_pageviews_pt = ranked_df_port.query('Title in @portugues_titles')\n", "# Get the number\n", "print ('There are ', len(portugues_titles)-len(match_with_pageviews_it), 'IT articles that have not been visited.')\n", "print ('There are ', len(portugues_titles)-len(match_with_pageviews_pt), 'PT articles that have not been visited.')" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define list of articles that have been visualized\n", "it_titles_with_pageviews = list(match_with_pageviews_it.Title)\n", "pt_titles_with_pageviews = list(match_with_pageviews_pt.Title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Define the matching dictionaries according to what said above." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Re-define the two dictionaries according to this evidence\n", "dict_italian_pageviews = {k:v for k,v in dict_italian.items() if v in it_titles_with_pageviews}\n", "\n", "# PT \n", "dict_pt_pageviews = {v : k for k, v in dict_italian.items() if k in pt_titles_with_pageviews}" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create df for IT mentions\n", "df_match_it_pageviews = ranked_df_ita.query('Title in @it_titles_with_pageviews').sort_values('Pageviews', ascending = False)\n", "\n", "# Create df for PT mentions\n", "df_match_pt_pageviews = ranked_df_port.query('Title in @pt_titles_with_pageviews').sort_values('Pageviews', ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Add new variable to allow the join" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create new column\n", "new_column_it = ['/'.join([k]+[v]) for i in df_match_it_pageviews.Title for k,v in dict_italian_pageviews.items() if i == v]\n", "new_column_pt = ['/'.join([v]+[k]) for i in df_match_pt_pageviews.Title for k,v in dict_pt_pageviews.items() if i == v]\n", "\n", "# Add the new column to the two dataframes\n", "df_match_it_pageviews['Matches'] = new_column_it\n", "df_match_pt_pageviews['Matches'] = new_column_pt" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitlePageviewsMatches
474Partito Democratico (Italia)11653.0Partido Democrático (Itália)/Partito Democrati...
274Enrico Letta7791.0Enrico Letta/Enrico Letta
312Marianna Madia7608.0Marianna Madia/Marianna Madia
442G20 (paesi industrializzati)2545.0G20/G20 (paesi industrializzati)
318Giuliano Poletti2021.0Giuliano Poletti/Giuliano Poletti
\n", "
" ], "text/plain": [ " Title Pageviews \\\n", "474 Partito Democratico (Italia) 11653.0 \n", "274 Enrico Letta 7791.0 \n", "312 Marianna Madia 7608.0 \n", "442 G20 (paesi industrializzati) 2545.0 \n", "318 Giuliano Poletti 2021.0 \n", "\n", " Matches \n", "474 Partido Democrático (Itália)/Partito Democrati... \n", "274 Enrico Letta/Enrico Letta \n", "312 Marianna Madia/Marianna Madia \n", "442 G20/G20 (paesi industrializzati) \n", "318 Giuliano Poletti/Giuliano Poletti " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_match_it_pageviews.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitlePageviewsMatches
33Partido Democrático (Itália)567.0Partido Democrático (Itália)/Partito Democrati...
31Lista de chefes de Estado e de governo atuais410.0Lista de chefes de Estado e de governo atuais/...
4G7259.0G7/G7
19Federica Mogherini215.0Federica Mogherini/Federica Mogherini
7G20185.0G20/G20 (paesi industrializzati)
\n", "
" ], "text/plain": [ " Title Pageviews \\\n", "33 Partido Democrático (Itália) 567.0 \n", "31 Lista de chefes de Estado e de governo atuais 410.0 \n", "4 G7 259.0 \n", "19 Federica Mogherini 215.0 \n", "7 G20 185.0 \n", "\n", " Matches \n", "33 Partido Democrático (Itália)/Partito Democrati... \n", "31 Lista de chefes de Estado e de governo atuais/... \n", "4 G7/G7 \n", "19 Federica Mogherini/Federica Mogherini \n", "7 G20/G20 (paesi industrializzati) " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_match_pt_pageviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Join the two `DatFrames` with a right join, so that we see also the PT articles that have not been visualised in IT." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Title_ITPageviews_ITMatchesTitle_PTPageviews_PT
0Partito Democratico (Italia)11653.0Partido Democrático (Itália)/Partito Democrati...Partido Democrático (Itália)567.0
1Enrico Letta7791.0Enrico Letta/Enrico LettaEnrico Letta9.0
2Marianna Madia7608.0Marianna Madia/Marianna MadiaMarianna Madia6.0
3G20 (paesi industrializzati)2545.0G20/G20 (paesi industrializzati)G20185.0
4Giuliano Poletti2021.0Giuliano Poletti/Giuliano PolettiGiuliano Poletti7.0
\n", "
" ], "text/plain": [ " Title_IT Pageviews_IT \\\n", "0 Partito Democratico (Italia) 11653.0 \n", "1 Enrico Letta 7791.0 \n", "2 Marianna Madia 7608.0 \n", "3 G20 (paesi industrializzati) 2545.0 \n", "4 Giuliano Poletti 2021.0 \n", "\n", " Matches \\\n", "0 Partido Democrático (Itália)/Partito Democrati... \n", "1 Enrico Letta/Enrico Letta \n", "2 Marianna Madia/Marianna Madia \n", "3 G20/G20 (paesi industrializzati) \n", "4 Giuliano Poletti/Giuliano Poletti \n", "\n", " Title_PT Pageviews_PT \n", "0 Partido Democrático (Itália) 567.0 \n", "1 Enrico Letta 9.0 \n", "2 Marianna Madia 6.0 \n", "3 G20 185.0 \n", "4 Giuliano Poletti 7.0 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Join the two dfs on the correspondence tuples\n", "matches_pageviews = pd.merge(df_match_it_pageviews, df_match_pt_pageviews, how = 'right',on = 'Matches', suffixes = ('_IT','_PT'))\n", "matches_pageviews.fillna(0, inplace =True)\n", "# Show result\n", "matches_pageviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use a bar plot to visualize the results." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# bar_plot(df, 'Matches', 'Pageviews_IT', 'Pageviews_PT', 'IT', 'PT', 'Compare IT and PT pageviews', 'Article',\n", "# 'No. pageviews', 'color-bar-pvs')\n", "tls.embed('https://plot.ly/~crimenghini/40')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "From the plot:\n", "* The page with the highest visits are the same.\n", "* In general, it seems that the PT pages that mention Matteo Renzi are related to general topic and politic figures on the international stage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It can be interesting:\n", "\n", "> To present the same plot using the relative frequencies of the visit to see the importance of the page respect the list of articles (that mention Renzi) in that language. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> To study the relationships between the articles that mention Renzi. In particular, whether they are connected and point to each other. It may be used for define the importance of Matteo Renzi in an article (i.e. Matteo Renzi mentioned on the page of a TV show (just because he has been a guest), whether the page doesn't result to be connected to other articles it is possible to assume that Renzi in not the main topic of the article). I'm not totally sure it can be done, since moving from an article to another (even if the talk about an extremely different topic) does not need many hops." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }