---
title: Web scraping and analyzing 400,000 articles
author: ''
date: '2018-04-04'
description: "Perform web scraping and analysis on 400,000 articles published by NEWSBEAST.GR, one of the main on-line publishers in Greece"
slug: web-scraping-and-analyzing-400-000-articles
categories:
  - R
tags:
  - Web scraping
  - Time series
  - Smoothing
---

In this post i will analyze the articles published by [NEWSBEAST](http://www.newsbeast.gr), 
one of the main on-line publishers in Greece. 
The main question is **"Are there any trends on the daily posting frequency of NEWSBEAST?"**
The steps below were followed before the main data analysis (link for the full r code can be 
found at the end of the article):  

* Scraping the articles from the site (www.newbeast.gr) using rvest & car libraries in R  
* Data preparation (including data cleansing, transformation)  
* Exploratory analysis 


```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
# Load library for web scraping
library(rvest)
library(car)

# Create a dummy list
v <- list()

# Run the for loop to download posts from a number of pages 1 to 8000
for(i in 1:8000){
  v[[i]] <-  html(paste("http://www.newsbeast.gr/page/",i,sep="")) %>% 
    html_nodes("h2 a , .article-top-meta-time , .article-top-meta-date , .feed-article-subtitle, .feed-article-excerpt, .article-top-meta-category") %>%
    html_text()
  
}


# Unlist the object created before
test <- unlist(v)

# Get the length of the object 
a <- 1:length(test)

# Create the title object
title <- a[seq(1, length(a), 10)]
title <- test[title]

# Create the time object
time <- a[seq(2, length(a), 10)]
time <- test[time]

# Create the date object
date <- a[seq(3, length(a), 10)]
date <- test[date]

# Create the category object
cat <- a[seq(4, length(a), 10)]
cat <- test[cat]

# Create the title object
title2 <- a[seq(5, length(a), 10)]
title2 <- test[title2]

# Create the time object
time2 <- a[seq(6, length(a), 10)]
time2 <- test[time2]

# Create the date object
date2 <- a[seq(7, length(a), 10)]
date2 <- test[date2]

# Create the category object
cat2 <- a[seq(8, length(a), 10)]
cat2 <- test[cat2]

# Create the text/content object
text1 <- a[seq(9, length(a), 10)]
text1 <- test[text1]

# Create the text/content object
text2 <- a[seq(10, length(a), 10)]
text2 <- test[text2]

# Combine the final dataset with all posts) 
final <- as.data.frame(cbind(date,time,cat, title, title2, text1, text2))

# Save as RDS for future use
saveRDS(final, file= "/Users/manos/Dropbox/Projects/R/2018_01_Newsbeast/data/newsbeast.all.RDS")

```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(tidyverse)
library(lubridate)
library(ggthemes)


# Insert data
news <- readRDS("/Users/manos/OneDrive/Projects/R/All_Projects/2018_01_Newsbeast/data/newsbeast.all.RDS")

# Create the date variable 
news$date <- paste(news$date, news$time)
news$date <- parse_date_time(news$date, '%d/%m/%Y %H:%M')

# Change the class of text variables to character
news$title <- as.character(news$title)
news$title2 <- as.character(news$title2)
news$text1 <- as.character(news$text1)
news$text2 <- as.character(news$text2)


```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

# Calculate how many observations are different
sum(news$title != news$title2)

# Exclude the 2 observations from the dataset
news <- filter(news, title == title2)

# Delete the second variable beacause they are identical
news$title2 <- NULL

```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
# Create a summary of the dataset
summary(news)

# Explore the structure
str(news)

```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

news <- news %>% 
  rename(category = cat,      # Change variable names
         subtitle = text1,
         summary = text2) %>% 
  mutate(year = year(date),     # Create year, month, and weekday variables 
         month_num = month(date),
         month_name = month(date, label = TRUE),
         day_num = wday(date),
         day_name = wday(date, label = TRUE, week_start = 1)) %>% 
  filter(year > 2013)    # Now we filter our dataset and keep observations from 2014 to 2017

```


# MAIN DATA ANALYSIS

Let's create a scatter plot to see the number of articles published per day for the last 
4 years. 

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}

news %>% 
  group_by(day = floor_date(date, unit = "day")) %>% 
  summarise(N = n()) %>% 
  ggplot() +
  geom_point(aes(day, N), color = "steelblue", alpha = 0.7)+
  labs(y = "Number of daily posts", x = "", 
       title = "Number of daily posts from NEWSBEAST.GR", 
       subtitle = "Includes all posts from 2014 until the end of 2017")+
 theme_fivethirtyeight()
```


We can spot clearly that there is a constant trend (during the last 4 years) 
of two different patterns on number of daily articles published. We suspect that
on specific weekdays (maybe weekends?) there are significantly less published daily articles.  
A second trend (not so clear) could be that each consecutive year less articles
are published. 

Now let's plot it again by distinguish each week day with a different color for 
a more clear view.

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
news %>% 
  group_by(day = floor_date(date, unit = "day"), day_name) %>% 
  summarise(N = n()) %>% 
  ggplot() +
  geom_point(aes(day, N, color = day_name))+
  labs(y = "Number of daily posts", x = "", 
       title = "Number of daily posts from NEWSBEAST.GR", 
       subtitle = "Includes all posts from 2014 until the end of 2017",
       fill = "Week Day")+
  scale_color_discrete(name = "Week Day")+
  theme_fivethirtyeight()
  
```

So now it is clear that during the weekends the number of daily posts are much lower 
than on work-week days. 


# MODELLING

Now let's try to fit a model on each week day and create a plot to present these. 
In the plot below each weekday is presented by by fitting a separate smooth line 
using [LOESS regression](https://en.wikipedia.org/wiki/Local_regression)


```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
news %>% 
  group_by(day = floor_date(date, unit = "day"), day_name) %>% 
  summarise(N = n()) %>% 
  ggplot() +
  geom_smooth(
    mapping = aes(day, N, colour = day_name),
    show.legend = TRUE, method = "loess") +
  labs(y = "Number of daily posts", x = "", 
       title = "Number of daily posts from NEWSBEAST.GR", 
       subtitle = "with LOESS curve fitted")+
  scale_color_discrete(name = "Week Day")+
  theme_fivethirtyeight()


```

The standard error for weekend days (95% confidence level) indicates that the 
difference from the work-week days is significant.  
  

Now let's investigate if there are significant differences between different years.

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
news %>% 
  group_by(year, day = floor_date(date, unit = "day")) %>% 
  summarise(N = n()) %>% 
  group_by(year) %>% 
  summarise(avg.daily = mean(N),
            se = sd(N)/sqrt(n())) %>%
  mutate(lower.ci.mpg = avg.daily - qt(1 - (0.05 / 2), n() - 1) * se,
         upper.ci.mpg = avg.daily + qt(1 - (0.05 / 2), n() - 1) * se) %>% 
  ggplot(aes(year, avg.daily)) + 
  geom_bar(stat="identity", fill="steelblue", alpha = .7, position=position_dodge())+
  geom_errorbar(aes(ymin = lower.ci.mpg, ymax = upper.ci.mpg), color = "red")+
  labs(y = "Average daily posts", x = "", 
       title = "Average daily posts per year from NEWSBEAST.GR", 
       subtitle = "with error bars")+
  theme_fivethirtyeight()


```

By looking at the error bars it looks 
that all differences (between years) are statistically significant, except when comparing 2016 with
2017 were the error bars are overlapping.  

This finding can be confirmed by applying the non-parametric [Wilcoxon signed-rank](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test)
test in different samples (table below)

```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

# Create the dataset
news2 <- news %>% 
  group_by(year, day = floor_date(date, unit = "day")) %>% 
  summarise(N = n()) 


# Perform the Wilcoxon signed-rank test in different samples
wilcox.test(news2[news2$year == 2014,]$N, news2[news2$year == 2015,]$N, alternative = "two.sided")
wilcox.test(news2[news2$year == 2015,]$N, news2[news2$year == 2016,]$N, alternative = "two.sided")
wilcox.test(news2[news2$year == 2016,]$N, news2[news2$year == 2017,]$N, alternative = "two.sided")

```

| Wilcoxon test between Years | p-value |
|:---------------------------:|:-------:|
| 2014 - 2015                 | < 0.001 |
| 2015 - 2016                 | < 0.001 |
| 2016 - 2017                 | < 0.001 |


All p-values are very low (<0.01) with confidence 
level of 0.95, so the difference in daily mean articles posted is significant.  


# TIME SERIES ANALYSIS

Finally in order to investigate other, less obvious seasonal trends, we used
a package dedicated on time series analysis. Using the [prophet](https://github.com/facebook/prophet) 
package for R a forecasting model was fitted, as can be seen below, which revealed
more seasonal trends. In particular two major "drops" appear in daily articles
around the **christmas period** and the **summer holidays** (around August). There is also a minor 
"drop" around the **Easter period**. 


```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(prophet)

df <- news2[, -1] %>% 
  rename(y = N, ds = day)

m <- prophet(df)
future <- make_future_dataframe(m, periods = 365)

forecast <- predict(m, future)

plot(m, forecast)

```

The other useful feature is the ability to plot the various components (plot below)
It is a simple way to pull out the daily/weekly/yearly trends. 
In particular in the chart below we can easily spot the drop on the weekends,
the "drops" in daily articles around the **christmas period**, **summer holidays**, 
**Easter period** and the yearly "drop".


```{r}
prophet_plot_components(m, forecast)

```


# CONCLUSIONS

There are various outcomes from this analysis on Newsbeast.gr. In particular:  

* During the last 4 years there is a significant decrease on daily posted articles. Overall around **30%** less articles.  
* The published articles during the weekends are much less than on weekdays.  
* There are 2 major "drops" in daily articles each year, during the Christmas & summer holidays.  
* There is a minor "drop" in daily articles published each year during the Easter holidays


[Full R code](https://raw.githubusercontent.com/mantoniou/Blog/master/content/post/2018-04-04-web-scraping-and-analyzing-400-000-articles.Rmd)