Understanding Botnet-driven Blog Spam: Motivations and Methods Brandon Bevans brandonbevans@gmail.com California Polytechnic State University United States of America Bruce DeBruhl brandonbevans@gmail.com California Polytechnic State University United States of America Foaad Khosmood brandonbevans@gmail.com California Polytechnic State University United States of America Introduction Spam, or unsolicited commercial communication, has evolved from telemarketing schemes to a highly sophisticated and profitable black-market business. Although many users are aware that email spam is prominent, they are less aware of blog spam (Thomason, 2007). Blog spam, also known as forum spam, is spam that is posted to a public or outward facing website. Blog spam can be to accomplish many tasks that email spam is used for, such as posting links to a malicious executable. Blog spam can also serve some unique purposes. First, blog spam can influence purchasing decisions by featuring illegitimate advertisements or reviews. Second, blog spam can include content with target keywords designed to change the way a search engine identifies pages (Geerthik, 2013). Lastly, blog spam can contain link spam, which spams a URL on a victim page to increase the inserted URLs search engine ranking. Overall, blog spam weakens search engines' model of the Internet popularity distribution. Much academic and industrial effort has been spent to detect, filter, and deter spam (Dinh, 2013), (Spirin and Han, 2012). Less effort has been placed in understanding the underlying distribution mechanisms of spambots and botnets. One foundational study in characterizing blog spam (Niu et al., 2007) provided a quantitative analysis of blog spam in 2007. This study showed that blogs in 2007 included incredible amounts of spam but does not try to identify linked behavior that would imply botnet behavior. A later study on blog spam (Stringhini, 2015) explores using IPs and usernames to detect botnets but does not characterize the behavior of these botnets. In 2011, a research team (Stone-Gross et al., 2011) infiltrated a botnet, which allowed for observations of the logistics around botnet spam campaigns. Overall, our understanding of blog spam generated by botnets is still limited. Related Work Various projects have attempted to identify the mechanics, characteristics, and behavior of botnets that control spam. In one important study (Shin et al., 2011), researchers fully evaluated how one of the most popular spam automation programs, XRumer, operates. Another study explored the behavior of botnets across multiple spam campaigns (Thonnard and Dacier, 2011). Others (Pitsillidis et al., 2012) examined the impact that spam datasets had on characterization results. (Lumezanu et al., 2012) explored the similarities between email spam and blog spam on Twitter. They show that over 50% of spam links from emails also appeared on Twitter. [329-1] Figure 1: Browser rendering of the ggjx honeypot The underground ecosystem build around the botnet community has been explored (Stone-Gross et al., 2011). In a surprising result, over 95% of pharmaceuticals advertised in spam were handled by a small group of banks (Levchenko et al., 2011). Our work is similar in that we are trying to characterize the botnet ecosystem, focusing on the distribution and classification of certain spam producing botnets. Experimental Design In order to classify linguistic similarity and differences in botnets, we implement 3 honeypots to gather samples of blog spam. We configure our honeypots identically using the Drupal content management systems (CMS) as shown in Figure 1. Our honeypots are identical except for the content of their first post and their domain name. Ggjx.org is fashion themed, npcagent.com is sports themed, and gjams.com is pharmaceutical themed. We combine the data collected from Drupal with the Apache server logs (Apache, 2016) to allow for content analysis of data collected over 42 days. To allow botnets time to discover the honeypots, we activate the honeypots at least 6-weeks before data collection. We generate three tables of content for each honeypot (Bevans and Khosmood, 2016). In the user table, we record the information the spambot enters while registering and user login statistics that we summarize in Table 1. This includes the user id, username, password, date of registration, registration IP, and number of logins. In the content table, we record the content of spam posts and comments which we summarize in Table 2. This includes the blog node id, the author's unique id, the date posted, the number of hits, type of post, title of the post, text of the post, links in the post, language of the post, and a taxonomy of the post from IBM's Alchemy API. ┌─────────┬────────┬────────────────┬──────────────┐ │Honey pot│Quantity│Mean Logins/User│# of Countries│ ├─────────┼────────┼────────────────┼──────────────┤ │ggjx │62992 │1.066 │83 │ ├─────────┼────────┼────────────────┼──────────────┤ │gjams │28230 │1.102 │40 │ ├─────────┼────────┼────────────────┼──────────────┤ │npcagent │34332 │1.05 │53 │ └─────────┴────────┴────────────────┴──────────────┘ Table 1: User table characteristics for three honeypots ┌────────┬────────┬─────────┬──────────┬─────────────┐ │Honeypot│Quantity│Avg. Hits│Avg. Links│English Posts│ ├────────┼────────┼─────────┼──────────┼─────────────┤ │ggjx │2279 │28.237 │2.356 │1962 │ ├────────┼────────┼─────────┼──────────┼─────────────┤ │gjams │2225 │18.178 │0.311 │2137 │ ├────────┼────────┼─────────┼──────────┼─────────────┤ │npcagent│1430 │29.043 │1.823 │1409 │ └────────┴────────┴─────────┴──────────┴─────────────┘ Table 2: Characteristics for the content tables ┌────────────────────────┬───────┬───────┬────────┐ │Honeypot │ggjx │gjams │npcagent│ ├────────────────────────┼───────┼───────┼────────┤ │# Of Entities │3430 │1790 │1566 │ ├────────────────────────┼───────┼───────┼────────┤ │# of Users │62992 │28230 │34332 │ ├────────────────────────┼───────┼───────┼────────┤ │Mean Users/Entity │18.365 │15.771 │21.923 │ ├────────────────────────┼───────┼───────┼────────┤ │Max Users/Entity │37589 │14249 │23577 │ ├────────────────────────┼───────┼───────┼────────┤ │cr of Users/Entity │666.128│359.619│611.157 │ ├────────────────────────┼───────┼───────┼────────┤ │# of IPs │5291 │3092 │2120 │ ├────────────────────────┼───────┼───────┼────────┤ │Mean IPs/Entity │1.543 │1.727 │1.354 │ ├────────────────────────┼───────┼───────┼────────┤ │Maximum IPs/entity │118 │135 │60 │ ├────────────────────────┼───────┼───────┼────────┤ │