{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SERP Analysis\n", "This notebook analyzes the search page results for the 1.9K recommended keywords returned by Google Keyword Planner for our eight inputs, \"Black girls\", \"Latina girls\", \"Asian girls\", \"White girls\", \"Black boys\", \"Latino boys\", \"Asian boys\", and \"White Boys\".\n", "\n", "Here we determine the 200 most-shared domains across searches, and hand-label each domain as pornographic by Googling the website and checking if the site self-identifies as \"porn\" in the search listing. We found 132 of these domains to be pornographic and use that list to estimate how many of the search results have majority (<50%) pornographic links." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import glob\n", "import json\n", "from collections import Counter\n", "\n", "import urlexpander\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# input\n", "fn_serp_pattern = '../data/input/browser/*.html'\n", "\n", "# output\n", "fn_top_sites = '../data/intermediary/websites-from-search.csv'\n", "fn_labelled_sites = '../data/intermediary/websites-we-found-to-be-pornographic.csv'\n", "fn_keywords_labelled = '../data/intermediary/keywords-labelled-as-adult.json'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1976" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# HTML source code of Google searches for all recommended keywords.\n", "files = glob.glob(fn_serp_pattern)\n", "len(files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will read each search result into BeautifulSoup and parse out the traditional search listings and their respective web domains." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1976/1976 [00:39<00:00, 49.67it/s]\n" ] } ], "source": [ "link_data = []\n", "for fn in tqdm(files):\n", " soup = BeautifulSoup(open(fn))\n", " serp = soup.find('div', attrs={'id' : 'main'})\n", " links = serp.find_all('a', href=True, \n", " attrs = {'ping' : True})\n", " \n", " for i, link in enumerate(links):\n", " if link.find('div', role='heading', style=False):\n", " kw = fn.split('/')[-1].replace('.html', '').replace('_', ' ')\n", " url = link.get('href')\n", " domain = urlexpander.get_domain(url)\n", "\n", " row = {\n", " 'keyword' : kw,\n", " 'url' : url,\n", " 'domain' : domain\n", " }\n", " link_data.append(row)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(link_data)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | url | \n", "domain | \n", "
---|---|---|
keyword | \n", "\n", " | \n", " |
1 girl 5 black guys | \n", "10 | \n", "10 | \n", "
18 year old black boys | \n", "9 | \n", "9 | \n", "
18 year old black girl | \n", "10 | \n", "10 | \n", "
18 year old latina girls | \n", "10 | \n", "10 | \n", "
2 latino boys | \n", "7 | \n", "7 | \n", "
... | \n", "... | \n", "... | \n", "
young nude mexican girls | \n", "10 | \n", "10 | \n", "
young thai gays | \n", "8 | \n", "8 | \n", "
young thai ladyboy | \n", "9 | \n", "9 | \n", "
youporn black girl | \n", "10 | \n", "10 | \n", "
zexy asian | \n", "10 | \n", "10 | \n", "
1976 rows × 2 columns
\n", "\n", " | website | \n", "n_searches_found_on | \n", "
---|---|---|
0 | \n", "xvideos.com | \n", "1625 | \n", "
1 | \n", "pornhub.com | \n", "1573 | \n", "
2 | \n", "xnxx.com | \n", "1468 | \n", "
3 | \n", "xhamster.com | \n", "745 | \n", "
4 | \n", "redtube.com | \n", "420 | \n", "
... | \n", "... | \n", "... | \n", "
195 | \n", "blogspot.com | \n", "6 | \n", "
196 | \n", "surfgayvideo.com | \n", "6 | \n", "
197 | \n", "pinterest.nz | \n", "6 | \n", "
198 | \n", "aliexpress.com | \n", "6 | \n", "
199 | \n", "xfantazy.com | \n", "6 | \n", "
200 rows × 2 columns
\n", "\n", " | website | \n", "n_searches_found_on | \n", "identifies_as_porn | \n", "
---|---|---|---|
0 | \n", "xvideos.com | \n", "1625 | \n", "True | \n", "
1 | \n", "pornhub.com | \n", "1573 | \n", "True | \n", "
2 | \n", "xnxx.com | \n", "1468 | \n", "True | \n", "
3 | \n", "xhamster.com | \n", "745 | \n", "True | \n", "
4 | \n", "redtube.com | \n", "420 | \n", "True | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
195 | \n", "blogspot.com | \n", "6 | \n", "False | \n", "
196 | \n", "surfgayvideo.com | \n", "6 | \n", "False | \n", "
197 | \n", "pinterest.nz | \n", "6 | \n", "False | \n", "
198 | \n", "aliexpress.com | \n", "6 | \n", "False | \n", "
199 | \n", "xfantazy.com | \n", "6 | \n", "True | \n", "
200 rows × 3 columns
\n", "