{ "cells": [ { "cell_type": "raw", "id": "935c1c01", "metadata": {}, "source": [ "---\n", "title: \"Browser Automation\"\n", "pagetitle: \"Browser Automation\"\n", "description-meta: \"Introduction, case studies, and exercises for automating browsers.\"\n", "description-title: \"Introduction, case studies, and exercises for automating browsers.\"\n", "author: \"Piotr Sapiezynski and Leon Yin\"\n", "author-meta: Piotr Sapiezynski and Leon Yin\"\n", "date: \"06-11-2023\"\n", "date-modified: \"06-17-2023\"\n", "execute: \n", " enabled: false\n", "keywords: data collection, web scraping, browser automation, algorithm audits, personalization\n", "twitter-card:\n", " title: Browser Automation\n", " description: Introduction, case studies, and exercises for automating browsers.\n", " image: assets/inspect-element-logo.jpg\n", "open-graph:\n", " title: Browser Automation\n", " description: Introduction, case studies, and exercises for automating browsers.\n", " locale: us_EN\n", " site-name: Inspect Element\n", " image: assets/inspect-element-logo.jpg\n", "href: browser_automation\n", "---" ] }, { "cell_type": "code", "execution_count": 40, "id": "0b1cac78", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "📖 Read online\n", "⚙️ GitHub\n", "🏛 Citation\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#| echo: false\n", "from utils import build_buttons\n", "from importlib import reload\n", "import utils\n", "reload(utils)\n", "utils.build_buttons(link= 'browser_automation', \n", " github= 'https://github.com/yinleon/inspect-element/blob/main/browser_automation.ipynb',\n", " colab = False,\n", " citation= True)" ] }, { "cell_type": "markdown", "id": "f47810f1", "metadata": {}, "source": [ "Browser automation is a fundamental web scraping technique for building your own dataset.\n", "\n", "It is essential for investigating personalization, working with rendered elements, and waiting for scripts and code to execute on a web page.\n", "\n", "However, browser automation can be resource intensive and slow compared to other data collection approaches.\n", "\n", "👉[Click here to jump to the Selenium tutorial](#tutorial)." ] }, { "cell_type": "markdown", "id": "3ba5c004", "metadata": {}, "source": [ "# Intro\n", "\n", "If you’ve tried to buy concert tickets to a popular act lately, you’ve probably watched in horror as the blue “available” seats evaporate before your eyes the instant tickets are released. Part of that may be pure ✨star power✨, but more than likely, bots were programmed to buy tickets to be resold at a premium.\n", "\n", "These bots are programmed to act like an eager fan: waiting in the queue, selecting a seat, and paying for the show. These tasks can all be executed using browser automation.\n", "\n", "**Browser automation** is used to programmatically interact with web applications. \n", "\n", "The most frequent use case for browser automation is to run tests on websites by simulating user behavior (mouse clicks, scrolling, and filling out forms). This is routine and invisible work that you wouldn’t remember, unlike seeing your dream of crowd surfing with your favorite musician disappear thanks to ticket-buying bots.\n", "\n", "But browser automation has another use, one which _may_ make your dreams come true: web scraping.\n", "\n", "Browser automation isn’t always the best solution for building a dataset, but it is necessary when you need to:\n", "\n", "1. **Analyze rendered HTML**: see what's on a website as a user would.\n", "2. **Simulate user behavior**: experiment with personalization and experience a website as a user would.\n", "3. **Trigger event execution**: retrieve responses to JavaScript or [network requests](/apis.html) following an action.\n", "\n", "These reasons are often interrelated. We will walk through case studies (below) that highlight at least one of these strengths, as well as why browser automation was a necessary choice.\n", "\n", "Some popular browser automation tools are [Puppeteer](https://pptr.dev/), [Playwright](https://playwright.dev/), and [Selenium](https://www.selenium.dev/documentation/webdriver/elements/). \n", "\n", "## Headless Browsing\n", "\n", "Browser automation can be executed in a \"headless\" state by some tools.\n", "\n", "This doesn't mean that the browser is a ghost or anything like that, it just means that the _user interface_ is not visible.\n", "\n", "One benefit of headless browsing is that it is less [resource intensive](/apis.html#case-study-on-scalability-collecting-internet-plans), however there is no visibility into what the browser is doing, making headless scrapers difficult to debug.\n", "\n", "Luckily, some browser automation tools (such as Selenium) allow you to [toggle headless browsing](https://www.selenium.dev/blog/2023/headless-is-going-away/) on and off. Other tools, such as Puppeteer only allow you to use headless browsing.\n", "\n", "If you’re new to browser automation, we suggest not using headless browsing off the bat. Instead try Selenium (or Playwright), which is exactly what we’ll do in the [tutorial](#tutorial) below." ] }, { "cell_type": "markdown", "id": "c2c689ea", "metadata": {}, "source": [ "
\n", "
Using Selenium to automate browsing TikTok's \"For You\" page for food videos.
\n", "
" ] }, { "cell_type": "markdown", "id": "8ff579f4", "metadata": {}, "source": [ "# Case Studies\n", "## Case Study 1: Google Search\n", "In the investigation “[Google the Giant](https://themarkup.org/google-the-giant/2020/07/28/google-search-results-prioritize-google-products-over-competitors),” The Markup wanted to measure how much of a Google Search page is “Google.” Aside from the daunting task of classifying what is \"Google,\" and what is \"not Google,\" the team of two investigative journalists-- Adrianne Jeffries and Leon Yin (a co-author of this section) needed to measure real estate on a web page.\n", "\n", "The team developed a [targeted staining technique](https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool) inspired by the life sciences, originally used to highlight the presence of chemicals, compounds, or cancers. \n", "\n", "
\n", "\"https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool#google-search-flow\"\n", "
\n", "Source: The Markup\n", "
\n", "
\n", "\n", "The reporters wrote over [68 web parsers](https://github.com/the-markup/investigation-google-search-audit/blob/master/utils/parsers.py) to identify elements on trending Google Search results as \"Google,\" or three other categories. Once an element was identified, they could find the [coordinates](https://developer.mozilla.org/en-US/docs/Web/SVG/Element/rect) of each element along with its corresponding bounding box. Using the categorization and bounding box, The Markup were able to measure how many pixels were allocated to Google properties, as well as where they were placed on a down the page for a mobile phone.\n", "\n", "
\n", "\"https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool#google-search-flow\"\n", "
\n", "Source: The Markup\n", "
\n", "
\n", "\n", "Browser automation tools' ability to collect and analyze **rendered HTML pages** can be essential. This is especially the case for search results, since most search results contain modules, carousels, and other non-standardized rows and columns that are more complex than lists.\n", "\n", "Rendered HTML can be used to analyze the allocation of real estate on a website, which can be a useful metric to gauge self-preferencing and [anti-competitive business practices](https://themarkup.org/amazons-advantage/2021/10/14/amazon-puts-its-own-brands-first-above-better-rated-products) relevant to [antitrust](https://themarkup.org/google-the-giant/2020/07/29/congressman-says-the-markup-investigation-proves-google-has-created-a-walled-garden). Take for example this case study, which was placed above the others because one of this section's co-authors happened to work on it." ] }, { "cell_type": "markdown", "id": "2292b672", "metadata": {}, "source": [ "## Case Study 2: Deanonymizing Google's Ad Network\n", "\n", "Google ad sellers offer space on websites like virtual billboards, and are compensated by Google after an ad is shown. However, unlike physical ad sellers, almost all of the ~1.3 million ad sellers on Google are anonymous. To limit transparency further, multiple websites and apps can be monetized by the same seller, and it’s not clear which websites are part of Google’s ad network in the first place. \n", "\n", "As a result, [advertisers](https://checkmyads.org/branded/google-ads-has-become-a-massive-dark-money-operation/) and the public do not know who is making money from Google ads. Fortunately, watchdog groups, industry analysts, and reporters have developed methods to hold Google accountable for this oversight.\n", "\n", "The methods boil down to triggering a JavaScript function that sends a request to Google to show an ad on a loaded web page. Importantly, the request reveals the seller ID used to monetize the website displaying the ad, and in doing so, links the seller ID to the website.\n", "\n", "In 2022, reporters from ProPublica used Playwrite to [automated this process](https://www.propublica.org/article/google-display-ads-piracy-porn-fraud) to visit 7 million websites and deanonymize over 900,000 Google ad sellers. Their investigation found some websites were able to monetize advertisements, despite breaking Google’s policies.\n", "\n", "ProPublica's investigation used browser automation tools to **trigger event execution** to successfully load ads. Often, this required waiting a page to fully render, scrolling down to potential ad space, and browsing multiple pages. The reporters used a combination of network requests, rendered HTML, and cross-referencing screenshots to confirm that each website monetized ads from Google’s ad network.\n", "\n", "Browser automation can help you trawl for clues, especially when it comes to looking for specific network requests sent to a central player by many different websites." ] }, { "cell_type": "markdown", "id": "b006db29", "metadata": {}, "source": [ "## Case Study 3: TikTok Personalization\n", "An investigation conducted by the Wall Street Journal, \"[Inside TikTok's Algorithm](https://www.wsj.com/articles/tiktok-algorithm-video-investigation-11626877477)\" found that even when a user does not like, share, or follow any creators, TikTok still personalizes their \"For You\" page based on how long they watch the recommended videos.\n", "\n", "In particular, the WSJ investigation found that users who watch content related to depression and skip other content are soon presented with mental health content and little else. Importantly, this effect happened even when the users did not explicitly like or share any videos, nor did they follow any creators. \n", "\n", "You can watch the WSJ's video showing how they mimic user behavior to study the effects of personalization:" ] }, { "cell_type": "markdown", "id": "fd6cc6d9", "metadata": { "tags": [] }, "source": [ "
\n", "
Source: WSJ
\n", "
\n" ] }, { "cell_type": "markdown", "id": "43ff2b93", "metadata": {}, "source": [ "This investigation was possible only after **simulating user behavior** and triggering personalization from TikTok's \"For You\" recommendations." ] }, { "cell_type": "markdown", "id": "b95c954a", "metadata": {}, "source": [ "# Tutorial\n", "In the hands-on tutorial we will attempt to study personalization on TikTok with a mock experiment. \n", "\n", "We’re going to teach you the basics of browser automation in Selenium, but the techniques we'll discuss could be used to study any other website using any other automation tool.\n", "\n", "We will try to replicate elements of the WSJ investigation and see if we can trigger a personalized \"For You\" page. Although the WSJ ran their investigation using an Android on a Raspberry Pi, we will try our luck with something you can run locally on a personal computer using browser automation.\n", "\n", "In this tutorial we'll use Selenium to watch TikTok videos where the description mentions keywords of our choosing, while skipping all others. In doing so, you will learn practical skills such as:\n", "\n", "* Setting up the automated browser in Python\n", "* Hiding signs that are easy tells of an automated browser\n", "* Finding particular elements on the screen, extracting their content, and interacting with them\n", "* Scrolling\n", "* Taking screenshots\n", "\n", "Importantly, we’ll be watching videos with lighter topics than depression (the example chosen in the WSJ investigation.).\n", "\n", "::: {.callout-tip}\n", "#### Pro tip: Minimizing harms\n", "When developing an audit or investigation, start with low-stakes themes: both to minimize your exposure to harmful content and to avoid boosting their popularity, unnecessarily.\n", ":::" ] }, { "cell_type": "markdown", "id": "922cb96e", "metadata": {}, "source": [ "## Step 1: Setting up the browser\n", "Our setup will consist of a real browser and an interface that will allow us to control that browser using Python. We chose Google Chrome because it's the most popular browser and easy enough (famous last words) to set up.\n", "\n", "### 1.1 Installing Google Chrome\n", "Please download the most recent version [here](https://www.google.com/chrome/).\n", "\n", "If you already have Google Chrome installed, make sure it's a latest version by opening Chrome and pasting this address in the address bar: [chrome://settings/help](chrome://settings/help). Now verify that there are no pending updates.\n", "\n", "![](assets/browser1_01_version1.png \"Google Chrome window showing the current version\")\n", "\n", "### 1.2 Installing the webdriver\n", "The `webdriver` is our interface between Python and the browser. It is specific to the browser (there are different webdrivers for Firefox [called Gecko], Safari, etc) and even to the particular version of the browser. It's easier to ensure we are working with the correct version by installing a webdriver that automatically detects the current version of Chrome. \n", "\n", "Run the code in the cell below to download the Python package [`chromedriver-binary-auto`](https://pypi.org/project/chromedriver-binary-auto/). Adding an exclamation mark before code in Jupyter notebook allows you to run commands as if you were in your computer terminal's [command line](https://www.computerhope.com/jargon/c/commandi.htm)" ] }, { "cell_type": "code", "execution_count": 3, "id": "90f9e0f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting chromedriver-binary-auto\n", " Downloading chromedriver-binary-auto-0.2.6.tar.gz (5.2 kB)\n", " Preparing metadata (setup.py) ... \u001b[?25ldone\n", "\u001b[?25hBuilding wheels for collected packages: chromedriver-binary-auto\n", " Building wheel for chromedriver-binary-auto (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for chromedriver-binary-auto: filename=chromedriver_binary_auto-0.2.6-py3-none-any.whl size=8652851 sha256=1ccd18edd04cf5e1c63e0305676dc1c9c0c0532c8dc09842f6cf963a910e4f04\n", " Stored in directory: /Users/leon/Library/Caches/pip/wheels/2a/4e/a6/e342ab457a4cd1642a94bbc8f132e56e90a7a320d08d6bfeb2\n", "Successfully built chromedriver-binary-auto\n", "Installing collected packages: chromedriver-binary-auto\n", "Successfully installed chromedriver-binary-auto-0.2.6\n" ] } ], "source": [ "!pip install chromedriver-binary-auto" ] }, { "cell_type": "markdown", "id": "fedbef26", "metadata": {}, "source": [ "Let's see if the installation worked correctly! Run the cell below to import the correct webdriver and open a new Chrome window.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "4b2cb889", "metadata": {}, "outputs": [], "source": [ "from selenium import webdriver\n", "import chromedriver_binary # adds the chromedriver binary to the path\n", "\n", "driver = webdriver.Chrome()" ] }, { "cell_type": "markdown", "id": "6b9f465d", "metadata": {}, "source": [ "The `chrome-driver-auto` package should have installed a driver that's suitable for your current Chrome version running the line of code above should have opened a new Chrome window.\n", "\n", "This step is notoriously hard, and you might get a version mismatch error:\n", "\n", "```\n", "SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 112\n", "Current browser version is 113 with binary path /Applications/Google Chrome.app/Contents/MacOS/Google Chrome\n", "```\n", "It means that you probably updated your Chrome in the meantime. To fix it, reinstall the Python package:" ] }, { "cell_type": "code", "execution_count": 4, "id": "1ca9f52a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting chromedriver-binary-auto\n", " Using cached chromedriver_binary_auto-0.2.6-py3-none-any.whl\n", "Installing collected packages: chromedriver-binary-auto\n", " Attempting uninstall: chromedriver-binary-auto\n", " Found existing installation: chromedriver-binary-auto 0.2.6\n", " Uninstalling chromedriver-binary-auto-0.2.6:\n", " Successfully uninstalled chromedriver-binary-auto-0.2.6\n", "Successfully installed chromedriver-binary-auto-0.2.6\n" ] } ], "source": [ "!pip install --upgrade --force-reinstall chromedriver-binary-auto" ] }, { "cell_type": "markdown", "id": "c4301811", "metadata": {}, "source": [ "If everything works fine and you have the window open, our setup is complete and you can now close the Chrome window:" ] }, { "cell_type": "code", "execution_count": null, "id": "d55631d7", "metadata": {}, "outputs": [], "source": [ "driver.close()" ] }, { "cell_type": "markdown", "id": "2dc34332", "metadata": {}, "source": [ "## Step 2: Hiding typical tells of an automated browser\n", "When you open Chrome with Selenium you'll notice that the window displays a warning about being an \"automated session\". \n", "Even though the warning is only displayed to you, the webdriver leaves behind other red flags that inform website administrators that you are using browser automation.\n", "\n", "The website admins will use these red flags to refuse service to your browser.\n", "\n", "Let's remove those." ] }, { "cell_type": "code", "execution_count": 8, "id": "6b3f3303", "metadata": {}, "outputs": [], "source": [ "options = webdriver.ChromeOptions()\n", "options.add_argument(\"start-maximized\")\n", "\n", "# remove all signs of this being an automated browser\n", "options.add_argument('--disable-blink-features=AutomationControlled')\n", "options.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"])\n", "options.add_experimental_option('useAutomationExtension', False)\n", "\n", "# open the browser with the new options\n", "driver = webdriver.Chrome(options=options)\n", "driver.get('https://tiktok.com/foryou')" ] }, { "cell_type": "markdown", "id": "6af332c5", "metadata": {}, "source": [ "This should open a new window without those warnings and navigate to tiktok.com:\n", "\n", "![](assets/browser1_02_tiktok1.png \"tiktok main page\")\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "192072ed", "metadata": {}, "source": [ "## Step 3: Finding elements on page and interacting with them\n", "\n", "We will perform our mock experiment without logging in (but we will also learn how to create multiple accounts and how to log in later).\n", "\n", "Instead of logging in, our first interaction will be dismissing this login window. Doing this programmatically has two steps:\n", "\n", "1. We need to identify that \\[X\\] button in the page source \n", "2. And then click it\n", "\n", "Let's inspect the button element:\n", "![](assets/browser1_03_dismiss1.png \"Inspecting the Dismiss button\")\n", "\n", "In my case, the particular element that the Developer Tools navigated to is just the graphic on the button, not the button itself, but you can still find the actual button by hovering your mouse over different elements in the source and seeing what elements on page are highlighted:\n", "\n", "![](assets/browser1_04_inspect1.png \"Inspecting the Dismiss button\")\n", "\n", "Our close button is a `
` element, whose `data-e2e` attribute is `\"modal-close-inner-button\"`. \n", "\n", "There are many ways to fish for the exact element you want, and [many of those methods](https://www.selenium.dev/documentation/webdriver/elements/locators/) are built into Selenium. One way to find it would be using a `CSS_SELECTOR`, like so:" ] }, { "cell_type": "code", "execution_count": 4, "id": "16502655", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from selenium.webdriver.common.by import By\n", "\n", "close_button = driver.find_element(By.CSS_SELECTOR, '[data-e2e=\"modal-close-inner-button\"]')\n", "close_button" ] }, { "cell_type": "markdown", "id": "022e501b", "metadata": {}, "source": [ "If Selenium successfully finds an element, you'll get a `WebElement` object of the first match. However, if Selenium **does not** find the element-- for example because the element hasn't loaded yet, you will get an empty object in return. This will crash your script if you try to interact with the empty element. \n", "\n", "One thing you can do is to tell Selenium to wait up to `X_seconds` for that particular element before trying to click on it, like this:" ] }, { "cell_type": "code", "execution_count": 9, "id": "1f26712b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "# let's wait up to 20 seconds\n", "X_seconds = 20\n", "wait = WebDriverWait(driver, timeout = X_seconds)\n", "wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '[data-e2e=\"modal-close-inner-button\"]')))\n", "\n", "# this line will only execute whenever the element was found (or after 20 seconds it it wasn't)\n", "close_button = driver.find_element(By.CSS_SELECTOR, '[data-e2e=\"modal-close-inner-button\"]')\n", "close_button" ] }, { "cell_type": "markdown", "id": "2922df18", "metadata": {}, "source": [ "We seem to have found something, let's click it! `WebElement`s come equipped with special functions you can use to [interact](https://www.selenium.dev/documentation/webdriver/elements/interactions/) with them:" ] }, { "cell_type": "code", "execution_count": 10, "id": "702b1b21", "metadata": {}, "outputs": [], "source": [ "close_button.click()" ] }, { "cell_type": "markdown", "id": "0a490ca4", "metadata": {}, "source": [ "Did you notice a change on the page? Congratulations! You just automated the browser to click something." ] }, { "cell_type": "markdown", "id": "d8df81f3", "metadata": {}, "source": [ "## Step 4: Scrolling\n", "\n", "We now have a browser instance open and displaying the For You page. Let's scroll through the videos.\n", "\n", "If you are a *real person* who (for whatever reason) visits TikTok on their computer, you could press the down key the keyboard to see new videos. We will do that programmatically instead:" ] }, { "cell_type": "code", "execution_count": 11, "id": "abe85668", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver.common.action_chains import ActionChains\n", "from selenium.webdriver.common.keys import Keys\n", "\n", "actions = ActionChains(driver)\n", "actions.send_keys(Keys.ARROW_DOWN)\n", "actions.perform()" ] }, { "cell_type": "markdown", "id": "72a9b726", "metadata": {}, "source": [ "When you run the cell above you will see that your browser scrolls down to the next video. You just automated scrolling!" ] }, { "cell_type": "markdown", "id": "c5988f8b", "metadata": {}, "source": [ "## Step 5: Finding TikTok videos on the page\n", "\n", "Now that the site loaded and you can browse it, let's find all the TikTok videos that are displayed and extract the information (called metadata) from each of them.\n", "\n", "1. Right click on the white space around a TikTok video and choose \"Inspect\".\n", "![Inspect Element](assets/browser1_05_inspect_tiktok_a1.png)\n", "1. Hover your mouse over the surrounding `
` elements and observe the highlighted elements on the page to see which ones correspond to each TikTok video.\n", "![Inspect Element](assets/browser1_05_inspect_tiktok_b1.png)\n", "1. You will see that each video is in a separate `
` container but each of these containers has the same `data-e2e` attribute with the value of `recommend-list-item-container`.\n", "1. Similarly to how we found the close button, we can now use this to find all videos on page:" ] }, { "cell_type": "code", "execution_count": 12, "id": "f6e19d63", "metadata": {}, "outputs": [], "source": [ "videos = driver.find_elements(By.CSS_SELECTOR, '[data-e2e=\"recommend-list-item-container\"]')" ] }, { "cell_type": "markdown", "id": "b3c9177e", "metadata": {}, "source": [ "When we searched for the \"dismiss\" button we used the `driver.find_element()` function because we were only interested in the first element that matched our CSS selector.\n", "\n", "Now we're trying to find all videos on page, so we use the `driver.find_elements()` function instead - it returns the complete list of elements that match the selector." ] }, { "cell_type": "code", "execution_count": 13, "id": "b48701be", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "videos" ] }, { "cell_type": "markdown", "id": "3931f75c", "metadata": {}, "source": [ "## Step 6: Parsing TikTok metadata\n", "Now that we found all the TikTok videos on the page, let's extract the description from each - this is how we will decide whether to watch the video, or to skip it. The process of extracting a specific field from a webpage is \"parsing\".\n", "\n", "1. Pick any description, right click, \"Inspect\". \n", "1. Let's locate the `
` that contains the whole description (including any hashtags) and make note of its `data-e2s` attribute.\n", "1. Now let's write the code that, extracts the description from a single video (note that you can get the text content of any element by calling `element.text`)" ] }, { "cell_type": "code", "execution_count": 14, "id": "9a0b7e79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The last one 😂😂 #pet #cat #dog #cute #animals #funny #foryou #fyp\n", "الرد على @hadeelalsamare #اكسبلور #fyp #fypシ\n", "BEST MAGIC TRICKS REVEALED 😱😳 #magician #learnfromme #foru #popular\n", "The most Useful Toy ever! 2 😂 #fun #play #fyp\n", "Iphone 13 pro max #repair #tamarshabi🥰 תיקון\n", "Herb-Crusted Rack of Lamb 😍 #lamb #easyrecipe #easyrecipes #asmrfood #foodtok #cooktok #dinnerwithme #homecook #homecooking #dinnerideas #dinnerparty\n", "#fyp #halsey #geazy #scandal\n", "شو رأيكم كان فيها تكفي اللقمة اللي بتمها؟ 😐#hasanandhawraa #ramdan2023 #رمضان_يجمعنا #رمضان\n" ] } ], "source": [ "for video in videos:\n", " print(video.find_element(By.CSS_SELECTOR, '[data-e2e=\"video-desc\"]').text)" ] }, { "cell_type": "markdown", "id": "0e09cc34", "metadata": {}, "source": [ "::: {.callout-note}\n", "Note: We previously searched for elements using `driver.find_element()` and `driver.find_elements()`. That allowed us to search the whole page. Notice that here, instead of `driver`, we're using a particular element which we called `video`: this way we can search for elements **within an element**, rather than on the whole page.\n", ":::" ] }, { "cell_type": "markdown", "id": "7a494899", "metadata": {}, "source": [ "## Step 7: Finding the TikTok video that's currently playing\n", "We know how to scroll to the next video, and we know how to find all videos that are loaded.\n", "At this point we could either:\n", "\n", "1. Assume that at the beginning, the 0th video is playing, and then every time we press arrow down, the next video is being displayed
\n", "2. Assume that the arrow down does not always work and each time verify which video is actually playing\n", "\n", "The problem with the first approach is that even if scrolling fails just once, our experiment will be compromised (after it happens we will be watching and skipping different videos that our script tells us). This is why we will go with the second approach and verify which video is actually playing. Back to our favorite tool- inspect element!\n", "\n", "When you right click on the playing video, you will see that instead of our familiar UI we get a custom TikTok menu, so that won't work. Try right-clicking on the description of the video instead, then hovering over different elements in the inspector and expanding the one that highlights the video in the browser. Dig deep until you get to the `div` that only contains the video. \n", "\n", "Still in the inspector try looking at the video below. You will see that the `div` that contains the video is missing and there is no element with the tag name `video`. That's how we can find if the video is currently playing - its `div` will contain the `video` element that we can find by `TAG_NAME`:" ] }, { "cell_type": "code", "execution_count": 15, "id": "9d1e3cd6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "playing \n", "not playing The last one 😂😂 #pet #cat #dog #cute #animals #funny #foryou #fyp\n", "not playing الرد على @hadeelalsamare #اكسبلور #fyp #fypシ\n", "not playing BEST MAGIC TRICKS REVEALED 😱😳 #magician #learnfromme #foru #popular\n", "not playing The most Useful Toy ever! 2 😂 #fun #play #fyp\n", "not playing Iphone 13 pro max #repair #tamarshabi🥰 תיקון\n", "not playing Herb-Crusted Rack of Lamb 😍 #lamb #easyrecipe #easyrecipes #asmrfood #foodtok #cooktok #dinnerwithme #homecook #homecooking #dinnerideas #dinnerparty\n", "not playing #fyp #halsey #geazy #scandal\n", "not playing شو رأيكم كان فيها تكفي اللقمة اللي بتمها؟ 😐#hasanandhawraa #ramdan2023 #رمضان_يجمعنا #رمضان\n" ] } ], "source": [ "for video in videos:\n", " description = video.find_element(By.CSS_SELECTOR, '[data-e2e=\"video-desc\"]').text\n", " if video.find_elements(By.TAG_NAME, 'video'):\n", " playing = 'playing'\n", " else:\n", " playing = 'not playing'\n", " print(playing, description)" ] }, { "cell_type": "markdown", "id": "ccf69708", "metadata": {}, "source": [ "## Step 8: Taking screenshots and saving page source\n", "The presentation of your results might be more compelling, when its accompanied by screenshots, rather than just data. Selenium allows you to take screenshots of the whole screen, or just a particular element (though the latter is a bit cumbersome):" ] }, { "cell_type": "code", "execution_count": 22, "id": "07951510", "metadata": {}, "outputs": [], "source": [ "# take a screenshot of the whole browser\n", "driver.save_screenshot('full_screenshot.png')\n", "\n", "# take a screenshot of just one video\n", "screenshot = video.screenshot_as_png\n", "with open('element_screenshot.png', 'wb') as output:\n", " output.write(screenshot)" ] }, { "cell_type": "markdown", "id": "361d31df", "metadata": {}, "source": [ "In the spirit of _bringing receipts_, you can also save the entire webpage to parse it later." ] }, { "cell_type": "code", "execution_count": 25, "id": "dc37d600", "metadata": {}, "outputs": [], "source": [ "# save the source of the entire page\n", "page_html = driver.page_source\n", "with open('webpage.html', 'w') as output:\n", " output.write(page_html)" ] }, { "cell_type": "markdown", "id": "990753f7", "metadata": {}, "source": [ "::: {.callout-tip}\n", "#### Pro tip: Keep these records to sanity check your results\n", "Taking a screenshot and saving the page source is a useful practice for checking your work. Use the two to cross-reference what was visible in the browser and whatever data you end up extracting during the parsing step.\n", ":::\n", "\n", "Let's close the browser for now, and kick this workflow up a notch." ] }, { "cell_type": "code", "execution_count": 16, "id": "bd1f4675", "metadata": {}, "outputs": [], "source": [ "driver.close()" ] }, { "cell_type": "markdown", "id": "6bbfaaa2", "metadata": {}, "source": [ "## Step 9: Putting it all together\n", "At this point, we can read the description of TikTok videos and navigate the \"For You\" page. \n", "\n", "That's most of the setup we need to try our mock experiment:
\n", "let's watch all TikTok videos that mention food in the description and skip videos that do not mention food.\n", "\n", "After one hundred videos, we will see whether we are served videos from FoodTok more frequently than other topics.\n", "\n", "::: {.callout-tip}\n", "#### Pro tip: Use functions!\n", "So far we wrote code to open the browser, close the dialog, and find videos as separate cells in the notebook. We _could_ copy that code over here to use it, but it will be much easier to understand and maintain the code if we write clean, well-documented functions with descriptive names.\n", ":::" ] }, { "cell_type": "code", "execution_count": 17, "id": "9405d23e", "metadata": {}, "outputs": [], "source": [ "from selenium import webdriver\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.common.action_chains import ActionChains\n", "from selenium.webdriver.common.keys import Keys\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "import chromedriver_binary\n", "\n", "\n", "\n", "def open_browser():\n", " \"\"\"\n", " Opens a new automated browser window with all tell-tales of automated browser disabled\n", " \"\"\"\n", " options = webdriver.ChromeOptions()\n", " options.add_argument(\"start-maximized\")\n", "\n", " # remove all signs of this being an automated browser\n", " options.add_argument('--disable-blink-features=AutomationControlled')\n", " options.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"])\n", " options.add_experimental_option('useAutomationExtension', False)\n", "\n", " # open the browser with the new options\n", " driver = webdriver.Chrome(options=options)\n", " return driver\n", "\n", "def close_login_dialog(driver):\n", " \"\"\"\n", " Waits for the login dialog to appear, then closes it\n", " \"\"\"\n", " \n", " # rather than trying to click a button that might have not loaded yet, we will \n", " # wait up to 20 seconds for it to actually appear first\n", " wait = WebDriverWait(driver, timeout = 20)\n", " wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '[data-e2e=\"modal-close-inner-button\"]')))\n", " \n", " close_button = driver.find_element(By.CSS_SELECTOR, '[data-e2e=\"modal-close-inner-button\"]')\n", " if close_button:\n", " close_button.click()\n", "\n", "def arrow_down(driver):\n", " \"\"\"\n", " Sends the ARROW_DOWN key to a webdriver instance.\n", " \"\"\"\n", " actions = ActionChains(driver)\n", " actions.send_keys(Keys.ARROW_DOWN)\n", " actions.perform()\n", " \n", "def find_videos(driver):\n", " \"\"\"\n", " Finds all tiktoks loaded in the browser\n", " \"\"\"\n", " videos = driver.find_elements(By.CSS_SELECTOR, '[data-e2e=\"recommend-list-item-container\"]')\n", " return videos\n", "\n", "def get_description(video):\n", " \"\"\"\n", " Extracts the video description along with any hashtags\n", " \"\"\"\n", " try:\n", " description = video.find_element(By.CSS_SELECTOR, '[data-e2e=\"video-desc\"]').text\n", " except:\n", " # if the description is missing, just get any text from the video\n", " description = video.text\n", " return description\n", "\n", "def get_current(videos):\n", " \"\"\"\n", " Given the list of videos it returns the one that's currently playing\n", " \"\"\"\n", " for video in videos:\n", " if video.find_elements(By.TAG_NAME, 'video'):\n", " # this one has the video, we can return it and that ends the function.\n", " return video\n", " \n", " return None\n", "\n", "def is_target_video(description, keywords):\n", " \"\"\"\n", " Looks for keywords in the given description. \n", " NOTE: only looks for the substring IE partial match is enough.\n", " Returns `True` if there are any or `False` when there are none.\n", " \"\"\"\n", " # check in any of the keywords is in the description\n", " for keyword in keywords:\n", " if keyword in description:\n", " # we have a video of interest, let's watch it \n", " return True\n", " \n", " # if we're still here it means no keywords were found\n", " return False\n", "\n", "def screenshot(video, filename=\"screenshot.png\"):\n", " \"\"\"\n", " Saves a screenshot of a given video to a specified file\n", " \"\"\"\n", " screenshot = video.screenshot_as_png\n", " with open(filename, 'wb') as output:\n", " output.write(screenshot)\n", " \n", "def save_source(driver, filename=\"screenshot.html\"):\n", " \"\"\"\n", " Saves the browser HTML to a file\n", " \"\"\"\n", " page_html = driver.page_source\n", " with open('webpage.html', 'w') as output:\n", " output.write(page_html)" ] }, { "cell_type": "markdown", "id": "b560caef", "metadata": {}, "source": [ "Ok, with that out of the way, let's set up our first data collection.\n", "\n", "First, let's make a directory to save screenshots. We will save screenshots here whenever we find a video related to food." ] }, { "cell_type": "code", "execution_count": 18, "id": "89e12185", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.makedirs('data/screenshots/', exist_ok=True)" ] }, { "cell_type": "code", "execution_count": 22, "id": "0fdff08c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 False ДО КОНЦА😂 а какой у тебя рост?\n", "1 False • Reprodução: (SBT/Programa Raul Gil) 🇧🇷\n", "#combateaosuicidio\n", "2 False #stitch #이어찍기 #추천 #fyp #viral #xyzbca #korean #おすすめ\n", "3 True Cuando hago papas de esta manera, todos me preguntan por la receta😋😱#viral #parati #recetas #cocina #recetasfaciles #papa #queso #jamon #food #saborestiktok\n", "4 False #ومش_هزود_في_الملام #explore\n", "#fypシ #foryoupage #fyp #viral\n", "#مش_هنظبط_الريتش_بقي🖤 #حزين\n", "#حالات_واتس_حزينه💔 #foryou\n", "5 False #PasiondeGavilanes #telenovelacolombiana\n", "6 False #accident a veces pasa de todo 👉 sigueme para PARTE 2.\n", "7 False Zjedzcie se tez cos fajnego dzis #gotowaniezdominika\n", "8 False كيف تكتب اسم يوسف بخط جميل♥️🌹-\n", "-\n", "-\n", "-\n", "9 False بنت الجنوب 🔥🤍🇹🇳#مطماطة_قابس_تونس #اكسبلور\n", "10 False Game on\n", "11 False Чи бачите різницю між фото? Чи бачите які кадри зроблені на дорогу , а які на дешеву камеру? ☺️ #фотограф #фотоапарат #обзор #фотографія\n", "12 False #bendiciones #mideseo #TikTok #viral #\n", "13 False The most Useful Toy ever! 2 😂 #fun #play #fyp\n", "14 False Replying to @user4034722293618\n", "15 False jajeczniczka z kielbasiana\n", "16 False كام مره بكيت ؟ 🥺💔🎧 #المصمم_sheko🎧 #الرتش_فى_زمه_الله💔 #حالات_واتس #شاشه_سوداء #مصمم_حالات_واتس_شاشه_سوداء #fypシ #foryou #fyp #viral #music #tiktok\n", "17 False #movie #movieclip #fyp\n", "18 False Я ПРОТИВ КУРЕНИЯ, А ВЫ?\n", "19 False Uno de nuestros trends favoritos 😍🍭 @SHICKO 💊 @N E N A 🍓\n", "20 False Esse final me quebrou…🥺💛\n", "\n", "🎥Filme: Extraordinário\n", "\n", "#disciplina #motivacional #trechosvisionarios #extraordinar\n", "21 False Parece que o Vin Diesel curtiu “Vai Sentando” 😅\n", "22 False Para mi mama una niña valiente ♥️💕🇺🇸#parati #parati #parati #parati #parati #fyp #fyp #viral #viral #viral #viral #viral #viral #vistas #vistas #vistas ##vistas #vistas #muyviral @TikTok\n", "23 False #drawing #viralvideo🔥 #fypシ゚viral\n", "24 False شو رأيكم كان فيها تكفي اللقمة اللي بتمها؟ 😐#hasanandhawraa #ramdan2023 #رمضان_يجمعنا #رمضان\n", "25 False Brock is always there to save the day 🦈💪🏼 #wwe #wrestling #wrestlingmemes #brocklesnar #wweisfake #fakesituation #sharkattack #sharks #wwe2023 #nextgen #wwenetwork #smackdown #wwefan #bodybuilding #beach #holiday #pool #sea\n", "26 False HONEY, I SEE YOU #foryou #omspxd #music #mashonda #fyp #lyrics #speed #spedup #🎧\n", "27 False #fyp #mrbeast #foryou wow 😳😲\n", "28 False \n", "29 False Sometimes its better to save your breathe. #trauma #traumahealing #awakining #love #relationship #relatable #loveyourself #men #women #healing #problems #girltalk #therapy #couple #mom #fyp #fypシ #emotion\n", "30 False I love my body 🥰💜.. Dc: @Dance God 🦅🇬🇭 #purplespeedy\n", "31 False raye mi camioneta por ustedes jajajajajajaja\n", "32 False \n", "33 False My new car #catsoftiktok #fyp #fypシ\n", "34 False \n", "35 False Fiz um almoço insano! @Mateus ASMR\n", "36 False #bajonesemocionales #🥀💔\n", "37 False Don't mess with Cristiano 😤|| #cristianoronaldo #cr7 #mufc #intermilan #manutd #viral #ucl #tiktoktrending\n", "38 False This Small Town Farmer Better Buckle Up! - End #dealornodeal #show #fyp #deal\n", "39 False Genius, billionaire, playboy, philanthropist... and a great dancer🕺#downey #rdj #robertdowneyjr #ironman #tonystark #unrealdowneyjr #unrealrobertdowneyjr\n", "40 False Celebre as suas vitórias, amiga! 😍 Fazer 1% todos os dias vai te levar a lugares que você nem imagina.\n", "Eu treino com a @Queima Diária 🔥 desde novembro de 2022 e fico muito feliz com esses resultados. Quem vem nessa comigo? Clica no link da bio ou nos stories e experimente por 30 dias!\n", "41 False اكتب شيء تؤجر عليه ✨🤍 #fyp #قران #عبدالرحمن_مسعد\n", "42 False I ❤️ Michael Jordan 🏀 #mercuri_88 #funny #littlebrother #tiktok #mom #CapCut #basketball #nba #jordan\n", "43 False Estavam com saudade? Nao me deixa sem graça nao caraaaa kkkkkk\n", "44 False يعني ارسمها علشان افرحها ويحصل معايا كدة 🤦‍♂️ #علي_الراوي\n", "45 False Ролик уже на канале💋\n", "46 False What k-drama do you think this is? #kdrama #드라마 #seoul #theglory\n", "47 False cat #cat #catsoftiktok #fun #foryou #fyp #viral #funny #😂😂😂 #🤣🤣🤣\n", "48 False #korea #seoul #socialexperiment #fyp\n", "49 False \n", "50 False #fyp #foryou #طيران\n", "51 False الماء والنار… 🥀💔 #lebrany #viral #foryou #explor\n", "52 False #foryou #recovery #homecare #gloves\n", "53 False Салат из одного ингредиента\n", "54 False #blog #vacuna Hoy tocó hacer vacunar a Salchipapu contra la rabia 🥺🐶\n", "55 False Song name: Jegi Jegi\n", "Watch full song on youtube ( Barbud Music )\n", "\n", "#lailakhan #newsong #rejarahish #tiktokpakistan\n", "56 False Putting automatic stickers on manual doors 😂 #rosscreations #prank\n", "57 False Abril 11 parte 7 “Comida Turka”\n", "58 True recipe: @ファビオ飯(イタリア料理人)🇮🇹Fabio #tiktokfood #asmr\n", "59 False Metallic silver epoxy floor🔥 #fyp #epoxyresin #garagegoals #epoxypour #polyasparticfloors #polyaspartic #theepoxypros\n", "60 False Enter the homepage to watch more wonderful videos#movieclips\n", "61 False Respuesta a @RZㅤGOLOSAღ -😅 @Duhsein\n", "62 False Почему «Титаник» до сих пор не подняли со дна океана? #титаник\n", "63 False Funny homework!✨✨#asmr #home #goodthing #foryou\n", "64 False 😂😂@도윤 #주전 #fyp\n", "65 False #parati #fyp #foryou #foryoupage #viral #trump #trump2024 #biden #teamtrump #donaldtrump\n", "66 False Não acreditei no resultado🥺🙌🏼\n", "67 False Atât de vrednică sunt… 😂\n", "M-am făcut de negreală pe obraz🤦🏻‍♀️😂 #soferițadecamion🚛😍 #AGLogistics #oriundeîneuropa #truckgirl\n", "68 False Gatinho Resgatado na chuva 🙏🏻 #jesus #jesuscristo #deus #resgateanimal #resgate #gato #gatinho #cat #viraliza\n", "69 False #pegar un video de\n", "@Yohary Michell Rios #maestra #maestros #universidad #universidad #clases #clasesvirtuales #profesora #profesor #fyp #parati #fouryou #fouyoupage #escuela #escuelatiktok #viral #\n", "70 False So cuteee😂\n", "71 False بوظتلهم الدنيا 😂\n", "72 False #pourtoi #foryou #cpl #bracelet #trend\n", "73 False What’s one way He’s held you as you’ve stepped out in faith? 🌊 #UNITED #fyp #christiantiktok #worship #Oceans\n", "74 False Antwort auf @🍇Wallah Krise🍇 I am going out tonight 💚 #bumpyride\n", "75 False #ليلياناا_نحن_عنوان_الجمال👑😍 #viral #fipシ #foryou #foryoupage #جمال #مكياج #شنيون #عرايس #لف #ميش #اكسبلور #لايك #هشتاك #مشاهير_تيك_توك #تخصيل\n", "76 False Full Episode 293 on YT & Spotify | ShxtsnGigs Podcast\n", "77 False GAME DE RUA COM LARRIKA! #gamederua #viral #fy #fypシ #pravoce #foryoupage\n", "78 False The smallest phone #CapCut #oppo #infinix #Motorola #zte #huawei #vivo #samsung\n", "79 False \n", "80 False Respect Moment in Football ❤️#footballeur #surprise #fan #respectmoment #respectinfootball #moment #respect #foryou #pourtoi #football\n", "81 False I think I got it in my pants 😧 #learnfromkhaby #comic\n", "82 False Respondendo a @hg_11236 ta aqui a reacao dela ❤️❤️❤️❤️ fofa demais! #fypシ #diadasmaes #surpresa\n", "83 False Наступ на Белгород. Що роблять добровольці там #війна #грайворон #белгород #українськийтікток #андрійковаленко\n", "84 False Have you ever eaten a cappuccino croissant? ☕️🥐\n", ".\n", ".\n", ".\n", "#pastry #pasticceria #italia #croissant\n", "85 False #recetas #facil whatia en tierrra\n", "86 False seyran inşallah gidersin feritinn bı kazimdan tokat yemediği kalmamisti#yalıcapkınıxferit #feritkorhan #seyrankorhan #mertramazandemir #afrasaraçoğlu #seyfer #yalıçapkını #keşfet #fypシ #foryoupage #foryou #viral\n", "87 False \n", "88 False La puissance de l’eau #pourtoi #meteo #inondation #eau #vigilance\n", "89 False Olha a aranha\n", "#alegriaquecontagia #comedia #viral #rireomelhorremedio #rireprosfortes #rirrenovaalma #gargalhada #fypシ #viralvideo #comediante #trolagem\n", "90 False Se puede ser infiel por chat? VIDEO COMPLETO EN EL LINK DE MI PERFIL ✅ #juliosinfiltros #relaciones #pareja #relacionessanas #infidelidad #infieles #microinfidelidades\n", "91 False Replying to @MC Codër\n", "92 False #kamalaghalan❣\n", "93 False Лобода про детей\n", "94 False Відмічай друга😅#українськийтікток #футболкизпринтами #подарунокхлопцю #подарунокдругу\n", "95 False Find your self worth.#real #loyalty #love #sad #sadquotes #relatable #betryal #foryou #scrolling #mindset #reality #xyzbca #fyp\n", "96 False #київ #вибух #нло #метеорит #ракета #сяйво #спалах #сніданокз1плюс1\n", "97 False اكثر مسلسل حبيتوها برمضان ؟#مهند_رفل #explore\n", "98 False المنتج اللي قالب التيك توك .. أسفنجة التنضيف السحرية 🧐 #حركة_لاكسبلورر #fyp #gym #عبدالرحمن_وابتسام #trendingtiktok #challenge #fypシ\n", "99 True Scotch Egg 😍🥚 #scotchegg #egg #easyrecipe #easyrecipes #caviar #eggs #asmrfood #bacon #cooktok #foodtok #recipesoftiktok #homecook #dinnerideas #eggrecipe #breakfastideas #fancy\n" ] } ], "source": [ "import time\n", "\n", "# if the description has any one these words, we will watch the video\n", "keywords = ['food', 'dish', 'cook', 'pizza', 'recipe', 'mukbang', 'dinner', 'foodie', 'restaurant']\n", "\n", "# this is where will we store decisions we take\n", "decisions = []\n", "\n", "# open a browser, and go to TikTok's For You page.\n", "driver = open_browser()\n", "driver.get('https://tiktok.com/foryou')\n", "close_login_dialog(driver)\n", "\n", "for tiktok_index in range(0, 100):\n", " # get all videos\n", " tiktoks = find_videos(driver)\n", " \n", " # the current tiktok is the one that's currently showing the video player\n", " current_video = get_current(tiktoks)\n", " \n", " if current_video is None:\n", " print('no more videos')\n", " break\n", " \n", " # read the description of the video\n", " description = get_description(current_video)\n", " \n", " # categorize the video as relevant to `keywords` or not.\n", " contains_keyword = is_target_video(description, keywords)\n", " decisions.append(contains_keyword )\n", " \n", " print(tiktok_index, contains_keyword, description)\n", " \n", " if contains_keyword:\n", " # we have a video of interest, let's take a screenshot\n", " ## here we declare the files we'll save. they're named according to their order.\n", " fn_screenshot = f\"data/screenshots/screenshot_{tiktok_index:05}.png\"\n", " fn_page_soure = fn_screenshot.replace('.png', '.html')\n", " screenshot(current_video, fn_screenshot)\n", " save_source(driver, fn_page_source)\n", " # and now watch it for 30 seconds\n", " time.sleep(30)\n", " \n", " # move to the next video\n", " arrow_down(driver)\n", " time.sleep(2)\n", " \n", "driver.close()" ] }, { "cell_type": "markdown", "id": "2c9c72f4", "metadata": {}, "source": [ "::: {.callout-tip}\n", "#### Pro tip: Be careful about keywords\n", "For experiments that use `keywords`, the choices we make will directly shape our results. In the field, you can mitigate your own predisposition and biases by working with [domain experts to curate keyword lists](https://themarkup.org/google-the-giant/2021/04/09/how-we-discovered-googles-social-justice-blocklist-for-youtube-ad-placements#sourcing-social-justice-keywords).\n", ":::" ] }, { "cell_type": "code", "execution_count": 24, "id": "8b45827f", "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEGCAYAAAB2EqL0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAARG0lEQVR4nO3debBkZX3G8e8DBBhABEUNizgQURyJjmSKIFiWCiaiEtQQcI2SKC4YBLXiEqvERGORGBfKiBJFMIUjQlBwRUXixqLDIuAQRQERHGFcWERZhF/+OOdqz2Xu3HeY6Xub7u+n6tbtc/p09+/wNveZ855z3jdVhSRJs9lgvguQJN03GBiSpCYGhiSpiYEhSWpiYEiSmmw03wUMyzbbbFMLFy6c7zIk6T7lggsu+HlVPWh1z41tYCxcuJBly5bNdxmSdJ+S5MczPWeXlCSpiYEhSWpiYEiSmhgYkqQmBoYkqcmcXCWV5IHAWf3iHwN3ASv75T2q6o65qEOSdO/NSWBU1S+AxQBJjgJ+XVXvGtwmSYBU1d1zUZMkae3M630YSR4OnAFcBDwO2C/Jd6tqq/755wL7VtVLkzwEOBbYEbgbOLyqzpvLej9+/jWcfvF1v18+YPH2PP/Pd5zLEqQ18js6maa3+6LttuSt+z96vX/OKJzD2BV4T1UtAq5bw3bHAP9WVUuAg4APT98gyaFJliVZtnLlynu8wbo6/eLrWL7iZgCWr7h5lQaSRoHf0ck02O7DNAp3ev+oqlpuyd4XeGTXcwXA1kkWVNVvp1ZU1XHAcQBLliwZysxQi7bdkpNf/ngO/tC5w3h7aZ35HZ1MU+0+TKMQGLcOPL4byMDypgOPgyfIJWnejEKX1O/1J7x/lWSXJBsAzx54+ivAYVMLSRbPdX2SNMlGKjB6bwDOBM4Brh1Yfxiwd5JLkiwHXjYfxUnSpJrzLqmqOmrg8Q/pL7cdWHcycPJqXrcSOHDY9UmSVm8UjzAkSSPIwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNdloTU8mec6anq+q09ZvOZKkUbXGwAD2738/GNgL+Gq//GTgHMDAkKQJscbAqKpDAJJ8CVhUVSv65W2BE4ZenSRpZLSew3joVFj0rgd2HEI9kqQRNVuX1JSzkpwJLO2XDwa+MpySJEmjqCkwqurVSZ4NPLFfdVxVfWp4ZUmSRk3rEQbAhcAtVfWVJJsluV9V3TKswiRJo6XpHEaSlwGnAh/qV20PfHpYRUmSRk/rSe/DgL2BmwGq6gq6S20lSROiNTBur6o7phaSbATUcEqSJI2i1sD4WpI3AwuSPBU4BfjM8MqSJI2a1sB4I7ASuBR4OfB54C3DKkqSNHpaL6u9G/iv/keSNIGaAiPJ3sBRwMP61wSoqtp5eKVJkkZJ630YHwGOBC4A7hpeOZKkUdUaGDdV1ReGWokkaaTNNh/G7v3Ds5P8O91w5rdPPV9VFw6xNknSCJntCOM/pi0vGXhcwFPWbzmSpFE123wYT56rQiRJo611LKl/TbLVwPLWSd4+vLIkSaOm9ca9/arqxqmFqvoV8PThlCRJGkWtgbFhkk2mFpIsADZZw/aSpDHTelntSXSz7n20Xz4E+NhwSpIkjaLWoUGOTvJdYN9+1b9U1ZnDK0uSNGpahwY5uqreAHxxNeskSROg9RzGU1ezbr/1WYgkabTNdqf3K4FXATsnuWTgqfsB3xpmYZKk0TJbl9THgS8A76SbE2PKLVX1y6FVJUkaObPd6X0TcBPwPIAkDwY2BbZIskVVXTP8EiVJo6D1Tu/9k1wBXAV8Dbia7shDkjQhWk96vx3YE/hBVe0E7AOcN7SqJEkjpzUw7qyqXwAbJNmgqs5m1ZFrJUljrvVO7xuTbAF8AzgpyQ3ArcMrS5I0atZ4hJHkiCR7AM8CfgMcQXfz3o+A/YdfniRpVMx2hLED8F5gV+BSunsvzgE+42W1kjRZZrus9vUASTamO2exF93Ag8clubGqFg2/REnSKGg9h7EA2BK4f//zU7ojDknShJhtaJDjgEcDtwDn03VHvbufQEmSNEFmu6x2R7qJkn4GXAdcC9y4xldIksbSbOcwnpYkdEcZewGvA3ZL8kvg3Kp66xzUKEkaAbOew6iqAi5LciPduFI3Ac8E9gAMDEmaELOdwzic7shiL+BOunMY5wDH40lvSZoosx1hLAROAY6sqhXDL0eSNKpmO4fx2rkqRJI02loHH5QkTTgDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNRlaYCS5K8nFAz8L17DtwiSXDasWSdK622iI7/3bqlo8xPeXJM2hYQbGPfRHGf8NbN6venVVnTNtm0cDHwU2pjsC+uuquiLJC4HD+/XnA6+qqruGUefbPvM9lv/05nusX77iZhZtu+Uqywd/6NxhlCDdK35HJ9P0dh+WYQbGgiQX94+vqqpnAzcAT62q25LsAiwFlkx73SuA91XVSUk2BjZM8ijgYGDvqrozyQeAFwAfG3xhkkOBQwF23HHH9b5Di7bdkgMWbw/w+9/SKPE7OpkG232YUlXDeePk11W1xbR19wfeDywG7gIeUVWb9Ucen62q3ZI8H/gnujA4rT+6eDXwZrrAAVgALK2qo2b6/CVLltSyZcvW815J0nhLckFVTf+HPDDHXVLAkcD1wGPpuptum75BVX08yfnAM4DPJ3k5EODEqnrTXBYrSfqDub6s9v7Aiqq6G3gRsOH0DZLsDFxZVccApwOPAc4CDkzy4H6bByR52NyVLUma68D4APDiJN8FdgVuXc02BwGX9ec/dgM+VlXLgbcAX0pyCfBlYNs5qlmSxBDPYcw3z2FI0tpb0zkM7/SWJDUxMCRJTQwMSVITA0OS1GRsT3onWQn8eB3eYhvg5+upnPuKSdxnmMz9dp8nx9ru98Oq6kGre2JsA2NdJVk205UC42oS9xkmc7/d58mxPvfbLilJUhMDQ5LUxMCY2XHzXcA8mMR9hsncb/d5cqy3/fYchiSpiUcYkqQmBoYkqYmBMU2SpyX5fpIfJnnjfNczDEkemuTsJMuTfC/Ja/r1D0jy5SRX9L+3nu9ahyHJhkkuSvLZfnmnJOf3bX5yP9Pj2EiyVZJTk/xfksuTPH4S2jrJkf33+7IkS5NsOo5tneT4JDckuWxg3WrbN51j+v2/JMnua/NZBsaAJBsC/wnsBywCnpdk0fxWNRS/A15XVYuAPYHD+v18I3BWVe1CNwfJWAYm8Brg8oHlo4H3VNXDgV8Bfz8vVQ3P+4AvVtWudJOXXc6Yt3WS7YHDgSVVtRvd3DvPZTzb+gTgadPWzdS++wG79D+HAseuzQcZGKvaA/hhVV1ZVXcAnwAOmOea1ruqWlFVF/aPb6H7A7I93b6e2G92IvCs+alweJLsQDeb44f75QBPAU7tNxmr/e6nRX4i8BGAqrqjqm5kAtqabkbRBUk2AjYDVjCGbV1VXwd+OW31TO17AN0cQ1VV5wFbJWmeW8jAWNX2wE8Glq/t142tfj71xwHnAw+pqhX9Uz8DHjJPZQ3Te4F/BO7ulx8I3FhVv+uXx63NdwJWAh/tu+E+nGRzxrytq+o64F3ANXRBcRNwAePd1oNmat91+htnYEywJFsA/wMcUVU3Dz5X3fXWY3XNdZJnAjdU1QXzXcsc2gjYHTi2qh5HN8vlKt1PY9rWW9P9a3onYDtgc+7ZbTMR1mf7Ghirug546MDyDv26sZPkj+jC4qSqOq1fff3U4Wn/+4b5qm9I9gb+KsnVdN2NT6Hr39+q77aA8Wvza4Frq+r8fvlUugAZ97beF7iqqlZW1Z3AaXTtP85tPWim9l2nv3EGxqq+A+zSX0mxMd1JsjPmuab1ru+3/whweVW9e+CpM4AX949fDJw+17UNU1W9qap2qKqFdG371ap6AXA2cGC/2Vjtd1X9DPhJkkf2q/YBljPmbU3XFbVnks367/vUfo9tW08zU/ueAfxtf7XUnsBNA11Xs/JO72mSPJ2un3tD4Piqesc8l7TeJXkC8A3gUv7Ql/9muvMYnwR2pBsa/qCqmn4ybSwkeRLw+qp6ZpKd6Y44HgBcBLywqm6fz/rWpySL6U7ybwxcCRxC94/FsW7rJG8DDqa7KvAi4KV0/fVj1dZJlgJPohvG/HrgrcCnWU379uH5frruud8Ah1TVsubPMjAkSS3skpIkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMDQR+tF5/3LauiOSHJtkuySnzvC6/02yZD18/glJrkuySb+8TX8D4TpL8qSpkXelYTIwNCmW0t2sN+i5wNKq+mlVHbia16xvdwF/Nwefs1b6UZqlWRkYmhSnAs+Ymv+gH3RxO+AbSRZOzSWQZEGST/TzRnwKWDD1Bkn+Ism5SS5Mcko/FhdJ9ukH9ru0n5tgkxlqeC9w5MDQFFPvu8oRQpL3J3lJ//jqJO9McnGSZUl2T3Jmkh8lecXA22yZ5HPp5nL5YJINZqn56iRHJ7kQ+Jt7/V9VE8XA0ETo72L+Nt18ANAdXXyy7nnn6iuB31TVo+jumP0z6LqQgLcA+1bV7sAy4LVJNqWbj+DgqvpTusH+XjlDGdcA3wRetJblX1NVi+nuzj+BbmiLPYG3DWyzB/APdPO4/AnwnJlqHnjNL6pq96r6xFrWowm10eybSGNjqlvq9P736ibPeSJwDEBVXZLkkn79nnR/jL/Vja7AxsC5wCPpBrn7Qb/dicBhdEcTq/PO/vM/txZ1T41ndimwRT+HyS1Jbk+yVf/ct6vqSvj9UBFPAG6boeYpJ69FDZKBoYlyOvCeflrKzdZymPMAX66q562yMnns2hRQVVckuRg4aGD171j1aH/TaS+bGuvo7oHHU8tT/w9PP1KqmWoecGtr3RLYJaUJUlW/phut9Hi6o43V+TrwfIAkuwGP6defB+yd5OH9c5sneQTwfWDh1Hq67qavzVLKO4DXDyz/GFiUZJP+iGGftdqxzh79KMsb0A2498011CzdKwaGJs1SunmtZwqMY4EtklwO/DPdLG1U1UrgJcDSvpvqXGDXqrqNbvTXU5JMjf77wTUVUFXfAy4cWP4J3ciil/W/L7oX+/UdulFILweuAj41U8334r0lwNFqJUmNPMKQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSk/8HABSZHuiqMGwAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "plt.plot(decisions, ds='steps')\n", "plt.xlabel('Video Number')\n", "plt.ylabel('Watched')\n", "plt.yticks([0, 1], ['False', 'True']);" ] }, { "cell_type": "markdown", "id": "b2b28c2d", "metadata": {}, "source": [ "The figure above shows when during our 100-videos-long session we were recommended a video about food (from `keywords`). The x-axis is chronological, the 1st video displayed is on the left, and the most recent video is on the right. The y-axis is \"yes\" or \"no,\" depending on if the video was related to food. " ] }, { "cell_type": "markdown", "id": "01b7a95a", "metadata": {}, "source": [ "### Results\n", "\n", "You can look back to the `data/screenshots` folder we created to check whether the videos we watched appear to be food-related. \n", "\n", "If the feed was indeed increasingly filled with food videos, we would see more lines towards the right of the graph. At least here it does not appear to be the case. \n", "\n", "Does it mean that the WSJ investigation was wrong, or that TikTok stopped personalizing content? \n", "\n", "The answer is \"No,\" for several reasons: \n", "\n", "1. We only scrolled through 100 videos, this is likely too few to observe any effects. Try re-running with a higher number!
\n", "2. When studying personalization you should use an account per profile and make sure you're logged in, rather than relying on a fresh browser. So, instead of closing the login dialog, try actually logging in! You know how to find and click buttons, and [this is how you put text in text fields](https://www.geeksforgeeks.org/send_keys-element-method-selenium-python/).
\n", "3. When you're not logged in, you will be presented with content from all over the world, in all languages. If you filtered `keywords` in just one language, you will miss plenty of target content in other languages.
\n", "4. You should always have a baseline to compare to. In this case, you should probably run two accounts at the same time - one that watches food videos and one that doesn't. Then you compare the prevalence of food videos between these two.
\n", "5. The WSJ investigation was run on the mobile app rather than on a desktop browser. Perhaps TikTok's personalization works differently based on device or operating system." ] }, { "cell_type": "markdown", "id": "e94ac133", "metadata": {}, "source": [ "## Advanced Usage\n", "\n", "Above we highlighted some ideas to make your investigation or study more robust, some are methodological choices, but others are technical.\n", "\n", "There are some advanced use-cases and tasks you can perform with browser automation that include\n", "\n", "- Authentication using the browser and storing cookies for later use.
\n", "- Intercept background [API](/apis.html) calls and combine browser automation with API calls. See [`selenium-wire`](https://pypi.org/project/selenium-wire/) as an example.
\n", "- Signing in with one or more email addresses.
\n", "\n", "We may cover some or all of these topics in subsequent tutorials, but you should feel free to experiment.\n", "\n", "Let us know what you're interested in learning more about!" ] }, { "cell_type": "markdown", "id": "758b6226", "metadata": {}, "source": [ "# Related Readings\n", "\n", "More tutorials on the same subject:\n", "\n", "- \"[Using real browsers](https://scrapism.lav.io/using-real-browsers/)\" - Sam Lavigne\n", "\n", "Notable investigations, audits, and tools using browser automation:\n", "\n", "- \"[Blacklight](https://themarkup.org/blacklight)\" - a investigative tool by Surya Mattu
\n", "- \"[TheirTube](https://www.their.tube/)\" - an art and advocacy project by Tomo Kihara
\n", "- \"[Worlds Apart](https://www.nrk.no/osloogviken/xl/tiktok-doesn_t-show-the-war-in-ukraine-to-russian-users-1.15921522)\" - a TikTok investigation by Henrik Bøe and Christian Nicolai Bjørke
\n", "- \"[WebSearcher](https://github.com/gitronald/WebSearcher)\" - A Python package by Ronald E. Robertson
\n", "- \"[Googling for Abortion](https://journalqd.org/article/view/2752)\" - Yelena Mejova, Tatiana Gracyk, and Ronald E. Robertson
\n", "- \"[webXray](https://webxray.org/)\" - A website forensics tool by Tim Liebert
\n", "- \"[OpenWPM](https://github.com/itdelatrisu/OpenWPM)\" - A privacy-measurement tool\n", "\n", "Please reach out with more examples to add." ] }, { "cell_type": "markdown", "id": "0a3fea01", "metadata": {}, "source": [ "# Citation\n", "\n", "To cite this chapter, please use the following BibTex entry:\n", "\n", "
\n",
    "@incollection{inspect2023browser,\n",
    "  author    = {Sapiezynski, Piotr and Yin, Leon},\n",
    "  title     = {Browser Automation},\n",
    "  booktitle = {Inspect Element: A practitioner's guide to auditing algorithms and hypothesis-driven investigations},\n",
    "  year      = {2023},\n",
    "  editor    = {Yin, Leon and Sapiezynski, Piotr and Raji, Inioluwa Deborah},\n",
    "  note      = {\\url{https://inspectelement.org}}\n",
    "}\n",
    "
\n", "\n", "## Acknowledgements\n", "\n", "Thank you to Ruth Talbot and John West for answering questions about their two respective investigations." ] }, { "cell_type": "code", "execution_count": null, "id": "fe7eff69", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 5 }