{ "cells": [ { "cell_type": "raw", "id": "11500a81", "metadata": {}, "source": [ "---\n", "title: \"Browser Automation\"\n", "pagetitle: \"Browser Automation\"\n", "description-meta: \"Introduction, case studies, and exercises for automating browsers.\"\n", "description-title: \"Introduction, case studies, and exercises for automating browsers.\"\n", "author: \"Piotr Sapiezynski and Leon Yin\"\n", "author-meta: Piotr Sapiezynski and Leon Yin\"\n", "date: \"06-11-2023\"\n", "date-modified: \"07-09-2024\"\n", "execute: \n", " enabled: false\n", "keywords: data collection, web scraping, browser automation, algorithm audits, personalization\n", "twitter-card:\n", " title: Browser Automation\n", " description: Introduction, case studies, and exercises for automating browsers.\n", " image: assets/inspect-element-logo.jpg\n", "open-graph:\n", " title: Browser Automation\n", " description: Introduction, case studies, and exercises for automating browsers.\n", " locale: us_EN\n", " site-name: Inspect Element\n", " image: assets/inspect-element-logo.jpg\n", "href: browser_automation\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "id": "0c6b2d5c", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "📖 Read online\n", "⚙️ GitHub\n", "🏛 Citation\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#| echo: false\n", "from utils import build_buttons\n", "from importlib import reload\n", "import utils\n", "reload(utils)\n", "utils.build_buttons(link= 'browser_automation', \n", " github= 'https://github.com/yinleon/inspect-element/blob/main/browser_automation.ipynb',\n", " colab = False,\n", " citation= True)" ] }, { "cell_type": "markdown", "id": "9a047941", "metadata": {}, "source": [ "Browser automation is a fundamental web scraping technique for building your own dataset.\n", "\n", "It is essential for investigating personalization, working with rendered elements, and waiting for scripts and code to execute on a web page.\n", "\n", "However, browser automation can be resource intensive and slow compared to other data collection approaches.\n", "\n", "👉[Click here to jump to the Playwright tutorial](#tutorial)." ] }, { "cell_type": "markdown", "id": "d5b2f9f1", "metadata": {}, "source": [ "# Intro\n", "\n", "If you’ve tried to buy concert tickets to a popular act lately, you’ve probably watched in horror as the blue “available” seats evaporate before your eyes the instant tickets are released. Part of that may be pure ✨star power✨, but more than likely, bots were programmed to buy tickets to be resold at a premium.\n", "\n", "These bots are programmed to act like an eager fan: waiting in the queue, selecting a seat, and paying for the show. These tasks can all be executed using browser automation.\n", "\n", "**Browser automation** is used to programmatically interact with web applications. \n", "\n", "The most frequent use case for browser automation is to run tests on websites by simulating user behavior (mouse clicks, scrolling, and filling out forms). This is routine and invisible work that you wouldn’t remember, unlike seeing your dream of crowd surfing with your favorite musician disappear thanks to ticket-buying bots.\n", "\n", "But browser automation has another use, one which _may_ make your dreams come true: web scraping.\n", "\n", "Browser automation isn’t always the best solution for building a dataset, but it is necessary when you need to:\n", "\n", "1. **Analyze rendered HTML**: see what's on a website as a user would.\n", "2. **Simulate user behavior**: experiment with personalization and experience a website as a user would.\n", "3. **Trigger event execution**: retrieve responses to JavaScript or [network requests](/apis.html) following an action.\n", "\n", "These reasons are often interrelated. We will walk through case studies (below) that highlight at least one of these strengths, as well as why browser automation was a necessary choice.\n", "\n", "Some popular browser automation tools are [Puppeteer](https://pptr.dev/), [Playwright](https://playwright.dev/), and [Selenium](https://www.selenium.dev/documentation/webdriver/elements/). \n", "\n", "## Headless Browsing\n", "\n", "Browser automation can be executed in a \"headless\" state by some tools.\n", "\n", "This doesn't mean that the browser is a ghost or anything like that, it just means that the _user interface_ is not visible.\n", "\n", "One benefit of headless browsing is that it is less [resource intensive](/apis.html#case-study-on-scalability-collecting-internet-plans), however there is no visibility into what the browser is doing, making headless scrapers difficult to debug.\n", "\n", "Luckily, some browser automation tools (such as Playwright) allow you to [toggle headless browsing](https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch) on and off. Other tools, such as Puppeteer only allow you to use headless browsing.\n", "\n", "If you’re new to browser automation, we suggest not using headless browsing off the bat. Instead try headed Playwright, which is exactly what we’ll do in the [tutorial](#tutorial) below (see the same tutorial in Selenium [here](/browser_automation_selenium))." ] }, { "cell_type": "markdown", "id": "6bbafecb", "metadata": {}, "source": [ "
\n", "
Using Playwright to automate browsing TikTok's \"For You\" page for food videos.
\n", "
" ] }, { "cell_type": "markdown", "id": "a011a16a", "metadata": {}, "source": [ "# Case Studies\n", "## Case Study 1: Google Search\n", "In the investigation “[Google the Giant](https://themarkup.org/google-the-giant/2020/07/28/google-search-results-prioritize-google-products-over-competitors),” The Markup wanted to measure how much of a Google Search page is “Google.” Aside from the daunting task of classifying what is \"Google,\" and what is \"not Google,\" the team of two investigative journalists-- Adrianne Jeffries and Leon Yin (a co-author of this section) needed to measure real estate on a web page.\n", "\n", "The team developed a [targeted staining technique](https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool) inspired by the life sciences, originally used to highlight the presence of chemicals, compounds, or cancers. \n", "\n", "
\n", "\"https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool#google-search-flow\"\n", "
\n", "Source: The Markup\n", "
\n", "
\n", "\n", "The reporters wrote over [68 web parsers](https://github.com/the-markup/investigation-google-search-audit/blob/master/utils/parsers.py) to identify elements on trending Google Search results as \"Google,\" or three other categories. Once an element was identified, they could find the [coordinates](https://developer.mozilla.org/en-US/docs/Web/SVG/Element/rect) of each element along with its corresponding bounding box. Using the categorization and bounding box, The Markup were able to measure how many pixels were allocated to Google properties, as well as where they were placed on a down the page for a mobile phone.\n", "\n", "
\n", "\"https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool#google-search-flow\"\n", "
\n", "Source: The Markup\n", "
\n", "
\n", "\n", "Browser automation tools' ability to collect and analyze **rendered HTML pages** can be essential. This is especially the case for search results, since most search results contain modules, carousels, and other non-standardized rows and columns that are more complex than lists.\n", "\n", "Rendered HTML can be used to analyze the allocation of real estate on a website, which can be a useful metric to gauge self-preferencing and [anti-competitive business practices](https://themarkup.org/amazons-advantage/2021/10/14/amazon-puts-its-own-brands-first-above-better-rated-products) relevant to [antitrust](https://themarkup.org/google-the-giant/2020/07/29/congressman-says-the-markup-investigation-proves-google-has-created-a-walled-garden)." ] }, { "cell_type": "markdown", "id": "b11a8f77", "metadata": {}, "source": [ "## Case Study 2: Deanonymizing Google's Ad Network\n", "\n", "Google ad sellers offer space on websites like virtual billboards, and are compensated by Google after an ad is shown. However, unlike physical ad sellers, almost all of the ~1.3 million ad sellers on Google are anonymous. To limit transparency further, multiple websites and apps can be monetized by the same seller, and it’s not clear which websites are part of Google’s ad network in the first place. \n", "\n", "As a result, [advertisers](https://checkmyads.org/branded/google-ads-has-become-a-massive-dark-money-operation/) and the public do not know who is making money from Google ads. Fortunately, watchdog groups, industry analysts, and reporters have developed methods to hold Google accountable for this oversight.\n", "\n", "The methods boil down to triggering a JavaScript function that sends a request to Google to show an ad on a loaded web page. Importantly, the request reveals the seller ID used to monetize the website displaying the ad, and in doing so, links the seller ID to the website.\n", "\n", "In 2022, reporters from ProPublica used Playwright to [automate this process](https://www.propublica.org/article/google-display-ads-piracy-porn-fraud) to visit 7 million websites and deanonymize over 900,000 Google ad sellers. Their investigation found some websites were able to monetize advertisements, despite breaking Google’s policies.\n", "\n", "ProPublica's investigation used browser automation tools to **trigger event execution** to successfully load ads. Often, this required waiting a page to fully render, scrolling down to potential ad space, and browsing multiple pages. The reporters used a combination of network requests, rendered HTML, and cross-referencing screenshots to confirm that each website monetized ads from Google’s ad network.\n", "\n", "Browser automation can help you trawl for clues, especially when it comes to looking for specific network requests sent to a central player by many different websites." ] }, { "cell_type": "markdown", "id": "d5e47cdc", "metadata": {}, "source": [ "## Case Study 3: TikTok Personalization\n", "An investigation conducted by the Wall Street Journal, \"[Inside TikTok's Algorithm](https://www.wsj.com/articles/tiktok-algorithm-video-investigation-11626877477)\" found that even when a user does not like, share, or follow any creators, TikTok still personalizes the \"For You\" page based on how long they watch the recommended videos.\n", "\n", "In particular, the WSJ investigation found that users who watch content related to depression and skip other content are soon presented with mental health content and little else. Importantly, this effect happened even when the users did not explicitly like or share any videos, nor did they follow any creators. \n", "\n", "You can watch the WSJ's video showing how they mimic user behavior to study the effects of personalization:" ] }, { "cell_type": "markdown", "id": "92f2440f", "metadata": { "tags": [] }, "source": [ "
\n", "
Source: WSJ
\n", "
\n" ] }, { "cell_type": "markdown", "id": "e713418e", "metadata": {}, "source": [ "This investigation was possible only after **simulating user behavior** and triggering personalization from TikTok's \"For You\" recommendations." ] }, { "cell_type": "markdown", "id": "5f303855", "metadata": {}, "source": [ "# Tutorial\n", "In the hands-on tutorial we will attempt to study personalization on TikTok with a mock experiment. \n", "\n", "We’re going to teach you the basics of browser automation in Playwright, but the techniques we'll discuss could be used to study any other website using any other automation tool.\n", "\n", "We will try to replicate elements of the WSJ investigation and see if we can trigger a personalized \"For You\" page. Although the WSJ ran their investigation using an Android on a Raspberry Pi, we will try our luck with something you can run locally on a personal computer using browser automation.\n", "\n", "In this tutorial we'll use Playwright to watch TikTok videos where the description mentions keywords of our choosing, while skipping all others. In doing so, you will learn practical skills such as:\n", "\n", "* Setting up the automated browser in Python\n", "* Finding particular elements on the screen, extracting their content, and interacting with them\n", "* Scrolling\n", "* Taking screenshots\n", "\n", "Importantly, we’ll be watching videos with lighter topics than depression (the example chosen in the WSJ investigation.).\n", "\n", "::: {.callout-tip}\n", "#### Pro tip: Minimizing harms\n", "When developing the data collection methodology for an audit or investigation, start with low-stakes themes. This minimizes your exposure to harmful content and avoids boosting their popularity, unnecessarily.\n", ":::" ] }, { "cell_type": "markdown", "id": "c3e5b621", "metadata": {}, "source": [ "## Step 1: Installing playwright\n", "Playwright will take care of finding and installing the browser binary that's suitable for your operating system. Such setup is much more straightforward than [Selenium](https://selenium-python.readthedocs.io/), which requires the user to manage each browser version.\n", "\n", "The first line below installs the Python library, the second line installs the browser binaries." ] }, { "cell_type": "code", "execution_count": 2, "id": "0cd30c59", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: playwright in /Users/lyin72/miniconda3/lib/python3.11/site-packages (1.44.0)\n", "Requirement already satisfied: greenlet==3.0.3 in /Users/lyin72/miniconda3/lib/python3.11/site-packages (from playwright) (3.0.3)\n", "Requirement already satisfied: pyee==11.1.0 in /Users/lyin72/miniconda3/lib/python3.11/site-packages (from playwright) (11.1.0)\n", "Requirement already satisfied: typing-extensions in /Users/lyin72/miniconda3/lib/python3.11/site-packages (from pyee==11.1.0->playwright) (4.7.1)\n" ] } ], "source": [ "!pip install playwright\n", "!playwright install" ] }, { "cell_type": "markdown", "id": "a48204d3", "metadata": {}, "source": [ "Let's see if the installation worked correctly! Run the cell below to open a new Firefox window. We're going to use Firefox in this tutorial because Playwright's default browser (Chromium) does not support video playback in TikTok's format.\n" ] }, { "cell_type": "code", "execution_count": 25, "id": "6cda6fd6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ ">" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from playwright.async_api import async_playwright\n", "\n", "# Start the browser\n", "playwright = await async_playwright().start()\n", "browser = await playwright.firefox.launch(headless=False)\n", "\n", "# Create a new browser window\n", "page = await browser.new_page()\n", "\n", "# Open the default tiktok For You page\n", "await page.goto(\"https://www.tiktok.com/foryou\")" ] }, { "cell_type": "markdown", "id": "6eaa70e9", "metadata": {}, "source": [ "::: {.callout-note}\n", "What is `await`? We're running Playwright [asynchronously](https://realpython.com/python-async-features/#understanding-asynchronous-programming), which is the only way to be compatible with Jupyter Notebooks. You _can_ run Playwright synchronously (aka in regular Python) as a **script**, but not as a notebook. In practice you'll want to tinker and iterate, so a notebook is preferred.\n", "\n", "We explicitly call `await` after each line of Playwright code so that each command is run sequentially. Otherwise, every line of code runs at the same time.\n", ":::" ] }, { "cell_type": "markdown", "id": "890f29ee", "metadata": {}, "source": [ "If everything works fine and you have the browser with TikTok open, our setup is complete!\n", "\n", "Unfortunately, depending on your system this setup might not work:\n", "* It will not work at all in Google Colab - you need to run this on your own machine\n", "* It might not work on a Windows machine. If you're using Windows, you will need to downgrade your `ipykernel` to a version that supports Playwright. Uncomment the next code cell and run it, then restart this notebook:" ] }, { "cell_type": "code", "execution_count": null, "id": "bae3fa15", "metadata": {}, "outputs": [], "source": [ "## Only uncomment and run the next line if you're using windows and the cell above did not give you an open browser window.\n", "#!pip install ipykernel==6.28.0" ] }, { "cell_type": "markdown", "id": "f8dd4f31", "metadata": {}, "source": [ "## Step 2: Finding elements on page and interacting with them\n", "\n", "We will perform our mock experiment without logging in (but we will also learn how to create multiple accounts and how to log in later).\n", "\n", "Press the arrow down button on your keyboard a few times until a dialog pops up asking you do log in:\n", "\n", "![](assets/browser1_02_tiktok1.png \"tiktok main page\")\n", "\n", "Instead of logging in, our first interaction will be to click the \"Continue as guest\" button.\n", "\n", "Playwright has built-in tools called [Locators](https://playwright.dev/python/docs/locators) to find and interact with elements on the page. One helpful locator is [based on the text]((https://playwright.dev/python/docs/api/class-page#page-get-by-text)) of a button you want to press. We can use the `get_by_text` locator to find the button that says \"Continue as guest\" on the `page` and click it:" ] }, { "cell_type": "code", "execution_count": 26, "id": "19ccfd80", "metadata": { "scrolled": true }, "outputs": [], "source": [ "await page.get_by_text(\"Continue as guest\").click()" ] }, { "cell_type": "markdown", "id": "4aec1f05", "metadata": {}, "source": [ "If Playwright successfully finds the button with the text you specified, it will be clicked. However, if Playwright **does not** find the element -- because the element hasn't loaded yet or you misspelled the text, you will get a `TimeoutError`. \n", "\n", "This error is thrown because Playwright waits a short period of time for an element to appear on screen. The default is 30,000 milliseconds (30 seconds). You can specify a different timeout as an argument to `click()`, for example 1,000 milliseconds (1 second):\n", "```python\n", "await page.get_by_text(\"Continue as guest\").click(timeout = 1000)\n", "```" ] }, { "cell_type": "markdown", "id": "6a431668", "metadata": {}, "source": [ "Did you notice a change on the page? Congratulations! You just automated the browser to click something." ] }, { "cell_type": "markdown", "id": "b4f70b31", "metadata": {}, "source": [ "## Step 4: Scrolling\n", "\n", "We now have a browser instance open and displaying the For You page. Let's scroll through the videos.\n", "\n", "If you are a *real person* who (for whatever reason) visits TikTok on their computer, you could press the down key the keyboard to see new videos. We will do that programmatically using a [virtual keyboard](https://playwright.dev/docs/api/class-keyboard) instead:" ] }, { "cell_type": "code", "execution_count": 27, "id": "0de8d265", "metadata": {}, "outputs": [], "source": [ "await page.keyboard.press(\"ArrowDown\")" ] }, { "cell_type": "markdown", "id": "3c6a27fd", "metadata": {}, "source": [ "When you run the cell above you will see that your browser scrolls down to the next video." ] }, { "cell_type": "markdown", "id": "17dbb289", "metadata": {}, "source": [ "## Step 5: Finding TikTok videos on the page\n", "\n", "Now that we have the building blocks for swiping through the For You page, let's view the recommended TikTok videos and parse out information (called metadata) for each video.\n", "\n", "When we asked Playwright to search for the \"Continue as guest\" button (Step 3), we used a locator function based on text. Playwright had other [locator](https://playwright.dev/python/docs/locators) functions to find what you're looking for:\n", "\n", "- `get_by_role()` to locate by explicit and implicit accessibility attributes.\n", "- `get_by_text()` to locate by text content.\n", "- `get_by_label()` to locate by the associated label's text.\n", "- `get_by_placeholder()` to locate an input by placeholder.\n", "- `get_by_alt_text()` to locate an element, usually image, by its text alternative.\n", "- `get_by_title()` to locate an element by its title attribute.\n", "- `get_by_test_id()` to locate an element based on its data-testid attribute (other attributes can be configured).\n", "\n", "The developers suggest using these recommended locators. This will make your code more legible and reliable. Other browser automation tools have comparable functions.\n", "\n", "Unfortunately for us, none of these will work for our task. If you look at the source code for TikTok videos, you won't find any of these locators useful. However, there are fields that we can use to identify videos another way.\n", "\n", "1. Right click on the white space around a TikTok video and choose \"Inspect\".\n", "![Inspect Element](assets/browser1_05_inspect_tiktok_a1.png)\n", "2. Hover your mouse over the surrounding `
` elements and observe the highlighted elements on the page to see which ones correspond to each TikTok video.\n", "![Inspect Element](assets/browser1_05_inspect_tiktok_b1.png)\n", "3. You will see that each video is in a separate `
` container but each of these containers has the same [data attribute](https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/data-*) (`data-e2e`) with the value of `recommend-list-item-container`.\n", "4. We can now use this to find all videos on page (you can search by attribute value using square brackets):" ] }, { "cell_type": "markdown", "id": "6fb01a78", "metadata": {}, "source": [ "Playwright has a generic `locator` function that accepts both xpath and CSS [selectors](https://playwright.dev/python/docs/locators#locate-by-css-or-xpath).\n", "\n", "The same `
` can be identified in xpath as `//div[@data-e2e=\"recommend-list-item-container\"]` or as a CSS selector as `[data-e2e=\"recommend-list-item-container\"]`." ] }, { "cell_type": "code", "execution_count": 31, "id": "57474ce4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[ selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=0'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=1'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=2'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=3'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=4'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=5'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=6'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=7'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=8'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=9'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=10'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=11'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=12'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=13'>,\n", " selector='//div[@data-e2e=\"recommend-list-item-container\"] >> nth=14'>]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "videos = await page.locator('//div[@data-e2e=\"recommend-list-item-container\"]').all()\n", "videos" ] }, { "cell_type": "markdown", "id": "90d901d5", "metadata": {}, "source": [ "When we searched for the \"Continue as guest\" button we didn't need to use the `all` method because we were only expecting one element to match our locator.\n", "\n", "Now we're trying to find **all** videos on page, so we will chain the `locator` and `all` functions to return a full list of elements that match the locator." ] }, { "cell_type": "markdown", "id": "56709634", "metadata": {}, "source": [ "## Step 6: Parsing TikTok metadata\n", "With all the TikTok `videos` on the page, let's extract the description from each. Later, we'll use this metadata to decide whether to watch a video, or to skip it. The process of extracting a specific field from a webpage is \"parsing\".\n", "\n", "1. Pick any description, right click, \"Inspect\". \n", "2. Let's locate the `
` that contains the whole description (including any hashtags) and make note of its `data` attribute.\n", "3. Now let's write the code that extracts the description from a single video. You can get the text of any located element by calling the `inner_text` function." ] }, { "cell_type": "code", "execution_count": 34, "id": "42eb180c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unbelievable fish trap technique #fish #fishing #fishinglife #wild #wildlife #nature #asmr #river #fyp \n", "Super winner! Whoever clears the board first wins👌Sling Puck Game #viral #viralvideo #2024 #satisfying \n", "Head on to your nearest retail store and spot the 900g promo pack! Prepare your child for all school age challenges with NIDO!*Applicable on select retail stores nationwide.\n", "#india #streetfood #food #fpy #foryou #longervideos \n", "Jajaja YO NO SOY LA QUE REACCIONA, es una nena que se muere por el juhador #richardRios #colombia \n", "Every car needs this!🤯 #lifehack #cars #diy #sports \n", "This was insane 🫣🤣\n", "How North Korea is Now Impossible to Escape 🇰🇵🇰🇷 #northkorea #korea #southkorea #northkoreafact #northkorealife #border #maps #geography #learn #history #geotok #historytok #funfacts #fyp \n", "Geeze im tired of hurting \n", "#momsoftiktok #baseketball #nba #tiktok #fyp #foryou \n", "should dweeb count? 🤣 #trivia \n", "Apple watch hidden camera\n", "\n", "Nah fam. I’m not for this. I had to get back into line so I could record this. #ai #artificialintelligence #wendys \n", "Part 1#foryou #viral \n" ] } ], "source": [ "for video in videos:\n", " print(await video.locator('//div[@data-e2e=\"video-desc\"]').inner_text())" ] }, { "cell_type": "markdown", "id": "ee9cdbb6", "metadata": {}, "source": [ "::: {.callout-note}\n", "Note: We previously searched for elements using `page.locator()`. That allowed us to search the whole page. Here we're using a locator within a previously located element: `video.locator()`. This allows us to access attributes and elements **within an element on the page**, rather than on the whole page.\n", ":::" ] }, { "cell_type": "markdown", "id": "05f314b0", "metadata": {}, "source": [ "## Step 7: Finding the TikTok video that's currently playing\n", "We know how to scroll to the next video, and we know how to find all videos that are loaded.\n", "At this point we could either:\n", "\n", "a. Assume that at the beginning, the 0th video is playing, and then every time we press arrow down, the next video is being displayed
\n", "b. Or, assume that the arrow down does not always work and each time verify which video is actually playing\n", "\n", "The problem with the first approach is that even if scrolling fails just once, our experiment will be compromised (after it happens we will be watching and skipping different videos that our script tells us). This is why we will go with the second approach and verify which video is actually playing. Back to our favorite tool- inspect element!\n", "\n", "When you right click on the playing video, you will see that instead of our familiar UI we get a custom TikTok menu, so that won't work. Try right-clicking on the description of the video instead, then hovering over different elements in the inspector and expanding the one that highlights the video in the browser. Dig deep until you get to the `div` that only contains the video. \n", "\n", "Still in the inspector try looking at the video below. You will see that the `div` that contains the video is missing and there is no element with the tag name `video`. That's how we can find if the video is currently playing - its `div` will contain the `video` element that we can find by `TAG_NAME` <- ???:" ] }, { "cell_type": "code", "execution_count": 15, "id": "50499287", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "not playing Unbelievable fish trap technique #fish #fishing #fishinglife #wild #wildlife #nature #asmr #river #fyp \n", "not playing Super winner! Whoever clears the board first wins👌Sling Puck Game #viral #viralvideo #2024 #satisfying \n", "playing Head on to your nearest retail store and spot the 900g promo pack! Prepare your child for all school age challenges with NIDO!*Applicable on select retail stores nationwide.\n", "not playing #india #streetfood #food #fpy #foryou #longervideos \n", "not playing Jajaja YO NO SOY LA QUE REACCIONA, es una nena que se muere por el juhador #richardRios #colombia \n" ] } ], "source": [ "for video in videos:\n", " # let's get the description of each video using the method we already know\n", " description = await video.locator('//div[@data-e2e=\"video-desc\"]').inner_text()\n", "\n", " # now let's count all the