--- title: Wrestling Undetected-Chromedriver & Persistence in NixOS permalink: /futureproof/wrestling-undetected-chromedriver-persistence-nixos/ description: This entry details a challenging and ultimately successful journey through a 'deep rabbit hole' of stealth browser automation, specifically integrating `undetected-chromedriver` with persistence on NixOS. The author highlights the frustration of an 'arms race' against anti-bot measures and the value of a 'Roomba Robot' (iterative, persistent) problem-solving approach. They emphasize the utility of `git diff` for progress tracking during exhaustion and the critical role of AI collaboration in pinpointing the obscure NixOS-specific `exit code 127` errors related to both browser and driver executable paths. The narrative captures the raw experience of debugging a complex environmental issue, culminating in a triumphant 'Dayton Wind Tunnel success assured' moment. meta_description: Successfully integrates `undetected-chromedriver` and browser persistence on NixOS for stealth automation. Details `exit code 127` troubleshooting, anti-bot bypass, and binary path alignment. meta_keywords: undetected-chromedriver, Selenium, browser automation, NixOS, persistence, anti-bot scraping, troubleshooting, git diff, devops, python layout: post sort_order: 8 --- ## Setting the Stage: Context for the Curious Book Reader This entry documents a pivotal moment in the ongoing "Stealth Browser Automation Series," focusing on a particularly thorny technical challenge. It chronicles the deep dive into implementing `undetected-chromedriver` with browser session persistence within a unique NixOS environment, an endeavor that pushed the boundaries of traditional debugging. Readers will witness a real-time account of overcoming complex `exit code 127` errors and sophisticated anti-bot mechanisms, showcasing the power of persistent problem-solving, AI collaboration, and precise environmental configuration. This journal serves as a testament to the iterative nature of engineering and the wins that fuel progress, no matter how small. --- ## Technical Journal Entry Begins This is the part 2 of the stealth browser automation series and boy did I dive down a deep rabbit hole. But it's still *today* so that's okay. I didn't do too bad. Getting into these *arms races* sucks. You have to make sure you're working off the very latest tools without trying to do them personally yourself, for that rabbit hole is very, very deep. ## The Marathon of Stealth Automation I managed to get to the end of Wonderland on this one. I didn't have the choice not to... oh, I don't know... fail. This is one of those *failure is not an option* situations. Sometimes the rabbit holes are the critical dependencies to the project that you must create an adequate solution for. Even if you give up on one particular approach, you would have to find another way even if it's doing it manually. And that... that... that... is not an option. So much invested into this Selenium route that switching to the Chrome/Puppeteer route even though Google put out an MCP server recently would have all those same multi-platform issues and the corresponding whack-a-mole game. And there is no time for that. I committed to the Selenium route because it works so well cross-platform (Macs, Windows and Linux). It runs from Nix Flakes real well in this *write once run anywhere* way and with browser automation, that's no small feat. I've been like training for decades for this kind of work. I'm your guy to figure out these impossiscripts. I'll get to the Google Puppeteer MCP Chrome automation server eventually too... support that too... later. I had to fix Selenium not crawling one particularly adept site at fighting off the bots. ## Embracing the "Roomba Robot" Approach Did I mention the Roomba Robot? Particularly the early versions with random walk and not a lot of *real* machine intelligence? Back-up, turn a bit, retry. Back-up, turn a bit, retry. Do this ad infinitum and you shall eventually solve the problem. This is approximately the same problem solving method as having infinite monkeys on infinite typewriters until they reproduce the works of William Shakespeare. And sometimes it works. Today it worked. It was not a sure thing. And my credibility is on the line. It was touch and go there for a minute. But it's only 11:00 PM and the meeting isn't until 11:00 AM tomorrow morning. That's a cool 12 hours. And I also have a client meeting to prepare for that comes immediately on its tail. And I am totally on my 2nd wind and way past diminishing returns. ## Leveraging Git Diffs for Progress Tracking Have I mentioned `git diff`'s as a way of pinning up progress? That sort of fuzzy-mindedness that sets in when you're pushing yourself real hard and you start to forget things. Things can't be forgot with git diff's. Okay, so I had to ask for the implementation plan again. That last one wasn't bad but I really had to nail it. It was a nice short prompt: Help me update scraper_tools.py to properly use undetected-chromedriver Notice the techniques that are used to make a persistent browser user profile. That would probably be good in the Notebooks folder for persistence the same way I do data, logs and browser_cache. So the idea is to have a nice undetectable Selenium browser by improving what I have in scraper_tools.py to use the undetected-chromedriver and add browser persistence. There are several innovations here having to do with ensuring session persistence. It's just a wee bit different. ## The Implementation Plan: Undetected-Chromedriver & Persistence **Gemini**: Context verified. Of course. Using `undetected-chromedriver` is a great way to improve the reliability of your browser automation. I'll help you integrate it into `scraper_tools.py` and add a straightforward way to manage persistent browser profiles, drawing inspiration from your existing `440_browser_automation.py` workflow. Hereβs a breakdown of the plan: 1. **Update `scraper_tools.py`:** We'll replace the current `selenium-stealth` and `seleniumwire` implementation with `undetected-chromedriver`. 2. **Add Persistence:** We'll introduce `persistent` and `profile_name` parameters to the `scrape` function, allowing you to save and reuse browser sessions (cookies, logins, etc.). 3. **Organize Profiles:** A new `data/uc_profiles/` directory will be created to store these persistent profiles, keeping them organized alongside your other project data. 4. **Update the Call Chain:** I'll show you the minor changes needed in `pipulate/core.py` and `Notebooks/secretsauce.py` to pass the new persistence options through. 5. **Provide a Usage Example:** You'll see exactly how to call the updated function from your `FAQuilizer.ipynb` notebook. ----- ## 1\. Updated `tools/scraper_tools.py` This updated version integrates `undetected-chromedriver` and handles both temporary and persistent profiles. ```python # /home/mike/repos/pipulate/tools/scraper_tools.py # /home/mike/repos/pipulate/tools/scraper_tools.py import asyncio import json import os import sys import shutil import tempfile from datetime import datetime from pathlib import Path from urllib.parse import quote, urlparse from loguru import logger import undetected_chromedriver as uc from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from tools import auto_tool from . import dom_tools def get_safe_path_component(url: str) -> tuple[str, str]: """Converts a URL into filesystem-safe components for directory paths.""" parsed = urlparse(url) domain = parsed.netloc path = parsed.path if not path or path == '/': path_slug = "%2F" else: path_slug = quote(path, safe='').replace('/', '_')[:100] return domain, path_slug @auto_tool async def selenium_automation(params: dict) -> dict: """ Performs an advanced browser automation scrape of a single URL using undetected-chromedriver. Supports persistent profiles to maintain sessions across runs. Captures artifacts including DOM, source, headers, screenshot, and visual DOM layouts. """ url = params.get("url") domain = params.get("domain") url_path_slug = params.get("url_path_slug") take_screenshot = params.get("take_screenshot", False) headless = params.get("headless", True) is_notebook_context = params.get("is_notebook_context", False) persistent = params.get("persistent", False) profile_name = params.get("profile_name", "default") verbose = params.get("verbose", True) if not all([url, domain, url_path_slug is not None]): return {"success": False, "error": "URL, domain, and url_path_slug parameters are required."} driver = None artifacts = {} profile_path = None temp_profile = False base_dir = Path("browser_cache/") if not is_notebook_context: base_dir = base_dir / "looking_at" output_dir = base_dir / domain / url_path_slug try: if output_dir.exists(): if verbose: logger.info(f"ποΈ Clearing existing artifacts in: {output_dir}") shutil.rmtree(output_dir) output_dir.mkdir(parents=True, exist_ok=True) if verbose: logger.info(f"πΎ Saving new artifacts to: {output_dir}") options = uc.ChromeOptions() if headless: options.add_argument("--headless") options.add_argument("--no-sandbox") options.add_argument("--disable-dev-shm-usage") options.add_argument("--start-maximized") options.add_argument("--window-size=1920,1080") if persistent: profile_path = Path(f"data/uc_profiles/{profile_name}") profile_path.mkdir(parents=True, exist_ok=True) logger.info(f"π Using persistent profile: {profile_path}") else: profile_path = tempfile.mkdtemp(prefix='pipulate_automation_') temp_profile = True logger.info(f"π» Using temporary profile: {profile_path}") logger.info(f"π Initializing undetected-chromedriver (Headless: {headless})...") driver = uc.Chrome(options=options, user_data_dir=str(profile_path)) logger.info(f"Navigating to: {url}") driver.get(url) try: if verbose: logger.info("Waiting for security challenge to trigger a reload (Stage 1)...") initial_body = driver.find_element(By.TAG_NAME, 'body') WebDriverWait(driver, 20).until(EC.staleness_of(initial_body)) if verbose: logger.success("β Page reload detected!") if verbose: logger.info("Waiting for main content to appear after reload (Stage 2)...") WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "body"))) if verbose: logger.success("β Main content located!") except Exception as e: if verbose: logger.warning(f"Did not detect a page reload for security challenge. Proceeding anyway. Error: {e}") # --- Capture Core Artifacts --- dom_path = output_dir / "rendered_dom.html" dom_path.write_text(driver.execute_script("return document.documentElement.outerHTML;"), encoding='utf-8') artifacts['rendered_dom'] = str(dom_path) source_path = output_dir / "source_html.text" source_path.write_text(driver.page_source, encoding='utf-8') artifacts['source_html'] = str(source_path) if take_screenshot: screenshot_path = output_dir / "screenshot.png" driver.save_screenshot(str(screenshot_path)) artifacts['screenshot'] = str(screenshot_path) # --- Generate Visualization Artifacts --- if verbose: logger.info(f"π¨ Generating DOM box visualization...") viz_result = await dom_tools.visualize_dom_boxes({"file_path": str(dom_path), "verbose": False}) if viz_result.get("success"): viz_path = output_dir / "dom_layout_boxes.txt" viz_path.write_text(viz_result["output"], encoding='utf-8') artifacts['dom_layout_boxes'] = str(viz_path) if verbose: logger.info(f"π³ Generating DOM hierarchy visualization...") hierarchy_viz_result = await dom_tools.visualize_dom_hierarchy({"file_path": str(dom_path), "verbose": False}) if hierarchy_viz_result.get("success"): hierarchy_viz_path = output_dir / "dom_hierarchy.txt" hierarchy_viz_path.write_text(hierarchy_viz_result["output"], encoding='utf-8') artifacts['dom_hierarchy'] = str(hierarchy_viz_path) # --- Generate Accessibility Tree Artifact --- if verbose: logger.info("π² Extracting accessibility tree...") try: driver.execute_cdp_cmd("Accessibility.enable", {}) ax_tree_result = driver.execute_cdp_cmd("Accessibility.getFullAXTree", {}) ax_tree = ax_tree_result.get("nodes", []) ax_tree_path = output_dir / "accessibility_tree.json" ax_tree_path.write_text(json.dumps({"success": True, "node_count": len(ax_tree), "accessibility_tree": ax_tree}, indent=2), encoding='utf-8') artifacts['accessibility_tree'] = str(ax_tree_path) summary_result = await dom_tools.summarize_accessibility_tree({"file_path": str(ax_tree_path)}) if summary_result.get("success"): summary_path = output_dir / "accessibility_tree_summary.txt" summary_path.write_text(summary_result["output"], encoding='utf-8') artifacts['accessibility_tree_summary'] = str(summary_path) except Exception as ax_error: logger.warning(f"β οΈ Could not extract accessibility tree: {ax_error}") logger.success(f"β Scrape successful for {url}") return {"success": True, "looking_at_files": artifacts} except Exception as e: logger.error(f"β Scrape failed for {url}: {e}", exc_info=True) return {"success": False, "error": str(e), "looking_at_files": artifacts} finally: if driver: driver.quit() if verbose: logger.info("Browser closed.") if temp_profile and profile_path and os.path.exists(profile_path): shutil.rmtree(profile_path) if verbose: logger.info(f"Cleaned up temporary profile: {profile_path}") ``` ----- ## 2\. Updating the Call Chain To use the new options, you need to pass them down from your notebook to the `selenium_automation` function. ### `pipulate/core.py` Update the `scrape` method to accept and pass through the new arguments. ```python # /home/mike/repos/pipulate/pipulate/core.py # Add persistent=False and profile_name=None to the function signature async def scrape(self, url: str, take_screenshot: bool = False, mode: str = 'selenium', headless: bool = True, verbose: bool = True, persistent: bool = False, profile_name: str = "default", **kwargs): """ Gives AI "eyes" by performing browser automation or HTTP requests to scrape a URL. This method is the primary entrypoint for scraping and supports multiple modes. The default mode is 'selenium' which uses a full browser. Args: url (str): The URL to scrape. take_screenshot (bool): Whether to capture a screenshot (selenium mode only). Defaults to False. mode (str): The scraping mode to use ('selenium', 'requests', etc.). Defaults to 'selenium'. headless (bool): Whether to run the browser in headless mode (selenium mode only). Defaults to True. persistent (bool): Whether to use a persistent browser profile. Defaults to False. profile_name (str): The name of the persistent profile to use. Defaults to "default". **kwargs: Additional parameters to pass to the underlying automation tool. Returns: dict: The result from the scraper tool, including paths to captured artifacts. """ from tools.scraper_tools import selenium_automation from urllib.parse import urlparse, quote from datetime import datetime logger.info(f"ποΈβπ¨οΈ Initiating scrape for: {url} (Mode: {mode}, Headless: {headless}, Persistent: {persistent})") # --- New Directory Logic --- parsed_url = urlparse(url) domain = parsed_url.netloc path = parsed_url.path or '/' # Use quote with an empty safe string to encode everything, including slashes url_path_slug = quote(path, safe='') params = { "url": url, "domain": domain, "url_path_slug": url_path_slug, "take_screenshot": take_screenshot, "headless": headless, "is_notebook_context": self.is_notebook_context, # Pass the context flag "verbose": verbose, "persistent": persistent, "profile_name": profile_name, **kwargs # Pass through any other params } if mode == 'selenium': try: result = await selenium_automation(params) return result except Exception as e: logger.error(f"β Advanced scrape failed for {url}: {e}") return {"success": False, "error": str(e)} else: logger.warning(f"Scrape mode '{mode}' is not yet implemented.") return {"success": False, "error": f"Mode '{mode}' not implemented."} ``` ### `Notebooks/secretsauce.py` Update the `scrape_and_extract` function to accept and pass the new arguments. ```python # /home/mike/repos/pipulate/notebooks/secretsauce.py # Add persistent and profile_name to the function signature async def scrape_and_extract(job: str, headless: bool = True, verbose: bool = False, stealth: bool = True, persistent: bool = False, profile_name: str = "default"): """ Scrapes each URL using pip.scrape() and immediately parses the HTML to extract key SEO data. Verbosity is now controllable. """ print("π Starting browser-based scraping and extraction...") # --- NEW: Read fresh URLs from the notebook and update the state --- fresh_urls = _get_urls_from_notebook() if fresh_urls: print(f"β¨ Found {len(fresh_urls)} URLs in the notebook.") pip.set(job, URL_LIST_STEP, fresh_urls) # -------------------------------------------------------------------- urls_to_process = pip.get(job, URL_LIST_STEP, []) if not urls_to_process: print("β No URLs to process. Please add them to the 'url-list-input' cell in your notebook.") return extracted_data = [] for i, url in enumerate(urls_to_process): print(f" -> ποΈ [{i+1}/{len(urls_to_process)}] Processing: {url}") try: scrape_result = await pip.scrape( url=url, take_screenshot=True, headless=headless, verbose=verbose, stealth=stealth, persistent=persistent, # Pass persistent flag profile_name=profile_name # Pass profile name ) # ... (the rest of the function remains the same) ... ``` ----- ## 3\. Usage Example in `FAQuilizer.ipynb` Now you can easily control persistence directly from your notebook cell. For example, to run the scrape with a persistent session named "my\_session", you would change the call like this: ```python # /home/mike/repos/pipulate/notebooks/faquilizer.ipynb # This cell now uses a persistent profile, which is great for sites that require logins. # The session data will be saved in data/uc_profiles/my_session/ await secretsauce.scrape_and_extract( job, headless=False, verbose=False, persistent=True, profile_name="my_session" ) ``` With these changes, you have a much more powerful and flexible scraping tool that's harder to detect and can maintain sessions between runs. ## The Golden Commit: Validating the Diff **Me**: Okay, here's the diff. I mentioned up top this concept of *pinning the diff.* Once you do this your mind can go fuzzy from exhaustion and the machine helping you still has perfect clarity. If you lock-in a *golden commit* in your git repo, that whole Roomba back-up and retry becomes rapidly iterative and you can use this binary search algorithm to progressively keep dividing in half what could be wrong and pinpoint it. That's what I did. Here's the diff. ```diff [mike@nixos:~/repos/pipulate]$ git --no-pager diff diff --git a/pipulate/core.py b/pipulate/core.py index 9a2d9a54..f06cc4b0 100644 --- a/pipulate/core.py +++ b/pipulate/core.py @@ -1881,37 +1881,39 @@ class Pipulate: state = self.read_state(job) return state.get(step, default) - # START: scrape_method - async def scrape(self, url: str, take_screenshot: bool = False, mode: str = 'selenium', headless: bool = True, verbose: bool = True, **kwargs): + # Add persistent=False and profile_name=None to the function signature + async def scrape(self, url: str, take_screenshot: bool = False, mode: str = 'selenium', headless: bool = True, verbose: bool = True, persistent: bool = False, profile_name: str = "default", **kwargs): """ Gives AI "eyes" by performing browser automation or HTTP requests to scrape a URL. - + This method is the primary entrypoint for scraping and supports multiple modes. The default mode is 'selenium' which uses a full browser. - + Args: url (str): The URL to scrape. take_screenshot (bool): Whether to capture a screenshot (selenium mode only). Defaults to False. mode (str): The scraping mode to use ('selenium', 'requests', etc.). Defaults to 'selenium'. headless (bool): Whether to run the browser in headless mode (selenium mode only). Defaults to True. + persistent (bool): Whether to use a persistent browser profile. Defaults to False. + profile_name (str): The name of the persistent profile to use. Defaults to "default". **kwargs: Additional parameters to pass to the underlying automation tool. - + Returns: dict: The result from the scraper tool, including paths to captured artifacts. """ from tools.scraper_tools import selenium_automation from urllib.parse import urlparse, quote from datetime import datetime - - logger.info(f"ποΈβπ¨οΈ Initiating scrape for: {url} (Mode: {mode}, Headless: {headless})") - + + logger.info(f"ποΈβπ¨οΈ Initiating scrape for: {url} (Mode: {mode}, Headless: {headless}, Persistent: {persistent})") + # --- New Directory Logic --- parsed_url = urlparse(url) domain = parsed_url.netloc path = parsed_url.path or '/' # Use quote with an empty safe string to encode everything, including slashes url_path_slug = quote(path, safe='') - + params = { "url": url, "domain": domain, @@ -1920,9 +1922,11 @@ class Pipulate: "headless": headless, "is_notebook_context": self.is_notebook_context, # Pass the context flag "verbose": verbose, + "persistent": persistent, + "profile_name": profile_name, **kwargs # Pass through any other params } - + if mode == 'selenium': try: result = await selenium_automation(params) @@ -1933,8 +1937,6 @@ class Pipulate: else: logger.warning(f"Scrape mode '{mode}' is not yet implemented.") return {"success": False, "error": f"Mode '{mode}' not implemented."} - # END: scrape_method - def _find_project_root(self, start_path): """Walks up from a starting path to find the project root (marked by 'flake.nix').""" diff --git a/requirements.in b/requirements.in index b8bd294c..c821ea90 100644 --- a/requirements.in +++ b/requirements.in @@ -41,8 +41,8 @@ uvicorn selenium selenium-wire selenium-stealth -undetected-chromedriver webdriver-manager +undetected-chromedriver # Data Science and Plotting matplotlib diff --git a/tools/scraper_tools.py b/tools/scraper_tools.py index 3644f732..70f9d7e4 100644 --- a/tools/scraper_tools.py +++ b/tools/scraper_tools.py @@ -1,21 +1,19 @@ # /home/mike/repos/pipulate/tools/scraper_tools.py +import faulthandler +faulthandler.enable() + import asyncio import json import os import sys import shutil +import tempfile from datetime import datetime from pathlib import Path from urllib.parse import quote, urlparse from loguru import logger -from selenium.webdriver.chrome.options import Options -from selenium.webdriver.chrome.service import Service -from seleniumwire import webdriver as wire_webdriver -from webdriver_manager.chrome import ChromeDriverManager -from selenium_stealth import stealth -from selenium import webdriver - +import undetected_chromedriver as uc from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By @@ -37,28 +35,27 @@ def get_safe_path_component(url: str) -> tuple[str, str]: @auto_tool async def selenium_automation(params: dict) -> dict: """ - Performs an advanced browser automation scrape of a single URL. - Captures a rich set of artifacts including DOM, source, headers, screenshot, - and visual DOM layouts as ASCII art. + Performs an advanced browser automation scrape of a single URL using undetected-chromedriver. + Supports persistent profiles to maintain sessions across runs. + Captures artifacts including DOM, source, headers, screenshot, and visual DOM layouts. """ - verbose = params.get("verbose", True) url = params.get("url") domain = params.get("domain") url_path_slug = params.get("url_path_slug") take_screenshot = params.get("take_screenshot", False) headless = params.get("headless", True) is_notebook_context = params.get("is_notebook_context", False) - - options = webdriver.ChromeOptions() - options.add_experimental_option("excludeSwitches", ["enable-automation"]) - options.add_experimental_option('useAutomationExtension', False) - + persistent = params.get("persistent", False) + profile_name = params.get("profile_name", "default") + verbose = params.get("verbose", True) if not all([url, domain, url_path_slug is not None]): return {"success": False, "error": "URL, domain, and url_path_slug parameters are required."} driver = None artifacts = {} + profile_path = None + temp_profile = False base_dir = Path("browser_cache/") if not is_notebook_context: @@ -68,142 +65,92 @@ async def selenium_automation(params: dict) -> dict: try: if output_dir.exists(): - logger.info(f"ποΈ Clearing existing artifacts in: {output_dir}") + if verbose: logger.info(f"ποΈ Clearing existing artifacts in: {output_dir}") shutil.rmtree(output_dir) output_dir.mkdir(parents=True, exist_ok=True) - logger.info(f"πΎ Saving new artifacts to: {output_dir}") + if verbose: logger.info(f"πΎ Saving new artifacts to: {output_dir}") - chrome_options = Options() + options = uc.ChromeOptions() if headless: - chrome_options.add_argument("--headless") - chrome_options.add_argument("--no-sandbox") - chrome_options.add_argument("--disable-dev-shm-usage") - chrome_options.add_argument("--start-maximized") - chrome_options.add_argument("--window-size=1920,1080") - - effective_os = os.environ.get('EFFECTIVE_OS', sys.platform) - service = Service(ChromeDriverManager().install()) if effective_os == 'darwin' else Service() - - logger.info(f"π Initializing Chrome driver (Headless: {headless})...") - driver = wire_webdriver.Chrome(service=service, options=chrome_options) - - stealth(driver, - languages=["en-US", "en"], - vendor="Google Inc.", - platform="Win32", - webgl_vendor="Intel Inc.", - renderer="Intel Iris OpenGL Engine", - fix_hairline=True, - ) - - # --- ADD THIS BLOCK TO MANUALLY HIDE THE WEBDRIVER FLAG --- - driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { - "source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})" - }) + options.add_argument("--headless") + options.add_argument("--no-sandbox") + options.add_argument("--disable-dev-shm-usage") + options.add_argument("--start-maximized") + options.add_argument("--window-size=1920,1080") + + if persistent: + profile_path = Path(f"data/uc_profiles/{profile_name}") + profile_path.mkdir(parents=True, exist_ok=True) + logger.info(f"π Using persistent profile: {profile_path}") + else: + profile_path = tempfile.mkdtemp(prefix='pipulate_automation_') + temp_profile = True + logger.info(f"π» Using temporary profile: {profile_path}") + + logger.info(f"π Initializing undetected-chromedriver (Headless: {headless})...") + driver = uc.Chrome(options=options, user_data_dir=str(profile_path)) logger.info(f"Navigating to: {url}") driver.get(url) - - - # --- NEW: Two-stage wait for challenge pages that reload --- try: - logger.info("Waiting for security challenge to trigger a reload (Stage 1)...") - # Grab a reference to the body of the initial challenge page + if verbose: logger.info("Waiting for security challenge to trigger a reload (Stage 1)...") initial_body = driver.find_element(By.TAG_NAME, 'body') - - # Wait up to 20 seconds for the challenge to solve and location.reload() to be called. - # This is detected by waiting for the initial body element to go stale. WebDriverWait(driver, 20).until(EC.staleness_of(initial_body)) - logger.success("β Page reload detected!") - - logger.info("Waiting for main content to appear after reload (Stage 2)...") - # Now that the page has reloaded, wait up to 10 seconds for the actual content. - # Change 'main-content' to a selector that exists on your final target page. - WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "main-content"))) - logger.success("β Main content located!") - + if verbose: logger.success("β Page reload detected!") + + if verbose: logger.info("Waiting for main content to appear after reload (Stage 2)...") + WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "body"))) + if verbose: logger.success("β Main content located!") except Exception as e: - logger.warning(f"Did not detect the expected page lifecycle for the security challenge. Proceeding anyway. Error: {e}") - # If the wait fails, we can still try to capture the page as-is. - - + if verbose: logger.warning(f"Did not detect a page reload for security challenge. Proceeding anyway. Error: {e}") # --- Capture Core Artifacts --- dom_path = output_dir / "rendered_dom.html" dom_path.write_text(driver.execute_script("return document.documentElement.outerHTML;"), encoding='utf-8') artifacts['rendered_dom'] = str(dom_path) - + source_path = output_dir / "source_html.text" source_path.write_text(driver.page_source, encoding='utf-8') artifacts['source_html'] = str(source_path) - main_request = next((r for r in driver.requests if r.response and r.url == url), driver.last_request) - if main_request and main_request.response: - headers_path = output_dir / "response_headers.json" - headers_path.write_text(json.dumps(dict(main_request.response.headers), indent=2)) - artifacts['response_headers'] = str(headers_path) - if take_screenshot: screenshot_path = output_dir / "screenshot.png" driver.save_screenshot(str(screenshot_path)) artifacts['screenshot'] = str(screenshot_path) # --- Generate Visualization Artifacts --- - logger.info(f"π¨ Generating DOM box visualization...") - viz_result = await dom_tools.visualize_dom_boxes({"file_path": str(dom_path), "verbose": verbose}) + if verbose: logger.info(f"π¨ Generating DOM box visualization...") + viz_result = await dom_tools.visualize_dom_boxes({"file_path": str(dom_path), "verbose": False}) if viz_result.get("success"): viz_path = output_dir / "dom_layout_boxes.txt" viz_path.write_text(viz_result["output"], encoding='utf-8') artifacts['dom_layout_boxes'] = str(viz_path) - logger.success("β DOM box layout saved.") - else: - logger.warning(f"β οΈ Could not generate DOM box visualization: {viz_result.get('error')}") - - logger.info(f"π³ Generating DOM hierarchy visualization...") - hierarchy_viz_result = await dom_tools.visualize_dom_hierarchy({"file_path": str(dom_path), "verbose": verbose}) + + if verbose: logger.info(f"π³ Generating DOM hierarchy visualization...") + hierarchy_viz_result = await dom_tools.visualize_dom_hierarchy({"file_path": str(dom_path), "verbose": False}) if hierarchy_viz_result.get("success"): hierarchy_viz_path = output_dir / "dom_hierarchy.txt" hierarchy_viz_path.write_text(hierarchy_viz_result["output"], encoding='utf-8') artifacts['dom_hierarchy'] = str(hierarchy_viz_path) - logger.success("β DOM hierarchy saved.") - else: - logger.warning(f"β οΈ Could not generate DOM hierarchy visualization: {hierarchy_viz_result.get('error')}") - - # --- NEW: Generate Accessibility Tree Artifact --- - logger.info("π² Extracting accessibility tree...") + + # --- Generate Accessibility Tree Artifact --- + if verbose: logger.info("π² Extracting accessibility tree...") try: driver.execute_cdp_cmd("Accessibility.enable", {}) ax_tree_result = driver.execute_cdp_cmd("Accessibility.getFullAXTree", {}) - accessibility_tree = ax_tree_result.get("nodes", []) - + ax_tree = ax_tree_result.get("nodes", []) ax_tree_path = output_dir / "accessibility_tree.json" - with open(ax_tree_path, "w", encoding="utf-8") as f: - json.dump({ - "success": True, - "timestamp": datetime.now().isoformat(), - "url": driver.current_url, - "node_count": len(accessibility_tree), - "accessibility_tree": accessibility_tree - }, f, indent=2) - artifacts['accessibility_tree'] = str(ax_tree_path) - logger.success(f"β Accessibility tree extracted ({len(accessibility_tree)} nodes).") - except Exception as ax_error: - logger.warning(f"β οΈ Could not extract accessibility tree (graceful fallback): {ax_error}") - ax_tree_path = output_dir / "accessibility_tree.json" - with open(ax_tree_path, "w", encoding="utf-8") as f: - json.dump({ "success": False, "error": str(ax_error) }, f, indent=2) + ax_tree_path.write_text(json.dumps({"success": True, "node_count": len(ax_tree), "accessibility_tree": ax_tree}, indent=2), encoding='utf-8') artifacts['accessibility_tree'] = str(ax_tree_path) - logger.info("π Summarizing accessibility tree...") - summary_result = await dom_tools.summarize_accessibility_tree({"file_path": str(ax_tree_path)}) - if summary_result.get("success"): - summary_path = output_dir / "accessibility_tree_summary.txt" - summary_path.write_text(summary_result["output"], encoding='utf-8') - artifacts['accessibility_tree_summary'] = str(summary_path) - logger.success("β Accessibility tree summary saved.") - else: - logger.warning(f"β οΈ Could not summarize accessibility tree: {summary_result.get('error')}") + summary_result = await dom_tools.summarize_accessibility_tree({"file_path": str(ax_tree_path)}) + if summary_result.get("success"): + summary_path = output_dir / "accessibility_tree_summary.txt" + summary_path.write_text(summary_result["output"], encoding='utf-8') + artifacts['accessibility_tree_summary'] = str(summary_path) + except Exception as ax_error: + logger.warning(f"β οΈ Could not extract accessibility tree: {ax_error}") logger.success(f"β Scrape successful for {url}") return {"success": True, "looking_at_files": artifacts} @@ -215,4 +162,7 @@ async def selenium_automation(params: dict) -> dict: finally: if driver: driver.quit() - logger.info("Browser closed.") + if verbose: logger.info("Browser closed.") + if temp_profile and profile_path and os.path.exists(profile_path): + shutil.rmtree(profile_path) + if verbose: logger.info(f"Cleaned up temporary profile: {profile_path}") [mike@nixos:~/repos/pipulate]$ ``` **Me**: And then on the last mile I had this whole back and forth with Gemini about being explicit about the location of things that weren't being found easily be default because Nix Flakes are not a standard Linux filesystem hierarchy and several earlier attempts tried to get me to install stuff that the steam game engine uses to simulate the Filesystem Hierarchy Standard (FHS). It was a mess. And then this Gemini discussion thread who I'm about to have flesh in the details was able to pull my butt out of the fire with a few hints I got experimenting with that old Roomba random walk, fed it to Gemini and... well, explain to the nice people the final polish we put on things compared to the plan above. Go into excruciating pedantic detail. Oh, it's working now BTW. We're able to scrape a site that does this: ```html
``` **Gemini**: It sounds like you've been on quite a late-night coding odyssey\! The Roomba analogy is perfectβsometimes the most persistent, seemingly random approach is what finally bumps into the solution. It's that "brute force" method of exploration that often uncovers the subtle but critical detail everyone else misses. Let's break down, with pedantic detail, the final polish that took the initial plan over the finish line. ## The "Aha!" Moment: Solving Exit Code 127 on NixOS The initial plan was solid: switch to `undetected-chromedriver` and add persistence. But the `exit code 127` revealed a deeper environmental conflict unique to your NixOS setup. The final solution is more robust because it addresses this root cause directly. ----- ### The "Aha\!" Moment: Two `127` Errors, Two Different Problems Your key insight was realizing that the `exit code 127` wasn't about the browser (`chromium`) anymore, but about the driver service itself. This is the pedantic difference: 1. **Initial Problem:** The Python library (`undetected-chromedriver`) couldn't find the `chromium` **browser** to drive. We fixed this with `browser_executable_path`. 2. **Second, Trickier Problem:** The `undetected-chromedriver` library, when run, tried to use its own pre-compiled driver binary. That binary, built for a standard Linux system, couldn't find the libraries it needed to run within your isolated Nix environment, causing *it* to fail with `127`. The final polish wasn't just about adding a feature; it was about making the Python code fully respect the Nix environment by explicitly pointing it to the correct, Nix-provided binaries for **both** components. ----- ### Excruciatingly Pedantic Breakdown of the Final Polish Here are the critical refinements that made the difference, compared to the initial `git diff`: #### 1\. Added `driver_executable_path` to `uc.Chrome` This is the hero of the story. While the first pass correctly added `browser_executable_path`, the final version adds its equally important sibling. * **Before (Initial Plan):** ```python driver = uc.Chrome( options=options, user_data_dir=str(profile_path) ) ``` * **After (Final, Working Code):** ```python driver = uc.Chrome( options=options, user_data_dir=str(profile_path_obj), browser_executable_path=browser_path, driver_executable_path=driver_path # <--- The critical addition ) ``` **Why it matters:** This command tells the `undetected-chromedriver` Python library: *"Do not use the driver binary you downloaded. Instead, use this specific one that I'm giving you."* Since `shutil.which('undetected-chromedriver')` finds the one installed by Nix, you're now using a binary that was compiled *for* your environment and knows where to find all its dependencies. #### 2\. Robust Executable Path Discovery The final code doesn't just assume a path; it actively discovers it at runtime. ```python browser_path = shutil.which("chromium") or shutil.which("chromium-browser") driver_path = shutil.which("undetected-chromedriver") ``` **Why it matters:** This makes the script resilient. It doesn't matter where Nix decides to place the binaries in the `/nix/store/...` directory. As long as they are in the `PATH` of the `nix develop` shell (which they are), `shutil.which()` will find them. This is far superior to hardcoding paths or relying on `pip`'s local installs. #### 3\. Two-Stage Wait for Anti-Bot Pages Your scraper is now purpose-built to handle the exact kind of JavaScript challenge you posted. This logic was present in an earlier version but refined in the final implementation. ```python # Stage 1: Wait for the initial challenge page to go "stale" (i.e., reload) initial_body = driver.find_element(By.TAG_NAME, 'body') WebDriverWait(driver, 20).until(EC.staleness_of(initial_body)) # Stage 2: After the reload, wait for the *real* content to appear WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "body"))) ``` **Why it matters:** * The JS code you're facing does exactly this: it runs a challenge and then calls `location.reload(true)`. * A simple `WebDriverWait` would either time out on the initial page (if it waits for final content) or succeed too early (if it just waits for the challenge script). * By waiting for `staleness_of(initial_body)`, you are explicitly waiting for the `location.reload()` to happen. The old `` element is now gone, so it's "stale." * Only *after* detecting the reload do you wait for the final content. This two-step process perfectly mirrors the anti-bot's behavior, making your scraper dramatically more reliable against these kinds of checks. #### 4\. Removal of Redundant Stealth Code The initial code for `scraper_tools.py` contained `selenium-stealth` and manual CDP (Chrome DevTools Protocol) commands to hide the `navigator.webdriver` flag. ```python # OLD and now UNNECESSARY code: from selenium_stealth import stealth stealth(driver, ...) driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})" }) ``` **Why it matters:** `undetected-chromedriver` is designed from the ground up to handle these modifications automatically and more effectively. By removing the manual `stealth` and CDP calls, the code becomes cleaner and properly relies on the core functionality of the library you chose to use, avoiding potential conflicts. ## Lessons Learned and Triumphant Implementation In essence, the final solution was a Roomba-like "bump" into the right combination: you need to tell the Python library where to find **everything** the Nix environment provides, and then let that specialized library do its job without interference. Congratulations on wrestling this one to the groundβit was a genuinely tricky environment-specific problem! **Me**: So I've still got my work cut out for me tonight, but I had my *Dayton Wind Tunnel* success assured moment. We've got lift-off in the lab. There is nothing keeping this from being fully successful and it's that leaping over hurdles that keeps you going. Win leads to win leads to win. It's the demoralizing of not banking some win no matter how small that takes wind out of sails. And so with that Roomba technique you just keep getting back up and trying again. Or is that the Rocky technique? Same thing. Persistence pays off. And in this case it was persistence in implementing browser persistence! Plus pinpointing the Nix locations for files, both the browser and the driver that Selenium uses, called chromedriver. But not just that, it's a special re-spin of the chromedriver which has to exist from the Nix side and the Python side and all the crosshairs have to line up just-so or else the script won't run correctly or the site won't let you scrape it. Well, I cleaned the floor with this one. Gotcha. --- ## Book Analysis ### Ai Editorial Take This journal entry is exceptionally valuable as a real-world case study in complex technical problem-solving and effective human-AI collaboration. It perfectly illustrates how an AI can assist in navigating environment-specific challenges and provide detailed technical breakdowns, such as the two-stage wait for anti-bot pages and the precise differentiation between `browser_executable_path` and `driver_executable_path`. The author's candid recounting of their 'Roomba Robot' debugging approach and the strategic use of `git diff` adds a layer of practical wisdom that transcends the specific technical solution, making it highly instructive for any developer facing frustrating environmental hurdles. Its potential as a chapter in a book on advanced development or AI-assisted problem-solving is very high. ### Title Brainstorm * **Title Option:** Wrestling Undetected-Chromedriver & Persistence in NixOS * **Filename:** `wrestling-undetected-chromedriver-persistence-nixos.md` * **Rationale:** Directly addresses the core technical challenge and the unique environment (NixOS), implying a struggle and eventual mastery. * **Title Option:** NixOS, Stealth Automation, & The 'Roomba' of Debugging * **Filename:** `nixos-stealth-automation-roomba-debugging.md` * **Rationale:** Combines the specific technical environment and goal with the author's compelling 'Roomba Robot' analogy for problem-solving. * **Title Option:** Unmasking Bots: `undetected-chromedriver` on Nix Flakes * **Filename:** `unmasking-bots-undetected-chromedriver-nix-flakes.md` * **Rationale:** Focuses on the anti-bot aspect ('Unmasking Bots') and explicitly mentions the underlying Nix technology, 'Nix Flakes'. * **Title Option:** The Persistence Problem: `exit code 127` & Cross-Platform Scaping * **Filename:** `persistence-problem-exit-code-127-cross-platform-scraping.md` * **Rationale:** Highlights the key technical hurdles ('Persistence Problem', 'exit code 127') and the broad utility ('Cross-Platform Scraping') of the solution. ### Content Potential And Polish - **Core Strengths:** - Provides an authentic, first-person narrative of real-world debugging and problem-solving, making complex technical challenges relatable. - Offers detailed technical insights into `undetected-chromedriver`, browser persistence, anti-bot strategies, and NixOS-specific configurations. - Demonstrates effective human-AI collaboration in diagnosing and resolving intricate environmental issues. - Presents a practical, two-stage waiting solution for common JavaScript-based anti-bot challenges. - Emphasizes valuable software development practices like using `git diff` for progressive, exhaustion-proof problem-solving. - Clearly delineates the 'pedantic difference' between browser and driver executable paths, a common point of confusion. - **Suggestions For Polish:** - For a broader audience, expand on the overarching context of the 'stealth browser automation series' earlier in the narrative. - Include a brief, accessible primer on the implications of Nix Flakes' non-standard filesystem hierarchy for readers unfamiliar with NixOS. - Integrate key code snippets and configuration examples directly into the narrative prose rather than relying solely on diffs or separate blocks, especially for a book format. - Consider formalizing some of the more colloquial language (e.g., 'impossiscripts,' 'pull my butt out of the fire') to maintain a consistent technical tone for a book, while retaining the author's unique voice. - Elaborate on the 'Dayton Wind Tunnel' and 'Rocky technique' analogies for readers who might not be familiar with their specific connotations. ### Next Step Prompts - Refine the 'Context for the Curious Book Reader' to establish a stronger narrative hook for readers new to the 'Stealth Browser Automation Series' and its underlying motivations. - Develop a dedicated section discussing the broader philosophical and practical implications of the 'Roomba Robot' methodology, exploring its applicability beyond this specific technical problem to general software engineering and debugging processes.