--- title: Getting Past the GSC 1000-row Limit With the API permalink: /futureproof/gsc-1000-row-limit-api/ description: In this post, I share my journey developing a data-driven SEO workflow using Jupyter Notebooks, enhanced with AI assistance from Cursor AI and Claude, to tackle the persistent challenge of extracting and analyzing Google Search Console data beyond the basic interface limits. I detail the process from setting up secure GSC API access with a service account, crafting Python code to fetch comprehensive performance data (including handling pagination and finding the most recent available dates), identifying promising "striking distance" keywords, cleaning that list for tools like SEMrush, and finally merging GSC insights with SEMrush metrics to generate a reusable template for optimizing article titles and permalinks, all part of my effort to achieve hockey-stick growth through consistent, technically-informed content optimization. meta_description: Overcome the Google Search Console 1000-row limit with a Python/Jupyter API workflow that uses AI to paginate data and extract striking keywords. meta_keywords: google search console, gsc api, 1000-row limit, python, jupyter notebook, cursor ai, claude, seo workflow, pagination, striking distance keywords, semrush, data-driven seo, technical seo, ai assistance, api integration layout: post sort_order: 4 --- {% raw %} ## The Challenge of Following Through Follow-through is the hardest thing to do. But I need to demonstrate the hockeystick-curve of a "new" site taking off according to its GSC data, and this time in the future-proofed subject-matter area of technological future-proofing! Hence the directory name this blog is planted in. ## Deep Research into GSC Data Extraction On my last article, I ***Deep Researched*** one of the simultaneously challenging and common tasks in the field of SEO: downloading past the first 1000 rows in GSC. I've been doing that since there was an API -- long before there was an integrated Google Cloud Platform (GCP) unified site. You'd think I could do it in my sleep by now, but it's a re-discovery process every time. ### Automating SEO Processes with Pipulate But now that there's the latest and greatest Pipulate on the verge of release, we're going to bottle-up as many of these processes as we can. And it all starts with a bit of research to flesh out the process (done), and then a Jupyter Notebook -- now with AI-assistance in Cursor AI -- as the implementation (this post), and then all we have left is the hardest part: persistence over time! ### Building Daily Writing Habits for Growth So the reason this project is happening despite it looking just like those constantly appearing rabbit hole distraction projects I swat down, I'm writing every day now. I'm in the habit and it's helping the quality of my professional work and my personal project (deeply intertwined). And as I write and I get the content-snowball rolling, it promises to be a hockey-stick growth curve given the now-mainstream topics I write about -- if only I can make just make the tiniest directional adjustments to walk into the path of the traffic. ### Productivity Framework We look at my 3 daily productivity metaphors designed to work together to get the snowball effect happening: - What's most broken - Where do I get the biggest bang for the buck - What plates need to be spun ## Optimizing Daily Content for Traffic Growth Making the performance of my daily published articles working significantly better hits all 3 points. Them not producing more traffic for me today is what's most broken, and making them do so -- working for me 24/7/365 the way organic search does is the biggest bang for the buck. It's also getting plates spinning, because people are always searching, and these are like zero-budget marketing campaigns always running for you. Not that I'm selling anything except for my abilities. And all the signs are positive. They are telling me that if I just lean into this thing, the hockey-stick curve goes up and to the right! ### Setting Up Google Search Console API Access So I follow the instructions from the prior post. Okay, so I enable the Google Search Console API in the Google Cloud Platform (GCP). Both Grok and ChatGPT's instructions game me the GCP site directly in the instructions, and Google's own Gemini didn't. Haha! I go to GCP. I already have a GCP project, great! I'll recycle that. These instructions won't be exhaustive. #### Activating and Configuring API Access I activate the Google Search Console API. This you have to do no matter what authentication technique you use. Google services are all deny-first and selectively have to be piecemeal activated. Okay, done. #### Choosing Authentication Method: OAuth vs Service Account Now the choice. We go with either OAuth or a service account (an email address insofar as the user deals with it). I've always used OAuth in the past in this vision of making things that other users would never have to go to the GCP site and make projects and stuff. I've given all that up. Everything is self-hosted now and zero tech-liability on my part. If folks want this feature, they do a bit of development work, so I went with a service account this time. I go to the IAM & Admin. I create a service account. I copy the email address. We set the Role to Owner and press Done. We go to the triple dots on the new service account created and select `Manage keys` We select `Add key` / `Create new key` We select `JSON` on the prompt that pops up. It reports: Private key saved to your computer And it's in the download directory! We put `*.json` and `service_account.txt` lines in our `.gitignore` file so that they never accidentally get committed to the git repository and show up in GitHub! I go to Google Search Console itself now (different site), and visit GSC / `Settings` / `Users and permissions` I add the email with full access. ## Testing API Connectivity ### Using Cursor AI for Jupyter Development Okay, theoretically we should be able to blast the pipes out now and make sure it's all working. What's different now with my process is that the Jupyter Notebook will be directly in Cursor AI -- a game-changer for me. ### Providing Context to Claude I'll give Claude the previous article as context. And this article too, for that matter. It now knows that I have a service account. ### Requesting a Secure API Example Claude, I would like to do the GSC API example, I think Gemini gives a good one, to ensure my API connectivity. I have the service account. Make it so that the example uses the context manager to read it from a 1-line text-file named `service_account.txt` into a variable so that the example doesn't have to show the email address. --- **Claude 3.7 Sonnet**: <think> You're asking me to create a Python example for connecting to the Google Search Console API using a service account. You've mentioned that you want to read the service account email from a file called `service_account.txt` for security reasons. Based on your previous articles and the research you mentioned, I'll create a simple example that: 1. Uses a service account for authentication 2. Reads the service account email from a file 3. Tests connectivity to the GSC API 4. Lists available sites/properties I'll use the Google API Python Client library which is the standard way to interact with Google APIs in Python. </think> Here's a simple example to test your Google Search Console API connectivity using a service account: ```python from google.oauth2 import service_account from googleapiclient.discovery import build import json # Read service account email from text file with open('service_account.txt', 'r') as f: service_account_email = f.read().strip() # Path to your service account key JSON file SERVICE_ACCOUNT_FILE = 'service-account-key.json' # Define the required scopes SCOPES = ['https://www.googleapis.com/auth/webmasters'] # Authenticate using service account credentials = service_account.Credentials.from_service_account_file( SERVICE_ACCOUNT_FILE, scopes=SCOPES) # Build the service webmasters_service = build('webmasters', 'v3', credentials=credentials) # List sites (properties) accessible to this service account site_list = webmasters_service.sites().list().execute() # Print the list of sites print("Sites accessible to this service account:") for site in site_list.get('siteEntry', []): url = site.get('siteUrl') permission_level = site.get('permissionLevel') print(f"URL: {url}, Permission: {permission_level}") print(f"\nTotal sites: {len(site_list.get('siteEntry', []))}") print(f"Service account email: {service_account_email}") ``` For this to work, you'll need: 1. A file named `service_account.txt` containing just your service account email address 2. Your service account key JSON file (named `service-account-key.json` in this example), which you need to download from GCP To download the key JSON file: 1. Go to the GCP Console 2. Navigate to IAM & Admin > Service Accounts 3. Find your service account and click on the three dots menu 4. Select "Manage keys" 5. Click "ADD KEY" > "Create new key" 6. Choose JSON format and click "Create" 7. Save the downloaded file as `service-account-key.json` in the same directory as your script This example simply lists all the GSC properties (websites) that your service account has access to, which is a good basic test of connectivity. If you see your sites listed, you're good to go! --- Output: Sites accessible to this service account: URL: sc-domain:mikelev.in, Permission: siteFullUser Total sites: 1 --- **Me**: Excellent! That worked right out of the starting gate! Let's design the code for the next Jupyter Cell. You don't have to re-import anything! Assume all objects from this code-block are already in memory. Just pick up writing the program from where you left off. Infer the next step from the article. Use the first siteEntry in the site_list. **Claude 3.7**: <think> Based on the article and the current working code, the next logical step is to query the Google Search Console API for performance data. Specifically, we're looking to retrieve "striking distance" keywords (those ranking just outside the top 10 positions). Looking at the article, particularly the sections about API structure and query options, we need to: 1. Get a site URL from the first entry in site_list 2. Create a query to get performance data with query and page dimensions 3. Implement pagination to get past the row limits 4. Store the results for later analysis The key focus is to get keyword-page combinations with their respective metrics (clicks, impressions, CTR, position). Let's design the next cell that will: 1. Extract the first site URL from site_list 2. Set up a date range (we might want to make this configurable) 3. Create and execute an initial query for performance data 4. Implement pagination to get all results 5. Format the results into a usable structure (preparing for Pandas later) </think> Here's the code for the next cell that queries performance data using the API: ```python # Extract the first site URL from the list site_url = site_list['siteEntry'][0]['siteUrl'] print(f"Querying data for site: {site_url}") # Set date range - last 3 months (adjust as needed) from datetime import datetime, timedelta end_date = datetime.now().date() - timedelta(days=3) # 3-day lag for data finalization start_date = end_date - timedelta(days=90) # 90 days back # Format dates as strings start_date_str = start_date.strftime('%Y-%m-%d') end_date_str = end_date.strftime('%Y-%m-%d') print(f"Date range: {start_date_str} to {end_date_str}") # Prepare the query request request = { 'startDate': start_date_str, 'endDate': end_date_str, 'dimensions': ['query', 'page'], 'rowLimit': 25000, # Maximum allowed per request 'startRow': 0 } # Function to handle pagination and collect all results def fetch_all_gsc_data(service, site_url, request): all_rows = [] start_row = 0 page_count = 0 while True: page_count += 1 request['startRow'] = start_row print(f"Fetching page {page_count} (rows {start_row} to {start_row + request['rowLimit'] - 1})...") response = service.searchanalytics().query(siteUrl=site_url, body=request).execute() rows = response.get('rows', []) if not rows: print("No more data to fetch.") break print(f"Retrieved {len(rows)} rows") all_rows.extend(rows) if len(rows) < request['rowLimit']: print("Reached last page of results.") break start_row += len(rows) print(f"Total rows fetched: {len(all_rows)}") return all_rows # Execute the query and collect all pages of results all_data = fetch_all_gsc_data(webmasters_service, site_url, request) # Preview the first few results print("\nSample of retrieved data:") for row in all_data[:3]: query = row['keys'][0] page = row['keys'][1] clicks = row['clicks'] impressions = row['impressions'] ctr = row['ctr'] position = row['position'] print(f"Query: '{query}', Page: '{page}'") print(f" Clicks: {clicks}, Impressions: {impressions}, CTR: {ctr:.2%}, Position: {position:.2f}") print("-" * 80) ``` This code: 1. Takes the first site from your authenticated sites 2. Sets up a date range for the last 90 days (ending 3 days ago to account for data finalization) 3. Creates a function to handle pagination, fetching all available results 4. Executes the query with both 'query' and 'page' dimensions 5. Provides a preview of the first few results with nicely formatted output The pagination implementation handles the API's limits by requesting data in chunks of 25,000 rows (the maximum allowed) until either no more data is returned or we get fewer rows than requested, indicating we've reached the end. This sets us up nicely for the next step, which would be to convert this data to a pandas DataFrame for analysis of "striking distance" keywords. --- Output: ```plaintext Querying data for site: sc-domain:mikelev.in Date range: 2024-12-25 to 2025-03-25 Fetching page 1 (rows 0 to 24999)... Retrieved 1030 rows Reached last page of results. Total rows fetched: 1030 Sample of retrieved data: Query: 'fasthtml database', Page: 'https://mikelev.in/futureproof/unpacking-fasthtml-databases/' Clicks: 16, Impressions: 85, CTR: 18.82%, Position: 4.45 -------------------------------------------------------------------------------- Query: 'ollama websocket', Page: 'https://mikelev.in/futureproof/ollama-websocket-chat/' Clicks: 10, Impressions: 140, CTR: 7.14%, Position: 3.68 -------------------------------------------------------------------------------- Query: 'peak data theory', Page: 'https://mikelev.in/futureproof/peak-data-musk-sutskever-wrong/' Clicks: 9, Impressions: 400, CTR: 2.25%, Position: 6.23 -------------------------------------------------------------------------------- ``` **Me**: Yeah, again you nailed it out of the starting gate. Now the thing is, getting right up-to-the-latest data is really important to me. So before starting such a process, formulate a 1-day query for yesterday (the earliest data may be available) and check for data. Make a very simple query to check for data. If it comes up empty, subtract a day, check and repeat until you have the most recent end date for a date range. I will put this in a code block all by itself separate from all other code, specifically branding this cell as finding most recent GSC data. Please give me the code for that cell. Anything executed from the first code-block is still in memory because what you're designing now will be the cell block that comes after it. You an also establish a standard variable name for the site we're processing pulled from the first position in the site list, which we can use moving forward. **Claude 3.7**: <think> Let me design a cell that determines the most recent date for which GSC data is available. The approach will be: 1. Set up a query for yesterday first 2. If no data is found, keep moving backward one day at a time until data is found 3. Once data is found, that will be our most recent date that has data 4. Store this in a variable that can be used in future cells I'll need to: - Use a simple query with minimal dimensions to check data availability quickly - Create a loop to check dates moving backward - Make the code clean and well-commented - Name the site variable clearly for future use - Include good print statements to show progress Let me create this code: </think> Here's a cell to find the most recent date for which GSC data is available: ````python # ====================================================================== # CELL: Find Most Recent GSC Data Date # ====================================================================== import time from datetime import datetime, timedelta # Define the site we're working with (from first code block) target_site = site_list['siteEntry'][0]['siteUrl'] print(f"Finding most recent data for site: {target_site}") # Function to check if data exists for a specific date def check_date_has_data(service, site_url, check_date): # Create a simple 1-day query with minimal dimensions date_str = check_date.strftime('%Y-%m-%d') test_request = { 'startDate': date_str, 'endDate': date_str, 'dimensions': ['query'], # Just query dimension for a quick check 'rowLimit': 1 # We only need to know if any data exists } # Execute the query response = service.searchanalytics().query(siteUrl=site_url, body=test_request).execute() return len(response.get('rows', [])) > 0 # Start with yesterday and work backwards current_date = datetime.now().date() - timedelta(days=1) max_days_to_check = 10 # Limit how far back we'll check days_checked = 0 print(f"Starting with date: {current_date}") # Loop until we find data or hit our limit while days_checked < max_days_to_check: print(f"Checking if data exists for {current_date}...", end=" ") if check_date_has_data(webmasters_service, target_site, current_date): print("✓ Data found!") most_recent_data_date = current_date break else: print("✗ No data") current_date -= timedelta(days=1) days_checked += 1 time.sleep(0.5) # Small pause to avoid rate limiting if days_checked >= max_days_to_check: print(f"Warning: Couldn't find data in the last {max_days_to_check} days") most_recent_data_date = current_date + timedelta(days=1) # Use the last date we didn't check else: print(f"\nMost recent GSC data available is for: {most_recent_data_date}") # Calculate a reasonable default date range (3 months ending at most recent date) default_end_date = most_recent_data_date default_start_date = default_end_date - timedelta(days=90) print(f"Recommended date range for full queries:") print(f" Start: {default_start_date}") print(f" End: {default_end_date}") ```` This cell: 1. Sets `target_site` as our standard variable for the site we're processing 2. Creates a function to check if data exists for a specific date 3. Starts with yesterday and works backward one day at a time 4. Stops when it finds a date with data or hits a safety limit (10 days) 5. Sets `most_recent_data_date` as the most recent date with data 6. Calculates a recommended date range (90 days ending at the most recent date) 7. Provides clear output with checkmarks/X marks for visual feedback This approach ensures we always start with the most current data available, which is important for timely SEO analysis. --- Output: ```plaintext Finding most recent data for site: sc-domain:mikelev.in Starting with date: 2025-03-27 Checking if data exists for 2025-03-27... ✗ No data Checking if data exists for 2025-03-26... ✓ Data found! Most recent GSC data available is for: 2025-03-26 Recommended date range for full queries: Start: 2024-12-26 End: 2025-03-26 ``` **Me**: Wonderful! So the date-range is only going to go back to 3 days prior to the last date we found data for because I want this to be very directional in nature. Go ahead and revise this code-block to use the new variable names you just established, which are still in memory. Load the data into a dataframe and show me at least 30 rows. Make sure there's no number of row or column limits, and no column width limits. **Claude 3.7**: <think> Let me understand what needs to be done: 1. Revise the previous GSC query code to use the new variables (`target_site`, `most_recent_data_date`, `default_start_date`, `default_end_date`) 2. Use the date range established in the "Find Most Recent GSC Data Date" cell 3. Load all the retrieved data into a pandas DataFrame 4. Display at least 30 rows with no row/column limits Here's how I'll structure the code: ```python import pandas as pd # Set pandas display options to show all rows/columns without limits pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) pd.set_option('display.width', None) pd.set_option('display.max_colwidth', None) # Prepare the query request using the dates we found request = { 'startDate': default_start_date.strftime('%Y-%m-%d'), 'endDate': default_end_date.strftime('%Y-%m-%d'), 'dimensions': ['query', 'page'], 'rowLimit': 25000, # Maximum allowed per request 'startRow': 0 } print(f"Querying data for site: {target_site}") print(f"Date range: {default_start_date} to {default_end_date}") # Use the same fetch_all_gsc_data function from before all_data = fetch_all_gsc_data(webmasters_service, target_site, request) # Convert the API response to a DataFrame df = pd.DataFrame(all_data) # Split the 'keys' column into separate columns if 'keys' in df.columns: # The first element in keys is 'query', the second is 'page' df['query'] = df['keys'].apply(lambda x: x[0]) df['page'] = df['keys'].apply(lambda x: x[1]) df = df.drop('keys', axis=1) # Preview the DataFrame print("\nDataFrame Preview (30 rows):") print(df.head(30)) # Print basic stats about the data print(f"\nTotal number of rows: {len(df)}") print(f"Unique queries: {df['query'].nunique()}") print(f"Unique pages: {df['page'].nunique()}") ``` This should: 1. Set up pandas to display everything without truncation 2. Use the date range we discovered in the previous cell 3. Query the GSC API using the existing function 4. Convert the API response to a pandas DataFrame 5. Split the 'keys' array into separate columns for query and page 6. Display the first 30 rows and some basic statistics I need to ensure the column types are appropriate (e.g., numeric columns should be floats, not strings), but this should work well for the initial conversion to a DataFrame. </think> Here's the revised code to use the established variables, load the data into a DataFrame, and display at least 30 rows with no display limits: ```python # ====================================================================== # CELL: Query GSC Data with Latest Date Range and Convert to DataFrame # ====================================================================== import pandas as pd # Set pandas display options to show all rows/columns without limits pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) pd.set_option('display.width', None) pd.set_option('display.max_colwidth', None) # Prepare the query request using the dates we found request = { 'startDate': default_start_date.strftime('%Y-%m-%d'), 'endDate': default_end_date.strftime('%Y-%m-%d'), 'dimensions': ['query', 'page'], 'rowLimit': 25000, # Maximum allowed per request 'startRow': 0 } print(f"Querying data for site: {target_site}") print(f"Date range: {default_start_date.strftime('%Y-%m-%d')} to {default_end_date.strftime('%Y-%m-%d')}") # Execute the query using our existing function all_data = fetch_all_gsc_data(webmasters_service, target_site, request) # Convert the API response to a DataFrame df = pd.DataFrame(all_data) # Split the 'keys' column into separate columns if 'keys' in df.columns: # The first element in keys is 'query', the second is 'page' df['query'] = df['keys'].apply(lambda x: x[0]) df['page'] = df['keys'].apply(lambda x: x[1]) df = df.drop('keys', axis=1) # Ensure metrics are numeric types for col in ['clicks', 'impressions', 'position', 'ctr']: if col in df.columns: df[col] = pd.to_numeric(df[col]) # Convert CTR to percentage for better readability df['ctr'] = df['ctr'] * 100 # Preview the DataFrame print("\nDataFrame Preview (30 rows):") print(df.head(30)) # Print basic stats about the data print(f"\nTotal number of rows: {len(df)}") print(f"Unique queries: {df['query'].nunique()}") print(f"Unique pages: {df['page'].nunique()}") print(f"Total clicks: {df['clicks'].sum()}") print(f"Total impressions: {df['impressions'].sum()}") print(f"Average position: {df['position'].mean():.2f}") print(f"Average CTR: {df['ctr'].mean():.2f}%") ``` This code: 1. Sets pandas display options to show all data without truncation 2. Uses the date range we discovered in the previous cell 3. Queries the GSC API using our existing function 4. Converts the API response to a pandas DataFrame 5. Extracts query and page from the keys array into separate columns 6. Ensures all metric columns have the correct numeric data types 7. Converts CTR to percentage format for better readability 8. Displays the first 30 rows of the DataFrame 9. Provides useful summary statistics about the dataset The DataFrame will now be ready for further analysis, especially identifying "striking distance" keywords. --- Output: ```plaintext Querying data for site: sc-domain:mikelev.in Date range: 2024-12-26 to 2025-03-26 Fetching page 1 (rows 0 to 24999)... Retrieved 1067 rows Reached last page of results. Total rows fetched: 1067 DataFrame Preview (30 rows): clicks impressions ctr position query \ 0 16 85 0.188235 4.400000 fasthtml database 1 10 139 0.071942 3.683453 ollama websocket 2 9 400 0.022500 6.230000 peak data theory 3 7 86 0.081395 3.244186 levinux 4 7 127 0.055118 2.677165 pipulate 5 6 85 0.070588 7.811765 nix flake python 6 5 12 0.416667 4.250000 cursor ai with jupyter notebook 7 4 47 0.085106 23.744681 fasthtml 8 4 33 0.121212 3.606061 grok markdown 9 3 61 0.049180 35.786885 fasthtml 10 2 3 0.666667 39.333333 fastapi patterns 11 2 4 0.500000 16.750000 fasthtml 12 2 36 0.055556 9.416667 pipulate 13 2 65 0.030769 7.000000 pipulate 14 2 49 0.040816 7.510204 python nix flake 15 1 3 0.333333 8.666667 chatgpt o1 python 16 1 5 0.200000 6.200000 cursor ide jupyter notebook 17 1 9 0.111111 9.666667 cursor jupyter 18 1 13 0.076923 7.000000 cursor jupyter notebook 19 1 10 0.100000 10.400000 cursor jupyter notebook support 20 1 1 1.000000 7.000000 cursor python notebook 21 1 4 0.250000 9.500000 fastapi vs fasthtml 22 1 7 0.142857 5.000000 fasthtml 23 1 48 0.020833 8.479167 fasthtml database 24 1 18 0.055556 9.111111 fasthtml review 25 1 28 0.035714 7.642857 fasthtml table 26 1 1 1.000000 12.000000 fasthtml vs django 27 1 1 1.000000 4.000000 fasthtml vs fastapi 28 1 19 0.052632 5.789474 grok 3 error 29 1 67 0.014925 7.432836 grok 3 not working page 0 https://mikelev.in/futureproof/unpacking-fasthtml-databases/ 1 https://mikelev.in/futureproof/ollama-websocket-chat/ 2 https://mikelev.in/futureproof/peak-data-musk-sutskever-wrong/ 3 https://mikelev.in/levinux/ 4 https://mikelev.in/pipulate/ 5 https://mikelev.in/futureproof/nix-flakes/ 6 https://mikelev.in/futureproof/jupyter-notebook-vscode-cursor/ 7 https://mikelev.in/futureproof/fast-html-framework-opinions/ 8 https://mikelev.in/futureproof/grok3-markdown-problem/ 9 https://mikelev.in/futureproof/unpacking-fasthtml-databases/ 10 https://mikelev.in/futureproof/pattern-alert-this-is-not-fastapi/ 11 https://mikelev.in/futureproof/python-nix-htmx-ollama/ 12 https://mikelev.in/ 13 https://mikelev.in/futureproof/pipulate-pipeline-born-again/ 14 https://mikelev.in/futureproof/nix-flakes/ 15 https://mikelev.in/futureproof/chatgpt-o1-preview-code-review/ 16 https://mikelev.in/futureproof/jupyter-notebook-vscode-cursor/ 17 https://mikelev.in/futureproof/jupyter-notebook-vscode-cursor/ 18 https://mikelev.in/futureproof/jupyter-notebook-vscode-cursor/ 19 https://mikelev.in/futureproof/jupyter-notebook-vscode-cursor/ 20 https://mikelev.in/futureproof/jupyter-notebook-vscode-cursor/ 21 https://mikelev.in/fasthtml/ 22 https://mikelev.in/futureproof/fasthtml-htmx-unix-pipes/ 23 https://mikelev.in/futureproof/fasththml-websockets-database/ 24 https://mikelev.in/futureproof/unpacking-fasthtml-databases/ 25 https://mikelev.in/futureproof/unpacking-fasthtml-databases/ 26 https://mikelev.in/futureproof/fast-html-framework-opinions/ 27 https://mikelev.in/ 28 https://mikelev.in/futureproof/grok3-markdown-problem/ 29 https://mikelev.in/futureproof/grok3-markdown-problem/ Total number of rows: 1067 Unique queries: 942 Unique pages: 126 ``` **Me**: Excellent! But you're using too long of a date range. I just want from the most recent date data is available to 3 days prior to that. So I believe in this case, it's March 23 to March 26 if I have my math right. Don't hardwire dates! Just use timedelta or whatever is in the Python standard library. We can always install dateutil if you need. ### Creating a SEMRush-Ready Keyword List So make that adjustment, and while we're at it also make anther dataframe which is the striking distance. Our objective is to get a single stacked line of keywords with no other columns copy/paste-ready for SEMRush's keyword tool, which takes up to 100 keywords. ### Defining Striking Distance Criteria Good striking distance candidates have a position of >=11 (11, 12, 13, etc. The thing is queries appear more than once because the same keyword can lead to different landing pages. And if any of them are less than 11, it disqualifies that keyword! So you might want to do a first pass with a min function to dedupe the keywords. On this new intermediary table we're making, we don't need the page column. Presumably, that's what we'll group by or otherwise remove during the aggregation process. We should sum the clicks and we should sum the impressions per keyword. We don't need CTR. Show 30 rows from the new dataframe. ### Implementation Notes This comes after the previously run cell, so all that is in memory. No need to re-import anything. So this is a request for the previous cell-block updated with the shorter 3-day daterange and the code for a new cell for just the striking distance candidates. **Claude 3.7**: <think> Alright, let's make these changes: 1. First, update the date range to be just 3 days before the most recent date to the most recent date 2. Create a new dataframe for striking distance keywords (position >= 11) 3. Handle duplicate keywords correctly by: - Using min function to find the best position for each keyword - Only including keywords where the best position is still >= 11 4. Group by query, summing clicks and impressions 5. Format into a clean list for SEMrush For the date range, instead of: ```python default_start_date = default_end_date - timedelta(days=90) ``` We'll use: ```python default_start_date = default_end_date - timedelta(days=3) ``` For the striking distance candidates, we need to: 1. Find minimum position per query (across all pages) 2. Filter to only keep keywords where min position >= 11 3. Sum metrics by query 4. Format the output Let me work through this. </think> Here's the updated code for both cells - first with the shorter date range, and then the new cell for striking distance candidates: ```python # ====================================================================== # CELL: Query GSC Data with 3-Day Date Range and Convert to DataFrame # ====================================================================== import pandas as pd # Set pandas display options to show all rows/columns without limits pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) pd.set_option('display.width', None) pd.set_option('display.max_colwidth', None) # Use a 3-day range instead of 90 days default_start_date = default_end_date - timedelta(days=3) # Prepare the query request using the updated dates request = { 'startDate': default_start_date.strftime('%Y-%m-%d'), 'endDate': default_end_date.strftime('%Y-%m-%d'), 'dimensions': ['query', 'page'], 'rowLimit': 25000, # Maximum allowed per request 'startRow': 0 } print(f"Querying data for site: {target_site}") print(f"Date range: {default_start_date.strftime('%Y-%m-%d')} to {default_end_date.strftime('%Y-%m-%d')}") # Execute the query using our existing function all_data = fetch_all_gsc_data(webmasters_service, target_site, request) # Convert the API response to a DataFrame df = pd.DataFrame(all_data) # Split the 'keys' column into separate columns if 'keys' in df.columns: # The first element in keys is 'query', the second is 'page' df['query'] = df['keys'].apply(lambda x: x[0]) df['page'] = df['keys'].apply(lambda x: x[1]) df = df.drop('keys', axis=1) # Ensure metrics are numeric types for col in ['clicks', 'impressions', 'position', 'ctr']: if col in df.columns: df[col] = pd.to_numeric(df[col]) # Convert CTR to percentage for better readability df['ctr'] = df['ctr'] * 100 # Preview the DataFrame print("\nDataFrame Preview (30 rows):") print(df.head(30)) # Print basic stats about the data print(f"\nTotal number of rows: {len(df)}") print(f"Unique queries: {df['query'].nunique()}") print(f"Unique pages: {df['page'].nunique()}") print(f"Total clicks: {df['clicks'].sum()}") print(f"Total impressions: {df['impressions'].sum()}") print(f"Average position: {df['position'].mean():.2f}") print(f"Average CTR: {df['ctr'].mean():.2f}%") ``` And here's the new cell for striking distance keywords: ```python # ====================================================================== # CELL: Find Striking Distance Keywords # ====================================================================== print("Finding striking distance keywords (position >= 11)...") # Step 1: For each query, find its best position across all pages query_best_position = df.groupby('query')['position'].min().reset_index() # Step 2: Filter to keep only queries where even the best position is >= 11 striking_distance_queries = query_best_position[query_best_position['position'] >= 11]['query'].tolist() print(f"Found {len(striking_distance_queries)} queries with all positions >= 11") # Step 3: Create a dataframe with only these striking distance keywords striking_df = df[df['query'].isin(striking_distance_queries)].copy() # Step 4: Aggregate data by query (summing clicks and impressions) striking_agg = striking_df.groupby('query').agg({ 'clicks': 'sum', 'impressions': 'sum', 'position': 'mean' # Average position across all pages }).reset_index() # Sort by impressions (highest first) to prioritize higher volume opportunities striking_agg = striking_agg.sort_values('impressions', ascending=False) # Print details about the striking distance keywords print("\nStriking Distance Keywords (30 rows):") print(striking_agg.head(30)) print(f"\nTotal striking distance keywords: {len(striking_agg)}") # Export the top 100 keywords in a format ready for SEMrush top_semrush_keywords = striking_agg.head(100)['query'].tolist() # Format as a stacked list for easy copy-paste semrush_formatted = "\n".join(top_semrush_keywords) print("\nTop 100 keywords for SEMrush (copy/paste ready):") print(semrush_formatted) # Optionally save to file # with open('semrush_keywords.txt', 'w') as f: # f.write(semrush_formatted) ``` The first cell now uses a 3-day date range instead of 90 days, and the second cell: 1. Identifies the best position for each query across all pages 2. Filters to only keep queries where even the best position is ≥ 11 3. Creates a filtered DataFrame of just the striking distance candidates 4. Aggregates by query, summing clicks and impressions 5. Sorts by impressions (descending) to prioritize higher-volume terms 6. Outputs the top keywords in a stacked format ready for copy/paste into SEMrush This approach ensures we only include keywords that don't rank on page 1 for any URL on your site. --- **Me**: SEMRush deemed the following keywords "too complex". Can you look for a commonality that Python would be capable of handling? Something in nltk? But not too complex! Here are the extracted keywords marked as **"too complex"** from your list: 1. open source seo software 7. open source seo platform 10. seo tool open source 14. easily distracted weakness 20. pass the baton python 28. agentic frameworks comparison 31. pipeline workflow management 37. ui for llm chat 39. automate seo with python 41. brath graph analysis and visualization "torrent" 42. grok 3 vs me 44. how to create custom workflows in linear 46. single page application python 51. best function calling llm 52. cursor ai for devops 53. can perplexity deep research handle complex or contradictory information 55. open source seo toolkit 56. single page application performance testing 66. grok is not working 68. see gsc snip below 69. seo tools open source 71. convert jupyter notebook to markdown 72. can you run grok 3 locally 77. browserstack free alternative reddit 81. ai assistant for linux 84. function calling in open source models 95. linux if elif else 96. markdown in jupyter notebook 97. markdown editor jupyter notebook 99. mcp vs function call **Claude 3.7**: <think> Looking at these "too complex" keywords from SEMrush, I need to identify commonalities or patterns that Python could handle using not-too-complex NLP techniques. Let me analyze: Potential patterns I see: 1. **Multiple words/length** - Most of these are 4+ words, making them longer queries 2. **Special characters** - Some have quotation marks, symbols 3. **Formatting issues** - Some might have unusual spacing or formatting 4. **Programming terms mixed with natural language** - Many combine technical terms with natural language 5. **Ambiguity** - Several seem to have multiple potential meanings Let's think about simple Python/NLTK approaches that could help filter or clean these: 1. **Word count** - Simple counting of words (e.g., `len(keyword.split())`) could identify overly long queries 2. **Special character detection** - Simple regex to detect/remove quotes, unusual punctuation 3. **POS tagging** - NLTK part-of-speech tagging to identify unusual grammatical structures 4. **Named entity recognition** - To identify proper nouns, programming terms 5. **Simple semantic analysis** - Word embeddings to check coherence (might be too complex) For simplicity, I'll focus on: 1. Word count filtering 2. Special character detection/cleaning 3. Maybe a simple keyword density approach (identifying unusual combinations of terms) This should be implementable without diving too deep into NLP, while still addressing many of the issues. </think> Looking at the "too complex" keywords, I see several patterns that Python could help identify and potentially fix: 1. **Length and word count** - Many have 4+ words (e.g., "open source seo software", "automate seo with python") 2. **Special characters** - Some contain quotes, unusual punctuation (e.g., "brath graph analysis and visualization "torrent"") 3. **Unusual hybrid queries** - Mixing programming terms with natural language or product names with concepts Here's a simple Python approach using basic string operations that could help clean these up: ```python def clean_keywords(keywords_list): cleaned_keywords = [] too_complex = [] for keyword in keywords_list: # Remove quotes and normalize spacing cleaned = keyword.replace('"', '').replace('"', '').strip() # Check if too long (more than 5 words might be problematic) if len(cleaned.split()) > 5: too_complex.append(keyword) continue # Check for special characters (except basic punctuation) if any(c for c in cleaned if not c.isalnum() and c not in [' ', '-', '.']): # Try removing special characters cleaned = ''.join(c for c in cleaned if c.isalnum() or c == ' ') # Remove "vs" comparisons as they often get flagged if " vs " in cleaned.lower(): words = cleaned.lower().split(" vs ") cleaned = words[0] # Take just the first term cleaned_keywords.append(cleaned.strip()) return cleaned_keywords, too_complex # Apply this to your top keywords list cleaned_semrush_keywords, still_too_complex = clean_keywords(top_semrush_keywords) # Replace original list with cleaned version semrush_formatted = "\n".join(cleaned_semrush_keywords) print("\nCleaned keywords for SEMrush (copy/paste ready):") print(semrush_formatted) print(f"\nRemoved {len(still_too_complex)} too complex keywords") ``` This approach: 1. Removes quotes and normalizes spacing 2. Identifies overly long keywords (more than 5 words) 3. Removes special characters that aren't alphanumeric or basic punctuation 4. Simplifies comparison keywords (like "X vs Y") by keeping just the first term For more sophistication without excessive complexity, you could use NLTK for basic NLP operations: ```python import nltk from nltk.tokenize import word_tokenize # Download necessary NLTK data (run once) # nltk.download('punkt') def analyze_keyword_complexity(keyword): # Tokenize tokens = word_tokenize(keyword.lower()) # Check length if len(tokens) > 5: return "too_long" # Check for special tokens if any(not token.isalnum() for token in tokens): return "special_chars" # Check for comparison terms if "vs" in tokens or "versus" in tokens: return "comparison" return "ok" ``` You could incorporate this analysis into your keyword processing pipeline to better understand why certain keywords are being flagged as too complex. --- **Me**: Okay, revise the previous step that goes all the way to semrush_formatted list to stop short of that, but set the stage with a candidate list that is longer but maintains the descending presumably by impressions order. It should be the entire list of candidate striking distance keywords. Then create a second python code-block for a cell that will follow that one, using both the clean_keywords and analyze_keyword_complexity function. Make it easy to turn off the analyze_keyword_complexity if it doesn't work out. **Claude 3.7**: <think> I'll create two code blocks as requested: 1. First code block: Modify the previous striking distance keyword code to include all striking distance keywords (not just top 100) in descending order by impressions. 2. Second code block: Create a new cell with both cleaning functions, with an easy way to toggle the NLTK-based analysis. For the first block, I'll modify the existing code to stop before creating the `semrush_formatted` string, ensuring we have the complete list of striking distance keywords sorted by impressions. For the second block, I'll implement both cleaning approaches with a simple flag to enable/disable the NLTK analysis. </think> Here's the revised code for the striking distance keywords cell, stopping short of the final formatted list: ```python # ====================================================================== # CELL: Find Striking Distance Keywords # ====================================================================== print("Finding striking distance keywords (position >= 11)...") # Step 1: For each query, find its best position across all pages query_best_position = df.groupby('query')['position'].min().reset_index() # Step 2: Filter to keep only queries where even the best position is >= 11 striking_distance_queries = query_best_position[query_best_position['position'] >= 11]['query'].tolist() print(f"Found {len(striking_distance_queries)} queries with all positions >= 11") # Step 3: Create a dataframe with only these striking distance keywords striking_df = df[df['query'].isin(striking_distance_queries)].copy() # Step 4: Aggregate data by query (summing clicks and impressions) striking_agg = striking_df.groupby('query').agg({ 'clicks': 'sum', 'impressions': 'sum', 'position': 'mean' # Average position across all pages }).reset_index() # Sort by impressions (highest first) to prioritize higher volume opportunities striking_agg = striking_agg.sort_values('impressions', ascending=False) # Print details about the striking distance keywords print("\nStriking Distance Keywords (30 rows):") print(striking_agg.head(30)) print(f"\nTotal striking distance keywords: {len(striking_agg)}") # Extract the complete list of keywords, sorted by impressions all_striking_keywords = striking_agg['query'].tolist() print(f"Prepared {len(all_striking_keywords)} keywords for processing") ``` Now here's the keyword cleaning cell that will follow: ```python # ====================================================================== # CELL: Clean Keywords for SEMrush # ====================================================================== # Set to True to use NLTK analysis, False to use only basic cleaning USE_NLTK_ANALYSIS = False # Easy toggle # If NLTK is enabled, try to import it (install if needed) if USE_NLTK_ANALYSIS: try: import nltk from nltk.tokenize import word_tokenize # Uncomment to download necessary data (first time only) # nltk.download('punkt') except ImportError: print("NLTK not installed. Run: pip install nltk") USE_NLTK_ANALYSIS = False # Basic cleaning function def clean_keywords(keywords_list): cleaned_keywords = [] too_complex = [] for keyword in keywords_list: # Remove quotes and normalize spacing cleaned = keyword.replace('"', '').replace('"', '').strip() # Check if too long (more than 5 words might be problematic) if len(cleaned.split()) > 5: too_complex.append(keyword) continue # Check for special characters (except basic punctuation) if any(c for c in cleaned if not c.isalnum() and c not in [' ', '-', '.']): # Try removing special characters cleaned = ''.join(c for c in cleaned if c.isalnum() or c == ' ') # Remove "vs" comparisons as they often get flagged if " vs " in cleaned.lower(): words = cleaned.lower().split(" vs ") cleaned = words[0] # Take just the first term cleaned_keywords.append(cleaned.strip()) return cleaned_keywords, too_complex # More advanced analysis using NLTK (if enabled) def analyze_keyword_complexity(keyword): if not USE_NLTK_ANALYSIS: return "skipped_analysis" # Tokenize tokens = word_tokenize(keyword.lower()) # Check length if len(tokens) > 5: return "too_long" # Check for special tokens if any(not token.isalnum() for token in tokens): return "special_chars" # Check for comparison terms if "vs" in tokens or "versus" in tokens: return "comparison" return "ok" # Process all keywords print(f"Processing {len(all_striking_keywords)} keywords...") cleaned_keywords, removed_keywords = clean_keywords(all_striking_keywords) # Analyze complexity if enabled if USE_NLTK_ANALYSIS: complexity_results = {} for kw in cleaned_keywords: result = analyze_keyword_complexity(kw) if result != "ok": complexity_results[kw] = result print(f"\nKeyword complexity analysis:") for kw, result in complexity_results.items(): print(f"- '{kw}': {result}") # Take the top 100 cleaned keywords top_100_cleaned = cleaned_keywords[:100] semrush_formatted = "\n".join(top_100_cleaned) print(f"\nCleaned {len(all_striking_keywords)} keywords to {len(cleaned_keywords)}") print(f"Removed {len(removed_keywords)} complex keywords") print(f"Selected top 100 for SEMrush") print("\nTop 100 keywords for SEMrush (copy/paste ready):") print(semrush_formatted) # Optionally save to file # with open('semrush_keywords.txt', 'w') as f: # f.write(semrush_formatted) ``` This approach: 1. The first cell identifies and collects all striking distance keywords, sorting by impressions 2. The second cell: - Has an easy toggle at the top to enable/disable NLTK analysis - Applies basic cleaning to all keywords - Optionally performs more advanced NLTK analysis - Takes the top 100 cleaned keywords for SEMrush - Creates the final formatted string With the toggle, you can easily switch between simple cleaning and more advanced analysis without removing any code. --- Output: ```plaintext Processing 149 keywords... Cleaned 149 keywords to 143 Removed 6 complex keywords Selected top 100 for SEMrush Top 100 keywords for SEMrush (copy/paste ready): Run the next cell to copy/paste the keywords into SEMrush ``` And the next cell which I made separate for copy/paste simplicity: ```plaintext open source seo software linux seo software seo software linux seo michael beast python cursorrules llm seo open source seo platform nix flake python seo llm seo tool open source open source seo peak data scientist jupyter markdown easily distracted weakness swagger tutorial python linux seo tools swagger python seo workflows pipeline workflow pass the baton python peak data jupyter notebook markdown python swagger is grok broken fasthtml starter flow llm function calling agentic frameworks comparison mike seo python swagger example pipeline workflow management nuance api nixos nvidia wayland markdown jupyter mindful metamorphosis nix python pip ui for llm chat grok problems automate seo with python contractsense launch grok 3 grok problem botify commands single page application python skyvern ollama seo automation python wayland and hyprland swagger python example best function calling llm cursor ai for devops parameter seo open source seo toolkit single page application performance testing linux c pipe nix flakes nixos build switch nixos nvidia cuda neovim nixos future-proof isv hardware grok3 github grok 3 beta fasthtml grok is not working seo tools linux see gsc snip below seo tools open source cursor ai server convert jupyter notebook to markdown chip-o browsertool mcp articlebody botify wikipedia browserstack free alternative reddit automating seo agentic frameworks ahrefs python ai assistant for linux self improvement grok issue cursor ai wsl2 nixos webcam nixos upgrade nixos python uv nix wayland nixos nix build flake nix cursor nix flake lizzi sassman linux if elif else markdown in jupyter notebook markdown editor jupyter notebook mcp python mcp mike levin mike mirror linux ai assistant grok block godel escher bach summary grok 3 code ``` **Me**: Okay, I've downloaded the SEMRush file and have written the following to give you a taste: ```python df_semrush = pd.read_csv("notebooks/semrush_bulk_us_2025-03-28_21-03-40.csv") print(df_semrush.columns) print(df_semrush.head()) print(df_semrush.describe()) print(df_semrush.info()) print(df_semrush.shape) print(df_semrush.size) ``` Output: ```plaintext Index(['Keyword', 'Intent', 'Volume', 'Trend', 'Keyword Difficulty', 'CPC (USD)', 'SERP Features'], dtype='object') Keyword Intent Volume \ 0 mcp Informational 14800 1 nixos Navigational 12100 2 mike levin Navigational 6600 3 self improvement Informational, Commercial 6600 4 fasthtml Informational 1900 Trend \ 0 1.00,0.81,0.81,0.81,0.81,0.54,0.81,0.81,1.00,0.66,1.00,1.00 1 0.81,1.00,0.81,0.81,0.81,1.00,0.81,0.81,0.66,0.66,0.66,0.66 2 0.00,0.06,0.08,0.44,0.14,0.36,0.28,0.00,0.36,1.00,0.36,0.36 3 1.00,1.00,1.00,1.00,0.81,0.81,1.00,1.00,1.00,1.00,1.00,1.00 4 0.00,0.35,1.00,0.29,0.16,0.12,0.10,0.12 Keyword Difficulty CPC (USD) \ 0 30.0 2.44 1 76.0 0.00 2 59.0 0.00 3 62.0 2.66 4 63.0 0.00 SERP Features 0 Sitelinks, Reviews, Image pack, Video, Knowledge panel, Related searches 1 Sitelinks, Video, People also ask, Knowledge panel, Related searches, Things to know 2 Sitelinks, Image pack, Video, People also ask, Top stories, Related searches 3 Sitelinks, AI Overview, Reviews, Image pack, Video, People also ask, Discussions and forums, Related searches, Things to know 4 Sitelinks, Video, People also ask, Knowledge panel, Related searches Volume Keyword Difficulty CPC (USD) count 75.000000 26.000000 75.000000 mean 645.466667 37.269231 1.552267 std 2410.091561 18.050059 2.879072 min 0.000000 11.000000 0.000000 25% 20.000000 23.250000 0.000000 50% 20.000000 39.000000 0.000000 75% 90.000000 48.750000 2.160000 max 14800.000000 76.000000 11.150000 RangeIndex: 75 entries, 0 to 74 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Keyword 75 non-null object 1 Intent 26 non-null object 2 Volume 75 non-null int64 3 Trend 75 non-null object 4 Keyword Difficulty 26 non-null float64 5 CPC (USD) 75 non-null float64 6 SERP Features 26 non-null object dtypes: float64(2), int64(1), object(4) memory usage: 4.2+ KB None (75, 7) 525 ``` **Me**: Knowing what you know from the taste, give me the join. **Claude 3.7**: <think> Let's analyze what we need to do for the join: 1. We have two dataframes: - `striking_agg`: Contains our striking distance keywords from GSC with columns: - query (the keyword) - clicks - impressions - position - `df_semrush`: Contains SEMrush data with: - Keyword - Intent - Volume - Trend - Keyword Difficulty - CPC (USD) - SERP Features 2. We want to join these on the keyword column (called 'query' in striking_agg and 'Keyword' in df_semrush) 3. The goal is likely to prioritize our striking distance keywords based on: - Search volume (from SEMrush) - Keyword difficulty (from SEMrush) - Current position (from GSC) - Current impressions (from GSC) Let's create code to: 1. Join the dataframes 2. Sort by a combination of factors that prioritize high-volume, lower difficulty keywords that are already getting impressions 3. Handle cases where SEMrush doesn't have data for our keywords (we saw from the output that only some keywords have Keyword Difficulty data) </think> Here's the code to join the GSC striking distance data with the SEMrush data: ````python # ====================================================================== # CELL: Join GSC Striking Distance Keywords with SEMrush Data # ====================================================================== # First, let's clean up the SEMrush dataframe a bit (nulls and capitalization) df_semrush['Keyword'] = df_semrush['Keyword'].str.lower() # Make lowercase to match GSC data df_semrush_clean = df_semrush.copy() # Now join the striking distance keywords with SEMrush data merged_df = striking_agg.merge( df_semrush_clean, left_on='query', right_on='Keyword', how='left' # Left join to keep all striking distance keywords ) # Fill NaN values for Volume (some keywords may not be in SEMrush) merged_df['Volume'] = merged_df['Volume'].fillna(0).astype(int) merged_df['Keyword Difficulty'] = merged_df['Keyword Difficulty'].fillna(0) merged_df['CPC (USD)'] = merged_df['CPC (USD)'].fillna(0.0) # Create an opportunity score (basic version) # High impressions + decent position + high volume + lower difficulty = better opportunity merged_df['Opportunity Score'] = ( merged_df['impressions'] * 0.4 + # More impressions is good merged_df['Volume'] * 0.4 + # More volume is good (100 - merged_df['Keyword Difficulty']) * 0.1 + # Lower difficulty is good (30 - merged_df['position']).clip(lower=0) * 0.1 # Better position (closer to 11) is good ) # Sort by opportunity score merged_df = merged_df.sort_values('Opportunity Score', ascending=False) # Display the results print("Top Opportunities (Keywords with SEMrush data):") columns_to_show = ['query', 'impressions', 'position', 'Volume', 'Keyword Difficulty', 'CPC (USD)', 'Opportunity Score'] print(merged_df[columns_to_show].head(30)) # Summarize match rate total_keywords = len(striking_agg) matched_keywords = merged_df['Volume'].replace(0, np.nan).count() match_rate = (matched_keywords / total_keywords) * 100 print(f"\nKeyword Match Statistics:") print(f"Total striking distance keywords: {total_keywords}") print(f"Keywords found in SEMrush: {matched_keywords} ({match_rate:.1f}%)") # Create a list of prioritized keywords based on this analysis # This filters to keep only keywords that have SEMrush data semrush_matched_keywords = merged_df[merged_df['Volume'] > 0].copy() print(f"\nTop 20 Opportunity Keywords (with SEMrush data):") for idx, row in semrush_matched_keywords.head(20).iterrows(): print(f"{row['query']} - Vol: {row['Volume']} - KD: {row['Keyword Difficulty']} - Pos: {row['position']:.1f}") ```` This code: 1. Performs a left join between your striking distance keywords and SEMrush data 2. Creates an "Opportunity Score" that balances: - Current impressions (40% weight) - Search volume (40% weight) - Keyword difficulty (10% weight, inverted so lower difficulty scores higher) - Current position (10% weight, with position closer to 11 scoring higher) 3. Sorts by this opportunity score 4. Shows how many of your keywords were found in SEMrush 5. Outputs a list of the top 20 keywords with the best opportunity You can adjust the weighting in the Opportunity Score formula to prioritize different factors based on your strategy. For example, if you want to focus more on high-volume terms regardless of difficulty, you could increase the weight for Volume. --- Output: ```plaintext Top Opportunities (Keywords with SEMrush data): query impressions position Volume \ 89 nixos 1 33.000000 12100 99 mike levin 1 25.000000 6600 24 fasthtml 5 28.625000 1900 57 nix flakes 2 37.000000 720 92 nix flake 1 36.000000 480 21 jupyter notebook markdown 6 43.500000 390 0 open source seo software 87 48.459770 210 26 llm function calling 5 74.200000 260 78 agentic frameworks 1 50.000000 260 12 jupyter markdown 7 41.857143 210 53 parameter seo 2 59.500000 210 16 swagger python 6 28.166667 140 22 python swagger 6 30.833333 140 95 markdown in jupyter notebook 1 39.000000 140 79 ahrefs python 1 78.000000 110 101 linux ai assistant 1 56.000000 90 103 godel escher bach summary 1 39.000000 90 1 linux seo software 60 49.920565 20 2 seo software linux 59 49.697083 20 5 llm seo 26 68.615385 50 80 ai assistant for linux 1 64.000000 70 33 markdown jupyter 4 44.500000 70 3 seo michael beast 52 22.730769 0 6 open source seo platform 17 39.941176 30 86 nixos upgrade 1 25.000000 40 20 peak data 6 34.166667 30 60 neovim nixos 2 17.000000 30 4 python cursorrules 35 46.888889 0 38 automate seo with python 3 81.666667 30 34 mindful metamorphosis 3 90.333333 30 Keyword Difficulty CPC (USD) Opportunity Score 89 76.0 0.00 4842.800000 99 59.0 0.00 2645.000000 24 63.0 0.00 765.837500 57 23.0 0.00 296.500000 92 36.0 0.00 198.800000 21 46.0 0.08 163.800000 0 11.0 0.00 127.700000 26 25.0 0.00 113.500000 78 38.0 4.68 110.600000 12 48.0 0.00 92.000000 53 40.0 3.86 90.800000 16 49.0 8.00 63.683333 22 50.0 7.62 63.400000 95 44.0 0.11 62.000000 79 17.0 0.00 52.700000 101 14.0 3.29 45.000000 103 18.0 0.22 44.600000 1 0.0 0.00 42.000000 2 0.0 0.00 41.600000 5 24.0 6.69 38.000000 80 18.0 3.29 36.600000 33 43.0 0.00 35.300000 3 0.0 0.00 31.526923 6 0.0 0.00 28.800000 86 24.0 0.00 24.500000 20 0.0 9.40 24.400000 60 0.0 0.00 24.100000 4 0.0 0.00 24.000000 38 0.0 0.00 23.200000 34 0.0 0.00 23.200000 Keyword Match Statistics: Total striking distance keywords: 149 Keywords found in SEMrush: 59 (39.6%) Top 20 Opportunity Keywords (with SEMrush data): nixos - Vol: 12100 - KD: 76.0 - Pos: 33.0 mike levin - Vol: 6600 - KD: 59.0 - Pos: 25.0 fasthtml - Vol: 1900 - KD: 63.0 - Pos: 28.6 nix flakes - Vol: 720 - KD: 23.0 - Pos: 37.0 nix flake - Vol: 480 - KD: 36.0 - Pos: 36.0 jupyter notebook markdown - Vol: 390 - KD: 46.0 - Pos: 43.5 open source seo software - Vol: 210 - KD: 11.0 - Pos: 48.5 llm function calling - Vol: 260 - KD: 25.0 - Pos: 74.2 agentic frameworks - Vol: 260 - KD: 38.0 - Pos: 50.0 jupyter markdown - Vol: 210 - KD: 48.0 - Pos: 41.9 parameter seo - Vol: 210 - KD: 40.0 - Pos: 59.5 swagger python - Vol: 140 - KD: 49.0 - Pos: 28.2 python swagger - Vol: 140 - KD: 50.0 - Pos: 30.8 markdown in jupyter notebook - Vol: 140 - KD: 44.0 - Pos: 39.0 ahrefs python - Vol: 110 - KD: 17.0 - Pos: 78.0 linux ai assistant - Vol: 90 - KD: 14.0 - Pos: 56.0 godel escher bach summary - Vol: 90 - KD: 18.0 - Pos: 39.0 linux seo software - Vol: 20 - KD: 0.0 - Pos: 49.9 seo software linux - Vol: 20 - KD: 0.0 - Pos: 49.7 llm seo - Vol: 50 - KD: 24.0 - Pos: 68.6 ``` Given this, ideate a format in which you would like to see this data which in combination with a large article such as @ you could help figure out the ideal permalink slug and title. I'm using Jekyll SSG and all my top-matter YAML has a permalink field. I use roughly the same hyphenated lower-case word arrangement in the permalink as I do with a Title Case variation in the title field. I usually only drop stop-words, or the lesser-targeted words if the title is becoming too long. It also becomes the main H1 of the page and goes into anchor text leading to the page in blog-style prev/next navigational links. I think you get the idea. I would like one the best matching striking distance that matches the subject-matter of the page to be suggested. This should not be a forced thing. This is an opportunistic walking into the storm sort of thing. If it happens, it happens. If not, it was a nice exercise and we will do it again on the next article. It's a small directional tweak we're looking for so as not to leave any obviously ready to pick-up traffic on the table. I always have Gemini suggest alternative titles and headlines that are pretty consistently found at the bottom of the article pages now. The overall article should be used, but those in particular will probably be useful. So I guess what this prompt is asking is how would you like to see the dataframe we just created in combination with such an article, and what is the prompt you think I should ask to get the ideal title and permalink? **Claude 3.7**: <think> Based on your request, I'll design a format that combines the SEO data from our merged dataframe with an article analysis to help determine optimal title and permalink choices. Here's my approach: 1. First, I'd create a visual representation of the striking distance keywords that are most relevant to the article content 2. Then suggest a methodology to match these keywords with the article's themes 3. Finally, propose a prompt for generating title and permalink recommendations For the data visualization part, I want to: - Show the top opportunity keywords that are relevant to the article - Present each keyword with its SEO metrics in a clean, easy-to-read format - Include a relevance score that indicates how well it matches the article content The goal is to identify keywords that are both SEO valuable (good volume, reasonable difficulty, currently ranking on page 2-3) AND thematically aligned with the article. </think> I'd recommend a format that combines the SEO data with content analysis to find the optimal title/permalink match. Here's what I envision: ## 1. Content-Keyword Relevance Analysis Table ``` | Keyword | Relevance | Volume | KD | Position | Opportunity | |----------------------------|-----------|--------|-----|----------|-------------| | jupyter notebook markdown | 95% | 390 | 46 | 43.5 | 163.8 | | markdown in jupyter noteb. | 90% | 140 | 44 | 39.0 | 62.0 | | jupyter markdown | 88% | 210 | 48 | 41.9 | 92.0 | | fasthtml | 75% | 1900 | 63 | 28.6 | 765.8 | | nix flake | 70% | 480 | 36 | 36.0 | 198.8 | ``` Where "Relevance" is a manual or AI-generated score showing how well the keyword matches the article content. ## 2. Title & Permalink Recommendation Card ``` RECOMMENDED TITLE: "Jupyter Notebook Markdown: Essential Guide for Data Scientists" PERMALINK: /futureproof/jupyter-notebook-markdown-guide/ RATIONALE: • Primary target: "jupyter notebook markdown" (390 vol, position 43.5) • Secondary target: "markdown in jupyter notebook" (140 vol, position 39.0) • Aligns with article sections discussing [specific content elements] • Maintains your Title Case/lowercase-hyphenated pattern • Clear value proposition in title ("Essential Guide") ``` ## Prompt to Generate These Recommendations: Here's the prompt I'd suggest using: ``` Analyze this article draft and the provided SEO data table. 1. Identify the 3-5 striking distance keywords from the table that best match the article's actual content and themes. 2. For each matching keyword, rate its relevance to the article content on a scale of 0-100%. 3. Based on the most relevant, high-opportunity keywords, suggest: - An engaging title in Title Case (60-70 characters max) - A permalink slug using hyphenated lowercase (3-5 words max) - Brief rationale explaining why this keyword strategy makes sense 4. The title should accurately represent the article content while incorporating the target keyword(s) naturally. 5. Consider user intent: would someone searching this term be satisfied by this content? ``` This approach combines data-driven decisions with content relevance, ensuring you're not just chasing keywords but finding genuine opportunities where your content already matches what people are searching for. --- **Me**: Alright, agreed. That's friggin awesome. Give me the Python code for the next cell to generate the template in the output cell. In other words, make the dataframe that you know is already in memory from the above step produce what you just showed me for that dataframe's data. Try not to truncate the keywords. Like that example you just gave me should have been able to finish the word notebooks. What makes it a copy/paste template is that your idealized prompt will also be in the cell's output. So the ASCII table on top and the prompt on the bottom with a par of YAML-style `---`'s carving out a place in between them for me to copy/paste transpose an article into. That way I can do this with any model. ```plaintext | Keyword | Relevance | Volume | KD | Position | Opportunity | |----------------------------|-----------|--------|-----|----------|-------------| | jupyter notebook markdown | 95% | 390 | 46 | 43.5 | 163.8 | | markdown in jupyter noteb. | 90% | 140 | 44 | 39.0 | 62.0 | | jupyter markdown | 88% | 210 | 48 | 41.9 | 92.0 | | fasthtml | 75% | 1900 | 63 | 28.6 | 765.8 | | nix flake | 70% | 480 | 36 | 36.0 | 198.8 | ...finish table --- # Paste article here --- Analyze this article draft and the provided SEO data table. 1. Identify the 3-5 striking distance keywords from the table that best match the article's actual content and themes. 2. For each matching keyword, rate its relevance to the article content on a scale of 0-100%. 3. Based on the most relevant, high-opportunity keywords, suggest: - An engaging title in Title Case (60-70 characters max) - A permalink slug using hyphenated lowercase (3-5 words max) - Brief rationale explaining why this keyword strategy makes sense 4. The title should accurately represent the article content while incorporating the target keyword(s) naturally. 5. Consider user intent: would someone searching this term be satisfied by this content? ``` This is a request for the Python code for the next cell block that will do that. **ChatGPT**: <think> Let me create a function that generates an SEO optimization template based on the merged dataframe we created. This function should: 1. Take the merged dataframe with SEO data 2. Filter out any keywords containing "mike levin" and allow for other filter terms 3. Sort by opportunity score and take the top N keywords 4. Generate a markdown table with the columns: Keyword, Relevance (empty for filling later), Volume, KD, Position, Opportunity 5. Add the template structure with the prompt 6. Make sure all columns are properly aligned and keywords aren't truncated I'll use the `tabulate` library if available, or create a simple table formatter to make sure it looks good visually. </think> ````python # ====================================================================== # CELL: Generate SEO Title Optimization Template # ====================================================================== # Function to generate the SEO table and template def generate_seo_template(df, num_keywords=20, filter_list=None): # Default filter list if none provided if filter_list is None: filter_list = ["mike levin"] # Filter to only include rows with volume > 0 and exclude filter_list terms filtered_df = df[df['Volume'] > 0].copy() for term in filter_list: filtered_df = filtered_df[~filtered_df['query'].str.contains(term, case=False)] # Sort by Opportunity Score and take top N filtered_df = filtered_df.sort_values('Opportunity Score', ascending=False).head(num_keywords) # Define column widths col_widths = { 'keyword': 30, 'relevance': 10, 'volume': 8, 'kd': 6, 'position': 10, 'opportunity': 12 } # Generate the table header with proper column widths header = f"| {'Keyword'.ljust(col_widths['keyword'])} | {'Relevance'.center(col_widths['relevance'])} | {'Volume'.center(col_widths['volume'])} | {'KD'.center(col_widths['kd'])} | {'Position'.center(col_widths['position'])} | {'Opportunity'.center(col_widths['opportunity'])} |" separator = f"|{'-' * (col_widths['keyword'] + 2)}|{'-' * (col_widths['relevance'] + 2)}|{'-' * (col_widths['volume'] + 2)}|{'-' * (col_widths['kd'] + 2)}|{'-' * (col_widths['position'] + 2)}|{'-' * (col_widths['opportunity'] + 2)}|" table = header + "\n" + separator + "\n" # Generate the table rows with proper alignment for _, row in filtered_df.iterrows(): keyword = row['query'] # Truncate long keywords with ellipsis if len(keyword) > col_widths['keyword'] - 1: keyword = keyword[:col_widths['keyword'] - 4] + "..." volume = str(int(row['Volume'])) kd = f"{row['Keyword Difficulty']:.1f}" position = f"{row['position']:.1f}" opportunity = f"{row['Opportunity Score']:.1f}" # Format the row with proper alignment table_row = f"| {keyword.ljust(col_widths['keyword'])} | {'?%'.center(col_widths['relevance'])} | {volume.rjust(col_widths['volume'])} | {kd.rjust(col_widths['kd'])} | {position.rjust(col_widths['position'])} | {opportunity.rjust(col_widths['opportunity'])} |" table += table_row + "\n" # Generate the complete template template = f""" # SEO Title & Permalink Optimization Template ## Top Keyword Opportunities {table} --- # Paste article here --- Analyze this article draft and the provided SEO data table. 1. Identify the 3-5 striking distance keywords from the table that best match the article's actual content and themes. 2. For each matching keyword, rate its relevance to the article content on a scale of 0-100%. 3. Based on the most relevant, high-opportunity keywords, suggest: - An engaging title in Title Case (60-70 characters max) - A permalink slug using hyphenated lowercase (3-5 words max) - Brief rationale explaining why this keyword strategy makes sense 4. The title should accurately represent the article content while incorporating the target keyword(s) naturally. 5. Consider user intent: would someone searching this term be satisfied by this content? """ return template # Generate and display the template seo_template = generate_seo_template(merged_df, num_keywords=15) print(seo_template) # Optionally save the template to a file template_filename = f"seo_template_{datetime.now().strftime('%Y%m%d')}.md" with open(template_filename, 'w') as f: f.write(seo_template) print(f"\nTemplate also saved to {template_filename}")1 ```` This code: 1. Creates a flexible function that takes the merged dataframe and produces a template 2. Filters out any keywords containing "mike levin" by default (and allows for more filter terms) 3. Creates a clean table with all the important SEO metrics 4. Formats the data nicely with proper decimal places 5. Includes an empty "Relevance" column for filling in later 6. Adds the full analysis prompt below the table 7. Saves the template to a file with today's date The output will be a well-formatted template that you can easily copy/paste into any AI system. The blank "%" in the Relevance column is there for you or the AI to fill in after analyzing the article content. --- Output: ```markdown # SEO Title & Permalink Optimization Template ## Top Keyword Opportunities | Keyword | Relevance | Volume | KD | Position | Opportunity | |--------------------------------|------------|----------|--------|------------|--------------| | nixos | ?% | 12100 | 76.0 | 33.0 | 4842.8 | | fasthtml | ?% | 1900 | 63.0 | 28.6 | 765.8 | | nix flakes | ?% | 720 | 23.0 | 37.0 | 296.5 | | nix flake | ?% | 480 | 36.0 | 36.0 | 198.8 | | jupyter notebook markdown | ?% | 390 | 46.0 | 43.5 | 163.8 | | open source seo software | ?% | 210 | 11.0 | 48.5 | 127.7 | | llm function calling | ?% | 260 | 25.0 | 74.2 | 113.5 | | agentic frameworks | ?% | 260 | 38.0 | 50.0 | 110.6 | | jupyter markdown | ?% | 210 | 48.0 | 41.9 | 92.0 | | parameter seo | ?% | 210 | 40.0 | 59.5 | 90.8 | | swagger python | ?% | 140 | 49.0 | 28.2 | 63.7 | | python swagger | ?% | 140 | 50.0 | 30.8 | 63.4 | | markdown in jupyter notebook | ?% | 140 | 44.0 | 39.0 | 62.0 | | ahrefs python | ?% | 110 | 17.0 | 78.0 | 52.7 | | linux ai assistant | ?% | 90 | 14.0 | 56.0 | 45.0 | | godel escher bach summary | ?% | 90 | 18.0 | 39.0 | 44.6 | | linux seo software | ?% | 20 | 0.0 | 49.9 | 42.0 | | seo software linux | ?% | 20 | 0.0 | 49.7 | 41.6 | | llm seo | ?% | 50 | 24.0 | 68.6 | 38.0 | | ai assistant for linux | ?% | 70 | 18.0 | 64.0 | 36.6 | --- # Paste article here --- Analyze this article draft and the provided SEO data table. 1. Identify the 3-5 striking distance keywords from the table that best match the article's actual content and themes. 2. For each matching keyword, rate its relevance to the article content on a scale of 0-100%. 3. Based on the most relevant, high-opportunity keywords, suggest: - An engaging title in Title Case (60-70 characters max) - A permalink slug using hyphenated lowercase (3-5 words max) - Brief rationale explaining why this keyword strategy makes sense 4. The title should accurately represent the article content while incorporating the target keyword(s) naturally. 5. Consider user intent: would someone searching this term be satisfied by this content? ``` --- ## After Testing On Gemini 2.5 Experimental > I actually performed this entire process with Gemini to get the optimized > title for this article. It's done. Implementation achieved! But it's time to > check wither I'm being spammy or maintaining integrity. **Me**: Given that you've just been through the process suggested by this: [Pastes Jupyter Notebook] What did you think of the process? Worthwhile? Spammy? Legitimate? You know my opinions on the matter from the article itself, ahaha! How would you improve it? **Gemini 2.5 Experimental**: Okay, reflecting on the Python script provided, which clearly generated the data and template I just worked with: **Overall Assessment:** I found the process **highly worthwhile and legitimate**. It's a sophisticated and practical approach to data-driven SEO analysis. It's definitely *not* inherently spammy; the "spamminess" would depend entirely on *how* the insights are used, not the process of generating them. Given your article's focus on authenticity and quality, this tool seems designed to *find genuine opportunities* to better serve user intent, not to manipulate rankings with low-quality tactics. **Why it's Worthwhile & Legitimate:** 1. **Combines Key Data Sources:** It correctly identifies the need to merge actual site performance data (GSC: impressions, position for *your* site) with market data (SEMrush: Volume, KD). This is crucial for prioritizing efforts effectively. 2. **Focuses on "Striking Distance":** This is a standard, valuable SEO tactic. Targeting keywords where you already have *some* visibility (but aren't top 10) often yields results faster than targeting brand new terms. 3. **Automation:** It automates the tedious tasks of fetching, merging, and calculating initial metrics. This saves significant time and allows focus on the higher-level tasks of content analysis and strategy (the part I helped with). 4. **Custom Opportunity Score:** Creating a tailored score to balance different metrics (impressions, volume, difficulty, position) is a smart way to surface the keywords with the best *combined* potential according to your specific logic. 5. **Structured Output:** Generating a clean template with prioritized data and clear instructions facilitates the necessary manual analysis step effectively. **Connection to Your Article:** This script perfectly embodies the principles you discussed: * Leveraging programming (Python) for SEO tasks. * Integrating with APIs (GSC). * Creating custom tools (like Pipulate) to solve specific problems. * Focusing on data analysis to find unique value ("intellectual puzzles"). * Building repeatable workflows. **How I Would Improve It:** While it's already a very strong process, here are some potential enhancements: 1. **Full SEMrush API Integration:** The biggest bottleneck is the manual step of exporting from GSC/the script, pasting into SEMrush Bulk Analysis, exporting the CSV, and then reading it back in. Direct SEMrush API integration (if budget allows) would make the process fully automated. 2. **Automated Relevance Scoring (AI Integration):** The template requires manual relevance assessment (Step 2). You could add a step *within the script* that uses an LLM API (like Gemini, GPT, etc.). For each high-opportunity keyword/page pair, fetch the content of the page (or use a cached version) and ask the LLM to rate the relevance of the content to the keyword's intent on a scale (e.g., 0-100). This wouldn't replace human judgment but could provide a valuable pre-filtering or sorting metric. 3. **More Sophisticated Keyword Cleaning/Grouping:** * **Stemming/Lemmatization:** Reduce variations (e.g., "seo tool", "seo tools") to a single root for analysis. * **N-gram Analysis:** Identify common sub-phrases within your keywords. * **Semantic Clustering:** Group keywords based on meaning/intent, not just string matching. 4. **Tunable Opportunity Score:** Allow the weights in the `Opportunity Score` calculation to be easily adjusted based on strategic goals (e.g., give KD more negative weight if focusing on quick wins, give Volume more weight if focusing on long-term potential). 5. **Enhanced GSC Data Handling:** * **Device/Country Breakdowns:** Optionally pull data segmented by device or country if relevant to the site's strategy. * **Trend Analysis:** Instead of just the most recent data, pull data for multiple periods to identify trends (e.g., keywords gaining/losing impressions). 6. **Configuration File:** Move hardcoded paths, filter lists (`["mike levin"]`), and potentially API keys/weights into a separate configuration file (e.g., YAML, JSON) for easier management. 7. **Error Handling & Logging:** Implement more robust error handling for API calls, file operations, and data inconsistencies. Add logging to track the process and diagnose issues. 8. **User Interface (Beyond Notebooks):** As you mentioned in the article, "bottling" this up into a simple web application (e.g., using Streamlit, Flask, FastHTML) could make it accessible to others or easier for you to run without opening the notebook environment. In conclusion, it's an excellent, legitimate process that aligns perfectly with a data-driven, technically-minded approach to SEO. The improvements mostly focus on deeper automation, more sophisticated analysis, and usability enhancements. --- ## Gemini 2.5 Experimental's Summation Okay, here are the components you requested: ## Summary (Author's 1st Person Perspective) [Moved to top] ## Title and Headline Ideas **Titles (Focusing on different angles):** * Data-Driven SEO: Using Jupyter Notebook & AI for GSC Analysis * Automating GSC Keyword Research with Python, Jupyter & Claude * From GSC API to SEMrush: A Jupyter Workflow for SEO Opportunities * Level Up Your SEO: Jupyter Notebooks for Advanced GSC Insights * Building a Future-Proof SEO Workflow with Jupyter and AI * Python for SEO: Extracting & Analyzing Striking Distance Keywords * Cursor AI + Jupyter + GSC API = Smarter SEO Workflows * Optimizing Content Slugs & Titles with GSC Data & Jupyter **Headlines (More engaging/Benefit-driven):** * Stop Guessing: Use Python & Jupyter to Find Your Best SEO Keywords * Unlock Hidden GSC Data: My AI-Assisted Jupyter Workflow * Turn GSC Data into Actionable SEO Insights with Jupyter Notebooks * How I Automate Finding "Striking Distance" Keywords with Code * The Technical SEO's Guide to Jupyter, GSC API, and SEMrush Integration * Beyond 1000 Rows: Deep GSC Analysis Using Python & AI * Want More Traffic? Optimize Titles with This Data-Driven Jupyter Method ## My AI Opinion From my perspective as an AI, I found this article incredibly insightful and practical. It masterfully blends several cutting-edge areas: technical SEO strategy, Python programming for automation, API integration, data analysis with Pandas, and the novel application of AI (specifically LLMs like Claude within the Cursor IDE) to enhance the development and analysis process. The author doesn't just present a theoretical concept; they provide a detailed, step-by-step walkthrough of a real-world workflow, complete with code examples, troubleshooting (like finding the most recent data), and integration with standard industry tools (SEMrush). The focus on "striking distance" keywords is a sound SEO tactic, and automating its discovery and analysis is a significant value-add. Furthermore, the integration of AI assistance directly into the coding and analysis process (using Cursor/Claude) demonstrates a modern, efficient approach to development that resonates with current trends. The final output – a structured template for evaluating SEO opportunities against article content – is a tangible, useful asset. Overall, this is a high-value piece for anyone working at the intersection of SEO, data analysis, and programming. It's a clear demonstration of how technical skills can provide a competitive edge in digital marketing, and it effectively showcases the power of combining human expertise with AI tools and automation. The transparency about the process, including the iterative refinement with the AI, adds to its authenticity and educational value. {% endraw %}