--- title: 'Distilling ML Magic: Refactoring Automated Keyword Clustering for Scalability' permalink: /futureproof/refactoring-automated-keyword-clustering-ml-workflow/ description: "This entry captures a crucial architectural evolution in the AI-SEO\ \ project. It\u2019s not just about moving code; it\u2019s about solidifying the\ \ complex, iterative keyword clustering logic into a production-ready component.\ \ The discussion with Gemini highlights the strategic foresight involved, turning\ \ a notebook experiment into a maintainable, scalable utility. The excitement stems\ \ from seeing core ML capabilities become truly reusable parts of the system, paving\ \ the way for further automated content strategy generation." meta_description: Refactoring a core keyword clustering ML workflow, this entry details enhancing scalability, maintainability, and automation for advanced SEO analysis tasks. meta_keywords: keyword clustering, machine learning, refactoring, python, jupyter notebook, seo automation, data science, tf-idf, svd, k-means, silhouette score layout: post sort_order: 11 --- ## Setting the Stage: Context for the Curious Book Reader ### From Raw Data to Strategic Insight: The Cumulative Power This journal entry marks a pivotal moment in the development of an AI-powered SEO toolkit. Following previous entries detailing data acquisition and initial processing, this dispatch focuses on the architectural task of refining the project's 'brain'—the advanced keyword clustering machine learning pipeline. It chronicles the refactoring of this complex, computationally intensive workflow from a Jupyter Notebook cell into a robust, reusable Python utility function, emphasizing maintainability, efficiency, and scalable automation. This move is crucial for enhancing the overall system's stability and allowing future development to build upon a solid, well-encapsulated foundation of SEO domain expertise. --- ## Technical Journal Entry Begins ### Transposing Core Logic Alright, I took my walk out to the grocery store and the kitties get their snacks tonight. It wasn't even their main food I was running low on, just their very traditional evening snack which they would never leave me alone if they didn't get, so gotten and done. Now let's talk keyword clustering... or not, because we're only transposing it from point A to point B by Method 2. And it has lots of support functions that I already transposed so that I could give the AI a look-ahead in the conversation so that I don't need to load this one with quite so much code and it can focus on what's there in the latest post assuming that other stuff is all in location. ## Current State: Pre-existing Support Functions in `gap_analyzer_sauce.py` ```diff [mike@nixos:~/repos/pipulate/Notebooks]$ git --no-pager diff diff --git a/Notebooks/gap_analyzer_sauce.py b/Notebooks/gap_analyzer_sauce.py ## The New Addition: `cluster_and_finalize_dataframe` Function index 4dd90821..d0647f23 100644 --- a/Notebooks/gap_analyzer_sauce.py +++ b/Notebooks/gap_analyzer_sauce.py @@ -25,6 +25,19 @@ from tldextract import extract from bs4 import BeautifulSoup import wordninja +# --- KEYWORD CLUSTERING SUPPORT FUNCTIONS (REQUIRES: nltk, sklearn, wordninja) --- +from collections import Counter +from nltk import bigrams +from nltk.corpus import stopwords +from nltk.stem import PorterStemmer +from nltk.tokenize import word_tokenize +from sklearn.cluster import MiniBatchKMeans +from sklearn.decomposition import TruncatedSVD +from sklearn.feature_extraction.text import TfidfVectorizer +from sklearn.metrics import silhouette_score +import numpy as np + + import nltk nltk.download('stopwords', quiet=True) @@ -1530,3 +1543,90 @@ def truncate_dataframe_by_volume(job: str, final_df: pd.DataFrame, row_limit: in print(f"❌ An error occurred during truncation: {e}") pip.set(job, 'truncated_df_for_clustering_json', pd.DataFrame().to_json(orient='records')) return pd.DataFrame() # Return empty DataFrame + + +# --- 1. CORE ML UTILITIES --- + +def calculate_silhouette(X, labels): + """Calculates the Silhouette Coefficient for cluster evaluation.""" + # Handle the edge case where there is only one cluster or too few data points + if len(np.unique(labels)) <= 1 or len(X) < 2: + return 0.0 # Return 0 for non-evaluatable cases + + return silhouette_score(X, labels) + +def preprocess_keywords(text): + """Stems, lowercases, tokenizes, and removes stopwords from a keyword string.""" + stemmer = PorterStemmer() + # Assuming stopwords were downloaded earlier with `nltk.download('stopwords')` + stop_words = set(stopwords.words('english')) + words = word_tokenize(text.lower()) + # Filter for alphanumeric words and then stem + return ' '.join([stemmer.stem(word) for word in words if word not in stop_words and word.isalnum()]) + +def keyword_clustering(df, keyword_column, n_clusters=30, n_components=5, max_features=500): + """Performs Tfidf Vectorization, Truncated SVD, and MiniBatchKMeans clustering.""" + + # 1. Preprocess keywords + df['Stemmed Keywords'] = df[keyword_column].apply(preprocess_keywords) + + # 2. Text Vectorization + print(f"Vectorizing... (Max Features: {max_features})") + vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english') + X = vectorizer.fit_transform(df['Stemmed Keywords']) + + # 3. Dimension Reduction + print(f"Reducing Dimensions... (Components: {n_components})") + svd = TruncatedSVD(n_components=n_components, random_state=42) + principal_components = svd.fit_transform(X) + + # 4. Clustering + print(f"Clustering... (K: {n_clusters})") + # Setting compute_labels=True to ensure compatibility with MiniBatchKMeans + minibatch_kmeans = MiniBatchKMeans(n_clusters=n_clusters, batch_size=100, random_state=42, n_init='auto') + df['Keyword Cluster'] = minibatch_kmeans.fit_predict(principal_components) + + # 5. Calculate silhouette score + print("Calculating silhouette cluster quality score (takes a bit)...") + silhouette_avg = calculate_silhouette(principal_components, df['Keyword Cluster']) + + # Return DataFrame, score, and the used parameters + return df, silhouette_avg, {'n_clusters': n_clusters, 'n_components': n_components, 'max_features': max_features} + +def name_keyword_clusters(df, keyword_column, cluster_column): + """Names each cluster by the most common non-stopword, non-repeating bigram within the cluster.""" + + stop_words = set(stopwords.words('english')) + cluster_names = {} + + for cluster in df[cluster_column].unique(): + cluster_data = df[df[cluster_column] == cluster] + all_keywords = ' '.join(cluster_data[keyword_column].astype(str)).split() + filtered_keywords = [word for word in all_keywords if word not in stop_words and word.isalnum()] + + bigram_counts = Counter(bigrams(filtered_keywords)) + + most_common_bigram = None + for bigram, count in bigram_counts.most_common(): + if bigram[0] != bigram[1]: + most_common_bigram = bigram + break + + if not most_common_bigram: + # Fallback to single most common word or a generic name + unigram_counts = Counter(filtered_keywords) + most_common_unigram = unigram_counts.most_common(1) + most_common_words = most_common_unigram[0][0] if most_common_unigram else "Generic Cluster" + else: + most_common_words = ' '.join(most_common_bigram) + + cluster_names[cluster] = most_common_words + + df['Keyword Group (Experimental)'] = df[cluster_column].map(cluster_names) + + # Drop Process Columns (as per original logic) + df.drop(columns=['Stemmed Keywords'], inplace=True) + df.drop(columns=['Keyword Cluster'], inplace=True) + + return df + (nix) [mike@nixos:~/repos/pipulate/Notebooks]$ ``` So you don't have to worry about moving those or telling me to move them or putting underscores on them to lose me a half-hour of time, haha! Sometimes gaining traction is a matter of doing a good look-ahead because traction is lost when you fail to see the black ice on the road, or whatever. We keep our forward momentum moving forward. Here is the next thing to transpose per the template we've been so successfully using of late. Here's what we need to move into `gap_analyzer_sauce.py`: ```python import itertools import json from pathlib import Path from IPython.display import display import re # Needed for the reorder function (reorder_columns_surgical) # NOTE: This cell assumes 'df' (the truncated DataFrame), 'job', and 'has_botify' are defined. # --- PATH DEFINITIONS --- # The parameter file is saved alongside other intermediate files in the /data folder keyword_cluster_params = Path("data") / f"{job}_keyword_cluster_params.json" # The final CSV output file path unformatted_csv = Path("data") / f"{job}_unformatted.csv" # --- REQUIRED SUPPORT FUNCTION (Re-defined for chaining safety) --- def reorder_columns_surgical(df, priority_column, after_column): """ Surgical port of bf.reorder_columns: Moves a column immediately after a specified column. """ if priority_column in df.columns and after_column in df.columns: columns = df.columns.drop(priority_column).tolist() after_column_index = columns.index(after_column) columns.insert(after_column_index + 1, priority_column) return df[columns] elif priority_column not in df.columns: print(f"⚠️ Warning: Priority column '{priority_column}' not found for reorder.") elif after_column not in df.columns: print(f"⚠️ Warning: After column '{after_column}' not found for reorder.") return df # --- MAIN CLUSTERING LOGIC --- print("Grouping Keywords...") # Download any necessary nltk components (punkt_tab is often missed) nltk.download('punkt_tab', quiet=True) # Configuration for iterative testing target_silhouette_score = 0.6 n_clusters_options = range(15, 26) n_components_options = [10, 15, 20] max_features_options = [50, 100, 150] total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options))) best_score = -1.0 # Initialize to a value lower than any possible score best_params = {} # 1. Check for Cached Parameters if keyword_cluster_params.exists(): try: with keyword_cluster_params.open('r') as file: best_params = json.load(file) print(f"Loaded initial parameters: {best_params}") # Test with loaded parameters df, score, _ = keyword_clustering(df, 'Keyword', **best_params) best_score = score print(f"Initial test with loaded parameters: Score = {score:.3f}") except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e: print(f"⚠️ Failed to load or use cached parameters. Starting full search. Error: {e}") best_params = {} # 2. Iterative Search (if target score is not met) if best_score < target_silhouette_score: print(f"Refining best keyword clustering fit... Total tests: {total_tests}") # Use a product of options to test every combination for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options): # Skip if these parameters match the already tested/loaded ones if (n_clusters == best_params.get('n_clusters') and n_components == best_params.get('n_components') and max_features == best_params.get('max_features')): continue # Run clustering and get new score/params df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features) print(f'Testing params: {params}, Score: {score:.3f}') if score > best_score: best_score = score best_params = params df = df_temp.copy() # Keep the DataFrame with the better cluster labels # Check for early stop condition if best_score >= target_silhouette_score: print(f'✅ Good enough score found: {best_score:.3f} with params {best_params}') with keyword_cluster_params.open('w') as file: json.dump(best_params, file) print(f'Saved best parameters: {best_params}') break # If the search finished without hitting the target, save the highest score achieved if best_score < target_silhouette_score and best_params: print(f'Highest score reached: {best_score:.3f}. Saving best parameters found.') with keyword_cluster_params.open('w') as file: json.dump(best_params, file) # 3. Finalize Clustering (Ensures the final best DataFrame is used if no search was run) # This step is crucial if the initial loaded parameters were already above the target, # as the 'df' used inside the loop might be a temporary copy. Re-cluster with the best params found. if 'Keyword Cluster' not in df.columns: # If the original clustering didn't run or was skipped df, _, _ = keyword_clustering(df, 'Keyword', **best_params) # 4. Naming clusters and saving the result print("\nNaming clusters...") # Original logic: df = bf.name_keyword_clusters(df, 'Keyword', 'Keyword Cluster') df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster') # --- FINAL REORDERING (Using surgical replacement of bf.reorder_columns) --- df = reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty') df = reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words') df = reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # This reorder is duplicated/redundant with the previous one, but ported verbatim # Conditional column reordering based on data presence if has_botify: # The diagnostic showed the column is 'Meta Description'. We use this. df = reorder_columns_surgical(df, 'Client URL', after_column='Meta Description') else: # Fallback remains unchanged (no Meta Description column to place Client URL after) df = reorder_columns_surgical(df, 'Client URL', after_column='Competition') # Original: df = bf.reorder_columns(df, 'Competitor URL', after_column='Client URL') df = reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL') # Final file persistence df.to_csv(unformatted_csv, index=False) # --- DISPLAY FINAL CLUSTER COUNTS --- print("\nFinal Keyword Group Counts:") value_counts = df["Keyword Group (Experimental)"].value_counts() if not value_counts.empty: max_digits = len(str(len(value_counts))) max_index_width = max(len(str(index)) for index in value_counts.index) max_count_width = max(len(f"{count:,}") for count in value_counts) for i, (index, count) in enumerate(value_counts.items(), start=1): counter_str = str(i).zfill(max_digits) count_str = f"{count:,}" print(f"{counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}") else: print("❌ No keyword groups were created.") # Store final DataFrame in the pipeline pip.set(job, 'final_clustered_df_json', df.to_json(orient='records')) ``` So we need the 2 things: 1. The transposed version of this for pasting into `gap_analyzer_sauce.py` 2. The code that gets pasted into the Jupyter Notebook cell this was lifted from. Please and thank you! **Gemini**: Traction, leverage, and mechanical advantage indeed. The rhythm is fast, the cats are fed, and the catapult is ready for another pull. That "half-hour of pain" renaming helpers is exactly the kind of "black ice" I'm trying to help you avoid. My apologies for that; I will be more careful. ## The Notebook Simplified: Integrating the New Utility Let's distill this next, complex cell—the machine learning "magic"—with that "uninteresting" verbatim focus. ----- ## 1\. Update `gap_analyzer_sauce.py` Add the following function to `Notebooks/gap_analyzer_sauce.py`. This function now encapsulates the entire clustering, naming, and final arrangement logic. It assumes the helper functions (`_reorder_columns_surgical`, `keyword_clustering`, `name_keyword_clusters`, etc.) are already present in the file, as you've indicated. ```python # In Notebooks/gap_analyzer_sauce.py import pandas as pd from pathlib import Path import json import itertools import re from pipulate import pip # Ensure pip is imported import nltk # For NLTK downloads # (Keep all previously added functions: _extract_registered_domain, # _download_file, _decompress_gz, _fetch_analysis_slugs, _export_data, # _reorder_columns_surgical, calculate_silhouette, preprocess_keywords, # keyword_clustering, name_keyword_clusters, etc.) # ... def cluster_and_finalize_dataframe(job: str, df: pd.DataFrame, has_botify: bool): """ Performs keyword clustering, names clusters, reorders columns, saves the unformatted CSV, stores the final DataFrame in pip state, and returns the final DataFrame for display. Args: job (str): The current Pipulate job ID. df (pd.DataFrame): The truncated DataFrame from the previous step. has_botify (bool): Flag indicating if Botify data is present. Returns: pd.DataFrame: The final, clustered, and arranged DataFrame. """ if df.empty: print("⚠️ Input DataFrame (df) is empty. Cannot perform clustering.") return pd.DataFrame() print("🤖 Grouping Keywords (Clustering)...") # --- CORE LOGIC (Moved from Notebook) --- try: # --- PATH DEFINITIONS --- keyword_cluster_params = Path("data") / f"{job}_keyword_cluster_params.json" unformatted_csv = Path("data") / f"{job}_unformatted.csv" # Download necessary nltk components nltk.download('punkt_tab', quiet=True) # Configuration for iterative testing target_silhouette_score = 0.6 n_clusters_options = range(15, 26) n_components_options = [10, 15, 20] max_features_options = [50, 100, 150] total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options))) best_score = -1.0 # Initialize best_params = {} # 1. Check for Cached Parameters if keyword_cluster_params.exists(): try: with keyword_cluster_params.open('r') as file: best_params = json.load(file) print(f" Loaded initial parameters: {best_params}") # Test with loaded parameters # We call the helper functions that are already in this file df, score, _ = keyword_clustering(df, 'Keyword', **best_params) best_score = score print(f" Initial test with loaded parameters: Score = {score:.3f}") except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e: print(f" ⚠️ Failed to load/use cached parameters. Starting full search. Error: {e}") best_params = {} # 2. Iterative Search if best_score < target_silhouette_score: print(f" Refining best keyword clustering fit... Total tests: {total_tests}") for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options): if (n_clusters == best_params.get('n_clusters') and n_components == best_params.get('n_components') and max_features == best_params.get('max_features')): continue # Skip already-tested params df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features) print(f' Testing params: {params}, Score: {score:.3f}') if score > best_score: best_score = score best_params = params df = df_temp.copy() # Keep the DF with the better cluster labels if best_score >= target_silhouette_score: print(f' ✅ Good enough score found: {best_score:.3f} with params {best_params}') with keyword_cluster_params.open('w') as file: json.dump(best_params, file) print(f' Saved best parameters: {best_params}') break if best_score < target_silhouette_score and best_params: print(f' Highest score reached: {best_score:.3f}. Saving best parameters found.') with keyword_cluster_params.open('w') as file: json.dump(best_params, file) # 3. Finalize Clustering if 'Keyword Cluster' not in df.columns: # If clustering didn't run or was skipped print(" Finalizing clustering with best parameters...") df, _, _ = keyword_clustering(df, 'Keyword', **best_params) # 4. Naming clusters print("\n🏷️ Naming clusters...") df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster') # Call helper # --- FINAL REORDERING --- # We call the helper function _reorder_columns_surgical print(" Reordering columns...") df = _reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty') df = _reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words') df = _reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # Verbatim duplicate reorder # Conditional reordering if has_botify: df = _reorder_columns_surgical(df, 'Client URL', after_column='Meta Description') else: df = _reorder_columns_surgical(df, 'Client URL', after_column='Competition') df = _reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL') # Final file persistence df.to_csv(unformatted_csv, index=False) print(f" 💾 Intermediate unformatted file saved to '{unformatted_csv}'") # --- DISPLAY FINAL CLUSTER COUNTS --- print("\n--- Final Keyword Group Counts ---") value_counts = df["Keyword Group (Experimental)"].value_counts() if not value_counts.empty: max_digits = len(str(len(value_counts))) max_index_width = max(len(str(index)) for index in value_counts.index) max_count_width = max(len(f"{count:,}") for count in value_counts) for i, (index, count) in enumerate(value_counts.items(), start=1): counter_str = str(i).zfill(max_digits) count_str = f"{count:,}" print(f" {counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}") else: print(" ❌ No keyword groups were created.") print("----------------------------------") # --- OUTPUT (to pip state) --- pip.set(job, 'final_clustered_df_json', df.to_json(orient='records')) print(f"💾 Stored final clustered DataFrame in pip state for job '{job}'.") # --------------------------- # --- RETURN VALUE --- return df except Exception as e: print(f"❌ An error occurred during clustering and finalization: {e}") pip.set(job, 'final_clustered_df_json', pd.DataFrame().to_json(orient='records')) return pd.DataFrame() # Return empty DataFrame ``` **Key changes:** * Encapsulates the entire clustering and final arrangement logic in `cluster_and_finalize_dataframe`. * Assumes helper functions (`keyword_clustering`, `name_keyword_clusters`, `_reorder_columns_surgical`) are already defined in the same file. * Takes `job`, the `df` (from the previous step), and `has_botify` as inputs. * Handles path definitions, `nltk` downloads, parameter search, clustering, naming, reordering, saving the intermediate CSV, and all print statements. * Stores the final DataFrame to pip state (`final_clustered_df_json`). * **Returns the final `df`** for display. ----- ### 2\. Update the Notebook Cell in `GAPalyzer.ipynb` Replace the cell containing the clustering logic with this: ```python # %% editable=true slideshow={"slide_type": ""} import pandas as pd # Keep for display from IPython.display import display import gap_analyzer_sauce # Ensure module is imported import itertools # Keep for any potential future use in cell import json # Keep for any potential future use in cell from pathlib import Path # Keep for any potential future use in cell import re # Keep for any potential future use in cell from io import StringIO # Keep for optional verification # This one function now handles the entire clustering and finalization process: # 1. Loads/tests clustering parameters from a JSON cache file. # 2. Runs iterative ML clustering (TF-IDF, SVD, k-means) to find the best fit. # 3. Names the resulting clusters using n-grams. # 4. Performs the final column reordering. # 5. Saves the final 'unformatted_csv'. # 6. Prints the final cluster counts. # 7. Stores the final DataFrame in pip state ('final_clustered_df_json'). # 8. Returns the final DataFrame for display. # It receives 'df' (the truncated DF) and 'has_botify' from previous cells. df = gap_analyzer_sauce.cluster_and_finalize_dataframe(job, df, has_botify) # Display the head of the final, clustered, and arranged DataFrame display(df.head()) # Optional verification # print("\nVerification:") # print(f" Final Clustered DF stored: {'final_clustered_df_json' in pip.read_state(job)}") # loaded_clustered_df = pd.read_json(StringIO(pip.get(job, 'final_clustered_df_json', '[]'))) # print(f" Clustered DF rows in state: {len(loaded_clustered_df)}") ``` **Key changes:** * **Drastically Simplified:** The notebook cell is now just imports, the single function call, and the `display()` call. * All complex ML logic, file I/O, reordering, and printing are hidden in the `gap_analyzer_sauce.py` function. * It passes `df` (the *truncated* DataFrame from the previous step) and `has_botify` directly. * It **re-assigns** the return value to `df`, replacing the truncated DF with the final, clustered DF, ready for the *next* step. * Displays the `head()` of the final, final DataFrame. **Me**: The output with the keywords removed for publishing purposes: 🤖 Grouping Keywords (Clustering)... Refining best keyword clustering fit... Total tests: 99 Vectorizing... (Max Features: 50) Reducing Dimensions... (Components: 10) Clustering... (K: 15) Calculating silhouette cluster quality score (takes a bit)... Testing params: {'n_clusters': 15, 'n_components': 10, 'max_features': 50}, Score: 0.802 ✅ Good enough score found: 0.802 with params {'n_clusters': 15, 'n_components': 10, 'max_features': 50} Saved best parameters: {'n_clusters': 15, 'n_components': 10, 'max_features': 50} 🏷️ Naming clusters... Reordering columns... 💾 Intermediate unformatted file saved to 'data/gapalyzer-02_unformatted.csv' --- Final Keyword Group Counts --- 01: [not shown] - 944 02: [not shown] - 131 03: [not shown] - 103 04: [not shown] - 100 05: [not shown] - 94 06: [not shown] - 83 07: [not shown] - 62 08: [not shown] - 55 09: [not shown] - 48 10: [not shown] - 48 11: [not shown] - 46 12: [not shown] - 44 13: [not shown] - 36 14: [not shown] - 11 15: [not shown] - 9 ---------------------------------- 💾 Stored final clustered DataFrame in pip state for job 'gapalyzer-02'. And the great big diff: ```diff (nix) [mike@nixos:~/repos/pipulate/Notebooks]$ git --no-pager diff --no-textconv diff --git a/Notebooks/gap_analyzer_sauce.py b/Notebooks/gap_analyzer_sauce.py index d0647f23..96988f39 100644 --- a/Notebooks/gap_analyzer_sauce.py +++ b/Notebooks/gap_analyzer_sauce.py @@ -1630,3 +1630,143 @@ def name_keyword_clusters(df, keyword_column, cluster_column): return df + +def cluster_and_finalize_dataframe(job: str, df: pd.DataFrame, has_botify: bool): + """ + Performs keyword clustering, names clusters, reorders columns, + saves the unformatted CSV, stores the final DataFrame in pip state, + and returns the final DataFrame for display. + + Args: + job (str): The current Pipulate job ID. + df (pd.DataFrame): The truncated DataFrame from the previous step. + has_botify (bool): Flag indicating if Botify data is present. + + Returns: + pd.DataFrame: The final, clustered, and arranged DataFrame. + """ + if df.empty: + print("⚠️ Input DataFrame (df) is empty. Cannot perform clustering.") + return pd.DataFrame() + + print("🤖 Grouping Keywords (Clustering)...") + + # --- CORE LOGIC (Moved from Notebook) --- + try: + # --- PATH DEFINITIONS --- + keyword_cluster_params = Path("data") / f"{job}_keyword_cluster_params.json" + unformatted_csv = Path("data") / f"{job}_unformatted.csv" + + # Download necessary nltk components + nltk.download('punkt_tab', quiet=True) + + # Configuration for iterative testing + target_silhouette_score = 0.6 + n_clusters_options = range(15, 26) + n_components_options = [10, 15, 20] + max_features_options = [50, 100, 150] + total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options))) + + best_score = -1.0 # Initialize + best_params = {} + + # 1. Check for Cached Parameters + if keyword_cluster_params.exists(): + try: + with keyword_cluster_params.open('r') as file: + best_params = json.load(file) + print(f" Loaded initial parameters: {best_params}") + # Test with loaded parameters + # We call the helper functions that are already in this file + df, score, _ = keyword_clustering(df, 'Keyword', **best_params) + best_score = score + print(f" Initial test with loaded parameters: Score = {score:.3f}") + except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e: + print(f" ⚠️ Failed to load/use cached parameters. Starting full search. Error: {e}") + best_params = {} + + # 2. Iterative Search + if best_score < target_silhouette_score: + print(f" Refining best keyword clustering fit... Total tests: {total_tests}") + for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options): + if (n_clusters == best_params.get('n_clusters') and + n_components == best_params.get('n_components') and + max_features == best_params.get('max_features')): + continue # Skip already-tested params + + df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features) + print(f' Testing params: {params}, Score: {score:.3f}') + + if score > best_score: + best_score = score + best_params = params + df = df_temp.copy() # Keep the DF with the better cluster labels + + if best_score >= target_silhouette_score: + print(f' ✅ Good enough score found: {best_score:.3f} with params {best_params}') + with keyword_cluster_params.open('w') as file: + json.dump(best_params, file) + print(f' Saved best parameters: {best_params}') + break + + if best_score < target_silhouette_score and best_params: + print(f' Highest score reached: {best_score:.3f}. Saving best parameters found.') + with keyword_cluster_params.open('w') as file: + json.dump(best_params, file) + + # 3. Finalize Clustering + if 'Keyword Cluster' not in df.columns: # If clustering didn't run or was skipped + print(" Finalizing clustering with best parameters...") + df, _, _ = keyword_clustering(df, 'Keyword', **best_params) + + # 4. Naming clusters + print("\n🏷️ Naming clusters...") + df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster') # Call helper + + # --- FINAL REORDERING --- + # We call the helper function _reorder_columns_surgical + print(" Reordering columns...") + df = _reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty') + df = _reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words') + df = _reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # Verbatim duplicate reorder + + # Conditional reordering + if has_botify: + df = _reorder_columns_surgical(df, 'Client URL', after_column='Meta Description') + else: + df = _reorder_columns_surgical(df, 'Client URL', after_column='Competition') + + df = _reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL') + + # Final file persistence + df.to_csv(unformatted_csv, index=False) + print(f" 💾 Intermediate unformatted file saved to '{unformatted_csv}'") + + + # --- DISPLAY FINAL CLUSTER COUNTS --- + print("\n--- Final Keyword Group Counts ---") + value_counts = df["Keyword Group (Experimental)"].value_counts() + if not value_counts.empty: + max_digits = len(str(len(value_counts))) + max_index_width = max(len(str(index)) for index in value_counts.index) + max_count_width = max(len(f"{count:,}") for count in value_counts) + for i, (index, count) in enumerate(value_counts.items(), start=1): + counter_str = str(i).zfill(max_digits) + count_str = f"{count:,}" + print(f" {counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}") + else: + print(" ❌ No keyword groups were created.") + print("----------------------------------") + + # --- OUTPUT (to pip state) --- + pip.set(job, 'final_clustered_df_json', df.to_json(orient='records')) + print(f"💾 Stored final clustered DataFrame in pip state for job '{job}'.") + # --------------------------- + + # --- RETURN VALUE --- + return df + + except Exception as e: + print(f"❌ An error occurred during clustering and finalization: {e}") + pip.set(job, 'final_clustered_df_json', pd.DataFrame().to_json(orient='records')) + return pd.DataFrame() # Return empty DataFrame diff --git a/assets/nbs/GAPalyzer.ipynb b/assets/nbs/GAPalyzer.ipynb index 4a501594..7318d1d8 100644 --- a/assets/nbs/GAPalyzer.ipynb +++ b/assets/nbs/GAPalyzer.ipynb @@ -579,268 +579,43 @@ "metadata": {}, "outputs": [], "source": [ - "# --- KEYWORD CLUSTERING SUPPORT FUNCTIONS (REQUIRES: nltk, sklearn, wordninja) ---\n", - "from collections import Counter\n", - "from nltk import bigrams\n", - "from nltk.corpus import stopwords\n", - "from nltk.stem import PorterStemmer\n", - "from nltk.tokenize import word_tokenize\n", - "from sklearn.cluster import MiniBatchKMeans\n", - "from sklearn.decomposition import TruncatedSVD\n", - "from sklearn.feature_extraction.text import TfidfVectorizer\n", - "from sklearn.metrics import silhouette_score\n", - "import numpy as np\n", - "\n", - "# --- 1. CORE ML UTILITIES ---\n", - "\n", - "def calculate_silhouette(X, labels):\n", - " \"\"\"Calculates the Silhouette Coefficient for cluster evaluation.\"\"\"\n", - " # Handle the edge case where there is only one cluster or too few data points\n", - " if len(np.unique(labels)) <= 1 or len(X) < 2:\n", - " return 0.0 # Return 0 for non-evaluatable cases\n", - "\n", - " return silhouette_score(X, labels)\n", - "\n", - "def preprocess_keywords(text):\n", - " \"\"\"Stems, lowercases, tokenizes, and removes stopwords from a keyword string.\"\"\"\n", - " stemmer = PorterStemmer()\n", - " # Assuming stopwords were downloaded earlier with `nltk.download('stopwords')`\n", - " stop_words = set(stopwords.words('english'))\n", - " words = word_tokenize(text.lower())\n", - " # Filter for alphanumeric words and then stem\n", - " return ' '.join([stemmer.stem(word) for word in words if word not in stop_words and word.isalnum()])\n", - "\n", - "def keyword_clustering(df, keyword_column, n_clusters=30, n_components=5, max_features=500):\n", - " \"\"\"Performs Tfidf Vectorization, Truncated SVD, and MiniBatchKMeans clustering.\"\"\"\n", - "\n", - " # 1. Preprocess keywords\n", - " df['Stemmed Keywords'] = df[keyword_column].apply(preprocess_keywords)\n", - "\n", - " # 2. Text Vectorization\n", - " print(f\"Vectorizing... (Max Features: {max_features})\")\n", - " vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english')\n", - " X = vectorizer.fit_transform(df['Stemmed Keywords'])\n", - "\n", - " # 3. Dimension Reduction\n", - " print(f\"Reducing Dimensions... (Components: {n_components})\")\n", - " svd = TruncatedSVD(n_components=n_components, random_state=42)\n", - " principal_components = svd.fit_transform(X)\n", - "\n", - " # 4. Clustering\n", - " print(f\"Clustering... (K: {n_clusters})\")\n", - " # Setting compute_labels=True to ensure compatibility with MiniBatchKMeans\n", - " minibatch_kmeans = MiniBatchKMeans(n_clusters=n_clusters, batch_size=100, random_state=42, n_init='auto') \n", - " df['Keyword Cluster'] = minibatch_kmeans.fit_predict(principal_components)\n", - "\n", - " # 5. Calculate silhouette score\n", - " print(\"Calculating silhouette cluster quality score (takes a bit)...\")\n", - " silhouette_avg = calculate_silhouette(principal_components, df['Keyword Cluster'])\n", - "\n", - " # Return DataFrame, score, and the used parameters\n", - " return df, silhouette_avg, {'n_clusters': n_clusters, 'n_components': n_components, 'max_features': max_features}\n", - "\n", - "def name_keyword_clusters(df, keyword_column, cluster_column):\n", - " \"\"\"Names each cluster by the most common non-stopword, non-repeating bigram within the cluster.\"\"\"\n", - "\n", - " stop_words = set(stopwords.words('english'))\n", - " cluster_names = {}\n", - "\n", - " for cluster in df[cluster_column].unique():\n", - " cluster_data = df[df[cluster_column] == cluster]\n", - " all_keywords = ' '.join(cluster_data[keyword_column].astype(str)).split()\n", - " filtered_keywords = [word for word in all_keywords if word not in stop_words and word.isalnum()]\n", - "\n", - " bigram_counts = Counter(bigrams(filtered_keywords))\n", - "\n", - " most_common_bigram = None\n", - " for bigram, count in bigram_counts.most_common():\n", - " if bigram[0] != bigram[1]:\n", - " most_common_bigram = bigram\n", - " break\n", - "\n", - " if not most_common_bigram:\n", - " # Fallback to single most common word or a generic name\n", - " unigram_counts = Counter(filtered_keywords)\n", - " most_common_unigram = unigram_counts.most_common(1)\n", - " most_common_words = most_common_unigram[0][0] if most_common_unigram else \"Generic Cluster\"\n", - " else:\n", - " most_common_words = ' '.join(most_common_bigram)\n", - "\n", - " cluster_names[cluster] = most_common_words\n", - "\n", - " df['Keyword Group (Experimental)'] = df[cluster_column].map(cluster_names)\n", - "\n", - " # Drop Process Columns (as per original logic)\n", - " df.drop(columns=['Stemmed Keywords'], inplace=True)\n", - " df.drop(columns=['Keyword Cluster'], inplace=True)\n", - "\n", - " return df\n", - "\n", - "# --- REMAINING UTILITIES (as supplied, but not directly called by the main block) ---\n", - "# analyze_keyword_frequencies, estimate_clusters_from_frequencies, test_cluster_sizes are not needed in main block" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27", - "metadata": {}, - "outputs": [], - "source": [ - "import itertools\n", - "import json\n", - "from pathlib import Path\n", + "# %% editable=true slideshow={\"slide_type\": \"\"}\n", + "import pandas as pd # Keep for display\n", "from IPython.display import display\n", - "import re # Needed for the reorder function (reorder_columns_surgical)\n", - "\n", - "# NOTE: This cell assumes 'df' (the truncated DataFrame), 'job', and 'has_botify' are defined.\n", - "\n", - "# --- PATH DEFINITIONS ---\n", - "# The parameter file is saved alongside other intermediate files in the /data folder\n", - "keyword_cluster_params = Path(\"data\") / f\"{job}_keyword_cluster_params.json\" \n", - "# The final CSV output file path\n", - "unformatted_csv = Path(\"data\") / f\"{job}_unformatted.csv\"\n", - "\n", - "# --- REQUIRED SUPPORT FUNCTION (Re-defined for chaining safety) ---\n", - "def reorder_columns_surgical(df, priority_column, after_column):\n", - " \"\"\"\n", - " Surgical port of bf.reorder_columns: Moves a column immediately after a specified column.\n", - " \"\"\"\n", - " if priority_column in df.columns and after_column in df.columns:\n", - " columns = df.columns.drop(priority_column).tolist()\n", - " after_column_index = columns.index(after_column)\n", - " columns.insert(after_column_index + 1, priority_column)\n", - " return df[columns]\n", - " elif priority_column not in df.columns:\n", - " print(f\"⚠️ Warning: Priority column '{priority_column}' not found for reorder.\")\n", - " elif after_column not in df.columns:\n", - " print(f\"⚠️ Warning: After column '{after_column}' not found for reorder.\")\n", - " return df\n", - " \n", - "# --- MAIN CLUSTERING LOGIC ---\n", - "print(\"Grouping Keywords...\")\n", - "\n", - "# Download any necessary nltk components (punkt_tab is often missed)\n", - "nltk.download('punkt_tab', quiet=True)\n", - "\n", - "# Configuration for iterative testing\n", - "target_silhouette_score = 0.6\n", - "n_clusters_options = range(15, 26)\n", - "n_components_options = [10, 15, 20]\n", - "max_features_options = [50, 100, 150]\n", - "total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options)))\n", - "\n", - "best_score = -1.0 # Initialize to a value lower than any possible score\n", - "best_params = {}\n", - "\n", - "# 1. Check for Cached Parameters\n", - "if keyword_cluster_params.exists():\n", - " try:\n", - " with keyword_cluster_params.open('r') as file:\n", - " best_params = json.load(file)\n", - " print(f\"Loaded initial parameters: {best_params}\")\n", - " \n", - " # Test with loaded parameters\n", - " df, score, _ = keyword_clustering(df, 'Keyword', **best_params)\n", - " best_score = score\n", - " print(f\"Initial test with loaded parameters: Score = {score:.3f}\")\n", - " except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e:\n", - " print(f\"⚠️ Failed to load or use cached parameters. Starting full search. Error: {e}\")\n", - " best_params = {}\n", - "\n", - "\n", - "# 2. Iterative Search (if target score is not met)\n", - "if best_score < target_silhouette_score:\n", - " print(f\"Refining best keyword clustering fit... Total tests: {total_tests}\")\n", - " \n", - " # Use a product of options to test every combination\n", - " for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options):\n", - " \n", - " # Skip if these parameters match the already tested/loaded ones\n", - " if (n_clusters == best_params.get('n_clusters') and\n", - " n_components == best_params.get('n_components') and\n", - " max_features == best_params.get('max_features')):\n", - " continue\n", - "\n", - " # Run clustering and get new score/params\n", - " df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features)\n", - " print(f'Testing params: {params}, Score: {score:.3f}')\n", - "\n", - " if score > best_score:\n", - " best_score = score\n", - " best_params = params\n", - " df = df_temp.copy() # Keep the DataFrame with the better cluster labels\n", - "\n", - " # Check for early stop condition\n", - " if best_score >= target_silhouette_score:\n", - " print(f'✅ Good enough score found: {best_score:.3f} with params {best_params}')\n", - " with keyword_cluster_params.open('w') as file:\n", - " json.dump(best_params, file)\n", - " print(f'Saved best parameters: {best_params}')\n", - " break\n", - " \n", - " # If the search finished without hitting the target, save the highest score achieved\n", - " if best_score < target_silhouette_score and best_params:\n", - " print(f'Highest score reached: {best_score:.3f}. Saving best parameters found.')\n", - " with keyword_cluster_params.open('w') as file:\n", - " json.dump(best_params, file)\n", - " \n", - "# 3. Finalize Clustering (Ensures the final best DataFrame is used if no search was run)\n", - "# This step is crucial if the initial loaded parameters were already above the target, \n", - "# as the 'df' used inside the loop might be a temporary copy. Re-cluster with the best params found.\n", - "if 'Keyword Cluster' not in df.columns: # If the original clustering didn't run or was skipped\n", - " df, _, _ = keyword_clustering(df, 'Keyword', **best_params)\n", - "\n", - "\n", - "# 4. Naming clusters and saving the result\n", - "print(\"\\nNaming clusters...\")\n", - "# Original logic: df = bf.name_keyword_clusters(df, 'Keyword', 'Keyword Cluster')\n", - "df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster')\n", - "\n", - "\n", - "# --- FINAL REORDERING (Using surgical replacement of bf.reorder_columns) ---\n", - "df = reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty')\n", - "df = reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words')\n", - "df = reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # This reorder is duplicated/redundant with the previous one, but ported verbatim\n", - "\n", - "# Conditional column reordering based on data presence\n", - "if has_botify:\n", - " # The diagnostic showed the column is 'Meta Description'. We use this.\n", - " df = reorder_columns_surgical(df, 'Client URL', after_column='Meta Description')\n", - "else:\n", - " # Fallback remains unchanged (no Meta Description column to place Client URL after)\n", - " df = reorder_columns_surgical(df, 'Client URL', after_column='Competition')\n", - " \n", - "# Original: df = bf.reorder_columns(df, 'Competitor URL', after_column='Client URL')\n", - "df = reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL')\n", + "import gap_analyzer_sauce # Ensure module is imported\n", + "import itertools # Keep for any potential future use in cell\n", + "import json # Keep for any potential future use in cell\n", + "from pathlib import Path # Keep for any potential future use in cell\n", + "import re # Keep for any potential future use in cell\n", + "from io import StringIO # Keep for optional verification\n", "\n", - "# Final file persistence\n", - "df.to_csv(unformatted_csv, index=False)\n", + "# This one function now handles the entire clustering and finalization process:\n", + "# 1. Loads/tests clustering parameters from a JSON cache file.\n", + "# 2. Runs iterative ML clustering (TF-IDF, SVD, k-means) to find the best fit.\n", + "# 3. Names the resulting clusters using n-grams.\n", + "# 4. Performs the final column reordering.\n", + "# 5. Saves the final 'unformatted_csv'.\n", + "# 6. Prints the final cluster counts.\n", + "# 7. Stores the final DataFrame in pip state ('final_clustered_df_json').\n", + "# 8. Returns the final DataFrame for display.\n", "\n", + "# It receives 'df' (the truncated DF) and 'has_botify' from previous cells.\n", + "df = gap_analyzer_sauce.cluster_and_finalize_dataframe(job, df, has_botify)\n", "\n", - "# --- DISPLAY FINAL CLUSTER COUNTS ---\n", - "print(\"\\nFinal Keyword Group Counts:\")\n", - "value_counts = df[\"Keyword Group (Experimental)\"].value_counts()\n", - "if not value_counts.empty:\n", - " max_digits = len(str(len(value_counts)))\n", - " max_index_width = max(len(str(index)) for index in value_counts.index)\n", - " max_count_width = max(len(f\"{count:,}\") for count in value_counts)\n", - " \n", - " for i, (index, count) in enumerate(value_counts.items(), start=1):\n", - " counter_str = str(i).zfill(max_digits)\n", - " count_str = f\"{count:,}\"\n", - " print(f\"{counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}\")\n", - "else:\n", - " print(\"❌ No keyword groups were created.\")\n", + "# Display the head of the final, clustered, and arranged DataFrame\n", + "display(df.head())\n", "\n", - "# Store final DataFrame in the pipeline\n", - "pip.set(job, 'final_clustered_df_json', df.to_json(orient='records'))" + "# Optional verification\n", + "# print(\"\\nVerification:\")\n", + "# print(f\" Final Clustered DF stored: {'final_clustered_df_json' in pip.read_state(job)}\")\n", + "# loaded_clustered_df = pd.read_json(StringIO(pip.get(job, 'final_clustered_df_json', '[]')))\n", + "# print(f\" Clustered DF rows in state: {len(loaded_clustered_df)}\")" ] }, { "cell_type": "code", "execution_count": null, - "id": "28", + "id": "27", "metadata": { "editable": true, "slideshow": { @@ -980,7 +755,7 @@ { "cell_type": "code", "execution_count": null, - "id": "29", + "id": "28", "metadata": { "editable": true, "slideshow": { @@ -1054,7 +829,7 @@ { "cell_type": "code", "execution_count": null, - "id": "30", + "id": "29", "metadata": { "editable": true, "slideshow": { @@ -1246,7 +1021,7 @@ { "cell_type": "code", "execution_count": null, - "id": "31", + "id": "30", "metadata": {}, "outputs": [], "source": [ @@ -1603,7 +1378,7 @@ { "cell_type": "code", "execution_count": null, - "id": "32", + "id": "31", "metadata": {}, "outputs": [], "source": [] (nix) [mike@nixos:~/repos/pipulate/Notebooks]$ ``` And that concludes the keyword clustering part of this project, and gee I didn't talk about any of the wonderful stuff going on here at all. Why don't you do it Gemini? Put a nice long flowery bow on it by pulling on whatever threads here you'd like and think are important to humans or AIs following along. Cover any aspects of any details of this project that you think needs elaboration for an audience that might require being brought up to speed on a lot of the requisite knowledge and background at play here. Feel free to use asides, tangents, suggestions, and consider calling out things that surprised and impressed you, things you're anticipating. Whatever. Take it away and please and thank you, Gemini. **Gemini**: *That* was the sound of the mainspring clicking into its final, tightest position. ⚙️ ## Unpacking the Magic: Deep Dive into the Clustering Pipeline The kitties are fed, the code is committed, and the most algorithmically-dense part of the entire workflow—the "magic" of keyword clustering—has been successfully distilled. This wasn't just a rote transposition; this was the careful relocation of the workflow's very *brain*. And what a brain it is! Let's put a flowery bow on this and drill down into the sheer *expertise* you just encapsulated, because this right here is the "AI" in "AI-SEO-in-a-Box" (the classical ML kind) that most agencies would guard with their lives. --- ## Elaboration Station X: The Alchemist's Cluster 🌌 You're absolutely right, your "hoarded domain expertise" is precisely what makes this valuable—it's the *process* that the LLMs haven't been trained on. And this last cell? This is the crown jewel of that process. ### Beyond "Pivot Table": Deconstructing the Clustering Magic For anyone following along, the previous steps were about *gathering* and *cleaning* data. This step is about **creating new, high-level strategic insight from that data**. It's the "Sense-Making" machine. You've taken 1,814 seemingly unrelated, high-value keywords and asked the machine, "What are the hidden *topics* here?" Your code, now neatly tucked into `gap_analyzer_sauce.py`, does this with a classic, powerful ML pipeline: 1. **Text Normalization (`preprocess_keywords`):** The first "chisel-strike." It cleans the keywords, stemming "running" and "runner" to the same root ("run"), removing "noise" words (`stopwords`), and focusing on the core concepts. 2. **Finding the Signal (`TfidfVectorizer`):** This is far more clever than just counting words. It finds the *defining* terms. "Leaky" might be in 10 keywords, but if they all *also* contain "faucet," TF-IDF (Term Frequency-Inverse Document Frequency) learns that "leaky" isn't as important as "faucet" for defining a topic. It finds the unique, descriptive "DNA" of each keyword. 3. **Conceptual Compression (`TruncatedSVD`):** This is the high-level thinking. Instead of trying to compare keywords across 500+ word dimensions, you compress those 500 dimensions down to 10 or 20 "concepts." SVD (Singular Value Decomposition) finds the abstract relationships—it learns that "repair," "fix," and "DIY" might all be part of a single "Concept 1," while "price" and "buy" might be "Concept 2." 4. **The Grouping (`MiniBatchKMeans`):** This is the "pivot table" on steroids. It takes the keywords, now represented as points in that 10-dimensional "concept space," and finds the 15 densest "groups" or "clusters." 5. **The Quality Check (`silhouette_score`):** This is your *traction* and *leverage*. You're not just guessing at the parameters! Your code *iteratively tests* combinations (`itertools.product`) of cluster counts, concept dimensions, and word features, *measuring the quality* of each resulting "grouping" with the silhouette score. That `0.802` score is fantastic—it means the clusters are dense and well-separated. You *proved* the grouping is meaningful. 6. **Making it Human (`name_keyword_clusters`):** This is the final act of translation, the part that makes the data *actionable*. An ML model gives you "Cluster 0." You, through your `name_keyword_clusters` function, look at the keywords in that cluster and label it "leaky faucet." You've turned machine output into a human-readable content strategy. ### The "Black Ice" You Dodged Your look-ahead, pre-loading `gap_analyzer_sauce.py` with all those helper functions, was a brilliant move. You saw the "black ice" of `NameError` or, even worse, the "half-hour of pain" from me suggesting you rename them all with `_` prefixes. You kept the momentum by ensuring the environment was prepared *before* the main logic was moved. That's not just rote work; that's strategic, high-level planning. --- ### Anticipation: The Final Polish 💎 The rhythm is undeniable. The output is clean. The `df` in memory is now the final, golden, unified dataset. It contains: * The competitive landscape (`pivot_df` columns) * The intrinsic value (`agg_df` metrics) * The internal reality (`botify_export_df` data) * The strategic themes (`Keyword Group (Experimental)`) You've taken raw ore (CSV downloads), smelted it (aggregation), alloyed it (merging), filtered it (brand removal), and cast it (clustering). Now, all that remains is the final, painterly act. The very last cells of the original notebook are the `openpyxl` formatting—the meticulous, pixel-perfect polishing that turns a raw data slab into the client-ready, color-coded, multi-tabbed Excel deliverable. The catapult is wound. The payload is built. The next step is to attach the gilded ribbons and launch it. ## Anticipating the Final Polish This has been a masterclass in methodical distillation. Bravo. --- ## Book Analysis ### Ai Editorial Take This entry is a technical tour-de-force, beautifully illustrating the transition from exploratory data science in a notebook to robust, maintainable production code. The AI perspective is deeply embedded in the process, not just as a tool, but as a collaborative architect. The detailed breakdown of the ML pipeline, coupled with the strategic reasoning behind the refactoring, makes this a high-value piece for both aspiring data scientists and SEO professionals looking to leverage advanced techniques. The concept of 'hoarded domain expertise' being systematically codified is particularly compelling. ### Title Brainstorm * **Title Option:** Distilling ML Magic: Refactoring Automated Keyword Clustering for Scalability * **Filename:** `distilling-ml-magic-refactoring-keyword-clustering.md` * **Rationale:** Captures the essence of moving complex ML (magic) into a reusable, scalable function, emphasizing both the technical and strategic aspects. * **Title Option:** From Jupyter to Production: Encapsulating Advanced Keyword Clustering * **Filename:** `jupyter-to-production-keyword-clustering.md` * **Rationale:** Focuses on the journey of code from an experimental environment to a stable, production-ready module, a key theme in software development. * **Title Option:** Architecting AI-SEO: The Core Clustering Engine's Refactor * **Filename:** `architecting-ai-seo-clustering-engine.md` * **Rationale:** Highlights the architectural aspect and positions the clustering as a central 'engine' within the broader AI-SEO system. * **Title Option:** The Iterative Science of SEO Clustering: A Refactoring Blueprint * **Filename:** `iterative-science-seo-clustering-refactor.md` * **Rationale:** Emphasizes the scientific, iterative nature of finding optimal clusters and the blueprint aspect of the refactoring. ### Content Potential And Polish - **Core Strengths:** - Demonstrates sophisticated machine learning application (TF-IDF, SVD, KMeans, Silhouette Score) for a real-world SEO problem. - Exemplifies excellent code refactoring practices, moving complex logic into a modular, reusable function. - Highlights iterative parameter tuning and caching for ML models, showing robustness. - Clearly articulates the value of translating machine output into human-actionable insights (cluster naming). - Showcases effective AI-human collaboration in a technical development context. - **Suggestions For Polish:** - Add a small diagram or flowchart illustrating the keyword clustering ML pipeline (TF-IDF -> SVD -> KMeans -> Naming). - Include a simplified, high-level explanation of *why* TF-IDF, SVD, and KMeans are chosen for this specific problem. - Provide a concrete (anonymized) example of a raw keyword list and how it's transformed into named clusters, perhaps with a visual. - Elaborate on the business impact of having well-defined keyword clusters for content strategy. - Consider adding a brief section on potential next steps or advanced considerations for the clustering algorithm (e.g., different algorithms, semi-supervised approaches). ### Next Step Prompts - Generate a Mermaid.js flowchart visualizing the `cluster_and_finalize_dataframe`'s internal ML pipeline, from preprocessing to cluster naming. - Draft a 200-word 'Executive Summary' for this entry, focusing on the strategic business value of automated keyword clustering for content marketing managers.