--- name: keyword-extractor description: Extract keywords and key phrases from text using TF-IDF, RAKE, and frequency analysis. Generate word clouds and export to various formats. --- # Keyword Extractor Extract important keywords and key phrases from text documents using multiple algorithms. Supports TF-IDF, RAKE, and simple frequency analysis with word cloud visualization. ## Quick Start ```python from scripts.keyword_extractor import KeywordExtractor # Extract keywords extractor = KeywordExtractor() keywords = extractor.extract("Your long text document here...") print(keywords[:10]) # Top 10 keywords # From file keywords = extractor.extract_from_file("document.txt") extractor.to_wordcloud("keywords.png") ``` ## Features - **Multiple Algorithms**: TF-IDF, RAKE, frequency-based - **Key Phrases**: Extract multi-word phrases, not just single words - **Scoring**: Relevance scores for ranking - **Stopword Filtering**: Built-in + custom stopwords - **N-gram Support**: Unigrams, bigrams, trigrams - **Word Cloud**: Visualize keyword importance - **Batch Processing**: Process multiple documents ## API Reference ### Initialization ```python extractor = KeywordExtractor( method="tfidf", # tfidf, rake, frequency max_keywords=20, # Maximum keywords to return min_word_length=3, # Minimum word length ngram_range=(1, 3) # Unigrams to trigrams ) ``` ### Extraction Methods ```python # TF-IDF (best for comparing documents) keywords = extractor.extract(text, method="tfidf") # RAKE (best for key phrases) keywords = extractor.extract(text, method="rake") # Frequency (simple word counts) keywords = extractor.extract(text, method="frequency") ``` ### Results Format ```python keywords = extractor.extract(text) # Returns list of tuples: [(keyword, score), ...] # [('machine learning', 0.85), ('data science', 0.72), ...] # Get just keywords keyword_list = extractor.get_keywords(text) # ['machine learning', 'data science', ...] ``` ### Customization ```python # Add custom stopwords extractor.add_stopwords(['company', 'product', 'service']) # Set minimum frequency extractor.min_frequency = 2 # Filter by part of speech (nouns only) extractor.pos_filter = ['NN', 'NNS', 'NNP'] ``` ### Visualization ```python # Generate word cloud extractor.to_wordcloud("wordcloud.png", colormap="viridis") # Bar chart of top keywords extractor.plot_keywords("keywords.png", top_n=15) ``` ### Export ```python # To JSON extractor.to_json("keywords.json") # To CSV extractor.to_csv("keywords.csv") # To plain text extractor.to_text("keywords.txt") ``` ## CLI Usage ```bash # Extract from text python keyword_extractor.py --text "Your text here" --top 10 # Extract from file python keyword_extractor.py --input document.txt --method tfidf --output keywords.json # Generate word cloud python keyword_extractor.py --input document.txt --wordcloud cloud.png # Batch process directory python keyword_extractor.py --input-dir ./docs --output keywords_all.csv ``` ### CLI Arguments | Argument | Description | Default | |----------|-------------|---------| | `--text` | Text to analyze | - | | `--input` | Input file path | - | | `--input-dir` | Directory of files | - | | `--output` | Output file | - | | `--method` | Algorithm (tfidf, rake, frequency) | `tfidf` | | `--top` | Number of keywords | 20 | | `--ngrams` | N-gram range (e.g., "1,2") | `1,3` | | `--wordcloud` | Generate word cloud | - | | `--stopwords` | Custom stopwords file | - | ## Examples ### Article Keyword Extraction ```python extractor = KeywordExtractor(method="tfidf") article = """ Machine learning is transforming data science. Deep learning models are achieving state-of-the-art results in natural language processing and computer vision. Neural networks continue to advance... """ keywords = extractor.extract(article, top_n=10) for keyword, score in keywords: print(f"{score:.3f}: {keyword}") ``` ### Compare Multiple Documents ```python extractor = KeywordExtractor(method="tfidf") docs = [ open("doc1.txt").read(), open("doc2.txt").read(), open("doc3.txt").read() ] # Extract keywords from each for i, doc in enumerate(docs): keywords = extractor.extract(doc, top_n=5) print(f"\nDocument {i+1}:") for kw, score in keywords: print(f" {kw}: {score:.3f}") ``` ### SEO Keyword Research ```python extractor = KeywordExtractor( method="rake", ngram_range=(2, 4), # Focus on phrases max_keywords=30 ) webpage_content = open("page.html").read() keywords = extractor.extract(webpage_content) # Filter by score threshold high_value = [(kw, s) for kw, s in keywords if s > 0.5] print("High-value keywords for SEO:") for kw, score in high_value: print(f" {kw}") ``` ## Algorithm Comparison | Algorithm | Best For | Strengths | |-----------|----------|-----------| | **TF-IDF** | Document comparison | Finds unique terms, good for search | | **RAKE** | Key phrases | Extracts multi-word concepts | | **Frequency** | Quick overview | Simple, fast, interpretable | ## Dependencies ``` scikit-learn>=1.2.0 nltk>=3.8.0 pandas>=2.0.0 matplotlib>=3.7.0 wordcloud>=1.9.0 ``` ## Limitations - English optimized (other languages need language-specific stopwords) - Very short texts may not have enough data for TF-IDF - Domain-specific jargon may need custom stopword handling