--- name: named-entity-extractor description: Extract named entities (people, organizations, locations, dates) from text using NLP. Use for document analysis, information extraction, or data enrichment. --- # Named Entity Extractor Extract named entities from text including people, organizations, locations, dates, and more. ## Features - **Entity Types**: People, organizations, locations, dates, money, percentages - **Multiple Models**: spaCy for accuracy, regex for speed - **Batch Processing**: Process multiple documents - **Entity Linking**: Group same entities across text - **Export**: JSON, CSV output formats - **Visualization**: Entity highlighting ## Quick Start ```python from entity_extractor import EntityExtractor extractor = EntityExtractor() text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976." entities = extractor.extract(text) for entity in entities: print(f"{entity['text']}: {entity['type']}") # Output: # Apple Inc.: ORG # Steve Jobs: PERSON # Cupertino: GPE # California: GPE # 1976: DATE ``` ## CLI Usage ```bash # Extract from text python entity_extractor.py --text "Steve Jobs founded Apple in California." # Extract from file python entity_extractor.py --input document.txt # Batch process folder python entity_extractor.py --input ./documents/ --output entities.csv # Filter by entity type python entity_extractor.py --input document.txt --types PERSON,ORG # Use regex mode (faster, less accurate) python entity_extractor.py --input document.txt --mode regex # JSON output python entity_extractor.py --input document.txt --json ``` ## API Reference ### EntityExtractor Class ```python class EntityExtractor: def __init__(self, mode: str = "spacy", model: str = "en_core_web_sm") # Extraction def extract(self, text: str) -> list def extract_file(self, filepath: str) -> list def extract_batch(self, folder: str) -> dict # Filtering def filter_entities(self, entities: list, types: list) -> list def get_unique_entities(self, entities: list) -> list def group_by_type(self, entities: list) -> dict # Analysis def entity_frequency(self, text: str) -> dict def find_relationships(self, text: str) -> list # Export def to_csv(self, entities: list, output: str) -> str def to_json(self, entities: list, output: str) -> str def highlight_text(self, text: str) -> str ``` ## Entity Types ### Standard Entity Types (spaCy) | Type | Description | Example | |------|-------------|---------| | PERSON | People, including fictional | "Steve Jobs" | | ORG | Companies, agencies, institutions | "Apple Inc." | | GPE | Countries, cities, states | "California" | | LOC | Non-GPE locations, mountains, water | "Pacific Ocean" | | DATE | Dates, periods | "January 2024" | | TIME | Times | "3:30 PM" | | MONEY | Monetary values | "$1.5 million" | | PERCENT | Percentages | "20%" | | PRODUCT | Products | "iPhone" | | EVENT | Events | "World Cup" | | WORK_OF_ART | Books, songs, etc. | "The Great Gatsby" | | LAW | Laws, regulations | "GDPR" | | LANGUAGE | Languages | "English" | | NORP | Nationalities, groups | "American" | ### Regex Mode Entities Faster extraction with regex patterns: | Type | Description | |------|-------------| | EMAIL | Email addresses | | PHONE | Phone numbers | | URL | Web URLs | | DATE | Common date formats | | MONEY | Currency amounts | | PERCENTAGE | Percentages | ## Output Format ### Entity Result ```python { "text": "Steve Jobs", "type": "PERSON", "start": 10, "end": 20, "confidence": 0.95 } ``` ### Full Extraction Result ```python { "text": "Original text...", "entities": [ {"text": "Steve Jobs", "type": "PERSON", "start": 10, "end": 20}, {"text": "Apple Inc.", "type": "ORG", "start": 30, "end": 40} ], "summary": { "total_entities": 5, "unique_entities": 4, "by_type": { "PERSON": 2, "ORG": 1, "GPE": 2 } } } ``` ## Filtering and Grouping ### Filter by Type ```python entities = extractor.extract(text) # Get only people and organizations filtered = extractor.filter_entities(entities, ["PERSON", "ORG"]) ``` ### Get Unique Entities ```python # Remove duplicates, keep first occurrence unique = extractor.get_unique_entities(entities) ``` ### Group by Type ```python grouped = extractor.group_by_type(entities) # Returns: { "PERSON": ["Steve Jobs", "Tim Cook"], "ORG": ["Apple Inc."], "GPE": ["California", "Cupertino"] } ``` ## Entity Frequency ```python frequency = extractor.entity_frequency(text) # Returns: { "Steve Jobs": {"count": 5, "type": "PERSON"}, "Apple": {"count": 8, "type": "ORG"}, "California": {"count": 2, "type": "GPE"} } ``` ## Batch Processing ### Process Folder ```python results = extractor.extract_batch("./documents/") # Returns: { "doc1.txt": { "entities": [...], "summary": {...} }, "doc2.txt": { "entities": [...], "summary": {...} } } ``` ### Export to CSV ```python extractor.to_csv(results, "entities.csv") # Creates CSV with columns: # filename, entity_text, entity_type, start, end ``` ## Text Highlighting Generate HTML with highlighted entities: ```python html = extractor.highlight_text(text) # Returns HTML with colored spans for each entity type ``` ## Example Workflows ### Document Analysis ```python extractor = EntityExtractor() # Analyze a document text = open("article.txt").read() result = extractor.extract(text) # Get key people mentioned people = extractor.filter_entities(result, ["PERSON"]) print(f"People mentioned: {len(people)}") # Get frequency freq = extractor.entity_frequency(text) top_entities = sorted(freq.items(), key=lambda x: x[1]["count"], reverse=True)[:10] ``` ### Contact Information Extraction ```python extractor = EntityExtractor(mode="regex") text = """ Contact John Smith at john.smith@example.com or call (555) 123-4567. """ entities = extractor.extract(text) # Finds: EMAIL, PHONE entities ``` ### Content Tagging ```python extractor = EntityExtractor() articles = ["article1.txt", "article2.txt", "article3.txt"] tags = {} for article in articles: entities = extractor.extract_file(article) tags[article] = extractor.get_unique_entities(entities) ``` ## Dependencies - spacy>=3.7.0 - pandas>=2.0.0 - en_core_web_sm (spaCy model) Note: Run `python -m spacy download en_core_web_sm` to install the model.