--- name: content-similarity-checker description: Compare document similarity using TF-IDF, cosine similarity, and Jaccard index. Use for plagiarism detection, duplicate finding, or content matching. --- # Content Similarity Checker Compare documents and text for similarity using multiple algorithms. ## Features - **Cosine Similarity**: TF-IDF based comparison - **Jaccard Similarity**: Set-based comparison - **Levenshtein Distance**: Edit distance for short texts - **Batch Comparison**: Compare multiple documents - **Similarity Matrix**: Pairwise comparison of all documents - **Reports**: Detailed similarity reports ## Quick Start ```python from similarity_checker import SimilarityChecker checker = SimilarityChecker() # Compare two texts score = checker.compare( "The quick brown fox jumps over the lazy dog", "A fast brown fox leaps over a sleepy dog" ) print(f"Similarity: {score:.2%}") # Compare documents score = checker.compare_files("doc1.txt", "doc2.txt") ``` ## CLI Usage ```bash # Compare two texts python similarity_checker.py --text1 "Hello world" --text2 "Hello there world" # Compare two files python similarity_checker.py --file1 doc1.txt --file2 doc2.txt # Compare all files in folder python similarity_checker.py --folder ./documents/ --output matrix.csv # Use specific algorithm python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard # Find similar documents (threshold) python similarity_checker.py --folder ./documents/ --threshold 0.7 # JSON output python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json ``` ## API Reference ### SimilarityChecker Class ```python class SimilarityChecker: def __init__(self, method: str = "cosine") # Text comparison def compare(self, text1: str, text2: str) -> float def compare_files(self, file1: str, file2: str) -> float # Multiple algorithms def compare_all_methods(self, text1: str, text2: str) -> dict # Batch comparison def compare_to_corpus(self, text: str, corpus: list) -> list def similarity_matrix(self, documents: list) -> pd.DataFrame def find_duplicates(self, documents: list, threshold: float = 0.8) -> list # Folder operations def compare_folder(self, folder: str, threshold: float = None) -> dict def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list # Report def generate_report(self, output: str) -> str ``` ## Similarity Methods ### Cosine Similarity (Default) Best for comparing documents of different lengths: ```python checker = SimilarityChecker(method="cosine") score = checker.compare(text1, text2) # Returns: 0.0 to 1.0 ``` ### Jaccard Similarity Good for comparing sets of words/tokens: ```python checker = SimilarityChecker(method="jaccard") score = checker.compare(text1, text2) # Returns: 0.0 to 1.0 ``` ### Levenshtein (Edit Distance) Best for short texts, typo detection: ```python checker = SimilarityChecker(method="levenshtein") score = checker.compare(text1, text2) # Returns: 0.0 to 1.0 (normalized) ``` ### TF-IDF + Cosine Advanced: considers term importance: ```python checker = SimilarityChecker(method="tfidf") score = checker.compare(text1, text2) ``` ## Batch Comparison ### Compare to Corpus ```python checker = SimilarityChecker() target = "Machine learning is a subset of artificial intelligence." corpus = [ "AI includes machine learning and deep learning.", "Python is a programming language.", "Neural networks power deep learning systems." ] results = checker.compare_to_corpus(target, corpus) # Returns: [ {"index": 0, "similarity": 0.65, "text": "AI includes..."}, {"index": 2, "similarity": 0.42, "text": "Neural networks..."}, {"index": 1, "similarity": 0.12, "text": "Python is..."} ] ``` ### Similarity Matrix ```python documents = [ "Document one content...", "Document two content...", "Document three content..." ] matrix = checker.similarity_matrix(documents) # Returns DataFrame: # doc_0 doc_1 doc_2 # doc_0 1.000 0.750 0.320 # doc_1 0.750 1.000 0.410 # doc_2 0.320 0.410 1.000 ``` ### Find Duplicates ```python documents = [...] # List of texts duplicates = checker.find_duplicates(documents, threshold=0.85) # Returns: [ {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92}, {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88} ] ``` ## Compare All Methods Get similarity scores from all algorithms: ```python checker = SimilarityChecker() results = checker.compare_all_methods(text1, text2) # Returns: { "cosine": 0.82, "jaccard": 0.65, "levenshtein": 0.71, "tfidf": 0.78, "average": 0.74 } ``` ## Folder Operations ### Compare All Files in Folder ```python checker = SimilarityChecker() results = checker.compare_folder("./documents/") # Returns: { "files": ["doc1.txt", "doc2.txt", "doc3.txt"], "comparisons": 3, "similar_pairs": [ {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87} ], "matrix": } ``` ### Find Most Similar to Query ```python query = "Your search text here..." results = checker.find_most_similar(query, "./documents/", top_n=5) # Returns: [ {"file": "doc3.txt", "similarity": 0.89}, {"file": "doc1.txt", "similarity": 0.72}, ... ] ``` ## Output Format ### Comparison Result ```python result = checker.compare_with_details(text1, text2) # Returns: { "similarity": 0.82, "method": "cosine", "text1_length": 150, "text2_length": 180, "common_words": 25, "unique_words_text1": 10, "unique_words_text2": 15, "interpretation": "High similarity - likely related content" } ``` ## Example Workflows ### Plagiarism Check ```python checker = SimilarityChecker() submission = open("student_paper.txt").read() results = checker.compare_folder("./source_materials/") suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6] if suspicious: print(f"Warning: Found {len(suspicious)} potentially similar sources") for p in suspicious: print(f" {p['file1']} matches {p['file2']}: {p['similarity']:.0%}") ``` ### Document Deduplication ```python checker = SimilarityChecker() # Load all documents docs = {} for file in Path("./articles/").glob("*.txt"): docs[file.name] = file.read_text() # Find near-duplicates duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9) print(f"Found {len(duplicates)} duplicate pairs") ``` ### Content Matching ```python checker = SimilarityChecker() query = "Best practices for Python web development" results = checker.find_most_similar(query, "./blog_posts/", top_n=10) print("Most relevant articles:") for r in results: print(f" {r['file']}: {r['similarity']:.0%} match") ``` ## Dependencies - scikit-learn>=1.3.0 - nltk>=3.8.0 - numpy>=1.24.0 - pandas>=2.0.0