# My own tiny similarity and plagiarism detector in Python scikit-learn

- I want to try to have a small similarity and plagiarism detector, for any files!

## First try

- Source: <https://kalebujordan.com/make-your-own-plagiarism-detector-in-python/>

In [1]:
import os

extension = 'txt'
extension = 'ml'
extension = 'java'

In [2]:
student_files = [doc for doc in os.listdir() if doc.endswith(f'.{extension}')]
print("students_files:", student_files)

student_notes =[open(File).read() for File in  student_files]

students_files: ['Ivann_VYSLANKO.java', 'Ethan_GAUTHIER.java', 'Florian_EPAIN__Salma_BEN_AYAD.java', 'Astrid_PELISSOU.java', 'Clement_DESBROUSSES__Dorian_LAHOCHE_MA1-2.java', 'Adélaïde_MONTEMBAULT__Mealig_LE_GUEVEL_MA1-2.java', 'Mohamed__MOUHIMINE_MA1-2.java', 'Enzo_LEGUY__Albane_CHALLAMEL.java', 'Axel_ALLAIN.java', 'Enzo_SOLDI.java', 'Remi_CAZOULAT.java', 'Andrea-Karol-JAKUBOWSKI.java', 'Louis_LIEUTAUD.java', 'Lena_ARHUIS.java', 'EOuann_AUBRY__Mathias_SALDANHA.java', 'Nouhou_OUSSENI__Emmanuel_LE_PANNERER.java', 'Yoan_PETTORELLI.java', 'Thomas_DERRIEN.java', 'Ariane_NICOLAS.java', 'Yasmine_TELLACHE.java', 'Florian_EPAIN.java', 'Lea_AUBRY__Iska_LE_MENN.java', 'Naoufel_GIRARD_MA1-2.java', 'Ryan_BORCHANI__Ael_COIC.java', 'Tom_CHAUVEAU.java', 'Yann_BALLANGER__Abel_LOCOCGUEN_MA1-2.java', 'Theo_LE_GOC.java', 'Gabriel_STIERER__Romain_SINIC_MA1-2.java', 'Bouchra_BOUSSIF__Youssouf_DIAKITE.java', 'Divi_SINQUIN__Alexane_FAISANT.java', 'Amelie_BREJOT__Clemence_BOUVIER_MA1-2.java', 'Mathurin_GESNY_

In [5]:
len(student_notes)
len(student_notes[0])

14785

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [8]:
vectorize = lambda Text: TfidfVectorizer().fit_transform(Text).toarray()
similarity = lambda doc1, doc2: cosine_similarity([doc1, doc2])

vectors = vectorize(student_notes)
s_vectors = list(zip(student_files, vectors))

In [9]:
def check_plagiarism(s_vectors):
    plagiarism_results = set()
    for student_a, text_vector_a in s_vectors:
        new_vectors =s_vectors.copy()
        current_index = new_vectors.index((student_a, text_vector_a))
        del new_vectors[current_index]
        for student_b , text_vector_b in new_vectors:
            sim_score = similarity(text_vector_a, text_vector_b)[0][1]
            student_pair = sorted((student_a, student_b))
            score = (student_pair[0], student_pair[1],sim_score)
            plagiarism_results.add(score)
    return plagiarism_results

Now let's sort by increasing similarity index:

In [19]:
def sort_plagiarism(s_vectors):
    data_plagiarism = check_plagiarism(s_vectors)
    sorted_data = sorted(data_plagiarism, key=lambda n1n2score: n1n2score[::-1])
    return sorted_data

And then filter also:

In [22]:
def filter_plagiarism(sorted_data, threshold=0.70):
    return [ scoren1n2 for scoren1n2 in sorted_data if scoren1n2[-1] >= threshold ]

In [31]:
for n1, n2, score in filter_plagiarism(sort_plagiarism(s_vectors)):
    name1 = n1.replace(f'.{extension}', '')[:5]
    name2 = n2.replace(f'.{extension}', '')[:5]
    print(f"Files {name1} and {name2} have similarity = {score:.2%}")

Files Adéla and Cleme have similarity = 70.02%
Files Adéla and Enzo_ have similarity = 70.02%
Files Camil and Tom_C have similarity = 70.04%
Files Bouch and Nicol have similarity = 70.13%
Files Bouch and Lea_A have similarity = 70.22%
Files Louis and Thoma have similarity = 70.33%
Files Thoma and Yvan_ have similarity = 70.33%
Files Adéla and Bouch have similarity = 70.34%
Files Ameli and Ewen_ have similarity = 70.46%
Files Dimit and EOuan have similarity = 70.55%
Files Julie and Lena_ have similarity = 70.65%
Files Gabri and Ryan_ have similarity = 70.67%
Files Camil and Gabri have similarity = 70.74%
Files Thoma and Tom_C have similarity = 70.76%
Files Nicol and Ryan_ have similarity = 71.04%
Files Astri and Nicol have similarity = 71.05%
Files Adéla and Dimit have similarity = 71.13%
Files Astri and Divi_ have similarity = 71.19%
Files Naouf and Yann_ have similarity = 71.28%
Files Flori and Flori have similarity = 71.34%
Files Andre and Astri have similarity = 71.39%
Files Andre a

It's already pretty good!