{ "cells": [ { "cell_type": "markdown", "id": "1b079b26-ef7d-44cc-8a14-eee8963538cb", "metadata": {}, "source": [ "I was studying German the other day and stumbled upon a typo that leads me an interesting observation on these two words:\n", "\n", "- [anschließen](https://en.wiktionary.org/wiki/anschlie%C3%9Fen#German) (to connect)\n", "- [anschließend](https://en.wiktionary.org/wiki/anschlie%C3%9Fend#German) (following, afterwards)\n", "\n", "They are very \"similar\" and I would like them to be connected in [wilhelmlang.com](https://wilhelmlang.com/), a platform that helps language learner learn multi-languages via knowledge graph.\n", "\n", "We define the similarity of two words in this context as follows:\n", "\n", "___Two words are similar either structurally semantically___.\n", "\n", "For example:\n", "\n", "- __anschließen__ and __anschließend__ are structually similar because they differ by just one character (trailing __d__).\n", "- __anschließend__ and [__nachher__](https://en.wiktionary.org/wiki/nachher#German), are semantically similar because they both mean __afterwards__ as adverb\n", "- Some can possess both. For instance, [__das Theater__](https://en.wiktionary.org/wiki/Theater#German) (the theater) and [__das Theaterstück__](https://en.wiktionary.org/wiki/Theaterst%C3%BCck#German) (the drama) are similar both semantically and structurally" ] }, { "cell_type": "markdown", "id": "24027445-317b-4124-96a9-3da581eaced6", "metadata": {}, "source": [ "### Lavenshtien's Distance\n", "\n", "The first idea was to calculating the similarity between two words\n", "\n", "The closest would be like the [Levenstein's distance](https://en.wikipedia.org/wiki/Levenshtein_distance) (also popularly called the _edit distance_).\n", "\n", "> In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.\n", "\n", "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other." ] }, { "cell_type": "code", "execution_count": 1, "id": "d9baf248-9a8b-41c7-94ad-a45e23561e85", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "nltk.edit_distance(\"anschließen\", \"anschließend\")" ] }, { "cell_type": "markdown", "id": "6cf5be91-9a2f-4ae7-bc6f-ce7384aa15ea", "metadata": {}, "source": [ "The code above would return 1, as only one letter is different between the two words. Lavenshtien's distance is good for spotting the __anschließen-anschließend__ case" ] }, { "cell_type": "markdown", "id": "1c538194-f6a1-4ff5-80aa-83f6cffa1426", "metadata": {}, "source": [ "It should be noted that Lavenshtien's distance must be combined with stemming to eliminate false-positive. For example, the German words \"Bank\" and \"Sahne\" (cream) has a pretty small edit distance (3). To distinguish \"Bank-Sahne\" with \"anschließen-anschließend\", notice that the former share different word stem while the latter share the same:" ] }, { "cell_type": "code", "execution_count": 2, "id": "43baa9b4-6f74-466e-b08a-5eecdeaf8917", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bank stem: bank\n", "Sahne stem: sahn\n", "anschließen stem: anschliess\n", "anschließend stem: anschliess\n" ] } ], "source": [ "from nltk.stem.snowball import GermanStemmer\n", "stemmer = GermanStemmer()\n", "\n", "words = [\"Bank\", \"Sahne\", \"anschließen\", \"anschließend\"]\n", "\n", "for word in words:\n", " print(\"{original} stem: {stemmed}\".format(original=word, stemmed=stemmer.stem(word)))" ] }, { "cell_type": "markdown", "id": "ae6ec13a-20a5-4f6d-88c9-82dffdd5486e", "metadata": {}, "source": [ "The __anschließend-nachher__, however, won't work well with the approach above. We need a different metric approach." ] }, { "cell_type": "markdown", "id": "a795a63a-2694-4976-89dd-d72903bc77d7", "metadata": {}, "source": [ "### Cosin Similarity" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }