{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1b079b26-ef7d-44cc-8a14-eee8963538cb",
   "metadata": {},
   "source": [
    "I was studying German the other day and stumbled upon a typo that leads me an interesting observation on these two words:\n",
    "\n",
    "- [anschließen](https://en.wiktionary.org/wiki/anschlie%C3%9Fen#German) (to connect)\n",
    "- [anschließend](https://en.wiktionary.org/wiki/anschlie%C3%9Fend#German) (following, afterwards)\n",
    "\n",
    "They are very \"similar\" and I would like them to be connected in [wilhelmlang.com](https://wilhelmlang.com/), a platform that helps language learner learn multi-languages via knowledge graph.\n",
    "\n",
    "We define the similarity of two words in this context as follows:\n",
    "\n",
    "___Two words are similar either structurally semantically___.\n",
    "\n",
    "For example:\n",
    "\n",
    "- __anschließen__ and __anschließend__ are structually similar because they differ by just one character (trailing __d__).\n",
    "- __anschließend__ and [__nachher__](https://en.wiktionary.org/wiki/nachher#German), are semantically similar because they both mean __afterwards__ as adverb\n",
    "- Some can possess both. For instance, [__das Theater__](https://en.wiktionary.org/wiki/Theater#German) (the theater) and [__das Theaterstück__](https://en.wiktionary.org/wiki/Theaterst%C3%BCck#German) (the drama) are similar both semantically and structurally"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24027445-317b-4124-96a9-3da581eaced6",
   "metadata": {},
   "source": [
    "### Lavenshtien's Distance\n",
    "\n",
    "The first idea was to calculating the similarity between two words\n",
    "\n",
    "The closest would be like the [Levenstein's distance](https://en.wikipedia.org/wiki/Levenshtein_distance) (also popularly called the _edit distance_).\n",
    "\n",
    "> In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.\n",
    "\n",
    "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d9baf248-9a8b-41c7-94ad-a45e23561e85",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import nltk\n",
    "nltk.edit_distance(\"anschließen\", \"anschließend\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cf5be91-9a2f-4ae7-bc6f-ce7384aa15ea",
   "metadata": {},
   "source": [
    "The code above would return 1, as only one letter is different between the two words. Lavenshtien's distance is good for spotting the __anschließen-anschließend__ case"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c538194-f6a1-4ff5-80aa-83f6cffa1426",
   "metadata": {},
   "source": [
    "It should be noted that Lavenshtien's distance must be combined with stemming to eliminate false-positive. For example, the German words \"Bank\" and \"Sahne\" (cream) has a pretty small edit distance (3). To distinguish \"Bank-Sahne\" with \"anschließen-anschließend\", notice that the former share different word stem while the latter share the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "43baa9b4-6f74-466e-b08a-5eecdeaf8917",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bank stem: bank\n",
      "Sahne stem: sahn\n",
      "anschließen stem: anschliess\n",
      "anschließend stem: anschliess\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem.snowball import GermanStemmer\n",
    "stemmer = GermanStemmer()\n",
    "\n",
    "words = [\"Bank\", \"Sahne\", \"anschließen\", \"anschließend\"]\n",
    "\n",
    "for word in words:\n",
    "    print(\"{original} stem: {stemmed}\".format(original=word, stemmed=stemmer.stem(word)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae6ec13a-20a5-4f6d-88c9-82dffdd5486e",
   "metadata": {},
   "source": [
    "The __anschließend-nachher__, however, won't work well with the approach above. We need a different metric approach."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a795a63a-2694-4976-89dd-d72903bc77d7",
   "metadata": {},
   "source": [
    "### Cosin Similarity"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}