{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Predstavitev besedil s terkami in ocenjevanje podobnosti besedil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iz spletnega portala kopiramo nekaj člankov, in besedilo vsakega od teh shranimo v svojo datoteko v direktoriju `text-data`. Taki zbirki besedil pravimo korpus. Korpus shranimo v slovar, besedila pretvorimo v množico terk ter podobnost med besedili ocenimo z mero po Jaccardu." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import glob\n", "import os.path\n", "from collections import Counter\n", "from itertools import combinations" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "corpus = {}\n", "for file_name in glob.glob(\"text-data/*\"):\n", " name = os.path.splitext(os.path.basename(file_name))[0]\n", " text = \" \".join([line.strip() for line in open(file_name).readlines()])\n", " text = text.lower()\n", " corpus[name] = text" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['potop', 'snaga', 'opel', 'audi', 'očistimo'])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.keys()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'odpadno embalažo in odpadne sveče bi lahko na podlagi interventnega zakona, ki je od petka v javni o'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus['potop'][:100]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def kmers(s, k=3):\n", " \"\"\"Generates k-mers for an input string.\"\"\"\n", " for i in range(len(s)-k+1):\n", " yield s[i:i+k]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{' mo',\n", " ' ne',\n", " 'a m',\n", " 'a n',\n", " 'dra',\n", " 'dri',\n", " 'eba',\n", " 'ina',\n", " 'mod',\n", " 'na ',\n", " 'neb',\n", " 'odr',\n", " 'ra ',\n", " 'rin'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(kmers(\"modra modrina neba\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dela! Spodaj iz slovarja besedil sestavimo slovar množice terk za vsako od slovarjev" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "data = {k: set(kmers(corpus[k], 3)) for k in corpus}" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def jaccard(data, a, b):\n", " \"\"\"Jaccard distance between two sets.\"\"\"\n", " return len(data[a] & data[b]) / len(data[a] | data[b])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.75" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = {'a': {1,2,3,4}, 'b': {1, 3, 4}}\n", "jaccard(test, 'a', 'b')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tudi podobnost po Jaccardu dela (za zgornja dva niza je podobnost enaka 3/4). Spodaj zmerimo podobnost za naše članke iz spletnega portala in razmislimo, ali je izračun smiselen." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.28112449799196787, 'opel', 'očistimo'),\n", " (0.28206896551724137, 'snaga', 'opel'),\n", " (0.30111306385471587, 'potop', 'opel'),\n", " (0.3038309114927345, 'snaga', 'audi'),\n", " (0.310999441652708, 'audi', 'očistimo'),\n", " (0.32602423542989034, 'snaga', 'očistimo'),\n", " (0.326519023282226, 'potop', 'audi'),\n", " (0.37184115523465705, 'potop', 'očistimo'),\n", " (0.38333333333333336, 'opel', 'audi'),\n", " (0.4154798761609907, 'potop', 'snaga')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted([(jaccard(data, a, b), a, b) \n", " for a, b in combinations(data.keys(), 2)])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }