{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# “Word Sense Disambiguation” mit Hilfe von Text Mining Methoden" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die Mehrdeutigkeit von Suchbegriffen kann die Korpusbildung erschweren, da wichtige Suchbegriffe auch zu einer großen Anzahl von irrelevanten Dokumenten für das eigene Forschungsprojekt führen können. Das deutsche Wort “Krebs” zum Beispiel, beschreibt eine Krankheit, ist aber auch ein Tier, ein Sternzeichen sowie ein weit verbreiteter Nachname. Um die Bedeutung von Krebs zu bestimmen, braucht es Kontext. Mit Text Mining Methoden kann der Kontext von Suchwörtern mitberücksichtigt und somit die Unterscheidung zwischen relevanten und irrelevanten Dokumenten erleichtert werden. Nachfolgendes Notebook zeigt, wie die Methoden des Topic Modeling (Gensim Library) in Kombination mit der Jensen-Shannon (JS) Distanz genutzt werden können, um semantisch ähnlich Inhalte zu automatisch zu erkennen und zu gruppieren. \n", "\n", "Word Sense Disambiguation (WSD) aus dem Bereich Natural Language Processing (NLP) umfasst Methoden, die polysemische Suchbegriffe disambiguieren sollen. WSD kann als Aufgabe beschrieben werden, die richtige Bedeutung mit einem Wort in einem gegebenen Kontext zu assoziieren\" (Pasini und Navigli, 2020). WSD-Techniken können wissensbasiert (z. B. auf der Grundlage von Wörterbüchern), überwacht (mit Hilfe von maschinellem Lernen aus manuell annotierten Daten) oder unüberwacht sein. Unüberwachte WSD-Methoden gehen davon aus, dass ähnliche Bedeutungen in ähnlichen Kontexten auftreten (Pal und Saha, 2015; Navigli 2009)" ] }, { "attachments": { "grafik.png": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA50AAABECAYAAAAC9VEwAAAgAElEQVR4nO2dfWwUZ4Lm69QSf8RSIs2I0aA5abasHqM1B3sOKAgphGTRmUSGUTaTA8JKoKIX7O1AgKQxXkgcPhK2h/NsghGEHM2SjAePYQQmZ0gIo5ZvcYDLgXMGEgwMM6RhWIJtMk6GZrB7lef+qKruqur66nZ1V2Oen/RISbu66u2qruL99fslxONxEEJIKRGPxxmGYRiGYZhREiEep3QSQkoLvx+MDMMwDMMwjHehdBJCSg6/H4wMwzAMwzCMd6F0EkJKDr8fjAzDMAzDMIx3oXQSQkoOvx+MDMMwDMMwjHehdBJCSg6/H4wMwzAMwzCMd8lLOr/99ev4978T8s63v37d+1oqIWTU4PeDkWEYhmEYhvEueUtn36q/wdD5zpzTt+pvKJ2EEFv8fjAyDMMwDMMw3iVv6Rx4dUZelcmBV2dQOn2gpUaAIAgQhBq0+F2YB5Ce+nKX574FNYKA8vqeIpTKHclkEo2Njejq6nLcNhaLIRaLjfiYfj8YGYZhGIZhGO8yYulMJT7HwKszHDN0Qa6wOkpnTz3KBVWQzFOICnlLjQChvB657jkjc4Utn8vSmApL9mcbodi01Jhel5r71mZ7UF9u8l3L4/sgk31+85dO/yU0kUggHA5DkiRb8YzFYpAkCY2NjUgmkyM6pt8PRoZhGIZhGMa7jFg6h3tPuBrH+ZdTBwHk3tLpvrI+MnKXTlkGBKNptdSUmHS6fc39/s2uR099+f0vnboPoHzWvMRzdEkn4CyeXgonQOlkGIZhGIYZTaF05kmplisbr6TFTMxGCxafraUGglAOL3zvfu5eq2Ilnl4LJ0DpZBiGYRiGGU2hdOZJvt1xi483ElOq18EbKJ1uMYpnIYQToHQyDMMwDMOMptzH0mkyDi+rFU6uwNe0qPuxHndpLpE2x1DGNTo1/LXUKO8xjlW1EFbjGFHT/WeNqVTFSC8sxs+c+dxWYqN2nzVuL5+H3ETIxfXpqUe5UI76nsy22vI5Xzerz2EmkdnlybzPXDqzvntWEqpc28zb3Xevzb7ezt1re+rLle+P4XqZf1kM17QGLSOUaa14FkI4AUonwzAMwzDMaMr9KZ2qwDmOwVMq7OXlejFRpE37mtVkO1mvaY6ZFgabFk95v+Uod9y3Ij7a10zEVpUw/Ws1ptKpOwdOr5md05561NT3mEiVA26vjyKd5eVGAXJ73dxKp4lUqp/N5u9Zn9lT6TS53hpBdJRO49hapQxm19TsHI60Bbe5uTktnR9//HH+O7LA7wcjwzAMwzAM411GLp0XT+rksm/5XyP1h/+H22/M1kvn/2kH4I10plsPszc2rfybbWvcr1E6XXef1bY6mmwvi6lJS62xrBZCoyuHo/zlK50O4zVzlE7X18dMijTlc75ubqUz03Jq8QHNZ6912902H+lsqbH9XjhLp8N3BdbXwer9btF2qXUzq20++P1gZBiGYRiGYbyLpy2dX295Ht8N3U1v9+e29fj3v/tPHrd02glEDrJhEAh9hT2PcXVp+XQpyIZj2AuCvE/ncYF5SqeTVOYknTlcH8v9ur1uObZ0WoqWuXTLPxho3uOhdLr9Xth3rzUWw42Q23wOFxjHcLpdTiVX/H4wMgzDMAzDMN7FG+n872OQ/HC76bb3Tnfg5vwy76TTVoA8ks5cu5NmdprVQudOLixa2rTj8IxltDl+ztLpKCE5SHgu1yc9ptP8eN5Jp/pWq7GhVi29htc9k067lmWPpNPuOuQpnVaTBhVCPP1+MDIMwzAMwzDeZcTSmfrjRQz/7v/abp+63ovhy58CKLWWTqvutU7dMd2X10k6Mz7jvByJlWwY95lfS6e9hLifrTfXls58pFMvb26lU78PhzGf6U1dlLcUWzo9lk6nWWq9Fk+/H4wMwzAMwzCMdxmxdObKyKXTRihyGNNpP4Yz/zUpjUJgOabTUPF3taRGXi2Sbl5z0ZJpOf4ya0P318dBOp2vm8WxTCcyMm5S7uJ6u+sOnD25k7N0Wl5vF5Mluete6yDSOUhnb2+vq1lqVfEMh8Mjns3W7wcjwzAMwzAM413yls5bS/8Kf25bn3NuLf2rws5ea/aaixlgs1ryrI6h/H9LjUkrkslss+kunSYzozrPvivvM3t/emHwZPZakxl99TO8ZraxEhnjBEGO18dJOl1ct+zXNF2VtS3eZsunOLR0Zp9rdd/Z3b1zlU7z653D7LWO0gnza2oye62biYW6u7tdiWQikUAikXDczgm/H4wMwzAMwzCMd8lLOu/G92Dg1Rl55258j+tjWbcAWq0pmb1NTYtxzKTzzJ9Wx9B1h80ae2mx35oWx3VClU+bPbbTUi7MthmBdMo71q8laioiVuNPjdfIxfVx7F7rfN0A47XQrPtp6GZtPTOtxWcy7U5s2Fd5PXryXqfTZA1NzyYSSp+c7M9t2so+siVUvMbvByPDMAzDMAzjXfKSzvuH/MdmelYCF2M1iRH/r9uoxmq5lhLC7wcjwzAMwzAM410onYUuAaUzD/y/bqMZ95NC+YffD0aGYRiGYRjGu1A6C10CSmce+H/dRgM99eUWkx6VVldaM/x+MDIMwzAMwzDehdJZ6BJQOvPA/+s2KjCO5zQde1ua+P1gZBiGYRiGYbzLKJdOQsj9iN8PRoZhGIZhGMa7UDoJISWH3w9GhmEYhmEYxrtQOgkhJYffD0aGYRiGYRjGu1A6CSElh98PRoZhGIZhGMa7UDoJISWH3w9GhmEYhmEYxrtQOgkhJQefS4QQQgghowdKJyGEEEIIIYSQgkHpJIQQQgghhBBSMCidhBBCCCGEEEIKBqWTEEIIIYQQQkjBoHQSQgghhBBCCCkYlE5CCCGEEEIIIQWD0kkIIYQQQgghpGBQOgkhhBBCCCGEFAxKJyGEEEIIIYSQgkHpJA8Ee/fuhSRJCIfD6OrqKsIRhTxCSCmQwpWPdmLr1q3p7PzoClJ+F4uQkieFO7f7cO38CcTjJ3D+Wh/6vrnnd6EecO7h1vk4Wncqz7M9B3DmD3f4PCPEByidJUZ/f7/fRTAlkUigubkZ0WjUVYojdtZlvXjxYjqtra2QJAmRSAThcBiSJOGTTz7RbeP9ec9HOimjpBS4itbVEiRJk9WtuOp3sYhn3PvmMs7E44jH44ifuYzS9aJ7+ObyGbmc8TjOXP4GJVnUewl0vfcmVtWG9PeNkiXhf8L/OPAZBmg6RSV1oxNvvWx2TUJYvvEALpXkl4mQ0cuIpTOZTOoq704pVanym/b2dt1D8dChQ67epxWsQtLY2Gj6j6lV2tvbC1qefMqZTCbR3d1t+fdYLOZhSbyUToooKSYPqnS2Y7EoQlxczGdXcY+ZGvgMrRuXI2R8/oWWY2NrKUlRCgOftWLj8mxhCC3fiNbPBkqmpWrw7HtYu8Tdv4uh5RtxYNSZjh/3jTOpG0exOexwPRpacaVUvkgq7YshiiJK7HQS4gkjks5kMpmzjITDYSQSCQ8/wv2P2vUzEokgGo2mW+M+/vhj2/d1dXXpzmshUY9TyvT29pqeR63INzc3614zbusdhZZOSqgV2h9irJ41yWSyyKW6n/BGOtsXixBFfSomTkH1vNfQ+vk3GC5E0UfEKJfOweNocqiEh5uOY7DwJXFk8HgTwrZ1iTCajvtf0sFPt+PlkPv6jyz4DYiddVf2u5c+QFSqRlVlULmHgqismo45/xjFB+l76AKiM0SI4mKo3yL53sv8Pxxeh3Y/M6K4kNNZKD3pzBLO0Mv4eesRxONxHGiOYImmxbPx0E3znSjyl/0Mm4750U7cTD/AlM+f83mzgNKZPy01EAQBNS25v7WnvhyCUI76Hu+LRTKMSDrVSn40GkV7e7ttYrFYSYpne3u7r11BAaTlR60IJ5NJhMNhRCIRy/cYhbPQ59Monbm0cBfrWqvfR7WVNRqNum6RVbf1DvfCePUqpXOkJJNJHDp0KOuHBvXe+uSTT9LbxmIx9Pb2+ljaUsdL6ZyKn9bVoa6uDnV181BdVYmgUnmetuIoBgpR/LwZzdJ5FfsazLt+6hPCy6++jhVKy12odg2aO28Ut1Xx6j40uBG50Mt49fUVikCEULumGZ03iljSm4fQqJQz9PJGbH1jDWqtyh2qxZo3dmLreuX5FGqElevIDOPcjmcxSRQhihWYWD1PvocWzcH0iRWKAM1A9AIAXMb2n03BlCmv4CPl3Q+ydGYJ55K12KdrzkzhSmtDprXf6tmmyN/Un6rPL/n8T1N+AJi0Iq5I/0d4ZcoUTPnZdlz24gOMFulUBFAQyjD34JDJBlewYaIgb1NeD09cj9JZ8nginW66Uqrblpp4qtLsF/39/aZlsJOgYgsnoJfO/v5+08q9XZxabb3gfpXOaFRAU5OAP/2p9KSzp6cH+/btc5Uvv/yyYOWwQ9vjIhwOIxaLpX/s0rZsx2Kx9I9fxZLO/v5+21bV/v7+Ehxy4KV0Zldw715qRe20IEQxiJnRcyXU4jmKpfPMdtTl0hpnEFHL1qCCFLUuz3K6kTmvuINjmzMS/8bROwDk7svxI3GcudyHvr4+9F07jxNH4vhM7bd84q207ITeOGrZqjzYUYtJoojgo7XYfzP7Drl77Riizz+PJgs7fGClM/U5YivthFPdrhNb0t+bJhw325eV/A28j7lBEeLktThVgI8w+qRTQGD6Ntwy/v3EixgnBBAIUDofJHyTTrWC6Dd+S6dahnA4nK6cJhIJy5ZOP4RTLaMqZR9//HH6vDm1cKtjVe1abb3C+H1MJBLo7e21jVrh91s6JUlAOCzg3/5NwHfflY507tu3z3WF7+TJkwUrhxVa4YzFYqaC19/fnzUMoFjS2dvbi0gkYnm83t5ehMNhdHd3F6U87iisdAIABjtQO0mEGHwBv0rXvIdx8/BrmDNNbQ2twMTqMFov3VX+fgM7ZosQJ6yA7l+t+ApMEEXMfV9bhT+HTY+LEGc24bKmMt19qRXh6omoEEWIFRNRHdG2tppVnu+ie5eEarV1KViJafOj6NSKwI3/hZeqp2BihajpAlkNaVc37gK6fV1qDaf3FaychjmRxXi6CBX2q62rTe7Z1dh2RJlM6Mg2rLa7v5fH8HlBS5guafZ3T5IgSWFsPnoD9wY6HbsIL48VoaQ3D2Cd5piqdDpyvElT1pV4z3Qqhm40ThUhBmdaSqUerSyqXW2Nke/D3KTTSkKNr6v3zX7c7IxivnL/BiunYX60q6i9Ge4c25xpwQw1oPVKCqkrnThiHAM8EunEKaydrD5bgOznhs352GH4ke3uJbSGq5VnRxCV0+Ygsvjp7OMO/x77I5lWVrFiIqrDrUg/GjVluHupFeGnKhFMt4T7hCKA0598EgFhHF48of3j13jnyQACVQswt5zS+SDhq3R6W8nPj1KQTlXMwuFw1lhEbddfv4QT0EunWl63kwUV61rn8n00UgrSqWbzZgFffUXpdIP6XXSaBErbvb/Y0qke02xyMO3fm5ubfRprmsKNU63YmV4iJYo1dYbrW7cG0fTfd6L1lHOXS/sKLpDYKleuVFm88PYsVIgiKp6QEN3dhrbmCJ55NAgxOBPRc3JV7dymxyGKM9Gk6cd2au1kWeJe+FWm5WjwfcwVRUxeewrpSnKwAhXBICqfmoe6ukXyvsUgZu9Qn6PGyuMgOsKPIigG8egzETS3tWF3VMITFSLESYvRrtamLzShZtocLHolit1tbWhra8ZrP5uGoG7fwzgXnalUPp/CPENXvcJKZwqdW8zuWW1l+ziabO9vi4q550XVyoBeONXvW8pJPJsKX9L+Q9ofsVYi9rm7br2pgSN4Q9MFt2Hf9eyNzm3C46KI4OJ2l70AtBI4iPMftmHN0yJE8WmsaWtDW1sb2tpOIIECS2cwiKBYgcfmLEJd3Tw8VWm8vwrPibcyrc/rDtxE6kqr0lU7jPUdV9PbDR59w3X3Wv1vUNdwbNNsTBAnobZDfdpYSGdFBSo05+OJChGiOAkr4spVHT6H6Ez5HMnPpDosmjMNlUFRf1x1u4onMO+1ZrS1taE58gweDYoIzmzSX4NJj8qvV1ZhypSf4u0SkM6alhN4cZyAwJPv4Gv1b1c2YKJQhrkHT6NeJ50J/Hp+JX7w8JhMK+lD30PlM004mf6nsSf9ntPXP0T9jB9ijCqaJtKZPLkK4wMCAuPrcVrp5Tt0/UPUz/gRHgoIEIQxeLjyGSyYPtYgnUO4vi+MyT96CAFB3e55vHteLUgCm6sECGUL0KH93B0LUCYImL7ta82Lp7HqxwKE8evwhab8J8+/i+crH8YYQYAw5mFULjqY3SKcI99++y0OHz5su00qlcJvfvMb3L1713a7QuCJdKqTsdjFasIhvykF6QTk1sNIJJI+n9pzpIqntgthsbsmUzpzJX/plCQBS5YIOHw4l1bPwlDq0qkug2Mna0bh9Es6JUlCY2Oj7t41/t2uVbRw9KNjvZvxfpmE1nfAqVOwk3Ti1FpMFkVMWBEHBn+FF4IigrN3QPdkG9iPhRNEiE9vlV/PatWUWx1mzJgBMbgY7WotPb4CE8QgajuAjHTOxNrOm5mK/PBRLJtg02LR3YipurFbytvObcIMUcSEZUdtpEBpaZ29AzcAILEVT4sigjOjOKd7U2G7JqYGTmJnZIkLkXSQzhffwWcFHS6ZwsDJnYhkzQKrF87M57IWzxff+azgY1B1XYAbDzneC1o+jy23F2RFdmY2uR0hmC2H9t1rHZKvdM5Yre8BoN67xp4JHpO6cRRN/7QLZ1LA8SZNK/IlQwu/0vU6dWa7bpKqdQdym0hIFP8az++4rOnFYCGdk15A7JKmUq8+A+SHkvKjm8kQA4PsJrY+rfvhTWXg/bkIihOwIq45ZsUsNGqfcX6iEcArm6sQSAvdEDoWPKKIZo9BOnuw7r/+CJNnLUT9v+zCrl27sDE8Df85ICBQtRlXlG3qywUIZd/H98fIsvaDH3wPL7QhSzqHTtdnCSdutaDmEVkifzh5Fp577jk8oYqfRjp71v8XjBEEjBGfkcuycREmfT8AITAe9crOTq/6MQRhPNZ9kfnYnUvHyrKsleyvt2G6IGDs0s5M+QNjMCYQwEM/eQLPPTdL3rcQQNXmKyM67du3b4ckSfjlL39p+vdUKoUtW7ZAkiS8/fbbIzpWPnginSOJ35SKdBrRtmqq4plMJhGNRn0ZC0vpzJWRSaeaDRsEXL9O6TQjkUhAkuxbObu6ukx/BCv25FbahMNhHDt2zPLvkuR+ySTPGPwU203XszMRzpe341MXk286SueFKGYolbXh/QshiiIW7s+uLsVXTIAozsaOGwCG92OhKqrpfcxANL4Ds8UgFivWeblpJkRxLmQ3tao8X0bTTLNugnKJ5VbVx7HpnLFE2d18h7/5HB9EX8G86imYkp4sKbNvuTxqBVF3lgonnY6z1ebS0ikpy6qcLcjstuaz1ZoLp4qdeIaWb0Sryxli8yEjNxbiaINOhEze++Xbs0y7dWYLo9p9Mlfp1E7upc3f428njKR7bfZ3WL53C9fNMzNp0Gq0Xs1u6UTqClo1E2itfO8i8HkMy9Xvid2SKWYTCdXNS3ePn7S4Xek6bNW91ng+OlAbVF9Xnj1mQq6TTmW7OW/jiz/+EX/U5n+vw+OiiBnRCzbH9BGtAA4dxNwyAWVzD2Lo63fwZCCAJ9/5GsiSTnNkuavC5gQy7wl8H//tF+eh+7lZc8yhi9vw5CMCAuNXaVpJh3BwbhkE4RHUtOjbFHXda79+B0/qRFfh1m5UlwkQJm6QX89q1ezE0rECysvLIQRq0KKKbscClAkBPNumLf94LP3wOtJTLCnnSG4NzZ9vv/0Wq1fLz5g9e/bgu+++S/9taGgoXddduXIlbt++PYIj5ccDI53GdTDdxO8ZLlXxjEajiEQivs6yS+nMFW+kU5IE/MM/CDh4UMB//AelU8tIrnexsHtGNjc3264ba2wVLTguxNOtcALuWzonrz2FC9EZFoIH5W+qsCnCp7ROJrY+rVR+E9j6tNr6OIz9C7UtmNYTpLQvtq48y+VfCBMPVv4mi/DwuShmBrWzjL6C6O63sWhyZt/y9qoE6/ZUoMpiCp9uc5qQJ0fplCRIUggNrVe8bUlMfYptxu7ckgTpn38LpxUt7/32n20kWR7TVwh00vlKC3Jpm9C1kubQ0nlmZ0Z+/v5vJ4xAOgs1pjN7r/K9q/Y48Jg7J/CW5keHN47eMR/T+em2zCRajYfQr3ThDjXEYPu7hOWYzgHsXzgBojgBy44Ow710KudtcXtmm7nvZ/+Iozuusp1N7gvpBHBlw0QIgSosmFsOYdyLkId4mkjn0J/w2d56LHyiEj/4wfeULrDaVkgbUVWP+Qu5NTMwfimO6dyyDc8GBAhVm2H8l1UrnUO7qyEIAqp3Z8+627GgLCPAQ7tRLQgoW9Ch7gTlQjnqOzajSgigRrHOL9aNhyBMh+ymVuX/AuvGezO+9fbt21i5ciUkScKuXbvw3Xff4S9/+QvefPNNSJKEZcuW4auvvhrhUfKjoNKpdn0rBem0avGQpMxYSr9aQ+zKbDyPfoknpTNXvJPOJUsEfPABpdPI/S6d7e3tjlJa9DGeNuKZi3ACuY3pdC+dagvkXLw/KAvo1MbuzHYTViCudLmVx3MChZVORYKDM7GpWzs+Rn9M630VqLKYOo4m0+U7NuCI0zUcPIINtve63KrkXVGbMqJgFa2c6SbjcUgeE2C5QT8xUw7nwyDYddvPZG+j/BhjN6ZTvifuF+mcjLUFmOb14nsrs39gMMxeG3rrBHC1NTNJ1upWXMUgPj1wGJecftGwm0VW+Zu58OUgnQv3Z19jM+k0207/ppKXTrX1UN+F1CBgQ6dRPz4AQRAw5uFKPPHcc1hY/y94fdbYnKQzEJD3Ub7qNHTa2FOP8nQ3Vz1a6ZT/+8dYdTr7Y8l/K4Psmcq4TqV18sqGiUq55OVgyuYexBCGsLta24JpXf6WGu8mVfrqq6/w0ksvpcVz06ZNad+5du2aB0fIj4JIZzgc1rUSqrOd+imdVkhS6XSvVde+7O/vz5o0SHvO/BBPSmeueCOdP/+5gFu32L3WDLV7bXNzs+v3hMPhos6abfaM1I7btOp+6+tstibimatwAg4V3OFTWDtVTM9eq3avnZvdFKh00dNMHhRfgQniBKxofx9ztRXaC1HMECdjbbvc1TbTupKfdMpya1ZhVkWzFh2WlUL9Ma276haosqitaOsSwsvbP7XpIjuIs7EGRwk0nQAn76KazazrkXRKDfCwqBnUJWhCtQjXhhB6+S389g93bFuA7906jdb1Yd21aDpu9g7n2WvvF+nUdY33lOvY16Cexzps6cx8o1NXDmF9WIIUbsLxQf33K/TWCZt9GrCRTvV5lb90KmO+H9+E7EeCSfdax+VZ7gPpBHDixXEQyuYis2ynXsASm6sgCAGMX3VS121WP7Oss3Q+UrMBq6oC2d1oFen8sYlN5iedatff6dj2tSyg45Rpenvqy5VJhuQutxnRLY50AsCNGzd0/lBbW4urV696tPf8KIh0mq3JaDaZB6VTJplM6tYTNFZAE4mE6RjPYkLpzJWRSeeyZQK6ujh7rRPqpFtu1rpU7yGnmW69xPiMNC7rYvx7NBot0rqdKVz5aKcyG60+ezpvIKURT1k4U7jRucd0+50fmXe3tKrgDt/sROOsCnkSDbVWrU4kZJxoxziREAAMt2NxUMTUqVMNlTG5sj556lQEdV1Z85PO9ERC6bFbyuF1EwkpyydMXo0uTbnvXnoHc7Xj45QJkKauPaWR02HcPLwMUwtRWby+Dw2W96qVeLoTTkmSsHKPd0NPru9rKKB0roSHRdVwEe+tlCCFGrDj6FG8r8jkkvBLqH/3tGaSmbv4oqURL4VNJnOq24ZPLSxVniTGep1Od9KZ3bKem3SqP65olzUC7nb/ArMnuJBO5d7NmhzME7Tdwbeg03ge732Db+4Bg8ffwvJ0i38Im4+5XNoGsJbOu93YNDMIUZwKuZNFPtKpCvlUrD2luUjDN3F42VTdcbsbp0IUg5jT/HvDD1vDuLm/Fg0ddsf0EVfLl+gFrKVGgCBUw9irNVfprGlBZsKgsmrsVr1T6Q4rVO+GseOsWfda/Qy0MnL3Ws3kQR0LUCaUYUHLNkwXxiLjlvUoF8ZiaYvc1VYez2lffq+lEwCuXr2Kuro6LFmyBBcvmq7RVFQKIp1mYyGtxlT6TSlI5969eyFJcgvI3r170xVp4yy1foonpTNX8pfOf/1XAX/+cy7C+eBKp3pPNDY22nZFVde+tXo+FQr1O2nVeql9hha3m7DdGD5lApfBsziw5wDODqZw4+hmk4le1Jgvp5E9ack8VFdl1uCc1ahfw+9Ck7ykiN2SKTJKZVg7oZCCunxKZjwnkLd0YhAdtZMgOiyZIlcKRVQ8NgeL6uqwaM5jqAhWoCKobzVq0i2PsAhzHqtQllUoRGXxDLabjZO0FE/3wilJ3rZ0plsNCyKdBWrphGbZjbomHD1zDLs2R/DSsjV43/h4MW11DqFh31WbvQ/g6IppmvVq5yn30CLMmV6lLKuhtv5nf79v/M9nERRFTHq2EbvbmhF5pgEdyFU6gcSO2Yr8PoNFdXWY91QlglnfbXXcYQWemPcamtvs7l2v0H6/18FsAlq9cEoINezLrau12URC8zJrak6LxJX7Jz/pxIUmeSx4sBJPzZOXUHqsQkRFRYX5kiliBR6bE1HO72vytTB2w73PpVOe+XUspGMaJUyeR3R6We7SCWCocynGCdrZa5XusIHp2KYd65k8iVerNMdQJxLSznoLZE8kBABDLagJCBg3bhyEsUvRmd5YXipm7LhxCKTHc9qXvxDSCQCXLl3C2bNnPd5rfhREOs1kSBUrSqd5GSKRSLrSnEwmLbsB+iWe7e3t6UoxpdMNuUvnunUCfve7XGXzwT4M2R0AAATGSURBVJZOINOLorGx0VQou7q60sJZ7PGfvb29tq2Xvb29xZ8wCIDzxDFhrG89jWt913C6db2NcDpJpz4VE6eget5raNUuJZBmGL/fH8GcaZWaynbYdFvjOM80msmJNFvnKZ0AMICurVJ6xkoxWIlp86P6pSGGb+Jw5ClFBNQyd2cfc6AL0fnKOnzp/ewvWGXx6j4niVTFMzfhVJee8LCk2NfgMHPy6vdwrq8PfX19uHZwg+tnUqjxEDwtqo5BHG+Snysrtx3HH27fwuDAN7hjHCtoIp22s6amGcbN0zG8Nq8aVep6rmIFJk6fg3+M/hqn099Bs+/3AI5GMoJUWfUKPkLu0gkMoCs6H49V2H23lfvm6cWIpNeZtL53veEOjm3OfGfCTZ0YSJ/PFAY6jcLpMGmQGWZLpgQrUTV9Pho/uOS8ZIqTdAIY6IpivvK8C1ZOw/xoJ27uN2lhHf49jkQ1zyExiMqqakjRDyCf4tEhnTjxIsYJAoQxP8TkWc/huVmT8cMxAYwZE8hLOoEhnK4fj4Ag4JGaFtwC8HXbs3hEe4wnfoKHAsZjAD3r5PfZLZkio4isdkIhBXX5FP2MtMWXzlKiYGM6tZUouxka/cZv6VTHpRnLYCdBfne1pXS6wb0wNjW5mSjIH+n88ssvcfLkSVfp6+srWDnsSCaTuu772nWDteMZ/JhwyGkioGQyWfzJggC4n61UnyXhMJZIEqRQLWrTayqaSyfxm4wU2YlnbbjWvXAWYvZawMXyLnmkgLPXpkldwb612q6zJpMKGaQzn/HRJBvjepuh5a9iy9at2PL6CvkZNRLhJCMnH+nEEK7vW4SfPJSZTOj5d8/jZD7da9Pcwu7qMnms6LoeAEO4uPt5VD48xuYYULZbhMk/eggBQV7X8+HK5/Hu+ex/r43jPNN0LsXYrImLKJ15v9lp9trGxkY0NjbabuM3vb29vs9Sq7ZqquVQRTQSiVi+x0/xpHS6wb0w3r07EtksrHTeT/T29pqOjW5ubvZ9+aPSIx/p3ILOVAqdWyRs6UwBZ9/Fi5TOEmcQZ1vfxIoluV7r7IRq16C503rdzJEX9Sxa39QLQ34JoXZNMzpvFFg4VVIDOLkzki531iooaekMYfnPjyDhNGsqcUkKV1rtW+hDDS3Os9QSQorGiKQTgOUEOG5SyksdFBPt7L7qeE5JkhxnsDSKZ7FQpTMWi+HixYuOKbZ0alu63EZtEfMOL0SS0pkviUSiSBPy3K+cwFumS2rYV+aXr1mD5SEJoeVrsGa50r0t9BZymBOS+EIKd27L3VPzym37mVm9Leod3M63nH19uH2naCXVF3ugF0f27MSxq4Y/3DmDtp2t6KJtFoAUbnQ2I5L1o8oSRJo7UazfHQgh7hixdBJv6O7uTrcKR6PRkm6ZcWrhNksuy1qMBKuxw27ibWsxpZOUMikM9J5APB4fcU70DhRPSAghxEjqDm5fPiM/k85c9u2HB0KIPZROkhddXV2uWxFjsdgD2OpE6SSEEEIIIQSgdBJSRCidhBBCCCHkwYPSSYjvUDoJIYQQQsjohdJJSElC6SSEEEIIIaMDSich9wWUTkIIIYQQcn9C6SSEEEIIIYQQUjAonYQQQgghhBBCCsb/B7lXD/m3e+N2AAAAAElFTkSuQmCC" } }, "cell_type": "markdown", "metadata": {}, "source": [ " Laden Sie das Coverbild. Klicken Sie dafür aud die nächste Zelle mit dem Code und dann auf \"Run\", wie im Bild gelb hervorgehoben. ![grafik.png](attachment:grafik.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image\n", "display(Image(\"Bilder/cover.png\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This work has been supported by the European Union Horizon 2020 research and innovation programme under grant 770299 (NewsEye)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Schritt für Schritt...\n", "* [Importieren des manuell annotierten Korpus](#1-bullet)\n", "* [Pre-processing and die Bildung eines Trainings- und testkorpus](#2-bullet)\n", "* [Klassifizierung in relevante und irrelevante Artikel](#3-bullet)\n", "* [Anwendung des Modells auf den gesamten Korpus (400 Zeitungsausschnitte) ](#4-bullet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importieren des manuell annotierten Korpus " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Fügen Sie die CSV Datei \"export_krebs_krankheit_12_05_2020_10_52\" ein. Dafür geben Sie in der nächsten Zelle den Dateiname zwischen die zwei leeren Anführungszeichen bei \"df_all = pd.read_csv('')\" ein und fügen dem Dateinamen \".csv\" hinzu. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "import re, numpy as np, pandas as pd\n", "import csv\n", "from pprint import pprint\n", "from IPython.display import display\n", "get_ipython().magic(u'matplotlib inline')\n", "#import data\n", "df_all = pd.read_csv('')\n", "print('Tabelle 1: Annotierter Korpus mit Metadaten.')\n", "df_all.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Klicken Sie in der nachfolgenden Zeilen erneut auf \"Run\". Für das weitere Notebook werden wir uns auf den Text und die Relevanzlabels konzentrieren. Die Ziffer 0 wurde an nicht relevante Texte vergeben, die Ziffer 3 an relevante Texte. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('export_krebs_krankheit_12_05_2020_10_52.csv', usecols = ['text','relevancy'])\n", "caption_content = 'Tabelle 2: Text mit Relevanzlables (3 = relevant; 0 = irrelevant).'\n", "display(df[22:24].style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Visualisieren Sie die Verteilung der Relevanzlabels und der Tageszeitungen in Ihrem Korpus, indem Sie den nachfolgenden Code ausführen. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "figure-1" ] }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "df_newspaper = pd.read_csv('export_krebs_krankheit_12_05_2020_10_52.csv')\n", "fig = df_newspaper.groupby(['relevancy','newspaper_id']).size().unstack().plot(kind='bar',stacked=True)\n", "plt.title('Abbildung 1: Manuell annotierte Zeitungsartikel zum Thema \"Krebs\" (0 = irrelevant, 3 = relevant).')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pre-processing and die Bildung eines Trainings- und Testkorpus " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Reinigen, tokenisieren und stemmen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bevor Textmining Methoden angewendet werden können, muss der Text bereinigt(Interpunktion wird entfernt, alle Wörter werden kleingeschrieben), und tokenisiert (der Text wird in einzelne sprachliche Einheiten zerlegt) werden. Außerdem werden Stoppwörter entfernt und die Token gestemmt (flektierte Wörter auf ihren Wortstamm reduziert). Das Pre-Processing ist wichtig, da Interpunktion oder Sonderzeichen für die weitere Analyse in der Regel nicht benötigt werden. Das Gleiche gilt für Wörter wie *und*, *oder*, *mit* und ähnliche, die nicht als wichtiger Kontext betrachtet werden können. Aus diesem Grund werden wir im nächsten Schritt nur jene Wörter beibehalten, die eine Unterschiedung von relevanten und nicht-relevanten Kontexten ermöglichen.\n", "\n", "Ein weiterer Pre-Processing Schritt, der angewendet werden könnte, ist die Lemmatisierung (Umwandlung von Wörtern in ihre Lemmaform/Lexeme). In diesem Notebook wird jedoch ein deutschsprachigen Stemmer verwendet, da dieser weniger Verarbeitungsaufwand erfordert (kein Part-of-Speech-Tagging erforderlich). Wir haben jedoch die Liste der deutschen Stoppwörter aus dem NLTK-Paket erweitert. Eine längere Liste mit deutschen Stoppwörtern wurde von https://countwordsfree.com/stopwords/german abgerufen und der bestehenden Datei hinzugefügt." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Pre-Processing Methoden sind sprachenabhängig. Da wir mit deutschen Texten arbeiten, ist es notwendig, auf Modelle zurückzugreifen, die auf die deutsche Sprache abgestimmt sind. Fügen Sie jeweils dort, wo zwei Punkte zwischen Anführungszeichen zu finden sind, die Sprache für die einzelnen Pre-Processing Schritte hinzu. Für Deutsch verwenden Sie \"german\". Erweitern Sie ebenfalls die Liste der manuell hinzugefügten Stoppwörtern: \n", "\n", "\"a\",\"ab\",\"aber\",\"ach\",\"acht\",\"achte\",\"achten\",\"achter\",\"achtes\",\"ag\",\"alle\",\"allein\",\"allem\",\"allen\",\"aller\",\"allerdings\",\"alles\",\"allgemeinen\",\"als\",\"also\",\"am\",\"an\",\"andere\",\"anderen\",\"andern\",\"anders\",\"au\",\"auch\",\"auf\",\"aus\",\"ausser\",\"außer\",\"ausserdem\",\"außerdem\"," ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import numpy as np\n", "import nltk\n", "from nltk.corpus import stopwords\n", "from nltk.stem.snowball import SnowballStemmer\n", "from nltk import FreqDist\n", "\n", "#functions to clean and tokenize the data\n", "def initial_clean(text):\n", " text = re.sub(r'[^\\w\\s]','',text)\n", " text = text.lower() \n", " text = nltk.word_tokenize(text, language = '..')\n", " return text\n", "\n", "#remove stop words\n", "nltk.download('stopwords')\n", "nltk.download('punkt')\n", "\n", "stop_words = stopwords.words('..')\n", "#add stop words manually\n", "stop_words.extend([\"a\",\"ab\",\"aber\",\"ach\",\"acht\",\"achte\",\"achten\",\"achter\",\"achtes\",\"ag\",\"alle\",\"allein\",\"allem\",\"allen\",\"aller\",\"allerdings\",\"alles\",\"allgemeinen\",\"als\",\"also\",\"am\",\"an\",\"andere\",\"anderen\",\"andern\",\"anders\",\"au\",\"auch\",\"auf\",\"aus\",\"ausser\",\"außer\",\"ausserdem\",\"außerdem\",\"b\",\"bald\",\"bei\",\"beide\",\"beiden\",\"beim\",\"beispiel\",\"bekannt\",\"bereits\",\"besonders\",\"besser\",\"besten\",\"bin\",\"bis\",\"bisher\",\"bist\",\"c\",\"d\",\"da\",\"dabei\",\"dadurch\",\"dafür\",\"dagegen\",\"daher\",\"dahin\",\"dahinter\",\"damals\",\"damit\",\"danach\",\"daneben\",\"dank\",\"dann\",\"daran\",\"darauf\",\"daraus\",\"darf\",\"darfst\",\"darin\",\"darüber\",\"darum\",\"darunter\",\"das\",\"dasein\",\"daselbst\",\"dass\",\"daß\",\"dasselbe\",\"davon\",\"davor\",\"dazu\",\"dazwischen\",\"dein\",\"deine\",\"deinem\",\"deiner\",\"dem\",\"dementsprechend\",\"demgegenüber\",\"demgemäss\",\"demgemäß\",\"demselben\",\"demzufolge\",\"den\",\"denen\",\"denn\",\"denselben\",\"der\",\"deren\",\"derjenige\",\"derjenigen\",\"dermassen\",\"dermaßen\",\"derselbe\",\"derselben\",\"des\",\"deshalb\",\"desselben\",\"dessen\",\"deswegen\",\"d.h\",\"dich\",\"die\",\"diejenige\",\"diejenigen\",\"dies\",\"diese\",\"dieselbe\",\"dieselben\",\"diesem\",\"diesen\",\"dieser\",\"dieses\",\"dir\",\"doch\",\"dort\",\"drei\",\"drin\",\"dritte\",\"dritten\",\"dritter\",\"drittes\",\"du\",\"durch\",\"durchaus\",\"dürfen\",\"dürft\",\"durfte\",\"durften\",\"e\",\"eben\",\"ebenso\",\"ehrlich\",\"ei\",\"ei,\",\"eigen\",\"eigene\",\"eigenen\",\"eigener\",\"eigenes\",\"ein\",\"einander\",\"eine\",\"einem\",\"einen\",\"einer\",\"eines\",\"einige\",\"einigen\",\"einiger\",\"einiges\",\"einmal\",\"eins\",\"elf\",\"en\",\"ende\",\"endlich\",\"entweder\",\"er\",\"Ernst\",\"erst\",\"erste\",\"ersten\",\"erster\",\"erstes\",\"es\",\"etwa\",\"etwas\",\"euch\",\"f\",\"früher\",\"fünf\",\"fünfte\",\"fünften\",\"fünfter\",\"fünftes\",\"für\",\"g\",\"gab\",\"ganz\",\"ganze\",\"ganzen\",\"ganzer\",\"ganzes\",\"gar\",\"gedurft\",\"gegen\",\"gegenüber\",\"gehabt\",\"gehen\",\"geht\",\"gekannt\",\"gekonnt\",\"gemacht\",\"gemocht\",\"gemusst\",\"genug\",\"gerade\",\"gern\",\"gesagt\",\"geschweige\",\"gewesen\",\"gewollt\",\"geworden\",\"gibt\",\"ging\",\"gleich\",\"gott\",\"gross\",\"groß\",\"grosse\",\"große\",\"grossen\",\"großen\",\"grosser\",\"großer\",\"grosses\",\"großes\",\"gut\",\"gute\",\"guter\",\"gutes\",\"h\",\"habe\",\"haben\",\"habt\",\"hast\",\"hat\",\"hatte\",\"hätte\",\"hatten\",\"hätten\",\"heisst\",\"her\",\"heute\",\"hier\",\"hin\",\"hinter\",\"hoch\",\"i\",\"ich\",\"ihm\",\"ihn\",\"ihnen\",\"ihr\",\"ihre\",\"ihrem\",\"ihren\",\"ihrer\",\"ihres\",\"im\",\"immer\",\"in\",\"indem\",\"infolgedessen\",\"ins\",\"irgend\",\"ist\",\"j\",\"ja\",\"jahr\",\"jahre\",\"jahren\",\"je\",\"jede\",\"jedem\",\"jeden\",\"jeder\",\"jedermann\",\"jedermanns\",\"jedoch\",\"jemand\",\"jemandem\",\"jemanden\",\"jene\",\"jenem\",\"jenen\",\"jener\",\"jenes\",\"jetzt\",\"k\",\"kam\",\"kann\",\"kannst\",\"kaum\",\"kein\",\"keine\",\"keinem\",\"keinen\",\"keiner\",\"kleine\",\"kleinen\",\"kleiner\",\"kleines\",\"kommen\",\"kommt\",\"können\",\"könnt\",\"konnte\",\"könnte\",\"konnten\",\"kurz\",\"l\",\"lang\",\"lange\",\"leicht\",\"leide\",\"lieber\",\"los\",\"m\",\"machen\",\"macht\",\"machte\",\"mag\",\"magst\",\"mahn\",\"man\",\"manche\",\"manchem\",\"manchen\",\"mancher\",\"manches\",\"mann\",\"mehr\",\"mein\",\"meine\",\"meinem\",\"meinen\",\"meiner\",\"meines\",\"mensch\",\"menschen\",\"mich\",\"mir\",\"mit\",\"mittel\",\"mochte\",\"möchte\",\"mochten\",\"mögen\",\"möglich\",\"mögt\",\"morgen\",\"muss\",\"muß\",\"müssen\",\"musst\",\"müsst\",\"musste\",\"mussten\",\"n\",\"na\",\"nach\",\"nachdem\",\"nahm\",\"natürlich\",\"neben\",\"nein\",\"neue\",\"neuen\",\"neun\",\"neunte\",\"neunten\",\"neunter\",\"neuntes\",\"nicht\",\"nichts\",\"nie\",\"niemand\",\"niemandem\",\"niemanden\",\"noch\",\"nun\",\"nur\",\"o\",\"ob\",\"oben\",\"oder\",\"offen\",\"oft\",\"ohne\",\"Ordnung\",\"p\",\"q\",\"r\",\"recht\",\"rechte\",\"rechten\",\"rechter\",\"rechtes\",\"richtig\",\"rund\",\"s\",\"sa\",\"sache\",\"sagt\",\"sagte\",\"sah\",\"satt\",\"schlecht\",\"Schluss\",\"schon\",\"sechs\",\"sechste\",\"sechsten\",\"sechster\",\"sechstes\",\"sehr\",\"sei\",\"seid\",\"seien\",\"sein\",\"seine\",\"seinem\",\"seinen\",\"seiner\",\"seines\",\"seit\",\"seitdem\",\"selbst\",\"sich\",\"sie\",\"sieben\",\"siebente\",\"siebenten\",\"siebenter\",\"siebentes\",\"sind\",\"so\",\"solang\",\"solche\",\"solchem\",\"solchen\",\"solcher\",\"solches\",\"soll\",\"sollen\",\"sollte\",\"sollten\",\"sondern\",\"sonst\",\"sowie\",\"später\",\"statt\",\"t\",\"tag\",\"tage\",\"tagen\",\"tat\",\"teil\",\"tel\",\"tritt\",\"trotzdem\",\"tun\",\"u\",\"über\",\"überhaupt\",\"übrigens\",\"uhr\",\"um\",\"und\",\"und?\",\"uns\",\"unser\",\"unsere\",\"unserer\",\"unter\",\"v\",\"vergangenen\",\"viel\",\"viele\",\"vielem\",\"vielen\",\"vielleicht\",\"vier\",\"vierte\",\"vierten\",\"vierter\",\"viertes\",\"vom\",\"von\",\"vor\",\"w\",\"wahr?\",\"während\",\"währenddem\",\"währenddessen\",\"wann\",\"war\",\"wäre\",\"waren\",\"wart\",\"warum\",\"was\",\"wegen\",\"weil\",\"weit\",\"weiter\",\"weitere\",\"weiteren\",\"weiteres\",\"welche\",\"welchem\",\"welchen\",\"welcher\",\"welches\",\"wem\",\"wen\",\"wenig\",\"wenige\",\"weniger\",\"weniges\",\"wenigstens\",\"wenn\",\"wer\",\"werde\",\"werden\",\"werdet\",\"wessen\",\"wie\",\"wieder\",\"will\",\"willst\",\"wir\",\"wird\",\"wirklich\",\"wirst\",\"wo\",\"wohl\",\"wollen\",\"wollt\",\"wollte\",\"wollten\",\"worden\",\"wurde\",\"würde\",\"wurden\",\"würden\",\"x\",\"y\",\"z\",\"z.b\",\"zehn\",\"zehnte\",\"zehnten\",\"zehnter\",\"zehntes\",\"zeit\",\"zu\",\"zuerst\",\"zugleich\",\"zum\",\"zunächst\",\"zur\",\"zurück\",\"zusammen\",\"zwanzig\",\"zwar\",\"zwei\",\"zweite\",\"zweiten\",\"zweiter\",\"zweites\",\"zwischen\",\"zwölf\",\"euer\",\"eure\",\"hattest\",\"hattet\",\"jedes\",\"mußt\",\"müßt\",\"sollst\",\"sollt\",\"soweit\",\"weshalb\",\"wieso\",\"woher\",\"wohin\"])\n", "def remove_stop_words(text):\n", " return [word for word in text if word not in stop_words]\n", "\n", "#stemming\n", "stemmer = SnowballStemmer('..')\n", "def stem_words(text):\n", " try:\n", " text = [stemmer.stem(word) for word in text]\n", " text = [word for word in text if len(word) > 1] \n", " except IndexError: \n", " pass\n", " return text\n" ] }, { "attachments": { "grafik.png": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA6EAAABKCAYAAABKMFL2AAAgAElEQVR4nO2dfWwU573v58oSf4BUpFOlKsofdKytowsXn8uLwgWFmB4kaGQ4orQXKKVFY+rY2UCAxrgUGjskhLvNdQ/BCELLUqAbfAwVWXJMUiBaWcXh7QKnvORgoLRkQ12CDa6TshTbVb73j5lnd16eefXszqzz+0g/Jd6dnXl2Znd4Pvt7nt8jgCAIooCkUimMqb9MQUFBEZpIpVLo7e2loKCgCEWkUqlhH0LQHVKCIL5YkIRSUFCELUhCKSgowhRBCyJJKEEQww6SUAoKirAFSSgFBUWYImhBJAklCGLYkUqlgm4CQRCEBpJQCgqKMEXQgkgSShDEsIMklCCIsEESSkFBEaYIWhBJQgmCGHaQhBIEETZIQikoKMIUQQsiSShBEMMOklCCIMIGSSgFBUWYImhBJAklCGLYQRJKEETYIAmloKAIUwQtiCShBEEMO0hCCYIIGyShFBQUYYqgBTFQCc1kMjh+/DgOHz5siM7OTgBAR0eH4bmOjo6C/aNBEETx4UVCP/v3RvzlW4Ln+OzfG/1/IwRBDBu8SOisvXM8R8eNDwLv5FJQUIQ3ghbEQCX02LFjkCSJG8lkEgAQi8UMz8VisYL9o1FoklUixKpk0M0gCF9Jp9PIZDK222UyGUfb2eFVQrvX/DP6r7S7ju41/0wSShCEJV4ldN5b87GybbUmFh34HmbtnYNFB75neG7eW/NJQtVxphHTx05H45l8HSOBpWPHYmkiv+8jsXQsxi5NBH4+7dtRmPMxHOPKlSvo6uqy3a6rq8vRdnYRtCAGKqHJZBKSJDm6ecfjcXcSmqyCKFYgdtXxvw+hgCSUGG6k02lEo1E0NDRYCmYmk0FDQ4Ptdk7wKqH3flrh6Xj3flpBEhoAlZWVEAQBgkCzPoKgvr7e8bkXBAH19fV5bpFz2P3GyciqeDyOeDw+5GN6ldCVbasNj+88/UvM2jsHO0//0vDcyrbVJKHqIAktcDv8PR9nGqdj7NilSPj9mZjeiDOmxxubC0/nXD4H2X2YHEsdV65cwXPPPYcNGzZYCmZXVxc2bNhgu52TCFoQi0JCmYCy/5KEEkTxwDp7kiSZCqZ6G786e25RS+hg+kPc+2mFbfRflTuwdhJ68eLFrCyZRT466JWVlSgtLfX0ukK0zym84/Pe21DamUgkuNclkUh4bnfQlJaWGt6Pl88DQ39+hyKhQX+m2I9jkiRZiijrd/j145jbTiJJKEce3AolSWiB25F/CR3KuchKJkcMzzRO1z5+phHTXYuo/P6nN55R/j6Dxun2IsrkUpIkU8FUb7Njx44hn9ugBTH0EqoWUAAkoQRRhFiJqN8CCgxdQgc6TzqaB/r3028DcJ8JddN5HwpeJFQQBFRWVmoeSyQSoZNQp4853T/vetTX1xe9hOqv5VBEdDhJKGAvon4KKEAS6kuQhAZ+DcIwHNfTuUgs1WY4eRKaSPDF1EUm1iCyvb1ZmbU7J1Yi6reA9vaShFpKqF5AAZJQgihWeCKaDwEFSEK9EtZ28fBLYniiNlzgvTeW8b148eKQ91/Mw3EZZiLqt4ACJKG+BElo4NegWCVULYdcUTSLxFKMdfz5kbOeuSyo9nEnbeaJaD4EtLeXJNRUQnkC6gonEpqsgiiKmtD4n9k+rsZQodk2iSrl76uxCtX+TI5vOG5uu6yEKsdg21TwdqTbRqyI4arm6Qr5sex2xSflxPBCL6L5EFCAJNQrXofvBoEfUhPW6+AXJKHO0ItoPgQU8C6hiw8uxc7Tv9TEC0fWYNbeOXjhyBrDc4sPLnUlofo5cOoOdGLpWE4HXJnvpnSos0LAhi7azIMzzLnjZJmy+8xmr+Zh0fSxutfxOvuc4Eko269eSrLvQbu9fB5yoRWsnHRZnUuNjLh9D1zxyu1Lbk+uHdn2ZrfXzVHkDDE1u4769rlph/Z8mIiczedGk4nUZzNNPj+OPvP5kFCLjGdiqbO5ob29RhHNh4D29pKEciXUSkCTyaSzJVpsJfQqYhVVUDunLJCq17iUUL0sJqtEiCLvGFrZvRqr0kqoXigVaTUKsvqxq4hVaF8nS2gFKnRyShBBohbRfAgoUNwSqp/Hx8vSsbmKbD9m8zbNpNLsGExQ7IagVlZWorKy0jDX1Uxg9XNMefvnzclkoqR+b/r3rH7OTHTMti8tLXUtRnbXh52TixcvZrdVt8/JdTN7Hzyp1LdH/Tre9vrPnpmUsvehvlZOh+PyrrfdcNz6+vrs58fu86/fRn0Mr3KtFtF8CCgQziVaZFlRdeKVTrRhPptKOvSdafnv6bpCL4r0aB5jssKZ38eTvunTMV2fOfIjE6qIjFquZCnRvwdVm9WPK6/PiYYyB3D6dK2wGY7DyYadacRSTxLKO5e5dugl6Eyj7jHDdVaJq+V7dd4O/g8aOmE07N94vv2eE5q73s6E0I08Wn0+tce0v+ZqEc2HgPb2koQaJNQuA5rf4bi5jKblPswkVD+MVr+d4XW8ZhvFNSuYuR3p/ubv3yDVBBEC9BKar86eW4KWUNbpdzKHjz2m7sTzOvpmxXv0j6mPyQTCKiPK9utk3/rteKLLzof+MZ6Eqo9l9xjvnF68eDG7bzeFh5xeH7ZdaWmpQYicXjenEqr/m703q+f179lvCTX7XDiRUP3cXNYG3jXlncOhZnibm5uz96Vjx4553o8ZXiV00YHvGbKdbMjtyrbVhufY8i22EmqStTF0+tXbcTrZXLng7T+x1CRrZSK6pvscgoRyBDT3njn7NcmCacVEmxk2P5dDG6aaEy++zFu1w3p/Vufc5Np4aYfh82YyRFW3XZASavq5cPJZs/ws2IdeQv2ohquPoAUxVBK6f/9+28yI7xKqH9KqzmZ6GI6rawSqVPuTpVAvmPpm8+eEJqtUWc6rMVRw35v74xFEIdHPAbWrmuuVIUvotVMa2exe+d8x+Kff4/6muVoJPSN/u/yQUJZd1GMmA7xt9fvVS6jT4bbqTj1veyaqdm01Exx1O5zIoFcJtZrv6VZCnV4fniSp2+fkujmVULv286rjOh2e60VC2b708M6J2b6sPivsb6tz6FVC1UNwnVTN9YJXCc3XnFDTjjhHvORtl2IpZ76buRBoq4RaiYO+g26afRqKhCaM2T87QTBrs2O51JxLJm3e5qeytpgKuqEqqz6MQ4ENGW3Oe+Vmvh20w3g+dO0zvZba7YKRUP0QY5efNdtMqHXo54DaVc31GkELYqgkNJ1OY//+/ZY3cf8klA2hVW+jlTi/JVQjkqbNdiChnLmsPInOzgm1OB5BFApeESIny7d4wc9MaO/r38Hn/Q+z2/2t9WX85Vv/zfdMqJVQOJUPvVDoO/BmcmOGWkbVmImA/hh2wqD/fzO8SKidZLqVUKfXx2q/Tq+bm0yolXjxJJz9gMBe46eEOv1cmO2L94OHU0EfynBc/RxQp8u3uCVsEqqf56gNfUfafJkJZxJqXZiFK6G8bT1LqIN5qmZDcU3PkVsJNZ53p/NB9a/ji5F5O9hrDcNjvUqop3boJJQ7v9N4bgovoWw4sYe1SX2YE8orQuRk+RYvEbQghkpCneCXhPKFUCehZhnHoWRC/ZBQ00yovpkkoUQ4sKqCmw8R9UVC//cIZN7bzt320bk23Fk8yjcJtRMiPyTUrXSp0WfQnMoGLxOnDn0brY7vVkKdSIlTKXdzfdRzQnnH81NCAe38S/1rzDLB6sf9lFCrzLMfEmp1HbxKqFkRonyIaNgk1E1mJjtn0lGhGr6MuM6E+iqh09F4hjdP1fpcOJMdOwk1kRkmYS6Hz7JiP2aSZ3jc0ZBi95lQ1+1wnAm1/mw4vy4ePvvsBwunc0BNPgteq+NaVcHNh4gGLYgkoYbX6CXUKJfG4kLOJNTJEGFHEqrfrwkkoUQYcLIMi98iOlQJHfzzNQz84f9Zbj94uxMDN84CCF8mlMHLhHqRULcZL3YMq+3U+86HhFrJoPp9OK0GXAgJVe/PqYTq92E1Z5Shvi5m7Q1jJtRvCbWrguu3iIZNQh1X/VRnd0wzexzR0m1rPifOat4hry1DKUzEF1EzKXE2j898LqZd9st1gRzlGHwBdCOhxvPgVkJdt8MgaHbDh82vQb4k1FURIqvr5GGdUCfLsPgtokELYigk1G04qqZpI3zGoj28CrdKASD1vErVHFLXEgpWeEjbLkN1XFsJ5VfZBa4iVqWvjksSSgQLE0y77y3bLgwS6hY/JNRMGNzMCbWbA+p1TUy9IJjNCdWLgJOhtl4zlk4es8t0Ws3f1OP0+thJqJPrxjuWWWEkNfrr5CQTaiZ2vGJRdhJqdr3dVsfltcPuPemHGdvR2dnp6EcvJqLRaNSX+5LbTuKsvXMwb/+38MKRNZpgy7AsPrjU8Ny8/d9yWB3XpLCMpmKrw+q4egljGSXuUh686rhOJcOZuPTq3o9GXDnZLnMZNMmeJpYaquPqZcwoaAks5Z2PIRUSMlbpNYiOw0q4XiTUVTs4144vsWfQuNS6Oq7bIj/64F9vn9Y3dVBhurfXeJ6ZYNpVwWXbkYQOUUIzmQzi8ThisZjjiMfjzv4hsJg3ycQtuxyKyESTl2HMyWl26RSPw3EZ2rVEtXLpVELN3qN26ReSUCIcOO28ZTKZQIfj3n32a/hb68uu4+6zX8trdVzeY3pB4EmDXm7MjsH+rqysNMgIr5ot6/DzKq86qe6bSCS4+1MLhB/VcXnyo68gy8sgqtull0u762MnoU6uG+8xNrRZPyxajV31XPYaffvYvvXvwa2Esm28Vsd1IqG8a8qrjuukUNGFCxcc3W/S6TTS6bTtdnZ4ldB8LtGiERLDXEdzSVULpn54pt18R8PxXA6DVR/H8zqhuvVArTOS1gV9cvKi384k+6gOF9k83jnR/gBgNyxYO5dzqMNx3bfD5AcEztxQo8ybVOD1OHfTSkJN56m6ybzq1z41vJa/L6dS2dXVRcNxhyqhBEEQ+cCLhD5M7cG9n1Z4joepPY6PZZUh1M+bNKuymkgkDHMu7SqLmh1DPXxW/5zZfisrK23Xu2To22klG7xtvEoooBUqs/fDayMLu3Nndjyr4bh21w0wXgu27qh+WLY67NYQNTv3+n2x5WW8SCivXbzXepVQwLimLPuxwK2EFhovEtpx4wPPcfvu7SF3Ur3KEQUFRfgjaEEkCSUIYtjhRUKLCd7wyULiZK4nYSTo6zacMVseJkx4kdBiCJJQCorijKAFkSSUIIhhB0lofiEJ9UbQ120446bIVFCQhOYh9MMeLYZ1hjn0Q5n5w6MpzMJyuaEhFhrSxDD5vLEIWhBJQgmCGHaQhOYXklBvBH3dhgP19fWmRZTCNPSWB0koBQVFmCJoQSQJJQhi2EESml9IQr0R9HUbDujng5rN3Q0jw1VCKSgoijOCFkSSUIIghh3DXUIJgig+SEIpKCjCFEELIkmoKYJNEAQRVkhCCYIIGyShFBQUYYqgBZEklIudgBbhWyKILxAkoQRBhA2SUAoKijBF0IJIEsrFiYSSoBJEWCEJJQgibJCEUlBQhCmCFsSCSGjQDXAfgs8R9PuhoPjiBUEQRJgI+p5IQUFB8UWLIkwNDjUTShlSggiSVIoklCCIcEH3JYIgiMJSZPaVLwElESUIgiAIgiAIgigERWZe+ZbQIjsdBEEQBEEQBEEQRUaRWVchJLTITglBEARBEARBEEQRUWTGRRJKEARBEARBEARRzBSZcZGEEgRBEARBEARBFDNFZlyFklCSUYIgCIIgCIIgiHxQZKZVaAktstNDEARBEARBEAQRcorIsoIQUBJRgiAIgiAIgiAIPykiwwpSQovoNBEEQRAEQRAEQYSYIrIrklDCnP3790OSJESjUXR0dATdHIIIL31nsX/rVmzNxn6c7Qu6UQRRBDz6FN3dN3A+lULq/A10d9/Hg8GgG/XFZvDBn3D+0B7lXrYTLakruPso6FYRBOGEIrKr4SGhPT09vu3LT9LpNJqbmxGLxRxFkKKXTqdx7dq1bLS0tECSJNTV1SEajUKSJHzwwQeabcJ63gmi4JxogiRJmmg6EXSjCN8YfIC/XDmJVCqFVOokrvzlAULrSUXR1kHc63wXzT+Jolr3vZEkCdLyGqxp3In302Q+haUPl1peRpR3TarrsPPUvRB+lgiCUGNrV5lMRtOZt4v8dfaDltChiWgymdTcJA8fPuzodWrhyicNDQ3GG7lFJJPJvLbHSzszmQwuXLhg+nw8Hg+kzQQRKr6gEno1VgFRrEDs6nA95iOk329GXbXx3ldd1xwySSqStg524ejPVmK5o38Xq1G38xTuDTPzCeJ7Y08fLsXX2VyXKJpOhG+IRzjPJ0EEg6VZZTIZ13ISjUaRTqfz1NSgwxtsqGhdXR1isVg2W3fs2DHL13V0dGjOaz5hxwkznZ2d3POoFvvm5mbNY/ptieJG/aOM2X0mk8kUuFVFhi8SehWxChGiqI4Ixk2cgXnPbcXxjx/mo+VDYnhL6CButth0ypevQ8vNMBhSkbR18CYOvxw1b6NZH2jzUXQ5avoA7pyLo27eDEwoY9+hMkyYMhuLXorjFPsOJasgiiIq2IfoagwV6r8ZZo8zlP1Uufz9OHzSZBTQ6PqdOJRKIfXuXryycnnuekS34zz3WvDuXyLEyDhMnB1Fy/Xc/Ut+/+7PmxnhO5/FwkXUlwoQSutx0fVrE6gUBAiViTy0ixgKlmbFOv2xWAzJZNIy4vF4nkXUmzgmkwI6OoIVUSZDrHOcyWQQjUZRV1dn+hq9gOZH7LVtVEuamwx4vtvGYJ9HloWNxWKOM7Zs20Jz69atgh9zuJHJZHD48GHDjw7se/XBBx9kt43H4+js7AywtUWAnxI6/l/wvdpa1NbWonbZPMyYUKZ06Moxf8dlDOSj/R4ZzhL64OQW/rBEfUTXo/HHNUoHvhqrXmvBpQIni4qjrYM4v53db6JY//rraFxVbdrW6lWNeH3nz7BuufJv9vbzNkNB7+HoqmmIKOIzbd4y+Tu0aDYmjoso36EqJAHgty9iypQp+Pb2G/JLv9ASqhfQ5Vi55QQ0H4u+E2iK5p5vOsG7Epz7V20tFs2egDJRhBiZix1Kt+bG9m9jypQpePG3/ryDcJ1PryhCKAgQJmzETc4W/W8vxChB3sYf9yMJHY44klAnQy/ZtvkTJ2/SKEkCYrHgJLSnpycr8mqspKjQAgpoJbSnp4fb4bcKu6yuHxSjhMZiMTQ1NeGvf/1rwY9tx8WLF3HgwAFH8dFHHwXSRvVojGg0ing8nv3hS531jsfj2R/CCiWhPT09llnXnp6ecM5F9lNCK2LQ9qUGcKc9hvnlsohWJe/51uyhMnwltAeHG9xl67SyZ5Yt+gK39dperGbHXBnHhwCAR0h3vIt3T17Bx93d6O7uxo3zKbzbcR3y4OHbOLCOtXU14h+aN/Rq0yxERBFlc2K4YBg0MIBPP2xB9OkouD2vL7CE9h3dZC2gCjcTL2Y/M2tbbnG2ML9/nV4/FaIoYuG+/PziEabz6R2VhApj8Gx7v+75Xrw5swQlJSUkoYQleZNQ1mn0j+KUUADZc8E6rOl02jQTGoSAsjYySTt27FhWnO0y4Gyuq1VW1y/0n8d0Oo3Ozk7LYBIQpISya/m73/0On3/+ecHbYMaBAwccd/5OnTpV8PapBTQej3OFr6enxzBloFAS2tnZibq6OtPjdXZ2IhqN4sKFCwVpj2PyKqHs6SbMiogQpzYg9+4f4sIuCbNZtjQyDtMWx9B+R8mXplZhvChi7o4u1Y66sGOuCHHyepxWPTpw8AcQxQhq2pDrXB+8g/bYYkwbF5GHBk9bjB2Xc7lYbufv4QXskmZnh0NGxk3D4lg77qhSuOffmIcZE8fJWStRhFg2ATPmvYQjd7R53oE77YgtnoZxEWVY5WwJqxZOLkCH8wSaON/Z2k1vKQV/UnhrU63l93vT0Qf5bKBtW6Xl6xC/1IdH1xPZbGJwbR3E2W2q85WVUDtuoWVtrp3LNx8Ht6V9b+G7ERFieQ3anHiOWh6V/9dHReyqewk1k1Ld49nvzYXraImy70oZJsyO4uAfCznW4Rr2rlb1L5tOoA99OPvu+9BPIfYuoUDfvoW5ewuM9w3z81GHo7rf3JzeE+51bIXEsrDKvSvWfic7kiR7zMvyPe7JMhGiX+ODPcGEcCZmjhFQMvNN9KqfPvk8xgijsWTJbJJQwpK8Sqi/nf7ilVAmatFo1DCXUV1lNigBBbQSytrrtPiQ/9eaj5vPo56gJZTF5s2b8cknnxS8HTzCLqHsc2hXUEo9FaDQEsqOySs0pn6+ubk5uLmqfdfx7h7VkiyNqw3Xd3Wj6vk97+K6befYRkIxgKMrxkMUJ2P9aQDoQ1t0EiJiBJOeqUNzayt2xyQ8XSZCLK9C8h6AgYP4gSgiwnp/ANC3DwtFEaI4FQ0qlz+9fjJEcSH29SHbaS4rK4NY9iTmLatF7aKn5Q5d+SqklJ6cQUL72hCdFIEYmYRn6prR2robMUl+XXlVEqw/mYxOwOxFtXipuRWtra3YHXsOs8u1+8a9JKrK5Y7mk8rQyuzQvnxL6M0EXuR8Z9Wd71stay2/3/yOeoHaqggow05E89/W89heq5LJTUe52TYjj3DpF6rv1vItOMnZSv4BxUIW9ailMH0SrW8sw2RRxORlb6C1Vf5MvnelL88SGkEkIiIy7htYVFuLZfOeNHy/8s7tA1jHzm3tNpwd7MOJJrkvtfxHO3GeGf/gh4hnZdVmOK7m/iVnoKumiojMaso+zpfQCMrKIrnz8cwkREQRkbk7kO21Obwn3EtWoVyMYNIzzyG2uxWtu2OQni6DKJajRvmVQj7meEyaVKbMG56CiS+o7pMFJyeE7W/ORIkwBs9nP+z9eHvhKHmYbqJSK6EfNGLy1/4JI0tYFnUEvvS1yYgeuI1cLjUnjJkrv8B3vj4SJUIp6i9yJLT/NnZXjoYgjEZl4q7yYAZXfvEdjPvSCAiCgJKRj2PyskpMMEhoBqeansluJ5SMxOMV9XjvttKStiUYJQiYuFndD09j80QBwmPPol31aP/u2RCEEsxvBcDe8+7beK++Ao+PLIEglGDk4xXYfE6fMXbPZ599hiNHjlhuMzg4iN/85jd4+DB8tRn0OJJQVtzFKswKGPnb1OKUUEDOLtbV1WXPp/ocMRFVDzsspIACJKH5gjdkuLq6GkeOHAk8Kxp2CWVL7ljJm15Ag5JQSZLQ0NCg+d7qn7fKmuaV3+90NgcvG1Hs/L3dTu0klGUUlMzmhQZMFUWUr0pp5okOXH4VFaKI8SuOYoCT9ezbtxDi+ApUTFZ3rpXtZjXhBpDtNJd/Nw5VPRGkt37TMqNxoWEqRLEcqzS96AFcflXu8K04at67lkViPFal5NfIwm0cfpzfoXePcP3QK1hpImxuJLTxnXwPmzZpq05As1tbiGje29pzGA3Z49Vi+3kXr31wFJuyr10Lni8zialx6hB6WbQZjsvLlKrDm4SOx8I3ryP39WLfE/3IBb/pw6W9jXit7Q5wqwVr2bn9P+/jkS6rLs/DvYPDDarCRLXbcNZNYSJRhDitDqn75iMo2PWbtV49YoLdA2ah6Yb6b5t7wsBRrBiv/dFL3t1prJ8qQpy7A12qY076vvYeFxwqIexvw5LRAkZVJuRs6M3NmFgyCpWJ3pyQMfdLfAdfGvc0FkRfwa5du7Dr3+rx7fGyRC5pY4KmSOjoL+PLJQJKRv4TvvKVKXjZIKF3kTAIaD/O1T+BEkFAyciv4+kFC7BgzmRFBNUS2ovW73wZJUIJvly+DK/s2oV/q38G4ggBwuhKJO4C6N+N2YKAkvmtubfduw0zlCHIz6t+YWp/9jEIwgxs60X2PY8YMQLCiK9i8pwFWPC0iBGCAGH0ErQN0UO3b98OSZLw61//mvv84OAgXn/9dUiShDfeeGNoBysAjiR0KOFvU4tXQvWos55MRDOZDGKxWMEFFCAJzRdW81Y3btyI27dvF7xNjDBLaDqdhiRZZ0E7Ojq4P4gVulCWOqLRKI4fP276vCQ5X57JPwZx87DJenocAX358E0H6+vZS6i6quflV5+CKD6FVy/rN1KEcvwqpABlu7mQ+7UDSFZFEKlK4uiq8apjtaEmImLy+tOa4xg60201iPCGFV4FgMt49SkR4lOvwtikHZgrihgvGyaAh/j4VBwvLZuHGVOmqCqZsn3L7WEdRs1ZypuE2leYdSOhkiQvjdLurKyrP201EVCGlYjmr63Qyo6JSJqjFiP+a38bLeN8JnhSpBQmcimh4//le9lCO5r416lDG46r/wwr35P8DQtVFSFqOsHJhAJ9J5py97Xlm3H8wQMc3ZS7l5kv0cIvTLRsnjJ0NjILMWUov+lwXN35uNE0S/W4w3tCWw0i4v/EC//xZ/z5z9rYvST3GQjfPFKtEN7cOAGCMAEbbwInnx8DYczz8igAvYTyUGRv1BL2q4wioSP+B55/T50hVR/zLo4/+wRKhNGo3K3a5uZGTBAElDxRD23SUTcc9+TzGCMIGL2kDerN+s+tQakgYNTCt9HPyXr2bpsBYVQpSh8TUFrP8rHKdk9swH+p3vPomT/HFdXv5/I5UrKlQ+Czzz7D2rXy/XzPnj2aZEZ/f3+2z7l69Wrcv39/aAcrAMNGQpNJWTjdRGdncBIK5EQ0Fouhrq5OMzS30JCE5ge74kk//OEP8fbbb+Mf//hHwdsWZgkdyrUuFFb3x+bmZss1a/VZ0/zjRESdCijgJhO6cF8fklUiRPEHOMhJLsrPKeKZWoXxLEs0cBQrxkdQlRzAwNEVGC9+E1vTAC6/iqfUmSSzzrTSKed3ppOoEkWIPzjIqeCrPDd3B7pwD8mq8ux80XnLalH7UjNaf/zN3DGV42SlWNOEPHUe7xzCBpvvrFsJlSQJUrQJvi+taNLW1Xuv2770+l7j0PG8thXQSfjMq6QAAAzYSURBVKiE19tdyK4mi+omE9qF/3g5J0P/OtW7hOZtTqhht8r3hI1I8JlbB1Q/XKyM40PunNA7OLSBPSZnreX5oFFsPtplcS8zv3+x0RniN7ciDecSKp835XGH9wS2/It5FIeEov9tLBwlYHTlEsweVYKZbyozRDkSmvlTCj+PzsHkr30FX2FDYTVZSrP5m+yYa7C7XhbQmduuaSTyvzY8AUEYhSWGUQbafZ5bMxaCMBZrzum3U4Ry1BK0ZbebCHlEbj8SlSUoqUzg7SWjVBnZVswvEfDYs+2m71nebD5KfJofe//+faxeLd8bd+3ahc8//xx///vf8dprr0GSJKxYsSI0077sGJKEsuFyYZDQjg4546kPSRIQjfKfS6fDIaHq8xiUiJKE5gc7Ca2ursY777xDEqqj2CU0mUzaSmrh54haiagbAQXczgl1LKHKvNDJ60/LQhr5Lt7qAzCQRFVEHvony60yH1TeQf4kVCmWVF51UFOsSHNM5ThPGdO8ees8mknlyl9csrmGg7j0i5WW3/XaphP8gjo+t3XIwpyHtspoh3m6OcadQxtU7WvAYU5xbPbjjNWcUPk7USQSunCfwzmzLnhwHJtVWXBZOPXVcdfhwG3gRFNuu6YTwODN3+LQqXs23wOr+xfLSvMF0I2E2t0T5P/njRAxf004MM7PPPn8GFkm1UNOdUJ2N1GJ0Wz+5eQ5WLAgild2LdfN17SR0JISlAgChFGzsfuudotEpZAbFqt9RrNPebvZ2M0ZGis/p4hn2xKMYtnL/rexcFQJKhP9yvIzcuYX59ZgrDrDaSahF+tR6luRJuCTTz7BCy+8kBXRV199Vf6uRKP4+OOP/TlIAfAkodFoVDO3iVVTDVJCi2E4Llt7s6enx1CESH3OghBRktD8YCWhP/vZz3D37l37neSJMEsoG47b3Nzs+DXRaNTnitzW8O6P6nmfZsN1g62WyxNRtwIK2ErovX1YqKqOKw+zZUWK1CjDcSM1aFP/PXcHkusnqzq48tBcceE+JFeN12ZfPEmoMhxXV3VXboI8zDBS02beSVQfUxFnntDmq/Oo7nhrIvoyDt80v5KDXUexOWrzfTcpqON7W32QUL/bKpNbRqY6GkW1FMXLLedw95HFSwYf4E/vb8GP1MOH17bgFm9bB9Vxi0JCDcPWfeTklpxsrk0g95Huw9ntP8JyaTnWtdzE4IMTaMoWkZKl1BlW9y/l3jAUCXV6T1CmDNgtB1MMEoreNzGzRMCEjapVQzVC1oYlo+Q5l7tvq+1PL512EvoE1mxciNGcYbfmculRQpWhwo892y4LaclMvNkLoD+ByhK5aFHvthla8S2QhAJAV1eXxh9qamqKbm16TxLKWxOSVyCEJFQmk8lo1jTUd0rT6TR3jmghIQnNDzwJXbFiRaBDrxlhllAA2QJeTtbaZN8fu0q6fqK/P+qXkdE/H4vFCrduaN9Z7N+qqnjLYuchXOpTi6gioH2XcGgnZ/ut+3GW2z8y78Q9vB7H9ydFoK7umC1MpCvAoS1MpOw5VgExMhVTJ2s7Z31vfRcRcSqmTtUNc/MkobnCRNrCIdrCRCxrtSCu2mbgDtrXVqiOyUR6Ifapd/XwAn4+d3xeOo8ntyw3/76aiKgjAZUkSNJmHPex+IllW4cqoT63lfHg+GYslyRENx/A0cNb5Lmpy2uw4oVNeOcvqg3vv4+m1StQw5m7uuHQHZO959ai5K8T6kxCDVk2txKqZPmnqstOD9zBkRVTHUgo+57oC3v5g/oz8GLipu7ZQTz49AEGB2/iwPrq3DlfvRfXHB/BYp3jg1UoF0VEvvsW+uBRQp3eE9gPElNX4pi+3tbDC4hF38BVq2MGhsPlUtRCpkjYWMMYWJcSWlqPi6oCRKVrzmWH5JoPs+UNx30MbARtDmU4bsl8tKr/nrgZiWcfgzBjm7IUjTw0V5ixDYklo3LzQfXvWdN8/yUUAG7duoXa2lpUV1fj2jXn34Cw4ElCeRUembiQhBrZv38/JEnOkuzfvz/budZXwQ1SRElC84NeQn/1q1/hb3/7W8HbwSPsEsq+Dw0NDZZDV9m6u2b3pnzBPo9m2U31/bPQw4qtOvLL18VxqW8QXe17sKe9C4N9lxBfZy4KluvsqQt7LJuHGdk1QCfh+/v+qMoC9KGtphyi1RItDKVznCtQxHbBlmvRzafzKKHoa0NNuWi9RAvrJEbG4RuLalFbuwjfGBeRl4NR7buvrQbloqhaIuYbGBeRl3HIR+ex5zC/Gr2ZiDoXUMn37KJtW4cioXnJhEKzzMfaxHmcb23C+jUvYM3/PQ79z0jcTG90C05ajeEduIwd88uRXS9XWcKjtnYRZk9RlvFgowMMn+/foa5chBiZhppm+XskveFhndCBFFYpS4g8vUj+/j5ZJho+22zeYmTSM3guttt0KSM/UX9maredNY7S0AuoFMUWyxOuh1eYaBnmTVPWAy6fj1+aSKczCXV+T5CXaJGXv5HY+X1OvhZMkoeFhLLKsv/r58iN/+rH7fcklLqWUAC4i20zSrTVcZVlVcY8266aK9qP2wcWYgyvMFFlAuqxaNrCRMpR60shlIzBmMcEzFCN8+1VlqYZM0Y1H1T/njXNz4+EAsD169dx6dIl/3dcADxJKE+OmGiRhBphAso60plMxnToYFAimkwmsx1lklD/YMfdsGED/vCHPxT8+FaEXUKB3AiLhoYGrmB2dHRkBbTQotfZ2WmZ3ezs7AygAJGMXUd++Y+24P0b3ei+8T62/Mg6U2UpoZoiGhGMmzgD857bivY7vOzIPXRslTA7K6rjMG1xzLitMv/TWFmyD/sWitr5oIB3CQWAex3YKs3OVryNjJuGxbF2zfzPhxe2YvGTZdrnD+qPOYA/Hoxm31vZhNmItlzHhXx1Hh+cxBY7qVRE1JWASmypi8K2tfb1o/i4uxvd3d04/ab1nNW8tlXF4M0WOQNa24ADv+/G3Xt9uP/AKDpGCbWqyqrmIa6/E8Nz82bkKi5HxmHi7EV4qfn93FIcnM/3wOUdWMyEqWwCvr39hnsJBTDwx4OIfkPej9lnW/7eTMbCVbnvLu974ivX9mJ19oeGdUhcV42FfnQdLToBtS5CxIO/REvZhCmYHY3j3B27JVrsJdTNPeHef8ZRxyrzKtd0xrw6xM/dwYDVMQPDg4SiF2/OVNbM/PrTWLBgAZ7++kiUjBghL1/iWkIB3E2gcrQ8RLf+XD+Ai9jwhPoYczD5qyMgGI7Ri9b5oyFYLdHCUMQ2V6CI7YIt16KreBuAhBYzniRUn8GzqgLpb1OLT0LZ3LZYLKZ53EqKgh6aSxLqH01NTYEVHrLjo48+wqlTpxxFd3d3IG3MZDKaof7qNYvVcyGCKGBkV1gok8kEUHxIxt2QRtbZq0G0RhbS6pqa7JwsvoQSQZOVJKtrWh1FtNrFZyBPFWcdtdVt5Ks6roq+E1u0a5s2nTBso5VQL/OrCSO69T6latRt2oqtWzfhxzXLNefbvYASQ8eLhALInMLGiq/KQlgyEo9X1OO927s9DMfNwbKXwuj5aO0FcPc46isex8gSq2MAwF0c3/gMxrEKvdltdRNFlfmfwsTN0P6c3IttMziFkEhCXeFJQlk0NDSgocF6qI2/TXUfnZ1eq+D6lwllWU8m7kxM6+rqTF8TpIiShPrHw4ehWFm66Ons7OTOq25ubi7oENxiwYuEvpi4CdxM4MUXE7iJPry7kSQ07Ax2taP5xzWW64U6i2qseq0FFst2fqHaqubR9UN4ZaUiPpxiQ1kJrV6Pvf9pV5WVcEzfCTRZZtCjaGqn800QxYytVZkV1HES/mYn/BTJwkqounowmw8qSZJtlUy9iBYKJqHxeBzXrl2zjUJLqDob5jRY1owoftLpdOEK/BQptw+sc3/Prq7DT+qq5azDT+pQrTy+znnJSSIoHn2KbmU4q5f41Kry6xe5rblGI93Rgp2t5w3Ltdw6vhN73u3EPbIh/+m7hJZXVhp+uFi+8hW0FOpXCIIg8oY3qwqEoOXTu4QC8pBlljWOxWKhzt7YZcB54WYpjaFgNvfYSYShIi1BFIRHaZxPpZAacpxHOpBOP0EQhMyjTz/GlZMppFInceXjT0G3JIIYHpCEFkhCi42Ojg7HWcZ4PE6ZKYIgCIIgCIIgHFFkVhW0gBbZ6SIIgiAIgiAIgggZRWZVQQtokZ0ugiAIgiAIgiCIkFFkVhW0gBbZ6SIIgiAIgiAIgggZRWZVQQtokZ0ugiAIgiAIgiCIkFFkVhW0gBbZ6SIIgiAIgiAIgggZRW5VJKEEQRAEQRAEQRDFRJFbFYknQRAEQRAEQRBEMTFMDIuEkyAIgiAIgiAIohgQUqkUKCgoKCgoKCgoKCgoKCgKEf8f8cHN8XraXuMAAAAASUVORK5CYII=" } }, "cell_type": "markdown", "metadata": {}, "source": [ " In einem nächsten Schritt müssen die einzelnen Funktionen angewendet werden. Um den dafür notwendigen Code zu erstellen, müssen Sie eine neue Code-Zelle einfügen. Dafür müssen Sie sich zunächst in jener Zelle befinden, auf die eine neue Zelle folgen soll (zum Beispiel in der Zelle dieses Textes). Anschließend klicken Sie in der Menüleiste auf das \"plus\" Symbol: ![grafik.png](attachment:grafik.png) \n", "\n", "\n", " Fügen Sie nun die folgenden Codezeilen in die neue Zelle ein. Kopieren Sie den Text im Bearbeitungsmodus dieser Zelle. Dafür müssen Sie mit einem Doppelklick auf diese Zelle klicken. \n", "\n", "#apllying all functions \n", "def apply_all(text):\n", " return stem_words(remove_stop_words(initial_clean(text)))\n", "\n", "df['tokenized'] = df['text'].apply(apply_all) \n", "\n", "caption_content='Tabelle 2: Relevanz, Originaltext und Tokens.'\n", "display(df[12:14].style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In einem weiteren Schritt wird die Sammlung in ein Trainings- und Testkorpus zu unterteilt. Damit erhalten wir eine Reihe von Dokumenten (Trainingskorpus), um den Algorithmus zu trainieren, und eine Reihe von Dokumenten (Testkorpus), um die Effizienz der gewählten Methoden zu testen. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Training- und Testkorpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Um die Sammlung in ein Trainings- und ein Testkorpus aufzuteilen, wird die Funktion numpy.random.rand() verwendet, um ein Array mit einer bestimmten Form zu erstellen und es mit Zufallswerten zu füllen. Dies ermöglichte es, eine gute Mischung aus relevanten und nicht relevanten Artikeln in jedem der Korpora zu erhalten (Abbildung 2). Da diese Funktion auf dem Zufallsprinzip beruht und die Aufteilung der Korpora bei jedem Aufruf variieren kann, wurde ein Zufallswert gesetzt, um reproduzierbare Aufrufe zu erzeugen. Folglich sind alle Zufallszahlen, die nach dem Setzen des Seeds erzeugt werden, auf jeder Maschine gleich." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#create testing and training corpus\n", "np.random.seed(1)\n", "msk = np.random.rand(len(df)) < 0.599\n", "train_df = df[msk]\n", "train_df.reset_index(drop=True,inplace=True)\n", "test_df = df[~msk]\n", "test_df.reset_index(drop=True,inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#plot the result\n", "my_colors = [(0.20,0.200,0.50), (0.100, 0.75, 0.200)] #set colors\n", "fig, axes = plt.subplots(1,2,figsize=(10,3))\n", "test_df.relevancy.value_counts().plot(kind='bar', color = (0.100, 0.75, 0.20), ax=axes[1])\n", "test_df.relevancy.value_counts().plot(kind='bar', color = my_colors, ax=axes[1])\n", "train_df.relevancy.value_counts().plot(kind='bar', color = (0.100, 0.75, 0.20), ax=axes[0])\n", "train_df.relevancy.value_counts().plot(kind='bar', color = my_colors, ax=axes[0])\n", "axes[1].legend(['Non_Relevant', 'Relevant'])\n", "axes[0].legend(['Non_Relevant', 'Relevant'])\n", "axes[1].title.set_text('figure 2a: Test Corpus.')\n", "axes[0].title.set_text('figure 2b: Training Corpus.')\n", "print(f\"Der Trainingskorpus enthält {len(train_df)} Artikel, {train_df.relevancy.value_counts()[3]} sind relevant und {train_df.relevancy.value_counts()[0]} irrelevant.\")\n", "print(f\"Der Testkoprus enthält {len(test_df)} Artikel, {test_df.relevancy.value_counts()[3]} sind relevant und {test_df.relevancy.value_counts()[0]} irrelevant.\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Klassifizierung in relevante und irrelevante Artikel " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Um Wörter und ähnliche Ausdrücke zu gruppieren, die relevante oder irrelevante Dokumente am besten charakterisieren, und um jeden Artikel mit Informationen über seine Themenverteilung zu versehen, wird ein Topic Modeling Altgorithmus (LDA) trainiert. Die Jensen-Shannon Distanz hingegen wird verwendet, um die Ähnlichkeit zwischen der Themenverteilung der Dokumente zu messen. Die Kombination von LDA und JSD hat Vorteile gegenüber Textklassifikatoren, da auch für hochkomplexe Sammlung gute Ergebnisse erziehlt werden können. Die Grenzen zwischen relevanten und nicht relevanten Artikeln ist bei vielen mehrdeutigen Begriffen sehr wage. Mit LDA in Kombination mit JSD kann diese Komplexität bewältigt werden, indem der Dateninput für die endgültige Klassifizierung eines ungesehenen Artikels auf die 10 ähnlichsten Artikel eingrenzen. Dies bedeutet, dass nur die ähnlichsten Artikel als Grundlage für die Klassifizierung eines ungesehenen Artikels herangezogen werden." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Trainieren des Topic Modeling Algorithmus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic Modelle beruhen auf der Annahme, dass Texten in natürlicher Sprache eine relativ kleine Menge latenter oder verborgener Themen zugrunde liegt, wobei ein Wort zu mehreren Themen gehören kann. Themenmodelle verwenden die so genannte Bag-of-Words-Annahme innerhalb eines Dokuments bzw. sie gruppieren statistisch signifikante Wörter innerhalb eines bestimmten Korpus. Wie von (Blei, Ng und Jordan 2003) beschrieben, können Dokumente \"als zufällige Mischungen über latente Themen dargestellt werden, wobei jedes Thema durch eine Verteilung über Wörter charakterisiert ist\". Die Themenmodellierung wird für verschiedene Zwecke eingesetzt: um einen Koprpus besser zu verstehen (Zosa et al. 2020), um Diskursdynamik zu erfassen (Marjanen et al. 2020), um einen besseren Einblick in die Art oder das Genre von Dokumenten in einem Korpus zu erhalten (Oberbichler 2021), um die Entwicklung von Themen und Trends in mehrsprachigen Sammlungen zu erfassen (Zosa und Granroth-Wilding 2019) oder um verschiedene Korpora zu vergleichen (Lu, Henchion und Namee 2019).\n", "\n", "\n", "Jeder dieser Anwendungsbereiche benötigt unterschiedliche Parameter. Während Methoden zur automatischen Bestimmung der Themenanzahl in einigen Fällen hilfreich sein können (Zhao et al. 2015) (O'Callaghan et al. 2015), ist das genaue Lesen einer beträchtlichen Menge von Dokumenten nach wie vor am zuverlässigsten." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir verwenden hier die Python-Library *Gensim*, um die Topic Modelle zu trainieren. Gensim ist eine Bibliothek, die für die Themenmodellierung, die Dokumentenindexierung als auch die Ähnlichkeitssuche verwendet wird. Alle Algorithmen sind speicher- und sprachunabhängig sowie unüberwacht, was bedeutet, dass keine menschliche Eingabe erforderlich ist." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Die Gensim Library beinhaltet diverse Parameter, die je nach Anwendung abgestimmt werden müssen. Passen Sie die Parameter auf folgende Werte an: \n", "\n", "- Trainieren Sie 250 topics\n", "- Lassen Sie das Modell 5 mal \"passieren\", was bedeutet, dass das Modell fünf mal trainiert wird (=passes)\n", "- Aktualisieren Sie die Wahrscheinlichkeitsberechnung, dass ein Wort zu einem bestimmten Topic gehört, 200 Mal (=iterations)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "from gensim.models import LdaModel\n", "from gensim import parsing, corpora, matutils, interfaces, models, utils\n", "import gensim.corpora as corpora\n", "import gensim, spacy, logging, warnings\n", "from gensim.models import CoherenceModel\n", "from matplotlib import pyplot as plt\n", "from wordcloud import WordCloud\n", "\n", "#function to train the topic model\n", "def train_lda(data):\n", " num_topics = \n", " dictionary = corpora.Dictionary(data['tokenized'])\n", " corpus = [dictionary.doc2bow(doc) for doc in data['tokenized']]\n", " lda = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary,\n", " alpha=0.2e-2, eta=1e-2, minimum_probability=0.0, passes=, iterations=, update_every=1, random_state =1)\n", " return dictionary,corpus,lda\n", "\n", "#apply function and check results\n", "dictionary,corpus,lda = train_lda(train_df)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Sehen Sie sich neben dem Topic 27 noch weitere Topics an. Ändern Sie dafür die Zahl 27 in der nachfolgenden Zeile. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "figure-3" ] }, "outputs": [], "source": [ "# plot word cloud\n", "plt.figure(dpi=100)\n", "plt.imshow(WordCloud(max_words=30).fit_words(dict(lda.show_topic(27, 200))))\n", "plt.axis(\"off\")\n", "plt.title(\"Abbildung 3: Topic Nummer 27.\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Dominante Topics ermitteln" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nach dem Training des Modells sollte geprüft werden, wie gut diese Themen in relevante und nicht relevante Artikel unterteilt werden konnten. Nicht alle Themen haben die gleiche Dominanz innerhalb eines Textes. Um zu verstehen, welches Thema in einem Dokument am dominantesten ist, werden die Themen mit der höchsten Gewichtung für ein Dokument ermittel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def format_topics_texts(ldamodel=None, corpus=corpus, relevancy=df['relevancy']):\n", " sent_topics_df = pd.DataFrame()\n", "\n", " #get main topic in each document\n", " for i, row_list in enumerate(ldamodel[corpus]):\n", " row = row_list[0] if ldamodel.per_word_topics else row_list \n", " row = sorted(row, key=lambda x: (x[1]), reverse=True)\n", " for s, (topic_num, prop_topic) in enumerate(row):\n", " if s == 0: # => dominant topic\n", " wp = ldamodel.show_topic(topic_num)\n", " topic_keywords = \", \".join([word for word, prop in wp])\n", " sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)\n", " else:\n", " break\n", " sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']\n", "\n", " #add relevancy to the end of the output\n", " contents = pd.Series(relevancy)\n", " \n", " sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)\n", " return(sent_topics_df)\n", "\n", "\n", "df_topic_sents_keywords = format_topics_texts(ldamodel=lda, corpus=corpus, relevancy=df['relevancy'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#format\n", "df_dominant_topic = df_topic_sents_keywords.reset_index()\n", "df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Relevancy']\n", "caption_content= 'Tabelle 2: Dominante Topics eines jeden Artikels, topic Keyörter und die Relevanz der Artikel.'\n", "display(df_dominant_topic[23:28].style.set_table_attributes(\"style='display:block'\").set_caption(caption_content).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wie in [Tabelle 3](#table-3) zu sehen ist, stellt das Topic 27 ([Abbildung 3](#figure-3)) eindeutig diesen Artikel dar." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Sehen Sie sich auch jenen Text an, im dem Topic 1 dominiert. Dafür müssen Sie nicht nach Artikel 23, sondern nach Artikel 26 (wie aus Tabelle 2 hervorgeht) suchen. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "table-3" ] }, "outputs": [], "source": [ "article_df_1 = train_df['text'][23]\n", "article_df_1 = pd.DataFrame(np.column_stack([article_df_1]), \n", " columns=['Der Text, der überwiegend von Topic 27 repräsentiert wird'])\n", "article_df_1 = article_df_1.apply(lambda x: x[:600])\n", "caption_content = \"Tabelle 3: Der Text, der überwiegend von Topic 27 repräsentiert wird\"\n", "display(article_df_1.style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Visualisierung der Verknüpfung von dominaten Themen und Relevanzleveln" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Um zu sehen, wie gut die dominanten Themen auf relevante (3) und nicht relevante (0) Artikel verteilt sind, erstellen wir eine Netzwerkvisualisierung mit dem Python-Paket *NetworkX*. NetworkX wird hauptsächlich für die Erstellung, Manipulation und Untersuchung der Struktur, Dynamik und Funktionen komplexer Netzwerke verwendet. Anhand dieser Visualisierung lässt sich erkennen, wie effektiv das Modell trainiert wurde. Für das Netzwerk werden das dominanteste Thema sowie das Relevanzlabel für jeden Zeitungsausschnitt miteinander in Verbindung gebracht." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "figure-4" ] }, "outputs": [], "source": [ "import networkx as nx\n", "import seaborn as sns\n", "import sys\n", "\n", "#create a list with topics and the relevancy\n", "df_dominant_topic.to_csv('topic_relevancy.csv')\n", "import csv\n", "with open('topic_relevancy.csv', encoding=\"utf8\") as infile:\n", " reader = csv.reader(infile) \n", " csv_data = list(reader)\n", "df_dominant_topics= pd.read_csv('topic_relevancy.csv', usecols = ['Dominant_Topic', 'Relevancy'])\n", "list_topic = []\n", "for key in csv_data: \n", " list_topic.append(key[2])\n", "topic = list_topic[1:]\n", "\n", "list_relevancy = []\n", "for key in csv_data:\n", " list_relevancy.append(key[5])\n", "relevance = list_relevancy[1:] \n", "\n", "\n", "#build a dataframe with 4 connections\n", "df = pd.DataFrame({ 'from': relevance, 'to': topic})\n", "\n", "#build the graph \n", "plt.figure(figsize=(6.5,6.5))\n", "G = nx.from_pandas_edgelist(df, 'from', 'to')\n", "color_map = []\n", "for node in G:\n", " if node == \"3\":\n", " color_map.append('#b85399')\n", " if node == \"0\":\n", " color_map.append('#b85399')\n", " else: \n", " color_map.append('#2d95b3') \n", "\n", "\n", " \n", "#plot it \n", "nx.draw(G, with_labels=True, node_color=color_map, node_size=500)\n", "plt.legend(('Relevancy labels (0= irrelevant, 3= relevant)', 'Relationship between entities'),\n", " loc='upper left')\n", "plt.title('Abbildung 4: Dieses Diagramm zeigt, wie gut die dominierenden Themen zwischen relevanten und irrelevanten Zeitungsausschnitten getrennt sind')\n", "plt.show()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Können Sie Topic 27 finden? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Semantisch ähnliche Artikel finden" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Im nächsten Schritt wird die Jensen-Shannon-Distanz (JS) angewendet, um die Ähnlichkeit zwischen der Themenverteilung der Dokumente aus dem Trainingskorpus und jenen aus dem Testkorpus zu messen. Es wird davon ausgegangen, dass Artikel mit ähnlichen Inhalten auch als solche erkannt werden. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import entropy\n", "\n", "#JS distance functions\n", "def jensen_shannon(query, matrix):\n", " p = query[None,:].T \n", " q = matrix.T \n", " m = 0.5*(p + q)\n", " return np.sqrt(0.5*(entropy(p,m) + entropy(q,m)))\n", "def get_most_similar_documents(query,matrix,k=10):\n", " sims = jensen_shannon(query,matrix) \n", " return sims.argsort()[:k]\n", "\n", "#most similar articles\n", "bow = dictionary.doc2bow(test_df.iloc[11,2])\n", "doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=bow)])\n", "doc_topic_dist = np.stack([np.array([tup[1] for tup in lst]) for lst in lda[corpus]])\n", "doc_topic_dist.shape\n", "sim_ids = get_most_similar_documents(doc_distribution,doc_topic_dist)\n", "similar_df = train_df[train_df.index.isin(sim_ids)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mit der Hilfe der JS-Distanz wird die Themenverteilung jedes neuen Artikels (aus dem Testkorpus) mit der Themenverteilung jedes Artikels im Trainingkorpus verglichen. Auf diese Weise werden die 10 ähnlichsten Artikel aus dem Trainingskorpus abgerufen. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sehen wir uns ein Beispiel aus dem Testkorpus an:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "figure-5" ] }, "outputs": [], "source": [ "from IPython.display import Image, display\n", "display(Image(\"Bilder/artikel.png\"), 'Abbildung 5: Neue Freie Presse, 02.07.1871')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Tabelle 4](#table-4) zeigt die zehn ähnlichsten Artikel für den Zeitungsartikel aus der \"Neue Freie Presse\" vom 02.07.1871. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Lesen Sie den Artikel aus Abbildung 5. Dann sehen Sie sich die Ausschnitte der autmatisch gefundenen 10 ähnlichsten Artikel an (dafür müssen Sie den Code der nächsten Zeile ausführen). Finden Sie Ähnlichkeiten? " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "table-4" ] }, "outputs": [], "source": [ "similar_df = similar_df.drop(['tokenized'], axis =1)\n", "similar_df['text'] = similar_df['text'].apply(lambda x: x[:300])\n", "similar_df.rename(columns={'text': 'Die zehn ähnlichsten Artikel aus dem Trainingskorpus für den ungesehenen Artikel aus Abbildung 5'}, inplace=True)\n", "caption_content= 'Tabelle 4: Die zehn ähnlichsten Artikel aus dem Trainingskorpus für den ungesehenen Artikel aus Abbildung 5'\n", "display(similar_df.style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Nun wählen Sie selbst einen bisher ungesehenen Artikel aus dem Testkorpus aus, indem Sie eine Zahl zwischen 1 und 50 überall dort einfügen, wo derzeit zwei Punkte vorhanden sind. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(test_df['text'][..])\n", "\n", "#most similar articles\n", "bow = dictionary.doc2bow(test_df.iloc[.., 2])\n", "doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=bow)])\n", "doc_topic_dist = np.stack([np.array([tup[1] for tup in lst]) for lst in lda[corpus]])\n", "doc_topic_dist.shape\n", "sim_ids = get_most_similar_documents(doc_distribution,doc_topic_dist)\n", "similar_df = train_df[train_df.index.isin(sim_ids)]\n", "similar_df = similar_df.drop(['tokenized'], axis =1)\n", "similar_df['text'] = similar_df['text'].apply(lambda x: x[:300])\n", "similar_df.rename(columns={'text': 'Die zehn ähnlichsten Artikel aus dem Trainingskorpus für den ungesehenen Artikel'}, inplace=True)\n", "caption_content= 'Tabelle 5: Die zehn ähnlichsten Artikel aus dem Trainingskorpus für den ungesehenen Artikel'\n", "display(similar_df.style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Klassifikation in relevant und irrelevant" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nun durchläuft jeder Artikel aus dem Testkorpus eine Schleife, in der er mit allen Artikeln aus dem Trainingskoprus verglichen wird. Wenn 60 Prozent der ähnlichsten Artikel ursprünglich als relevant eingestuft wurden, wird auch der neue Artikel als relevant eingestuft. Andernfalls wird er als irrelevant eingestuft. In diesem Fall wurde der ungesehene Artikel aus [Abbildung 5](#figure-5) als relevant eingestuft (erste Spalte, dritte Zeile), weil mehr als 60 Prozent der ähnlichsten Artikel ([Tabelle 6](#table-6)) ursprünglich als relevant annotiert wurden (= 3)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Dieser Code bräucht etwas länger, bis er ausgeführt wird. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "table-6" ] }, "outputs": [], "source": [ "#lists for the output\n", "\n", "text_relevant = []\n", "number_relevant = []\n", "text_non_relevant = []\n", "number_non_relevant = []\n", "\n", "index = 0\n", "#loop for the classification of each article\n", "while index < len(test_df) -1:\n", " index +=1\n", " new_bow = dictionary.doc2bow(test_df.iloc[index,2])\n", " new_doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=new_bow)])\n", " doc_topic_dist = np.stack([np.array([tup[1] for tup in lst]) for lst in lda[corpus]])\n", " doc_topic_dist.shape\n", " most_sim_ids = get_most_similar_documents(new_doc_distribution,doc_topic_dist)\n", " most_similar_df = train_df[train_df.index.isin(most_sim_ids)]\n", " relevant = []\n", " if sum(most_similar_df['relevancy']) > 17: \n", " text_relevant.append(test_df.iloc[index,1])\n", " number_relevant.append(test_df.iloc[index,0])\n", " else:\n", " text_non_relevant.append(test_df.iloc[index,1])\n", " number_non_relevant.append(test_df.iloc[index,0])\n", "\n", "#dataframe with results\n", "\n", "df_relevant = pd.DataFrame(np.column_stack([text_relevant, number_relevant]), \n", " columns=['Relevant_Text', 'Real_Relevancy'])\n", "\n", "df_non_relevant = pd.DataFrame(np.column_stack([text_non_relevant, number_non_relevant]), \n", " columns=['Unrelevant_Text', 'Real_Relevancy'])\n", "\n", "df_results = pd.concat([df_relevant,df_non_relevant], ignore_index=True, axis=1)\n", "df_results.columns=['Diese Texte wurden als relevant klassifiziert', '3','Diese Texte wurden als irrelevant klassifiziert', '0']\n", "\n", "\n", "df_results['Diese Texte wurden als relevant klassifiziert'][0:20] = df_results['Diese Texte wurden als relevant klassifiziert'][0:20].apply(lambda x: x[:50])\n", "df_results['Diese Texte wurden als irrelevant klassifiziert'][0:20] = df_results['Diese Texte wurden als irrelevant klassifiziert'][0:20].apply(lambda x: x[:50])\n", "caption_content= 'Tabelle 6: Die Artikel werden automatisch in relevante und nicht relevante Artikel gruppiert. Die manuell vergebenen Annotierungen zeigen, ob die Artikel korrekt klassifiziert wurden.'\n", "display(df_results[0:20].style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "table-7" ] }, "outputs": [], "source": [ "#calculation of correct or incorrect classified results\n", "rev_3 = []\n", "for key in df_results['3']:\n", " if key == '3':\n", " rev_3.append(key)\n", "rev_0 = []\n", "for key in df_results['3']:\n", " if key == '0':\n", " rev_0.append(key)\n", "non_rev_3 = []\n", "for key in df_results['0']:\n", " if key == '3':\n", " non_rev_3.append(key)\n", "non_rev_0 = []\n", "for key in df_results['0']:\n", " if key == '0':\n", " non_rev_0.append(key)\n", "result_right = len(non_rev_0) + len(rev_3)\n", "result_wrng = len(non_rev_3) + len(rev_0)\n", "relevant = len(rev_3) / (len(rev_0) + len(rev_3))\n", "irrelevant = len(non_rev_0) / (len(non_rev_0) + len(non_rev_3))\n", "all_ = len(non_rev_3) + len(rev_0) + len(non_rev_0) + len(rev_3)\n", "score = result_right / all_\n", "\n", "\n", "df_score = pd.DataFrame(np.column_stack([relevant, irrelevant, score]), \n", " columns=['Correct classified relevant articles', 'Correct classified irrelevant articles', 'Total score'])\n", "caption_content= 'table 7: Evaluation of the classification of the articles from the testing corpus.'\n", "display(df_score.style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Diese Ergebnisse werden als gut genug für die weitere Verarbeitung angesehen. Der letzte Schritt betrifft den gesamten Korpus, der disambiguiert werden soll." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anwendung des Modells auf den gesamten Korpus (400 Zeitungsausschnitte) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "table_8" ] }, "outputs": [], "source": [ "df_all = pd.read_csv('export_krebs_all_25_05_2020_20_00.csv', usecols = ['id','language','date','newspaper_id','iiif_url','text'])\n", "print('Tabelle 8: Gesamter Korpus')\n", "df_all.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_all['tokenized'] = df_all['text'].apply(apply_all) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Anzahl der Wörter und Texte im der Sammlung" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# first get a list of all words\n", "all_words = [word for item in list(df_all['tokenized']) for word in item]\n", "# use nltk fdist to get a frequency distribution of all words\n", "fdist = FreqDist(all_words)\n", "f\"Die Anzahl individueller Worter ist: {len(fdist)}\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#document length\n", "df_all['doc_len'] = df_all['tokenized'].apply(lambda x: len(x))\n", "doc_lengths = list(df_all['doc_len'])\n", "df_all.drop(labels='doc_len', axis=1, inplace=True)\n", "\n", "print(f\"Die Sammlung enthält {max(doc_lengths)} Artikel\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Sehr kurze Artikel (unter 30 Token) werden ausgeschlossen" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_all = df_all[df_all['tokenized'].map(len) >= 30]\n", "df_all = df_all[df_all['tokenized'].map(type) == list]\n", "df_all.reset_index(drop=True,inplace=True)\n", "print(\"Die Sammlung enthält nun\", len(df_all), \"Artikel\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def jensen_shannon(query, matrix):\n", " p = query[None,:].T \n", " q = matrix.T \n", " m = 0.5*(p + q)\n", " return np.sqrt(0.5*(entropy(p,m) + entropy(q,m)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_most_similar_documents(query,matrix,k=10):\n", " sims = jensen_shannon(query,matrix) # list of jensen shannon distances\n", " return sims.argsort()[:k] # the top k positional index of the smallest Jensen Shannon distances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Klassifikation der Artikel \n", "\n", "Die Klassifikation dauert, wenn im Browser ausgeführt, etwas länger. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create lists for your output\n", "text_relevant = []\n", "number_relevant = []\n", "date_relevant = []\n", "text_non_relevant = []\n", "number_non_relevant = []\n", "language_relevant = []\n", "newspaper_id_relevant = []\n", "iiif_url_relevant = []\n", "id_relevant = []\n", "\n", "#find most similar articles and select between relevant and non-relevant\n", "\n", "index = 0\n", "while index < len(df_all) -1:\n", " index +=1\n", " new_bow = dictionary.doc2bow(df_all.iloc[index,6])\n", " new_doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=new_bow)])\n", " doc_topic_dist = np.stack([np.array([tup[1] for tup in lst]) for lst in lda[corpus]])\n", " doc_topic_dist.shape\n", " most_sim_ids = get_most_similar_documents(new_doc_distribution,doc_topic_dist)\n", " most_similar_df = train_df[train_df.index.isin(most_sim_ids)]\n", " # Calculate \n", " if sum(most_similar_df['relevancy']) > 17: \n", " text_relevant.append(df_all.iloc[index,5])\n", " date_relevant.append(df_all.iloc[index,2])\n", " language_relevant.append(df_all.iloc[index,1])\n", " newspaper_id_relevant.append(df_all.iloc[index,3])\n", " iiif_url_relevant.append(df_all.iloc[index,4])\n", " id_relevant.append(df_all.iloc[index,0])\n", " \n", " else:\n", " text_non_relevant.append(df_all.iloc[index,5])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#transform your lists into a dataframe\n", "df_relevant = pd.DataFrame(np.column_stack([text_relevant]), \n", " columns=['Relevanter_Text'])\n", "\n", "df_non_relevant = pd.DataFrame(np.column_stack([text_non_relevant]), \n", " columns=['Unrelevanter_Text'])\n", "\n", "df_results = pd.concat([df_relevant,df_non_relevant], ignore_index=True, axis=1)\n", "df_results.columns=['Diese Texte wurden als relevant klassifiziert', 'Diese Texte wurden als irrelevant klassifiziert']\n", "\n", "\n", "df_results['Diese Texte wurden als relevant klassifiziert'][0:5] = df_results['Diese Texte wurden als relevant klassifiziert'][0:5].apply(lambda x: x[:400])\n", "df_results['Diese Texte wurden als irrelevant klassifiziert'][0:5] = df_results['Diese Texte wurden als irrelevant klassifiziert'][0:5].apply(lambda x: x[:400])\n", "caption_content= 'Tabelle 5: Die Artikel werden automatisch in relevante und nicht relevante Artikel gruppiert. Die manuell vergebenen Annotierungen zeigen, ob die Artikel korrekt klassifiziert wurden.'\n", "display(df_results[0:5].style.set_caption(caption_content).hide_index())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_final = pd.DataFrame(np.column_stack([id_relevant, language_relevant, newspaper_id_relevant, date_relevant, iiif_url_relevant, text_relevant]), \n", " columns=['id', 'language', 'date', 'newspaper_id', 'iiif_url', 'text'])\n", "\n", "\n", "df_new = pd.concat([df_final], ignore_index=True, axis=1)\n", "df_new.columns=['id','language', 'date', 'newspaper_id', 'iiif_url', 'text']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Exportieren der relevanten Artikel in Form des Originalfiles (csv und excel)\n", "\n", " Schauen Sie ich das Excel File \"Collection_relevant\" an, das automatisch in Ihrem Ordner generiert wurde. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openpyxl import Workbook\n", "df_new.to_csv('Collection_relevant.csv')\n", "df_new.to_excel('Collection_relevant.xlsx')" ] } ], "metadata": { "celltoolbar": "Tags", "cite2c": { "citations": { "6142573/2FLWKIR9": { "ISBN": "978-3-540-00532-2 978-3-540-36456-6", "URL": "http://link.springer.com/10.1007/3-540-36456-0_24", "accessed": { "day": 29, "month": 4, "year": 2021 }, "author": [ { "family": "Patwardhan", "given": "Siddharth" }, { "family": "Banerjee", "given": "Satanjeev" }, { "family": "Pedersen", "given": "Ted" } ], "collection-editor": [ { "family": "Goos", "given": "Gerhard" }, { "family": "Hartmanis", "given": "Juris" }, { "family": "van Leeuwen", "given": "Jan" } ], "container-title": "Computational Linguistics and Intelligent Text Processing", "editor": [ { "family": "Gelbukh", "given": "Alexander" } ], "event-place": "Berlin, Heidelberg", "id": "6142573/2FLWKIR9", "issued": { "year": 2003 }, "note": "Series Title: Lecture Notes in Computer Science\nDOI: 10.1007/3-540-36456-0_24", "page": "241-257", "page-first": "241", "publisher": "Springer", "publisher-place": "Berlin, Heidelberg", "title": "Using Measures of Semantic Relatedness for Word Sense Disambiguation", "type": "chapter", "volume": "2588" }, "6142573/3MAA8Z5M": { "DOI": "10.1016/j.eswa.2015.02.055", "URL": "https://linkinghub.elsevier.com/retrieve/pii/S0957417415001633", "accessed": { "day": 27, "month": 4, "year": 2021 }, "author": [ { "family": "O’Callaghan", "given": "Derek" }, { "family": "Greene", "given": "Derek" }, { "family": "Carthy", "given": "Joe" }, { "family": "Cunningham", "given": "Pádraig" } ], "container-title": "Expert Systems with Applications", "container-title-short": "Expert Systems with Applications", "id": "6142573/3MAA8Z5M", "issue": "13", "issued": { "year": 2015 }, "journalAbbreviation": "Expert Systems with Applications", "language": "en", "page": "5645-5657", "page-first": "5645", "title": "An analysis of the coherence of descriptors in topic modeling", "type": "article-journal", "volume": "42" }, "6142573/3YMW54I3": { "ISBN": "978-84-9773-529-2", "URL": "https://wlv.openrepository.com/handle/2436/622560", "abstract": "We should always bear in mind that the assumption of representativeness ‘must be regarded largely as an act of faith’ (Leech 1991: 2), as at present we have no means of ensuring it, or even evaluating it objectively. (Tognini-Bonelli 2001: \n 57) Corpus Linguistics (CL) has not yet come of age. It does not make any difference whether we consider it a full-fledged linguistic discipline (Tognini-Bonelli 2000: 1) or, else, a set of analytical techniques that can be applied to any discipline (McEnery et al. 2006: 7). The truth is that CL is still striving to solve thorny, central issues such as optimum size, balance and representativeness of corpora (of the language as a whole or of some subset of the language). \nCorpus-driven/based studies rely on the quality and representativeness of each corpus as their true foundation for producing valid results. This entails deciding on valid external and internal criteria for corpus design and compilation. A basic tenet is that corpus representativeness determines the kinds of research questions that can be addressed and the generalizability of the results obtained (cf. Biber et al. 1988: 246). Unfortunately, faith and beliefs do not seem to ensure quality. \nIn this paper we will attempt to deal with these key questions. Firstly, we will give a brief description of the R&D projects which originally have served as the main framework for this research. Secondly, we will focus on the complex notion of corpus representativeness and ideal size, from both a theoretical and an applied perspective. Finally, we will describe a computer application which has been developed as part of the research. This software will be used to verify whether a sample bilingual comparable corpus could be deemed representative.", "accessed": { "day": 3, "month": 3, "year": 2021 }, "author": [ { "family": "Corpas Pastor", "given": "Gloria" }, { "family": "Seghiri Domínguez", "given": "Míriam" } ], "container-title": "Lengua, traducción, recepción en honor de Julio César Santoyo. León: Universidad de León Área de Publicaciones", "editor": [ { "family": "Rabadán", "given": "Rosa" }, { "family": "Fernández López", "given": "Marisa" }, { "family": "Guzmán González", "given": "Trinidad" } ], "id": "6142573/3YMW54I3", "issued": { "day": 1, "month": 6, "year": 2010 }, "page": "111-145", "page-first": "111", "publisher": "Publicaciones Universidad de León", "title": "Size Matters: A Quantitative Approach to Corpus Representativeness", "type": "chapter" }, "6142573/5FI5SV3F": { "author": [ { "family": "Pfanzelter", "given": "Eva" } ], "collection-number": "3157", "collection-title": "UTB", "container-title": "Digitale Arbeitstechniken für Geistes- und Kulturwissenschaften", "editor": [ { "family": "Gasteiner", "given": "Martin" }, { "family": "Haber", "given": "Peter" } ], "event-place": "Stuttgart", "id": "6142573/5FI5SV3F", "issued": { "year": 2010 }, "page": "39-50", "page-first": "39", "publisher-place": "Stuttgart", "title": "Von der Quellenkritik zum kritischen Umgang mit digitalen Ressourcen", "type": "chapter" }, "6142573/5LNAAFQE": { "DOI": "10.5937/SPSUNP1701001S", "URL": "http://scindeks.ceon.rs/Article.aspx?artid=2217-55391701001S", "abstract": "Probabilistic topic modeling is a text mining technique that allows to extract sets of term probability distributions which can intuitively be interpreted as latent topics. The extraction in most techniques uses only document term frequency matrices as input data. Moreover, topic models estimate posterior document-topic distributions useful for intelligent document retrieval query processing. This paper discusses two approaches to topic modeling involving Dirichlet distributions and Dirichlet processes. However, these and related approaches presume suitable text preprocessing in order to keep parameter spaces for estimations from training text corpora at manageable sizes. In the present paper, we discuss the influence of morphological preprocessing of training texts. Morphological analysis is a computer linguistic discipline that allows to decompose observed terms into base lemmata. This is effected by a deep analysis of the observed terms as opposed to straightforward prefix or postfix elimination used in conventional stemming algorithms. Morphological preprocessing is especially effective in inflection rich languages like, e.g. Finnish or German, and effectively reduces the training vocabulary size. In addition, morphological preprocessing allows for decomposing compound words. It is of considerable interest to study the influence of morphological preprocessing on text mining and statistical topic models. In experiments reported in the application section of this paper, significant changes of the frequency structure of document term matrices were found. Interestingly, these changes also led to substantial improvements in model quality indicators of topic models due to morphological preprocessing. Steps for further research are suggested in the concluding section.", "accessed": { "day": 30, "month": 8, "year": 2021 }, "author": [ { "family": "Spies", "given": "Marcus" } ], "container-title": "Scientific Publications of the State University of Novi Pazar Series A: Applied Mathematics, Informatics and mechanics", "id": "6142573/5LNAAFQE", "issue": "1", "issued": { "year": 2017 }, "note": "PMID: 2217-55391701001S", "page": "1-18", "page-first": "1", "title": "Topic modelling with morphologically analyzed vocabularies", "type": "article-journal", "volume": "9" }, "6142573/5ZGM8EKM": { "URL": "https://researchportal.helsinki.fi/en/publications/disappearing-discourses-avoiding-anachronisms-and-teleology-with-", "accessed": { "day": 11, "month": 1, "year": 2021 }, "author": [ { "family": "Zosa", "given": "Elaine" }, { "family": "Hengchen", "given": "Simon" }, { "family": "Marjanen", "given": "Jani" }, { "family": "Pivovarova", "given": "Lidia" }, { "family": "Tolonen", "given": "Mikko" } ], "container-title": "Digital Humanities in the Nordic Countries DHN 2020", "event": "Digital Humanities in the Nordic Countries DHN 2020", "event-place": "Riga", "id": "6142573/5ZGM8EKM", "issued": { "year": 2020 }, "language": "English", "publisher": "Riga", "shortTitle": "Disappearing Discourses", "title": "Disappearing Discourses: Avoiding Anachronisms and Teleology with Data-Driven Methods in Studying Digital Newspaper Collections", "title-short": "Disappearing Discourses", "type": "paper-conference" }, "6142573/6IFE7WQI": { "ISBN": "978-1-57181-522-4 978-1-84545-522-4", "author": [ { "family": "Waine", "given": "Anthony Edward" } ], "call-number": "PT111 .W35 2007", "event-place": "New York", "id": "6142573/6IFE7WQI", "issued": { "year": 2007 }, "note": "OCLC: ocn156834470", "number-of-pages": "184", "publisher": "Berghahn Books", "publisher-place": "New York", "shortTitle": "Changing cultural tastes", "title": "Changing cultural tastes: writers and the popular in modern Germany", "title-short": "Changing cultural tastes", "type": "book" }, "6142573/74BYZPVE": { "ISBN": "0-8014-2875-0", "author": [ { "family": "Wyman", "given": "Mark" } ], "event-place": "Ithaca, N.Y.", "id": "6142573/74BYZPVE", "issued": { "year": 1993 }, "language": "English", "number-of-pages": "267", "publisher": "Cornell University Press", "publisher-place": "Ithaca, N.Y.", "title": "Round-trip to America: the immigrants return to Europe, 1880-1930", "type": "book" }, "6142573/9C4MU2UT": { "URL": "https://aclanthology.org/U15-1013", "accessed": { "day": 30, "month": 8, "year": 2021 }, "author": [ { "family": "Martin", "given": "Fiona" }, { "family": "Johnson", "given": "Mark" } ], "container-title": "Proceedings of the Australasian Language Technology Association Workshop 2015", "event": "ALTA 2015", "event-place": "Parramatta, Australia", "id": "6142573/9C4MU2UT", "issued": { "month": 12, "year": 2015 }, "page": "111–115", "page-first": "111", "publisher-place": "Parramatta, Australia", "title": "More Efficient Topic Modelling Through a Noun Only Approach", "type": "paper-conference" }, "6142573/9DLPRG9V": { "ISBN": "978-3-938375-77-8", "call-number": "JV6217.5 .B55 2017", "collection-number": "Band 30", "collection-title": "Arco Wissenschaft", "editor": [ { "family": "Prager", "given": "Katharina" }, { "family": "Straub", "given": "Wolfgang" } ], "event-place": "Wuppertal", "id": "6142573/9DLPRG9V", "issued": { "year": 2017 }, "language": "ger eng", "note": "OCLC: ocn987202232", "number-of-pages": "388", "publisher": "Arco Verlag", "publisher-place": "Wuppertal", "shortTitle": "Bilderbuch-Heimkehr?", "title": "Bilderbuch-Heimkehr? Remigration im Kontext", "title-short": "Bilderbuch-Heimkehr?", "type": "book" }, "6142573/9KTJU78H": { "DOI": "10.13140/RG.2.2.31214.43846", "URL": "http://rgdoi.net/10.13140/RG.2.2.31214.43846", "accessed": { "day": 3, "month": 3, "year": 2021 }, "author": [ { "family": "Malone", "given": "Daniel" } ], "event": "Corpora and Discourse International Conference 2020", "id": "6142573/9KTJU78H", "issued": { "year": 2020 }, "language": "en", "publisher": "Unpublished", "shortTitle": "Developing a complex query to build a specialised corpus", "title": "Developing a complex query to build a specialised corpus: Reducing the issue of polysemous query terms.", "title-short": "Developing a complex query to build a specialised corpus", "type": "paper-conference" }, "6142573/AXWYSSW3": { "URL": "https://www.deutsche-biographie.de/sfz50193.html", "abstract": "Biografische Information zu Lenau, Nikolaus, Biografienachweise, Quellen, Quellennachweise, Literatur, Literaturnachweise Portrait, Porträtnachweise, Objekte, Objektnachweise, Verbindungen, Orte , Niembsch Edler von Strehlenau, Nikolaus (eigentlich); Niembsch, Nikolaus (eigentlich, bis 1820); Niembsch von Strehlenau, Nikolaus (eigentlich); Strehlenau, Nikolaus Niembsch Edler von; Lenau, Nikolaus; Niembsch Edler von Strehlenau, Nikolaus (eigentlich); niembsch edler von strehlenau, nikolaus; Niembsch, Nikolaus (eigentlich, bis 1820); niembsch, nikolaus; Niembsch von Strehlenau, Nikolaus (eigentlich); niembsch von strehlenau, nikolaus; Strehlenau, Nikolaus Niembsch Edler von; Lenau, H.; Lenau, Miklós; Lenau, N.; Lenau, Nicolaus; Lenau, Nikolus; Lenau,̆ Nikalaus̆; Lenaŭ, Nikalaŭs; Len̄au, Nîqôlaus; Lēnau, Nîqôlaus; Niembsch von Strehlenau, Nicolaus; Niembsch von Strehlenau, Nicolaus Franz; Niembsch von Strehlenau, Nikolaus F.; Niembsch von Strehlenau, Nikolaus Franz; Niembsch, Nicolaus; Niembsch, Nikolaus F.; Niembsch, Nikolaus Franz, Edler von Strehlenau; Strehlenau, Nicolaus Franz Niembsch von; Strehlenau, Nicolaus Niembsch von; Strehlenau, Nikolaus F. von; Strehlenau, Nikolaus Franz Niembsch von; Strehlenau, Nikolaus Franz von; Strehlenau, Nikolaus Niembsch von; Von Strehlenau, Nikolaus Franz Niembsch", "accessed": { "day": 4, "month": 5, "year": 2021 }, "container-title": "Deutsche Biographie", "id": "6142573/AXWYSSW3", "language": "de", "note": "Publisher: Bayerische Staatsbibliothek", "title": "Lenau, Nikolaus - Deutsche Biographie", "type": "webpage" }, "6142573/B353HSFG": { "DOI": "10.14765/ZZF.DOK.2.269.V1", "URL": "http://zeitgeschichte-digital.de/doks/269", "accessed": { "day": 9, "month": 7, "year": 2020 }, "author": [ { "family": "Haber", "given": "Peter" } ], "container-title": "Docupedia-Zeitgeschichte", "id": "6142573/B353HSFG", "issued": { "year": 2012 }, "language": "ger", "title": "Zeitgeschichte und Digital HumanitiesZeitgeschichte und Digital Humanities", "type": "article-journal" }, "6142573/B3YGSZTZ": { "DOI": "10.5121/ijctcm.2015.5301", "URL": "http://arxiv.org/abs/1508.01346", "abstract": "In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages around the world, research in WSD has been conducted upto different extents. In this paper, we have gone through a survey regarding the different approaches adopted in different research works, the State of the Art in the performance in this domain, recent works in different Indian languages and finally a survey in Bengali language. We have made a survey on different competitions in this field and the bench mark results, obtained from those competitions.", "accessed": { "day": 15, "month": 3, "year": 2021 }, "author": [ { "family": "Pal", "given": "Alok Ranjan" }, { "family": "Saha", "given": "Diganta" } ], "container-title": "arXiv:1508.01346 [cs]", "id": "6142573/B3YGSZTZ", "issued": { "day": 6, "month": 8, "year": 2015 }, "note": "arXiv: 1508.01346", "shortTitle": "Word sense disambiguation", "title": "Word sense disambiguation: a survey", "title-short": "Word sense disambiguation", "type": "article-journal" }, "6142573/BEHFC5EB": { "DOI": "10.1177/1558689816651015", "URL": "https://doi.org/10.1177/1558689816651015", "abstract": "This article demonstrates how a digital environment offers new opportunities for transforming qualitative data into quantitative data in order to use data mining and information visualization for mixed methods research. The digital approach to mixed methods research is illustrated by a framework which combines qualitative methods of multimodal discourse analysis with quantitative methods of data mining and information visualization in a multilevel, contextual model that will result in an integrated, theoretically well-founded, and empirically evaluated technology for analyzing large data sets of multimodal texts. The framework is applicable to situations in which critical information needs to be extracted from geotagged public data: for example, in crisis informatics, where public reports of extreme events provide valuable data sources for disaster management.", "accessed": { "day": 28, "month": 12, "year": 2020 }, "author": [ { "family": "O’Halloran", "given": "Kay L." }, { "family": "Tan", "given": "Sabine" }, { "family": "Pham", "given": "Duc-Son" }, { "family": "Bateman", "given": "John" }, { "family": "Vande Moere", "given": "Andrew" } ], "container-title": "Journal of Mixed Methods Research", "container-title-short": "Journal of Mixed Methods Research", "id": "6142573/BEHFC5EB", "issue": "1", "issued": { "day": 1, "month": 1, "year": 2018 }, "journalAbbreviation": "Journal of Mixed Methods Research", "page": "11-30", "page-first": "11", "shortTitle": "A Digital Mixed Methods Research Design", "title": "A Digital Mixed Methods Research Design: Integrating Multimodal Analysis With Data Mining and Information Visualization for Big Data Analytics", "title-short": "A Digital Mixed Methods Research Design", "type": "article-journal", "volume": "12" }, "6142573/BMJS79M7": { "DOI": "10.1080/19312458.2018.1430754", "URL": "https://doi.org/10.1080/19312458.2018.1430754", "abstract": "Latent Dirichlet allocation (LDA) topic models are increasingly being used in communication research. Yet, questions regarding reliability and validity of the approach have received little attention thus far. In applying LDA to textual data, researchers need to tackle at least four major challenges that affect these criteria: (a) appropriate pre-processing of the text collection; (b) adequate selection of model parameters, including the number of topics to be generated; (c) evaluation of the model’s reliability; and (d) the process of validly interpreting the resulting topics. We review the research literature dealing with these questions and propose a methodology that approaches these challenges. Our overall goal is to make LDA topic modeling more accessible to communication researchers and to ensure compliance with disciplinary standards. Consequently, we develop a brief hands-on user guide for applying LDA topic modeling. We demonstrate the value of our approach with empirical data from an ongoing research project.", "accessed": { "day": 22, "month": 4, "year": 2021 }, "author": [ { "family": "Maier", "given": "Daniel" }, { "family": "Waldherr", "given": "A." }, { "family": "Miltner", "given": "P." }, { "family": "Wiedemann", "given": "G." }, { "family": "Niekler", "given": "A." }, { "family": "Keinert", "given": "A." }, { "family": "Pfetsch", "given": "B." }, { "family": "Heyer", "given": "G." }, { "family": "Reber", "given": "U." }, { "family": "Häussler", "given": "T." }, { "family": "Schmid-Petri", "given": "H." }, { "family": "Adam", "given": "S." } ], "container-title": "Communication Methods and Measures", "id": "6142573/BMJS79M7", "issue": "2-3", "issued": { "day": 3, "month": 4, "year": 2018 }, "note": "Publisher: Routledge\n_eprint: https://doi.org/10.1080/19312458.2018.1430754", "page": "93-118", "page-first": "93", "shortTitle": "Applying LDA Topic Modeling in Communication Research", "title": "Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology", "title-short": "Applying LDA Topic Modeling in Communication Research", "type": "article-journal", "volume": "12" }, "6142573/BXVFU4ZK": { "DOI": "10.1080/02619288.2001.9975006", "abstract": "This article locates the phenomenon of return migration within a broader history of European and American experiences. It argues that by studying return migration we can perhaps better understand some apparent contradictions in the broader themes of migration history. For example, by considering the patterns of, and motivations for, return migration from America to Europe, we are able to probe why some countries were severely damaged both economically and socially by high emigration and why others were not. In considering such issues, this article draws heavily, but not exclusively, on the Scandinavian countries. The discussion that follows hopes to reinforce the notion, apparent from the history of the 1880–1930 era in Europe and America, that return migration was among those major influences that challenged and jarred traditional societies and produced the modern world we all inhabit.", "author": [ { "family": "Wyman", "given": "Mark" } ], "collection-title": "Historical Studies in Ethnicity, Migration and Diaspora", "container-title": "Immigrants & Minorities", "id": "6142573/BXVFU4ZK", "issue": "1", "issued": { "year": 2001 }, "language": "English", "page": "1–18", "page-first": "1", "title": "Return migration ‐ old story, new story", "type": "article-journal", "volume": "20" }, "6142573/CHE86H2F": { "DOI": "10.1186/1471-2105-16-S13-S8", "URL": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/", "abstract": "Background\nTopic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach.\n\nMethods and results\nBased on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed.\n\nConclusion\nThe proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.", "accessed": { "day": 23, "month": 4, "year": 2021 }, "author": [ { "family": "Zhao", "given": "Weizhong" }, { "family": "Chen", "given": "James J" }, { "family": "Perkins", "given": "Roger" }, { "family": "Liu", "given": "Zhichao" }, { "family": "Ge", "given": "Weigong" }, { "family": "Ding", "given": "Yijun" }, { "family": "Zou", "given": "Wen" } ], "container-title": "BMC Bioinformatics", "container-title-short": "BMC Bioinformatics", "id": "6142573/CHE86H2F", "issue": "Suppl 13", "issued": { "day": 25, "month": 9, "year": 2015 }, "journalAbbreviation": "BMC Bioinformatics", "note": "PMID: 26424364\nPMCID: PMC4597325", "page": "S8", "page-first": "S8", "title": "A heuristic approach to determine an appropriate number of topics in topic modeling", "type": "article-journal", "volume": "16" }, "6142573/CVSFNSE2": { "DOI": "10.1145/2133806.2133826", "URL": "https://doi.org/10.1145/2133806.2133826", "abstract": "Surveying a suite of algorithms that offer a solution to managing large document archives.", "accessed": { "day": 28, "month": 4, "year": 2021 }, "author": [ { "family": "Blei", "given": "David M." } ], "container-title": "Communications of the ACM", "container-title-short": "Commun. ACM", "id": "6142573/CVSFNSE2", "issue": "4", "issued": { "day": 1, "month": 4, "year": 2012 }, "journalAbbreviation": "Commun. ACM", "page": "77–84", "page-first": "77", "title": "Probabilistic topic models", "type": "article-journal", "volume": "55" }, "6142573/DHFC4A24": { "author": [ { "family": "Pfanzelter", "given": "Eva" } ], "container-title": "Zeitschrift für das Archivwesen der Wirtschaft", "id": "6142573/DHFC4A24", "issue": "1", "issued": { "year": 2015 }, "language": "de", "page": "5-19", "page-first": "5", "title": "Die historische Quellenkritik und das Digitale", "type": "article-journal", "volume": "48" }, "6142573/DXVIWVKG": { "ISBN": "978-1-4051-9473-0 978-1-4051-9843-1", "URL": "http://doi.wiley.com/10.1002/9781405198431.wbeal0974", "accessed": { "day": 4, "month": 5, "year": 2021 }, "author": [ { "family": "Hasko", "given": "Victoria" } ], "container-title": "The Encyclopedia of Applied Linguistics", "editor": [ { "family": "Chapelle", "given": "Carol A." } ], "event-place": "Oxford, UK", "id": "6142573/DXVIWVKG", "issued": { "day": 5, "month": 11, "year": 2012 }, "language": "en", "note": "DOI: 10.1002/9781405198431.wbeal0974", "page": "wbeal0974", "page-first": "wbeal0974", "publisher": "Blackwell Publishing Ltd", "publisher-place": "Oxford, UK", "title": "Qualitative Corpus Analysis", "type": "chapter" }, "6142573/DZ5DRETR": { "URL": "http://journals.openedition.org/cognitextes/1311", "abstract": "Twentieth-century structuralist and generative linguists argued that the study of the language system (langue, competence) must be separated from the study of language use (parole, performance). For Saussure or Chomsky, no generalizations about language could be made based on the observation of patterns, regularities and rules of language performance. For Saussure, “Il n’y a donc rien de collectif dans la parole ; les manifestations en sont individuelles et momentanées. Ici il n’y a rien de p...", "accessed": { "day": 26, "month": 4, "year": 2021 }, "author": [ { "family": "Raineri", "given": "Sophie" }, { "family": "Debras", "given": "Camille" } ], "container-title": "CogniTextes. Revue de l’Association française de linguistique cognitive", "id": "6142573/DZ5DRETR", "issue": "Volume 19", "issued": { "day": 17, "month": 6, "year": 2019 }, "language": "en", "note": "Number: Volume 19\nPublisher: Association française de linguistique cognitive (AFLiCo)", "shortTitle": "Corpora and Representativeness", "title": "Corpora and Representativeness: Where to go from now?", "title-short": "Corpora and Representativeness", "type": "article-journal", "volume": "19" }, "6142573/EAL2CICM": { "DOI": "10.1073/pnas.0307752101", "URL": "http://www.pnas.org/cgi/doi/10.1073/pnas.0307752101", "accessed": { "day": 31, "month": 8, "year": 2021 }, "author": [ { "family": "Griffiths", "given": "T. L." }, { "family": "Steyvers", "given": "M." } ], "container-title": "Proceedings of the National Academy of Sciences", "container-title-short": "Proceedings of the National Academy of Sciences", "id": "6142573/EAL2CICM", "issue": "Supplement 1", "issued": { "day": 6, "month": 4, "year": 2004 }, "journalAbbreviation": "Proceedings of the National Academy of Sciences", "language": "en", "page": "5228-5235", "page-first": "5228", "title": "Finding scientific topics", "type": "article-journal", "volume": "101" }, "6142573/F3B5EV7L": { "ISBN": "0-7190-7071-6", "abstract": "Emigrant homecomings addresses the significant but neglected issue of return migration to Britain and Europe since 1600. While emigration studies have become prominent in both scholarly and popular circles in recent years, return migration has remained comparatively under-researched, despite evidence that in the nineteenth and twentieth centuries between a quarter and a third of all emigrants from many parts of Britain and Europe ultimately returned to their countries of origin. Emigrant homecomings analyses the motives, experiences and impact of these returning migrants in a wide range of locations over four hundred years, as well as examining the mechanisms and technologies which enabled their return.\n\nThe book examines the multiple identities that migrants adopted and the huge range and complexity of homecomers' motives and experiences. It also dissects migrants' perception of 'home' and the social, economic, cultural and political change that their return engendered.", "collection-title": "Studies in Imperialism MUP", "editor": [ { "family": "Harper", "given": "Marjory" } ], "event-place": "Manchester", "id": "6142573/F3B5EV7L", "issued": { "year": 2012 }, "language": "Englisch", "number-of-pages": "288", "publisher": "Manchester University Press;", "publisher-place": "Manchester", "title": "Emigrant homecomings: The return movement of emigrants, 1600-2000", "type": "book" }, "6142573/FRE6XIFV": { "URL": "https://zenodo.org/record/3895269", "abstract": "This short paper introduces the NewsEye project.", "accessed": { "day": 11, "month": 1, "year": 2021 }, "author": [ { "family": "Doucet", "given": "Antoine" }, { "family": "Gasteiner", "given": "Martin" }, { "family": "Granroth-Wilding", "given": "Mark" }, { "family": "Kaiser", "given": "Max" }, { "family": "Kaukonen", "given": "Minna" }, { "family": "Labahn", "given": "Roger" }, { "family": "Moreux", "given": "Jean-Philippe" }, { "family": "Muehlberger", "given": "Günter" }, { "family": "Pfanzelter", "given": "Eva" }, { "family": "Therenty", "given": "Marie-Eve" }, { "family": "Toivonen", "given": "Hannu" }, { "family": "Tolonen", "given": "Mikko" } ], "container-title": "Book of Abstracts", "event": "DH2020", "event-place": "Ottawa", "id": "6142573/FRE6XIFV", "issued": { "day": 23, "month": 7, "year": 2020 }, "language": "en", "publisher-place": "Ottawa", "shortTitle": "NewsEye", "title": "NewsEye: A digital investigator for historical newspapers", "title-short": "NewsEye", "type": "paper-conference" }, "6142573/G3B3QXYX": { "DOI": "10.1145/3383583.3398627", "ISBN": "978-1-4503-7585-6", "URL": "https://doi.org/10.1145/3383583.3398627", "abstract": "The NewsEye project demonstrator is a proof of concept of a digital platform dedicated to historical newspapers, intended to show benefits for researchers and the general public. This platform presently hosts newspapers from partner libraries in four different languages (Finnish, Swedish, German and French) providing users with various analysis tools as well as allowing them to manage their research in an interactive way. The platform gives access to these enriched data sets, and additionally interfaces with analysis tools developed in the NewsEye project, letting users experiment with tools specifically developed for investigating historical newspapers.", "accessed": { "day": 19, "month": 11, "year": 2020 }, "author": [ { "family": "Jean-Caurant", "given": "Axel" }, { "family": "Doucet", "given": "Antoine" } ], "collection-title": "JCDL '20", "container-title": "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020", "event-place": "New York, NY, USA", "id": "6142573/G3B3QXYX", "issued": { "day": 1, "month": 8, "year": 2020 }, "page": "531–532", "page-first": "531", "publisher": "Association for Computing Machinery", "publisher-place": "New York, NY, USA", "title": "Accessing and Investigating Large Collections of Historical Newspapers with the NewsEye Platform", "type": "paper-conference" }, "6142573/GP9W87WP": { "author": [ { "family": "Olivier", "given": "Claudia" } ], "container-title": "Transnationales Wissen und Soziale Arbeit", "editor": [ { "family": "Bender", "given": "Désirée" }, { "family": "Duscha", "given": "Annemarie" }, { "family": "Huber", "given": "Lena" }, { "family": "Klein-Zimmer", "given": "Kathrin" } ], "id": "6142573/GP9W87WP", "issued": { "year": 2013 }, "language": "German", "page": "181–205", "page-first": "181", "publisher": "Beltz Juventa", "title": "Brain Gain oder Brain Clash? Implizites transnationales Wissen im Kontext von Rückkehr-Migration", "type": "chapter" }, "6142573/GRIVXPM6": { "ISBN": "978-3-8233-6295-1", "author": [ { "family": "Steyer", "given": "Kathrin" }, { "family": "Lauer", "given": "Meike" } ], "call-number": "PF3065 .S67 2007", "collection-number": "Bd. 40", "collection-title": "Studien zur deutschen Sprache", "container-title": "Sprach-Perspektiven: germanistische Linguistik und das Institut für Deutsche Sprache", "editor": [ { "family": "Kämper", "given": "Heidrun" }, { "family": "Eichinger", "given": "Ludwig M." } ], "event-place": "Tübingen", "id": "6142573/GRIVXPM6", "issued": { "year": 2007 }, "note": "OCLC: ocn122260172", "page": "493-509", "page-first": "493", "publisher": "G. Narr", "publisher-place": "Tübingen", "title": "„Corpus-Driven“: Linguistische Interpretation von Kookkurrenzbeziehungen", "type": "chapter" }, "6142573/HNFZPFHD": { "DOI": "10.1162/tacl_a_00099", "URL": "https://aclanthology.org/Q16-1021", "abstract": "Rule-based stemmers such as the Porter stemmer are frequently used to preprocess English corpora for topic modeling. In this work, we train and evaluate topic models on a variety of corpora using several different stemming algorithms. We examine several different quantitative measures of the resulting models, including likelihood, coherence, model stability, and entropy. Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability.", "accessed": { "day": 29, "month": 8, "year": 2021 }, "author": [ { "family": "Schofield", "given": "Alexandra" }, { "family": "Mimno", "given": "David" } ], "container-title": "Transactions of the Association for Computational Linguistics", "id": "6142573/HNFZPFHD", "issued": { "year": 2016 }, "page": "287–300", "page-first": "287", "shortTitle": "Comparing Apples to Apple", "title": "Comparing Apples to Apple: The Effects of Stemmers on Topic Models", "title-short": "Comparing Apples to Apple", "type": "article-journal", "volume": "4" }, "6142573/HP5RBUIN": { "abstract": "While Austro-Hungarian officials initially opposed emigration and considered it disloyal to leave the homeland, the massive growth of transatlantic labor migration, its economic benefits, and its potentially temporary duration prompted a change in governmental attitudes and policy at the turn of the twentieth century. Even as it continued to discourage and police the exit of emigrants, the Hungarian government, in particular, also became an active promoter of return migration. Using files from the Hungarian Prime Minister’s Office, the Hungarian Ministry of Agriculture, and the joint Austro-Hungarian Foreign Ministry, this article examines the Hungarian government’s attempts to encourage return migration to further its economic and nationalist goals. These initiatives emphasized the homecoming of desirable “patriotic” subjects, of Hungarian-speakers, and of farmers and skilled industrial workers to address the state’s perceived labor needs. Officials debated the risks of welcoming back migrants with undesirable social and political orientations and speakers of minority languages, as well as the risks of potential conflicts with the United States government.", "author": [ { "family": "Poznan", "given": "Kristina E." } ], "container-title": "The Hungarian Historical Review", "id": "6142573/HP5RBUIN", "issue": "3", "issued": { "year": 2017 }, "language": "English", "page": "647–667", "page-first": "647", "title": "Return Migration to Austria-Hungary from the United States in Homeland Economic and Ethnic Politics and International Diplomacy", "type": "article-journal", "volume": "6" }, "6142573/JETQF4X2": { "abstract": "Many approaches have been introduced to enable Latent Dirichlet Allocation (LDA) models to be updated in an online manner. This includes inferring new documents into the model, passing parameter priors to the inference algorithm or a mixture of both, leading to more complicated and computationally expensive models. We present a method to match and compare the resulting LDA topics of different models with light weight easy to use similarity measures. We address the on-line problem by keeping the model inference simple and matching topics solely by their high probability word lists.", "author": [ { "family": "Niekler", "given": "Andreas" }, { "family": "Jähnichen", "given": "Patrick" } ], "container-title": "Proceedings of ICCM 2012, 11th International Conference on Cognitive Modeling", "event": "11th International Conference on Cognitive Modeling", "id": "6142573/JETQF4X2", "issued": { "year": 2012 }, "page": "317-322", "page-first": "317", "publisher": "Universitätsverlag der TU Berlin", "title": "Matching Results of Latent Dirichlet Allocation for Text", "type": "paper-conference" }, "6142573/JMD7CSSP": { "author": [ { "family": "Oberbichler", "given": "Sarah" }, { "family": "Pfanzelter", "given": "Eva" } ], "container-title": "Digitised Newspapers – A New Eldorado for Historians? Tools, Methodology, Epistemology, and the Changing Practices of Writing History in the Context of Historical Newspapers Mass Digitization", "id": "6142573/JMD7CSSP", "issued": { "year": 2021 }, "publisher": "De Gruyter", "title": "Tracing Discourses in Digital Newspaper Collections: A Contribution to Digital Hermeneutics while Investigating ’Return Migration’ in Historical Press Coverage", "type": "chapter" }, "6142573/JMZAZWUX": { "URL": "https://www.c2dh.uni.lu/thinkering/digital-hermeneutics-history-theory-and-practice", "abstract": "On 25-26 October 2018, the Luxembourg Centre for Contemporary and Digital History (C²DH) organised the two day conference and workshop Digital Hermeneutics in History: Theory and Practice, on occasion of the official launch of the Ranke.2 teaching platform for Digital Source Criticism.", "accessed": { "day": 11, "month": 1, "year": 2021 }, "author": [ { "family": "Fickers", "given": "Andreas" } ], "container-title": "C2DH | Luxembourg Centre for Contemporary and Digital History", "id": "6142573/JMZAZWUX", "issued": { "day": 22, "month": 2, "year": 2019 }, "language": "en", "shortTitle": "Digital Hermeneutics in History", "title": "Digital Hermeneutics in History: Theory and Practice", "title-short": "Digital Hermeneutics in History", "type": "webpage" }, "6142573/K2RNGLST": { "abstract": "Regional mobility took many different forms. People moved shorter and longer distances, passed over administrative, geographical, or cultural borders, went back and forth between rural and urban areas, migrated to a neighboring country, or even crossed oceans. While some migrations consisted of a one-time move from one place of residence to another, other movements, even across national borders, were temporary, circular, or repeated.", "author": [ { "family": "Steidl", "given": "Annemarie" } ], "container-title": "Migration in Austria", "editor": [ { "family": "Rupnow", "given": "Dirk" }, { "family": "Bischof", "given": "Günter" } ], "event-place": "New Orleans", "id": "6142573/K2RNGLST", "issued": { "year": 2017 }, "language": "English", "number-of-volumes": "26", "page": "69–88", "page-first": "69", "publisher": "University of New Orleans Press", "publisher-place": "New Orleans", "title": "Migration Patterns in the Late Habsburg Empire", "type": "chapter", "volume": "Contemproary Austrian Studies" }, "6142573/KAFFLBWQ": { "ISBN": "978-3-11-020041-6 978-3-11-020937-2", "URL": "https://www.degruyter.com/document/doi/10.1515/9783110209372.6.407/html", "accessed": { "day": 27, "month": 4, "year": 2021 }, "author": [ { "family": "Bubenhofer", "given": "Noah" } ], "collection-editor": [ { "family": "Günthner", "given": "Susanne" }, { "family": "Konerding", "given": "Klaus-Peter" }, { "family": "Liebert", "given": "Wolf-Andreas" }, { "family": "Roelcke", "given": "Thorsten" } ], "container-title": "Diskurse berechnen? Wege zu einer korpuslinguistischen Diskursanalyse", "editor": [ { "family": "Warnke", "given": "Ingo H." }, { "family": "Spitzmüller", "given": "Jürgen" } ], "event-place": "Berlin, New York", "id": "6142573/KAFFLBWQ", "issued": { "day": 19, "month": 8, "year": 2008 }, "language": "en", "note": "Series Title: Linguistik - Impulse & Tendenzen\nDOI: 10.1515/9783110209372.6.407", "page": "407-434", "page-first": "407", "publisher": "Walter de Gruyter", "publisher-place": "Berlin, New York", "title": "Methods of Discourse Linguistics", "type": "chapter", "volume": "31" }, "6142573/LAF2DBJT": { "DOI": "10.1093/llc/fqy048", "URL": "https://academic.oup.com/dsh/article/34/2/368/5127711", "accessed": { "day": 3, "month": 12, "year": 2020 }, "author": [ { "family": "Koolen", "given": "Marijn" }, { "family": "van Gorp", "given": "Jasmijn" }, { "family": "van Ossenbruggen", "given": "Jacco" } ], "container-title": "Digital Scholarship in the Humanities", "id": "6142573/LAF2DBJT", "issue": "2", "issued": { "day": 1, "month": 6, "year": 2019 }, "language": "en", "page": "368-385", "page-first": "368", "shortTitle": "Toward a model for digital tool criticism", "title": "Toward a model for digital tool criticism: Reflection as integrative practice", "title-short": "Toward a model for digital tool criticism", "type": "article-journal", "volume": "34" }, "6142573/LM8L24CE": { "DOI": "10.1109/18.61115", "abstract": "A novel class of information-theoretic divergence measures based on the Shannon entropy is introduced. Unlike the well-known Kullback divergences, the new measures do not require the condition of absolute continuity to be satisfied by the probability distributions involved. More importantly, their close relationship with the variational distance and the probability of misclassification error are established in terms of bounds. These bounds are crucial in many applications of divergence measures. The measures are also well characterized by the properties of nonnegativity, finiteness, semiboundedness, and boundedness.<>", "author": [ { "family": "Lin", "given": "J." } ], "container-title": "IEEE Transactions on Information Theory", "id": "6142573/LM8L24CE", "issue": "1", "issued": { "month": 1, "year": 1991 }, "note": "Conference Name: IEEE Transactions on Information Theory", "page": "145-151", "page-first": "145", "title": "Divergence measures based on the Shannon entropy", "type": "article-journal", "volume": "37" }, "6142573/LVI27PCC": { "DOI": "10.1145/1459352.1459355", "URL": "https://dl.acm.org/doi/10.1145/1459352.1459355", "abstract": "Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner. WSD is considered an AI-complete problem, that is, a task whose solution is at least as hard as the most difficult problems in artificial intelligence. We introduce the reader to the motivations for solving the ambiguity of words and provide a description of the task. We overview supervised, unsupervised, and knowledge-based approaches. The assessment of WSD systems is discussed in the context of the Senseval/Semeval campaigns, aiming at the objective evaluation of systems participating in several different disambiguation tasks. Finally, applications, open problems, and future directions are discussed.", "accessed": { "day": 15, "month": 3, "year": 2021 }, "author": [ { "family": "Navigli", "given": "Roberto" } ], "container-title": "ACM Computing Surveys", "container-title-short": "ACM Comput. Surv.", "id": "6142573/LVI27PCC", "issue": "2", "issued": { "year": 2009 }, "journalAbbreviation": "ACM Comput. Surv.", "language": "en", "page": "1-69", "page-first": "1", "shortTitle": "Word sense disambiguation", "title": "Word sense disambiguation: A survey", "title-short": "Word sense disambiguation", "type": "article-journal", "volume": "41" }, "6142573/N6WP2ZUH": { "URL": "/paper/Understanding-Critical-Discourse-Analysis-in-Mogashoa/9f63601020213bdeb08053613d19d476d2009292", "abstract": "This article explores critical discourse analysis as a theory in qualitative research. The framework of analysis includes analysis of texts, interactions and social practices at the local, institutional and societal levels. It aims at revealing the motivation and politics involved in the arguing for or against a specific research method, statement, or value. It draws on the necessity for describing, interpreting, analysing, and critiquing social life reflected in text by using critical discourse analysis. The article recognises that Human subjects use texts to make sense of their world and to construct social actions and relations in the labour of everyday life while at the same time, texts position and construct individuals, making available various meanings, ideas and versions of the world (Lucke 1996:12). Drawing from literature, this study will explore programmes, various forms of critical discourse analysis, principles as well as advantages and disadvantages of using this theory in qualitative research.", "accessed": { "day": 4, "month": 5, "year": 2021 }, "author": [ { "family": "Mogashoa", "given": "Tebogo" } ], "id": "6142573/N6WP2ZUH", "issued": { "year": 2014 }, "language": "en", "title": "Understanding Critical Discourse Analysis in Qualitative Research", "type": "webpage" }, "6142573/NBV4BG2G": { "URL": "https://pro.europeana.eu/page/issue-11-generous-interfaces", "abstract": "EuropeanaTech Insight is a multimedia publication about R&D developments by the EuropeanaTech Community", "accessed": { "day": 25, "month": 11, "year": 2020 }, "author": [ { "family": "Oberbichler", "given": "Sarah" }, { "family": "Pfanzelter", "given": "Eva" }, { "family": "Hechl", "given": "Stefan" }, { "family": "Marjanen", "given": "J." } ], "container-title": "EuropeanaTech Insight", "id": "6142573/NBV4BG2G", "issued": { "year": 2020 }, "language": "en-GB", "title": "Doing historical research with digital newspapers - perspectives of DH scholars", "type": "article-journal", "volume": "16: Newspapers" }, "6142573/NY822LF2": { "DOI": "10.14765/ZZF.DOK-1765", "URL": "https://zeitgeschichte-digital.de/doks/1765", "abstract": "»[…] wenn ›die Quelle‹ die Reliquie historischen Arbeitens ist – nicht nur Überbleibsel, sondern auch Objekt wissenschaftlicher Verehrung –, dann wäre analog ›das Archiv‹ die Kirche der Geschichtswissenschaft, in der die heiligen Handlungen des Suchens, Findens, Entdeckens und Erforschens vollzogen werden.« Achim Landwehr wirft in seinem geschichtstheoretischen Essay den Historikern ihren »Quellenglauben« vor – diese Kritik ließe sich im digitalen Zeitalter leicht auf die Heilsversprechen der Apostel der »Big Data Revolution« übertragen. Zwar regen sich mittlerweile vermehrt Stimmen, die den »Wahnwitz« der digitalen Utopie in Frage stellen, doch wird der öffentliche Diskurs weiterhin von jener Revolutionsrhetorik dominiert, die standardmäßig als Begleitmusik neuer Technologien ertönt. Statt in der intellektuell wenig fruchtbaren Dichotomie von Gegnern und Befürwortern, »First Movers« und Ignoranten zu verharren, welche die Landschaft der »Digital Humanities« ein wenig überspitzt auch heute noch kennzeichnet, ist das Ziel dieses Beitrages eine praxeologische Reflexion, die den Einfluss von digitalen Infrastrukturen, digitalen Werkzeugen und digitalen »Quellen« auf die Praxis historischen Arbeitens zeigen möchte. Ausgehend von der These, dass ebenjene digitalen Infrastrukturen, Werkzeuge und »Quellen« heute einen zentralen Einfluss darauf haben, wie wir Geschichte denken, erforschen und erzählen, plädiert der Beitrag für ein »Update« der klassischen Hermeneutik in der Geschichtswissenschaft. Die kritische Reflexion über die konstitutive Rolle des Digitalen in der Konstruktion und Vermittlung historischen Wissens ist nicht nur eine Frage epistemologischer Dringlichkeit, sondern zentraler Bestandteil der Selbstverständigung eines Faches, dessen Anspruch als Wissenschaft sich auf die Methoden der Quellenkritik gründet.", "accessed": { "day": 9, "month": 7, "year": 2020 }, "author": [ { "family": "Fickers", "given": "Andreas" } ], "container-title": "Zeithistorische Forschungen/Studies in Contemporary History", "id": "6142573/NY822LF2", "issue": "1", "issued": { "year": 2020 }, "language": "de", "page": "157-168", "page-first": "157", "title": "Update für die Hermeneutik. Geschichtswissenschaft auf dem Weg zur digitalen Forensik?", "type": "article-journal", "volume": "17" }, "6142573/PBSKPE7S": { "author": [ { "family": "Foucault", "given": "Michel" } ], "event-place": "London", "id": "6142573/PBSKPE7S", "issued": { "year": 1969 }, "publisher": "Routledge", "publisher-place": "London", "title": "The Archaeology of Knowledge", "type": "book" }, "6142573/PWHTUDVE": { "ISBN": "978-3-642-41061-1 978-3-642-41062-8", "URL": "http://link.springer.com/10.1007/978-3-642-41062-8_16", "accessed": { "day": 1, "month": 9, "year": 2021 }, "author": [ { "family": "Connor", "given": "Richard" }, { "family": "Cardillo", "given": "Franco Alberto" }, { "family": "Moss", "given": "Robert" }, { "family": "Rabitti", "given": "Fausto" } ], "collection-editor": [ { "family": "Hutchison", "given": "David" }, { "family": "Kanade", "given": "Takeo" }, { "family": "Kittler", "given": "Josef" }, { "family": "Kleinberg", "given": "Jon M." }, { "family": "Mattern", "given": "Friedemann" }, { "family": "Mitchell", "given": "John C." }, { "family": "Naor", "given": "Moni" }, { "family": "Nierstrasz", "given": "Oscar" }, { "family": "Pandu Rangan", "given": "C." }, { "family": "Steffen", "given": "Bernhard" }, { "family": "Sudan", "given": "Madhu" }, { "family": "Terzopoulos", "given": "Demetri" }, { "family": "Tygar", "given": "Doug" }, { "family": "Vardi", "given": "Moshe Y." }, { "family": "Weikum", "given": "Gerhard" } ], "container-title": "Similarity Search and Applications", "editor": [ { "family": "Brisaboa", "given": "Nieves" }, { "family": "Pedreira", "given": "Oscar" }, { "family": "Zezula", "given": "Pavel" } ], "event-place": "Berlin, Heidelberg", "id": "6142573/PWHTUDVE", "issued": { "year": 2013 }, "note": "Series Title: Lecture Notes in Computer Science\nDOI: 10.1007/978-3-642-41062-8_16", "page": "163-168", "page-first": "163", "publisher": "Springer Berlin Heidelberg", "publisher-place": "Berlin, Heidelberg", "title": "Evaluation of Jensen-Shannon Distance over Sparse Data", "type": "chapter", "volume": "8199" }, "6142573/QA64ENM8": { "URL": "https://www.aclweb.org/anthology/L16-1042", "abstract": "Web corpora are often constructed automatically, and their contents are therefore often not well understood. One technique for assessing the composition of such a web corpus is to empirically measure its similarity to a reference corpus whose composition is known. In this paper we evaluate a number of measures of corpus similarity, including a method based on topic modelling which has not been previously evaluated for this task. To evaluate these methods we use known-similarity corpora that have been previously used for this purpose, as well as a number of newly-constructed known-similarity corpora targeting differences in genre, topic, time, and region. Our findings indicate that, overall, the topic modelling approach did not improve on a chi-square method that had previously been found to work well for measuring corpus similarity.", "accessed": { "day": 27, "month": 12, "year": 2020 }, "author": [ { "family": "Fothergill", "given": "Richard" }, { "family": "Cook", "given": "Paul" }, { "family": "Baldwin", "given": "Timothy" } ], "container-title": "Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)", "event": "LREC 2016", "event-place": "Portorož, Slovenia", "id": "6142573/QA64ENM8", "issued": { "month": 5, "year": 2016 }, "page": "273–279", "page-first": "273", "publisher": "European Language Resources Association (ELRA)", "publisher-place": "Portorož, Slovenia", "title": "Evaluating a Topic Modelling Approach to Measuring Corpus Similarity", "type": "paper-conference" }, "6142573/QA8I2BAZ": { "abstract": "We describe a method for automatic word sense disambiguation using a text corpus and a machine-readable dictionary (MRD). The method is based on word similarity and context similarity measures. Words are considered similar if they appear in similar contexts; contexts are similar if they contain similar words. The circularity of this definition is resolved by an iterative, converging process, in which the system learns from the corpus a set of typical usages for each of the senses of the polysemous word listed in the MRD. A new instance of a polysemous word is assigned the sense associated with the typical usage most similar to its context. Experiments show that this method can learn even from very sparse training data, achieving over 92 % correct disambiguation performance. 1.", "author": [ { "family": "Karov", "given": "Yael" }, { "family": "Edelman", "given": "Shimon" } ], "container-title": "Computational Linguistics", "id": "6142573/QA8I2BAZ", "issued": { "year": 1998 }, "page": "41–59", "page-first": "41", "title": "Similarity-based word sense disambiguation", "type": "article-journal", "volume": "24" }, "6142573/RWUWZF7D": { "ISBN": "978-3-205-00702-9", "author": [ { "family": "Kürnberger", "given": "Ferdinand" } ], "collection-number": "3", "collection-title": "Österreichische Bibliothek", "event-place": "Wien", "id": "6142573/RWUWZF7D", "issued": { "year": 1985 }, "note": "OCLC: 246634489", "number-of-pages": "631", "publisher": "Böhlau", "publisher-place": "Wien", "shortTitle": "Der Amerikamüde", "title": "Der Amerikamüde", "title-short": "Der Amerikamüde", "type": "book" }, "6142573/RZWYTHC7": { "ISBN": "978-1-85604-694-7", "abstract": "This textbook, for students of library and information studies undertaking courses in information retrieval, information organization, information use and knowledge-based systems, explains the theory, techniques and tools of traditional approaches to the organization and processing of information.", "author": [ { "family": "Chowdhury", "given": "Gobinda G." } ], "id": "6142573/RZWYTHC7", "issued": { "year": 2010 }, "language": "en", "note": "Google-Books-ID: cN4qDgAAQBAJ", "number-of-pages": "529", "publisher": "Facet Publishing", "title": "Introduction to Modern Information Retrieval", "type": "book" }, "6142573/SC642F4H": { "DOI": "10.5121/ijctcm.2015.5301", "URL": "http://www.airccse.org/journal/ijctcm/papers/5315ijctcm01.pdf", "accessed": { "day": 6, "month": 5, "year": 2021 }, "author": [ { "family": "Ranjan Pal", "given": "Alok" }, { "family": "Saha", "given": "Diganta" } ], "container-title": "International Journal of Control Theory and Computer Modeling", "container-title-short": "IJCTCM", "id": "6142573/SC642F4H", "issue": "3", "issued": { "day": 31, "month": 7, "year": 2015 }, "journalAbbreviation": "IJCTCM", "page": "1-16", "page-first": "1", "shortTitle": "Word Sense Disambiguation", "title": "Word Sense Disambiguation: A Survey", "title-short": "Word Sense Disambiguation", "type": "article-journal", "volume": "5" }, "6142573/SCBME2FU": { "DOI": "10.26615/978-954-452-056-4_159", "URL": "https://www.aclweb.org/anthology/R19-1159", "abstract": "Dynamic topic models (DTMs) capture the evolution of topics and trends in time series data.Current DTMs are applicable only to monolingual datasets. In this paper we present the multilingual dynamic topic model (ML-DTM), a novel topic model that combines DTM with an existing multilingual topic modeling method to capture cross-lingual topics that evolve across time. We present results of this model on a parallel German-English corpus of news articles and a comparable corpus of Finnish and Swedish news articles. We demonstrate the capability of ML-DTM to track significant events related to a topic and show that it finds distinct topics and performs as well as existing multilingual topic models in aligning cross-lingual topics.", "accessed": { "day": 28, "month": 1, "year": 2021 }, "author": [ { "family": "Zosa", "given": "Elaine" }, { "family": "Granroth-Wilding", "given": "Mark" } ], "container-title": "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)", "event": "RANLP 2019", "event-place": "Varna, Bulgaria", "id": "6142573/SCBME2FU", "issued": { "month": 9, "year": 2019 }, "page": "1388–1396", "page-first": "1388", "publisher": "INCOMA Ltd.", "publisher-place": "Varna, Bulgaria", "title": "Multilingual Dynamic Topic Model", "type": "paper-conference" }, "6142573/TTCX55K3": { "URL": "https://eprints.lancs.ac.uk/id/eprint/528/", "abstract": "This paper proposes an accessible measure of the relevance of additional terms to a given query, describes and comments on the steps leading to its develop-ment, and discusses its utility. The measure, termed relative query term rele-vance (RQTR), draws on techniques used in information retrieval, and can becombined with a technique used in creating corpora from the world wide web,namely keyword analysis. It is independent of reference corpora, and does notrequire knowledge of the number of (relevant) documents in the database. Although it does not make use of user/expert judgements of document relevance,it does allow for subjective decisions. However, subjective decisions are triangu-lated against two objective indicators: keyness and, mainly, RQTR.", "accessed": { "day": 19, "month": 11, "year": 2020 }, "author": [ { "family": "Gabrielatos", "given": "Costas" } ], "container-title": "ICAME Journal", "id": "6142573/TTCX55K3", "issued": { "month": 4, "year": 2007 }, "language": "en", "page": "5-44", "page-first": "5", "title": "Selecting query terms to build a specialised corpus from a restricted-access database.", "type": "article-journal", "volume": "31" }, "6142573/U48CR9BT": { "DOI": "10.1016/j.artint.2019.103215", "URL": "https://www.sciencedirect.com/science/article/pii/S0004370218307021", "abstract": "Word Sense Disambiguation (WSD) is the task of associating the correct meaning with a word in a given context. WSD provides explicit semantic information that is beneficial to several downstream applications, such as question answering, semantic parsing and hypernym extraction. Unfortunately, WSD suffers from the well-known knowledge acquisition bottleneck problem: it is very expensive, in terms of both time and money, to acquire semantic annotations for a large number of sentences. To address this blocking issue we present Train-O-Matic, a knowledge-based and language-independent approach that is able to provide millions of training instances annotated automatically with word meanings. The approach is fully automatic, i.e., no human intervention is required, and the only type of human knowledge used is a task-independent WordNet-like resource. Moreover, as the sense distribution in the training set is pivotal to boosting the performance of WSD systems, we also present two unsupervised and language-independent methods that automatically induce a sense distribution when given a simple corpus of sentences. We show that, when the learned distributions are taken into account for generating the training sets, the performance of supervised methods is further enhanced. Experiments have proven that Train-O-Matic on its own, and also coupled with word sense distribution learning methods, lead a supervised system to achieve state-of-the-art performance consistently across gold standard datasets and languages. Importantly, we show how our sense distribution learning techniques aid Train-O-Matic to scale well over domains, without any extra human effort. To encourage future research, we release all the training sets in 5 different languages and the sense distributions for each domain of SemEval-13 and SemEval-15 at http://trainomatic.org.", "accessed": { "day": 3, "month": 4, "year": 2021 }, "author": [ { "family": "Pasini", "given": "Tommaso" }, { "family": "Navigli", "given": "Roberto" } ], "container-title": "Artificial Intelligence", "container-title-short": "Artificial Intelligence", "id": "6142573/U48CR9BT", "issued": { "day": 1, "month": 2, "year": 2020 }, "journalAbbreviation": "Artificial Intelligence", "language": "en", "page": "103215", "page-first": "103215", "shortTitle": "Train-O-Matic", "title": "Train-O-Matic: Supervised Word Sense Disambiguation with no (manual) effort", "title-short": "Train-O-Matic", "type": "article-journal", "volume": "279" }, "6142573/VJ7HSBUL": { "ISBN": "978-3-7728-0676-6", "author": [ { "family": "Leyh", "given": "Peter" } ], "event-place": "Stuttgart-Bad Cannstatt", "id": "6142573/VJ7HSBUL", "issued": { "year": 1977 }, "language": "de", "note": "OCLC: 256305642", "number-of-pages": "532", "publisher-place": "Stuttgart-Bad Cannstatt", "shortTitle": "Historik. Bd. 1", "title": "Johann Gustav Droysen: Historik. Bd. 1: Rekonstruktion der ersten vollständigen Fassung der Vorlesungen (1857). Grundriß der Historik in der ersten handschriftlichen (1857/58) und in der letzten gedruckten Fassung (1882)", "title-short": "Historik. Bd. 1", "type": "book" }, "6142573/VJZ89VCM": { "URL": "http://arxiv.org/abs/2011.10428", "abstract": "This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers, with the aim of capturing and understanding discourse dynamics. Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data. Our main contributions are a) a combined sampling, training and inference procedure for applying topic models to huge and imbalanced diachronic text collections; b) a discussion on the differences between two topic models for this type of data; c) quantifying topic prominence for a period and thus a generalization of document-wise topic assignment to a discourse level; and d) a discussion of the role of humanistic interpretation with regard to analysing discourse dynamics through topic models.", "accessed": { "day": 28, "month": 4, "year": 2021 }, "author": [ { "family": "Marjanen", "given": "Jani" }, { "family": "Zosa", "given": "Elaine" }, { "family": "Hengchen", "given": "Simon" }, { "family": "Pivovarova", "given": "Lidia" }, { "family": "Tolonen", "given": "Mikko" } ], "container-title": "arXiv:2011.10428 [cs]", "id": "6142573/VJZ89VCM", "issued": { "day": 20, "month": 11, "year": 2020 }, "note": "arXiv: 2011.10428", "title": "Topic modelling discourse dynamics in historical newspapers", "type": "article-journal" }, "6142573/VPCBKFBD": { "abstract": "We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.", "author": [ { "family": "Blei", "given": "David M." }, { "family": "Ng", "given": "Andrew Y." }, { "family": "Jordan", "given": "Michael I." } ], "container-title": "The Journal of Machine Learning Research", "container-title-short": "J. Mach. Learn. Res.", "id": "6142573/VPCBKFBD", "issued": { "day": 1, "month": 3, "year": 2003 }, "journalAbbreviation": "J. Mach. Learn. Res.", "page": "993–1022", "page-first": "993", "title": "Latent dirichlet allocation", "type": "article-journal", "volume": "3" }, "6142573/WLBLU3DX": { "URL": "https://www.aclweb.org/anthology/D07-1109", "accessed": { "day": 28, "month": 4, "year": 2021 }, "author": [ { "family": "Boyd-Graber", "given": "Jordan" }, { "family": "Blei", "given": "David" }, { "family": "Zhu", "given": "Xiaojin" } ], "container-title": "Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)", "event": "CoNLL-EMNLP 2007", "event-place": "Prague, Czech Republic", "id": "6142573/WLBLU3DX", "issued": { "month": 6, "year": 2007 }, "page": "1024–1033", "page-first": "1024", "publisher": "Association for Computational Linguistics", "publisher-place": "Prague, Czech Republic", "title": "A Topic Model for Word Sense Disambiguation", "type": "paper-conference" }, "6142573/WSN56ZDB": { "DOI": "10.3115/979617.979625", "URL": "http://portal.acm.org/citation.cfm?doid=979617.979625", "accessed": { "day": 29, "month": 4, "year": 2021 }, "author": [ { "family": "Dagan", "given": "Ido" }, { "family": "Lee", "given": "Lillian" }, { "family": "Pereira", "given": "Fernando" } ], "container-title": "Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics", "event": "the eighth conference", "event-place": "Madrid, Spain", "id": "6142573/WSN56ZDB", "issued": { "year": 1997 }, "language": "en", "page": "56-63", "page-first": "56", "publisher": "Association for Computational Linguistics", "publisher-place": "Madrid, Spain", "title": "Similarity-based methods for word sense disambiguation", "type": "paper-conference" }, "6142573/X8HTUPMY": { "ISBN": "978-3-8467-4432-1 978-3-7705-4432-5", "URL": "https://www.fink.de/view/book/edcoll/9783846744321/B9783846744321-s009.xml", "accessed": { "day": 2, "month": 12, "year": 2020 }, "author": [ { "family": "Busse", "given": "Dietrich" } ], "container-title": "Diskurse der Personalität", "editor": [ { "family": "Plotnikov", "given": "Nikolaj" }, { "family": "Haardt", "given": "Alexander" } ], "id": "6142573/X8HTUPMY", "issued": { "day": 1, "month": 1, "year": 2008 }, "note": "DOI: 10.30965/9783846744321_009", "page": "115-142", "page-first": "115", "publisher": "Wilhelm Fink Verlag", "title": "Begriffsgeschichte – Diskursgeschichte – Linguistische Epistemologie. Bemerkungen zu den theoretischen und methodischen Grundlagen einer Historischen Semantik in philosophischem Interesse anlässlich einer Philosophie der ‚Person‘", "type": "chapter" }, "6142573/XVWTJR4U": { "DOI": "10.3389/fdigh.2018.00015", "URL": "https://www.frontiersin.org/article/10.3389/fdigh.2018.00015/full", "accessed": { "day": 29, "month": 8, "year": 2021 }, "author": [ { "family": "Navarro-Colorado", "given": "Borja" } ], "container-title": "Frontiers in Digital Humanities", "container-title-short": "Front. Digit. Humanit.", "id": "6142573/XVWTJR4U", "issued": { "day": 20, "month": 6, "year": 2018 }, "journalAbbreviation": "Front. Digit. Humanit.", "page": "15", "page-first": "15", "shortTitle": "On Poetic Topic Modeling", "title": "On Poetic Topic Modeling: Extracting Themes and Motifs From a Corpus of Spanish Poetry", "title-short": "On Poetic Topic Modeling", "type": "article-journal", "volume": "5" }, "6142573/Y4CJJ4XN": { "DOI": "10.1023/A:1007537716579", "URL": "https://doi.org/10.1023/A:1007537716579", "abstract": "In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach” and ”eat a beach” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar” words.", "accessed": { "day": 29, "month": 4, "year": 2021 }, "author": [ { "family": "Dagan", "given": "Ido" }, { "family": "Lee", "given": "Lillian" }, { "family": "Pereira", "given": "Fernando C. N." } ], "container-title": "Machine Learning", "container-title-short": "Machine Learning", "id": "6142573/Y4CJJ4XN", "issue": "1", "issued": { "day": 1, "month": 2, "year": 1999 }, "journalAbbreviation": "Machine Learning", "language": "en", "page": "43-69", "page-first": "43", "title": "Similarity-Based Models of Word Cooccurrence Probabilities", "type": "article-journal", "volume": "34" }, "6142573/YGUFHGJK": { "ISBN": "978-1-315-83436-8 978-1-4058-5822-9 978-1-317-86465-3", "URL": "http://dx.doi.org/10.4324/9781315834368", "accessed": { "day": 7, "month": 12, "year": 2020 }, "author": [ { "family": "Fairclough", "given": "Norman" } ], "edition": "Second Editon", "event-place": "New York", "id": "6142573/YGUFHGJK", "issued": { "year": 2013 }, "language": "English", "note": "OCLC: 1167313756", "publisher": "Routledge", "publisher-place": "New York", "shortTitle": "Critical discourse analysis", "title": "Critical discourse analysis: the critical study of language", "title-short": "Critical discourse analysis", "type": "book" }, "6142573/YJ6WDMIM": { "abstract": "Corpus comparison techniques are often used to compare different types of online media, for example social media posts and news articles. Most corpus comparison algorithms operate at a word-level and results are shown as lists of individual discriminating words which makes identifying larger underlying differences between corpora challenging. Most corpus comparison techniques also work on pairs of corpora and do need easily extend to multiple corpora. To counter these issues, we introduce Multi-corpus Topic-based Corpus Comparison (MTCC) a corpus comparison approach that works at a topic level and that can compare multiple corpora at once. Experiments on multiple real-world datasets are carried demonstrate the effectiveness of MTCC and compare the usefulness of different statistical discrimination metrics - the χ2 and Jensen-Shannon Divergence metrics are shown to work well. Finally we demonstrate the usefulness of reporting corpus comparison results via topics rather than individual words. Overall we show that the topic-level MTCC approach can capture the difference between multiple corpora, and show the results in a more meaningful and interpretable way than approaches that operate at a word-level.", "author": [ { "family": "Lu", "given": "Jinghui" }, { "family": "Henchion", "given": "Maeve" }, { "family": "Namee", "given": "Brian Mac" } ], "container-title": "Proceedings for the 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science", "event": "28th Irish Conference on Artificial Intelligence and Cognitive Science", "event-place": "Galway", "id": "6142573/YJ6WDMIM", "issued": { "year": 2019 }, "language": "en", "page": "64-75", "page-first": "64", "publisher": "CEUR-WS.org", "title": "A Topic-Based Approach to Multiple Corpus Comparison", "type": "article-journal" }, "6142573/YKNUBLK6": { "abstract": "'Die Entwicklung von Theorien und Typisierungen zum Thema Rückkehrmigration spiegeln die historischen Entwicklungen im vergangenen Jahrhundert wider. Lange Zeit nicht auf der wissenschaftlichen Agenda, setzte in Deutschland eine intensive Beschäftigung mit dem Thema erst in den 1970er Jahren ein. Geographische, soziologische, politikwissenschaftliche und volkswirtschaftliche Ansätze setzen dabei unterschiedliche Schwerpunkte. Neben der Frage nach den Motiven für oder gegen eine Rückkehr sowie der Frage nach dem (optimalen) Zeitpunkt können die Reintegration der Rückkehrer oder deren Einflüsse auf das Heimatland im Mittelpunkt der Analyse stehen. Die folgende Systematisierung dient zum einem dem Ziel, einen Überblick über die Entwicklung der Theorieansätze zu geben. Zum anderen soll herausgearbeitet werden, in welchem Umfang die verschiedenen Ansätze den unterschiedlichen Remigrantentypen sowie den relevanten Forschungsfragen gerecht werden können.' (Autorenreferat)", "author": [ { "family": "Currle", "given": "Edda" } ], "container-title": "Sozialwissenschaftlicher Fachinformationsdienst", "id": "6142573/YKNUBLK6", "issue": "2", "issued": { "year": 2006 }, "language": "de", "note": "Series Title: Migration und ethnische Minderheiten", "page": "7-23", "page-first": "7", "title": "Theorieansätze zur Erklärung von Rückkehr und Remigration", "type": "article-journal" }, "6142573/YQFJLI5C": { "URL": "https://github.com/soberbichler/Text_classification_of_newspaper_clippings", "abstract": "Text classification for topic-specific newspaper collections", "accessed": { "day": 3, "month": 5, "year": 2021 }, "author": [ { "family": "Oberbichler", "given": "Sarah" } ], "id": "6142573/YQFJLI5C", "issued": { "day": 22, "month": 4, "year": 2021 }, "note": "original-date: 2020-07-05T16:55:27Z", "title": "Text classification of newspaper clippings", "type": "book" }, "8918850/AH3TIH3N": { "URL": "http://edoc.unibas.ch/diss/DissB_12621", "abstract": "Der Autor geht der Frage nach, wie sich die historische Quellenkritik durch die Verwendung von digitalen Objekten als Forschungsressource sowie digitalen Informations- und Kommunikationsmedien verändert. Da digitale Objekte neue und bisher nicht bekannte Eigenschaften aufweisen und sich von bisher bekannten Objekten unterscheiden, wird der gesamten Prozess der historisch-kritischen Methode und insbesondere die Quellenkritik als deren Hauptprozessschritt hinterfragt und angepasst. Dafür werden Methoden aus der Informationstechnik beigezogen, denn nur mit diesen lassen sich diese neuartigen Forschungsressourcen, die auch neue Quellentypen und -gattungen sowie Funktionen hervorbringen, vollständig untersuchen. Für die sich neu stellenden Probleme im Umgang mit digitalen Objekten werden Lösungsvorschläge präsentiert, die Anpassungen an der Arbeitsweise von (Geschichts-)Wissenschaftlern und die Schaffung von informationstechnischen Infrastrukturen betreffen.", "accessed": { "day": 10, "month": 12, "year": 2020 }, "author": [ { "family": "Föhr", "given": "Pascal" } ], "genre": "Thesis", "id": "8918850/AH3TIH3N", "issued": { "year": 2017 }, "language": "deu", "note": "DOI: 10.5451/unibas-006805169", "number-of-pages": "1 Online-Ressource (VIII, 339 Seiten)", "publisher": "University_of_Basel", "title": "Historische Quellenkritik im Digitalen Zeitalter", "type": "thesis" } } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }