{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview\n", "\n", "We're now switching focus away from the Network Science (for a little bit), beginning to think about _Language Processing_ instead. In other words, today will be all about learning to parse and make sense of textual data. This ties in nicely with our work on the network of Computational Social Scientists, because papers naturally contain text.\n", "\n", "We've looked at the network so far - now, let's see if we can include the text. Today is about \n", "\n", "* Part 1 - Installing the _natural language toolkit_ (NLTK) package and learning the basics of how it works (Chapter 1)\n", "* Part 2 - Figuring out how to make NLTK to work with real world data.\n", "* Part 3 - Apply some of the concepts that you have learned to study the abstract dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **_Video Lecture_**. [Intro to Natural Language processing](https://www.youtube.com/watch?v=Ph0EHmFT3n4). Today is all about working with NLTK, so not much lecturing - we will start with a perspective on text analysis by Sune (you will hear him talking about Wikipedia data here and there. Everything he sais applies to other textual data as well!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1 : Installing and the basics" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2MBERISGBUYLxoaL2NCOEJjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY//AABEIAWgB4AMBIgACEQEDEQH/xAAbAAACAwEBAQAAAAAAAAAAAAAAAwECBAUGB//EADwQAAIBAwICBwYFAwQBBQAAAAABAgMEESExBRIiMkFRcZHRBhMVUmGBFCMzQlMWkqEkQ3Kx0gdiY8Hw/8QAGAEBAQEBAQAAAAAAAAAAAAAAAAECAwT/xAAgEQEBAQEAAwEAAgMAAAAAAAAAARECITFBEgMTIlFh/9oADAMBAAIRAxEAPwD5+AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1f6fu/5KP9z9CV7PXb/wByh/c/QuJsckDsf05efyUP7n6B/Td5/JQ/ufoMNjjgdj+nLz+Sh/c/QP6cvP5KH9z9BhsccDr/ANO3n8lD+5+hP9N3n8lD+5+gymxxwOv/AE5efyUP7n6B/Tt3/JQ/ufoMpscgDqvgF0v9yj5v0I+A3X8lHzfoMNjlgdT4DdfyUfN+gfArr+Sj5v0GGxywOn8Dufno+b9CHwS5X76Xm/QYa5oHQfB7hfvpeb9CPhNf56fm/Qi6wAdBcHrv/cpeb9Cfgtx89LzfoBzgOi+DXPz0vN+hV8IuF+6n5v0AwAbvhVf56fm/QPhVf56fm/QGsIHRjwW5ktJ0vN+he34DdXFX3cKlFP6t+gTXLA7tf2Uv7eHNOrbNfSUvQx/Brj56Xm/QK5wG58KrredPzfoV+G1vmp+b9AMYGv4dW+aHm/QPh1b5oeb9AMgGv4dW+aHm/QlcNrN45oeb9AMYG18LuEs9DzfoV+H1s9aHmwMgHRXBrmUcp098bv0LLgd04OXPS07Mv0Jq5XMA6D4PcJdek/u/QXPhtaG8oeb9BsMrGBq/AVfmh5sj8DV+aHmyozAafwNX5oebD8FU+aHmBmA0/gqnzQ8w/BVPmh5sDMBqjw+rLaUPNky4fVjvKHm/QDIBo/B1O+PmSrGq/wB0PNk1cZgNfw+r80PNk/Da3zQ836DTGMDauGVn+6n5v0LLhNd/vp+b9BpjAB0PhFx89LzfoHwe4+el5v0GmOeB0Pg9x89LzfoSuDXDfXpeb9C6Y5wHU+BXX8lHzfoQ+B3K/fR836BHpi8UVGRRtzWSLNAkTgCmCWtA7SXsBRIs1oCRYqF4IaL4IYCZIXgbMWwIwQywNEVTBWSGFWgEtFXDI1oMEVncGiMtD5IVJEEKt3l1OEhTiUwF1qdOMtiv4d9glTlE0UbnD1AZTozUdh/CYv8AGPKNNtVpTjrgfYQpu6bjjcodxbSh9jzrPR8ZWKJ51krUZ6m4pjqiFNakSqYLOnJJPGj7SYxTeqZspxXu+WUcoluLzNYo02bYWaUeZ4l2osnCDWiIlXjBYWEu9MzenScYl8iWO36matGHPnHLLt7mKuLle83z9SJ1HO1k1q4a/YzreRvpVYVKSi9GtyYTUVlbSxlf4ONRufeKSzjCN9tLmp023nouRSKzn7uqoN64/wDsW6iqrl7jPe1WqsJZ7MEKeacavZ2oJTp08JYF4LurzeDIWGWMWKAWlgqaRABkMhDaW5NbYrR3LV9gRmGRFoZEy0Yi6KIugLxGIpEYgLEkEgBenuVReG4DisixEiocNgKHQOriui2CEWAo9wexL3B7AViXa0KxLvYqKEPYsQ9gETF4GzFgRgMFsEYIqrRDRchgJYEyKsgiQtl2VYUtlcF2ioEYBIklIgfSTUdGb+BybuZZfaY6a6Bu4FH86T+pRu4y/wAo8+9jvcZ/TOFLYlbjPKSzgFFT2IlDMmPt4Jvo08PvZm3FktTTo7J9nabY0E10Z69mRbzTT55NZ7ik2pxxTr4f1Rzt12kwq7pTjByUXLByqteMtFmNRftfabJ3NajUcZPLW2HhmS4qUriKclnDw3jVAYalV9jx9O438IxXlUjLVRgzDUoPmUZar9sl2nS4NRdJXEu+GCo5kYujdzp9zaOm/wAilyrdQx/+8jJKm53kJd7Q24nzTn3ZwgYz3T5uSXZzMrGf+mkvAibzRS7FLP2FZ1wVD/eYcV9B1Oaaw3qYa0sRptdxehNyj9UMRt31IwFOacN1kllYqMBgkCi9Fak19iaO4XAIyoZEWi8TLRiGIWi6AbEZEXEZECxIEgCL09ypenuA4rIuVkVDkOhsJjuOhsdnAyJYiJJFVZD2Je4FER3LPYrEuwipEiUEtgM8tygyRQADBIEVXBEloXIlsAhohrQYlqEloQIaKsvIowqjILMjAEYJQYLIDRBflnQ4BH8yT+phivyjp+z61b+oDONdQ4TWh3eOdU5FLGra+5mtzypTt6k30Ema4050YfmNa9wUaknP8vHixd9e04Qx71ub7jja9HPOMty608qMYzXkzm+8r0ajXI+V7xlqdC3t69d8+XhnYocInWiverJNbx5ypCdRKWXJf5IVpKck+V52em57GlwRRWkdB64RFPql1MjydHhrccOOVuvobKdi6dKoowxzI9PT4eorGCZWX/tQPDxNbh0qck0tkZallJR2+p7ipYuUsOOneLnwuOOqNXI8DUs5rTHYZZUZwy8anvKnCo/KYbjhUWn0RpeI8TPsymyIyfYsfc7t9wvlb5VqcWrRnSm4yRuXXLrmxMJtSWNEbqcuZHPjq8Sf3NNKful0isWNSQYCMoyScSSsmUdxF/PkRoo7mTinVAxU7lp6mylUU1oco1WbeRYsrpIuhcRkTKmxGxFRGxAsSQWAC9PcoMp7gOKyLlZFQ6I6OwlbjonZxMiWIiWIKMjsJe4MqIjuWZCJkUVQSBFZAKkVLS3KogkAJIqCGixD2AWis2WeguTApIoyzKsgqQWIAMEpagWiFaYr8o6ns+uizmf7R1/Z9flsorxpczS72Z6PD5zgtjffqPvk5apdhLk40svo/RHHuu/8ccjiNaFjQdOlHmnLu7zn8L4dUva/vKuXh7F7n3l9xBUKWdXjQ9nwvhlOzt4U4rVLVnN23CrHhkY4bidanbxisJDIQUUMNzly67t9KKmkT7tFiTeOe0r3aXYVlTQ4qyWLLWaVMq4I0NFJIzjpOmWpTWNjHWoJ9h0JozVjNdZXAvaC1yjg39jGpGWmuNz1d3DMWcerDpYwRr28VVpSo1HCW5MYuUH9DucWsuZc8Vqtjl0oYizcuuHXOVFvKUVjt7jWtVqLhCMtnhjVCSWpqMWG0THxTqm2ksGLia0ZWXJN1pHCyZKSzNZOhTSSWBSHxGxExGxMtGxGxFRGxAuWRCJABlLcoMpdYoeVmXwVmEMQ2IpDYnZwOgXKx2LEVR7gyQZUVRMgW4TKKrYiRaOxWewQmRBMiERUgBKIqAexIPYBMhUhsxMgKsglkAQ0QSABgtFalUXhuQasfknY4AvyTkS/RO1wFf6cqxe7WKuVjIq4WaGM6sbdrNXH11JlSU4JLY49e3o49E+zvDkq07maWVpE9LFYE2tJUqEYpY0NESQ6urIkANxySBAFAVZLKmasDFSYyQqRmt8lTEzWUOkLkZrtGC4gsM5Nen0s4O3Vg2zBcU9zLccS7WY4Zw6tHkqvl2fYdq+0mc6vHmWVuajHbBJNPK27R8anNBdv1F1E1nvF0X0vp3G441tpPJi4m9DbRMPE+00w5kXhm2jUzozChkJYZvNjPp1IsZEz0Jc0TRE546Q2I6ImI6IDESQiyAlDKW4tDaO4GhIrNFyswiUNgLQ2B2cD47EhEl7EVUhkgyohbhMlbkyWhRRFagxIrNaBGZkIme5BFSSQSFBEnoSUmyBc2KkXZRgUZBYgCAAABDYboWhsN0Bpn+idzgS/032OLU/RO7wVYtfsFhd3U/NxjtG28lKUUuwwX1RwrScn4Gixaik3uzhfb08+noodRDIoTRlmmh0SxjpYAA1GAQ85LAUUeSUSwIqk0xLjkfJlHgljXNKcBcoo0PGDNUnGO7M2OnNIqpJHNudHg3VK0MtZOfcyjJPD1M2OscHiXXObUniP1OnxHtOLWlqIz0pKopyFQ6M8dgmcunnPaMpz6SRtxrfR2MPE3ubqGxz+Jdc0w5uGTk0xgvdZMz3NxLGy0nnQ2xOVby5ZnUpvKRnqLD4DoCIj4mWjUWRRF0BKG0txQ2luBpKzLFZhFkOpikOpnZwh8QZMSGRUIGSiGVAty72KRGFCwexMtGQ9gjLU3KItU6xBFBKILAQUmhhWexFZ2UYyRRgUAlkAQAEoAQ6nuhSHUl0kBpqL8o7/AAhf6VeBwav6Z6DhaxarwCuNxiWLqLeyZe2qt1IrIrjGHca9hWxbnXi5bHHr29HHp6+1/RiPcuVGJ31vRpJKfM0tkZZXFW4WW/dw7Et2Qza6crmMX0pJeLFz4lbw3qx8zF7lSprkppy75amarw+vPWUor/isDV/MdF8Yt+bljJS8GMhxGnU2eDifDuyai/sLdhVpPnozaa/a3lMfqtfiPTqopLKJcsI4dHi0KVLFSElNdhjufaOrlqnSjj66l1n8PRuoZq19Cistnmpe0dVrEqcfs8CHdTvpZlmNPsS3ZGvzHTvPaDD5aScn9Dkz4vd1Z/tXjI00rSMpZaxE0UqdjTliUoZI1jnK6vJa8tNr/kxFe9rxeZUWkt3F8x6HFnJdCUGc3iFCGOanhNdqBjjTvoXMeVvpdhguo9DPcPu6KquTilGrFZ0/cvUxVfdrKVectM7dvcWRi35WKo92MtW5S8AvKDpSWOrLYbaU+WPMac66FHY5vEn02dOj1TlcT67NMErLo6GZ7jqNTEcMVN5k8FhUrR5OjbT5o4Odg1WuUXpJ7dKLHRM0GPgzm6HxLoXEugLjaO4kbS3CNRWZJEwGI0UloZ1uaqS0OzgYkQy6KPcihESLIiZUViMQpbjYlFahSWwyohU+qEZp7kIJbgRUkkIsAFZ7Fys9iKzyFsZIWwKkEkAAAAEofS6yExH0usiDVV/TR3+G6Wn2OBV6iO/YaWi8Arg8W1uSnDq0adwnNRkkurJ4TOtT4ZG/uZznJqMdEl2mK94FOleUYp9FzTTOfTvxPDdRSub2mnCMI9sY5xodGlSWW5Gf3Lo39Jv9yx9zdVi4weNzFdJ4Jr3tOhHCxlHJr8YuqlZ0qMEmlnL0WPua1w+pUqxnUfRTzgdd8OjVl7ymo8zjyyUiyb7Tq56cijxG6qtNYlnU6VGu6sdVh9xFpw6NutVFYWEkbLW0UNd+/UWf6al8eXLu6FP8ZTc9I1F/k60eGW0aCUoJvBi4jS99xWxoxWmXKXgsHZqaRGM3q/HjOO2VGi3KDUTFbT5JrTRHQ9pHLllKP7dTHYw95FT/AGvdk+Ol9rzuKtzNUaWUm8aFLuylb1J06k6nRp86UNMs71pY21JJwy3vk1V6FG5iverLWzZqePbHe308JSVwpJ5ksx5sS7DRTvZT6E2/B9h6eVnb0E3CKbx4s4d9YuVR1YxwS4vOz2wwhzXtNd8kcmNvnnbWcNpHft6bpOdxU2pRcvT/ACYI0XC3hndrLGlm1xbm4lOn7ua6UXozVawk6KeNjPdQSuX9dTVCo3ywjp3l1ic61UeqcrifXZ2Ka5co4/E+uzccr4c8skVLI3yzV0a6O2hjRpovQd+jn22QY+DM0GPgzk6NMGMQmDGoBiGUtxKY2k9QNZEiEwmVD47mqnsZoLU1Q2OjiuUe5ZlXuBKKzLIpNgQhkRSeo2BpkVBM+qMqMVPYDO9wCW5GSKsSUyWyQWRWpsHMVnLKClMoy0mUyBDKl2VAgCQAmI+j1kIiPo9ZEVpq7I9DZaWi8Dz1X9viehtdLP7ATYZ5JyT1Uy17UlK/tIvvyI4RWUrmvQe/WRq4jSxc21Xulg5V6ebMh97RdSipQ68HzIZRqRr0Yzj27ruY2L0M8qE7eo6tBc0Zdan3+AZ09QfYRJTxpgKdzSn+7ll8stGN5kMTayuhKT6T+yLxgoLXRIvUr0qSzOaiYa8q14/dwTp0nu3u0K3Laiwg69/WvZdTHu6Xh2s6NTqlaVONOnGnBYjFYRaprArH15TjUc1WmtJHL4bUVncO2raQn1JP/o9JxO395F9/eef93Gu3RrLVbM5+npzXorSKUTXyruPP2U76zagsVoLZSeH5nUhxHT8y2rwf/DP/AEXWbK0Tox7jJXoJppLLG/j1JdC3rzf/AAx/2Zq6urjSo1b03uovM39+wJlci+gqj/DUtYRalWmtm+yJhuscjO1WVOjS93TiowXYcC8lukRvMjicQ/VjIfbR6SeNxN4uZx8TdaJOKfcaYns9rD+xxOJdd+J295SZwuJP8x+J0jz9eaxEoqSjcrJiHUWJ7BtF6l79JPbXAfFmeA+JxdWiDHRZngx8WBcZS3FDKXWA1oiQIiRUaovDHKrgQ0VydHFq96HOjNkOYI1KZSchPOQ5gNi9R8GZIzwOVVYKLyeWLnsRzrJWc1gqEyepTISlqVyRVslkxWSckDckNi+YOYKJC2WbKgQAAAAQAFoj6PWQiI+j1kRWmpvHxPQW+ln9jz0+vDxPQUtLP7Ace2uXb8bhPOjfKz1tzRVenFdzyjwl1JxvHJbp5PeWs1WtKc12xTMV0lFMcnoJh2jURrpSrRp1evCMvFCvwlFfsfmzUgGJLjPG3hF5hTin341LKGPEY3l4QqrUVGLctgs2mQwWljGpyocTU6mFSqKPzNFrm+5Vlak/Ua/rukcUuYUtG9WefqThOtppntK8avJTrdCLnJdiZnsak7rPvKDpyj25yjna9PMek4e41afJNJtaG78PjqSaPPcKusX0qfN2Hp6cuaJqM9eCXTnjczV4S5dTdJ4M1eSwKc15+8i8s4l52nev5JNnCuVlSZGq5tXHaOtHy09e0z1H0sGqDSgtNjTluHR7ThcS/Vfidyn1Di8Q1qPxOkea1gLRBrBU1PCGtrBai9RORlF9IvV0kxuiOiIgPic2zoD4iID4kVcZS6wsZS3A1IiRJEistrQuSHSZRo3rnhWCBmCrLpipDIbK5CL5I5iuSuQi7kwc2LbDJRLZBGSMkFshkrkjIFgK5DIEsgjJGQqxBAASBAAXRoodZGeJoodYiny/Uh4noI6Wf2PPv9WHid/OLP7BXl7t/wCol4nr/Zq499wyEW9YPlPHXT/1E/E38B4mrG55Kj/LqPfuZmtx7NfqSRdbiIVIzmpReU0PRlqrkNgQ3gMpSF14RqR5ZLRhKehG4ak+s34OEV0WzNc0ujjB0thE2nLDM2OvPVeRu7Kp79yXVfaZlQrYxzuK+h7GvbQlHY5da1S1xsZx1nWufwy0VGopdv1PRW9XTGTjOapy3H0Lh82AXy605aGG5n3lnX5UYrirzNompHOvpZbOZWX5TZ0Lh80n4GC7xGi0IVxZvpvxHqfRwY6lRRyxdO+XvOkui+06SOPVyO1S/TONffqPxOrSrU3T0mjkX005vDydI4VlnsLJk8kCkAyj1xZen10RW+A+Jnp9hpgRo6A+OwmmaIbAWGUlqUwNorUB6REi5EkEaQADTKCsti5SewCmUZZlWVhBAMgoCAIAAAgIAAAAgAAAIAigAJKAkgCC8TRQ6xniaaG4U7/fh4nenpZ/Y89OrClVjKckkjRd+0VlTtuSNRSljZBXMuX+fPxM85xXacy54nOrUk46JsxTuZyfWZMV7/2Xvpu5dCU3KOOim9j2CeT5H7N306HGrZyl0ZS5X9z6xCWYJmOvFbnmHZ0FyZOcoVUeDNrUg3GRwlqzlXF/KnU5KVOVSfcjnXt/xecfyLF+Dlgmuv8AXa9P7yntkW4RbymjydPifF6cPzbOMPs2TPjlWm8SoVvGJdWfxf8AXqqjhGOsjm15x6WJI85U47J5Xua0n4GCrxmvLLhbVvuS3XScY7dzNJtiaFbNdHCnxS8ln/TP7sfY17r3qlOEVF9iMj0lWr0TFzye5s9y6lOLw12mavTcJbYIyy1Xh4fac7ibxSaRrnPmrfRHN4rU5aaWdSxK4lbWEvAxGvOYy8DIdo83SctbNkABWQAAUBaHWRUmO4HQpPRGqBjoPKRspmWj6aNMEZ6aNUFoBOBtJalMDKS1AegmiyREkENJIDJUSLnsXKSCFMqy7Ks0yoyCWVKgIJIACCScAVIZZoqwIJwQSmAYDAZAAwTgAIqMFkAYIuLRKV7+naxfbLuE3V3GhFrOpw7ivKrNtvcsgvecQq3M3mWI9xkbI7SSiewW9xmyFyCnUJuE4zjo4tNH17gt7G94dRrReeaKz4nx2kz2XsTxVUasrGo9JdKDb7e1GOp4b4r30ZYZFRcyF82ZJl5SwcnRFKhCMubHSLSgm87MmMizWUU26S4J9aORVShRa6iGVZOKMdWvVWcJYGunOpdvbpa04+Rzr2nS/ZCPkXqXFZ6YRlr87jl5ZLXaVhrRgtcJsvY01KqpSWiI9zKUuZ7GmlFU+l3GCuopKMdDmcUqKNJv6DXcYWTjcWuXU6KfaVyIhV1b+hyOLV8yxk2yqKFNyzocK4qOpVZqRjqoTxTkZjRU6NHxM50jj0AACsgAJSKBE6BjAFRaMnHqs2212tI1NPqc8nOCK9FSaaTTyjVBHm7a7nQl0XldzO5Z31Kuks8su5kxdbMDaS1KoZSWoDsESRdIiSCACAAkrInJWTCKMqyxVlZUkVZZlSogADAAixCJCoZRl2VwERggtgMAVJRPKSohUEk4JwRVUhN3cRoQeuo2tUjSpuTOBdV3VqNtlkVWrVlUk5NmdvUu3oL7TSIkEWWktBcXqRTXsKmN7BUxQU9zTGrOjOFWnLlnF5TMsdx71iB9M9nuM0+KWUZN4qR0nHuZ2s8yR8e4VxKrwq+jXpPT90e9H1DhfFaHEbWNajJNNaruONmO3N10oprtHJ6CYSTRaMiNJqLJhqwkuw6GmCr5W9UFlcp28s5cWE7bMcSWDp5T0RluHjLJW5XPnQjHQy3GFFx2H1q2G3nJyLq65ubXUzi6mrcRVNqW62ONXre8rFLu75tIsxTrckG86lkYtF9cPqRZhhFyaSLPM5cz7TRGCpUnUlvg25+2S5fSUVtEQTKXNJt9pBtzoACyRUCWS2MEA2VAQBKiBAFnpuQwK7FoyaeU8MgAOnacYqUsRq9OP+TuWnELWthqqk+5njw2ZFfQlrsRJHlOGcbrW0o060veUvruj1UZxq04zg8xksogpkMlQAtkpIkAihDLaBoVC8EcpfKDmQFOUnlBzRDqATyhgo6hDqBDNCNBbmyOZlw0zKI5kL5iMjDTXJEc4rIA0znI5ygq4qe7pNjBk4hccz5U9jlOXSNFSXNlmVvDNKY3lCy2SvaRV1rEU9JF4vQrVWmQGdguZaDzFETApHccuoIW45dQQJnudPgXEa9hcxdKXRk0pRezOZPcZQb1xuZV9ap1p0nGNdY5llM1wqJy30wZOHVafF/Z+1r7yUeWX0a0MlWrWs5Yac4LZ9xzsyu3N2O57xJfQVUrpao4suLx5dWs9xkq8Xjr0l5mdbj0cauYZMV3cpRaysnHXGY8rw3kw3PFFPbVsL4aLy6cdnp9DhXV20njd7EXF1KS7vEx4lN5w22JEtU5svmb1KOLqPL27jRGg3LbLNlG0UVzTLrGayW9q2+aa0MnELhTl7uGy3NfEb+MIujR63a12HHNSM9X5AAEpZNOaUixGO4sloaRGyKssyEu0CUu0N9iG2w0QA9/+yA3B94EEgQRQQTggAO7wHi3umrWu+g+o+44QJ4eSD3POR7wz5JLjOne8KuoLILiaY6hVzZVkZAs5MjmIIAnJGQAIMgVAokCAAnIEBgCSQSBogg5nEK3NLlWyN1eXu6TZxqr5m2WKonlCpYT1LReuCs1qFVzqXpUp168KVNZnOWEhWT1fsDw5XfFncTWY0Fp4ktakd3hfsRaU7RTuszqNa/Q8v7U8JpcNrpUeo+w+qV5qFJrOD5l7aV3UulHOxie3TP8XmqWxMitMszbkXsx0eqKki8HoBSpuFF4mEysdJIn0ey9juPw4dXlaXLxb1nv8r7z2N3CE1lNOL1TR8kWkjq8M4/dWH5bk6lLsjJ7eA6mtc9Y9hX4fCrnC5X3o51awqU88uGvqhlt7S2dWSjKfK32tG+VxRqw5o1IyX0Zysx2nlwKtOstOijJOhKWjl5HXu6lLn0kvrqZJVKME25rQisCtFu8vxGq17XohdbilClnGG/M59zxec1in5ly1m2R1JToUE3KSyjk33E51c06T5Yd/eYKlWdR5nJtlDU5YvegAA0wlFtkCWCUaiJisv7Et6gtEVfeUH17ABsOzsIDZalSdw0AFsyCcf4IACAYEUEEsgAAAIPXok4Pxq5+Sl5P1D43c/JS8n6l1nHeDBwfjdz8lLyfqHxu5+Sl5P1GmO9gMHB+N3PyUvJ+ofG7n5KXk/UumO9gDg/G7n5KXk/UPjdz8lLyfqNMd0jJw/jVz8lLyfqHxq5+Sl5P1GpldzAcpw/jdz8lLyfqQ+NXL/ZS8n6jVx3cBp3nBfGLl/tp+T9Sr4rcPsh5P1JpjvucV2lHWijg/Eq3yw8n6h8RrfLDyfqNXHcdx3DKVK5uFmlSlJd6RwI8SqqSfJTf0afqdq39uOIW1JU6VnYRiv8A45f+RFkn1n4jCvSXLUhJfY5beTq3ntdeXsHGtZ2TT7oS/wDI4k68pScuWMc9iLKWT4mTxIs3lCXNvuBVGl2F1MWe59K/9OKEY8IqVu2dR/4PmfMzu8F9ruIcEs/w1tRtpwy3mpGTf+GiVY+qXiUoNHy72shyX/ijVU/9QeLVFrb2X2hP/wAjgcS4nX4lW97XjCL7oJpf9mJMuun6n5xlg9S4pPBPO/ob1yXlsFPZlHJsFNrYuiZ7lCW8kEU+Dzgh7i4zcdgc2+4upi8ngmFxWisQqSXgxbm3vgjJNDHcVs/qS8yJVqko8spya7si2BF0AAAAAAATFEEp4AsWS1F8z7kTzv6GtQx6lZdxXnf0I5mNFluS99Cqm1HGF4kczJov3akYxsV5mHMxos9M/UgjLDI1QBGQAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/9k=\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import YouTubeVideo\n", "YouTubeVideo(\"Ph0EHmFT3n4\",width=800, height=450)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "> _Reading_\n", "> The reading for today is Natural Language Processing with Python (NLPP) Chapter 1, Sections 1, 2, 3\\. [It's free online](http://www.nltk.org/book/). \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> *Exercises*: NLPP Chapter 1\\.\n", "> \n", "> * First, install `nltk` if it isn't installed already (there are some tips below that I recommend checking out before doing installing)\n", "> * Second, work through chapter 1. The book is set up as a kind of tutorial with lots of examples for you to work through. I recommend you read the text with an open IPython Notebook and type out the examples that you see. ***It becomes much more fun if you to add a few variations and see what happens***. Some of those examples might very well be due as assignments (see below the install tips), so those ones should definitely be in a `notebook`. \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLTK Install tips \n", "\n", "Check to see if `nltk` is installed on your system by typing `import nltk` in a `notebook`. If it's not already installed, install it as part of _Anaconda_ by typing \n", "\n", " conda install nltk \n", "\n", "at the command prompt. If you don't have them, you can download the various corpora using a command-line version of the downloader that runs in Python notebooks: In the iPython notebook, run the code \n", "\n", " import nltk\n", " nltk.download()\n", "\n", "Now you can hit `d` to download, then type \"book\" to fetch the collection needed today's `nltk` session. Now that everything is up and running, let's get to the actual exercises." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Exercises: NLPP Chapter 1 (the stuff that might be due in an upcoming assignment).\n", "> \n", "> The following exercises from Chapter 1 are what might be due in an assignment later on.\n", ">\n", "> * Try out the `concordance` method, using another text and a word of your own choosing.\n", "> * Also try out the `similar` and `common_context` methods for a few of your own examples.\n", "> * Create your own version of a dispersion plot (\"your own version\" means another text and different word).\n", "> * Explain in your own words what aspect of language _lexical diversity_ describes. \n", "> * Create frequency distributions for `text2`, including the cumulative frequency plot for the 75 most common words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2 - Processing real text (from out on the inter-webs)\n", "\n", "Ok. So Chapter 3 in NLPP is all about working with text from the real world. Getting text from this internet, cleaning it, tokenizing, modifying (e.g. stemming, converting to lower case, etc) to get the text in shape to work with the NLTK tools you've already learned about.\n", "> \n", "> **Video lecture**: Short overview of chapter 3 + a few words about kinds of language processing that we don't address in this class. \n", "> " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2MBERISGBUYLxoaL2NCOEJjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY//AABEIAWgB4AMBIgACEQEDEQH/xAAbAAEAAwEBAQEAAAAAAAAAAAAAAQIDBAUGB//EAD0QAAIBAwIDBgQEAwcEAwAAAAABAgMEESExBRJRBhMiMkFhFFJxkRUjQtFigaEWU3KSsdLhM4LB8CVDov/EABgBAQEBAQEAAAAAAAAAAAAAAAABAgME/8QAIBEBAQACAwEBAQADAAAAAAAAAAECERIhMUEDURMiYf/aAAwDAQACEQMRAD8A/PwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbfDT6xHw0+sQMQbfDT6xI+Hn1iBkDTuZdUWVtN7SiBiDd2s16xK/Dz6xAyBr8PPrEfDz6xAyBr8PPrEfDz6xAyB0qyquPNzQx9Sbaxq3Nbuqbgn1beAOUHo3XBbm1SdSdJ5+Vv9jldrNesQMAbfDz6xJ+Fn1iBgDd2s16xKuhJeqAyBp3MuqHcy6oDMGncy6odzLqgMwadzLqh3MuqAzBp3MuqHcy6oDMGncy6olUJP1QGQNvhp9YllZ1Hs4gc4Or4Ct1j9x8DV6w+42unKDsXDaz9Yfd/sV+Aq/ND7sGnKDr/D6vzQ+7H4fV+aH3f7BHIDr/AA+r80Pu/wBh+HVvmh93+wNOQHX+HVvmh93+w/Dq3zQ+7/YGnIDr/Dq3zQ+7/Yfh1X5ofd/sByA6/wAOrfND7v8AYfh1b5ofd/sDTkB1/h9X5ofd/sPw+r80Pu/2A5Adf4fV+aH3f7D8Pq/ND7sDkB1/h9X5ofdkfh9X5ofdgcoOr4Cr80Pux8BV+aH3YHKDq+Aq/ND7sfAVfmh92DTlB1fA1fmh92R8DV+aH3YHMDp+Cq/ND7sfBVPmj9wOgEkGmEEYLEBVcEarYuQQIz9GWwnsUaITcQqzTRBrCUWtRKn0IMiUMYJW5R1RWKP8jTgMea8myj0oHV2bhmtUl7lR0cdeqR4ktj2ePP8AOweNPYlWKIukQiyIIlsYy3N5bGD3AgABQAkCASCCAAALQWpU67a3U1nLUugWIjRk2sprOzNlRlF9H1OmPNSWJvK6MrUqxw+XDWNjO25FOfZtewSjLVbHNz5i8PK3TRnSufzEs6PRhXZCrmXImQlk4YTcLzc1dbFbR7toDo9Sxz07qE/P4XszohOEl5v5lZ0YJH2ZIRGASCiASAIwRgsAKgkYIKgkAQQSAIILEFEYILEAQQSCCoJIZRiCSDbkgEgioBJAEYILEMCjXQvCo47kEYA2XLMd009DFZT0Om3r6pSCtauluej2Xh4JP3OerSjVt20/Q9Ls1R5LZssRwccebk8mex6XGHm7kebMzfViEXiVRdAVnsYs3nsYsCpBICgBIEYBIAqSMGlGOai2/mQXoprGaXNn2OxeBaw5ZCnlPOVH2RldKbXNB5M2usjf4qlOLjLLj667e5xXFOVCfPCXNDqYOTzzLSXqaUnOUeRJtensF9YwqOnX/gmVrRaeUepQ4VVuP0Ho0uztSbWVvuTa8HzksuqpdVkrPmznHufYR7MPTQtPsvFr3G14Pi22pt43LRnKLxnQ+pqdlnjfU4bjs5Wp55Xn2G04V5ELiUXuXp3rjLVZQr8OuKOeaLOZwa0aKzZXr06kKnlZc8NVJ0pZWjOqlxCaeKmqNMPSBSnVjVimvU0wQQCcDAFQW5WOUCgwX5RylGYL8o5AMwacpHKgKEGnKiMIDPBBrhFJbgVIZJBBiCSDo5BBIIIBJAVBBYgCoJIAF6cczRU1oLNRAa3NWVKliL3Pouz0v/j4t9D5q/8AKkfT8HhycOX+E0jwuJy5ruRwyR1Xrzcz+pxzbTMVpKNEZRkaRYETMWbTZkwKsE4ICgJAAgkAQWhJxehBMPMRY6Kc56czz7NCo4vZcr9pGcliLNuE2Vfil9GjBtQjrNr0Rl0i9lw+teTxGGY/Ng+m4d2fjSSc46nuWPD6VtRjCnFJI74wSRl1nTiocPhBeVHVCjCCwkakpZBtlyIrKmdGCkl7FHHOGGYVKalujsqLJzzWplp51zZQqxeIrJ8/xPgzUeaK+uD67BlVpqSaaA/M7q3nTeHqc/I8Z3PsOM8NWHOCPnpUoxnrsblccsHNbXHct6HqWldVovO+dDy6lNd4+RrUm1rOnPVZ6GnN72EMGVGtGpFN6PobZQFQHJEc6CAwR3i6kOqgJwMFHVRDqgXwRgp3pHePoBpgjlM+8ZeLyBVrUrPc0luZzAoQSQRWQAOjmgEkAAABBDJIIIBIKBtbLxmJ02i8RBS+1nFH11pHk4Z/2nyVz4rqEf4kfYtcnDH/AITSPkbl5rzfuc01mRvW1qy+pi9zDSvKSsoukTy6AZ7lWsEz0ehCn6MCCC+M7FWFQBkZAkEACQtHkgnAGmji87n3PZfhisbCMpL82r4p+3RHyXBLb4riNOm1lZy/oj9ForEcLYxk7fnProisInKRRblnjBl1XXUsmyq2Jw2ETlkNhxZV5W5UZzWTCcTpk0ZVFoRrbmehm9S02ZN4ZFY3dOM6bT3PjuKUXCb6H19xPDweBxWkp8yRWa+aUueSVTTXzL0Ir0e7aqRw1s10Juabi9CKVTD5aqzF6M3HGxpRqTg44lmJ6kYylFPqeXyKlOMZPMc6M9qn/wBOP0KzWXdsd0zchsIw7od0bNkAZd0O6RpkjIFO6Q5EWyAKOKwIEy2IgAkZyNJFJgUZUsyCKxAB0cwAACCSAiASQRUAkAEddmtWciO2zXhbLBk1z8RpL+I+wu/Dwx/Q+TtVz8Wpr3Pq+Kvl4aVHx09Zv6mX6jSW7KLcw0si+NCsUX9AjCaMmjaZmwqE8E5TIIwBblICeC2jCowA0AAGBgD6fsfbZnWuGv4EfYwp5STPB7Pxp2PC6XfSUZS8TXrqezQvaNR4gznfXox6jrjHQnBKksIrVliLwGl9EZyuaUXjnWfqeVdXNacHGGUzy4UKznvLO7efUmzT6d3VPO+pV3MXojx6NrdJZxnBNRV0vK1IbNPV7xT23KznphnmUp1Us5lnoy6u+aShUWJP1Ctp7mVRYWfUtzJtCS5kRXDcPmT6nnXMcptnq1qeHqefcSSWuzKy+evrfKeNzypKT0a0R9DcJSb6M8i6pcrbRqOeTnU5d3yyWej6HsWc3OhFvc8VP0eD0+FVXKEoS/SzTnXciHuStw9wiMEYLEARgjBYgCjWCEWlsQgIlsVhuWlsVgBMjOexpIpLYDNkMlkEViADo5gAAEEkACCSCIAAKlHdaL8ts4kd9usUGWIrwuPPxiPsfTcdfLYJHz3AY83Fm+iPd7RvFpFFHybKos9isTDVXiXexWJaWxUYzMy8ygVAAIIGxICpUupOCuAngCTt4dShKbq1FmMNcHGmme32doxr3UYS8urZmtYzddVCF3dT7zlkl6aaI9W3o1aTUpPY9fnoUIY5V7JIx+IpVZ8ihJN9Y4Ob0R029VVIrB0zipQwzko01Ri8bvUtKq1uVdMK1HEtCsHCjmVRrB0NNrPU83i9vUVLzaPoSrO+nfS4nQmvBlrqkdCrUKvqsnxMrq9pSiqdWcI5wljGh6VvdXtWg+f8xqWE3HDZqTpi2S6e/Uox1ccHm3NvltrdbHPTueIU5L8iTj7vJ306zrYTg4y9UzNakZRT7uOVj0LZwjtqUV3PucSxIqs60eaOfVHjXekpRZ7ckeHxF4nLG6EZrgqS0emxxVqecnW9Wl6GU01J41Kw8evT8TwsYOrhGXKX0KXKxJvGhbhTSuJLPobjnXqrcMlbhhkIJAFSCzIAq9iqLPYqtwEtikNy8ikdwJkUnsaT2M5bAZkFipFYgA6OYAADIJICBABAAJQEo9Glpb/yPPitT0cYtn9Cwa9mY83EKjPU7TyxSiji7KQzcVZe509qZeKKKPm5bERJlsImGqvEtLYiJMtio557lC89yhFAAAAJCoBIAjB6nZ+v3N8vdM8w7eDrm4nRi9pPD+xmtY9V7Vfj9zC4jRtKalUk8arLf0PqbGlVhQjO6kpVmsywtEcXCuE0qVeVZ0/zFs2e5KEIw1epl3+uSo9SFFVNi9SClszKMZRloRtvCOFh7ETg+VpNNFo1tovBpHlqLbUqVywtofqj90bqEUsLGPZF+RkqASsnTi/QxnDD2OtxwZVFoSkZzSdI86O7OybaTXocefEw3IiSwmfPcRy7rlXqfQzeUeHxCKhzVG9WmkImUecn+bFeiaN50HKPNBZwzlot1KiSW6S/mfW8Pt4W9uqtTGUtPqViTb5+XZ+6u6Sk1Gjn0nueb+FXPDb2PexTg08TjsfUXlzVvKjhSbVOPT1LW1p30HSq5aktM+jHJq/lLHz63JlubV6Ere5nSlvFmc0beWzSqWgwWTGUBXBVovzIhyQGbWhRbmkmiqWoB7GUfMbNaGa8wEzWhnLym014TKXlAyKkkMKxBBJtyAAFCAAiAAQCyRBZBV4LVHfV0tjiprMkd1zpbFg9DslHSo+sivaiWbhI6eyUcWzl1kcHaSWbzBfiR4sti0SrLRMtVpES2ERPYI557lC8igAAEUJIJCgAAH0vY/hir3Er2ovDSeILqz5k/SOzVFUeDW8UvNHmf1epK3hN11XFd204z/S9GZVL9Si8PKNr2lCrbVIz0WN+h8pb1ZU67hH/AKSeFk5V6ZI7fxy8lOSpWjwnpzPc6rbi91zJXFlKK+aMkyOVaNLXGTocY8ifLnTVhrpNzeRkko7s9C3U1ShUlvjVHHZwpLE+RLo2tT1Kc1OGuhYmV66aKSaJckkYwemOhC5qksR+5rbnxJVNQqUp+xtGnGmvfqZyrqDlrhrYhv8AjO6tOSnlSy+h5Mmmmdl1xOMacsvJyWdCVylKTxDf6itY712wcsJo8niMo1G6WjbPsoWtGFHWCWfY+d4xbU6NeNSCSWuRrRy28Syo4uMdGezOs5U2m8RjosnDaU+Sprq8Zyd8rZThGL8tTczVj0rGyjRt+V4k2t0b2/LCOOXVMpYQdrBUnlpbM2q4hWTX6kaHzvHYJcQyt3FZPMnE6uIXPxF/Ul6J4RyVGbjzZ+quJHKRqx4isp5SvKh4ir5iCWiEtSGmI7gWa0Mf1m72MP1gaT8pk/KbzXgMH5QMSrJIYVgEAjbkkMAKghkkMiIJICAsi6KRNEFa0fOjqvHigc9Dzo2v3+Uij6HsvHlsE+p4nHpZvme/2eXLw2H0Pm+MS5r6YvhPXBIvEo9y8SLWkSJkxImEc8iheRUCCQCKAAAAAqYxc5KMd28I/T+H06ltYUYVEsxgk8P2PzKjLkrQl8skz9bt3GtbxktnFMzXT87p5PE62bfu45zN40Oa3sILElDLOi9t8XcMvwZ/qdSuI0Ia4RzenX8ZqjnCUcYNHb8tPGNXuQ7+EWstJGkbiNVbpopZXHNT5oL0RsqzSeHsy9VLmTis50RvaWipNznrN/0DNumlClNpuWifodMYKCGUkYV66hHc145W2or1lDd6Hi8U4lTow5m8P06v6GV9xCrWuPhrODq1n6ekfdnXw7gEYVFcXsu+uH6vaPskY9dJ/r68yx4decVqqtdZp0E8xp/ufRRoxt6SjHpg6pONKGEkkeTeXygnnZepdaOVyXvb/FP+JaYPmLziErms1lcsdMIvKdfjNz3du5KhnEpr1+h6/wDZ2yoWyhCHLNLzJ65Nd06jw6FTV56HtWVRT5Ib8q1Pn0nG8dF7xeGfV2ijSpLlglNrV4Mq6pVI6Qaxj1Zx3VdRhVr/AKYRxH3Zatl1OaTyjyeLXqq/kU/LHfBot1HjN5m2JhecmpsaeVMEmiWkRDYMojQiSJIewFHsZrzGj2MluQaPYwfnN3sYS8wG0vIYejNn5DH0YGLKslkEVzk5IB0ck5JKgCWVZJVkAkgkqrxLopEuiDot/Oi3EH4ERbeci/eXFFH1vCFycNj/AIT5PiT5ryf1PrrLwcN/7T428lm5m/cUnrF7l4mfqaRIVrEpMtErMDCRUtIqAJAIoAAqAAASyz9Y4NTnT4dQjUeZKCz9j8y4XQVzxK3ovaU1k/V7dctNLojNbx8cHE4eFS+V5PD4vOFGdOrWk1SUlzP2Pd4rUjGhPLPA4zayv6CoweJTxFZ9DF9evDxy3HaPg9KUZwhVq/4Y4/1Mbfj9TiV9ChwuyqtvzczSSXVmdHsFXlVfeXsFD0cYNtn2fB+D2vCLVUbeGv6pveT9y6jhyydFpQ7uCc9Z41Z0PbQjJnOqordA7qas1QoylJ7I8GtWub6srW1/6j1nP0prr9fY6KveX1x3VCT0803tH/k9O1taVnR5Ka95Se8n1ZPV8UseH2/D6KhTjmT1lN7yfVm8qq9DKb5m9dDnu66oxXVl2mts7+8VOL1Pn6FtX4/XaTlGyT1lt3n/AAdFO1r8crtaws4vxS/vPZex9LSo0rOhGnTioxitMD1renPRtaPDqUY0YJKHoefxLiSp05z5lGKWXKRpxLiMaUZyk0kkfBcV4hO8rPMnyLyx/wDIlP8Atetwlq6v6laXq8n06qxhDLZ+f2NzWtl3sHv/AFOi749dOi4xxF9RpqZTXb6TiPE4Z7mlPM309EeX1PL4NCU4yuKjcpTe7PTNSOGefJmvOTU2KrzlquxWCGwZENiQAYIYGcjP1LyM/UDR7GM/Ma5MZ+YDV+Qx9GbfoMeoGL3IJe5UK5ySAbckgAgFSWQUCSEWQVdFolUWiQdVt5il34q8F/EjS23M6nivaS/iRR9fHwcMf+E+LrvNaT9z7K4fLwx/Q+KqPNSX1FIqtzWJktzWJBoikyyKVAMWAwgoACAAAoQSAPY7KRUuO0crOj/0P02C8OD4DsLR5uKVanyU8fd/8H6BEzfXSeODiMI4gsLLZzUrNVZZlHQ760VVqp+kdjRLTCMu2Odk0inTjTSUVsXKvQnDa0KwpUeInDOlVupctN8sPWb/APB2/DqUszk5e3oa6JYRNLv+M6FCnbUlCnHCX9SKlT0JnNJZPOuruME9QSN61zCjFvdnHStZ8Tnz124UtuVeqJsrSd7JVqqaorVJ/q/4O+rUUKqhHRY9CL543SpW9NU6aUUlokeVxa/jQpNtm17Xjb0eaUtXovqfL16j4hf8kn+VDf3ZUkePxW7uLqXiUo0X/wDo8mby0lu2kfQ9oK1OFGFNRWG9D5+2j3lzndR2Kld01y0lH2OKusxZ31Focdz4YNiF8V4dxJ2yVKccwzut0e7Sqwqw54SUk+h8iaUq1Sk/y5uP0ZtwfT58ZepsfLSr1ZvMqkm/qa07+5prCqNro9RofRw2JPPsuJ06uIVPBL+jO9NPZjSrFZEkS2IMpGfqXZm9wLozqbl0zOpuBon4DPqXXkM3uwMXuQyZblWFYAgZNuSSSAQCCQVQlbkEoC6LxKF4kHXbmUPFxKkv4jWh5SloubitP6lH1HEXy8Nf0PjJPxM+v4xLl4cfHCkTHc0RlHc1iQaLYzmX9DObAze4IyCKkFHOK9SrrL0NcabagwdaRDqSZeFTboyluysqi9NTmcpdS0I1KklCGZSk8JL1Lwht+gdhKHJw6rdSWHVnhfRf+s+p8Ut9jg4HZ/B8Nt6D3hBJ/X1PT2OF9d51FVHCJRJIXauBnBLZVsCJTwjJ1OhFd+BnNGpiGpG5OkXNfli1kztLD4marXC/LWsYv192Wo2sq9VVKulOOqj1PQlUUVhbBNlSapwwtEjzq1xCm3OepW9vFBPL0R81c3la/rOnRyoZw2FkbXt/Uv66pUotKL0yyZQo2drJzaWFqaRo0rC2c35lu2fK8V4jUvarp02+RPV9Ro5acV/d1L25csvG0V0R02VLkRlQt8bnpUaWCsTu7VnHQ8y/nhcvU9S4kqcHJvRHz9ao6tRyZYZ3SgJJSOkjihIlElksI1IztCRvSuq1LyTf8zHdj2RrQ61xK5j+tP8AkSuJXGc6HGSyaht6NPiSbxVhy+6OrmUknF5TPDN7e5lRaT1j0M3H+LK9dMpUFKrGrHmg8oioc2mkX4TN7l4eUze4GctyrJnuVCucAGnJIyAUAARUlolC8QLovEojSIV1UdIFeGLm4rH2LQ0pk8FWeJt9EVHtceliyS9j5I+o7RSxbRXsfLCkWgaxMoGsSC/oYVMmxlUeAMttzOc8vCZao9DnT8R0mOk209SGtSd1kLVG2USWhEXpqXRm9J46hSS10PouxVkrrijqyWY0VlfVngPU++7CWXc8OlXa1rSz/JHP9LqN4TdfWU1yxRfJRE5PO9CxOSCMlQe5nN4Rdsxk9SNRhVlKWi2Kwpv1NmsEZXqRocsLCMLit3cW29EJ1G56aI8Xi105NUKbzKTwVNOK4qVeJXUqVJtU8+JnfbWsLaCwvDFa/U2s7SFpbfxNas8njHEXQpOlTliUiK8vj19O4qujTeIp6nnU6SikkWjmTy3l53NqdPJpzvdWo09EdihhCjT0WmC9V8sW2RuPG4tU5YcvU8k6eIVu9uHh6I5kdMY4Z3dSkWIRY6xzokG/Qn0Iiv1FROyx6gEAA9g3gjVgMk4I9QBrQryoTzF6eqPTVSNWClH1PHNrat3U8PyvcxlGpXrw8pV7l6bzHJR7nNpjPzFDSfmKgc6BBOTTASQSBAwSABaJACtUXiZRNYkHUtKT+hpwGOb6bM5aUWb9nVm5qM0jq7Sy8EV7HzZ7/aWXjSPnyUi8DRGcdjRAWbwjjnUzPU1qVMvCZySfjOuM12y2esTnejNs+F4Mc5eDVI0i/QbMpF4ZefUCyxkpVWNQmTV1hkfBalCVarCnBZlJpJH61wy2VpY0aMdFCKR8D2LsPi+IqtNZp0Nf5+h+jwR5v1y309H54/WsSyKFk9DnHSrNkMFW8lRWexRalqrwjLOItkaUqT8WEZTn6IjOW3kxrXEaUW/ULGd9cunTxHdnk2lNzuuaesm9PZG1StKdTOOab2Rego26lUm8yxlsLpPGL+NrRazrjY+QqTncVXOTy2dPErp3l03nwJ6GMIlYpCGp10qZSlTzud1KlswsiIwa19EcXE63d0ZPbQ9VUnLCS1PmOPXUKlx8PRlzRp+aXWRqRnK6jyt22WSIRbZHaR56lIn1IiWNxlEhsvYbzIlrLBAWpLeF7jGhXdgQlnUsAAIGcjYAQwwRXqcNrc8HTlvHY6ZbnjW9V0a8ZLbZntPXVHOxqMZrUqaVFqUZlXIBgYNMBYjBbAEAnAwFQSME4ILRNI7mSZeD8SKOuo8UTs7NrxVH7nDWeKJ6HZteCT9yxGPaOWa6R4h6vH5ZusHlEqxpHYipPCwhzcsTJvO50wx32zaJ+pzyf5hsYVPObyTFsnoYbTNIszlpIlWLvfJd+RFIvMS8vKEUTNlF1EopZbZlCEpvRH1nZbg3eVIXddeGOsU/V9SXLUamPKvoOzXC1w3hsKbX5k/FN+57kVgpTWPQ1X0PHe7t65NdJSyWI2RCZUTJ6FUw2HsFYVp+LBnOWIE1HmZxXdVvwRCs6lZ64ehxOE60svKidKp82F6G8YRpxz0Irlp0VRi5Nas8XjF83+RTe+7O3i3EFTg4xep89rOTlLdlS1WC6G9OGSacV6m8OSJU01o0lvjU7qVJaaa9DOzo1bmWKMMpbyeyM+Lcat+DwdG2arXrWs/SBZjamWcxY9ouJLhtH4WjJfFVF4mv/rX7nxq1ZarVnXqyqVZOU5PLb9SFojrI89u0x3JZMV4SFqzbK8cJE+hA2RplWO8mFq8iPlJ2RFVkyY6IjGWWAhkMllW/RATnGxASLEVADIZQPZs595bQfqtDxT0uF1PDKn75MZK6qi1KNGlRalWjDTj0GUUyCsL5GSiJAtzDmKgC3MMlS2AJyaUtZIyNaXmQVvcP8o9Xs6sW+fc8i6f5Z7XAFizTLEeTxqWbtnmylyrJ3cVebuR5dSXiLJum2mcrJGSY7GcspnXxhbmM6i9S3Mmho0SrFYsVNiq0eC0tYk+NfSmzZQc2ktjOhTcnnZHTotEax8ZyWilCOIn2fZTiMa1BW1R/mU9vdHxZtaXdSzuIVqT8Uf8AQmePKLhlxr9WpyNkzx+DcSpcQtlUg1zeq6HqKX2PHrT2ertkFdScgN2JPCGUc9ap6IIwr1NWonMo41e7NZYWrMJ1FDMpMNRo3GEctnl39/yxaTK3N23nD0PleL8S5m6VKWX6yLJtMstRe9q3VStzRozlH0aRlD4yo0oW9T/KcFLiF1R0hWnjo3k6Fxy/W1b+hvjHHnXsWfCeIVpZqKNGPWT1PVVnw3h1J1by47xx9G8L7Hx8+MX81h3VRL2eDjlOc5ZnJyfVvJqSRm5ZV9LxXtZUq0nb2EO5pbc2z/l0PmZScpOUm23u2AVkW5LYWxMdZGhZ6RwREmbENjX1n4stCJaIkrPYqC8qDHoiCKlaIbArJ9ADedETFCKJZFGQGwVAABVWdFjPkuY++hgWpvFSL9zNHuTKM0esUyrRzbjzAc3fy6Inv5dEVjTpQOb4iXRD4ifRBdOoHL8RPpEn4ifSINOpFjj+Jn0iT8VPpEGnU3qa0fMjg+Kn8sS8L+pB5UIfZ/uB33WeTY9/g0XHh6f8J8nU4lVqQ5XCn/JP9zqo9obuhQ7qFOjy4xlxef8AUu4mjiM18XLPU8+qsSyRUrzqVHOeHJlZVXJYaRqWGq1g9CZaowU2ie8l7F5xOKXuSZuTZPO/YnKLpaWuqNqNPKzL7GCqNPOEy/xM+kSzKFldWcaLREN5OX4ifSI+Il0ia/yRnjXVkq3/AO9Tn+In0Q7+XRDnDjXr8J4lU4bcqrBtx2kvRo/ReH31K+to1aM00z8j7+XSJ3cM47ecMlJ2/I1LeM02v9TlnrLx1wyuPVfrkXpqVnNeh+c/264p/c2n+SX+4f254n/cWn+SX+45ca6c4/QnNsxm8ZZ8F/bnif8AcWn+SX+4pPtrxKe9G1X0jL/cONXnH21atGnFts8W7u+eTy8RPma3aW+reaNH+Sf7nFccTubiPLJxiv4Rxp/kju4rxVzbpUZfVo8VgG5HK3YAAgACiQtyMkp4LuCz0QprXJVvJKk0i8ptNJnuWjsZt5ZPO/YvKbTTQiRTnfsHNsvKGl3ghFOZhSaJyhpeTIiiuSVNocoulyNyvMxzscomlhkpzMczHKGl2QV5mOYcoaT6kwaVSLe2SreSDNyV9I9YrBVo8mHEq0KagowaXq0/3J/E63yU/s/3MrHEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD/9k=\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import YouTubeVideo\n", "YouTubeVideo(\"Rwakh-HXPJk\",width=800, height=450)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> *Reading*: NLPP Chapter 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.9, and 3.10\\. It's not important that you go in depth with everything here - the key thing is that you *know that Chapter 3 of this book exists*, and that it's a great place to return to if you're ever in need of an explanation on topics that you forget as soon as you stop using them (and don't worry, I forget about those things too)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Part 3 - Putting things into practice with the abstract dataset\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prelude to Exercise 1: Some theory on the Zipf's law. \n", "\n", "\n", "**Zipf's Law:** Let $f(w)$ be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. The [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) states that the frequency of a word type is inversely proportional to its rank (i.e. f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type. \n", "\n", "\n", "> _Reading_\n", "> Skim through the Wikipedia page on the [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 1: Tokenization and Zipf's law**.\n", "> Consider the list of abstracts of Computational Social Science papers. You shall take the paperIds of Computational Social Science papers (from Week 4), then go back to your *abstract* dataframe (Week 2) to get the abstracts only for those. \n", "> \n", ">\n", "> 1. Tokenize the __text__ of each abstract. Create a column __tokens__ in your dataframe containing the tokens. Remember the bullets below for success.\n", "> * If you dont' know what _tokenization_ means, go back and read Chapter 3 again. **The advice to go back and check Chapter 3 is valid for every cleaning step below**.\n", "> * Exclude punctuation.\n", "> * Exclude URLs\n", "> * Exclude stop words (if you don't know what stop words are, go back and read NLPP1e again).\n", "> * Exclude numbers.\n", "> * Set everything to lower case.\n", "> * **Note** that none of the above has to be perfect. And there's some room for improvisation. You can try using stemming. Choices like that are up to you.\n", "> 2. Create a single list that includes the concatenation of all the __tokens__ (from all abstracts). \n", "> 3. What are the top 50 most common tokens in your corpus? \n", "> 4. Write a function to process your list of tokens and plot word frequency against word rank using. Do you confirm Zipf's law? (Hint: it helps to use a logarithmic scale). What is going on at the extreme ends of the plotted line?\n", "> 5. Generate random text, e.g., using random.choice(\"abcdefg \"), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, and generate the Zipf plot as before, and compare the two plots. What do you make of Zipf's Law in the light of this?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prelude to Exercise 2: Some theory on bigrams and contingency tables .\n", "\n", "In this course, we use a \"bag-of-words\" approach, beucause using simple methods to explore the data is **very** important before applying any complex model. \n", "Here, we learn how to account for an issue that often comes up when using a bag-of-words approach when studying textual data. \n", "The concept of [*collocation*](https://en.wikipedia.org/wiki/Collocation), or *pairs of words that tend to appear together more often than by chance*. It is an important concept in linguistics. \n", "\n", "In the case of collocations, words should be considered together to retain their original meaning (e.g. *machine learning* is not simply *machine* and *learning*. The same applies to *computer science*, *social media*, *computational social science* ).\n", "\n", "**How do we find out if a pair of words $w_1, w_2$ appears in a corpus more often than one would expect by chance?** We study the corresponding *contingency table*.\n", "Given a corpus, and two words $w_1$ and $w_2$, a contingency table is a matrix with the following elements:\n", "\n", "$$C_{w_1,w_2}= \\begin{bmatrix} n_{ii} & n_{oi} \\\\ n_{io} & n_{oo} \\end{bmatrix}$$\n", "\n", "$n_{ii}$: the number of times the bigram ($w_1$, $w_2$) appear in the corpus \n", "$n_{io}$: the number of bigrams ($w_1$, * ), where the first element is $w_1$ and the second element is **not** $w_2$ \n", "$n_{oi}$: the number of bigrams ( * , $w_2$ ), where the first element is **not** $w_1$ and the second element is $w_2$ \n", "$n_{oo}$: the number of bigrams ( * , * ) where the first element is **not** $w_1$ and the second is **not** $w_2$. \n", "\n", "Then, we can compare the observed number of occurrences of the bigram, $n_{ii}$, with the number of occurrences we would expect simply by random chance.\n", "The value we would expect by chance is equal to the product between three terms: the total number of bigrams, $N$; the probability that a bigram starts with $w_1$, which is equal to $(n_{ii} + n_{io})/N$; the probability that a bigram ends with $w_2$, which is equal to $(n_{ii} + n_{oi})/N$. \n", "\n", "\n", "To check if the number of times our bigram appears, $n_{ii}$, is close to the value we expect by chance, we can run a [Chi Square test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html).\n", "\n", "\n", "\n", "**Note:** [*contigency tables*](https://en.wikipedia.org/wiki/Contingency_table) can be used in any statistical problem where one wants to study multivariate frequency distributions (not just in the case of textual data). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise 2: Bigrams and contingency tables**. \n", "> 1. Find the list of bigrams in each of the abstracts. If you don't remember how to do it, go back to [Chapter 1](http://www.nltk.org/book/) of your book. Store all the bigrams in a single list. \n", "> 2. For each unique bigram in your list:\n", "> - compute the corresponding *contingency table* (see the theory just above)\n", "> - compute the p-value associated to the [Chi-squared test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)\n", "> 3. What is the sum of all the elements of a contingency table? \n", "> 4. Find the list of bigrams with p-value smaller than 0.001.\n", "> 5. How many bigrams have you found? Print out 10 of them. What do you observe? Which bigrams does this list include? \n", "> 6. *(Optional)* Recompute the __tokens__ column in your dataframe. This time, do not split pairs of words that constitute a collocation (they should be part of the same token). **Hint:** You can use the [MWETokenizer](https://www.nltk.org/_modules/nltk/tokenize/mwe.html). \n", "> 7. Save your filtered abstract dataframe with the new __tokens__ column. **You can also delete your old abstract dataframe, that included all papers.**\n" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }