{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview\n", "\n", "We're now switching focus away from the Network Science (for a little bit), beginning to think about _Natural Language Processing_ instead. In other words, today will be all about teaching your computer to \"understand\" text. This ties in nicely with our work on wikipedia, since wikipedia is a network of connected pieces of text. We've looked at the network so far - now, let's see if we can include the text. Today is about \n", "\n", "* Installing the _natural language toolkit_ (NLTK) package and learning the basics of how it works (Chapter 1)\n", "* Figuring out how to make NLTK to work with other types of text (Chapter 2)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **_Video Lecture_**. Today is all about working with NLTK, so not much lecturing - you can get my perspective and a little pep-talk" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkz\nODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2MBERISGBUYLxoaL2NCOEJjY2NjY2NjY2Nj\nY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY//AABEIAWgB4AMBIgACEQED\nEQH/xAAbAAACAwEBAQAAAAAAAAAAAAAAAwECBAUGB//EADwQAAIBAwICBwYFAwQBBQAAAAABAgME\nESExBRIiMkFRcZHRBhMVUmGBFCMzQlMWkqEkQ3Kx0gdiY8Hw/8QAGAEBAQEBAQAAAAAAAAAAAAAA\nAAECAwT/xAAgEQEBAQEAAwEAAgMAAAAAAAAAARECITFBEgMTIlFh/9oADAMBAAIRAxEAPwD5+AAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1f6fu/5KP9z9CV7PXb/wByh/c/QuJs\nckDsf05efyUP7n6B/Td5/JQ/ufoMNjjgdj+nLz+Sh/c/QP6cvP5KH9z9BhsccDr/ANO3n8lD+5+h\nP9N3n8lD+5+gymxxwOv/AE5efyUP7n6B/Tt3/JQ/ufoMpscgDqvgF0v9yj5v0I+A3X8lHzfoMNjl\ngdT4DdfyUfN+gfArr+Sj5v0GGxywOn8Dufno+b9CHwS5X76Xm/QYa5oHQfB7hfvpeb9CPhNf56fm\n/Qi6wAdBcHrv/cpeb9Cfgtx89LzfoBzgOi+DXPz0vN+hV8IuF+6n5v0AwAbvhVf56fm/QPhVf56f\nm/QGsIHRjwW5ktJ0vN+he34DdXFX3cKlFP6t+gTXLA7tf2Uv7eHNOrbNfSUvQx/Brj56Xm/QK5wG\n58KrredPzfoV+G1vmp+b9AMYGv4dW+aHm/QPh1b5oeb9AMgGv4dW+aHm/QlcNrN45oeb9AMYG18L\nuEs9DzfoV+H1s9aHmwMgHRXBrmUcp098bv0LLgd04OXPS07Mv0Jq5XMA6D4PcJdek/u/QXPhtaG8\noeb9BsMrGBq/AVfmh5sj8DV+aHmyozAafwNX5oebD8FU+aHmBmA0/gqnzQ8w/BVPmh5sDMBqjw+r\nLaUPNky4fVjvKHm/QDIBo/B1O+PmSrGq/wB0PNk1cZgNfw+r80PNk/Da3zQ836DTGMDauGVn+6n5\nv0LLhNd/vp+b9BpjAB0PhFx89LzfoHwe4+el5v0GmOeB0Pg9x89LzfoSuDXDfXpeb9C6Y5wHU+BX\nX8lHzfoQ+B3K/fR836BHpi8UVGRRtzWSLNAkTgCmCWtA7SXsBRIs1oCRYqF4IaL4IYCZIXgbMWwI\nwQywNEVTBWSGFWgEtFXDI1oMEVncGiMtD5IVJEEKt3l1OEhTiUwF1qdOMtiv4d9glTlE0UbnD1AZ\nTozUdh/CYv8AGPKNNtVpTjrgfYQpu6bjjcodxbSh9jzrPR8ZWKJ51krUZ6m4pjqiFNakSqYLOnJJ\nPGj7SYxTeqZspxXu+WUcoluLzNYo02bYWaUeZ4l2osnCDWiIlXjBYWEu9MzenScYl8iWO36matGH\nPnHLLt7mKuLle83z9SJ1HO1k1q4a/YzreRvpVYVKSi9GtyYTUVlbSxlf4ONRufeKSzjCN9tLmp02\n3nouRSKzn7uqoN64/wDsW6iqrl7jPe1WqsJZ7MEKeacavZ2oJTp08JYF4LurzeDIWGWMWKAWlgqa\nRABkMhDaW5NbYrR3LV9gRmGRFoZEy0Yi6KIugLxGIpEYgLEkEgBenuVReG4DisixEiocNgKHQOri\nui2CEWAo9wexL3B7AViXa0KxLvYqKEPYsQ9gETF4GzFgRgMFsEYIqrRDRchgJYEyKsgiQtl2VYUt\nlcF2ioEYBIklIgfSTUdGb+BybuZZfaY6a6Bu4FH86T+pRu4y/wAo8+9jvcZ/TOFLYlbjPKSzgFFT\n2IlDMmPt4Jvo08PvZm3FktTTo7J9nabY0E10Z69mRbzTT55NZ7ik2pxxTr4f1Rzt12kwq7pTjByU\nXLByqteMtFmNRftfabJ3NajUcZPLW2HhmS4qUriKclnDw3jVAYalV9jx9O438IxXlUjLVRgzDUoP\nmUZar9sl2nS4NRdJXEu+GCo5kYujdzp9zaOm/wAilyrdQx/+8jJKm53kJd7Q24nzTn3ZwgYz3T5u\nSXZzMrGf+mkvAibzRS7FLP2FZ1wVD/eYcV9B1Oaaw3qYa0sRptdxehNyj9UMRt31IwFOacN1kllY\nqMBgkCi9Fak19iaO4XAIyoZEWi8TLRiGIWi6AbEZEXEZECxIEgCL09ypenuA4rIuVkVDkOhsJjuO\nhsdnAyJYiJJFVZD2Je4FER3LPYrEuwipEiUEtgM8tygyRQADBIEVXBEloXIlsAhohrQYlqEloQIa\nKsvIowqjILMjAEYJQYLIDRBflnQ4BH8yT+phivyjp+z61b+oDONdQ4TWh3eOdU5FLGra+5mtzypT\nt6k30Ema4050YfmNa9wUaknP8vHixd9e04Qx71ub7jja9HPOMty608qMYzXkzm+8r0ajXI+V7xlq\ndC3t69d8+XhnYocInWiverJNbx5ypCdRKWXJf5IVpKck+V52em57GlwRRWkdB64RFPql1MjydHhr\nccOOVuvobKdi6dKoowxzI9PT4eorGCZWX/tQPDxNbh0qck0tkZallJR2+p7ipYuUsOOneLnwuOOq\nNXI8DUs5rTHYZZUZwy8anvKnCo/KYbjhUWn0RpeI8TPsymyIyfYsfc7t9wvlb5VqcWrRnSm4yRuX\nXLrmxMJtSWNEbqcuZHPjq8Sf3NNKful0isWNSQYCMoyScSSsmUdxF/PkRoo7mTinVAxU7lp6mylU\nU1oco1WbeRYsrpIuhcRkTKmxGxFRGxAsSQWAC9PcoMp7gOKyLlZFQ6I6OwlbjonZxMiWIiWIKMjs\nJe4MqIjuWZCJkUVQSBFZAKkVLS3KogkAJIqCGixD2AWis2WeguTApIoyzKsgqQWIAMEpagWiFaYr\n8o6ns+uizmf7R1/Z9flsorxpczS72Z6PD5zgtjffqPvk5apdhLk40svo/RHHuu/8ccjiNaFjQdOl\nHmnLu7zn8L4dUva/vKuXh7F7n3l9xBUKWdXjQ9nwvhlOzt4U4rVLVnN23CrHhkY4bidanbxisJDI\nQUUMNzly67t9KKmkT7tFiTeOe0r3aXYVlTQ4qyWLLWaVMq4I0NFJIzjpOmWpTWNjHWoJ9h0JozVj\nNdZXAvaC1yjg39jGpGWmuNz1d3DMWcerDpYwRr28VVpSo1HCW5MYuUH9DucWsuZc8Vqtjl0oYizc\nuuHXOVFvKUVjt7jWtVqLhCMtnhjVCSWpqMWG0THxTqm2ksGLia0ZWXJN1pHCyZKSzNZOhTSSWBSH\nxGxExGxMtGxGxFRGxAuWRCJABlLcoMpdYoeVmXwVmEMQ2IpDYnZwOgXKx2LEVR7gyQZUVRMgW4TK\nKrYiRaOxWewQmRBMiERUgBKIqAexIPYBMhUhsxMgKsglkAQ0QSABgtFalUXhuQasfknY4AvyTkS/\nRO1wFf6cqxe7WKuVjIq4WaGM6sbdrNXH11JlSU4JLY49e3o49E+zvDkq07maWVpE9LFYE2tJUqEY\npY0NESQ6urIkANxySBAFAVZLKmasDFSYyQqRmt8lTEzWUOkLkZrtGC4gsM5Nen0s4O3Vg2zBcU9z\nLccS7WY4Zw6tHkqvl2fYdq+0mc6vHmWVuajHbBJNPK27R8anNBdv1F1E1nvF0X0vp3G441tpPJi4\nm9DbRMPE+00w5kXhm2jUzozChkJYZvNjPp1IsZEz0Jc0TRE546Q2I6ImI6IDESQiyAlDKW4tDaO4\nGhIrNFyswiUNgLQ2B2cD47EhEl7EVUhkgyohbhMlbkyWhRRFagxIrNaBGZkIme5BFSSQSFBEnoSU\nmyBc2KkXZRgUZBYgCAAABDYboWhsN0Bpn+idzgS/032OLU/RO7wVYtfsFhd3U/NxjtG28lKUUuww\nX1RwrScn4Gixaik3uzhfb08+noodRDIoTRlmmh0SxjpYAA1GAQ85LAUUeSUSwIqk0xLjkfJlHglj\nXNKcBcoo0PGDNUnGO7M2OnNIqpJHNudHg3VK0MtZOfcyjJPD1M2OscHiXXObUniP1OnxHtOLWlqI\nz0pKopyFQ6M8dgmcunnPaMpz6SRtxrfR2MPE3ubqGxz+Jdc0w5uGTk0xgvdZMz3NxLGy0nnQ2xOV\nby5ZnUpvKRnqLD4DoCIj4mWjUWRRF0BKG0txQ2luBpKzLFZhFkOpikOpnZwh8QZMSGRUIGSiGVAt\ny72KRGFCwexMtGQ9gjLU3KItU6xBFBKILAQUmhhWexFZ2UYyRRgUAlkAQAEoAQ6nuhSHUl0kBpqL\n8o7/AAhf6VeBwav6Z6DhaxarwCuNxiWLqLeyZe2qt1IrIrjGHca9hWxbnXi5bHHr29HHp6+1/RiP\ncuVGJ31vRpJKfM0tkZZXFW4WW/dw7Et2Qza6crmMX0pJeLFz4lbw3qx8zF7lSprkppy75amarw+v\nPWUor/isDV/MdF8Yt+bljJS8GMhxGnU2eDifDuyai/sLdhVpPnozaa/a3lMfqtfiPTqopLKJcsI4\ndHi0KVLFSElNdhjufaOrlqnSjj66l1n8PRuoZq19Cistnmpe0dVrEqcfs8CHdTvpZlmNPsS3ZGvz\nHTvPaDD5aScn9Dkz4vd1Z/tXjI00rSMpZaxE0UqdjTliUoZI1jnK6vJa8tNr/kxFe9rxeZUWkt3F\n8x6HFnJdCUGc3iFCGOanhNdqBjjTvoXMeVvpdhguo9DPcPu6KquTilGrFZ0/cvUxVfdrKVectM7d\nvcWRi35WKo92MtW5S8AvKDpSWOrLYbaU+WPMac66FHY5vEn02dOj1TlcT67NMErLo6GZ7jqNTEcM\nVN5k8FhUrR5OjbT5o4Odg1WuUXpJ7dKLHRM0GPgzm6HxLoXEugLjaO4kbS3CNRWZJEwGI0UloZ1u\naqS0OzgYkQy6KPcihESLIiZUViMQpbjYlFahSWwyohU+qEZp7kIJbgRUkkIsAFZ7Fys9iKzyFsZI\nWwKkEkAAAAEofS6yExH0usiDVV/TR3+G6Wn2OBV6iO/YaWi8Arg8W1uSnDq0adwnNRkkurJ4TOtT\n4ZG/uZznJqMdEl2mK94FOleUYp9FzTTOfTvxPDdRSub2mnCMI9sY5xodGlSWW5Gf3Lo39Jv9yx9z\ndVi4weNzFdJ4Jr3tOhHCxlHJr8YuqlZ0qMEmlnL0WPua1w+pUqxnUfRTzgdd8OjVl7ymo8zjyyUi\nyb7Tq56cijxG6qtNYlnU6VGu6sdVh9xFpw6NutVFYWEkbLW0UNd+/UWf6al8eXLu6FP8ZTc9I1F/\nk60eGW0aCUoJvBi4jS99xWxoxWmXKXgsHZqaRGM3q/HjOO2VGi3KDUTFbT5JrTRHQ9pHLllKP7dT\nHYw95FT/AGvdk+Ol9rzuKtzNUaWUm8aFLuylb1J06k6nRp86UNMs71pY21JJwy3vk1V6FG5iverL\nWzZqePbHe308JSVwpJ5ksx5sS7DRTvZT6E2/B9h6eVnb0E3CKbx4s4d9YuVR1YxwS4vOz2wwhzXt\nNd8kcmNvnnbWcNpHft6bpOdxU2pRcvT/ACYI0XC3hndrLGlm1xbm4lOn7ua6UXozVawk6KeNjPdQ\nSuX9dTVCo3ywjp3l1ic61UeqcrifXZ2Ka5co4/E+uzccr4c8skVLI3yzV0a6O2hjRpovQd+jn22Q\nY+DM0GPgzk6NMGMQmDGoBiGUtxKY2k9QNZEiEwmVD47mqnsZoLU1Q2OjiuUe5ZlXuBKKzLIpNgQh\nkRSeo2BpkVBM+qMqMVPYDO9wCW5GSKsSUyWyQWRWpsHMVnLKClMoy0mUyBDKl2VAgCQAmI+j1kIi\nPo9ZEVpq7I9DZaWi8Dz1X9viehtdLP7ATYZ5JyT1Uy17UlK/tIvvyI4RWUrmvQe/WRq4jSxc21Xu\nlg5V6ebMh97RdSipQ68HzIZRqRr0Yzj27ruY2L0M8qE7eo6tBc0Zdan3+AZ09QfYRJTxpgKdzSn+\n7ll8stGN5kMTayuhKT6T+yLxgoLXRIvUr0qSzOaiYa8q14/dwTp0nu3u0K3Laiwg69/WvZdTHu6X\nh2s6NTqlaVONOnGnBYjFYRaprArH15TjUc1WmtJHL4bUVncO2raQn1JP/o9JxO395F9/eef93Gu3\nRrLVbM5+npzXorSKUTXyruPP2U76zagsVoLZSeH5nUhxHT8y2rwf/DP/AEXWbK0Tox7jJXoJppLL\nG/j1JdC3rzf/AAx/2Zq6urjSo1b03uovM39+wJlci+gqj/DUtYRalWmtm+yJhuscjO1WVOjS93Ti\nowXYcC8lukRvMjicQ/VjIfbR6SeNxN4uZx8TdaJOKfcaYns9rD+xxOJdd+J295SZwuJP8x+J0jz9\neaxEoqSjcrJiHUWJ7BtF6l79JPbXAfFmeA+JxdWiDHRZngx8WBcZS3FDKXWA1oiQIiRUaovDHKrg\nQ0VydHFq96HOjNkOYI1KZSchPOQ5gNi9R8GZIzwOVVYKLyeWLnsRzrJWc1gqEyepTISlqVyRVslk\nxWSckDckNi+YOYKJC2WbKgQAAAAQAFoj6PWQiI+j1kRWmpvHxPQW+ln9jz0+vDxPQUtLP7Ace2uX\nb8bhPOjfKz1tzRVenFdzyjwl1JxvHJbp5PeWs1WtKc12xTMV0lFMcnoJh2jURrpSrRp1evCMvFCv\nwlFfsfmzUgGJLjPG3hF5hTin341LKGPEY3l4QqrUVGLctgs2mQwWljGpyocTU6mFSqKPzNFrm+5V\nlak/Ua/rukcUuYUtG9WefqThOtppntK8avJTrdCLnJdiZnsak7rPvKDpyj25yjna9PMek4e41afJ\nNJtaG78PjqSaPPcKusX0qfN2Hp6cuaJqM9eCXTnjczV4S5dTdJ4M1eSwKc15+8i8s4l52nev5JNn\nCuVlSZGq5tXHaOtHy09e0z1H0sGqDSgtNjTluHR7ThcS/Vfidyn1Di8Q1qPxOkea1gLRBrBU1PCG\ntrBai9RORlF9IvV0kxuiOiIgPic2zoD4iID4kVcZS6wsZS3A1IiRJEistrQuSHSZRo3rnhWCBmCr\nLpipDIbK5CL5I5iuSuQi7kwc2LbDJRLZBGSMkFshkrkjIFgK5DIEsgjJGQqxBAASBAAXRoodZGeJ\noodYiny/Uh4noI6Wf2PPv9WHid/OLP7BXl7t/wCol4nr/Zq499wyEW9YPlPHXT/1E/E38B4mrG55\nKj/LqPfuZmtx7NfqSRdbiIVIzmpReU0PRlqrkNgQ3gMpSF14RqR5ZLRhKehG4ak+s34OEV0WzNc0\nujjB0thE2nLDM2OvPVeRu7Kp79yXVfaZlQrYxzuK+h7GvbQlHY5da1S1xsZx1nWufwy0VGopdv1P\nRW9XTGTjOapy3H0Lh82AXy605aGG5n3lnX5UYrirzNompHOvpZbOZWX5TZ0Lh80n4GC7xGi0IVxZ\nvpvxHqfRwY6lRRyxdO+XvOkui+06SOPVyO1S/TONffqPxOrSrU3T0mjkX005vDydI4VlnsLJk8kC\nkAyj1xZen10RW+A+Jnp9hpgRo6A+OwmmaIbAWGUlqUwNorUB6REi5EkEaQADTKCsti5SewCmUZZl\nWVhBAMgoCAIAAAgIAAAAgAAAIAigAJKAkgCC8TRQ6xniaaG4U7/fh4nenpZ/Y89OrClVjKckkjRd\n+0VlTtuSNRSljZBXMuX+fPxM85xXacy54nOrUk46JsxTuZyfWZMV7/2Xvpu5dCU3KOOim9j2CeT5\nH7N306HGrZyl0ZS5X9z6xCWYJmOvFbnmHZ0FyZOcoVUeDNrUg3GRwlqzlXF/KnU5KVOVSfcjnXt/\nxecfyLF+Dlgmuv8AXa9P7yntkW4RbymjydPifF6cPzbOMPs2TPjlWm8SoVvGJdWfxf8AXqqjhGOs\njm15x6WJI85U47J5Xua0n4GCrxmvLLhbVvuS3XScY7dzNJtiaFbNdHCnxS8ln/TP7sfY17r3qlOE\nVF9iMj0lWr0TFzye5s9y6lOLw12mavTcJbYIyy1Xh4fac7ibxSaRrnPmrfRHN4rU5aaWdSxK4lbW\nEvAxGvOYy8DIdo83SctbNkABWQAAUBaHWRUmO4HQpPRGqBjoPKRspmWj6aNMEZ6aNUFoBOBtJalM\nDKS1AegmiyREkENJIDJUSLnsXKSCFMqy7Ks0yoyCWVKgIJIACCScAVIZZoqwIJwQSmAYDAZAAwTg\nAIqMFkAYIuLRKV7+naxfbLuE3V3GhFrOpw7ivKrNtvcsgvecQq3M3mWI9xkbI7SSiewW9xmyFyCn\nUJuE4zjo4tNH17gt7G94dRrReeaKz4nx2kz2XsTxVUasrGo9JdKDb7e1GOp4b4r30ZYZFRcyF82Z\nJl5SwcnRFKhCMubHSLSgm87MmMizWUU26S4J9aORVShRa6iGVZOKMdWvVWcJYGunOpdvbpa04+Rz\nr2nS/ZCPkXqXFZ6YRlr87jl5ZLXaVhrRgtcJsvY01KqpSWiI9zKUuZ7GmlFU+l3GCuopKMdDmcUq\nKNJv6DXcYWTjcWuXU6KfaVyIhV1b+hyOLV8yxk2yqKFNyzocK4qOpVZqRjqoTxTkZjRU6NHxM50j\nj0AACsgAJSKBE6BjAFRaMnHqs2212tI1NPqc8nOCK9FSaaTTyjVBHm7a7nQl0XldzO5Z31Kuks8s\nu5kxdbMDaS1KoZSWoDsESRdIiSCACAAkrInJWTCKMqyxVlZUkVZZlSogADAAixCJCoZRl2VwERgg\ntgMAVJRPKSohUEk4JwRVUhN3cRoQeuo2tUjSpuTOBdV3VqNtlkVWrVlUk5NmdvUu3oL7TSIkEWWk\ntBcXqRTXsKmN7BUxQU9zTGrOjOFWnLlnF5TMsdx71iB9M9nuM0+KWUZN4qR0nHuZ2s8yR8e4VxKr\nwq+jXpPT90e9H1DhfFaHEbWNajJNNaruONmO3N10oprtHJ6CYSTRaMiNJqLJhqwkuw6GmCr5W9UF\nlcp28s5cWE7bMcSWDp5T0RluHjLJW5XPnQjHQy3GFFx2H1q2G3nJyLq65ubXUzi6mrcRVNqW62ON\nXre8rFLu75tIsxTrckG86lkYtF9cPqRZhhFyaSLPM5cz7TRGCpUnUlvg25+2S5fSUVtEQTKXNJt9\npBtzoACyRUCWS2MEA2VAQBKiBAFnpuQwK7FoyaeU8MgAOnacYqUsRq9OP+TuWnELWthqqk+5njw2\nZFfQlrsRJHlOGcbrW0o060veUvruj1UZxq04zg8xksogpkMlQAtkpIkAihDLaBoVC8EcpfKDmQFO\nUnlBzRDqATyhgo6hDqBDNCNBbmyOZlw0zKI5kL5iMjDTXJEc4rIA0znI5ygq4qe7pNjBk4hccz5U\n9jlOXSNFSXNlmVvDNKY3lCy2SvaRV1rEU9JF4vQrVWmQGdguZaDzFETApHccuoIW45dQQJnudPgX\nEa9hcxdKXRk0pRezOZPcZQb1xuZV9ap1p0nGNdY5llM1wqJy30wZOHVafF/Z+1r7yUeWX0a0MlWr\nWs5Yac4LZ9xzsyu3N2O57xJfQVUrpao4suLx5dWs9xkq8Xjr0l5mdbj0cauYZMV3cpRaysnHXGY8\nrw3kw3PFFPbVsL4aLy6cdnp9DhXV20njd7EXF1KS7vEx4lN5w22JEtU5svmb1KOLqPL27jRGg3Lb\nLNlG0UVzTLrGayW9q2+aa0MnELhTl7uGy3NfEb+MIujR63a12HHNSM9X5AAEpZNOaUixGO4sloaR\nGyKssyEu0CUu0N9iG2w0QA9/+yA3B94EEgQRQQTggAO7wHi3umrWu+g+o+44QJ4eSD3POR7wz5JL\njOne8KuoLILiaY6hVzZVkZAs5MjmIIAnJGQAIMgVAokCAAnIEBgCSQSBogg5nEK3NLlWyN1eXu6T\nZxqr5m2WKonlCpYT1LReuCs1qFVzqXpUp168KVNZnOWEhWT1fsDw5XfFncTWY0Fp4ktakd3hfsRa\nU7RTuszqNa/Q8v7U8JpcNrpUeo+w+qV5qFJrOD5l7aV3UulHOxie3TP8XmqWxMitMszbkXsx0eqK\nki8HoBSpuFF4mEysdJIn0ey9juPw4dXlaXLxb1nv8r7z2N3CE1lNOL1TR8kWkjq8M4/dWH5bk6lL\nsjJ7eA6mtc9Y9hX4fCrnC5X3o51awqU88uGvqhlt7S2dWSjKfK32tG+VxRqw5o1IyX0Zysx2nlwK\ntOstOijJOhKWjl5HXu6lLn0kvrqZJVKME25rQisCtFu8vxGq17XohdbilClnGG/M59zxec1in5ly\n1m2R1JToUE3KSyjk33E51c06T5Yd/eYKlWdR5nJtlDU5YvegAA0wlFtkCWCUaiJisv7Et6gtEVfe\nUH17ABsOzsIDZalSdw0AFsyCcf4IACAYEUEEsgAAAIPXok4Pxq5+Sl5P1D43c/JS8n6l1nHeDBwf\njdz8lLyfqHxu5+Sl5P1GmO9gMHB+N3PyUvJ+ofG7n5KXk/UumO9gDg/G7n5KXk/UPjdz8lLyfqNM\nd0jJw/jVz8lLyfqHxq5+Sl5P1GpldzAcpw/jdz8lLyfqQ+NXL/ZS8n6jVx3cBp3nBfGLl/tp+T9S\nr4rcPsh5P1JpjvucV2lHWijg/Eq3yw8n6h8RrfLDyfqNXHcdx3DKVK5uFmlSlJd6RwI8SqqSfJTf\n0afqdq39uOIW1JU6VnYRiv8A45f+RFkn1n4jCvSXLUhJfY5beTq3ntdeXsHGtZ2TT7oS/wDI4k68\npScuWMc9iLKWT4mTxIs3lCXNvuBVGl2F1MWe59K/9OKEY8IqVu2dR/4PmfMzu8F9ruIcEs/w1tRt\npwy3mpGTf+GiVY+qXiUoNHy72shyX/ijVU/9QeLVFrb2X2hP/wAjgcS4nX4lW97XjCL7oJpf9mJM\nuun6n5xlg9S4pPBPO/ob1yXlsFPZlHJsFNrYuiZ7lCW8kEU+Dzgh7i4zcdgc2+4upi8ngmFxWisQ\nqSXgxbm3vgjJNDHcVs/qS8yJVqko8spya7si2BF0AAAAAAATFEEp4AsWS1F8z7kTzv6GtQx6lZdx\nXnf0I5mNFluS99Cqm1HGF4kczJov3akYxsV5mHMxos9M/UgjLDI1QBGQAAACAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/9k=\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import YouTubeVideo\n", "YouTubeVideo(\"Ph0EHmFT3n4\",width=800, height=450)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Installing and the basics\n", "\n", "> _Reading_\n", "> The reading for today is Natural Language Processing with Python, first edition (NLPP1e) Chapter 1, Sections 1.1, 1.2, 1.3\\. [It's free online](http://www.nltk.org/book_1ed/). \n", "> \n", "> * **Important**: Do not use the newest version of this book. Use the first edition. (The newest version is based on on Python 3).\n", "> * **Important**: Seriously, remember that we're using the *first edition*.\n", "> " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> _Exercises_: NLPP1e Chapter 1\\.\n", "> \n", "> * First, install `nltk` if it isn't installed already (there are some tips below that I recommend checking out before doing installing)\n", "> * Second, work through chapter 1. The book is set up as a kind of tutorial with lots of examples for you to work through. I recommend you read the text with an open IPython Notebook and type out the examples that you see. ***It becomes much more fun if you to add a few variations and see what happens***. Some of those examples might very well be due as assignments (see below the install tips), so those ones should definitely be in a `notebook`. \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLTK Install tips \n", "\n", "Check to see if `nltk` is installed on your system by typing `import nltk` in a `notebook`. If it's not already installed, install it as part of _Anaconda_ by typing \n", "\n", " conda install nltk \n", "\n", "at the command prompt. If you don't have them, you can download the various corpora using a command-line version of the downloader that runs in Python notebooks: In the iPython notebook, run the code \n", "\n", " import nltk\n", " nltk.download()\n", "\n", "Now you can hit `d` to download, then type \"book\" to fetch the collection needed today's `nltk` session. Now that everything is up and running, let's get to the actual exercises." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> _Exercises_: NLPP1e Chapter 1 (the stuff that might be due in an upcoming assignment).\n", "> \n", "> The following exercises from Chapter 1 are what might be due in an assignment later on.\n", ">\n", "> * Try out the `concordance` method, using another text and a word of your own choosing.\n", "> * Also try out the `similar` and `common_context` methods for a few of your own examples.\n", "> * Create your own version of a dispersion plot (\"your own version\" means another text and different word).\n", "> * Explain in your own words what aspect of language _lexical diversity_ describes. \n", "> * Create frequency distributions for `text2`, including the cumulative frequency plot for the 75 most common words.\n", "> * What is a bigram? How does it relate to `collocations`. Explain in your own words.\n", "> * Work through ex 2-12 in NLPP's section 1.8\\. \n", "> * Work through exercise 15, 17, 19, 22, 23, 26, 27, 28 in section 1.8\\. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with NLTK and other types of text\n", "\n", "So far, we've worked with text from Wikipedia. But that's not the only source of text in the universe. In fact, it's far from it. Chapter 2 in NLPP1e is all about getting access to nicely curated texts that you can find built into NLTK. \n", "> \n", "> _Reading_: NLPP1e Chapter 2.1 - 2.4\\.\n", "> " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> _Exercises_: NLPP1e Chapter 2\\.\n", "> \n", "> * Solve exercise 4, 8, 11, 15, 16, 17, 18 in NLPP1e, section 2.8\\. As always, I recommend you write up your solutions nicely in a `notebook`.\n", "> * Work through exercise 2.8.23 on Zipf's law. [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) connects to a property of the Barabasi-Albert networks. Which one? Take a look at [this article](http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf) and write a paragraph or two describing other important instances of power-laws found on the internet.\n", ">" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }