{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLTK (Natural Language Toolkit) package\n", "\n", "## What is that?\n", "**NLTK** (stands for Natural Language Toolkit) is one of the most popular libraries for natural language processing (NLP). It is written in Python, has a lot of useful methods, as well as lots of tutorials and very supportive community.\n", "\n", "What is NLP? \n", "\n", "The aim of NLP is to achieve that *computers can understand human languages*. Here are some practical examples of NLP procedures:\n", "* speech recognition;\n", "* understanding synonyms;\n", "* sentence analysis;\n", "\n", "etc.\n", "\n", "## What for?\n", "Here are some examples of NLP that you have definitely used:\n", "* **Search engines** - many popular search engines like Google, Yandex, etc. show the most relevant results for each user, depending on your search history and interests.\n", "* **Spam filters** - using NLP techniques, modern spam filters can understand the aim of the e-mail and quite precisely detect whether it is spam or ham.\n", "* **Conversational agents and Chatbots** - Siri by Apple and other agents understand you and can reply to you because of NLP.\n", "* **Social networks** - giants like Instagram and Facebook show posts and pictures in your feed that you could be interested in the most. How do they achieve it? NLP.\n", "* **Plagiarism detection**\n", "\n", "By using NLTK library, be sure that you will be able to achieve it, too.\n", "\n", "### First steps\n", "\n", "Make sure that you have NLTK installed on your computer. In case it isn't, run this command in your terminal/cmd: `# pip install nltk` or `conda install -c anaconda nltk` (for Anaconda users)\n", "\n", "Next, import NLTK library and install the packages. You can easily install all as they are not that heavy." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "nltk.download()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classical operations with NLTK\n", "\n", "As we're talking about NLP, it's common to split text into tokens. Most popular approaches is tokenization by sentences **sent_tokenize**, words **word_tokenize** and word and punctuation tokenization (in case you want to retrieve text only) **wordpunct_tokenize**. These approaches can be seen below:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize\n", "\n", "text = \"NLTK (stands for Natural Language Toolkit) is one of the most popular libraries for natural language processing (NLP). It is written in Python, has a lot of useful methods, as well as lots of tutorials and very supportive community.\"" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['NLTK (stands for Natural Language Toolkit) is one of the most popular libraries for natural language processing (NLP).', 'It is written in Python, has a lot of useful methods, as well as lots of tutorials and very supportive community.']\n" ] } ], "source": [ "print(sent_tokenize(text))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['NLTK', '(', 'stands', 'for', 'Natural', 'Language', 'Toolkit', ')', 'is', 'one', 'of', 'the', 'most', 'popular', 'libraries', 'for', 'natural', 'language', 'processing', '(', 'NLP', ')', '.', 'It', 'is', 'written', 'in', 'Python', ',', 'has', 'a', 'lot', 'of', 'useful', 'methods', ',', 'as', 'well', 'as', 'lots', 'of', 'tutorials', 'and', 'very', 'supportive', 'community', '.']\n" ] } ], "source": [ "print(word_tokenize(text))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['NLTK', '(', 'stands', 'for', 'Natural', 'Language', 'Toolkit', ')', 'is', 'one', 'of', 'the', 'most', 'popular', 'libraries', 'for', 'natural', 'language', 'processing', '(', 'NLP', ').', 'It', 'is', 'written', 'in', 'Python', ',', 'has', 'a', 'lot', 'of', 'useful', 'methods', ',', 'as', 'well', 'as', 'lots', 'of', 'tutorials', 'and', 'very', 'supportive', 'community', '.']\n" ] } ], "source": [ "words = wordpunct_tokenize(text)\n", "print(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, there are lots of unnecessary words, e.g., articles or commas. These are also called *stop-words*, which are often getting excluded from the analysis, as they have no additional meaning and do not change the main theme of the text. Let's get rid of them:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['nltk', 'stands', 'natural', 'language', 'toolkit', 'one', 'popular', 'libraries', 'natural', 'language', 'processing', 'nlp', 'it', 'written', 'python', 'lot', 'useful', 'methods', 'well', 'lots', 'tutorials', 'supportive', 'community']\n" ] } ], "source": [ "from nltk.corpus import stopwords\n", "\n", "noStopwords = words[:]\n", "for token in words:\n", " if token in stopwords.words('english'):\n", " noStopwords.remove(token)\n", "\n", "noStopwords=[word.lower() for word in noStopwords if word.isalpha()]\n", "\n", "print(noStopwords)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK provides lists of stop-words not only for English, but for other languages, as well.\n", "\n", "Now let's create a list of 20 most-used words in the text and create a graph for that." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from nltk import FreqDist\n", "\n", "frequence = FreqDist(noStopwords)\n", "\n", "# this line displays the graph correctly\n", "%matplotlib inline \n", "frequence.plot(20, cumulative=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results of the following analysis are not that interesting, but NLTK understood that the text is about itself. Nice!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieving synonyms and antonyms\n", "\n", "NLTK contains a database named **WordNet** that is built for natural languages. It includes groups of synonyms and antonyms that you can use during your analysis. Let's try to find synonyms and antonyms for first 5 words from our clean list:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WORD stands\n", "DEFINITION a support or foundation\n", "USAGE EXAMPLES \n", "['the base of the lamp']\n", "SYNONYMS\n", "['base', 'pedestal', 'stand', 'stand', 'stand', 'stand', 'rack', 'stand', 'stand', 'standstill', 'tie-up', 'point_of_view', 'viewpoint', 'stand', 'standpoint', 'stall', 'stand', 'sales_booth', 'stand', 'stand', 'bandstand', 'outdoor_stage', 'stand', 'stand', 'stand', 'stand_up', 'stand', 'stand', 'stand', 'remain_firm', 'digest', 'endure', 'stick_out', 'stomach', 'bear', 'stand', 'tolerate', 'support', 'brook', 'abide', 'suffer', 'put_up', 'stand', 'stand', 'stand', 'stand', 'stand', 'stand_up', 'place_upright', 'resist', 'stand', 'fend', 'stand']\n", "ANTONYMS\n", "['sit', 'yield']\n", "\n", "\n", "WORD natural\n", "DEFINITION someone regarded as certain to succeed\n", "USAGE EXAMPLES \n", "[\"he's a natural for the job\"]\n", "SYNONYMS\n", "['natural', 'natural', 'cancel', 'natural', 'natural', 'natural', 'natural', 'natural', 'natural', 'natural', 'instinctive', 'natural', 'raw', 'rude', 'natural', 'natural', 'born', 'innate', 'lifelike', 'natural']\n", "ANTONYMS\n", "['unnatural', 'artificial', 'supernatural', 'sharp']\n", "\n", "\n", "WORD language\n", "DEFINITION a systematic means of communicating by the use of sounds or conventional symbols\n", "USAGE EXAMPLES \n", "['he taught foreign languages', 'the language introduced is standard throughout the text', 'the speed with which a program can be executed depends on the language in which it is written']\n", "SYNONYMS\n", "['language', 'linguistic_communication', 'speech', 'speech_communication', 'spoken_communication', 'spoken_language', 'language', 'voice_communication', 'oral_communication', 'lyric', 'words', 'language', 'linguistic_process', 'language', 'language', 'speech', 'terminology', 'nomenclature', 'language']\n", "ANTONYMS\n", "[]\n", "\n", "\n" ] } ], "source": [ "from nltk.corpus import wordnet\n", "\n", "for word in noStopwords[0:4]:\n", " foundWord = wordnet.synsets(word)\n", " if foundWord:\n", " print(\"WORD \" + word)\n", " # shows definition of a word\n", " print(\"DEFINITION \" + foundWord[0].definition())\n", " print(\"USAGE EXAMPLES \")\n", " # shows examples of usage\n", " print(foundWord[0].examples())\n", " \n", " synonyms = []\n", " antonyms = []\n", " \n", " # searches for synonyms and antonyms; in case there are some - returns them\n", " for syn in wordnet.synsets(word):\n", " for lemma in syn.lemmas():\n", " synonyms.append(lemma.name())\n", " \n", " if lemma.antonyms():\n", " antonyms.append(lemma.antonyms()[0].name())\n", " \n", " print(\"SYNONYMS\")\n", " print(synonyms)\n", " print(\"ANTONYMS\")\n", " print(antonyms)\n", " print (\"\\n\")\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, no information about toolkit and nltk were found, but we'll be fine - we still retrieved a lot of interesting information. No need to use dictionary or google translate when you have NLTK. :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stemming and lemmatizing\n", "\n", "Word stemming is removing affixes from words and return the root word. Many search engines use this approach when indexing pages, as people may write the same word differently, but they all will mean the same thing.\n", "\n", "Word lemmatizing is the same thing as stemming, but the lemmatizing always returns a valid word." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "increas\n", "increase\n" ] } ], "source": [ "from nltk.stem import PorterStemmer, WordNetLemmatizer\n", " \n", "# as you will see, stemmer will return quite interesting word, right?\n", "word = \"increases\"\n", "\n", "stemmer = PorterStemmer()\n", "print(stemmer.stem(word))\n", " \n", "lemmatizer = WordNetLemmatizer()\n", "print(lemmatizer.lemmatize(word))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part-Of-Speech tagging\n", "\n", "For mor precise analysis you can use lots of features with part-of-speech tagging by NLTK. I will show you only the most common operation: tagging itself." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('nltk', 'RB'), ('stands', 'VBZ'), ('natural', 'JJ'), ('language', 'NN'), ('toolkit', 'NN'), ('one', 'CD'), ('popular', 'JJ'), ('libraries', 'NNS'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('nlp', 'IN'), ('it', 'PRP'), ('written', 'VBN'), ('python', 'JJ'), ('lot', 'NN'), ('useful', 'JJ'), ('methods', 'NNS'), ('well', 'RB'), ('lots', 'NNS'), ('tutorials', 'NNS'), ('supportive', 'VBP'), ('community', 'NN')]\n" ] } ], "source": [ "from nltk import pos_tag\n", "\n", "taggedWords = pos_tag(noStopwords)\n", "print(taggedWords)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, NLTK tagged each word with it's part-of-speech. To see the list of available p-o-s, take a look here: https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's use some gained knowledge in practice!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chatbots\n", "\n", "There is a module in NLTK called [**nltk.chat**](https://www.nltk.org/api/nltk.chat.html) which allows you to build simple yet working chatbot. Probably it's not the most intelligent one, but it definitely will work.\n", "\n", "To create a simple chatbot, you will need **Chat** - class that contains chatting logic for your bot and **Reflections** - dictionary that contains input and output values, usually these are synonyms or different forms of the word, e.g., here are some default reflections:\n", "\n", "`reflections = {\n", " \"i am\" : \"you are\",\n", " \"i was\" : \"you were\",\n", " \"i\" : \"you\",\n", " \"i'm\" : \"you are\"\n", "}`\n", "\n", "You can use default package reflections or create your own. There are also **pairs** - basically, these are dialogue sentences.\n", "\n", "So, let's build our simple chatbot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hi, I'm your first chatbot. Let's chat!\n", ">hello\n", "Hello!\n", ">my name is test\n", "Hello test, How are you doing today?\n", ">i'm doing fine\n", "Nice to hear that\n", ">are you studying at the university?\n", "I'm studying at University of Latvia, I'm enjoying my time here!\n", ">it was nice to meet you, bye\n", "See you soon!\n" ] } ], "source": [ "from nltk.chat.util import Chat, reflections\n", "\n", "pairs = [\n", " [\n", " r\"my name is (.*)\",\n", " [\"Hello %1, How are you doing today?\", \"Nice to meet you, %1!\"]\n", " ],\n", " [\n", " r\"how are you?\",\n", " [\"I'm doing great, thanks! What about you?\"]\n", " ],\n", " [\n", " r\"i'm doing fine\",\n", " [\"Nice to hear that\"]\n", " ],\n", " [\n", " r\"good evening|hello\",\n", " [\"Hello!\", \"Hi!\"]\n", " ],\n", " [\n", " r\"it was nice to meet you, bye\",\n", " [\"See you soon!\"]\n", " ],\n", " [\n", " r\"who is the strongest man on the earth?\",\n", " [\"CHUCK NORRIS\"]\n", " ],\n", " [\n", " r\"(.*) university?\",\n", " [\"I'm studying at University of Latvia, I'm enjoying my time here!\"]\n", " ]\n", "]\n", "\n", "def chatty():\n", " print(\"Hi, I'm your first chatbot. Let's chat!\")\n", " chat = Chat(pairs, reflections)\n", " chat.converse()\n", " \n", "if __name__ == \"__main__\":\n", " chatty()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically, we are creating a dictionary with phrases, create a main **Chat** class with dictionary and reflections and creating a conversation with **converse()** method. \n", "\n", "That's it! Enjoy your first chatbot!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Gender finder by name\n", "\n", "Let's try to implement a script that will try to guess your gender by your name. Hoping we'll get it right and won't offense anybody.\n", "\n", "Importing all necessary packages:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "from nltk import NaiveBayesClassifier\n", "from nltk.classify import accuracy\n", "from nltk.corpus import names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's extract the last N letters of the input word that will act as a features from which we will try to predict gender." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "def extractLetters(name, N = 2):\n", " lastLetters = name[-N:]\n", " return {'feature': lastLetters.lower()}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating training data from available in NLTK datasets:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "if __name__=='__main__':\n", " maleNames = [(name, 'male') for name in names.words('male.txt')]\n", " femaleNames = [(name, 'female') for name in names.words('female.txt')]\n", " data = (maleNames + femaleNames)\n", "\n", " random.seed(5)\n", " random.shuffle(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's enter some test data:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "names = ['Karina', 'Ramzes', 'James', 'Gloria']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Defining number of training samples, let's use 90% of all available dataset:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "importedData = int(0.9 * len(data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's test the model with the different number of end letters:" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Last letter count: 1\n", "Accuracy = 75.47%\n", "Karina ==> female\n", "Ramzes ==> male\n", "James ==> male\n", "Gloria ==> female\n", "\n", "Last letter count: 2\n", "Accuracy = 79.25%\n", "Karina ==> female\n", "Ramzes ==> male\n", "James ==> male\n", "Gloria ==> female\n", "\n", "Last letter count: 3\n", "Accuracy = 77.86%\n", "Karina ==> female\n", "Ramzes ==> female\n", "James ==> male\n", "Gloria ==> female\n", "\n", "Last letter count: 4\n", "Accuracy = 70.94%\n", "Karina ==> female\n", "Ramzes ==> female\n", "James ==> male\n", "Gloria ==> female\n" ] } ], "source": [ "for i in range(1, 5):\n", " print('\\nLast letter count:', i)\n", " features = [(extractLetters(n, i), gender) for (n, gender) in data]\n", " trainingData, testingData = features[:importedData], features[importedData:]\n", " classifier = NaiveBayesClassifier.train(trainingData)\n", " \n", " modelAccuracy = round(100 * accuracy(classifier, testingData), 2)\n", " print('Accuracy = ' + str(modelAccuracy) + '%')\n", " \n", " for name in names:\n", " print(name, '==>', classifier.classify(extractLetters(name, i)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the deeper it looks, the worse results become. The best option is to use 2 last letters. Precision is not the best, but it clearly sees me as a female, which is nice. :D Feel free to optimize this algorithm, the best will receive cookies.\n", "\n", "In this algorithm NLTK built-in Naive Bayes Classifier algorithm was used.\n", "The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonging to each class to make a prediction. It is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically.\n", "\n", "Naive bayes simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method. [Source](https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "As NLTK is a very huge library, I included only most needed and most used basic methods, as well as some easy yet interesting methods that would motivate you to dive into NLP more deeply. I really hope you've learned something.\n", "\n", "NLTK is very useful but easy to understand at the same time. It supports many languages and can be used for both human and computer languages, as long as they are natural. It can help you to analyse text, write your own filters or chatbots, filter spam, as well as use some artificial intelligence tricks. Who knows, probably next Alexa or Google will be written by you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional and used sources\n", "* http://www.nltk.org/howto/ - official NLTK how-to's\n", "* https://likegeeks.com/nlp-tutorial-using-python-nltk/ - intro to NLP with NLTK\n", "* https://towardsdatascience.com/build-your-first-chatbot-using-python-nltk-5d07b027e727 - chatbot tutorial with NLTK\n", "* https://stackabuse.com/text-summarization-with-nltk-in-python/ - easy text summarization tutorial (processes like that do happen in search engines) using NLTK\n", "* https://likegeeks.com/nlp-tutorial-using-python-nltk/ - NLTK basics\n", "* https://www.tutorialspoint.com/artificial_intelligence_with_python/artificial_intelligence_with_python_nltk_package.htm - some more intelligent NLTK\n", "* https://www.nltk.org - official NLTK docs\n", "* https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk - NLP in search engine style using NLTK - calculating word weights, etc.\n", "* https://towardsdatascience.com/spam-classifier-in-python-from-scratch-27a98ddd8e73 - spam classifier from scratch in NLTK\n", "\n", "## How NLTK can help you at the university?\n", "\n", "* One of the examples is my program for \"Formal Grammars\" course. \n", "\n", "In case you haven't had this course yet, consider this package to be your best friend during it, as it'll help you to easily parse any formal grammar and give you the correct answer. You will have only one task: to reduce the grammar correctly.\n", "\n", "This example parses the grammar reduced by me and returns all possible parse trees for the word, as well as detects whether it belongs to the given grammar. This was the first time I found NLTK and I've never regret it ever since. :)\n", "\n", "[Click here](https://colab.research.google.com/drive/1vULur3cAO4vmiqHhAECatmIhnTMnWm0z)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }