{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayesian Classification for Machine Learning for Computational Linguistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using token probabilities for classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.5, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the discussion of a Bayesian classifier in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Training Corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assume that we have a set of e-mails that are annotated as *spam* or *ham*, as described in the textbook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are $4$ e-mails labeled *ham* and $1$ e-mail is labeled *spam*, that is we have a total of $5$ texts in our corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we would randomly pick an e-mail from the collection, the probability that we pick a spam e-mail would be $1 / 5$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spam emails might differ from ham e-mails just in some words. Here is a sample email constructed with typical keywords:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "spam = [ \"\"\"Our medicine cures baldness. No diagnostics needed.\n", " We guarantee Fast Viagra delivery.\n", " We can provide Human growth hormone. The cheapest Life\n", " Insurance with us. You can Lose weight with this treatment.\n", " Our Medicine now and No medical exams necessary.\n", " Our Online pharmacy is the best. This cream Removes\n", " wrinkles and Reverses aging.\n", " One treatment and you will Stop snoring. We sell Valium\n", " and Viagra.\n", " Our Vicodin will help with Weight loss. Cheap Xanax.\"\"\" ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data structure above is a list of strings that contains only one string. The triple-double-quotes mark multi-line text. We can output the size of the variable *spam* this way:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n" ] } ], "source": [ "print(len(spam))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a list of *ham* mails in a similar way:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ham = [ \"\"\"Hi Hans, hope to see you soon at our family party.\n", " When will you arrive.\n", " All the best to the family.\n", " Sue\"\"\",\n", " \"\"\"Dear Ata,\n", " did you receive my last email related to the car insurance\n", " offer? I would be happy to discuss the details with you.\n", " Please give me a call, if you have any questions.\n", " John Smith\n", " Super Car Insurance\"\"\",\n", " \"\"\"Hi everyone:\n", " This is just a gentle reminder of today's first 2017 SLS\n", " Colloquium, from 2.30 to 4.00 pm, in Ballantine 103.\n", " Rodica Frimu will present a job talk entitled \"What is\n", " so tricky in subject-verb agreement?\". The text of the\n", " abstract is below.\n", " If you would like to present something during the Spring,\n", " please let me know.\n", " The current online schedule with updated title\n", " information and abstracts is available under:\n", " http://www.iub.edu/~psyling/SLSColloquium/Spring2017.html\n", " See you soon,\n", " Peter\"\"\",\n", " \"\"\"Dear Friends,\n", " As our first event of 2017, the Polish Studies Center\n", " presents an evening with artist and filmmaker Wojtek Sawa.\n", " Please join us on JANUARY 26, 2017 from 5:30 p.m. to\n", " 7:30 p.m. in the Global and International Studies\n", " Building room 1100 for a presentation by Wojtek Sawa\n", " on his interactive installation art piece The Wall\n", " Speaks–Voices of the Unheard. A reception will follow\n", " the event where you will have a chance to meet the artist\n", " and discuss his work.\n", " Best,\"\"\"]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ham-mail list contains $4$ e-mails:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "print(len(ham))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access a particular e-mail via index from either spam or ham:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Our medicine cures baldness. No diagnostics needed.\n", " We guarantee Fast Viagra delivery.\n", " We can provide Human growth hormone. The cheapest Life\n", " Insurance with us. You can Lose weight with this treatment.\n", " Our Medicine now and No medical exams necessary.\n", " Our Online pharmacy is the best. This cream Removes\n", " wrinkles and Reverses aging.\n", " One treatment and you will Stop snoring. We sell Valium\n", " and Viagra.\n", " Our Vicodin will help with Weight loss. Cheap Xanax.\n" ] } ], "source": [ "print(spam[0])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dear Friends,\n", " As our first event of 2017, the Polish Studies Center\n", " presents an evening with artist and filmmaker Wojtek Sawa.\n", " Please join us on JANUARY 26, 2017 from 5:30 p.m. to\n", " 7:30 p.m. in the Global and International Studies\n", " Building room 1100 for a presentation by Wojtek Sawa\n", " on his interactive installation art piece The Wall\n", " Speaks–Voices of the Unheard. A reception will follow\n", " the event where you will have a chance to meet the artist\n", " and discuss his work.\n", " Best,\n" ] } ], "source": [ "print(ham[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can lower-case the email using the string *lower* function:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dear friends,\n", " as our first event of 2017, the polish studies center\n", " presents an evening with artist and filmmaker wojtek sawa.\n", " please join us on january 26, 2017 from 5:30 p.m. to\n", " 7:30 p.m. in the global and international studies\n", " building room 1100 for a presentation by wojtek sawa\n", " on his interactive installation art piece the wall\n", " speaks–voices of the unheard. a reception will follow\n", " the event where you will have a chance to meet the artist\n", " and discuss his work.\n", " best,\n" ] } ], "source": [ "print(ham[3].lower())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can loop over all e-mails in spam or ham and lower-case the content:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hi hans, hope to see you soon at our family party.\n", " when will you arrive.\n", " all the best to the family.\n", " sue\n", "dear ata,\n", " did you receive my last email related to the car insurance\n", " offer? i would be happy to discuss the details with you.\n", " please give me a call, if you have any questions.\n", " john smith\n", " super car insurance\n", "hi everyone:\n", " this is just a gentle reminder of today's first 2017 sls\n", " colloquium, from 2.30 to 4.00 pm, in ballantine 103.\n", " rodica frimu will present a job talk entitled \"what is\n", " so tricky in subject-verb agreement?\". the text of the\n", " abstract is below.\n", " if you would like to present something during the spring,\n", " please let me know.\n", " the current online schedule with updated title\n", " information and abstracts is available under:\n", " http://www.iub.edu/~psyling/slscolloquium/spring2017.html\n", " see you soon,\n", " peter\n", "dear friends,\n", " as our first event of 2017, the polish studies center\n", " presents an evening with artist and filmmaker wojtek sawa.\n", " please join us on january 26, 2017 from 5:30 p.m. to\n", " 7:30 p.m. in the global and international studies\n", " building room 1100 for a presentation by wojtek sawa\n", " on his interactive installation art piece the wall\n", " speaks–voices of the unheard. a reception will follow\n", " the event where you will have a chance to meet the artist\n", " and discuss his work.\n", " best,\n" ] } ], "source": [ "for text in ham:\n", " print(text.lower())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the tokenizer from NLTK to tokenize the lower-cased text into single tokens (words and punctuation marks):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['hi', 'hans', ',', 'hope', 'to', 'see', 'you', 'soon', 'at', 'our', 'family', 'party', '.', 'when', 'will', 'you', 'arrive', '.', 'all', 'the', 'best', 'to', 'the', 'family', '.', 'sue']\n" ] } ], "source": [ "from nltk import word_tokenize\n", "\n", "print(word_tokenize(ham[0].lower()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can count the numer of tokens and types in lower-cased text:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({'this': 2, 'test': 2, 'is': 1, 'a': 1, '.': 1, 'will': 1, 'teach': 1, 'us': 1, 'how': 1, 'to': 1, 'count': 1, 'tokens': 1, '?': 1})\n", "number of types: 13\n", "number of tokens: 15\n" ] } ], "source": [ "from collections import Counter\n", "\n", "myCounts = Counter(word_tokenize(\"This is a test. Will this test teach us how to count tokens?\".lower()))\n", "\n", "print(myCounts)\n", "print(\"number of types:\", len(myCounts))\n", "print(\"number of tokens:\", sum(myCounts.values()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create a frequency profile of ham and spam words given the two text collections:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ham:\n", " Counter({'the': 14, ',': 11, '.': 11, 'to': 8, 'you': 8, 'a': 6, 'will': 4, 'is': 4, 'of': 4, 'and': 4, 'with': 3, 'please': 3, ':': 3, '2017': 3, 'in': 3, 'hi': 2, 'see': 2, 'soon': 2, 'our': 2, 'family': 2, 'best': 2, 'dear': 2, 'car': 2, 'insurance': 2, '?': 2, 'would': 2, 'discuss': 2, 'me': 2, 'if': 2, 'have': 2, 'first': 2, 'from': 2, 'present': 2, 'event': 2, 'studies': 2, 'artist': 2, 'wojtek': 2, 'sawa': 2, 'on': 2, 'p.m.': 2, 'his': 2, 'hans': 1, 'hope': 1, 'at': 1, 'party': 1, 'when': 1, 'arrive': 1, 'all': 1, 'sue': 1, 'ata': 1, 'did': 1, 'receive': 1, 'my': 1, 'last': 1, 'email': 1, 'related': 1, 'offer': 1, 'i': 1, 'be': 1, 'happy': 1, 'details': 1, 'give': 1, 'call': 1, 'any': 1, 'questions': 1, 'john': 1, 'smith': 1, 'super': 1, 'everyone': 1, 'this': 1, 'just': 1, 'gentle': 1, 'reminder': 1, 'today': 1, \"'s\": 1, 'sls': 1, 'colloquium': 1, '2.30': 1, '4.00': 1, 'pm': 1, 'ballantine': 1, '103.': 1, 'rodica': 1, 'frimu': 1, 'job': 1, 'talk': 1, 'entitled': 1, '``': 1, 'what': 1, 'so': 1, 'tricky': 1, 'subject-verb': 1, 'agreement': 1, \"''\": 1, 'text': 1, 'abstract': 1, 'below': 1, 'like': 1, 'something': 1, 'during': 1, 'spring': 1, 'let': 1, 'know': 1, 'current': 1, 'online': 1, 'schedule': 1, 'updated': 1, 'title': 1, 'information': 1, 'abstracts': 1, 'available': 1, 'under': 1, 'http': 1, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 1, 'peter': 1, 'friends': 1, 'as': 1, 'polish': 1, 'center': 1, 'presents': 1, 'an': 1, 'evening': 1, 'filmmaker': 1, 'join': 1, 'us': 1, 'january': 1, '26': 1, '5:30': 1, '7:30': 1, 'global': 1, 'international': 1, 'building': 1, 'room': 1, '1100': 1, 'for': 1, 'presentation': 1, 'by': 1, 'interactive': 1, 'installation': 1, 'art': 1, 'piece': 1, 'wall': 1, 'speaks–voices': 1, 'unheard': 1, 'reception': 1, 'follow': 1, 'where': 1, 'chance': 1, 'meet': 1, 'work': 1})\n", "------------------------------\n", "Spam:\n", " Counter({'.': 13, 'our': 4, 'and': 4, 'we': 3, 'with': 3, 'medicine': 2, 'no': 2, 'viagra': 2, 'can': 2, 'the': 2, 'you': 2, 'weight': 2, 'this': 2, 'treatment': 2, 'will': 2, 'cures': 1, 'baldness': 1, 'diagnostics': 1, 'needed': 1, 'guarantee': 1, 'fast': 1, 'delivery': 1, 'provide': 1, 'human': 1, 'growth': 1, 'hormone': 1, 'cheapest': 1, 'life': 1, 'insurance': 1, 'us': 1, 'lose': 1, 'now': 1, 'medical': 1, 'exams': 1, 'necessary': 1, 'online': 1, 'pharmacy': 1, 'is': 1, 'best': 1, 'cream': 1, 'removes': 1, 'wrinkles': 1, 'reverses': 1, 'aging': 1, 'one': 1, 'stop': 1, 'snoring': 1, 'sell': 1, 'valium': 1, 'vicodin': 1, 'help': 1, 'loss': 1, 'cheap': 1, 'xanax': 1})\n" ] } ], "source": [ "hamFP = Counter()\n", "spamFP = Counter()\n", "\n", "for text in spam:\n", " spamFP.update(word_tokenize(text.lower()))\n", "\n", "for text in ham:\n", " hamFP.update(word_tokenize(text.lower()))\n", "\n", "print(\"Ham:\\n\", hamFP)\n", "print(\"-\" * 30)\n", "print(\"Spam:\\n\", spamFP)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "our 2.947862376664825\n", "medicine 4.643856189774724\n", "cures 2.321928094887362\n", "baldness 2.321928094887362\n", ". 0.0\n", "no 4.643856189774724\n", "diagnostics 2.321928094887362\n", "needed 2.321928094887362\n", "we 6.965784284662087\n", "guarantee 2.321928094887362\n", "fast 2.321928094887362\n", "viagra 4.643856189774724\n", "delivery 2.321928094887362\n", "can 4.643856189774724\n", "provide 2.321928094887362\n", "human 2.321928094887362\n", "growth 2.321928094887362\n", "hormone 2.321928094887362\n", "the 0.0\n", "cheapest 2.321928094887362\n", "life 2.321928094887362\n", "insurance 1.3219280948873624\n", "with 0.965784284662087\n", "us 1.3219280948873624\n", "you 0.0\n", "lose 2.321928094887362\n", "weight 4.643856189774724\n", "this 2.643856189774725\n", "treatment 4.643856189774724\n", "now 2.321928094887362\n", "and 2.947862376664825\n", "medical 2.321928094887362\n", "exams 2.321928094887362\n", "necessary 2.321928094887362\n", "online 1.3219280948873624\n", "pharmacy 2.321928094887362\n", "is 1.3219280948873624\n", "best 0.7369655941662062\n", "cream 2.321928094887362\n", "removes 2.321928094887362\n", "wrinkles 2.321928094887362\n", "reverses 2.321928094887362\n", "aging 2.321928094887362\n", "one 2.321928094887362\n", "will 0.6438561897747247\n", "stop 2.321928094887362\n", "snoring 2.321928094887362\n", "sell 2.321928094887362\n", "valium 2.321928094887362\n", "vicodin 2.321928094887362\n", "help 2.321928094887362\n", "loss 2.321928094887362\n", "cheap 2.321928094887362\n", "xanax 2.321928094887362\n" ] } ], "source": [ "from math import log\n", "\n", "tokenlist = []\n", "frqprofiles = []\n", "for x in spam:\n", " frqprofiles.append( Counter(word_tokenize(x.lower())) )\n", " tokenlist.append( set(word_tokenize(x.lower())) )\n", "for x in ham:\n", " frqprofiles.append( Counter(word_tokenize(x.lower())) )\n", " tokenlist.append( set(word_tokenize(x.lower())) )\n", "#print(tokenlist)\n", "\n", "for x in frqprofiles[0]:\n", " frq = frqprofiles[0][x]\n", " counter = 0\n", " for y in tokenlist:\n", " if x in y:\n", " counter += 1\n", " print(x, frq * log(len(tokenlist)/counter, 2))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The probability that we pick randomly an e-mail that is spam or ham can be computed as the ratio of the counts divided by the number of e-mails:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "probability to pick spam: 0.2\n", "probability to pick ham: 0.8\n" ] } ], "source": [ "total = len(spam) + len(ham)\n", "\n", "spamP = len(spam) / total\n", "hamP = len(ham) / total\n", "\n", "print(\"probability to pick spam:\", spamP)\n", "print(\"probability to pick ham:\", hamP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will need the total token count to calculate the relative frequency of the tokens, that is to generate likelihood estimates. We could *brute force* add one to create space in the probability mass for unknown tokens." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total spam counts + 1: 87\n", "total ham counts + 1: 251\n" ] } ], "source": [ "totalSpam = sum(spamFP.values()) + 1\n", "totalHam = sum(hamFP.values()) + 1\n", "\n", "print(\"total spam counts + 1:\", totalSpam)\n", "print(\"total ham counts + 1:\", totalHam)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can relativize the counts in the frequency profiles now:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({'the': 0.055776892430278883, ',': 0.043824701195219126, '.': 0.043824701195219126, 'to': 0.03187250996015936, 'you': 0.03187250996015936, 'a': 0.02390438247011952, 'will': 0.01593625498007968, 'is': 0.01593625498007968, 'of': 0.01593625498007968, 'and': 0.01593625498007968, 'with': 0.01195219123505976, 'please': 0.01195219123505976, ':': 0.01195219123505976, '2017': 0.01195219123505976, 'in': 0.01195219123505976, 'hi': 0.00796812749003984, 'see': 0.00796812749003984, 'soon': 0.00796812749003984, 'our': 0.00796812749003984, 'family': 0.00796812749003984, 'best': 0.00796812749003984, 'dear': 0.00796812749003984, 'car': 0.00796812749003984, 'insurance': 0.00796812749003984, '?': 0.00796812749003984, 'would': 0.00796812749003984, 'discuss': 0.00796812749003984, 'me': 0.00796812749003984, 'if': 0.00796812749003984, 'have': 0.00796812749003984, 'first': 0.00796812749003984, 'from': 0.00796812749003984, 'present': 0.00796812749003984, 'event': 0.00796812749003984, 'studies': 0.00796812749003984, 'artist': 0.00796812749003984, 'wojtek': 0.00796812749003984, 'sawa': 0.00796812749003984, 'on': 0.00796812749003984, 'p.m.': 0.00796812749003984, 'his': 0.00796812749003984, 'hans': 0.00398406374501992, 'hope': 0.00398406374501992, 'at': 0.00398406374501992, 'party': 0.00398406374501992, 'when': 0.00398406374501992, 'arrive': 0.00398406374501992, 'all': 0.00398406374501992, 'sue': 0.00398406374501992, 'ata': 0.00398406374501992, 'did': 0.00398406374501992, 'receive': 0.00398406374501992, 'my': 0.00398406374501992, 'last': 0.00398406374501992, 'email': 0.00398406374501992, 'related': 0.00398406374501992, 'offer': 0.00398406374501992, 'i': 0.00398406374501992, 'be': 0.00398406374501992, 'happy': 0.00398406374501992, 'details': 0.00398406374501992, 'give': 0.00398406374501992, 'call': 0.00398406374501992, 'any': 0.00398406374501992, 'questions': 0.00398406374501992, 'john': 0.00398406374501992, 'smith': 0.00398406374501992, 'super': 0.00398406374501992, 'everyone': 0.00398406374501992, 'this': 0.00398406374501992, 'just': 0.00398406374501992, 'gentle': 0.00398406374501992, 'reminder': 0.00398406374501992, 'today': 0.00398406374501992, \"'s\": 0.00398406374501992, 'sls': 0.00398406374501992, 'colloquium': 0.00398406374501992, '2.30': 0.00398406374501992, '4.00': 0.00398406374501992, 'pm': 0.00398406374501992, 'ballantine': 0.00398406374501992, '103.': 0.00398406374501992, 'rodica': 0.00398406374501992, 'frimu': 0.00398406374501992, 'job': 0.00398406374501992, 'talk': 0.00398406374501992, 'entitled': 0.00398406374501992, '``': 0.00398406374501992, 'what': 0.00398406374501992, 'so': 0.00398406374501992, 'tricky': 0.00398406374501992, 'subject-verb': 0.00398406374501992, 'agreement': 0.00398406374501992, \"''\": 0.00398406374501992, 'text': 0.00398406374501992, 'abstract': 0.00398406374501992, 'below': 0.00398406374501992, 'like': 0.00398406374501992, 'something': 0.00398406374501992, 'during': 0.00398406374501992, 'spring': 0.00398406374501992, 'let': 0.00398406374501992, 'know': 0.00398406374501992, 'current': 0.00398406374501992, 'online': 0.00398406374501992, 'schedule': 0.00398406374501992, 'updated': 0.00398406374501992, 'title': 0.00398406374501992, 'information': 0.00398406374501992, 'abstracts': 0.00398406374501992, 'available': 0.00398406374501992, 'under': 0.00398406374501992, 'http': 0.00398406374501992, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 0.00398406374501992, 'peter': 0.00398406374501992, 'friends': 0.00398406374501992, 'as': 0.00398406374501992, 'polish': 0.00398406374501992, 'center': 0.00398406374501992, 'presents': 0.00398406374501992, 'an': 0.00398406374501992, 'evening': 0.00398406374501992, 'filmmaker': 0.00398406374501992, 'join': 0.00398406374501992, 'us': 0.00398406374501992, 'january': 0.00398406374501992, '26': 0.00398406374501992, '5:30': 0.00398406374501992, '7:30': 0.00398406374501992, 'global': 0.00398406374501992, 'international': 0.00398406374501992, 'building': 0.00398406374501992, 'room': 0.00398406374501992, '1100': 0.00398406374501992, 'for': 0.00398406374501992, 'presentation': 0.00398406374501992, 'by': 0.00398406374501992, 'interactive': 0.00398406374501992, 'installation': 0.00398406374501992, 'art': 0.00398406374501992, 'piece': 0.00398406374501992, 'wall': 0.00398406374501992, 'speaks–voices': 0.00398406374501992, 'unheard': 0.00398406374501992, 'reception': 0.00398406374501992, 'follow': 0.00398406374501992, 'where': 0.00398406374501992, 'chance': 0.00398406374501992, 'meet': 0.00398406374501992, 'work': 0.00398406374501992})\n", "------------------------------\n", "Counter({'.': 0.14942528735632185, 'our': 0.04597701149425287, 'and': 0.04597701149425287, 'we': 0.034482758620689655, 'with': 0.034482758620689655, 'medicine': 0.022988505747126436, 'no': 0.022988505747126436, 'viagra': 0.022988505747126436, 'can': 0.022988505747126436, 'the': 0.022988505747126436, 'you': 0.022988505747126436, 'weight': 0.022988505747126436, 'this': 0.022988505747126436, 'treatment': 0.022988505747126436, 'will': 0.022988505747126436, 'cures': 0.011494252873563218, 'baldness': 0.011494252873563218, 'diagnostics': 0.011494252873563218, 'needed': 0.011494252873563218, 'guarantee': 0.011494252873563218, 'fast': 0.011494252873563218, 'delivery': 0.011494252873563218, 'provide': 0.011494252873563218, 'human': 0.011494252873563218, 'growth': 0.011494252873563218, 'hormone': 0.011494252873563218, 'cheapest': 0.011494252873563218, 'life': 0.011494252873563218, 'insurance': 0.011494252873563218, 'us': 0.011494252873563218, 'lose': 0.011494252873563218, 'now': 0.011494252873563218, 'medical': 0.011494252873563218, 'exams': 0.011494252873563218, 'necessary': 0.011494252873563218, 'online': 0.011494252873563218, 'pharmacy': 0.011494252873563218, 'is': 0.011494252873563218, 'best': 0.011494252873563218, 'cream': 0.011494252873563218, 'removes': 0.011494252873563218, 'wrinkles': 0.011494252873563218, 'reverses': 0.011494252873563218, 'aging': 0.011494252873563218, 'one': 0.011494252873563218, 'stop': 0.011494252873563218, 'snoring': 0.011494252873563218, 'sell': 0.011494252873563218, 'valium': 0.011494252873563218, 'vicodin': 0.011494252873563218, 'help': 0.011494252873563218, 'loss': 0.011494252873563218, 'cheap': 0.011494252873563218, 'xanax': 0.011494252873563218})\n" ] } ], "source": [ "hamFP = Counter( dict([ (token, frequency/totalHam) for token, frequency in hamFP.items() ]) )\n", "spamFP = Counter( dict([ (token, frequency/totalSpam) for token, frequency in spamFP.items() ]) )\n", "\n", "print(hamFP)\n", "print(\"-\" * 30)\n", "print(spamFP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now compute the default probability that we want to assign to unknown words as $1 / totalSpam$ or $1 / totalHam$ respectively. Whenever we encounter an unknown token that is not in our frequency profile, we will assign the default probability to it." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "default spam probability: 0.011494252873563218\n", "default ham probability: 0.00398406374501992\n" ] } ], "source": [ "defaultSpam = 1 / totalSpam\n", "defaultHam = 1 / totalHam\n", "\n", "print(\"default spam probability:\", defaultSpam)\n", "print(\"default ham probability:\", defaultHam)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test an unknown document by calculating how likely it was generated by the hamFP-distribution or the spamFP-distribution. We have to tokenize the lower-cased unknown document and compute the product of the likelihood of every single token in the text. We should scale this likelihood with the likelihood of randomly picking a ham or a spam e-mail. Let us calculate the likelihood that the random email is spam:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9.669490943645368e-37\n" ] } ], "source": [ "unknownEmail = \"\"\"Dear ,\n", "we sell the cheapest and best Viagra on the planet. Our delivery is guaranteed confident and cheap.\n", "\"\"\"\n", "#unknownEmail = \"\"\"Dear Hans,\n", "#I have not seen you for so long. When will we go out for a coffee again.\n", "#\"\"\"\n", "\n", "tokens = word_tokenize(unknownEmail.lower())\n", "\n", "result = 1.0\n", "for token in tokens:\n", " result *= spamFP.get(token, defaultSpam)\n", "\n", "print(result * spamP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this number is very small, a better strategy might be to sum up the log-likelihoods:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-118.92540938825404\n" ] } ], "source": [ "from math import log\n", "\n", "resultSpam = 0.0\n", "for token in tokens:\n", " resultSpam += log(spamFP.get(token, defaultSpam), 2)\n", "resultSpam += log(spamP)\n", "\n", "print(resultSpam)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-139.6325534842533\n" ] } ], "source": [ "resultHam = 0.0\n", "for token in tokens:\n", " resultHam += log(hamFP.get(token, defaultHam), 2)\n", "resultHam += log(hamP)\n", "\n", "print(resultHam)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The log-likelihood for spam is larger than for *ham*. Our simple classifier would have guessed that this e-mail is *spam*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if max(resultHam, resultSpam) == resultHam:\n", " print(\"e-mail is ham\")\n", "else:\n", " print(\"e-mail is spam\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The are numerous ways to improve the algorithm and tutorial. Please [send me](http://cavar.me/damir/) your suggestions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "latex_metadata": { "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA", "author": "Damir Cavar", "title": "Bayesian Classification for Machine Learning for Computational Linguistics" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false }, "vscode": { "interpreter": { "hash": "1e28a5307a9b5c2fbeb0b263581f1cf3bfba9739188743f6a231f74c7de58892" } } }, "nbformat": 4, "nbformat_minor": 4 }