{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Complex Data Structures: Sets\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A set is a data structure where all elements are unique. While they are similar to lists, in sets there is **no order** among the items in a set, and sets do **not** support indexing. Sets are ideal for membership queries, i.e., checking if an items appears in the set or not." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Creating Sets" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "\n", "Sets are specified by curly braces, `{ }`, containing one or more comma separated values. To specify an empty list, you can use the alternative construct, `set()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# creating sets\n", "one_set = {4, 2, 1, 3}\n", "print(one_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "two_set = {6, 5, 4, -2}\n", "print(two_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "three_set = {5, 0, -3, 4, 4, 4, 4}\n", "print(three_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# creating an empty set; notice that we do *not* use the \"empty set = {}\" command\n", "# as someone would expect based on the way that we create an empty list\n", "empty_set = set()\n", "print(empty_set)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We can also create a set from a list:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "my_list = [1, 2, 3, 0, 5, 10, 11, 1, 5]\n", "my_set = set(my_list)\n", "print(my_list)\n", "print(my_set)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "As expected, the set created from a list does not contain any duplicate elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "print(\"Length of list:\", len(my_list))\n", "print(\"Length of set:\", len(my_set))" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "As seen above, we can use `len()` to get the number of elements in a list. Similarly, we can use the other functions `min()`, `max()`, `sum()`, `sorted()` etc., that we also used for lists." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Exercise " ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "What is the number of _distinct_ words in the `washington_post` variable?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "washington_post = \"\"\"Russian officials vehemently defended the country s airstrikes in Syria on Thursday as blows to Islamic State militants even as evidence mounted suggesting that U S backed rebels and others were facing the brunt of Moscow s attacks \n", "And while Russian officials and diplomats rallied behind President Vladimir Putin the Kremlin s stance appeared further clouded by acknowledgments that the missions have already extended beyond solely the Islamic State \n", "In Paris the Russian ambassador to France Alexander Orlov said the Russian attacks also targeted an al Qaeda linked group Jabhat al Nusra or al Nusra Front \n", "Syrias ambassador to Russia Riad Haddad echoed that the joint hit list for Russia and the Syrian government included Jabhat al Nusra which is believed to have some coordination with the Islamic State but is still seen mostly as a rival \n", "We are confronting armed terrorist groups in Syria regardless of how they identify themselves whether it is Jabhat al Nusra the ISIL or others he said using one of the acronyms for the Islamic State \n", "They all are pursuing ISIL ends he added according to the Interfax news agency \n", "The ambassadors did not specifically mention any U S and Western backed rebel groups \n", "But the comment was certain to deepen suspicions by Washington and allies that Putin s short term aim is to give more breathing space to Syria s embattled President Bashar al Assad whose government is strongly supported by Moscow \n", "Syrian activists meanwhile ramped up their own claims that Moscow was hitting groups seeking to bring down Assad who has managed to hang on during more than four years of civil war \n", "Russia s expanding military intervention in Syria added urgency to separate efforts by Russia and U S officials to coordinate strategies against the Islamic State and avoid potential airspace missteps between the two powers so called deconfliction talks The Pentagon said the discussions will begin Thursday \n", "One monitoring group the Britain based Syrian Observatory for Human Rights said Russian airstrikes again struck strongholds of an American backed rebel group Tajamu Alezzah in central Hama province \n", "The actions quickly criticized by Washington add an unpredictable element to a multilayered war \n", "The observatory also reported that airstrikes hit the northwestern city Jisr al Shughour which is in the hands of rebel groups including al Nusra after battles last month to drive back Assad s forces \n", "Among the locations hit was a site near Kafr Nabl the northern Syrian town whose weekly protests against the government often featuring pithy slogans in English won it renown as a symbol of what began as a peaceful protest movement against the Assad regime The local council receives U S assistance and the rebel unit there has received support under a covert CIA program aimed at bolstering moderate rebels \n", "Raed Fares one of the leaders of the protest movement in Kafr Nabl said warplanes struck a Free Syrian Army checkpoint guarding Roman ruins on the outskirts of the town He said the explosion was bigger than anything local residents had seen in three years of airstrikes conducted by Syrian warplanes \n", "It made a fire six kilometers wide he told The Washington Post \n", "Other sites hit on the second day of Russian bombing included locations in the province of Hama The targets suggested the main intention of the strikes was to shore up government control over a corridor of territory linking the capital Damascus to the Assad family s coastal heartland where the Russians are operating out of an expanded air base \n", "Syrian rebels some of them U S backed had been making slow but steady gains in the area considered one of the government s biggest vulnerabilities There has been no Islamic State presence there since January when moderate rebels rose up against the extremists and forced them to retreat to eastern Syria \n", "In Washington Sen John McCain R Ariz told CNN he could absolutely confirm that airstrikes hit Western backed groups such as the Free Syrian Army and other factions armed and trained by the CIA \n", "We have communications with people there said McCain chairman of the Senate Armed Services Committee \n", "The accounts could not be independently assessed but the main focus of the Russian attacks appeared to be in areas not known to have strong Islamic State footholds \n", "In Moscow the reply was blunt \n", "Total rubbish Gennady Zyuganov a member of parliament and leader of Russia s Communist Party said of the U S accusations \n", "In televised remarks Thursday Putin called accusations that Russian airstrikes had killed civilians in Syria information attacks \n", "He also addressed concerns about an accidental military clash between Russian and U S led coalition forces saying that his intelligence and military agencies were establishing contacts with counterparts in the United States \n", "This work is ongoing and I hope that it will conclude with the creation of a regularly acting mechanism he said \n", "A spokesman for Russia s Defense Ministry Igor Konashenkov said Thursday that warplanes hit a dozen Islamic State sites in the past hours destroying targets including a command center and two arms depots \n", "The United States and Russia agree on the need to fight the Islamic State but not about what to do with the Syrian president The Syrian civil war which grew out of an uprising against Assad has killed more than people since March and sent millions of refugees fleeing to countries in the Middle East and Europe \n", "Accusing Russia of pouring gasoline on the fire Defense Secretary Ashton B Carter vowed that U S pilots would continue their year long bombing campaign against the Islamic State in Syria despite Moscow s warning that American planes should stay away from its operations \n", "I think what they re doing is going to backfire and is counterproductive Carter said on Wednesday \n", "Yet Russia s military flexing in Syria brought quick overtures from neighboring Iraq where the Islamic State also holds significant territory but the government is within Washington s fold \n", "Iraq s prime minister Haider al Abadi told France that he would welcome Russia joining the U S led airstrikes against Islamic State targets but there have been no specific discussions \n", "Joining the protests against the Russian airstrikes was Saudi Arabia a leading foe of Assad and one of Washington s top Middle East allies \n", "At the United Nations late Wednesday the Saudi ambassador Abdallah al Mouallimi demanded that the Russian air campaign stop immediately and accused Moscow of carrying out attacks in areas outside the control of the Islamic State \n", "In Iran Assad s main regional backer Foreign Ministry spokeswoman Marzieh Afkham called Russia s military role a step toward resolving the current crisis in Syria \"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "#your code here" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "solution2": "hidden", "solution2_first": true }, "source": [ "#### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# Let's get our list of words from the big string\n", "words = washington_post.split()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# Let's see the first few words that we extracted\n", "words[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# Let's see how many words we have\n", "print(len(words))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# By using a set, we will eliminate duplicates\n", "unique_words = set(words)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# Let's see how many unique words we have\n", "print(len(unique_words))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# However, if we take a look at the set, we will see that capitalization \n", "# messes things up. We count as different words variations of the same word\n", "# with different capitalization (e.g, 'President' and 'president'; 'They' and 'they'; etc)\n", "print(unique_words)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# So, let's repeat the process by converting first the \n", "# string to lowercase, and then creating the words\n", "words = washington_post.lower().split()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "d1 = sorted(set(washington_post.split()))\n", "d2 = sorted(set(washington_post.lower().split()))\n", "d3 = [x.lower() for x in d1]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# As expected, the number of non-distinct words in the list remains the same\n", "print(len(words))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# However, we have a smaller set of distinct words now\n", "unique_words = set(words)\n", "print(len(unique_words))" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Checking for membership in a set" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "For sets, we can only check if an item appears within the set or not. We achieve this using the `in` operator:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "my_set = {1, 2, 3, 4}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "val = 1\n", "if val in my_set:\n", " print(\"The value\", val ,\"appears in the set\", my_set)\n", "else: \n", " print(\"The value\", val ,\"does not appear in the set\", my_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "val = 0\n", "if val in my_set:\n", " print(\"The value\", val ,\"appears in the set\", my_set)\n", "else: \n", " print(\"The value\", val ,\"does not appear in the set\", my_set)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We also have the `not in` operator, which behaves as expected:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "if val not in my_set:\n", " print(\"The value\", val ,\"does not appear in the set\", my_set)\n", "else: \n", " print(\"The value\", val ,\"appears in the set\", my_set)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Set operators: Add, remove elements" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "To add and remove elements in a set, we use, respectively the `add` and `remove` commands:\n", "\n", "+ `set_a.add(x)`: add an element to a set\n", "+ `set_a.remove(x)`: remove an element from a set" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "set_A = set()\n", "set_A.add(1)\n", "set_A.add(2)\n", "set_A.add(3)\n", "set_A.add(4)\n", "set_A.add(5)\n", "set_A.add(6)\n", "set_A" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "set_A.remove(6)\n", "set_A" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Set operators: Union, intersection, difference, subset. Plus, Jaccard Similarity" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Sets also support operations that allow us to quickly compute the difference, intersection, and union of two sets. For example:\n", "\n", "+ `set_a - set_b`: elements in a but not in b. Equivalent to `set_a.difference(set_b)`\n", "+ `set_a | set_b`: elements in a or b. Equivalent to `set_a.union(set_b)`\n", "+ `set_a & set_b`: elements in both a and b. Equivalent to `set_a.intersection(set_b)`\n", "+ `set_a ^ set_b`: elements in a or b but not both. Equivalent to `set_a.symmetric_difference(set_b)` \n", "+ `set_a <= set_b`:\ttests whether every element in set_a is in set_b. Equivalent to `set_a.issubset(set_b)`\n" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Exercise\n" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "\n", "Try the above yourself using the `set_A` and `set_B` variables, and compute the difference, union, intersection, and symmetric difference, between the two sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Your code here\n", "set_A = {1, 2, 3, 4, 5}\n", "set_B = {4, 5, 6, 7}\n", "print(\"Set A\", set_A)\n", "print(\"Set B\", set_B)\n", "print(\"Difference A-B\", {} )\n", "print(\"Union\", {})\n", "print(\"Intersection\", {})\n", "print(\"Symmetric Difference\", {})" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "solution2": "hidden", "solution2_first": true }, "source": [ "#### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Difference A-B\", set_A - set_B)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Difference A-B\", set_A.difference(set_B))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Union\", set_A | set_B)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Union\", set_A.union(set_B))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Intersection\", set_A & set_B)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Intersection\", set_A.intersection(set_B))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Symmetric Difference\", set_A ^ set_B)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "print(\"Symmetric Difference\", set_A.symmetric_difference(set_B))" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Exercise" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Now, lets try to use the [Jaccard index similarity](https://en.wikipedia.org/wiki/Jaccard_index) to compute the similarity of the two sets. The Jaccard coefficient is defined as the ratio of the size of the intersection of the two sets, divided by the size of the union of the two sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "solution2": "hidden", "solution2_first": true }, "source": [ "#### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "intersection = set_A & set_B\n", "print(\"The intersection has\", len(intersection), \"items\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "union = set_A | set_B\n", "print(\"The union has\", len(union), \"items\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "jaccard = len(intersection) / len(union)\n", "print(\"The similarity of set_A with set_B is\", jaccard)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Exercise" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Below we have two news articles discussing a security breach at Yahoo. We want to compute the similarity of these articles using the Jaccard similarity. (For the sake of simplicity, we have removed all punctuation from the text.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "wsj = \"\"\"\n", "Yahoo Inc disclosed a massive security breach by a state sponsored actor affecting at least 500 million users potentially the largest such data breach on record and the latest hurdle for the beaten down internet company as it works through the sale of its core business \n", "Yahoo said certain user account information including names email addresses telephone numbers dates of birth hashed passwords and in some cases encrypted or unencrypted security questions and answers was stolen from the company s network in late 2014 by what it believes is a state sponsored actor \n", "Yahoo said it is notifying potentially affected users and has taken steps to secure their accounts by invalidating unencrypted security questions and answers so they can t be used to access an account and asking potentially affected users to change their passwords \n", "Yahoo recommended users who haven t changed their passwords since 2014 do so It also encouraged users change their passwords as well as security questions and answers for any other accounts on which they use the same or similar information used for their Yahoo account \n", "The company which is working with law enforcement said the continuing investigation indicates that stolen information didn t include unprotected passwords payment card data or bank account information \n", "With 500 million user accounts affected this is the largest ever publicly disclosed data breach according to Paul Stephens director of policy and advocacy with Privacy Rights Clearing House a not for profit group that compiles information on data breaches \n", "No evidence has been found to suggest the state sponsored actor is currently in Yahoo s network and Yahoo didn t name the country it suspected was involved In August a hacker called Peace appeared in online forums offering to sell 200 million of the company s usernames and passwords for about 1 900 in total Peace had previously sold data taken from breaches at Myspace and LinkedIn Corp \n", "\"\"\"\n", "\n", "ust = \"\"\"\n", "SAN FRANCISCO Information from at least 500 million Yahoo accounts was stolen from the company in 2014 and the company said Thursday it believes that a state sponsored actor was behind the hack \n", "The information may have included names email addresses telephone numbers dates of birth and in some cases encrypted or unencrypted security questions and answers Yahoo said \n", "Claims surfaced in early August that a hacker using the name Peace was trying to sell the usernames passwords and dates of birth of Yahoo account users on the dark web a black market of thousands of secret websites \n", "The FBI said it was aware of the matter The compromise of public and private sector systems is something the agency takes very seriously and it said it will continue to investigate and hold accountable all who pose a threat in cyberspace the agency said in an emailed statement \n", "Yahoo recommends that users who haven t changed their passwords since 2014 do so The company said it was notifying potentially affected users and taking steps to secure their accounts That included invalidating unencrypted security questions and answers and asking users to change their passwords \n", "The announcement comes as Yahoo looks to complete its 4 8 billion sale of its core Internet business to media giant Verizon Communications which said it was notified of the Yahoo breach within the last two days \n", " We understand that Yahoo is conducting an active investigation of this matter but we otherwise have limited information and understanding of the impact Verizon said \n", "Given the unsettled nature of Yahoo s ownership just now regulators should be concerned with who will take responsibility for the response to this compromise It can be easy for the right thing to do to slip through the cracks in a multi billion dollar transition said Tim Erlin senior director of IT security and risk strategy at Tripwire a computer security firm \n", "Yahoo Chief Executive Officer Marissa Mayer has pledged to stay on with the company through the close of the merger which is being overseen by Verizon s Marni Walden and AOL CEO Tim Armstrong Yahoo shares YHOO were flat Thursday Verizon VZ shares were up 1 at 52 39 \n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "solution2": "hidden", "solution2_first": true }, "source": [ "#### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# We first get the words from each article, in lowercase\n", "# and compute the total number of words and the distinct words \n", "words_wsj = wsj.lower().split()\n", "print(\"The WSJ article had\", len(words_wsj), \"words\")\n", "print(\"The WSJ article had\", len(set(words_wsj)), \"distinct words\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "words_ust = ust.lower().split()\n", "print(\"The US Today article had\", len(words_ust), \"words\")\n", "print(\"The US Today article had\", len(set(words_ust)), \"distinct words\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# We then compute the intersection of the words across the two articles\n", "intersection = set(words_wsj) & set(words_ust)\n", "print(\"The intersection has\", len(intersection), \"words\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# And the union of the words across the two articles\n", "union = set(words_wsj) | set(words_ust)\n", "print(\"The union has\", len(union), \"words\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "# And now we can compute hte similarity\n", "similarity = len(intersection) / len(union)\n", "print(\"The similarity of the two articles is\", similarity)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }