{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formalia:\n", "\n", "Please read the [assignment overview page](https://github.com/lalessan/comsocsci2022/wiki/Assignments) carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment. \n", "\n", "_If you fail to follow these simple instructions, it will negatively impact your grade!_\n", "\n", "**Due date and time**: The assignment is due on Tuesday, April 5th at 23:55. Hand in your Jupyter notebook file (with extension `.ipynb`) via DTU Learn _(Course Content, Assignemnts, Assignment 2)_\n", "\n", "\n", "Remember to include in the first cell of your notebook:\n", "* the link to your group's Git repository\n", "* group members' contributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: TF-IDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this exercise, you need the following data: \n", "* The r/wallstreetbets submissions (either the one provided by me [here](https://github.com/lalessan/comsocsci2021/blob/master/data/wallstreet_subs.csv.gz) or the one you downloaded in Week 6).\n", "* The list of 15 stocks you identified in Week 6, Exercise 2.\n", "\n", "_Exercise_\n", "\n", "\n", "> 1. Tokenize the __text__ of each submission. Create a column __tokens__ in your dataframe containing the tokens. Remember to follow the instructions in Week 6, Exercise 3. \n", "> 2. Find submissions discussing at least one of the top 15 stocks you identified above (follow the instructions in Week 6, Exercise 3).\n", "> 3. Now, we want to find out which words are important for each *stock*, so we're going to create several ***large documents, one for each stock***. Each document includes all the tokens related to the same stock. We will also have a document including discussions that do not relate to the top 15 stocks.\n", "> 4. Now, we're ready to calculate the TF for each word. Find the top 5 terms within __5 stocks of your choice__. \n", "> * Describe similarities and differences between the stocks.\n", "> * Why aren't the TFs not necessarily a good description of the stocks?\n", "> * Next, we calculate IDF for every word. \n", "> * What base logarithm did you use? Is that important?\n", "> 5. We're ready to calculate TF-IDF. Do that for the __5 stock of your choice__. \n", "> * List the 10 top TF words for each stock.\n", "> * List the 10 top TF-IDF words for each stock.\n", "> * Are these 10 words more descriptive of the stock? If yes, what is it about IDF that makes the words more informative?\n", "> 6. Visualize the results in a Wordcloud and comment your results (follow the instrutions in Week 6, Exercise 4). \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Sentiment analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Exercise: Creating Word Shifts_\n", "> 1. Pick a day of your choice in 2020. We call it $d$. It is more interesting if you pick a day where you expect something relevant to occur (e.g. Christmas, New Year, Corona starting, the market crashes...).\n", "> 2. Build two lists $l$ and $l_{ref}$ containing all tokens for submissions posted on r/wallstreebets on day $d$, and in the 7 days preceding day $d$, respectively. \n", "> 3. For each token $i$, compute the relative frequency in the two lists $l$ and $l_{ref}$. We call them $p(i,l)$ and $p(i,l_{ref})$, respectively. The relative frequency is computed as the number of times a token occurs over the total length of the document. Store the result in a dictionary.\n", "> 4. For each token $i$, compute the difference in relative frequency $\\delta p(i) = p(i,l) - p(i,l_{ref})$. Store the values in a dictionary. Print the top 10 tokens (those with largest relative frequency). Do you notice anything interesting?\n", "> 5. Now, for each token, compute the happiness $h(i) = labMT(i) - 5$, using the labMT dictionary. Here, we subtract $5$, so that positive tokens will have a positive value and negative tokens will have a negative value. Then, compute the product $\\delta \\Phi = h(i)\\cdot \\delta p(i)$. Store the results in a dictionary. \n", "> 6. Print the top 10 tokens, ordered by the absolute value of $|\\delta \\Phi|$. Explain in your own words the meaning of $\\delta \\Phi$. If that is unclear, have a look at [this page](https://shifterator.readthedocs.io/en/latest/cookbook/weighted_avg_shifts.html).\n", "> 7. Now install the [``shifterator``](https://shifterator.readthedocs.io/en/latest/installation.html) Python package. We will use it for plotting Word Shifts. \n", "> 8. Use the function ``shifterator.WeightedAvgShift`` to plot the WordShift, showing which words contributed the most to make your day of choice _d_ happier or more sad then days in the preceding 7 days. Comment on the figure. \n", "> 9. How do words that you printed in step 6 relate to those shown by the WordShift? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Communities for the Zachary Karate Club Network" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Exercise: Zachary's karate club_: In this exercise, we will work on Zarachy's karate club graph (refer to the Introduction of Chapter 9). The dataset is available in NetworkX, by calling the function [karate_club_graph](https://networkx.org/documentation/stable//auto_examples/graph/plot_karate_club.html) \n", "\n", "> 1. Visualize the graph using [netwulf](https://netwulf.readthedocs.io/en/latest/). Set the color of each node based on the club split (the information is stored as a node attribute). My version of the visualization is below.\n", ">\n", "> 2. Write a function to compute the __modularity__ of a graph partitioning (use **equation 9.12** in the book). The function should take a networkX Graph and a partitioning as inputs and return the modularity.\n", "> 3. Explain in your own words the concept of _modularity_. \n", "> 4. Compute the modularity of the Karate club split partitioning using the function you just wrote. Note: the Karate club split partitioning is avilable as a [node attribute](https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.classes.function.get_node_attributes.html), called _\"club\"_.\n", "> 5. We will now perform a small randomization experiment to assess if the modularity you just computed is statitically different from $0$. To do so, we will implement the _double edge swap_ algorithm. The _double edge swap_ algorithm is quite old... it was implemented in 1891 (!) by Danish mathematician Julius Petersen(https://en.wikipedia.org/wiki/Julius_Petersen). Given a network G, this algorithm creates a new network, such that each node has exactly the same degree as in the original network, but different connections. Here is how the algorithm works.\n", "> * __a.__ Create an identical copy of your original network.\n", "> * __b.__ Consider two edges in your new network (u,v) and (x,y), such that u!=v and v!=x.\n", "> * __c.__ If none of edges (u,y) and (x,v) exists already, add them to the network and remove edges (u,v) and (x,y).\n", "> * Repeat steps __b.__ and __c.__ to achieve at least N swaps (I suggest N to be larger than the number of edges).\n", "> 6. Double check that your algorithm works well, by showing that the degree of nodes in the original network and the new 'randomized' version of the network are the same.\n", "> 7. Create $1000$ randomized version of the Karate Club network using the _double edge swap_ algorithm you wrote in step 5. For each of them, compute the modularity of the \"club\" split and store it in a list.\n", "> 8. Compute the average and standard deviation of the modularity for the random network.\n", "> 9. Plot the distribution of the \"random\" modularity. Plot the actual modularity of the club split as a vertical line (use [axvline](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html)). \n", "> 10. Comment on the figure. Is the club split a good partitioning? Why do you think I asked you to perform a randomization experiment? What is the reason why we preserved the nodes degree?\n", "> 11. Use [the Python Louvain-algorithm implementation](https://anaconda.org/auto/python-louvain) to find communities in this graph. Report the value of modularity found by the algorithm. Is it higher or lower than what you found above for the club split? What does this comparison reveal?\n", "> 12. Compare the communities found by the Louvain algorithm with the club split partitioning by creating a matrix **_D_** with dimension (2 times _A_), where _A_ is the number of communities found by Louvain. We set entry _D_(_i_,_j_) to be the number of nodes that community _i_ has in common with group split _j_. The matrix **_D_** is what we call a [**confusion matrix**](https://en.wikipedia.org/wiki/Confusion_matrix). Use the confusion matrix to explain how well the communities you've detected correspond to the club split partitioning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Exercise: Community detection on the GME network._\n", "> * Consider the GME network you built in [Week 4](https://github.com/lalessan/comsocsci2022/blob/main/lectures/Week4.ipynb), part 2.\n", "> * Use [the Python Louvain-algorithm implementation](https://anaconda.org/auto/python-louvain) to find communities. How many communities do you find? What are their sizes? Report the value of modularity found by the algorithm. Is the modularity significantly different than 0? \n", "> * Visualize the network, using netwulf (see Week 4). This time assign each node a different color based on their _community_. Describe the structure you observe." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }