{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Formalia:\n", "\n", "Please read the [assignment overview page](https://laura.alessandretti.com/comsocsci2023/assignments.html) carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment. \n", "\n", "_If you fail to follow these simple instructions, it will negatively impact your grade!_\n", "\n", "**Due date and time**: The assignment is due on Mar 28th at 23:59. Hand in your Jupyter notebook file (with extension `.ipynb`) via DTU Learn _(Assignment, Assignment 2)_. \n", "\n", "Remember to include in the first cell of your notebook:\n", "* the link to your group's Git repository (if you don't have a shared Git repository, it's fine. Remember to do it next time)\n", "* group members' contributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: Mixing Patterns and Assortativity\n", "\n", "Let's dig in and try to understand more about the network of Computational Social Scientists using more advanced features. If your network has more than one component, just work on the giant connected component (GCC) in the exercises below. For the exercises below, you shall use the network you built in [Week 4](https://github.com/lalessan/comsocsci2023/blob/main/lectures/Week4.ipynb).\n", "\n", "\n", "> **Exercise 1: Mixing Patterns and Assortativity.** \n", ">\n", "> * For each node, compute the fraction of edges that connect to a node that works in the same top field. Find the average value across all nodes.\n", "> * Create a new graph, with the same nodes and edges, but where the association between nodes and field is shuffled. Compute the measure above for this randomized graph.\n", "> * Repeat the point above 100 times (at least). Plot the distribution of the values obtained and compare it with the value you have found for the real graph. Is the chance to connect to a member of the same field significantly higher than it would be by chance?\n", "> * Compute the assortativity coefficient with respect to author's field. How do you interpret the value you obtain? (__Hint__: See [this paper](https://nbviewer.org/github/suneman/socialgraphs2019/blob/master/lectures/Week5.ipynb), eq (2)). **Important**: here I do not want you to use the NetworkX implementation, but rather to implement your own version of the measure. \n", "> * Is the graph assortative with respect to the degree? (e.g. do high-degree scientists tend to link to other high-degree scientists, and low-degree scientists to other low-degree scientists?). Provide an interpretation of your answer.\n", "> * _Optional:_ Estimate the gender of each author from their name, using the [World Gender Name Dictionary](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MSEGSJ). Repeat the analysis above to study the assortativity of the network by gender rather than by field. What do you observe?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: Communities\n", "\n", "> __Exercise 2: Zachary's karate club__: In this exercise, we will work on Zarachy's karate club graph (refer to the Introduction of Chapter 9). The dataset is available in NetworkX, by calling the function [karate_club_graph](https://networkx.org/documentation/stable//auto_examples/graph/plot_karate_club.html) \n", ">\n", "> 1. Visualize the graph using [netwulf](https://netwulf.readthedocs.io/en/latest/). Set the color of each node based on the club split (the information is stored as a node attribute). My version of the visualization is below.\n", ">\n", "> 2. Write a function to compute the __modularity__ of a graph partitioning (use **equation 9.12** in the book). The function should take a networkX Graph and a partitioning as inputs and return the modularity.\n", "> 3. Explain in your own words the concept of _modularity_. \n", "> 4. Compute the modularity of the Karate club split partitioning using the function you just wrote. Note: the Karate club split partitioning is avilable as a [node attribute](https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.classes.function.get_node_attributes.html), called _\"club\"_.\n", "> 5. We will now perform a small randomization experiment to assess if the modularity you just computed is statitically different from $0$. To do so, we will implement the _double edge swap_ algorithm. The _double edge swap_ algorithm is quite old... it was implemented in 1891 (!) by Danish mathematician Julius Petersen(https://en.wikipedia.org/wiki/Julius_Petersen). Given a network G, this algorithm creates a new network, such that each node has exactly the same degree as in the original network, but different connections. Here is how the algorithm works.\n", "> * __a.__ Create an identical copy of your original network.\n", "> * __b.__ Consider two edges in your new network (u,v) and (x,y), such that u!=v and v!=x.\n", "> * __c.__ If none of edges (u,y) and (x,v) exists already, add them to the network and remove edges (u,v) and (x,y).\n", "> * Repeat steps __b.__ and __c.__ to achieve at least N swaps (I suggest N to be larger than the number of edges).\n", "> 6. Double check that your algorithm works well, by showing that the degree of nodes in the original network and the new 'randomized' version of the network are the same.\n", "> 7. Create $1000$ randomized version of the Karate Club network using the _double edge swap_ algorithm you wrote in step 5. For each of them, compute the modularity of the \"club\" split and store it in a list.\n", "> 8. Compute the average and standard deviation of the modularity for the random network.\n", "> 9. Plot the distribution of the \"random\" modularity. Plot the actual modularity of the club split as a vertical line (use [axvline](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html)). \n", "> 10. Comment on the figure. Is the club split a good partitioning? Why do you think I asked you to perform a randomization experiment? What is the reason why we preserved the nodes degree?\n", "> 11. Use [the Python Louvain-algorithm implementation](https://anaconda.org/auto/python-louvain) to find communities in this graph. Report the value of modularity found by the algorithm. Is it higher or lower than what you found above for the club split? What does this comparison reveal?\n", "> 12. Compare the communities found by the Louvain algorithm with the club split partitioning by creating a matrix **_D_** with dimension (2 times _A_), where _A_ is the number of communities found by Louvain. We set entry _D_(_i_,_j_) to be the number of nodes that community _i_ has in common with group split _j_. The matrix **_D_** is what we call a [**confusion matrix**](https://en.wikipedia.org/wiki/Confusion_matrix). Use the confusion matrix to explain how well the communities you've detected correspond to the club split partitioning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> __Exercise 3__: Community detection on the network of Computational Social Scientists. \n", ">\n", "> * Consider the network you built in [Week 4](https://github.com/lalessan/comsocsci2023/blob/main/lectures/Week4.ipynb).\n", "> * Use [the Python Louvain-algorithm implementation](https://anaconda.org/auto/python-louvain) to find communities. How many communities do you find? What are their sizes? Report the value of modularity found by the algorithm. Is the modularity significantly different than 0? \n", "> * If you are curious, you can also try the *Infomap* algorithm. Go to [this page]. (https://mapequation.github.io/infomap/python/). It's harder to install, but a better community detection algorithm. You can read about it in [advanced topics 9B](http://networksciencebook.com/chapter/9#advanced-9b).\n", "> * Visualize the network, using netwulf (see Week 5). This time assign each node a different color based on their _community_. Describe the structure you observe.\n", "> * Make sure you save the assignment of authors to communities.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 3: TF-IDF.\n", "\n", "\n", "> __Exercise 4: TF-IDF and the Computational Social Science communities.__ The goal for this exercise is to find the words charachterizing each of the communities of Computational Social Scientists.\n", "> What you ned for this exercise: \n", "> * The assignment of each author to their network community, and the degree of each author (Week 6, Exercise 4). This can be stored in a dataframe or in two dictionaries, as you prefer. \n", "> * the tokenized _abstract_ dataframe (Week 7, Exercise 2)\n", ">\n", "> 1. First, check out [the wikipedia page for TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Explain in your own words the point of TF-IDF. \n", "> * What does TF stand for? \n", "> * What does IDF stand for?\n", "> 2. Now, we want to find out which words are important for each *community*, so we're going to create several ***large documents, one for each community***. Each document includes all the tokens of abstracts written by members of a given community. \n", "> * Consider a community _c_\n", "> * Find all the abstracts of papers written by a member of community _c_.\n", "> * Create a long array that stores all the abstract tokens \n", "> * Repeat for all the communities. \n", "> __Note:__ Here, to ensure your code is efficient, you shall exploit ``pandas`` builtin functions, such as [``groupby.apply``](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.apply.html) or [``explode``](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html).\n", "> 3. Now, we're ready to calculate the TF for each word. Use the method of your choice to find the top 5 terms within the __top 5 communities__ (by number of authors). \n", "> * Describe similarities and differences between the communities.\n", "> * Why aren't the TFs not necessarily a good description of the communities?\n", "> * Next, we calculate IDF for every word. \n", "> * What base logarithm did you use? Is that important?\n", "> 4. We're ready to calculate TF-IDF. Do that for the __top 9 communities__ (by number of authors). Then for each community: \n", "> * List the 10 top TF words \n", "> * List the 10 top TF-IDF words\n", "> * List the top 3 authors (by degree)\n", "> * Are these 10 words more descriptive of the community? If yes, what is it about IDF that makes the words more informative?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> __Exercise 5: The Wordcloud__. It's time to visualize our results!\n", "> * Install the [`WordCloud`](https://pypi.org/project/wordcloud/) module. \n", "> * Now, create word-cloud for each community. Feel free to make it as fancy or non-fancy as you like.\n", "> * Make sure that, together with the word cloud, you print the names of the top three authors in each community (see my plot above for inspiration). \n", "> * Comment on your results. What can you conclude on the different sub-communities in Computational Social Science? \n", "> * Look up online the top author in each community. In light of your search, do your results make sense?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "vscode": { "interpreter": { "hash": "5c7b89af1651d0b8571dde13640ecdccf7d5a6204171d6ab33e7c296e100e08a" } } }, "nbformat": 4, "nbformat_minor": 2 }