{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Some important admin and homework for next time.\n", "\n", "\n", "__Important point__. If you have not filled __[the mid-term survery ](https://forms.gle/wQMj2GV4XPFNJEuS7)__, please do! It's 100% anonymous. I promise it should not take more than 5 minutes. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview of today's class.\n", "\n", "This week's curriculum is about text analysis. The overview is\n", "\n", "* Finding the important words in a document (TF-IDF)\n", "* Apply these tricks to understand what different communities in Computational Social Science are working on. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1 - Words that characterize Computational Social Science communities\n", "\n", "In this section, we'll begin to play around with how far we can get with simple strategies for looking at text. The video is Sune talking about a fun paper, which shows you how little is needed in order to reveal something very interesting about humans that produce text. Then, we'll use a simple weighting scheme called TF-IDF to find the words in the Computational Social Science papers that charachterize different communities. Finally, we'll visualize them in a fun little word cloud (below is what I found). The wordclouds may not be immediately understandable. But if you do some research on the important words, you will find that the TF-IDF method extracts quite interesting information.\n", "\n", "\"Drawing\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Video lecture**: Simple methods reveal a lot. Sune talks a little bit about the paper: [Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073791). " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/jpeg": "\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "YouTubeVideo(\"wkYvdfkVmlI\",width=800, height=450)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> __Exercise 1: TF-IDF and the Computational Social Science communities.__ The goal for this exercise is to find the words > charachterizing each of the communities of Computational Social Scientists.\n", "> What you ned for this exercise: \n", "> * The assignment of each author to their network community, and the degree of each author (Week 6, Exercise 4). This can be stored in a dataframe or in two dictionaries, as you prefer. \n", "> * the tokenized _abstract_ dataframe (Week 7, Exercise 2)\n", ">\n", "> 1. First, check out [the wikipedia page for TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Explain in your own words the point of TF-IDF. \n", "> * What does TF stand for? \n", "> * What does IDF stand for?\n", "> 2. Now, we want to find out which words are important for each *community*, so we're going to create several ***large documents, one for each community***. Each document includes all the tokens of abstracts written by members of a given community. \n", "> * Consider a community _c_\n", "> * Find all the abstracts of papers written by a member of community _c_.\n", "> * Create a long array that stores all the abstract tokens \n", "> * Repeat for all the communities. \n", "> __Note:__ Here, to ensure your code is efficient, you shall exploit ``pandas`` builtin functions, such as [``groupby.apply``](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.apply.html) or [``explode``](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html).\n", "> 3. Now, we're ready to calculate the TF for each word. Use the method of your choice to find the top 5 terms within the __top 5 communities__ (by number of authors). \n", "> * Describe similarities and differences between the communities.\n", "> * Why aren't the TFs not necessarily a good description of the communities?\n", "> * Next, we calculate IDF for every word. \n", "> * What base logarithm did you use? Is that important?\n", "> 4. We're ready to calculate TF-IDF. Do that for the __top 9 communities__ (by number of authors). Then for each community: \n", "> * List the 10 top TF words \n", "> * List the 10 top TF-IDF words\n", "> * List the top 3 authors (by degree)\n", "> * Are these 10 words more descriptive of the community? If yes, what is it about IDF that makes the words more informative?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " __Exercise 2: The Wordcloud__. It's time to visualize our results!\n", "\n", "> * Install the [`WordCloud`](https://pypi.org/project/wordcloud/) module. \n", "> * Now, create word-cloud for each community. Feel free to make it as fancy or non-fancy as you like.\n", "> * Make sure that, together with the word cloud, you print the names of the top three authors in each community (see my plot above for inspiration). \n", "> * Comment on your results. What can you conclude on the different sub-communities in Computational Social Science? \n", "> * Look up online the top author in each community. In light of your search, do your results make sense?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }