{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 2" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Today\n", "\n", "In this lecture, we will continue to work on the data. \n", "\n", "* **Part 1:** We will learn the differences between **different kinds of data sources**. We will go a bit through the theory, and then work on an exercise. \n", "\n", "* **Part 2:** In the second part of this class, I will introduce you to **APIs**. We will use one API to gather some data on Computational Social Scientists and their works. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Data Sources for Computational Social Science\n", "\n", "\n", "We have seen how __DATA__ is central to Computational Social Science. But what data sources are we talking about? What are the limitations of different types of data sources? In the video below, I will give you an introduction to different types of data sources. As an example, I will introduce you to two studies that use two very different datasets to answer a similar question. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "> **_Video lecture_**: Watch the video below about Data Sources in Computational Social Science\n", ">\n", "> *Optional Reading: [The Spread of Behavior in an Online Social Network Experiment.](https://www.science.org/doi/full/10.1126/science.1185231)* This is the article describing the first study I talk about in the video. \n", "> *Optional Reading: [Exercise contagion in a global social network.](https://www.nature.com/articles/ncomms14753)* This is the article describing the second study I talk about in the video. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import YouTubeVideo\n", "YouTubeVideo(\"Hr5yKJaQUhE\",width=600, height=337.5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this course, we are going to focus mostly on observational data collected online to address social science questions. So, I would like us to reflect a little bit more on what it means to use *Ready made* data in the social science, and understand its advantages and challenges. This is something that you can read about in Sections 2.1 to 2.3 of the book _Bit by Bit_. \n", "\n", "> *Reading*: [Bit by Bit, sections 2.1 to 2.3](https://www.bitbybitbook.com/en/1st-ed/observing-behavior/observing-intro/) Read sections 2.1 to 2.3. I don't expect you to read all the details, but to have a general understanding of advantages and challenges of large observational datasets (a.k.a. Big Data/Ready made data) for social science research." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Using APIs to download data\n", "\n", "In this class, we will work with *Ready made* data. The second thing we will learn today is how to get data ready made data using APIs. We will do it using the Academic Graph API provided by Semantic Scholar. The Academic Graph API enables you to gather information on scientists and their publications. \n", "\n", "I made a short video for you to get familiar with the API. Check it out here below. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "> **_Video lecture_**: Watch the video below about APIs ([here is the notebook I used in the video](https://nbviewer.org/github/lalessan/comsocsci2023/blob/master/additional_notebooks/API_example.ipynb) ) " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import YouTubeVideo\n", "YouTubeVideo(\"7AQO3vJptvg\",width=600, height=337.5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise: A list APIs for Computational Social Science research.** Work in pairs. \n", ">\n", "> - Use the web to look for two APIs that could be used for gathering interesting data (from a Computational Social Science perspective). \n", "> - *Data description*: describe in a couple of lines the data types that you can gather using this API\n", "> - *Rate limits*: What are the rate limits of the free version of the API? \n", "> - Add your APIs to [this list](https://docs.google.com/spreadsheets/d/1LHdU-E6msMduqIHOvWGqn1uAbo5YjVwft2YkVu5subY/edit?usp=sharing) (**note: if someone had your same idea before you, just add your resource to the list a second time**). \n", ">\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prelude to part 3: Pandas Dataframes\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before starting, we will learn a bit about [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), a very user-friendly data structure that you can use to manipulate tabular data. Pandas dataframes are built using numpy, which is in turn built in C, so they are a quite efficient data structure. You will find it quite useful :)\n", "\n", "Pandas dataframes should be intuitive to use. **I suggest you to go through the [10 minutes to Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to learn what you need to solve the next exercise.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Getting data from the Semantic Scholar API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All right, so now we will start gathering some data on Computational Social Scientists and their publications. \n", "We will do this in steps. \n", "There is a lot of data to gather and process, so be patient. \n", "Feel free to team up if you find it useful and if you have any issues, come and ask me or the TAs. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise : Find potential Computational Social Scientists** In this exercise, we are going to find a list of potential authors of Computational Social Science papers using the Semantic Scholar API. The idea here is that we will find the researchers who have been to a conference on Computational Social Science or have worked with someone who has.\n", ">\n", "> 1. Consider the set of unique researcher names that you collected in Week 1, Exercise 3 (considering all years). Use the _author_ endpoint of the [Academic Graph API](https://api.semanticscholar.org/api-docs/graph#tag/Author-Data) to _search_ these researchers in the database based on their names. For each researcher in your list, find: \n", "> - their _authorId_ (the unique identifier in the Semantic Scholar API). **Hint**: the first result is typically the one the better matches your query. \n", "> - the _authorId_ of their collaborators. **Hint**: check out the field papers.authors\n", "> 2. Save the list of ids of the authors and their collaborators. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Exercise : Find potential Computational Social Science papers** Now, we are going to look for works by these researchers. \n", ">\n", "> 1. Consider the list of author ids you have found in the exercise above. For each author, use the Academic Graph API to find:\n", ">\n", "> - their _aliases_\n", "> - their _name_\n", "> - their _papers_, where for each paper we want to retain: \n", "> - _title_ \n", "> - _abstract_ \n", "> - the _year_ of publication\n", "> - the _externalIds_ (this is because there are universal identifiers for scientific works called DOI that we can use across platforms)\n", "> - _s2FieldsOfStudy_ the fields of study\n", "> - _citationCount_ the number of times that this paper was cited \n", "> (**Hint**: you can find authors in batches)\n", "> \n", ">\n", "> 2. Create three dataframes to store the data you have collected. \n", "> \n", "> - **Author dataset:** in the author dataset, one raw is one unique author, and each row contains the following information: \n", "> - *authorId*: (str) the id of the author\n", "> - *name*: (str) the name of the author\n", "> - *aliases*: (list) the aliases of the author\n", "> - *citationCount*: (int) the total number of citations received by an author\n", "> - *field*: (str) the _s2FieldsOfStudy_ that occurs most times across an author's papers (you should first obtain the *category* for each _s2FieldsOfStudy_)\n", "> - **Paper dataset:** in the paper dataset, one row is one unique paper, and each row contains the following information:\n", "> - *paperId*: (str) the id of the paper\n", "> - *title*: (str) the title of the paper\n", "> - *year*: (int) the year of publication\n", "> - *externalId.DOI:* (str) the DOI of the paper\n", "> - *citationCount*: (int) the number of citations\n", "> - *fields*: (list) the fields included in the paper (you should first obtain the *category* for each _s2FieldsOfStudy_)\n", "> - *authorIds:* (list) this is a list of *author Ids*, including all the authors of this paper that are in our author dataset\n", "> - **Paper abstract dataset:** in the paper abstract dataset, one row is one unique paper, and each row contains the following information: \n", "> - *paperId*: (str) the id of the paper\n", "> - *abstract*: (str) the abstract of the paper \n", "> (Note: we keep the abstract separate to keep the size of files more manageable)\n", "> \n", "> 3. Save the three dataframes to file.\n", "> 4. How long is the _Author_ dataframe? How long is the _Paper_ dataframe? \n", "> 5. One person per pair: go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/145262) and fill the Survey \"_Week 2 - Semantic Scholar API data_\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Your Feedback\n", "I hope you enjoyed today's class. It would be awesome if you could spend a few minutes to share your feedback. \n", "**Go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/145262) and fill the Survey \"_Week 2 - Feedback\"_.**" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "vscode": { "interpreter": { "hash": "5c7b89af1651d0b8571dde13640ecdccf7d5a6204171d6ab33e7c296e100e08a" } } }, "nbformat": 4, "nbformat_minor": 1 }