{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab exercise, you will learn how to perform scientometric network analysis in Python. We will start with practicalities on some basic data handling and import. We then move on to creating a network and cover some basic analysis. In the next session, we will be using more advanced techniques." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "This Python notebook is intended to be used as an exercise. We have prepared it for you to include many details, but at some parts we will ask you to fill in some of the blanks. Exercises where you are asked to do something, or to think about something, will be indicated like this. If you need to execute and write your own code, we provide empty space below to do so.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "If you need any help with anything, please don't hesitate to ask your teachers. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data handling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python is a general purpose programming language and it can be used to handle data in general. In this notebook we will specifically deal with scientometric datasets, but you can also use it for other purposes.\n", "\n", "We will start by handling some data from a scientometric data source. There are many different possible data sources, and we discussed some of them earlier this week. In this notebook we will focus on data downloaded from Web of Science. We have already downloaded some data for you to demonstrate Python. At the end of the exercise you will be asked to load your own data. \n", "\n", "The data that we provided is a selection of publications from authors from Belgium from Tropical Medicine from 2000-2017." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Note: You cannot load your own data when you run this notebook online using Binder.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by loading the data. In order to read in the data, we first need to make sure that Python is able to read it. A very versatile *package* for handling data in Python is called `pandas`. For those of you familiar with `R`, it is similar to the `data.frame` in `R`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We *import* this package as follows, and we call the `pandas` package `pd`, for easy reference. We also need the `csv` package to indicate some options to the `pandas` package." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " In order to execute the code you have to press Ctrl-Enter while selecting the code cell below. Alternatively, you can press the \"Play\" button at the top of the screen. This also moves to the next cell at the same time. Using Shift-Enter instead of Ctrl-Enter will also execute the code and move to the next cell at the same time.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " If you have executed that code cell correctly, it should now be numbered 1. While the code in a cell is being executed it is marked by an asterisk *. Each cell of executed code will be numbered in the order in which you execute it. If you execute it again, it will be numbered 2, et cetera.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready to read in the data that you just downloaded. We have named the `pandas` package `pd`, which will save us some typing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df = pd.read_csv('data-files/wos/tab-delimited/savedrecs_0001_0500.txt', \n", " sep='\\t', index_col='UT',\n", " quoting=csv.QUOTE_NONE, usecols=range(68))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We called the *function* `read_csv` of the `pandas` package. We provide it with several *arguments*. \n", "\n", "1. The location of the file we want to read.\n", "\n", "2. The second argument is a *named argument*, we provide both the name of the argument (`sep`) and its value (`'\\t'`). This indicates the *sep*arator between different fields. In this case it is a tab-delimited file, so the fields are separated by tabs, which is indicated by `'\\t'`.\n", "\n", "3. The third argument is again a named argument. We indicate that the `UT` field should be the index. This is the unique identifier that WoS uses.\n", "\n", "The two subqeuent arguments are needed to correctly handle some peculiarities of WoS files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "We downloaded some example files for you, which are located in the folder data_files/wos. At the end of this notebook, you will be asked to download your own data. If you want to load that data instead, use the path to that data.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Note: Windows usually uses backslashes \\ to separate directories, in Python you can also use the forward slash /, which is usually more convenient for a number of reasons.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pandas` package took care of reading the file, and has now stored it in the variable called `publications_df`. You can take a closer look at `publications_df` to see the data that we just read." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will see that the data has quite cryptic column headers. Each line contains information about a single publication, and contains various details, such as the title (`TI`), abstract (`AB`), authors (`AU`), journal title (`SO`) and cited references (`CR`). Unfortunately, the documentation of Web of Science is relatively limited, but some explanation can be found here. You can retrieve this information in various ways from the pandas dataframe `publications_df`. For example, you can list the first five titles as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df.TI[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, `[:5]` indicates that you want the first elements (starting at 0) until (but excluding) 5, so item 0, 1, 2, 3 and 4. This is called a *slice* of the data. You can also look at authors for rows 5 until 10 as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df.AU[5:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to get the last few elements, you can use negative indices. The last element is indicated by `-1`, the penultimate element is indicated by `-2`, and so on. You can get the journals for the last five sources as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df.SO[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, there are various ways to index the dataframe. For example, to get the title and abstract for the first five elements you can do the following." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df[0:5][['TI', 'AB']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notation `['TI', 'AB']` creates a *list* of elements in Python. We now used it to get multiple columns from the dataframe. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following does exactly the same:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df[['TI', 'AB']][0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pandas` package automatically determines whether you try to get columns or rows. Slices are always assumed to refer to rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Show the title (TI), abstract (AB), journal (SO) and publication year (PY) for rows 200-210.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "To start typing in the cell below, select the cell using the mouse, or select it using the arrows on the keyboard and press Enter\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also access a particular `UT` directly by using the `.loc` indexer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df.loc['WOS:000419235100004', ['TI', 'AU', 'SO', 'PY']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading multiple files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Until now we have only loaded one file. But we have of course downloaded more files, and we need to load all of them. We can list all files in a directory using the package `glob`. We first import the package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import glob" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us get a list of all files in the directory `data_files/wos/tab-delimited/`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n", "files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We asked `glob` for a list of files that end with `txt` (`*.txt`) in the directory `data-files/wos/tab-delimited`. We sorted the list to ensure that we read the files in the correct order. We can now simply pass this list of files to read multiple WoS files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n", " usecols=range(68), index_col='UT') for f in files)\n", "publications_df = publications_df.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Now checkout the new publications_df data frame, and see how many rows it has.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data summarisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pandas` package provides various ways to summarise the data and get a useful overview of the data. For example, you can group by a certain column, and count or sum things. For example, we can count the number of articles in each journal that is included in this dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped_by_journal = publications_df.groupby('SO')\n", "grouped_by_journal.size().sort_values(ascending=False)[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could also ask the mean publication year of publications in those journals" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped_by_journal['PY'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Group by the year (PY) and count the number of paper from each year.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Now it is time to introduce you a little trick: you can get a list of all functions and argument of some variable by simply pressing Tab. For example, you can type publications_df., including the . and then press Tab (make sure the cursor is located after the .). If you then start typing the name of the function you are looking for and press Tab again, Python will automatically finish it as much as possible. This is something general: whenever you press Tab Python will try to autocomplete whatever you are typing.\n", "\n", "One other trick: if you have selected a function and press Shift-Tab you will get documentation of what this function does. You can press the + to find out more.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Network generation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ultimately, we would like to use this data to generation scientometric networks. This is not a trivial task, and we will now show how to construct a co-authorship network and a journal level bibliographic coupling network.\n", "\n", "We first load the network analysis package that we will use in the notebook, `igraph`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Import the pacakge igraph and call it ig.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Co-authorship" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first build a co-authorship network. We will do this one publication at the time. All combinations of authors that are involved in a publication are co-authors. Let us look at the authors for publication 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df['AU'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the authors are all listed and separated with a semicolon (`;`). In computer terms, it is now a single *string*. We will split this string of all authors into a list of strings where each string then represents a single author." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df['AU_split'] = publications_df['AU'].fillna('').str.split('; ')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "authors = publications_df['AU_split'][0]\n", "authors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to create all possible combinations, we can use a convenient package, called `itertools`. The function `combinations` can generate all possible combinations of the elements of a list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import itertools as itr\n", "list(itr.combinations(authors, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, we don't want to do this for a single publication only, but rather, for all publications in our dataset. We can do that using the function `apply`. We can supply it with a small function (called a `lambda` function) that simply takes some input and produces some output. In this case, the input are the `authors`, and the output is the result of `itr.combinations(...)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coauthors_per_publication = publications_df['AU_split'].apply(\n", " lambda authors: list(itr.combinations(authors, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variable `coauthors_per_publication` is now a list of a list of co-authors per publication. That is, each element of `coauthors_per_publication` contains a list of all co-authors for that publication. So, `coauthors_per_publication[0]` contains the coauthors we examined previously." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coauthors_per_publication[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us turn each element of this list into a separate row. This is done by using `explode` in `pandas`. Publications with only one author have no co-authors, which results in an `NA` (Not Available) value. We will drop those using `dropna`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coauthors = coauthors_per_publication.explode().dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can create the actual network as follows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship = ig.Graph.TupleList(\n", " edges=coauthors.to_list(),\n", " vertex_name_attr='author',\n", " directed=False\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this graph will still contain many duplicate edges, because there are multiple edges present. Let us therefore simplify this graph, and simply count the number of multiple edges. We first create a so-called edge attribute `n_joint_papers`. We can create it by using the edge sequence `es` of the graph. We can then simply sum this weight when we simplify the graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.es['n_joint_papers'] = 1\n", "G_coauthorship = G_coauthorship.simplify(combine_edges='sum')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us see how many authors (i.e. nodes) there are in the network. This is called the `vcount` (vertex count) in `igraph`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.vcount()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, the number of edges is available as the `ecount` of the graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.ecount()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do all sorts of analysis on this network. But first, we will create a bibliographic coupling network." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bibliographic coupling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bibliographic coupling and co-authorship is in a sense very similar. Previously, we computed for each publication a combination of all co-authors. For bibliographic coupling we can compute for each cited reference the combinations of all citing journals. We will first create a dataframe of all journal citations (`SO`) of a certain cited reference (`CR`). Similar to the authors, we need to first split the cited references." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publication_with_cr_df = publications_df.loc[pd.notnull(publications_df['CR']), ['SO', 'CR']]\n", "publication_with_cr_df['CR'] = publication_with_cr_df['CR'].str.split('; ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now simply list all citations from a certain journal (`SO`) to a certain cited reference (`CR`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "journal_cits_df = publication_with_cr_df[['SO', 'CR']].explode('CR')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then create all bibliographic couplings per cited reference as follows. We first group by the cited reference (`CR`) and then take all combinations of citing journals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bibcoupling_per_cr = journal_cits_df.groupby('CR').apply(lambda x: list(itr.combinations(x['SO'], 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We again `explode` all combinations of two sources citing the same reference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bibcouplings = bibcoupling_per_cr.explode().dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then create the network." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coupling = ig.Graph.TupleList(\n", " edges=bibcouplings,\n", " vertex_name_attr='SO',\n", " directed=False\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " We again need to simplify this network. Create a new edge attribute called coupling set it to 1 and then sum this attribute when simplifying the network.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This network should be reasonably sized, and you should be able to visualize this network by calling `ig.plot`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ig.plot(G_coupling, vertex_label=G_coupling.vs['SO'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Network analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have created some scientometric networks, let us look at some basic analyses of these networks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connectivity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us start with a very simple question. Is the co-authorship network connected?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.is_connected()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, not all authors in this dataset are connected via co-authored papers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "How many authors do you think will be connected to each other? 500? 5000? Almost everybody?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to take a closer look, we need to detect the *connected components*. This is easily done, but the function is confusingly called `clusters`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "components = G_coauthorship.clusters()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We only want the so-called giant component. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "What function do you think returns the giant component?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Remember, you can use Tab and Shift-Tab to find out more about possible functions.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us only look at the giant component." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H = components.giant()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us check how many nodes are in the giant component. We can call the function `summary`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(H.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first line indicates that we have an undirected graph (`U`) with 7871 nodes and 69928 links. The next line shows vertex attributes (indicated by the `v` behind the name of the attribute), and edge attributes (indicated by the `e`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
    \n", "
  1. What is the percentage of nodes that are in the giant component? \n", "
  2. Double check whether the giant component is connected.\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us take a closer look at how far authors in this data set are apart from one another. Let us simply take a look at node number `0` (remember, the first node has number `0`, not `1`) and node number `355`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "paths = G_coauthorship.get_shortest_paths(0, 355)\n", "paths" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returns a list of all shortests paths of the nodes between node number 0 and node number 355. In fact, there is only one path, so let us select that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = paths[0]\n", "path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "How many nodes are in the path? What is the path length?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These numbers probably do not mean that much to you. You can find out more about an individual node by looking at the `VertexSequence` of `igraph`, abbreviated as `vs`. This is a sort of list of all vertices, and is indexed by brackets `[ ]`, similar to lists, instead of parentheses `( )` as we do for functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.vs[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vertex itself is also a type of list (called a *dictionary*), and you can only return the author name as follows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.vs[0]['author']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also list multiple vertices at once." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.vs[[0, 3, 223, 355]]['author']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can of course also simply pass the variable `path` that we constructed earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.vs[path]['author']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows that Osaer collaborated with Geert, who collaborated with Van Mark, who in the end collaborated with Watkins." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also get the vertex by searching for the author name. For example, if we want to find `'Van Marck, E'` we can use the following." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.vs.find(author_eq = 'Van Marck, E')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here `author_eq` refers to the condition that the vertex attribute `author` should **eq**ual `'Van Marck, E'`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Find the shortest path from 'Van Marck, E' to 'Migchelsen, S'. Who is in between?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can let `igraph` also calculate how far apart all nodes are." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "The following may take some time to run\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_lengths = G_coauthorship.path_length_hist()\n", "print(path_lengths)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "How far apart are most authors? Do you think most authors are close by? Or do you think they are far apart?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us take a closer look at the path between node 0 and node 355 again. Instead of the nodes on the path, we now want to take a closer look at the edges on the path." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epath = G_coauthorship.get_shortest_paths(0, 355, output='epath')\n", "epath" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the `VertexSequence` we encountered earlier, there is also an `EdgeSequence`, abbreviated as `es`. Let us take a closer look to the number of joint papers that the authors had co-authored." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.es[epath[0]]['n_joint_papers']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epath = G_coauthorship.get_shortest_paths(0, 355, weights='n_joint_papers', output='epath')\n", "epath" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do get a different path, which it is actually longer. Let us take a look at the number of joint papers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.es[epath[0]]['n_joint_papers']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total number of joint papers is lower! That is because *shortest path* means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the *shortest path*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Attention! Weighted shortest paths have the lowest total weight.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering coefficient" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us look whether co-authors of an author also tend to be co-authors among themselves." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us take a look at the co-authors of of author number 0, which are called the *neighbors* in network terminology." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.neighborhood(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we actually want to know is whether many of those neighors are connected. That is, we want to take the subgraph of all authors that have co-authored with author number 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H = G_coauthorship.induced_subgraph(G_coauthorship.neighborhood(0))\n", "print(H.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This subgraph only has 4 nodes (including node 0, so it has 3 neighbours) and 6 edges. This is sufficiently small to be easily plotted for visual inspection." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H.vs['color'] = 'red'\n", "H.vs[0]['color'] = 'grey'\n", "ig.plot(H)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Do many of the co-authors collaborate among themselves as well? Why do you think this happens?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also ask `igraph` to calculate the clustering coefficient (which is called *transitivity* in igraph, which is the same concept using different terms) of node 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.transitivity_local_undirected(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "What percentage of the co-authors of node 0 have also written papers with each other?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can calculate the average for all nodes using the function `transitivity_avglocal_undirected`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "What percentage of the co-authors have also written papers with each other on average? Do you think this is high or not?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Centrality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, people want to identify wich nodes seem to be most important in some way in the network. This is often thought of as a type of *centrality* of a node." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Degree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The simplest type of centrality is the *degree* of a node, which is simply the number of its neighbors. Previously, we saw that node 0 had 3 neighbors, we therefore say its degree is 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.degree(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also simply calculate the degree for everybody and store it in a new vertex attribute called `degree`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.vs['degree'] = G_coauthorship.degree()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " What is the degree of 'Van Marck, E'?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also take a look at the complete degree distribution. To plot it, we load the `matplotlib` package. We import the plotting functionality and name the package `plt`. We also include a statement telling Python to show the plots immediately in this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let us plot a histogram of the degree, using 50 bins." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(G_coauthorship.vs['degree'], 50);\n", "plt.yscale('log')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This clearly shows that the degree distribution is quite skewed. Most authors have only few collaborators, while a few authors have many collaborators. If the degree distribution is so skewed, it is sometimes referred to as a *scale-free* network, although the exact definition has been a topic of intense discussion recently." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below sorts the nodes in descending order of the degree." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "highest_degree = sorted(G_coauthorship.vs, key=lambda v: v['degree'], reverse=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `sorted` function takes a list as input, `G_coauthorship.vs` in our case, and sorts it according to a sort key. We indicate the sort key by a small function, called a `lambda` function, that returns the degree. In other words, the `sorted` function will sort the nodes according to the degree. By indicating `reverse=True` we obtain a list that is sorted highest to lowest, instead of the other way around." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can look at the first five results in the following way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "highest_degree[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, apparently, U D'Allessandro has collaborated with 715 other authors! This of course only considers the number of co-authors, it does not take into account the number of papers written with somebody else.\n", "When specifying such *edge weights* like the number of joint papers, the weighted degree is referred to as the *strength* of a node (which is sometimes a bit confusing term). \n", "\n", "Let us look at the strength of node 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship.strength(0, weights='n_joint_papers')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, author 0 collaborated with 3 different authors, and has a total strength of 3. But what does this 3 mean? We need to carefully think about this. Suppose that author 0 has co-authored a single publication with three other co-authors, then each of the three co-authors would have an edge weight of `n_joint_papers = 1`. So, the strenght would be 3. Hence, the strength denotes the total number of collaborations that an author had, which depends both on the number of publications and the number of collaborators per paper.\n", "\n", "Sometimes, we wish to take into account the number of co-authorships when creating a link weight. We can then fractionally count the weight of each collaboration between $n_a$ authors as\n", "\n", "$$\\frac{1}{n_a - 1}.$$\n", "\n", "We need to go back to the `publications_df` in order to construct such a *fractional* edge weight." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import itertools as itr\n", "[(coauthor[0], coauthor[1], 1/(len(authors) - 1)) for coauthor in itr.combinations(authors, 2)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We again do this for all publications." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coauthors_per_publication = publications_df['AU_split'].apply(\n", " lambda authors: \n", " [(coauthor[0], coauthor[1], 1, 1/(len(authors) - 1)) \n", " for coauthor in itr.combinations(authors, 2)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variable `coauthors_per_publication` is now a list of a list of co-authors per publication, but including a full weight of `1` and a fractional weight of `1/(len(authors) - 1)`, where `len(authors)` is the number of authors of the publications. We again `explode` this list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coauthors = coauthors_per_publication.explode().dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can again create the network, but now we can pass two edge attributes, `n_joint_papers` and `n_joint_papers_frac`. We of course also have to simplify the network again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coauthorship = ig.Graph.TupleList(\n", " edges=coauthors.to_list(),\n", " vertex_name_attr='author',\n", " directed=False,\n", " edge_attrs=('n_joint_papers', 'n_joint_papers_frac')\n", " )\n", "G_coauthorship = G_coauthorship.simplify(loops=False, combine_edges='sum')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "What is the sum of n_joint_papers_frac over all co-authors? Then shouldn't the strength sum up to a whole number? Why isn't that the case here? (Hint: look at the authors of publication 'WOS:000242241600004'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "publications_df.loc['WOS:000242241600004', 'AU']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Betweenness centrality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Betweenness centrality is much more elaborate, and gives an indication of the number of times a node is on the shortest path from one node to another node.\n", "\n", "As you can imagine, this can take quite some time to calculate for all nodes. We will therefore use the somewhat smaller bibliographic coupling network of journals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Note: On larger networks, it may take a long time to calculate the betweenness centrality.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coupling.vs['betweenness'] = G_coupling.betweenness()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the journals that have the highest betweenness." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sorted(G_coupling.vs, key=lambda v: v['betweenness'], reverse=True)[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we did previously when dealing with shortest paths, we can also use a weight for determining the shortest paths." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coupling.vs['betweenness_weighted'] = G_coupling.betweenness(weights='coupling')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "What is journal with the highest weighted betweenness centrality? Does this make sense if you compare it to the unweighted betweenness centrality?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Attention! Weighted shortest paths have the lowest total weight.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pagerank" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way of identifying central nodes relies on the idea of a random walk in a network. We will study this in the journal bibliographic coupling network. When performing such a random walk, we simply go from one journal to the next, following the bibliographic coupling links. The journal that is most frequently visited during such a random walk is then seen as most central. This is actually the idea that underlies Google's famous search engine. Luckily, we can compute that a lot faster than betweenness." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coupling.vs['pagerank'] = G_coupling.pagerank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Get the top 5 most central journals according to Pagerank. Who is the most central? Are the results very different from the betweenness?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can again take into account the weights. In pagerank this means that a journal that is a more closely bibliographically coupled will be more likely to be visited during a random walk. This is actually much more in line with our intuition than the shortest path. Let us see what we get if we do that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_coupling.vs['pagerank_weighted'] = G_coupling.pagerank(weights='coupling')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Are the results different for the weighted version of pagerank?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Pagerank is very similar to the techniques that underly the journal \"Eigenfactor\" and the \"SCImago Journal Rank\", which are seen as indicators of the scientific impact of a journal. Do you think it makes sense to interpret Pagerank on a bibliographic coupling network as the scientific impact of a journal? Why (not)?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Co-authorship using bipartite projection (optional)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also create co-authorship using a more theoretical approach from graph theory. We can first construct a network consisting of publications and authors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first again `explode` all authors for each publication, and create a graph out of it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "author_pubs_df = publications_df['AU_split'].explode()\n", "\n", "G_pub_authors = ig.Graph.TupleList(\n", " edges=author_pubs_df.reset_index().values,\n", " vertex_name_attr='name',\n", " directed=False\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This network consists of two types: publications and authors. This is called a *bipartite* graph. We can automatically get the types using `is_bipartite`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "is_bipartite, types = G_pub_authors.is_bipartite(return_types = True)\n", "print(is_bipartite)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The actual types are simply returned as a list of `True` and `False` values, which are arbitrary labels for publications and authors. Let us see what the first label stands for." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(types[0])\n", "print(G_pub_authors.vs[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the `name` of node `0` we can see that it refers to a publication, and so `False` indicates publications, while `True` indicates authors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now would like to perform a so-called *bipartite projection* onto the authors. This is exactly the type of operation that leads to a co-authorship network. If we were to *project* onto the publication, we would end up with a network of publications where each pair of publications is linked if it is authored by the same author." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_author_projection = G_pub_authors.bipartite_projection(types=types, which=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, it keeps track of the *multiplicity* (i.e. the number of joint papers) in the `weight` edge attribute. Unfortunately, it is not possible to do fractional counting using this approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Check the number of nodes in the bipartite projection. Why is it different from the number of nodes in the earlier constructed G_coauthorship? (Hint: checkout the degree.)\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis of your own data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have now learned the basics of handling WoS files and transforming them into scientometric networks. Please take some time now to do your own analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Go to Web of Science and select a publication set of interest. Make sure that the number of publications is higher than 1000, but lower than 5000. Export the files as follows:\n", "
    \n", "
  1. Export using \"Save to Other File Formats\".\n", "
  2. Select the appropriate records (e.g. 1-500, 501-1000, etc...).\n", "
  3. Select the Record Content \"Full Record and Cited References\".\n", "
  4. Select the File Format \"Tab delimited (Win, UTF8)\".\n", "
  5. Click on Send.\n", "
\n", "Repeat the above steps for each batch of 500 publications.\n", "\n", "Load the data from all files you downloaded using pandas\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Create a co-authorship network of your publications. Hint: use the approach you encountered earlier.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Identify the authors that are most central to the coauthorship network and interpret the results.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Create a co-citation network of your publications. Hint: use the bibliographic coupling approach, but switch the roles of the source and the target.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Identify the publications that are most central to the co-citation network and interpret the results. Are they relatively recent publications or not?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }