{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous sessions, you learned how to construct scientometric networks in Python. It was clear that this can be quite challenging. VOSviewer takes care of a lot of the necessary work in creating scientometric networks. You can hence use VOSviewer to create networks, which you could then export and analyse further in Python. We will here take this approach."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## VOSviewer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have previously constructed scientometric networks using VOSviewer. You can import the resulting network for further analysis in `igraph`. In order to import the file in `igraph` you need to have saved both the `map` file and the `network` file in VOSviewer. See the manual of VOSviewer for more explanation. As in the previous Python notebook, we have prepared some files for you, in this case the author collaboration network from the Web of Science files that we analysed previously."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first import the necessary packages. You will presumably recognize these still from the previous Python notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import igraph as ig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us read the map and network file from VOSviewer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Read the file <code>data-files/vosviewer/vosviewer_map.txt</code> using tabs (<code>'\\t'</code>) as a field separator, and call the resulting variable <code>map_df</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The network file from VOSviewer has no header, so we set it manually"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "network_df = pd.read_csv('data-files/vosviewer/vosviewer_network.txt', sep='\\t', header=None,\n",
    "                         names=['idA', 'idB', 'weight'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have loaded the data, so we can simply construct a network as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_vosviewer = ig.Graph.DictList(\n",
    "      vertices=map_df.to_dict('records'),\n",
    "      edges=network_df.to_dict('records'),\n",
    "      vertex_name_attr='id',\n",
    "      edge_foreign_keys=('idA', 'idB'),\n",
    "      directed=False\n",
    "      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The layout and clustering is also stored by VOSviewer, and we can use that to display the same visualization in `igraph`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "layout = ig.Layout(coords=zip(*[G_vosviewer.vs['x'], G_vosviewer.vs['y']]))\n",
    "clustering = ig.VertexClustering.FromAttribute(G_vosviewer, 'cluster')\n",
    "\n",
    "ig.plot(clustering, layout=layout, vertex_size=4, vertex_frame_width=0, vertex_label=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common phenomenon in many networks is the presence of group structure, where nodes within the same group are densely connected. Such a structure is sometimes called a *modular* structure, and a frequently used measure of group structure is known as *modularity*. You have already encountered this functionality briefly in VOSviewer, which provides clusters. Here we will explore this a bit more in-depth."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we will import a package called `leidenalg` which is the *Leiden algorithm*, which we will use for clustering. It is built on top of `igraph` so that it easily integrates with all the exisiting methods of `igraph`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import leidenalg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us find clusters in the collaboration network from VOSviewer, using the weight of the edges. Because the algorithm is stochastic, it may yield somewhat different results every time you run it. To prevent that from happening, and to always get the same result, we will set the random seed to 0. The result is a `VertexClustering`, which we already briefly encountered when using the clustering results from VOSviewer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will first find clusters using *modularity*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimiser = leidenalg.Optimiser()\n",
    "optimiser.set_rng_seed(0)\n",
    "clusters = leidenalg.ModularityVertexPartition(G_vosviewer, weights='weight')\n",
    "optimiser.optimise_partition(clusters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The length of the `clusters` variable indicates the number of clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(clusters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When accessing `clusters` variable as a list, each element corresponds to the set of nodes in that cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    What are the nodes in cluster <code>30</code>?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hence, node `548`, node `1052`, etc... belong to cluster `30`. Another way to look at the clusters is by looking at the `membership` of `clusters`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    What is the membership of the first 10 nodes?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hence, node `0` belongs to cluster `7`, node `1` belongs to cluster `9`, node `2` belongs to cluster `4`, et cetera."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us take a closer look at the largest cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "H = clusters.giant()\n",
    "print(H.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could again detect clusters using modularity in the largest cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimiser.set_rng_seed(0)\n",
    "subclusters = leidenalg.ModularityVertexPartition(H, weights='weight')\n",
    "optimiser.optimise_partition(subclusters)\n",
    "ig.plot(subclusters, vertex_size=5, vertex_label=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, modularity will continue to find subclusters in this way. An alternative approach, called CPM, does not suffer from that problem. \n",
    "\n",
    "Let us detect clusters using CPM. We do have to specify a parameter, called the `resolution_parameter`. As its name suggests, it specifies the resolution of the clusters we would like to find. At a higher resolution we will tend to find smaller clusters, while at a lower resolution we find larger clusters. Let us use the resolution parameter 0.01."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimiser.set_rng_seed(0)\n",
    "clusters = leidenalg.CPMVertexPartition(G_vosviewer,\n",
    "                                     weights='weight',\n",
    "                                     resolution_parameter=0.1)\n",
    "optimiser.optimise_partition(clusters)\n",
    "clusters.giant().vcount()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Detect subclusters in the largest cluster using CPM, using the same <code>resolution_parameter</code>. How many subclusters do you find? How does that compare to modularity?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Try to find more subclusters by specifying a higher <code>resolution_parameter</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modularity adapts itself to the network. In a sense that is convenient, because you then do not have to specify any parameters. On the other hand, it makes the definition of what a \"cluster\" is less clear.\n",
    "\n",
    "CPM does not adapt itself to the network, and maintains the same defintion across different networks. That is convenient, because it brings more clarity to what we mean by a \"cluster\". Whenever you try to find subclusters using the same `resolution_parameter`, CPM should not find any subclusters. In practice, it may happen that CPM still finds some subclusters, in which case the original clusters were actually not the best possible. The Leiden algorithm can be run for multiple iterations, and with each iteration, the chances are smaller that CPM would find such subclusters. Modularity will always find subclusters, independent of the number of iterations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Try to optimise the partition further. Note that the function <code>optimise_partition</code> returns how much further it managed to improve the function, so that if it returns <code>0.0</code>, it means it couldn't find any further improvement. Execute the cell repeatedly. Does it return 0.0 after some time?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us compare the clusters that we detected in Python with the clustering results from VOSviewer.\n",
    "\n",
    "We can summarize the overall similarity to the partition based on the disciplines using the Normalised Mutual Information (NMI). The NMI varies between 0 and 1 and equals 1 if both are identical."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters.compare_to(clustering, method='nmi')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some differences between the clustering from VOSviewer and the clusters we detected in Python. This will of course highly depend on what resolution parameter we have used for both results. One other important difference is that VOSviewer will by default use *normalized* weights. By default, it will divide the weight of a link by the expected weight, assuming that the total link weight of each node would remain the same, which is sometimes referred to as the *association strength*. We also perform this normalization here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_vosviewer.es['weight_normalized'] = [\n",
    "    e['weight']/( G_vosviewer.vs[e.source]['weight<Total link strength>']*G_vosviewer.vs[e.target]['weight<Total link strength>'] / (2*sum(G_vosviewer.es['weight'])) ) \n",
    "    for e in G_vosviewer.es]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default VOSviewer uses the default resolution of `1` for these normalized weights. If we now detect clusters using these weights, you will see that the result are more closely aligned to the VOSviewer results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters = leidenalg.find_partition(G_vosviewer, leidenalg.CPMVertexPartition, \n",
    "                                       weights='weight_normalized', resolution_parameter=1,\n",
    "                                       n_iterations=10)\n",
    "\n",
    "clusters.compare_to(clustering, method='nmi')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, the Leiden algorithm is also directly implemented in `igraph` itself nowadays. It is somewhat less elaborate than the `leidenalg` package, but it is also substantially faster. If you are analysing very large networks, it might be better to use the `igraph` Leiden algorithm. Using it is straightforward."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters = G_vosviewer.community_leiden(objective_function='CPM',weights='weight_normalized', \n",
    "                                        resolution_parameter=1.0, n_iterations=10)\n",
    "\n",
    "clusters.compare_to(clustering, method='nmi')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us explore cluster detection a bit further."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Vary the <code>resolution_parameter</code> when detecting clusters using the CPM method. What <code>resolution_parameter</code> seems reasonable to you, and why?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Try to find a <code>resolution_parameter</code> such that the network separates in two large clusters (and some remaining small clusters). What is the cause of these two large clusters? (Hint: examine the author names)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Compare the co-authorship network that we created previously in Python to the network created in VOSviewer. What are the differences?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Document-term clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now use the same type of clustering technique that we used previously in a slightly different way. Instead of clustering a network, we will cluster a specific type of network, namely a bipartite network. This requires a slightly different (and more complicated) approach. More specifically, we will cluster a document-term network, where documents are linked to terms if those terms appear in a document.\n",
    "\n",
    "We leave the task of extracting terms to VOSviewer, and simply import the resulting document-term network in Python. At the end of the notebook, you will find instructions how to extract the document-term network from VOSviewer yourself.\n",
    "\n",
    "We read two files: (1) the `terms.txt` file, which simply contains the terms and their `id`; and (2) the `doc-term.txt` file, which contains which term occurs in which document. The `document id` refers to the line number of the WoS files that were read by VOSviewer. We will encounter this later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terms_df = pd.read_csv('data-files/vosviewer/terms.txt', sep='\\t', index_col='id')\n",
    "doc_terms_df = pd.read_csv('data-files/vosviewer/doc-term.txt', sep='\\t')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this file, both the documents and the terms are using the same numbers, so that `igraph` cannot distinguish them (e.g. there is both a document `1` and a term `1`). We therefore create separate ids for both the documents and the terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_terms_df['document id'] = doc_terms_df['document id'].map(lambda x: str(x) + '-doc');\n",
    "doc_terms_df['term id'] = doc_terms_df['term id'].map(lambda x: str(x) + '-term');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now create the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_doc_term = ig.Graph.TupleList(\n",
    "      edges=doc_terms_df.values,\n",
    "      vertex_name_attr='id',\n",
    "      directed=False\n",
    "      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a bipartite network, and we create a specific vertex attribute to indicate what the type is of the node: either a `doc` or a `term`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_doc_term.vs['type'] = ['doc' if 'doc' in v['id'] else 'term' for v in G_doc_term.vs]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the co-authorship network, VOSviewer typically normalizes the weights in a network by using the association strength, and we will also use that here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_doc_term.es['weight'] = [2.0*G_doc_term.ecount()/(G_doc_term.vs[e.source].degree()*G_doc_term.vs[e.target].degree()) \n",
    "                           for e in G_doc_term.es];"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now employ a small trick in the `leidenalg` package in order to do clustering in a bipartite network. We will not explain the full details here, please see the [documentation](https://leidenalg.readthedocs.io/en/latest/multiplex.html#bipartite) for a brief explanation of this approach. Please note that this approach is *not* possible using the internal `igraph` Leiden algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "partition, partition_docs, partition_terms = leidenalg.CPMVertexPartition.Bipartite(\n",
    "    G_doc_term, types='type', weights='weight', resolution_parameter_01=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to detect clusters, but we are going to use all three partitions we created. We do so by using the function `optimise_partition_multiplex` instead of the `optimise_partition` function that we used previously. We have to pass a list of partitions to that function. For the trick to work, we also need to pass the argument `layer_weights=[1,-1,-1]`, which assumes that the `partition` is the first element of the list that we pass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimiser = leidenalg.Optimiser()\n",
    "optimiser.set_rng_seed(0)\n",
    "optimiser.optimise_partition_multiplex(\n",
    "              [partition, partition_docs, partition_terms],  \n",
    "              layer_weights=[1,-1,-1], n_iterations=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now `partition` contains the clustering results (actually, `partition_docs` and `partition_terms` contain the identical clustering results). We extract the cluster membership of each node, and make it a new node attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_doc_term.vs['cluster'] = partition.membership\n",
    "G_doc_term.vs['degree'] = G_doc_term.degree();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now create a so-called *projection* of the bipartite graph, which actually simply refers to the creation of a co-occurrence network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_doc_term.vs['type_int'] = [1 if v['type'] == 'term' else 0 for v in G_doc_term.vs];\n",
    "G_terms = G_doc_term.bipartite_projection(types='type_int', which=1);\n",
    "G_terms.simplify(combine_edges='sum');\n",
    "\n",
    "G_terms.vs['id'] = [int(v['id'][:-5]) for v in G_terms.vs];\n",
    "G_terms.vs['term'] = [terms_df.loc[v['id'],'term'] for v in G_terms.vs];"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now `G_terms` contains only terms and the co-occurrence between them. We will export this network to a file format so that we can read it back into VOSviewer. First, let us create the output directory (if necessary)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "output_dir = 'results/'\n",
    "if not os.path.exists(output_dir):\n",
    "    os.makedirs(output_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we export the network `G_terms` in file format which is understandable to VOSviewer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nodes_df = pd.DataFrame.from_dict({attr: G_terms.vs[attr] for attr in G_terms.vs.attributes()});\n",
    "nodes_df['label'] = nodes_df['term'];\n",
    "nodes_df['cluster'] += 1;\n",
    "nodes_df['weight<Occurence>'] = nodes_df['degree'];\n",
    "nodes_df = nodes_df.sort_values('id')\n",
    "nodes_df[['id', 'label', 'cluster', 'weight<Occurence>']].to_csv(output_dir + 'map_vosviewer.txt', sep='\\t', index=False);\n",
    "\n",
    "edge_df = pd.DataFrame([(G_terms.vs[e.source]['id'], G_terms.vs[e.target]['id'], e['weight']) for e in G_terms.es],\n",
    "                       columns=['source', 'target', 'weight']);\n",
    "edge_df = edge_df.sort_values(['source', 'target']);\n",
    "edge_df.to_csv(output_dir + 'network_vosviewer.txt', sep='\\t', index=False, header=False);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The great benefit of doing the clustering in Python is that we now also have a clustering of the publications. This is something that is not possible in VOSviewer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us first load the actual publication files which were used by VOSviewer (we have already done this in the previous notebook). As said, the `document id` refers to the line number of the WoS files that were read by VOSviewer, starting from `1`. We therefore also create a `document id` that is the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "import csv\n",
    "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n",
    "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n",
    "                                        usecols=range(68), index_col='UT') for f in files)\n",
    "publications_df['document id'] = range(1,publications_df.shape[0]+1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us create a dataframe from `G_doc_term` with all the information from the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nodes_df = pd.DataFrame.from_dict({attr: G_doc_term.vs[attr] for attr in G_doc_term.vs.attributes()});\n",
    "nodes_df = nodes_df[nodes_df['type'] == 'doc'];"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need back the original integer `document id`, instead of the identifiers we created `doc-1`, `doc-2`, etc... We can then use those `document id` to merge back the results with the original information from the publications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nodes_df['document id'] = nodes_df['id'].str[:-4].astype(int);\n",
    "publications_df = pd.merge(nodes_df[['document id', 'cluster']], publications_df, \n",
    "                           left_on='document id', right_on='document id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, for further inspection, we may want to export our results to a `.csv` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "publications_df[['AU', 'PY', 'TI', 'SO', 'cluster']].to_csv(output_dir + 'publications_clustering.csv', \n",
    "                                                            index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Own analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Load your own data in VOSviewer and create a co-citation network of journals.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Detect comunities in the journal co-citation network. What do you think the different clusters mean?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Load your own data in VOSviewer and create a term-map. Please take the following steps to create the term-map and extract the <code>terms.csv</code> file and the <code>doc-terms.csv</code> file.\n",
    "    \n",
    "<ol>\n",
    "    <li>Open VOSviewer and press the button \"Create...\".</li>    \n",
    "    <li>Choose \"Create a map based on text data\" and press \"Next\".</li>\n",
    "    <li>Choose \"Read data from bibliographic database files\" and press \"Next\".</li>\n",
    "    <li>Choose the \"Web of Science\" tab and select the files you have downloaded yourself and press \"Next\".</li>\n",
    "    <li>Choose \"Title and abstract fields\" (the default) and press \"Next\".</li>\n",
    "    <li>Choose \"Binary counting\" (the default) and press \"Next\".</li>\n",
    "    <li>Leave the default threshold of 10 and press \"Next\".</li>\n",
    "    <li>Leave the default number of terms to be selected and press \"Next\".</li>\n",
    "</ol>\n",
    "\n",
    "VOSviewer will now calculate the \"relevance\" scores. When it is done, you will be shown a list of terms together with the number of their occurrences and the relevance scores. Please follow the following remaining steps.\n",
    "\n",
    "<ol start=9>\n",
    "    <li>On the list of terms, click-right, and choose \"Export selected terms...\". Choose an appropriate file name (<code>terms.txt</code>) and make sure you choose an appropriate directory and then press \"Export\".</li>\n",
    "    <li>On the list of terms, click-right, and choose \"Export document-term relations...\". Choose an appropriate file name (<code>doc-terms.txt</code>) and make sure you choose an appropriate directory and then press \"Export\".</li>\n",
    "</ol>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Load the <code>terms.csv</code> file and the <code>doc-terms.csv</code> files. Detect the clusters in this bipartite network, as explained above.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Compare the results to the clusters you can detect immediately in VOSviewer itself. Are they similar or not?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Try to identify the main topic for the largest few clusters on the basis of the terms in the term map. Does that match well with the publications in the same cluster? Do you see any discrepancies?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}