# Cluster morphology using networkx

This is a short example demonstrating network analysis with DICES data. I'm using NetworkX to build the network models and Pyplot to visualize them.

I'm by no means an expert in network tools. If you have more complex case studies you'd like to share, please get in touch!

In [None]:
# import statements
import pickle
import pandas as pd
import os
from dicesapi import DicesAPI
from dicesapi.jupyter import NotebookPBar
from collections import Counter
from matplotlib import pyplot as plt
import networkx as nx

# initialize connection to database
api = DicesAPI(
 progress_class = NotebookPBar,
 logfile = 'dices.log',
)

### Define what we want to study

In this case, I'd like to organize every conversation in the corpus according to which parties talk to which other parties.

Nodes in my network will be character instances, and edges will be speaker-addressee relationships. I'm not going to consider how many times they speak throughout the conversation, simple whether person A ever speaks to person B.

I'll assign numbers to the participants in the order in which they appear.

The function below produces a dictionary with three components:
- `key`: a shorthand representation of who speaks to whom
- `turns`: a table of all the speeches in the cluster
- `graph`: a networkx graph representing speaker-addressee relationships

In [None]:
def convo_graph(cluster):
 persons = dict()
 
 def get_id(inst):
 name = inst.name if inst is not None else 'N/A'
 
 return persons.setdefault(name, len(persons) + 1)

 turns = pd.DataFrame(dict(
 id = cl.id,
 source = [get_id(inst) for inst in (s.spkr or [None])],
 target = [get_id(inst) for inst in (s.addr or [None])],
 ) for s in cluster.getSpeeches())
 
 all_edges = turns.explode('source').explode('target')
 
 flat_with_weights = all_edges.groupby(['source','target']
 ).size(
 ).reset_index(name='weight'
 ).sort_values(['source', 'target'])
 
 graph = nx.from_pandas_edgelist(flat_with_weights, create_using=nx.DiGraph,
 source='source', target='target')
 
 key = tuple((e.source, e.target) for i, e in flat_with_weights.iterrows())
 
 return dict(key=key, graph=graph, turns=turns)

### Download all the speech clusters in the *Iliad*

In [None]:
clusters = api.getClusters(work_title='Iliad')
print(len(clusters), 'clusters')

### Test out our model

Let's try building a couple of graphs to see what they're like. I'm starting with item 0, the first cluster. Try picking other numbers to compare the results.

In [None]:
cl = clusters[10]
print(cl)

pd.DataFrame(dict(
 cluster = cl.id,
 speech = s.id,
 work = f'{s.author.name} {s.work.title}',
 first = s.l_fi,
 last = s.l_la,
 spkr = s.getSpkrString(),
 addr = s.getAddrString(),
) for s in cl.getSpeeches())

Run our custom function to produce key, turns, and graph as a dict.

In [None]:
bundle = convo_graph(cl)

Let's start with the turns, since that's the easiest for us to interpret. The speeches are still in order, but the names have been replaced by numbers.

In [None]:
bundle['turns']

The key is a flattened form of this, combining turns that are identical in spkr-addressee relation.

In [None]:
bundle['key']

### Build graphs for each cluster

In [None]:
pbar = NotebookPBar(max=len(clusters))
graphs = []

for i, cl in enumerate(clusters):
 graphs.append(convo_graph(cl))
 pbar.update(i)

### Organize the clusters graphs according to key.

Here we create two dictionaries. One stores all the graphs based on key, the flat representation of the map. The other stores all the turn-taking tables in the same way.

In [None]:
graph_index = {}
turns_index = {}

for graph in graphs:
 k = graph['key']
 g = graph['graph']
 m = graph['turns']
 
 if k not in graph_index:
 graph_index[k] = []
 graph_index[k].append(g) 
 
 if k not in turns_index:
 turns_index[k] = []
 turns_index[k].append(m)

### Count conversations according to key

Make a quick counter of how many graphs are organized under each key, so we can see which morphologies are most common.

In [None]:
key_count = Counter([g['key'] for g in graphs])

In [None]:
key_count.most_common()

### Plot the most common morphologies

We use the counter to take each successive map in order, from most common down. Then we check the `graph_index` for an example of the graph representing that morphology and plot it. The final line below also saves a copy of the image.

In [None]:
fig, ax = plt.subplots(3, 4, figsize=(22,12))
plt.subplots_adjust(wspace=1, hspace=.5)

for i, rec in enumerate(key_count.most_common(12)):
 key, count = rec
 row = i % 4
 col = i // 4
 
 plt.sca(ax[col, row])
 g = graph_index[key][0]
 nx.draw_spring(g, node_color='pink', width=4, with_labels=True)
 ax[col,row].set_title(f'n={count}', fontsize=18)

plt.savefig('foo.pdf')

### Search for speeches by morphology

We can also go the other direction: specify a key and look for examples of it in the corpus by using the indices we built.

#### Define the relationship we're looking for

In [None]:
key = (((1), (2)), ((3), (1)))

#### Visualize it

In [None]:
# look at first graph
graph = graph_index[key][0]

fig, ax = plt.subplots(figsize=(8,6))
nx.draw(graph, node_color='pink', width=2, with_labels=True)
ax.set_title(f'n={len(g)}')
fig.savefig('chain.pdf')

#### List all matching conversations

In [None]:
cl_ids = [turns.loc[0,'id'] for turns in turns_index[key]]

for cl in clusters.filterIDs(cl_ids):
 display(
 pd.DataFrame(dict(
 author = s.author.name,
 work = s.work.title,
 lines = s.l_range,
 speaker = ', '.join([i.name for i in s.spkr]),
 addressee = ', '.join([i.name for i in s.addr]),
 ) for s in cl.getSpeeches())
 )