--- name: "networkx-graph-analysis" description: "Graph and network analysis toolkit. Four graph types (directed, undirected, multi-edge), centrality, shortest paths, community detection, generators, I/O (GraphML, GML, edge list), matplotlib viz. For large graphs (100K+ nodes) use igraph or graph-tool; for GNNs use PyG." license: BSD-3-Clause --- # NetworkX Graph Analysis ## Overview NetworkX is a Python library for creating, manipulating, and analyzing complex networks and graphs. It provides data structures for undirected, directed, and multi-edge graphs along with a comprehensive collection of graph algorithms, generators, and I/O utilities. Use NetworkX when working with relationship data in social networks, biological interaction networks, transportation systems, citation graphs, or any domain involving pairwise entity relationships. ## When to Use - Analyzing protein-protein interaction networks, gene regulatory networks, or metabolic pathways - Computing centrality measures (degree, betweenness, PageRank) to identify important nodes - Finding shortest paths or optimal routes in transportation or communication networks - Detecting communities or clusters in social networks or co-expression data - Generating synthetic networks (scale-free, small-world, random) for simulation or null models - Reading and writing graph data in standard formats (GraphML, GML, edge lists, JSON) - Visualizing network topology with node/edge attribute mapping - Checking graph properties: connectivity, planarity, isomorphism, DAG structure - For large-scale graphs (100K+ nodes) where speed is critical, use `igraph` or `graph-tool` instead - For billion-edge graphs or GPU-accelerated analytics, use `graph-tool` with OpenMP or `cuGraph` - For graph neural networks and deep learning on graphs, use `torch-geometric-graph-neural-networks` ## Prerequisites - **Python packages**: `networkx`, `matplotlib`, `scipy`, `pandas`, `numpy` - **Optional**: `pydot` or `pygraphviz` (Graphviz layouts) ```bash pip install networkx matplotlib scipy pandas numpy ``` ## Quick Start ```python import networkx as nx # Create a graph and add edges with weights G = nx.karate_club_graph() print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}") # Nodes: 34, Edges: 78 # Compute centrality and find most central node bc = nx.betweenness_centrality(G) top_node = max(bc, key=bc.get) print(f"Most central node: {top_node}, betweenness: {bc[top_node]:.3f}") # Detect communities from networkx.algorithms import community comms = community.greedy_modularity_communities(G) print(f"Communities found: {len(comms)}") ``` ## Core API ### Module 1: Graph Creation and Types ```python import networkx as nx # Undirected graph (most common) G = nx.Graph() G.add_node("protein_A", type="kinase", weight=1.5) G.add_nodes_from(["protein_B", "protein_C"]) G.add_edge("protein_A", "protein_B", weight=0.9, interaction="phosphorylation") G.add_edges_from([("protein_B", "protein_C"), ("protein_A", "protein_C")]) print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}") # Nodes: 3, Edges: 3 # Directed graph (gene regulation, citations) D = nx.DiGraph() D.add_edges_from([("TF1", "geneA"), ("TF1", "geneB"), ("TF2", "geneA")]) print(f"TF1 out-degree: {D.out_degree('TF1')}") # 2 # MultiGraph (multiple relationship types between same nodes) M = nx.MultiGraph() M.add_edge("A", "B", key="binding", affinity=0.8) M.add_edge("A", "B", key="regulation", effect="inhibition") print(f"Edges between A-B: {M.number_of_edges('A', 'B')}") # 2 ``` ### Module 2: Node and Edge Operations ```python import networkx as nx G = nx.karate_club_graph() # Query structure print(f"Degree of node 0: {G.degree(0)}") print(f"Neighbors of node 0: {list(G.neighbors(0))[:5]}") print(f"Has edge 0-1: {G.has_edge(0, 1)}") # Set and get attributes G.nodes[0]["role"] = "instructor" nx.set_node_attributes(G, {0: "high", 33: "high"}, "importance") G[0][1]["weight"] = 0.95 # Iterate with data for u, v, data in G.edges(data=True): if "weight" in data: print(f" Edge {u}-{v}: weight={data['weight']}") break # Subgraphs (returns read-only view; use .copy() for mutable) H = G.subgraph([0, 1, 2, 3, 4, 5]).copy() print(f"Subgraph: {H.number_of_nodes()} nodes, {H.number_of_edges()} edges") ``` ### Module 3: Graph Analysis (Centrality) ```python import networkx as nx G = nx.karate_club_graph() degree_c = nx.degree_centrality(G) between_c = nx.betweenness_centrality(G, weight="weight") # For large graphs, approximate: nx.betweenness_centrality(G, k=100) close_c = nx.closeness_centrality(G) eigen_c = nx.eigenvector_centrality(G, max_iter=1000) pr = nx.pagerank(G, alpha=0.85) # Compare top nodes across measures for name, metric in [("Degree", degree_c), ("Betweenness", between_c), ("Closeness", close_c), ("PageRank", pr)]: top = max(metric, key=metric.get) print(f"{name:12s}: top node={top}, score={metric[top]:.4f}") ``` ### Module 4: Path and Connectivity ```python import networkx as nx G = nx.karate_club_graph() # Shortest path path = nx.shortest_path(G, source=0, target=33) length = nx.shortest_path_length(G, source=0, target=33) print(f"Shortest path 0->33: {path} (length {length})") print(f"Average shortest path length: {nx.average_shortest_path_length(G):.3f}") # Connected components print(f"Connected: {nx.is_connected(G)}") components = list(nx.connected_components(G)) print(f"Components: {len(components)}, largest: {len(max(components, key=len))}") # For directed graphs: strong/weak connectivity D = nx.DiGraph([(0,1),(1,2),(2,0),(3,4)]) print(f"Strongly connected: {list(nx.strongly_connected_components(D))}") # Connectivity measures print(f"Node connectivity: {nx.node_connectivity(G)}") print(f"Edge connectivity: {nx.edge_connectivity(G)}") ``` ### Module 5: Community Detection Partition networks into densely connected groups. ```python import networkx as nx from networkx.algorithms import community import itertools G = nx.karate_club_graph() # Greedy modularity maximization comms_greedy = community.greedy_modularity_communities(G) mod_score = community.modularity(G, comms_greedy) print(f"Greedy: {len(comms_greedy)} communities, modularity={mod_score:.4f}") # Label propagation (fast, non-deterministic) comms_lpa = community.label_propagation_communities(G) print(f"Label propagation: {len(list(comms_lpa))} communities") # Girvan-Newman (hierarchical, edge betweenness removal) gn = community.girvan_newman(G) # Get first level of partition first_level = next(gn) print(f"Girvan-Newman first split: {len(first_level)} groups") print(f" Sizes: {[len(c) for c in first_level]}") ``` ### Module 6: I/O and Serialization ```python import networkx as nx import pandas as pd import json G = nx.karate_club_graph() # Edge list (simple text format) nx.write_edgelist(G, "karate.edgelist") G_loaded = nx.read_edgelist("karate.edgelist", nodetype=int) # GraphML (preserves all attributes, XML-based) nx.write_graphml(G, "karate.graphml") G_xml = nx.read_graphml("karate.graphml") # JSON (node-link format, web-friendly for d3.js) data = nx.node_link_data(G) with open("karate.json", "w") as f: json.dump(data, f) # Pandas integration df = pd.DataFrame({"source": [1,2,3], "target": [2,3,4], "weight": [0.5,1.0,0.75]}) G_pd = nx.from_pandas_edgelist(df, "source", "target", edge_attr="weight") df_out = nx.to_pandas_edgelist(G_pd) print(f"Pandas round-trip: {len(df_out)} edges") # NumPy/SciPy matrices A = nx.to_numpy_array(G) print(f"Adjacency matrix shape: {A.shape}") A_sparse = nx.to_scipy_sparse_array(G, format="csr") # Memory-efficient ``` ### Module 7: Visualization ```python import networkx as nx import matplotlib.pyplot as plt G = nx.karate_club_graph() pos = nx.spring_layout(G, seed=42) # Color by degree, size by betweenness centrality bc = nx.betweenness_centrality(G) fig, ax = plt.subplots(figsize=(10, 8)) nx.draw(G, pos=pos, ax=ax, node_color=[G.degree(n) for n in G.nodes()], cmap=plt.cm.viridis, node_size=[3000 * bc[n] + 100 for n in G.nodes()], edge_color="gray", alpha=0.8, with_labels=True, font_size=8) plt.tight_layout() plt.savefig("network.png", dpi=300, bbox_inches="tight") plt.savefig("network.pdf", bbox_inches="tight") # Vector format print("Saved network.png and network.pdf") ``` ### Module 8: Generators ```python import networkx as nx # Erdos-Renyi random graph: n nodes, edge probability p G_er = nx.erdos_renyi_graph(n=200, p=0.05, seed=42) print(f"ER: {G_er.number_of_nodes()} nodes, {G_er.number_of_edges()} edges") # Barabasi-Albert scale-free (power-law degree distribution) G_ba = nx.barabasi_albert_graph(n=200, m=3, seed=42) # Watts-Strogatz small-world G_ws = nx.watts_strogatz_graph(n=200, k=6, p=0.1, seed=42) print(f"WS clustering: {nx.average_clustering(G_ws):.3f}") # Stochastic block model (community structure) sizes, probs = [50, 50, 50], [[0.25,0.05,0.02],[0.05,0.35,0.07],[0.02,0.07,0.40]] G_sbm = nx.stochastic_block_model(sizes, probs, seed=42) # Built-in datasets and classic graphs G_karate = nx.karate_club_graph() # Zachary's karate club G_grid = nx.grid_2d_graph(5, 7) # 2D lattice G_tree = nx.random_tree(n=50, seed=42) # Random tree G_geo = nx.random_geometric_graph(n=100, radius=0.2, seed=42) # See references/algorithms_generators.md for full generator catalog ``` ## Key Concepts ### Graph Types | Class | Directed | Multi-edge | Self-loops | Use Case | |-------|----------|------------|------------|----------| | `Graph` | No | No | Yes | Undirected networks: social, PPI | | `DiGraph` | Yes | No | Yes | Gene regulation, citations, web | | `MultiGraph` | No | Yes | Yes | Multiple relationship types | | `MultiDiGraph` | Yes | Yes | Yes | Transportation with routes | ### Attribute Patterns Attributes are stored as dictionaries at graph, node, and edge levels: ```python import networkx as nx G = nx.Graph(name="example") # Graph-level attribute G.add_node(1, label="hub", weight=1.5) # Node attributes G.add_edge(1, 2, weight=0.8, type="ppi") # Edge attributes # Bulk set/get nx.set_node_attributes(G, {1: "red", 2: "blue"}, "color") colors = nx.get_node_attributes(G, "color") # {1: 'red', 2: 'blue'} ``` ### Layout Algorithms | Layout | Function | Best For | |--------|----------|----------| | Spring (force-directed) | `spring_layout(G, seed=42)` | General networks | | Circular | `circular_layout(G)` | Regular graphs, cycles | | Kamada-Kawai | `kamada_kawai_layout(G)` | Small-medium networks | | Spectral | `spectral_layout(G)` | Highlighting clusters | | Shell (concentric) | `shell_layout(G, nlist=[[...],[...]])` | Layered/hierarchical | | Planar | `planar_layout(G)` | Planar graphs only | ## Common Workflows ### Workflow 1: Social Network Analysis **Goal**: Identify influential actors, detect communities, and visualize. ```python import networkx as nx import matplotlib.pyplot as plt from networkx.algorithms import community # Step 1: Load network and basic stats G = nx.karate_club_graph() print(f"Network: {G.number_of_nodes()} actors, {G.number_of_edges()} ties") print(f"Density: {nx.density(G):.4f}, Clustering: {nx.average_clustering(G):.4f}") # Step 2: Identify influential nodes bc = nx.betweenness_centrality(G) top_bc = sorted(bc.items(), key=lambda x: x[1], reverse=True)[:5] print("Top 5 by betweenness:", [(n, f"{s:.3f}") for n, s in top_bc]) # Step 3: Detect communities comms = community.greedy_modularity_communities(G) print(f"Communities: {len(comms)}, modularity: {community.modularity(G, comms):.4f}") # Step 4: Visualize with community coloring pos = nx.spring_layout(G, seed=42) fig, ax = plt.subplots(figsize=(10, 8)) for i, comm in enumerate(comms): nx.draw_networkx_nodes(G, pos, nodelist=list(comm), ax=ax, node_color=[plt.cm.Set2(i)]*len(comm), node_size=400) nx.draw_networkx_edges(G, pos, ax=ax, alpha=0.3) nx.draw_networkx_labels(G, pos, ax=ax, font_size=8) plt.axis("off") plt.tight_layout() plt.savefig("social_network_analysis.png", dpi=300, bbox_inches="tight") print("Saved social_network_analysis.png") ``` ### Workflow 2: Biological Interaction Network **Goal**: Build a PPI network from tabular data, analyze topology, and identify hub proteins. ```python import networkx as nx import pandas as pd # Step 1: Load interaction data from DataFrame interactions = pd.DataFrame({ "protein_a": ["TP53","TP53","BRCA1","BRCA1","MDM2","ATM","ATM","CHEK2","RB1","CDK2"], "protein_b": ["MDM2","BRCA1","ATM","CHEK2","RB1","CHEK2","BRCA2","CDC25A","CDK2","CCNA2"], "score": [0.99, 0.95, 0.92, 0.88, 0.91, 0.97, 0.85, 0.90, 0.87, 0.93] }) G = nx.from_pandas_edgelist(interactions, "protein_a", "protein_b", edge_attr="score") print(f"PPI network: {G.number_of_nodes()} proteins, {G.number_of_edges()} interactions") # Step 2: Network statistics print(f"Connected: {nx.is_connected(G)}") print(f"Diameter: {nx.diameter(G)}") print(f"Avg path length: {nx.average_shortest_path_length(G):.2f}") print(f"Transitivity: {nx.transitivity(G):.4f}") # Step 3: Hub identification (multiple centrality measures) degree_c = nx.degree_centrality(G) between_c = nx.betweenness_centrality(G) close_c = nx.closeness_centrality(G) results = pd.DataFrame({ "protein": list(G.nodes()), "degree_centrality": [degree_c[n] for n in G.nodes()], "betweenness": [between_c[n] for n in G.nodes()], "closeness": [close_c[n] for n in G.nodes()], }).sort_values("betweenness", ascending=False) print("\nHub proteins:") print(results.head(5).to_string(index=False)) # Step 4: Export for downstream analysis nx.write_graphml(G, "ppi_network.graphml") results.to_csv("protein_centrality.csv", index=False) print("Exported ppi_network.graphml and protein_centrality.csv") ``` ## Key Parameters | Parameter | Module | Default | Range / Options | Effect | |-----------|--------|---------|-----------------|--------| | `weight` | Paths/Centrality | `None` | Edge attribute name | Use weighted edges for path/centrality calculations | | `alpha` | `pagerank` | `0.85` | `0.0`-`1.0` | Damping factor; lower = more uniform distribution | | `k` | `betweenness_centrality` | `None` | `int` | Sample k nodes for approximation on large graphs | | `max_iter` | `eigenvector_centrality` | `100` | `int` | Max iterations for convergence | | `seed` | Generators/Layouts | `None` | `int` | Random seed for reproducibility | | `n` / `p` / `m` | ER/BA generators | varies | `int`/`float` | Node count, edge probability, edges per new node | | `k` / `p` | Watts-Strogatz | varies | `int`/`float` | Nearest neighbors, rewiring probability | | `nodetype` | `read_edgelist` | `str` | `int`, `float`, `str` | Type conversion for node identifiers | | `edge_attr` | `from_pandas_edgelist` | `None` | Column name(s) | Edge attribute columns to include from DataFrame | | `format` | `to_scipy_sparse_array` | `"csc"` | `"csr"`, `"csc"`, `"coo"` | Sparse matrix format | ## Best Practices 1. **Always set random seeds** for reproducible generators and layouts: `seed=42` in both `erdos_renyi_graph()` and `spring_layout()`. 2. **Use approximate algorithms for large graphs**: `nx.betweenness_centrality(G, k=500)` samples k nodes instead of all pairs. 3. **Prefer `from_pandas_edgelist`** over manual `add_edge` loops for bulk data loading -- handles attributes cleanly and is faster. 4. **Copy subgraphs before modification**: `G.subgraph(nodes)` returns a read-only view; call `.copy()` for a mutable independent graph. 5. **Use GraphML or GML for persistent storage** to preserve all node/edge attributes. Edge lists lose metadata unless explicitly handled. 6. **Convert graph types explicitly**: `D.to_undirected()` (DiGraph -> Graph), `nx.Graph(M)` (MultiGraph -> Graph, collapses multi-edges). 7. **Use sparse matrices for large adjacency exports**: `to_scipy_sparse_array()` is far more memory-efficient than `to_numpy_array()`. 8. **Anti-pattern -- Don't use `nx.info()`**: Deprecated; use `G.number_of_nodes()`, `G.number_of_edges()`, `nx.density(G)` directly. 9. **Anti-pattern -- Don't assume node ordering**: Algorithms may return results in different orders. Always index by node key, not position. ## Common Recipes ### Recipe: Minimum Spanning Tree Extract the minimum spanning tree and compare to the original graph. ```python import networkx as nx # Create weighted graph G = nx.erdos_renyi_graph(50, 0.15, seed=42) for u, v in G.edges(): G[u][v]["weight"] = round(nx.utils.py_random_state(42).random(), 2) mst = nx.minimum_spanning_tree(G, weight="weight") print(f"Original: {G.number_of_edges()} edges") print(f"MST: {mst.number_of_edges()} edges") total_weight = sum(d["weight"] for _, _, d in mst.edges(data=True)) print(f"MST total weight: {total_weight:.2f}") ``` ### Recipe: Graph Coloring and Cliques Find cliques and compute graph coloring. ```python import networkx as nx G = nx.karate_club_graph() # Find all maximal cliques cliques = list(nx.find_cliques(G)) print(f"Maximal cliques: {len(cliques)}") largest_clique = max(cliques, key=len) print(f"Largest clique size: {len(largest_clique)}, nodes: {largest_clique}") # Greedy graph coloring coloring = nx.greedy_color(G, strategy="largest_first") n_colors = max(coloring.values()) + 1 print(f"Chromatic number (greedy upper bound): {n_colors}") ``` ### Recipe: DAG and Topological Sort Build a directed acyclic graph and find execution order. ```python import networkx as nx # Task dependency DAG D = nx.DiGraph() D.add_edges_from([ ("download_data", "preprocess"), ("download_data", "validate"), ("preprocess", "analyze"), ("validate", "analyze"), ("analyze", "visualize"), ("analyze", "report"), ("visualize", "report"), ]) print(f"Is DAG: {nx.is_directed_acyclic_graph(D)}") order = list(nx.topological_sort(D)) print(f"Execution order: {order}") # Find all paths from start to end paths = list(nx.all_simple_paths(D, "download_data", "report")) print(f"Paths to report: {len(paths)}") for p in paths: print(f" {' -> '.join(p)}") ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `NetworkXError: Graph is not connected` | Algorithm requires connected graph | Extract largest component: `G.subgraph(max(nx.connected_components(G), key=len)).copy()` | | `PowerIterationFailedConvergence` | Eigenvector/PageRank did not converge | Increase `max_iter` (e.g., 1000) or check for disconnected components | | Very slow centrality computation | O(n*m) complexity on large graphs | Use `k` parameter for sampling: `betweenness_centrality(G, k=500)` | | `nx.NetworkXNotImplemented` | Algorithm not available for graph type | Convert graph type: `G.to_undirected()` or `G.to_directed()` | | Memory error on large graphs | Dense adjacency matrix | Use `to_scipy_sparse_array()` instead of `to_numpy_array()` | | Node IDs read as strings from file | `read_edgelist` defaults to `str` | Pass `nodetype=int`: `nx.read_edgelist(f, nodetype=int)` | | Community detection returns frozen sets | Normal return type for communities | Convert: `[list(c) for c in communities]` | | Self-loops in generated graphs | Configuration model allows self-loops | Remove: `G.remove_edges_from(nx.selfloop_edges(G))` | | Visualization too cluttered | Too many nodes/edges | Filter to subgraph, adjust `alpha`, increase figure size, or use interactive tools (Plotly, PyVis) | ## Bundled Resources Migrated from original entry (STUB: 436-line main file + 2,014 lines across 5 reference files, main/total = 17.8%). ### references/algorithms_generators.md Covers: Detailed algorithm parameters for traversal (DFS/BFS), cycles, cliques, graph coloring, isomorphism, matching/covering, tree algorithms (MST variants). Full generator catalog: classic graphs, lattice/grid, tree, bipartite, degree sequence, graph operations (union, compose, complement, products). Relocated inline: Core algorithms (centrality, paths, connectivity, community, flow) -> Core API Modules 3-5. Core generators (ER, BA, WS, SBM) -> Module 8. Omitted: A* heuristic customization, Bellman-Ford negative weights -- consult official docs. **Original file disposition**: - `algorithms.md` (383 lines): Top algorithms relocated to Core API Modules 3-5 + Recipes. Remaining (traversal, cliques, coloring, isomorphism, matching, cycles, trees) -> this reference. - `generators.md` (378 lines): Core generators relocated to Module 8. Full catalog (classic, lattice, tree, bipartite, degree sequence, operators) -> this reference. ### references/io_visualization.md Covers: All I/O formats (adjacency list, GEXF, Pajek, LEDA, Cytoscape JSON, DOT/Graphviz, Matrix Market, CSV, database/SQL, compressed gzip). Format selection guide. Advanced visualization: Plotly interactive, PyVis HTML, Graphviz layouts, 3D networks, bipartite layout, community coloring, subgraph highlighting, multi-panel figures, edge labels, directed arrows. Relocated inline: Core I/O (edge list, GraphML, JSON, pandas, NumPy/SciPy) -> Module 6. Basic matplotlib -> Module 7. Omitted: `write_gpickle`/`read_gpickle` (deprecated), `read_shp`/`write_shp` (removed in NetworkX 3.0; use geopandas). **Original file disposition**: - `io.md` (441 lines): Core formats relocated to Module 6. Remaining formats + format selection guide -> this reference. - `visualization.md` (529 lines): Basic matplotlib relocated to Module 7. Advanced techniques (Plotly, PyVis, 3D, bipartite, community coloring) -> this reference. ### Fully consolidated original file - `graph-basics.md` (283 lines): Fully consolidated into main SKILL.md. Graph types -> Key Concepts. Node/edge operations, attributes, subgraphs -> Core API Modules 1-2. Diagnostics -> Common Workflows. Memory/float-point considerations -> Best Practices + Troubleshooting. Omitted: `nx.info()` (deprecated). ## Related Skills - **torch-geometric-graph-neural-networks** -- graph neural networks (GCN, GAT, GraphSAGE) for node/graph classification and link prediction on graph-structured data - **matplotlib-scientific-plotting** -- advanced figure customization beyond NetworkX's built-in `nx.draw` - **plotly-interactive-visualization** -- interactive network plots with hover, zoom, and pan - **pandas** (planned) -- DataFrame operations for preparing edge/node data before graph construction - **scipy** (planned) -- sparse matrix operations and numerical algorithms used by NetworkX internally ## References - [NetworkX documentation](https://networkx.org/documentation/latest/) -- official docs and API reference - [NetworkX tutorial](https://networkx.org/documentation/latest/tutorial.html) -- official getting started guide - [NetworkX GitHub](https://github.com/networkx/networkx) -- source code and issue tracker - [NetworkX gallery](https://networkx.org/documentation/latest/auto_examples/index.html) -- example gallery with visualizations - Hagberg, A., Schult, D., & Swart, P. (2008). Exploring network structure, dynamics, and function using NetworkX. SciPy 2008.