{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Arango in Python\n",
    "\n",
    "To work with Arango in Python we simply need to leverage the python-arango package (`pip install python-arango` and `import arango`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll import networkx and matplotlib for some light visualizations\n",
    "import json\n",
    "import networkx as nx\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from arango import ArangoClient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connecting\n",
    "Like almost every other database we connect to, we need to connect to the server. With Arango, we use the **arango.ArangoClient**.\n",
    "\n",
    "We simply need to provide:\n",
    " - protocol - 'http'\n",
    " - host - uri/ipaddress\n",
    " - port - 8529 by default\n",
    " \n",
    "Once connected we can access a given database by passing in our credentials **client.db()**\n",
    "\n",
    "Finally we can also connect to a graph, by name, by using **db.graph()**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our client connection\n",
    "client = ArangoClient(hosts='http://18.219.151.47:8529')\n",
    "\n",
    "# Our database connection\n",
    "db = client.db('emse6586', username='root', password='emse6586pass')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AQL Queries\n",
    "Once connected to a database, we can query the database using AQL, simply by executing it.\n",
    "\n",
    "`db.aql.execute(AQL_query)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"\"\"FOR tweet IN statuses\n",
    "            LIMIT 10\n",
    "            RETURN tweet\"\"\"\n",
    "results = db.aql.execute(query)\n",
    "print(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Arango Cursor\n",
    "Like other connections, we get back curosr objects from our executed queries. We can access these exactly the same way we interact with other cursors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets = list(results)\n",
    "print(tweets[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given the dictionary-like structure of the objects, they can be easily loaded into a DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame(tweets)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Arango Traversals\n",
    "Given Arango is built around traversing graphs and graph-like queries, pyarango provides a streamlined API. However, to use the simplified API requires a graph to have been defined within Arango.\n",
    "\n",
    "If you recall from the lecture, we have one called **twitter_sphere** which we can hook into. The graph can be initialized by `db.graph({graph_name})`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our graph connection\n",
    "graph = db.graph('twitter_sphere')\n",
    "\n",
    "results = graph.traverse(max_depth=2, direction='any', start_vertex='users/22203756', vertex_uniqueness='global')\n",
    "print(results.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main difference with this resultset is the format of the datastructure. Given that it is a graph traversal there are three different subelements:\n",
    " - paths\n",
    " - vertices\n",
    " - edges\n",
    " \n",
    "Each of these elements has it's own structure that will mirror those we saw when reviewing graph traversals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'Number of paths: {len(results[\"paths\"])}')\n",
    "print(f'Number of vertices: {len(results[\"vertices\"])}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(json.dumps(results['paths'][4000], indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for edge in results['paths'][4000]['edges']:\n",
    "    print(edge)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Traversals return vertices, paths, and edges (depending on the traversal)\n",
    "\n",
    "----\n",
    "\n",
    "## NetworkX Integration\n",
    "\n",
    "Using networkx and matplotlib, we can actually plot some of these graphs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def populate_from_query(results, G, limit=100):\n",
    "    \"\"\"Given results from a query populate a networkx graph\n",
    "    Args:\n",
    "        results (list/dict) - Results from an AQL graph\n",
    "        G (networkx.Graph) - A networkx graph\n",
    "        limit (int) - Limit to number of nodes/edges to populate\n",
    "    \"\"\"\n",
    "\n",
    "    edge_count = 0\n",
    "    for result in results['paths']:\n",
    "\n",
    "        nodes = result['vertices']\n",
    "        edges = result['edges']\n",
    "\n",
    "        for edge in edges:\n",
    "            if edge_count % 100 == 0:\n",
    "                print(f'{edge_count} of {len(edges)}')\n",
    "            from_user = edge['_from']\n",
    "            to_user = edge['_to']\n",
    "            for node in nodes:\n",
    "                if node['_id'] == from_user:\n",
    "                    if 'screen_name' in node:\n",
    "                        from_node = node['screen_name']\n",
    "                    elif 'status_id' in node:\n",
    "                        from_node = node['status_id']\n",
    "                if node['_id'] == to_user:\n",
    "                    if 'screen_name' in node:\n",
    "                        to_node = node['screen_name']\n",
    "                    elif 'status_id' in node:\n",
    "                        to_node = node['status_id']\n",
    "            G.add_edge(from_node, to_node)\n",
    "            edge_count += 1\n",
    "\n",
    "            if edge_count > limit and limit != -1:\n",
    "                return\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(16, 14));\n",
    "\n",
    "G = nx.Graph(ax=ax)\n",
    "populate_from_query(results, G, 25)\n",
    "\n",
    "pos = nx.spring_layout(G, k=.01)\n",
    "nx.draw_networkx_nodes(G, pos, node_color='red', alpha=0.7, node_size=500)\n",
    "nx.draw_networkx_edges(G, pos, edge_color='gray', alpha=0.5)\n",
    "nx.draw_networkx_labels(G, pos, font_weight='bold', font_size=12, font_color='black')\n",
    "#nx.draw(G, pos, font_size=16, with_labels=True)\n",
    "for p in pos:  # raise text positions\n",
    "    pos[p][1] += 0.07\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-Class Work\n",
    "Let's apply this kind of logic to solve a more interesting, and complex, problem.\n",
    "\n",
    "In our twitter data we'll focus on two people, and the relationships that connect them together:\n",
    " - Elon Musk\n",
    " - Tim Cook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing Our Query\n",
    "To start we need to write our query to identify the paths that link our two people together.\n",
    "\n",
    "We'll focus on just friendships and only allow our traversal to search a depth of 2. Meaning that we will only allow one intermediary friend to link our two people together.\n",
    "\n",
    " - start_node (Elon) = users/44196397\n",
    " - end_node (Tim) = tim_cook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Space for the query\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning the results\n",
    "Because our data treats friendship as a single direction, we can end up with paths that touch the same vertices (while techincally being a separate path). So let's clean it up to see how many actual intermediate people connect our two users.\n",
    "\n",
    "Identify the unique vertexes for the query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Space to identify unique connections\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting Our Data\n",
    "Now lets plot the data to see if our findings are corroborated by the graphs visualization. We can hook into the previously created **populate_from_query** function, however you will need to modify the results to work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Space for our graph\n",
    "fig, ax = plt.subplots(1, 1, figsize=(14, 8));\n",
    "\n",
    "G = nx.Graph(ax=ax)\n",
    "populate_from_query({'paths': paths}, G, 150)\n",
    "\n",
    "pos = nx.spring_layout(G, k=.3)\n",
    "nx.draw_networkx_nodes(G, pos, node_color='red', alpha=0.7, node_size=500)\n",
    "nx.draw_networkx_edges(G, pos, edge_color='gray', alpha=0.5)\n",
    "nx.draw_networkx_labels(G, pos, font_weight='bold', font_size=12, font_color='black')\n",
    "#nx.draw(G, pos, font_size=16, with_labels=True)\n",
    "for p in pos:  # raise text positions\n",
    "    pos[p][1] += 0.07"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How About in SQL\n",
    "How would we do this same analysis using SQL?\n",
    "\n",
    "Do you think that this would be harder or easier?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expanding the Concept\n",
    "Looking at only one jump wasn't to complicated when jumping back and forth between SQL and Arango (graphs), but how about expanding to enabling two jumps?\n",
    "\n",
    "Change our AQL to search for the connections between Trump and The Rock with 1 to 2 intermediary friends:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Updated query\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scaling Solutions\n",
    "How does changing AQL compare to what we would have to do to change the SQL query?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Handles one intermediary friend\n",
    "\n",
    "Select * FROM Friends\n",
    "    Where friend_from = \"tim_cook\"\n",
    "    AND friend_to IN (SELECT friend_from FROM friends WHERE friend_to='elonmusk') as inter_1\n",
    "    OR friend_to IN (SELECT friend_from FROM inter_1) as inter_2\n",
    "    OR "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}