{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Web Crawler\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I'm interesting in text processing, I decided to write simple web page crawler. I know that there is Python library beautifull soup, but writting parser is interesting itself... OK lets go. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a n ary tree, traversal, and function tu built abstract syntax tree - having it properly done - we are home." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# n - ary tree class\n", "\n", "class NaryTree:\n", " \"\"\"n ary tree, Python list as an array of children\"\"\"\t\n", " def __init__(self, rootObj):\n", " self.key = rootObj\n", " self.parent = None\n", " self.kids = None\n", "\n", " def insert(self, newNode):\n", " if self.kids:\n", " t = NaryTree(newNode)\n", " self.kids.append(t)\n", " t.parent = self\n", " else:\n", " t = NaryTree(newNode)\n", " self.kids = []\n", " self.kids.append(t)\n", " t.parent = self\n", "\n", " def setRootVal(self, obj):\n", " self.key.append(obj)\n", "\n", " def getRootVal(self):\n", " return self.key\n", "\n", " def getParent(self):\n", " return self.parent\n", "\n", " def getNthKid(self, i=-1): # return a child index i, the first from right if i not specified\n", " return self.kids[i]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tree code is suprisingly simple, simplest than a binary tree class! Nodes are inside a python array (self.kids) - which can be any size we want; function insert starts creating child nodes appending them to array from left to right. Traversal: " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# tree traversal and convert to a python list\n", "\n", "def traverse(tree): # dfs over tree\n", " if tree:\n", " stk = []\n", " stk.append(tree)\n", " while len(stk) > 0:\n", " top = stk.pop()\n", " if top.kids:\n", " for child in top.kids:\n", " stk.append(child)\n", " print(top.getRootVal())\n", " else:\n", " return None\n", "\n", "def nary_tree_tolist(tree, lst): # convert tree to a list\n", " if tree:\n", " stk = []\n", " stk.append(tree)\n", " while len(stk) > 0:\n", " top = stk.pop()\n", " if top.kids:\n", " for child in top.kids:\n", " stk.append(child)\n", " lst.append(top.getRootVal())\n", " return lst\n", " else:\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Traverse function is similar to DFS in graph, yes a tree is a graph, no problem.;) The second method just transfer tree content to a list. Finally, time to built a syntax tree:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# having supplied delimiters builds a syntax tree\n", "\n", "def buildNaryParseTree(lexp):\n", " \"\"\"build a syntax tree from tokenized list expression, returns n ary tree\"\"\"\n", " tree = NaryTree([])\n", " cur = tree\n", " ctr = 0\n", " ctr1 = 0\n", " for token in lexp:\n", " if token == '
':\n", " cur.insert([])\n", " cur = cur.getNthKid()\n", " ctr += 1\n", " elif token == '
', '
', '