{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Web Crawler\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I'm interesting in text processing, I decided to write simple web page crawler. I know that there is Python library beautifull soup, but writting parser is interesting itself... OK lets go. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a n ary tree, traversal, and function tu built abstract syntax tree - having it properly done - we are home." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# n - ary tree class\n", "\n", "class NaryTree:\n", " \"\"\"n ary tree, Python list as an array of children\"\"\"\t\n", " def __init__(self, rootObj):\n", " self.key = rootObj\n", " self.parent = None\n", " self.kids = None\n", "\n", " def insert(self, newNode):\n", " if self.kids:\n", " t = NaryTree(newNode)\n", " self.kids.append(t)\n", " t.parent = self\n", " else:\n", " t = NaryTree(newNode)\n", " self.kids = []\n", " self.kids.append(t)\n", " t.parent = self\n", "\n", " def setRootVal(self, obj):\n", " self.key.append(obj)\n", "\n", " def getRootVal(self):\n", " return self.key\n", "\n", " def getParent(self):\n", " return self.parent\n", "\n", " def getNthKid(self, i=-1): # return a child index i, the first from right if i not specified\n", " return self.kids[i]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tree code is suprisingly simple, simplest than a binary tree class! Nodes are inside a python array (self.kids) - which can be any size we want; function insert starts creating child nodes appending them to array from left to right. Traversal: " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# tree traversal and convert to a python list\n", "\n", "def traverse(tree): # dfs over tree\n", " if tree:\n", " stk = []\n", " stk.append(tree)\n", " while len(stk) > 0:\n", " top = stk.pop()\n", " if top.kids:\n", " for child in top.kids:\n", " stk.append(child)\n", " print(top.getRootVal())\n", " else:\n", " return None\n", "\n", "def nary_tree_tolist(tree, lst): # convert tree to a list\n", " if tree:\n", " stk = []\n", " stk.append(tree)\n", " while len(stk) > 0:\n", " top = stk.pop()\n", " if top.kids:\n", " for child in top.kids:\n", " stk.append(child)\n", " lst.append(top.getRootVal())\n", " return lst\n", " else:\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Traverse function is similar to DFS in graph, yes a tree is a graph, no problem.;) The second method just transfer tree content to a list. Finally, time to built a syntax tree:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# having supplied delimiters builds a syntax tree\n", "\n", "def buildNaryParseTree(lexp):\n", " \"\"\"build a syntax tree from tokenized list expression, returns n ary tree\"\"\"\n", " tree = NaryTree([])\n", " cur = tree\n", " ctr = 0\n", " ctr1 = 0\n", " for token in lexp:\n", " if token == '

':\n", " cur.insert([])\n", " cur = cur.getNthKid()\n", " ctr += 1\n", " elif token == '':\n", " cur.insert([])\n", " cur = cur.getNthKid()\n", " ctr1 += 1\n", " elif not token in ['<p>', '</p>', '<title>', ''] and (ctr is not 0 or ctr1 is not 0):\n", " cur.setRootVal(token)\n", " elif token == '

':\n", " ctr -= 1\n", " cur = cur.getParent()\n", " elif token == '':\n", " ctr1 -= 1\n", " cur = cur.getParent()\n", " return tree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function in standard way parse an expression: seeing opening token - create new node and go to it, seeing closing token - go level up on the tree; in the mean time it writes down whats inside delimiters (cur.setRootVal(token)). \n", "Variables ctr, ctr1 are for controlling writing (to not put the whole page inside a tree - just what's between delimiters). \n", "Of course, more tags must be added, this just for testing if design is correct (I believe yes!). \n", "Generally this function can parse anything, delimiters maybe nested or not (I check lots of expressions). \n", "The worst part now prepare html code for parsing, I think, I just scratch it;/..." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def prepare_expression(exp):\n", " \"\"\"takes a page as a list and make tags visible\"\"\"\n", " del_list = []\n", " l = len(exp)\n", " for i in range(l):\n", " if exp[i] == '<' and exp[i + 1] == 't' and exp[i + 2] == 'i' and exp[i + 3] == 't' and exp[i + 4] == 'l' and \\\n", " exp[i + 5] == 'e' and exp[i + 6] == '>':\n", " exp[i] = ''\n", " exp[i + 1] = []\n", " exp[i + 2] = []\n", " exp[i + 3] = []\n", " exp[i + 4] = []\n", " exp[i + 5] = []\n", " exp[i + 6] = []\n", " if exp[i] == '<' and exp[i + 1] == '/' and exp[i + 2] == 't' and exp[i + 3] == 'i' and exp[i + 4] == 't' and \\\n", " exp[i + 5] == 'l' and exp[i + 6] == 'e' and exp[i + 7] == '>':\n", " exp[i] = ''\n", " exp[i + 1] = []\n", " exp[i + 2] = []\n", " exp[i + 3] = []\n", " exp[i + 4] = []\n", " exp[i + 5] = []\n", " exp[i + 6] = []\n", " exp[i + 7] = []\n", " for t in range(len(exp) - 1, -1, -1):\n", " if isinstance(exp[t], list):\n", " del (exp[t])\n", " return exp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, having supplied a html page as a list of tokens, this function traverses it and bring to the world defined in a loop tags; here is just one \"title\", but there is more on github page. Lots of things to do, same sanity checks of html code, remove unbalanced tags, check tags balance and probaly more - depends on particular page and task. \n", "Finally, there is a recursive crawler code:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# crawler \n", "import urllib.request\n", "from parse_tree import *\n", "import re\n", "url = 'http://www.example.com/'\n", "s = set()\n", "List = []\n", "def f_go(List, s, url, iter_cnt):\n", " try:\n", " if iter_cnt > 200:\n", " return\n", " if url in s:\n", " return\n", " s.add(url)\n", " with urllib.request.urlopen(url) as response:\n", " html = response.read()\n", " print(\"here\", url)\n", " h = html.decode(\"utf-8\")\n", " lst0 = prepare_expression(list(h))\n", " ntr = buildNaryParseTree(lst0)\n", " lst1 = []\n", " lst2 = nary_tree_tolist(ntr, lst1)\n", " List.append(lst2)\n", " f1 = re.finditer(str_regex, h)\n", " l1 = []\n", " for tok in f1:\n", " ind1 = tok.span()\n", " l1.append(h[ind1[0]:ind1[1]])\n", " for exp in l1:\n", " f_go(List, s, exp, iter_cnt + 1)\n", " except:\n", " return" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The function f_go (function go!) recursively traverses set of pages (a single domain - with subdomains, but maybe tweaked to achieve different goals - namely); uses python set to store url's (to avoid repetitions); in python list List main scrapped web page is stored. \n", "How it works? Takes a url, add to set (return if not in) and try to open it, if succeed, parse content and add to List (lines 19 - 23), then, using regexp, extracts and put into the list, the all url's in given domain and in the last loop recursively visits them. \n", "The weak points are, arbitrary base case: just counter (means calls stop after certain limit iter_cnt - 200 this time - it need manual set); needs care in choosing subdomains to be visited - too wide selection may makes function to slow, to narrow - could (certainly will) cause loosing to many data... \n", "Here are the regexp used: " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "str_regex = '(https?:\\/\\/)?([a-z]+\\d\\.)?([a-z]+\\.)?activeingredients\\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'\n", "str_regex1 = '(https?:\\/\\/)?([a-z]+\\d\\.)?([a-z]+\\.)?activeingredients\\.[a-z]+(/?[^\\s\"\\.]*/?)*'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this all, example (small test rather): " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "here http://www.activeingredients.com/\n", "here https://www.activeingredients.com/work\n", "here https://www.activeingredients.com/\n", "here https://www.activeingredients.com/about\n", "here https://www.activeingredients.com/contact\n", "here https://www.activeingredients.com/work/lindsay\n", "here https://www.activeingredients.com/work/lithium\n", "here https://www.activeingredients.com/work/docusign\n", "here https://www.activeingredients.com/work/chicken-of-the-sea\n", "here https://www.activeingredients.com/work/cline\n", "here https://www.activeingredients.com/work/comfort-zone\n", "CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s\n", "Wall time: 13.5 s\n", "2392\n", "[[[], ['A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', ' ', 'D', 'i', 'g', 'i', 't', 'a', 'l', ' ', 'E', 'x', 'p', 'e', 'r', 'i', 'e', 'n', 'c', 'e', ' ', 'A', 'g', 'e', 'n', 'c', 'y']], [[], ['W', 'o', 'r', 'k', ' ', '\\x1f', '\\x1f', '\\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', ' ', 'D', 'i', 'g', 'i', 't', 'a', 'l', ' ', 'E', 'x', 'p', 'e', 'r', 'i', 'e', 'n', 'c', 'e', ' ', 'A', 'g', 'e', 'n', 'c', 'y']], [[], ['A', 'b', 'o', 'u', 't', ' ', '\\x1f', '\\x1f', '\\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'o', 'n', 't', 'a', 'c', 't', ' ', '\\x1f', '\\x1f', '\\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['L', 'i', 'n', 'd', 's', 'a', 'y', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['L', 'i', 't', 'h', 'i', 'u', 'm', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['D', 'o', 'c', 'u', 'S', 'i', 'g', 'n', ' ', '\\x1f', '\\x1f', '\\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'h', 'i', 'c', 'k', 'e', 'n', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'S', 'e', 'a', ' ', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'l', 'i', 'n', 'e', ' ', 'C', 'e', 'l', 'l', 'a', 'r', 's', ' ', '\\x1f', '\\x1f', '\\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']], [[], ['C', 'o', 'm', 'f', 'o', 'r', 't', ' ', 'Z', 'o', 'n', 'e', ' ', '\\x1f', '\\x1f', '\\x1f', '&', 'n', 'd', 'a', 's', 'h', ';', ' ', 'A', 'c', 't', 'i', 'v', 'e', ' ', 'I', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's']]]\n" ] } ], "source": [ "url = 'http://www.activeingredients.com/'\n", "s = set()\n", "List = []\n", "def f_go(List, s, url, iter_cnt):\n", " try:\n", " if iter_cnt > 200:\n", " return\n", " if url in s:\n", " return\n", " s.add(url)\n", " with urllib.request.urlopen(url) as response:\n", " html = response.read()\n", " print(\"here\", url)\n", " h = html.decode(\"utf-8\")\n", " lst0 = prepare_expression(list(h))\n", " ntr = buildNaryParseTree(lst0)\n", " lst1 = []\n", " lst2 = nary_tree_tolist(ntr, lst1)\n", " List.append(lst2)\n", " f1 = re.finditer(str_regex, h)\n", " l1 = []\n", " for tok in f1:\n", " ind1 = tok.span()\n", " l1.append(h[ind1[0]:ind1[1]])\n", " for exp in l1:\n", " f_go(List, s, exp, iter_cnt + 1)\n", " except:\n", " return\n", "\n", "%time f_go(List, s, url, 0)\n", "print(len(str(List)))\n", "print(str(List))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've chosen small page, just to make thinks visible, and I've left \"debugging print\" in line 13, to check what domains are there. It's a simple function, but may give nice material to data analysis. Having spaces as word delimiters, we could easily come back to original text and do whatever, collocations, frequencies, statistics, information retrieval. The biggest problem, I think is html itself. Thanks!" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }