{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# fastlinkcheck API\n", "> API for fast local and online link checking" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#export\n", "from fastcore.all import *\n", "from html.parser import HTMLParser\n", "from urllib.parse import urlparse,urlunparse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find links in an HTML file" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#export\n", "class _HTMLParseAttrs(HTMLParser):\n", " def reset(self):\n", " super().reset()\n", " self.found = set()\n", " def handle_starttag(self, tag, attrs):\n", " a = first(v for k,v in attrs if k in (\"src\",\"href\"))\n", " if a: self.found.add(a)\n", " handle_startendtag = handle_starttag" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#export\n", "def get_links(fn):\n", " \"List of all links in file `fn`\"\n", " h = _HTMLParseAttrs()\n", " h.feed(Path(fn).read_text())\n", " return L(h.found)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `get_links` to parse an HTML file for different types of links. For example, this is the contents of `./example/test.html`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ] } ], "source": [ "!cat ./example/test.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling `get_links` with the above file path will return a list of links:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(#4) ['http://fastlinkcheck.com/test.html','http://www.bing.com','//somecdn.com/doesntexist.html','test.js']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links = get_links('./example/test.html')\n", "links" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "test_eq(set(links), {'test.js',\n", " '//somecdn.com/doesntexist.html',\n", " 'http://www.bing.com','http://fastlinkcheck.com/test.html'})" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#export\n", "def _local_url(u, root, host, fname):\n", " \"Change url `u` to local path if it is a local link\"\n", " fpath = Path(fname).parent\n", " islocal=False\n", " # remove `host` prefix\n", " for o in 'http://','https://','http://www.','https://www.':\n", " if u.startswith(o+host): u,islocal = remove_prefix(u, o+host),True\n", " # remove params, querystring, and fragment\n", " p = list(urlparse(u))[:5]+['']\n", " # local prefix, or no protocol or host\n", " if islocal or (not p[0] and not p[1]):\n", " u = p[2]\n", " if u and u[0]=='/': return (root/u[1:]).resolve()\n", " else: return (fpath/u).resolve()\n", " # URLs without a protocol are \"protocol relative\"\n", " if not p[0]: p[0]='http'\n", " # mailto etc are not checked\n", " if p[0] not in ('http','https'): return ''\n", " return urlunparse(p)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "class LinkMap(dict):\n", " \"\"\"A dict that pretty prints Links and their associated locations.\"\"\"\n", " def __repr__(self):\n", " rstr=''\n", " for k in self:\n", " rstr+=f'Link: {repr(k)}\\n Locations found:\\n'\n", " for p in self[k]:\n", " rstr+=f' - {p}\\n'\n", " rstr+='\\n'\n", " return rstr" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#export\n", "def local_urls(path:Path, host:str):\n", " \"returns a `dict` mapping all HTML files in `path` to a list of locally-resolved links in that file\"\n", " path=Path(path)\n", " fns = L(path.glob('**/*.html'))+L(path.glob('**/*.htm'))\n", " found = [(fn.resolve(),_local_url(link, root=path, host=host, fname=fn))\n", " for fn in fns for link in get_links(fn)]\n", " return LinkMap(groupby(found, 1, 0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The keys of the `dict` returned by `local_urls` are links found in HTML files, and the values of this `dict` are a list of paths that those links are found in. \n", "\n", "Furthermore, local links are returned as `Path` objects, whereas external URLs are strings. For example, notice how the link:\n", "\n", "```html\n", "\n", "```\n", "\n", "is resolved to a local path, because the `host` parameter supplied to `local_urls`, `fastlinkcheck.com` matches the url in the link: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Link: Path('/Users/hamelsmu/github/fastlinkcheck/example/test.html')\n", " Locations found:\n", " - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n", "\n", "Link: 'http://www.bing.com'\n", " Locations found:\n", " - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n", "\n", "Link: 'http://somecdn.com/doesntexist.html'\n", " Locations found:\n", " - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n", "\n", "Link: Path('/Users/hamelsmu/github/fastlinkcheck/example/test.js')\n", " Locations found:\n", " - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = Path('./example')\n", "links = local_urls(path, host='fastlinkcheck.com')\n", "links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding broken links" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def broken_local(links) -> L:\n", " \"List of items in keys of `links` that are `Path`s that do not exist\"\n", " return L(o for o in links if isinstance(o,Path) and not o.exists())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `test.js` does not exist in the `example/` directory, `broken_local` returns this path:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(#1) [Path('/Users/hamelsmu/github/fastlinkcheck/example/test.js')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "broken_local(links)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "assert not all([x.exists() for x in broken_local(links)])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def broken_urls(links):\n", " \"List of items in keys of `links` that are URLs that return a failure status code\"\n", " its = L(links).filter(risinstance(str))\n", " working_urls = parallel(urlcheck, its, n_workers=32, threadpool=True)\n", " return L(o for o,p in zip(its,working_urls) if not p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly the url `http://somecdn.com/doesntexist.html` doesn't exist, which is why it is returned by `broken_urls`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "assert broken_urls(links) == ['http://somecdn.com/doesntexist.html']" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "@call_parse\n", "def fastlinkcheck(path:Param(\"Root directory searched recursively for HTML files\", str),\n", " host:Param(\"Host and path (without protocol) of web server\", str)='',\n", " config_file:Param(\"Location of file with urls to ignore\", str)=None):\n", " if config_file: assert Path(config_file).is_file(), f\"{config_file} is either not a file or doesn't exist.\"\n", " ignore = [] if not config_file else [x.strip() for x in Path(config_file).readlines()]\n", " links = local_urls(path, host=host)\n", " return LinkMap({k:links[k] for k in (broken_urls(links) + broken_local(links)) if str(k) not in ignore})" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Link: 'http://somecdn.com/doesntexist.html'\n", " Locations found:\n", " - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n", "\n", "Link: Path('/Users/hamelsmu/github/fastlinkcheck/example/test.js')\n", " Locations found:\n", " - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fastlinkcheck(path='./example', host='fastlinkcheck.com')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can choose to ignore files with a a plain-text file containing a list of urls to ignore. For example, the file `linkcheck.rc` contains a list of urls I want to ignore:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/hamelsmu/github/fastlinkcheck/example/test.js\r\n", "https://www.google.com" ] } ], "source": [ "! cat linkcheck.rc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case `example/test.js` will be filtered out from the list:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Link: 'http://somecdn.com/doesntexist.html'\n", " Locations found:\n", " - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fastlinkcheck(path='./example', host='fastlinkcheck.com', config_file='linkcheck.rc')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "with ExceptionExpected(ex=AssertionError, regex=\"not a file or doesn't exist\"):\n", " fastlinkcheck(path='./example/', config_file='doesnt_exist')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }