{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#default_exp linkcheck" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# fastlinkcheck API\n", "> API for fast local and online link checking" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "from fastcore.all import *\n", "from html.parser import HTMLParser\n", "from urllib.parse import urlparse,urlunparse\n", "from fastcore.script import SCRIPT_INFO\n", "import sys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find links in an HTML file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class _HTMLParseAttrs(HTMLParser):\n", " def reset(self):\n", " super().reset()\n", " self.found = set()\n", " def handle_starttag(self, tag, attrs):\n", " a = first(v for k,v in attrs if k in (\"src\",\"href\"))\n", " if a: self.found.add(a)\n", " handle_startendtag = handle_starttag" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def get_links(fn):\n", " \"List of all links in file `fn`\"\n", " h = _HTMLParseAttrs()\n", " h.feed(Path(fn).read_text())\n", " return L(h.found)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `get_links` to parse an HTML file for different types of links. For example, this is the contents of `./example/test.html`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "example = Path('_example/test.html')\n", "print(example.read_text())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling `get_links` with the above file path will return a list of links:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "links = get_links(example)\n", "test_eq(set(links), {'test.js',\n", " '//somecdn.com/doesntexist.html',\n", " 'http://www.bing.com','http://fastlinkcheck.com/test.html'})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def _local_url(u, root, host, fname):\n", " \"Change url `u` to local path if it is a local link\"\n", " fpath = Path(fname).parent\n", " islocal=False\n", " # remove `host` prefix\n", " for o in 'http://','https://','http://www.','https://www.':\n", " if u.startswith(o+host): u,islocal = remove_prefix(u, o+host),True\n", " # remove params, querystring, and fragment\n", " p = list(urlparse(u))[:5]+['']\n", " # local prefix, or no protocol or host\n", " if islocal or (not p[0] and not p[1]):\n", " u = p[2]\n", " if u and u[0]=='/': return (root/u[1:]).resolve()\n", " else: return (fpath/u).resolve()\n", " # URLs without a protocol are \"protocol relative\"\n", " if not p[0]: p[0]='http'\n", " # mailto etc are not checked\n", " if p[0] not in ('http','https'): return ''\n", " return urlunparse(p)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class _LinkMap(dict):\n", " \"\"\"A dict that pretty prints Links and their associated locations.\"\"\"\n", " def _repr_locs(self, k): return '\\n'.join(f' - `{p}`' for p in self[k])\n", " def __repr__(self):\n", " rstr = L(f'- {k!r} was found in the following pages:\\n{self._repr_locs(k)}' for k in self).concat()\n", " return '\\n'.join(rstr)\n", " _repr_markdown_ = __repr__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def local_urls(path:Path, host:str):\n", " \"returns a `dict` mapping all HTML files in `path` to a list of locally-resolved links in that file\"\n", " path=Path(path)\n", " fns = L(path.glob('**/*.html'))+L(path.glob('**/*.htm'))\n", " found = [(fn.resolve(),_local_url(link, root=path, host=host, fname=fn))\n", " for fn in fns for link in get_links(fn)]\n", " return _LinkMap(groupby(found, 1, 0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The keys of the `dict` returned by `local_urls` are links found in HTML files, and the values of this `dict` are a list of paths that those links are found in. \n", "\n", "Furthermore, local links are returned as `Path` objects, whereas external URLs are strings. For example, notice how the link:\n", "\n", "\n", "`http://fastlinkcheck.com/test.html`\n", "\n", "\n", "is resolved to a local path, because the `host` parameter supplied to `local_urls`, `fastlinkcheck.com` matches the url in the link: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.html') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- 'http://www.bing.com' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ], "text/plain": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.html') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- 'http://www.bing.com' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = Path('./_example')\n", "links = local_urls(path, host='fastlinkcheck.com')\n", "links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding broken links" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def broken_local(links, ignore_paths=None):\n", " \"List of items in keys of `links` that are `Path`s that do not exist\"\n", " ignore_paths = setify(ignore_paths)\n", " return L(o for o in links if isinstance(o,Path) and o not in ignore_paths and not o.exists())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `test.js` does not exist in the `example/` directory, `broken_local` returns this path:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(#1) [Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.js')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "broken_local(links)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert not all([x.exists() for x in broken_local(links)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def broken_urls(links, ignore_urls=None):\n", " \"List of items in keys of `links` that are URLs that return a failure status code\"\n", " ignore_urls = setify(ignore_urls)\n", " its = L(o for o in links if isinstance(o, str) and o not in ignore_urls)\n", " working_urls = parallel(urlcheck, its, n_workers=32, threadpool=True)\n", " return L(o for o,p in zip(its,working_urls) if not p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly the url `http://somecdn.com/doesntexist.html` doesn't exist, which is why it is returned by `broken_urls`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "assert broken_urls(links) == ['http://somecdn.com/doesntexist.html']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "@call_parse\n", "def link_check(path:Param(\"Root directory searched recursively for HTML files\", str),\n", " host:Param(\"Host and path (without protocol) of web server\", str)='',\n", " config_file:Param(\"Location of file with urls to ignore\",str)=None,\n", " actions_output:Param(\"Toggle GitHub Actions output on/off\",store_true)=False,\n", " exit_on_found:Param(\"(CLI Only) Exit with status code 1 if broken links are found\", store_true)=False,\n", " print_logs:Param(\"Toggle printing logs to stdout.\", store_true)=False):\n", " \"\"\"Check for broken links recursively in `path`.\"\"\"\n", " path = Path(path)\n", " is_cli = (SCRIPT_INFO.func == 'link_check')\n", " assert path.exists(), f\"{path.absolute()} does not exist.\"\n", " if config_file: assert Path(config_file).is_file(), f\"{config_file} is either not a file or doesn't exist.\"\n", " ignore = L(x.strip() for x in (Path(config_file).readlines() if config_file else ''))\n", " links = local_urls(path, host=host)\n", " ignore_paths = set((path/o).resolve() for o in ignore if not urlvalid(o))\n", " ignore_urls = set(ignore.filter(urlvalid))\n", " lm = _LinkMap({k:links[k] for k in (broken_urls(links, ignore_urls) + broken_local(links, ignore_paths))})\n", " if actions_output: print(f\"::set-output name=broken_links::{bool(lm)}\")\n", " msg = f'\\nERROR: The Following Broken Links or Paths were found:\\n{lm}' if lm else 'No Broken Links Found!'\n", " if print_logs or is_cli: print(msg)\n", " if is_cli and lm and exit_on_found: sys.exit(1)\n", " else: return lm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ], "text/plain": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "link_check(path='_example', host='fastlinkcheck.com')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ignore links with a configuration file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can choose to ignore files with a a plain-text file containing a list of urls to ignore. For example, the file `linkcheck.rc` contains a list of urls I want to ignore:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test.js\n", "https://www.google.com\n", "\n" ] } ], "source": [ "print((path/'linkcheck.rc').read_text())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case `example/test.js` will be filtered out from the list:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ], "text/plain": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "link_check(path='_example', host='fastlinkcheck.com', config_file='_example/linkcheck.rc')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can optionally emit the variable `broken_links` for GitHub Actions [as discussed here](https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-commands-for-github-actions#setting-an-output-parameter). This variable will be either `True` or `False` depending on if broken links are found or not:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "::set-output name=broken_links::True\n" ] }, { "data": { "text/markdown": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ], "text/plain": [ "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/test.html`" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "link_check(path=\"_example\", host=\"fastlinkcheck.com\", actions_output=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`link_check` can also be called use from the command line like this:\n", "\n", "> Note: the `!` command in Jupyter allows you [run shell commands](https://stackoverflow.com/questions/38694081/executing-terminal-commands-in-jupyter-notebook/48529220)\n", "\n", "The `-h` or `--help` flag will allow you to see the command line docs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: link_check [-h] [--host HOST] [--config_file CONFIG_FILE]\r\n", " [--actions_output] [--exit_on_found] [--print_logs] [--pdb]\r\n", " [--xtra XTRA]\r\n", " path\r\n", "\r\n", "Check for broken links recursively in `path`.\r\n", "\r\n", "positional arguments:\r\n", " path Root directory searched recursively for HTML files\r\n", "\r\n", "optional arguments:\r\n", " -h, --help show this help message and exit\r\n", " --host HOST Host and path (without protocol) of web server\r\n", " (default: )\r\n", " --config_file CONFIG_FILE\r\n", " Location of file with urls to ignore\r\n", " --actions_output Toggle GitHub Actions output on/off (default: False)\r\n", " --exit_on_found (CLI Only) Exit with status code 1 if broken links are\r\n", " found (default: False)\r\n", " --print_logs Toggle printing logs to stdout. (default: False)\r\n", " --pdb Run in pdb debugger (default: False)\r\n", " --xtra XTRA Parse for additional args (default: '')\r\n" ] } ], "source": [ "!link_check -h" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Converted index.ipynb.\n", "Converted linkcheck.ipynb.\n" ] } ], "source": [ "#hide\n", "from nbdev.export import *\n", "notebook2script()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }