{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \"Introducing fastlinkcheck\"\n", "\n", "> Say goodbye broken links on your static sites. Platform independent, fast, and built in python.\n", "- author: \"Jeremy Howard and Hamel Husain\"\n", "- toc: false\n", "- image: images/copied_from_nb/fastlinkcheck_images/fastlinkcheck.png\n", "- comments: true\n", "- categories: [nbdev, fastlinkcheck]\n", "- permalink: /fastlinkcheck/\n", "- badges: true" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![diagram](fastlinkcheck_images/fastlinkcheck.png)\n", "\n", "# Motivation\n", "\n", "Recently, fastai has been hard at work improving and overhauling [nbdev](https://nbdev.fast.ai/), a literate programming environment for python. A key feature of nbdev is automated generation of documentation from Jupyter notebooks. This documentation system adds many niceties, such as the following types of hyperlinks automatically:\n", "\n", " - Links to source code on GitHub.\n", " - Links to both internal and external documentation by introspecting variable names in backticks.\n", "\n", "Because documentation is so easy to create and maintain in nbdev, we find ourselves and others creating much more of it! In addition to automatic hyperlinks, we often include our own links to relevant websites, blogs and videos when documenting code. For example, one of the largest nbdev generated sites, [docs.fast.ai](https://docs.fast.ai/), has more than 300 external and internal links at the time of this writing. \n", "\n", "\n", "# The Solution\n", "\n", "Due to the continued popularity of [fastai](https://github.com/fastai/fastai) and the growth of new nbdev projects, grooming these links manually became quite tedious. We investigated solutions that could verify links for us automatically, but were not satisfied with any existing solutions. These are the features we desired:\n", "\n", "- A platform independent solution that is not tied to a specific static site generator like Jekyll or Hugo.\n", "- Intelligent introspection of external links that are actually internal links. For example, if we are building the site docs.fast.ai, a link to `https://docs.fast.ai/tutorial` should not result in a web request, but rather introspection of the local file system for the presence of `tutorial.html` in the right location.\n", "- Verification of any links to assets like CSS, data, javascript or other files. \n", "- Logs that are well organized that allow us to see each broken link or reference to a non-existent path, and the pages these are found in.\n", "- Parallelism to verify links as fast as possible. \n", "- Lightweight, easy to install with minimal dependencies.\n", "\n", "We tried tools such as [linkchecker](https://github.com/wummel/linkchecker) and [pylinkvalidator](https://github.com/bartdag/pylinkvalidator), but these required your site to be first be hosted. Since we wanted to check links on a static site, hosting is overhead we wanted to avoid.\n", "\n", "This is what led us to create [fastlinkcheck](https://fastlinkcheck.fast.ai/), which we discuss below.\n", "\n", "> Note: For Ruby users, [htmlproofer](https://github.com/gjtorikian/html-proofer) apperas to provide overlapping functionality. We have not tried this library. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# A tour of fastlinkcheck\n", "\n", "For this tour we will be referring to the files in the [fastlinkcheck repo](https://github.com/fastai/fastlinkcheck). You should clone this repo in the current directory in order to follow along:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cloning into 'fastlinkcheck'...\n", "remote: Enumerating objects: 135, done.\u001b[K\n", "remote: Counting objects: 100% (135/135), done.\u001b[K\n", "remote: Compressing objects: 100% (98/98), done.\u001b[K\n", "remote: Total 608 (delta 69), reused 76 (delta 34), pack-reused 473\u001b[K\n", "Receiving objects: 100% (608/608), 1.12 MiB | 10.47 MiB/s, done.\n", "Resolving deltas: 100% (302/302), done.\n" ] } ], "source": [ "git clone https://github.com/fastai/fastlinkcheck.git\n", "cd fastlinkcheck" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "\n", "You can install `fastlinkcheck` with pip:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pip install fastlinkcheck`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After installing `fastlinkcheck`, the cli command `link_check` is available from the command line. We can see various options with the `--help` flag." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: link_check [-h] [--host HOST] [--config_file CONFIG_FILE] [--pdb]\n", " [--xtra XTRA]\n", " path\n", "\n", "Check for broken links recursively in `path`.\n", "\n", "positional arguments:\n", " path Root directory searched recursively for HTML files\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " --host HOST Host and path (without protocol) of web server\n", " --config_file CONFIG_FILE\n", " Location of file with urls to ignore\n", " --pdb Run in pdb debugger (default: False)\n", " --xtra XTRA Parse for additional args (default: '')\n" ] } ], "source": [ "link_check --help" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the root of [fastlinkcheck repo](https://github.com/fastai/fastlinkcheck), We can search the directory `_example/broken_links` recursively for broken links like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "ERROR: The Following Broken Links or Paths were found:\n", "\n", "- 'http://fastlinkcheck.com/test.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`\n", "\n", "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`\n", "\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "link_check _example/broken_links " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specifying the `--host` parameter allows you detect links that are internal by identifying links with that host name. External links are verified by making a request to the appropriate website. On the other hand, internal links are verified by inspecting the presence and content of local files. \n", "\n", "We must be careful when using the `--host` argument to only pass the host (and path, if necessary) **without** the protocol. For example, this is how we specify the hostname if your site's url is `http://fastlinkcheck.com/test.html`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "ERROR: The Following Broken Links or Paths were found:\n", "\n", "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`\n", "\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "link_check _example/broken_links --host fastlinkcheck.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have one less broken link as there is indeed a file named `test.html` in the root of the path we are searching. However, if we add a path to the end of `--host` , such as `fastlinkcheck.com/mysite` the link would again be listed as broken because `_example/broken_links/mysite/test.html` does not exist:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "ERROR: The Following Broken Links or Paths were found:\n", "\n", "- 'http://fastlinkcheck.com/test.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`\n", "\n", "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`\n", "\n", "- Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js') was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "link_check _example/broken_links --host fastlinkcheck.com/mysite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can ignore links by creating a text file that contains a list of urls and paths to ignore. For example, the file `_example/broken_links/linkcheck.rc` contains:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test.js\n", "https://www.google.com\n" ] } ], "source": [ "cat _example/broken_links/linkcheck.rc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use this file to ignore urls and paths with the `--config_file` argument. This will filter out references to the broken link `/test.js` from our earlier results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "ERROR: The Following Broken Links or Paths were found:\n", "\n", "- 'http://somecdn.com/doesntexist.html' was found in the following pages:\n", " - `/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html`" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "link_check _example/broken_links --host fastlinkcheck.com --config_file _example/broken_links/linkcheck.rc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, if there are no broken links, `link_check` will not return anything. The directory `_example/no_broken_links/` does not contain any HTML files with broken links:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "█\r", "\r", " |--------------------| 0.00% [0/2 00:00<00:00]\r", "\r", " |██████████----------| 50.00% [1/2 00:00<00:00]\r", "\r", " |████████████████████| 100.00% [2/2 00:00<00:00]\r", "\r", " \r", "No broken links found!\n" ] } ], "source": [ "link_check _example/no_broken_links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use these utilities from python instead of the terminal. Please see [these docs](https://fastlinkcheck.fast.ai/linkcheck.html/) for more information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `link_check` in GitHub Actions\n", "\n", "\n", "The `link_check` CLI utility that is installed with `fastlinkcheck` can be very useful in continuous integration systems like [GitHub Actions](https://github.com/features/actions). Here is an example GitHub Actions workflow that uses `link_check`:\n", "\n", "\n", "```yaml\n", "name: Check Links\n", "on: [workflow_dispatch, push]\n", "\n", "jobs:\n", " check-links:\n", " runs-on: ubuntu-latest\n", " steps:\n", " - uses: actions/checkout@v2\n", " - uses: actions/setup-python@v2\n", " - name: check for broken links\n", " run: |\n", " pip install fastlinkcheck\n", " link_check _example \n", "```\n", "\n", "We can a few more lines of code to open an issue instead when a broken link is found, using the [gh cli](https://github.com/cli/cli):\n", "\n", "```yaml\n", "...\n", " - name: check for broken links\n", " run: |\n", " pip install fastlinkcheck\n", " link_check _example 2> err || true\n", " export GITHUB_TOKEN=\"YOUR_TOKEN\"\n", " [[ -s err ]] && gh issue create -t \"Broken links found\" -b \"$(< err)\" -R \"yourusername/yourrepo\"\n", "```\n", "\n", "We can extend this even further to only open an issue when another issue with a specific label isn't already open:\n", "\n", "```yaml\n", "...\n", " - name: check for broken links\n", " run: |\n", " pip install fastlinkcheck\n", " link_check \"docs/_site\" --host \"docs.fast.ai\" 2> err || true\n", " export GITHUB_TOKEN=\"YOUR_TOKEN\"\n", " if [[ -z $(gh issue list -l \"broken-link\")) && (-s err) ]]; then\n", " gh issue create -t \"Broken links found\" -b \"$(< err)\" -l \"broken-link\" -R \"yourusername/yourrepo\"\n", " fi\n", "```\n", "\n", "\n", "See the [GitHub Actions docs](https://docs.github.com/en/free-pro-team@latest/actions) for more information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Resources\n", "\n", "The following resources are relevant for those interested in learning more about `fastlinkcheck`:\n", "\n", "- The fastlinkcheck [GitHub repo](https://github.com/fastai/fastlinkcheck)\n", "- The fastlinkcheck [docs](https://fastlinkcheck.fast.ai/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }