{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# fastlinkcheck API\n",
"> API for fast local and online link checking"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#export\n",
"from fastcore.all import *\n",
"from html.parser import HTMLParser\n",
"from urllib.parse import urlparse,urlunparse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Find links in an HTML file"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#export\n",
"class _HTMLParseAttrs(HTMLParser):\n",
" def reset(self):\n",
" super().reset()\n",
" self.found = set()\n",
" def handle_starttag(self, tag, attrs):\n",
" a = first(v for k,v in attrs if k in (\"src\",\"href\"))\n",
" if a: self.found.add(a)\n",
" handle_startendtag = handle_starttag"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"#export\n",
"def get_links(fn):\n",
" \"List of all links in file `fn`\"\n",
" h = _HTMLParseAttrs()\n",
" h.feed(Path(fn).read_text())\n",
" return L(h.found)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use `get_links` to parse an HTML file for different types of links. For example, this is the contents of `./example/test.html`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n"
]
}
],
"source": [
"!cat ./example/test.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling `get_links` with the above file path will return a list of links:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(#4) ['http://fastlinkcheck.com/test.html','http://www.bing.com','//somecdn.com/doesntexist.html','test.js']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"links = get_links('./example/test.html')\n",
"links"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"test_eq(set(links), {'test.js',\n",
" '//somecdn.com/doesntexist.html',\n",
" 'http://www.bing.com','http://fastlinkcheck.com/test.html'})"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#export\n",
"def _local_url(u, root, host, fname):\n",
" \"Change url `u` to local path if it is a local link\"\n",
" fpath = Path(fname).parent\n",
" islocal=False\n",
" # remove `host` prefix\n",
" for o in 'http://','https://','http://www.','https://www.':\n",
" if u.startswith(o+host): u,islocal = remove_prefix(u, o+host),True\n",
" # remove params, querystring, and fragment\n",
" p = list(urlparse(u))[:5]+['']\n",
" # local prefix, or no protocol or host\n",
" if islocal or (not p[0] and not p[1]):\n",
" u = p[2]\n",
" if u and u[0]=='/': return (root/u[1:]).resolve()\n",
" else: return (fpath/u).resolve()\n",
" # URLs without a protocol are \"protocol relative\"\n",
" if not p[0]: p[0]='http'\n",
" # mailto etc are not checked\n",
" if p[0] not in ('http','https'): return ''\n",
" return urlunparse(p)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"class LinkMap(dict):\n",
" \"\"\"A dict that pretty prints Links and their associated locations.\"\"\"\n",
" def __repr__(self):\n",
" rstr=''\n",
" for k in self:\n",
" rstr+=f'Link: {repr(k)}\\n Locations found:\\n'\n",
" for p in self[k]:\n",
" rstr+=f' - {p}\\n'\n",
" rstr+='\\n'\n",
" return rstr"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"#export\n",
"def local_urls(path:Path, host:str):\n",
" \"returns a `dict` mapping all HTML files in `path` to a list of locally-resolved links in that file\"\n",
" path=Path(path)\n",
" fns = L(path.glob('**/*.html'))+L(path.glob('**/*.htm'))\n",
" found = [(fn.resolve(),_local_url(link, root=path, host=host, fname=fn))\n",
" for fn in fns for link in get_links(fn)]\n",
" return LinkMap(groupby(found, 1, 0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The keys of the `dict` returned by `local_urls` are links found in HTML files, and the values of this `dict` are a list of paths that those links are found in. \n",
"\n",
"Furthermore, local links are returned as `Path` objects, whereas external URLs are strings. For example, notice how the link:\n",
"\n",
"```html\n",
"\n",
"```\n",
"\n",
"is resolved to a local path, because the `host` parameter supplied to `local_urls`, `fastlinkcheck.com` matches the url in the link: "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Link: Path('/Users/hamelsmu/github/fastlinkcheck/example/test.html')\n",
" Locations found:\n",
" - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n",
"\n",
"Link: 'http://www.bing.com'\n",
" Locations found:\n",
" - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n",
"\n",
"Link: 'http://somecdn.com/doesntexist.html'\n",
" Locations found:\n",
" - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n",
"\n",
"Link: Path('/Users/hamelsmu/github/fastlinkcheck/example/test.js')\n",
" Locations found:\n",
" - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"path = Path('./example')\n",
"links = local_urls(path, host='fastlinkcheck.com')\n",
"links"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finding broken links"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def broken_local(links) -> L:\n",
" \"List of items in keys of `links` that are `Path`s that do not exist\"\n",
" return L(o for o in links if isinstance(o,Path) and not o.exists())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since `test.js` does not exist in the `example/` directory, `broken_local` returns this path:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(#1) [Path('/Users/hamelsmu/github/fastlinkcheck/example/test.js')]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"broken_local(links)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"assert not all([x.exists() for x in broken_local(links)])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def broken_urls(links):\n",
" \"List of items in keys of `links` that are URLs that return a failure status code\"\n",
" its = L(links).filter(risinstance(str))\n",
" working_urls = parallel(urlcheck, its, n_workers=32, threadpool=True)\n",
" return L(o for o,p in zip(its,working_urls) if not p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly the url `http://somecdn.com/doesntexist.html` doesn't exist, which is why it is returned by `broken_urls`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"assert broken_urls(links) == ['http://somecdn.com/doesntexist.html']"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"@call_parse\n",
"def fastlinkcheck(path:Param(\"Root directory searched recursively for HTML files\", str),\n",
" host:Param(\"Host and path (without protocol) of web server\", str)='',\n",
" config_file:Param(\"Location of file with urls to ignore\", str)=None):\n",
" if config_file: assert Path(config_file).is_file(), f\"{config_file} is either not a file or doesn't exist.\"\n",
" ignore = [] if not config_file else [x.strip() for x in Path(config_file).readlines()]\n",
" links = local_urls(path, host=host)\n",
" return LinkMap({k:links[k] for k in (broken_urls(links) + broken_local(links)) if str(k) not in ignore})"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Link: 'http://somecdn.com/doesntexist.html'\n",
" Locations found:\n",
" - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n",
"\n",
"Link: Path('/Users/hamelsmu/github/fastlinkcheck/example/test.js')\n",
" Locations found:\n",
" - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fastlinkcheck(path='./example', host='fastlinkcheck.com')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can choose to ignore files with a a plain-text file containing a list of urls to ignore. For example, the file `linkcheck.rc` contains a list of urls I want to ignore:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/Users/hamelsmu/github/fastlinkcheck/example/test.js\r\n",
"https://www.google.com"
]
}
],
"source": [
"! cat linkcheck.rc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case `example/test.js` will be filtered out from the list:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Link: 'http://somecdn.com/doesntexist.html'\n",
" Locations found:\n",
" - /Users/hamelsmu/github/fastlinkcheck/example/test.html\n"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fastlinkcheck(path='./example', host='fastlinkcheck.com', config_file='linkcheck.rc')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"with ExceptionExpected(ex=AssertionError, regex=\"not a file or doesn't exist\"):\n",
" fastlinkcheck(path='./example/', config_file='doesnt_exist')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}