{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#default_exp page" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pagination\n", "\n", "> Parallel and serial pagination" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "from fastcore.utils import *\n", "from fastcore.foundation import *\n", "from ghapi.core import *\n", "\n", "import re\n", "from urllib.parse import parse_qs,urlsplit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paged operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some GitHub API operations return their results one page at a time. For instance, there are many thousands of [gists](https://docs.github.com/en/free-pro-team@latest/github/writing-on-github/creating-gists), but if we call `list_public` we only see the first 30:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "api = GhApi()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gists = api.gists.list_public()\n", "len(gists)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's because this operation takes two optional parameters, `per_page`, and `page`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "[gists.list_public](https://docs.github.com/v3/gists/#list-public-gists)(since, per_page, page): *List public gists*" ], "text/plain": [ "[gists.list_public](https://docs.github.com/v3/gists/#list-public-gists)(since, per_page, page): *List public gists*" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "api.gists.list_public" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a common pattern for `list_*` operations in the GitHub API. One way to get more results is to increase `per_page`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(api.gists.list_public(per_page=100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, `per_page` has a maximum of `100`, so if you want more, you'll have to pass `page=` to get pages beyond the first. An easy way to iterate through all pages is to use `paged`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def paged(oper, *args, per_page=30, max_pages=9999, **kwargs):\n", " \"Convert operation `oper(*args,**kwargs)` into an iterator\"\n", " yield from itertools.takewhile(noop, (oper(*args, per_page=per_page, page=i, **kwargs) for i in range(1,max_pages+1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll demonstrate this using the `repos.list_for_org` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "[repos.list_for_org](https://docs.github.com/v3/repos/#list-organization-repositories)(org, type, sort, direction, per_page, page): *List organization repositories*" ], "text/plain": [ "[repos.list_for_org](https://docs.github.com/v3/repos/#list-organization-repositories)(org, type, sort, direction, per_page, page): *List organization repositories*" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "api.repos.list_for_org" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(30, 'fast-image')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "repos = api.repos.list_for_org('fastai')\n", "len(repos),repos[0].name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To convert this operation into a Python iterator, pass the operation itself, along with any arguments (either keyword or positional) to `paged`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "repos = paged(api.repos.list_for_org, 'fastai')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now iterate through `repos` using Python, e.g:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30 fast-image\n", "30 fastforest\n", "30 .github\n", "3 tweetrel\n" ] } ], "source": [ "for page in repos: print(len(page), page[0].name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Link header (RFC 5988)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GitHub tells us how many pages are available using the [link header](https://tools.ietf.org/html/rfc5988). Unfortunately the pypi [LinkHeader](https://pypi.org/project/LinkHeader/) library appears to no longer be maintained, so we've put a refactored version of it here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class _Scanner:\n", " def __init__(self, buf): self.buf,self.match = buf,None\n", " def __getitem__(self, key): return self.match.group(key)\n", " def scan(self, pattern):\n", " self.match = re.compile(pattern).match(self.buf)\n", " if self.match: self.buf = self.buf[self.match.end():]\n", " return self.match\n", "\n", "_QUOTED = r'\"((?:[^\"\\\\]|\\\\.)*)\"'\n", "_TOKEN = r'([^()<>@,;:\\\"\\[\\]?={}\\s]+)'\n", "_RE_COMMA_HREF = r' *,? *< *([^>]*) *> *'\n", "_RE_ATTR = rf'{_TOKEN} *(?:= *({_TOKEN}|{_QUOTED}))? *'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def _parse_link_hdr(header):\n", " \"Parse an RFC 5988 link header, returning a `list` of `tuple`s of URL and attr `dict`\"\n", " scanner,links = _Scanner(header),[]\n", " while scanner.scan(_RE_COMMA_HREF):\n", " href,attrs = scanner[1],[]\n", " while scanner.scan('; *'):\n", " if scanner.scan(_RE_ATTR):\n", " attr_name, token, quoted = scanner[1], scanner[3], scanner[4]\n", " if quoted is not None: attrs.append([attr_name, quoted.replace(r'\\\"', '\"')])\n", " elif token is not None: attrs.append([attr_name, token])\n", " else: attrs.append([attr_name, None])\n", " links.append((href,dict(attrs)))\n", " if scanner.buf: raise Exception(f\"parse() failed at {scanner.buf!r}\")\n", " return links" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def parse_link_hdr(header):\n", " \"Parse an RFC 5988 link header, returning a `dict` from rels to a `tuple` of URL and attrs `dict`\"\n", " return {a.pop('rel'):(u,a) for u,a in _parse_link_hdr(header)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an example of a link header with just one link:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'foo bar': ('http://example.com', {'type': 'text/html'})}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parse_link_hdr('; rel=\"foo bar\"; type=text/html')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "links = parse_link_hdr('; rel=\"foo bar\"; type=text/html')\n", "link = links['foo bar']\n", "test_eq(link[0], 'http://example.com')\n", "test_eq(link[1]['type'], 'text/html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test it on the headers we received on our last call to GitHub. You can access the last call's headers in `recv_hdrs':" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'; rel=\"prev\", ; rel=\"last\", ; rel=\"first\"'" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "api.recv_hdrs['Link']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what happens when we parse that:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'prev': ('https://api.github.com/organizations/20547620/repos?per_page=30&page=4',\n", " {}),\n", " 'last': ('https://api.github.com/organizations/20547620/repos?per_page=30&page=4',\n", " {}),\n", " 'first': ('https://api.github.com/organizations/20547620/repos?per_page=30&page=1',\n", " {})}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parse_link_hdr(api.recv_hdrs['Link'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting pages in parallel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than requesting each page one at a time, we can save some time by getting all the pages we need in parallel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "@patch\n", "def last_page(self:GhApi):\n", " \"Parse RFC 5988 link header from most recent operation, and extract the last page\"\n", " header = self.recv_hdrs.get('Link', '')\n", " last = nested_idx(parse_link_hdr(header), 'last', 0) or ''\n", " qs = parse_qs(urlsplit(last).query)\n", " return int(nested_idx(qs,'page',0) or 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To help us know the number of pages needed, we can use `last_page`, which uses the link header we just looked at to grab the last page from GitHub.\n", "\n", "We will need multiple pages to get all the repos in the `github` organization, even if we get 100 at a time:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "api.repos.list_for_org('github', per_page=100)\n", "api.last_page()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def _call_page(i, oper, args, kwargs, per_page):\n", " return oper(*args, per_page=per_page, page=i, **kwargs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def pages(oper, n_pages, *args, n_workers=None, per_page=100, **kwargs):\n", " \"Get `n_pages` pages from `oper(*args,**kwargs)`\"\n", " return parallel(_call_page, range(1,n_pages+1), oper=oper, per_page=per_page, args=args, kwargs=kwargs,\n", " progress=False, n_workers=ifnone(n_workers,n_pages), threadpool=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pages` by default passes `per_page=100` to the operation.\n", "\n", "Let's look at some examples. To get all the pages for the repos in the `github` organization in parallel, we can use this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "367" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gh_repos = pages(api.repos.list_for_org, api.last_page(), 'github').concat()\n", "len(gh_repos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you already know ahead of time the number of pages required, there's no need to call `last_page`. For instance, the GitHub docs specify that we can get at most 3000 gists:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3000" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gists = pages(api.gists.list_public, 30).concat()\n", "len(gists)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GitHub ignores the `per_page` parameter for some API calls, such as listing public events, which it limits to 8 pages of 30 items per page. To retrieve all pages in these cases, you need to explicitly pass the lower per page limit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "api.activity.list_public_events()\n", "api.last_page()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "232" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "evts = pages(api.activity.list_public_events, api.last_page(), per_page=30).concat()\n", "len(evts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export -" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Converted 00_core.ipynb.\n", "Converted 01_actions.ipynb.\n", "Converted 02_auth.ipynb.\n", "Converted 03_page.ipynb.\n", "Converted 04_event.ipynb.\n", "Converted 10_cli.ipynb.\n", "Converted 50_fullapi.ipynb.\n", "Converted 80_tutorial_actions.ipynb.\n", "Converted 90_build_lib.ipynb.\n", "Converted Untitled.ipynb.\n", "Converted ghapi demo.ipynb.\n", "Converted index.ipynb.\n" ] } ], "source": [ "#hide\n", "from nbdev.export import notebook2script\n", "notebook2script()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }