{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Downloading Gist data from the Github API" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ " from toolz.curried import *\n", " from pandas import Series, DataFrame, concat, get_dummies, TimeGrouper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cache the `pandas.read_json` function because that is how we will download the results. Be aware, resetting `read_json` will clear the cache." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ " read_json = memoize(__import__('pandas').read_json)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Begin downloading the data through the Github user information." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ " def get_info(user = 'tonyfast'):\n", " return concat({user: \n", " read_json(f\"\"\"https://api.github.com/users/{user}\"\"\", typ='series')\n", " }).unstack()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the info we can determine the location and quantity of the user's gists." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " def get_gists(info, max=3):\n", " gist_page_url = info.loc[\"gists_url\"].format(**{\"/gist_id\": \"?page={}\"}).format\n", "\n", " return (\n", " concat(\n", " [\n", " read_json(gist_page_url(object))\n", " for object in range(1, min(max, (info.loc[\"public_gists\"] // 30) + 1))\n", " ]\n", " )\n", " .pipe(\n", " lambda df: df[\"files\"]\n", " .apply(compose(Series, list, dict.values))\n", " .stack()\n", " .apply(Series)\n", " .reset_index(-1, drop=True)\n", " .join(df)\n", " )\n", " .pipe(do(cleanse)).pipe(convert_to_feather, f\"{info.name}_gists.feather\")\n", " )\n", "\n", "\n", " def cleanse(df):\n", " \"\"\"files and owner are dict's that cannot be serialized by feather\"\"\"\n", " del df[\"files\"], df[\"owner\"]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Storing the data in [feather](https://github.com/wesm/feather) makes it more reusable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " def convert_to_feather(df, dest):\n", " import feather\n", " feather.write_dataframe(df, dest)\n", " return feather.read_dataframe(dest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `main` function" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ " def main(user=\"tonyfast\", max=3):\n", " return get_gists(get_info(user).loc[user], max)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Demonstrate the functions use." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | filename | \n", "type | \n", "language | \n", "raw_url | \n", "size | \n", "comments | \n", "comments_url | \n", "commits_url | \n", "created_at | \n", "description | \n", "... | \n", "git_pull_url | \n", "git_push_url | \n", "html_url | \n", "id | \n", "node_id | \n", "public | \n", "truncated | \n", "updated_at | \n", "url | \n", "user | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2554 | \n", "Untitled279.ipynb | \n", "text/plain | \n", "Jupyter Notebook | \n", "https://gist.githubusercontent.com/tonyfast/a4... | \n", "1928 | \n", "0 | \n", "https://api.github.com/gists/748a91e1769d24979... | \n", "https://api.github.com/gists/748a91e1769d24979... | \n", "2015-12-29 13:51:19 | \n", "Underlay an svg element as a <div> background | \n", "... | \n", "https://gist.github.com/748a91e1769d24979393.git | \n", "https://gist.github.com/748a91e1769d24979393.git | \n", "https://gist.github.com/748a91e1769d24979393 | \n", "748a91e1769d24979393 | \n", "MDQ6R2lzdDc0OGE5MWUxNzY5ZDI0OTc5Mzkz | \n", "True | \n", "False | \n", "2015-12-29 13:51:19 | \n", "https://api.github.com/gists/748a91e1769d24979393 | \n", "NaN | \n", "
8969 | \n", "Untitled144.ipynb | \n", "text/plain | \n", "Jupyter Notebook | \n", "https://gist.githubusercontent.com/tonyfast/0c... | \n", "3753 | \n", "0 | \n", "https://api.github.com/gists/b0125d860d5ffe74d... | \n", "https://api.github.com/gists/b0125d860d5ffe74d... | \n", "2016-02-03 17:58:58 | \n", "A simple dropdown widget | \n", "... | \n", "https://gist.github.com/b0125d860d5ffe74dbdb.git | \n", "https://gist.github.com/b0125d860d5ffe74dbdb.git | \n", "https://gist.github.com/b0125d860d5ffe74dbdb | \n", "b0125d860d5ffe74dbdb | \n", "MDQ6R2lzdGIwMTI1ZDg2MGQ1ZmZlNzRkYmRi | \n", "True | \n", "False | \n", "2016-02-03 17:58:59 | \n", "https://api.github.com/gists/b0125d860d5ffe74dbdb | \n", "NaN | \n", "
2 rows × 21 columns
\n", "