{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intake Caching\n", "\n", "This notebook shows a simple demonstration of how you would use and manage caching with Intake of avoid repeated downloads to large data files.\n", "\n", "Let's start with a simple example. First, import intake as normal." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from intake.config import conf\n", "conf['cache_download_progress'] = False # <-- turn off download progress to display download times\n", "\n", "import intake\n", "cat = intake.open_catalog('catalog.yml')\n", "list(cat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sales = cat.sales()\n", "cache = sales.cache[0]\n", "cache.clear_all() # <-- clearing cache to make sure we start from scratch.\n", "sales._urlpath" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the urlpath is a remote HTTP server. When the data source is read for the first time a download will be triggered." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time df = sales.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's read the data again. Notice, the read is fast this time thanks to local caching." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time df = sales.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See that we do indeed have the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking under the hood at the default cache directory, notice the files now exist locally in a hashed subdirectory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%ls -la ~/.intake/cache/975358c19433bc3c5eae68abbde7f2ca" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These subdirectories are named by hashing the data source driver, urlpath, and cache regex to avoid collision among data sources and cache specifications. We can call the `_hash` method directly to find out the subdirectory name for a given `urlpath`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache._hash(sales._urlpath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspecting the metadata shows the created timestamp, original path, and cached path." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.get_metadata(sales._urlpath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data source will provide the cache directory if you are not sure where it is located." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sales.cache_dirs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cache can be cleared for an individual source." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.clear_cache(sales._urlpath)\n", "cache.get_metadata(sales._urlpath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After clearing the cache, the files are removed from the cache directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%ls -la ~/.intake/cache" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the data source is read again, the file is downloaded again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time df = sales.read()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%ls -la ~/.intake/cache/975358c19433bc3c5eae68abbde7f2ca" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cache object?\n", "\n", "Let's take a quick look at the cache object. This object provides utilities for managing cached data files. When a request for data is made, this object checks to see if data for the urlpath specified in the source exists on local disk in the cache directory. If so, it returns a reference to the local file path rather than the remote path. If the file(s) do not exist, it will download them, update the metadata, and return a local reference.\n", "\n", "Below are a few methods that Intake users should be familar with." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.get_metadata?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.clear_cache?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.clear_all?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cache directory is configurable\n", "\n", "The config and cache metadata are stored in ``~/.intake``. By default, the cache directory is located at ``~/.intake/cache``, however it can be set to a separate location specified in the config file, an environment variable, or at runtime. Here it is set at runtime." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from intake.config import conf\n", "conf['cache_download_progress'] = True # <-- turn progress bars back on (default)\n", "\n", "cache.clear_all()\n", "\n", "import os.path\n", "\n", "cat = intake.open_catalog('catalog.yml')\n", "sales = cat.sales()\n", "sales.set_cache_dir(os.path.join(os.getcwd(), 'test_cache_dir'))\n", "sales.cache_dirs\n", "cache = sales.cache[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = sales.read()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.get_metadata(sales._urlpath)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.clear_all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cache directory can also be set in the Intake config. This is equivalent to setting it in the ``INTAKE_CACHE_DIR`` environment variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from intake.config import conf, defaults\n", "import os.path\n", "\n", "conf['cache_dir'] = defaults['cache_dir']\n", "cat = intake.open_catalog('catalog.yml')\n", "sales = cat.sales()\n", "sales.cache_dirs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Disable Caching\n", "\n", "Caching can be disabled globally in the ``intake.config``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from intake.config import conf\n", "conf['cache_disabled'] = True\n", "\n", "cat = intake.open_catalog('catalog.yml')\n", "sales = cat.sales()\n", "cache = sales.cache[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice, the read times are consistently longer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time df = sales.read()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time df = sales.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, the cache directory and metadata are empty." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sales.cache_dirs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%ls -la ~/.intake/cache" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cache.get_metadata(sales._urlpath)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }