{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example of XML caching for pydov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To speed up subsequent queries involving similar data, pydov uses a caching mechanism where raw DOV XML data is cached locally for later reuse. For regular usage of the package and data requests, the cache will be a *convenient* feature speeding up the time for subsequent queries. However, in case you want to alter the configuration or cache handling, this notebook illustrates some use cases on the cache handling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use cases:\n", "* Check cached files\n", "* Speed up subsequent queries\n", "* Disabling the cache\n", "* Changing the location of cached data\n", "* Changing the maximum age of cached data\n", "* Cleaning the cache" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# check pydov path\n", "import warnings; warnings.simplefilter('ignore')\n", "import pydov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use cases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check cached files" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pydov.search.boring import BoringSearch\n", "boring = BoringSearch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pydov.cache.cachedir` defines the directory on the file system used to cache DOV files:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "c:\\users\\rhbav33\\appdata\\local\\temp\\pydov\n", "('directories: ', [])\n" ] } ], "source": [ "# check the cache dir\n", "import os\n", "import pydov.util.caching\n", "cachedir = pydov.cache.cachedir\n", "print(cachedir)\n", "print('directories: ', os.listdir(cachedir))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Speed up subsequent queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the convenience of the caching during subsequent data requests, consider the following request, while measuring the time:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[000/111] ..................................................\n", "[050/111] ..................................................\n", "[100/111] ...........\n", "Wall time: 38.4 s\n" ] } ], "source": [ "from pydov.util.location import Within, Box\n", "\n", "# Get all borehole data in a bounding box (llx, llxy, ulx, uly) and timeit\n", "%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('number of files: ', 111)\n", "('files present: ', ['1879-119364.xml.gz', '1879-121292.xml.gz', '1879-121293.xml.gz', '1879-121387.xml.gz', '1879-121401.xml.gz', '1879-121412.xml.gz', '1879-121424.xml.gz', '1879-122256.xml.gz', '1894-121258.xml.gz', '1894-122153.xml.gz', '1894-122154.xml.gz', '1894-122155.xml.gz', '1895-121232.xml.gz', '1895-121241.xml.gz', '1895-121242.xml.gz', '1895-121244.xml.gz', '1895-121247.xml.gz', '1895-121248.xml.gz', '1923-121199.xml.gz', '1923-121200.xml.gz', '1932-121315.xml.gz', '1936-122224.xml.gz', '1938-121359.xml.gz', '1938-121360.xml.gz', '1953-121327.xml.gz', '1953-121361.xml.gz', '1953-121362.xml.gz', '1969-033206.xml.gz', '1969-033207.xml.gz', '1969-033208.xml.gz', '1969-033209.xml.gz', '1969-033211.xml.gz', '1969-033212.xml.gz', '1969-033213.xml.gz', '1969-033214.xml.gz', '1969-033215.xml.gz', '1969-033216.xml.gz', '1969-033217.xml.gz', '1969-033218.xml.gz', '1969-033219.xml.gz', '1969-033220.xml.gz', '1969-092685.xml.gz', '1969-092686.xml.gz', '1969-092687.xml.gz', '1969-092688.xml.gz', '1969-092689.xml.gz', '1970-018757.xml.gz', '1970-018762.xml.gz', '1970-018763.xml.gz', '1970-061362.xml.gz', '1970-061363.xml.gz', '1970-061364.xml.gz', '1970-061365.xml.gz', '1970-061366.xml.gz', '1970-061442.xml.gz', '1970-061443.xml.gz', '1970-061444.xml.gz', '1970-061445.xml.gz', '1970-061446.xml.gz', '1970-061447.xml.gz', '1970-061450.xml.gz', '1970-061454.xml.gz', '1970-104897.xml.gz', '1970-104898.xml.gz', '1970-104899.xml.gz', '1970-104900.xml.gz', '1973-018152.xml.gz', '1973-060207.xml.gz', '1973-060208.xml.gz', '1973-081811.xml.gz', '1973-104723.xml.gz', '1973-104727.xml.gz', '1973-104728.xml.gz', '1974-010351.xml.gz', '1975-010345.xml.gz', '1976-014856.xml.gz', '1976-015297.xml.gz', '1976-015298.xml.gz', '1976-015779.xml.gz', '1976-015780.xml.gz', '1976-015781.xml.gz', '1976-015782.xml.gz', '1978-012352.xml.gz', '1978-121458.xml.gz', '1984-081833.xml.gz', '1984-081834.xml.gz', '1985-084552.xml.gz', '1986-005594.xml.gz', '1986-005596.xml.gz', '1986-005597.xml.gz', '1986-005598.xml.gz', '1986-059814.xml.gz', '1986-059815.xml.gz', '1986-059816.xml.gz', '1987-119382.xml.gz', '1996-021717.xml.gz', '1996-081802.xml.gz', '2017-148854.xml.gz', '2017-152011.xml.gz', '2017-153161.xml.gz', '2018-153957.xml.gz', '2018-154057.xml.gz', '2018-155266.xml.gz', '2018-155580.xml.gz', '2018-156632.xml.gz', '2018-156633.xml.gz', '2018-156634.xml.gz', '2018-157193.xml.gz', '2018-157294.xml.gz', '2018-157386.xml.gz', '2019-160294.xml.gz'])\n" ] } ], "source": [ "# The structure of cachedir implies a seperate directory for each query type, since permalinks are not unique accross types\n", "# In this example 'boring' will be queried, therefore list xmls in the cache of the 'boring' type\n", "# list files present\n", "print('number of files: ', len(os.listdir(os.path.join(pydov.cache.cachedir, 'boring'))))\n", "print('files present: ', os.listdir(os.path.join(pydov.cache.cachedir, 'boring')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rerun the previous request and timeit again:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[000/111] cccccccccccccccccccccccccccccccccccccccccccccccccc\n", "[050/111] cccccccccccccccccccccccccccccccccccccccccccccccccc\n", "[100/111] ccccccccccc\n", "Wall time: 980 ms\n" ] } ], "source": [ "%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The use of the cache decreased the runtime by a factor 100 in the current example. This will increase drastically if more permalinks are queried since the download takes much longer than the IO at runtime." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Disabling the cache" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can (temporarily!) disable the caching mechanism. This disables both the saving of newly downloaded data in the cache, \n", "as well as reusing existing data in the cache. It remains valid for the time being of the instantiated pydov.cache object.\n", "It does not delete existing data in the cache." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('number of files: ', 111)\n" ] } ], "source": [ "# list number of files\n", "print('number of files: ', len(os.listdir(os.path.join(cachedir, 'boring'))))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[000/002] ..\n", " pkey_boring boornummer x \\\n", "0 https://www.dov.vlaanderen.be/data/boring/1895... kb15d43w-B47 151600.0 \n", "1 https://www.dov.vlaanderen.be/data/boring/1984... kb15d43w-B403 151041.0 \n", "\n", " y mv_mtaw start_boring_mtaw gemeente diepte_boring_van \\\n", "0 205998.0 15.00 15.00 Antwerpen 0.0 \n", "1 205933.0 21.07 21.07 Antwerpen 0.0 \n", "\n", " diepte_boring_tot datum_aanvang uitvoerder \\\n", "0 3.3 1895-01-04 onbekend \n", "1 7.0 1984-09-26 Universiteit Gent - Geologisch Instituut \n", "\n", " boorgatmeting diepte_methode_van diepte_methode_tot boormethode \n", "0 False 0.0 3.3 onbekend \n", "1 False 0.0 7.0 droge boring \n" ] } ], "source": [ "# disable caching\n", "cache_orig = pydov.cache\n", "pydov.cache = None\n", "# new query\n", "df = boring.search(location=Within(Box(151000, 205930, 153000, 206000)))\n", "print(df.head())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('number of files: ', 111)\n" ] } ], "source": [ "# list number of files\n", "print('number of files: ', len(os.listdir(os.path.join(cachedir, 'boring'))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence, no new files were added to the cache when disabling it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The caching is disabled by removing the pydov.cache object from the namespace. If you want to enable caching again you must instantiate it anew." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pydov.cache = cache_orig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing the location of cached data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, pydov stores the cache in a temporary directory provided by the user's operating system. On Windows, the cache is usually located in: `C:\\Users\\username\\AppData\\Local\\Temp\\pydov\\`\n", "If you want the cached xml files to be saved in another location you can define your own cache for the current runtime. Mind that this does not change the location of previously saved data. No lookup in the old datafolder will be performed after changing the directory's location.\n", "Besides controlling the cache's location, this also allows using different scripts or projects." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pydov.util.caching\n", "\n", "pydov.cache = pydov.util.caching.GzipTextFileCache(\n", " cachedir=r'C:\\temp\\pydov'\n", " )" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C:\\temp\\pydov\n" ] } ], "source": [ "cachedir = pydov.cache.cachedir\n", "print(cachedir)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# for the sake of the example, change dir location back \n", "pydov.cache = cache_orig\n", "cachedir = pydov.cache.cachedir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing the maximum age of cached data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you work with rapidly changing data or want to control when cached data is renewed, you can do so by changing the maximum age of cached data to be considered valid for the currenct runtime. You can use 'weeks', 'days' or any other common datetime format.\n", "If a cached version exists and is younger than the maximum age, it is used in favor of renewing the data from DOV services. If no cached version exists or is older than the maximum age, the data is renewed and saved in the cache.\n", "Note that data older than the maximum age is not automatically deleted from the cache." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0:00:01\n" ] } ], "source": [ "import pydov.util.caching\n", "import datetime\n", "pydov.cache = pydov.util.caching.GzipTextFileCache(\n", " max_age=datetime.timedelta(seconds=1)\n", " )\n", "print(pydov.cache.max_age)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1879-119364.xml.gz\n" ] }, { "data": { "text/plain": [ "'Wed Mar 06 14:36:24 2019'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from time import ctime\n", "print(os.listdir(os.path.join(cachedir, 'boring'))[0])\n", "ctime(os.path.getmtime(os.path.join(os.path.join(cachedir, 'boring'),\n", " os.listdir(os.path.join(cachedir, 'boring'))[0]\n", " )\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[000/111] ..................................................\n", "[050/111] ..................................................\n", "[100/111] ...........\n", "Wall time: 35.7 s\n" ] } ], "source": [ "# rerun previous query \n", "%time df = boring.search(location=Within(Box(150145, 205030, 155150, 206935)))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1879-119364.xml.gz\n" ] }, { "data": { "text/plain": [ "'Wed Mar 06 14:38:20 2019'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from time import ctime\n", "print(os.listdir(os.path.join(cachedir, 'boring'))[0])\n", "ctime(os.path.getmtime(os.path.join(os.path.join(cachedir, 'boring'),\n", " os.listdir(os.path.join(cachedir, 'boring'))[0]\n", " )\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning the cache" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we use a temporary directory provided by the operating system, we rely on the operating system to clean the folder when it deems necessary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To clean the cache, removing all records older than the maximum age" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from time import sleep" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('number of files before clean: ', 111)\n", "('number of files after clean: ', 0)\n" ] } ], "source": [ "print('number of files before clean: ', len(os.listdir(os.path.join(cachedir, 'boring'))))\n", "sleep(2) # remember we've put the caching age on 1 second\n", "pydov.cache.clean()\n", "print('number of files after clean: ', len(os.listdir(os.path.join(cachedir, 'boring'))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Should you want to remove the pydov cache from code yourself, you can do so as illustrated below. Note that this will erase the entire cache, not only the records older than the maximum age:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "pydov.cache.remove()\n", "# check existence of the cache directory:\n", "print(os.path.exists(os.path.join(cachedir, 'boring')))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" }, "nbsphinx": { "execute": "never" } }, "nbformat": 4, "nbformat_minor": 2 }