{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This module has the necessary functions to be able to download several useful datasets that we might be interested in using in our models." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.datasets import * \n", "from fastai.datasets import Config\n", "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class URLs[source][test]

\n", "\n", "> URLs()\n", "\n", "
×

Tests found for URLs:

Some other tests where URLs is used:

To run tests please refer to this guide.

\n", "\n", "Global constants for dataset and model URLs. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(URLs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This contains all the datasets' and models' URLs, and some classmethods to help use them - you don't create objects of this class. The supported datasets are (with their calling name): `S3_NLP`, `S3_COCO`, `MNIST_SAMPLE`, `MNIST_TINY`, `IMDB_SAMPLE`, `ADULT_SAMPLE`, `ML_SAMPLE`, `PLANET_SAMPLE`, `CIFAR`, `PETS`, `MNIST`. To get details on the datasets you can see the [fast.ai datasets webpage](http://course.fast.ai/datasets). Datasets with SAMPLE in their name are subsets of the original datasets. In the case of MNIST, we also have a TINY dataset which is even smaller than MNIST_SAMPLE.\n", "\n", "Models is now limited to `WT103` but you can expect more in the future!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'http://files.fast.ai/data/examples/mnist_sample'" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "URLs.MNIST_SAMPLE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the rest of the datasets you will need to download them with [`untar_data`](/datasets.html#untar_data) or [`download_data`](/datasets.html#download_data). [`untar_data`](/datasets.html#untar_data) will decompress the data file and download it while [`download_data`](/datasets.html#download_data) will just download and save the compressed file in `.tgz` format. \n", "\n", "The locations where the data and models are downloaded are set in `config.yml`, which by default is located in `~/.fastai`. This directory can be changed via the optional environment variable `FASTAI_HOME` (e.g FASTAI_HOME=/home/.fastai).\n", "\n", "If no `config.yml` is present in the specified directory, a default one will be created with `data_archive_path`, `data_path` and `models_path` entries. The `data_path` and `models_path` entries point respectively to [`data`](/tabular.data.html#tabular.data) folder and [`models`](/tabular.models.html#tabular.models) folder in the same directory as `config.yml`. The `data_archive_path` allows you to set a separate folder to save compressed datasets for archiving purposes. It defaults to the same directory as `data_path`. \n", "\n", "Configure those download locations by editing `data_archive_path`, `data_path` and `models_path` in `config.yml`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

untar_data[source][test]

\n", "\n", "> untar_data(**`url`**:`str`, **`fname`**:`PathOrStr`=***`None`***, **`dest`**:`PathOrStr`=***`None`***, **`data`**=***`True`***, **`force_download`**=***`False`***) → `Path`\n", "\n", "
×

Tests found for untar_data:

  • pytest -sv tests/test_datasets.py::test_load_config [source]
  • pytest -sv tests/test_datasets.py::test_user_config [source]
  • pytest -sv tests/test_vision_data.py::test_trunc_download [source]

Some other tests where untar_data is used:

  • pytest -sv tests/test_datasets.py::test_user_config [source]

To run tests please refer to this guide.

\n", "\n", "Download `url` to `fname` if `dest` doesn't exist, and un-tgz to folder `dest`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(untar_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, [`untar_data`](/datasets.html#untar_data) uses a `url` to download a `tgz` file under `fname`, and then un-tgz `fname` into a folder under `dest`. \n", "\n", "If you have run [`untar_data`](/datasets.html#untar_data) before, then running `untar_data(URLs.something)` again will just return you `dest` without downloading again.\n", "\n", "If you have run [`untar_data`](/datasets.html#untar_data) before, then running [`untar_data`](/datasets.html#untar_data) again with `force_download=True` or the tgz file under `fname` being corrupted somehow, will remove the existing `fname` and `dest` and start downloading again.\n", "\n", "If you have run [`untar_data`](/datasets.html#untar_data) before, but `dest` does not exist, meaning no folder under `dest` exist (the folder could be removed or renamed somehow), then running `untar_data(URLs.something)` again will execute [`download_data`](/datasets.html#download_data). Furthermore, if the tgz file under `fname` does exist, then there will be no actual downloading rather than un-tgz `fname` into `dest`; if `fname` does not exist, then downloading for the tgz file will be actually executed.\n", "\n", "**Note**: the `url` you feed to [`untar_data`](/datasets.html#untar_data) must be one of `URLs.something`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/.fastai/data/planet_sample')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "untar_data(URLs.PLANET_SAMPLE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

download_data[source][test]

\n", "\n", "> download_data(**`url`**:`str`, **`fname`**:`PathOrStr`=***`None`***, **`data`**:`bool`=***`True`***, **`ext`**:`str`=***`'.tgz'`***) → `Path`\n", "\n", "
×

Tests found for download_data:

  • pytest -sv tests/test_datasets.py::test_load_config [source]
  • pytest -sv tests/test_datasets.py::test_user_config [source]

Some other tests where download_data is used:

  • pytest -sv tests/test_datasets.py::test_user_config [source]

To run tests please refer to this guide.

\n", "\n", "Download `url` to destination `fname`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(download_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: If the data file already exists in a data directory inside the notebook, that data file will be used instead of the one present in the folder specified in `config.yml`. `config.yml` is located in the directory specified in optional environment variable `FASTAI_HOME` (defaults to `~/.fastai/`). Paths are resolved by calling the function [`datapath4file`](/datasets.html#datapath4file) - which checks if data exists locally (`data/`) first, before downloading to the folder specified in `config.yml`.\n", "\n", "Example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/.fastai/data/planet_sample.tgz')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "download_data(URLs.PLANET_SAMPLE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

datapath4file[source][test]

\n", "\n", "> datapath4file(**`filename`**, **`ext`**:`str`=***`'.tgz'`***, **`archive`**=***`True`***)\n", "\n", "
×

Tests found for datapath4file:

  • pytest -sv tests/test_datasets.py::test_load_config [source]
  • pytest -sv tests/test_datasets.py::test_user_config [source]

To run tests please refer to this guide.

\n", "\n", "Return data path to `filename`, checking locally first then in the config file. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(datapath4file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the downloading functions use this to decide where to put the tgz and expanded folder. If `filename` already exists in a data directory in the same place as the calling notebook/script, that is used as the parent directly, otherwise `config.yml` is read to see what path to use, which defaults to ~/.fastai/data is used. To override this default, simply modify the value in your `config.yml`:\n", "\n", " data_archive_path: ~/.fastai/data\n", " data_path: ~/.fastai/data\n", "\n", "`config.yml` is located in the directory specified in optional environment variable `FASTAI_HOME` (defaults to `~/.fastai/`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

url2path[source][test]

\n", "\n", "> url2path(**`url`**, **`data`**=***`True`***, **`ext`**:`str`=***`'.tgz'`***)\n", "\n", "
×

Tests found for url2path:

  • pytest -sv tests/test_datasets.py::test_load_config [source]
  • pytest -sv tests/test_datasets.py::test_user_config [source]

To run tests please refer to this guide.

\n", "\n", "Change `url` to a path. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(url2path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class Config[source][test]

\n", "\n", "> Config()\n", "\n", "
×

Tests found for Config:

  • pytest -sv tests/test_datasets.py::test_creates_config [source]
  • pytest -sv tests/test_datasets.py::test_default_config [source]
  • pytest -sv tests/test_datasets.py::test_load_config [source]
  • pytest -sv tests/test_datasets.py::test_user_config [source]

Some other tests where Config is used:

  • pytest -sv tests/test_datasets.py::test_user_config [source]

To run tests please refer to this guide.

\n", "\n", "Creates a default config file 'config.yml' in $FASTAI_HOME (default `~/.fastai/`) " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You probably won't need to use this yourself - it's used by `URLs.datapath4file`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_path[source][test]

\n", "\n", "> get_path(**`path`**)\n", "\n", "
×

No tests found for get_path. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Get the `path` in the config file. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Config.get_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the key corresponding to `path` in the [`Config`](/datasets.html#Config)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

data_path[source][test]

\n", "\n", "> data_path()\n", "\n", "
×

No tests found for data_path. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Get the path to data in the config file. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Config.data_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the `Path` where the data is stored." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

model_path[source][test]

\n", "\n", "> model_path()\n", "\n", "
×

No tests found for model_path. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Get the path to fastai pretrained models in the config file. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Config.model_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

create[source][test]

\n", "\n", "> create(**`fpath`**)\n", "\n", "
×

No tests found for create. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Creates a [`Config`](/datasets.html#Config) from `fpath`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Config.create)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

url2name[source][test]

\n", "\n", "> url2name(**`url`**)\n", "\n", "
×

No tests found for url2name. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(url2name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get_key[source][test]

\n", "\n", "> get_key(**`key`**)\n", "\n", "
×

No tests found for get_key. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Get the path to `key` in the config file. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Config.get_key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source][test]

\n", "\n", "> get(**`fpath`**=***`None`***, **`create_missing`**=***`True`***)\n", "\n", "
×

Tests found for get:

Some other tests where get is used:

  • pytest -sv tests/test_datasets.py::test_creates_config [source]
  • pytest -sv tests/test_datasets.py::test_default_config [source]

To run tests please refer to this guide.

\n", "\n", "Retrieve the [`Config`](/datasets.html#Config) in `fpath`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Config.get)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] } ], "metadata": { "jekyll": { "keywords": "fastai" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }