{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#default_exp data.external" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "from fastai2.torch_basics import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# External data\n", "> Helper functions to download the fastai datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " A complete list of datasets that are available by default isnide the library are: \n", "\n", "**Main datasets**:\n", "1. **ADULT_SAMPLE**: A small of the [adults dataset](https://archive.ics.uci.edu/ml/datasets/Adult) to predict whether income exceeds $50K/yr based on census data. \n", "- **BIWI_SAMPLE**: A [BIWI kinect headpose database](https://www.kaggle.com/kmader/biwi-kinect-head-pose-database). The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch. \n", "1. **CIFAR**: The famous [cifar-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. \n", "1. **COCO_SAMPLE**: A sample of the [coco dataset](http://cocodataset.org/#home) for object detection. \n", "1. **COCO_TINY**: A tiny version of the [coco dataset](http://cocodataset.org/#home) for object detection.\n", "- **HUMAN_NUMBERS**: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.\n", "- **IMDB**: The full [IMDB sentiment analysis dataset](https://ai.stanford.edu/~amaas/data/sentiment/). \n", "\n", "- **IMDB_SAMPLE**: A sample of the full [IMDB sentiment analysis dataset](https://ai.stanford.edu/~amaas/data/sentiment/). \n", "- **ML_SAMPLE**: A movielens sample dataset for recommendation engines to recommend movies to users. \n", "- **ML_100k**: The movielens 100k dataset for recommendation engines to recommend movies to users. \n", "- **MNIST_SAMPLE**: A sample of the famous [MNIST dataset](http://yann.lecun.com/exdb/mnist/) consisting of handwritten digits. \n", "- **MNIST_TINY**: A tiny version of the famous [MNIST dataset](http://yann.lecun.com/exdb/mnist/) consisting of handwritten digits. \n", "- **MNIST_VAR_SIZE_TINY**: \n", "- **PLANET_SAMPLE**: A sample of the planets dataset from the Kaggle competition [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space).\n", "- **PLANET_TINY**: A tiny version of the planets dataset from the Kaggle competition [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space) for faster experimentation and prototyping.\n", "- **IMAGENETTE**: A smaller version of the [imagenet dataset](http://www.image-net.org/) pronounced just like 'Imagenet', except with a corny inauthentic French accent. \n", "- **IMAGENETTE_160**: The 160px version of the Imagenette dataset. \n", "- **IMAGENETTE_320**: The 320px version of the Imagenette dataset. \n", "- **IMAGEWOOF**: Imagewoof is a subset of 10 classes from Imagenet that aren't so easy to classify, since they're all dog breeds.\n", "- **IMAGEWOOF_160**: 160px version of the ImageWoof dataset. \n", "- **IMAGEWOOF_320**: 320px version of the ImageWoof dataset.\n", "- **IMAGEWANG**: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem\n", "- **IMAGEWANG_160**: 160px version of Imagewang. \n", "- **IMAGEWANG_320**: 320px version of Imagewang. \n", "\n", "**Kaggle competition datasets**:\n", "1. **DOGS**: Image dataset consisting of dogs and cats images from [Dogs vs Cats kaggle competition](https://www.kaggle.com/c/dogs-vs-cats). \n", "\n", "**Image Classification datasets**:\n", "1. **CALTECH_101**: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato.\n", "1. CARS: The [Cars dataset](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) contains 16,185 images of 196 classes of cars. \n", "1. **CIFAR_100**: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. \n", "1. **CUB_200_2011**: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations\n", "1. **FLOWERS**: 17 category [flower dataset](http://www.robots.ox.ac.uk/~vgg/data/flowers/) by gathering images from various websites.\n", "1. **FOOD**: \n", "1. **MNIST**: [MNIST dataset](http://yann.lecun.com/exdb/mnist/) consisting of handwritten digits. \n", "1. **PETS**: A 37 category [pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) with roughly 200 images for each class.\n", "\n", "**NLP datasets**:\n", "1. **AG_NEWS**: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.\n", "1. **AMAZON_REVIEWS**: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.\n", "1. **AMAZON_REVIEWS_POLARITY**: Amazon reviews dataset for sentiment analysis.\n", "1. **DBPEDIA**: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia. \n", "1. **MT_ENG_FRA**: Machine translation dataset from English to French.\n", "1. **SOGOU_NEWS**: [The Sogou-SRR](http://www.thuir.cn/data-srr/) (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.\n", "1. **WIKITEXT**: The [WikiText language modeling dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. \n", "1. **WIKITEXT_TINY**: A tiny version of the WIKITEXT dataset.\n", "1. **YAHOO_ANSWERS**: YAHOO's question answers dataset.\n", "1. **YELP_REVIEWS**: The [Yelp dataset](https://www.yelp.com/dataset) is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes\n", "1. **YELP_REVIEWS_POLARITY**: For sentiment classification on YELP reviews.\n", "\n", "\n", "**Image localization datasets**:\n", "1. **BIWI_HEAD_POSE**: A [BIWI kinect headpose database](https://www.kaggle.com/kmader/biwi-kinect-head-pose-database). The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch. \n", "1. **CAMVID**: Consists of driving labelled dataset for segmentation type models.\n", "1. **CAMVID_TINY**: A tiny camvid dataset for segmentation type models.\n", "1. **LSUN_BEDROOMS**: [Large-scale Image Dataset](https://arxiv.org/abs/1506.03365) using Deep Learning with Humans in the Loop\n", "1. **PASCAL_2007**: [Pascal 2007 dataset](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/) to recognize objects from a number of visual object classes in realistic scenes.\n", "1. **PASCAL_2012**: [Pascal 2012 dataset](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/) to recognize objects from a number of visual object classes in realistic scenes.\n", "\n", "**Audio classification**:\n", "1. **MACAQUES**: [7285 macaque coo calls](https://datadryad.org/stash/dataset/doi:10.5061/dryad.7f4p9) across 8 individuals from [Distributed acoustic cues for caller identity in macaque vocalization](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4806230).\n", "2. **ZEBRA_FINCH**: [3405 zebra finch calls](https://ndownloader.figshare.com/articles/11905533/versions/1) classified [across 11 call types](https://link.springer.com/article/10.1007/s10071-015-0933-6). Additonal labels include name of individual making the vocalization and its age.\n", "\n", "**Medical Imaging datasets**:\n", "1. **SIIM_SMALL**: A smaller version of the [SIIM dataset](https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation/overview) where the objective is to classify pneumothorax from a set of chest radiographic images.\n", "\n", "**Pretrained models**:\n", "1. **OPENAI_TRANSFORMER**: The GPT2 Transformer pretrained weights.\n", "1. **WT103_FWD**: The WikiText-103 forward language model weights.\n", "1. **WT103_BWD**: The WikiText-103 backward language model weights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To download any of the datasets or pretrained weights, simply run `untar_data` by passing any dataset name mentioned above like so: \n", "\n", "```python \n", "path = untar_data(URLs.PETS)\n", "path.ls()\n", "\n", ">> (#7393) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Abyssinian_92.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Russian_Blue_194.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/staffordshire_bull_terrier_91.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Persian_69.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/english_setter_33.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Russian_Blue_155.jpg')...]\n", "```\n", "\n", "To download model pretrained weights: \n", "```python \n", "path = untar_data(URLs.PETS)\n", "path.ls()\n", "\n", ">> (#2) [Path('/home/ubuntu/.fastai/data/wt103-bwd/itos_wt103.pkl'),Path('/home/ubuntu/.fastai/data/wt103-bwd/lstm_bwd.pth')]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Config -" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# export\n", "class Config:\n", " \"Setup config at `~/.fastai` unless it exists already.\"\n", " config_path = Path(os.getenv('FASTAI_HOME', '~/.fastai')).expanduser()\n", " config_file = config_path/'config.yml'\n", "\n", " def __init__(self):\n", " self.config_path.mkdir(parents=True, exist_ok=True)\n", " if not self.config_file.exists(): self.create_config()\n", " self.d = self.load_config()\n", "\n", " def __getitem__(self,k):\n", " k = k.lower()\n", " if k not in self.d: k = k+'_path'\n", " return Path(self.d[k])\n", "\n", " def __getattr__(self,k):\n", " if k=='d': raise AttributeError\n", " return self[k]\n", "\n", " def __setitem__(self,k,v): self.d[k] = str(v)\n", " def __contains__(self,k): return k in self.d\n", "\n", " def load_config(self):\n", " \"load and return config if version equals 2 in existing, else create new config.\"\n", " with open(self.config_file, 'r') as f:\n", " config = yaml.safe_load(f)\n", " if 'version' in config and config['version'] == 2: return config\n", " elif 'version' in config: self.create_config(config)\n", " else: self.create_config()\n", " return self.load_config()\n", "\n", " def create_config(self, cfg=None):\n", " \"create new config with default paths and set `version` to 2.\"\n", " config = {'data_path': str(self.config_path/'data'),\n", " 'archive_path': str(self.config_path/'archive'),\n", " 'storage_path': '/tmp',\n", " 'model_path': str(self.config_path/'models'),\n", " 'version': 2}\n", " if cfg is not None:\n", " cfg['version'] = 2\n", " config = merge(config, cfg)\n", " self.save_file(config)\n", "\n", " def save(self): self.save_file(self.d)\n", " def save_file(self, config):\n", " \"save config file at default config location `~/.fastai/config.yml`.\"\n", " with self.config_file.open('w') as f: yaml.dump(config, f, default_flow_style=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a config file doesn't exist already, it is always created at `~/.fastai/config.yml` location by default whenever an instance of the `Config` class is created. Here is a quick example to explain: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_file = Path(\"~/.fastai/config.yml\").expanduser()\n", "if config_file.exists(): os.remove(config_file)\n", "assert not config_file.exists()\n", "\n", "config = Config()\n", "assert config_file.exists()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The config is now available as `config.d`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'archive_path': '/home/jhoward/.fastai/archive',\n", " 'data_path': '/home/jhoward/.fastai/data',\n", " 'model_path': '/home/jhoward/.fastai/models',\n", " 'storage_path': '/tmp',\n", " 'version': 2}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "config.d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen, this is a basic config file that consists of `data_path`, `model_path`, `storage_path` and `archive_path`. \n", "All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the `data_path` while all pretrained model weights are download to `model_path` unless the default download location is updated." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Path('/home/jhoward/.fastai/config.yml'),\n", " Path('/home/jhoward/.fastai/config.yml.bak'))" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#hide\n", "config = Config()\n", "config_path = config.config_path\n", "config_file,config_bak = config_path/'config.yml',config_path/'config.yml.bak'\n", "config_file,config_bak" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#hide\n", "#This cell is just to make the config file compatible with current fastai\n", "# TODO: make this a method that auto-runs as needed\n", "if 'data_archive_path' not in config:\n", " config['data_archive_path'] = config.data_path\n", " config.save()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please note that it is possible to update the default path locations in the config file. Let's first create a backup of the config file, then update the config to show the changes and re update the new config with the backup file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if config_file.exists(): shutil.move(config_file, config_bak)\n", "config['archive_path'] = Path(\".\")\n", "config.save()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'archive_path': '.',\n", " 'data_archive_path': '/home/jhoward/.fastai/data',\n", " 'data_path': '/home/jhoward/.fastai/data',\n", " 'model_path': '/home/jhoward/.fastai/models',\n", " 'storage_path': '/tmp',\n", " 'version': 2}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "config = Config()\n", "config.d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `archive_path` has been updated to `\".\"`. Now let's remove any updates we made to Config file that we made for the purpose of this example. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'archive_path': '/home/jhoward/.fastai/archive',\n", " 'data_archive_path': '/home/jhoward/.fastai/data',\n", " 'data_path': '/home/jhoward/.fastai/data',\n", " 'model_path': '/home/jhoward/.fastai/models',\n", " 'storage_path': '/tmp',\n", " 'version': 2}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if config_bak.exists(): shutil.move(config_bak, config_file)\n", "config = Config()\n", "config.d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## URLs -" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class URLs():\n", " \"Global constants for dataset and model URLs.\"\n", " LOCAL_PATH = Path.cwd()\n", " URL = 'http://files.fast.ai/data/examples/'\n", " MDL = 'http://files.fast.ai/models/'\n", " S3 = 'https://s3.amazonaws.com/fast-ai-'\n", "\n", " S3_IMAGE = f'{S3}imageclas/'\n", " S3_IMAGELOC = f'{S3}imagelocal/'\n", " S3_AUDI = f'{S3}audio/'\n", " S3_NLP = f'{S3}nlp/'\n", " S3_COCO = f'{S3}coco/'\n", " S3_MODEL = f'{S3}modelzoo/'\n", "\n", " # main datasets\n", " ADULT_SAMPLE = f'{URL}adult_sample.tgz'\n", " BIWI_SAMPLE = f'{URL}biwi_sample.tgz'\n", " CIFAR = f'{URL}cifar10.tgz'\n", " COCO_SAMPLE = f'{S3_COCO}coco_sample.tgz'\n", " COCO_TINY = f'{URL}coco_tiny.tgz'\n", " HUMAN_NUMBERS = f'{URL}human_numbers.tgz'\n", " IMDB = f'{S3_NLP}imdb.tgz'\n", " IMDB_SAMPLE = f'{URL}imdb_sample.tgz'\n", " ML_SAMPLE = f'{URL}movie_lens_sample.tgz'\n", " ML_100k = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'\n", " MNIST_SAMPLE = f'{URL}mnist_sample.tgz'\n", " MNIST_TINY = f'{URL}mnist_tiny.tgz'\n", " MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz'\n", " PLANET_SAMPLE = f'{URL}planet_sample.tgz'\n", " PLANET_TINY = f'{URL}planet_tiny.tgz'\n", " IMAGENETTE = f'{S3_IMAGE}imagenette2.tgz'\n", " IMAGENETTE_160 = f'{S3_IMAGE}imagenette2-160.tgz'\n", " IMAGENETTE_320 = f'{S3_IMAGE}imagenette2-320.tgz'\n", " IMAGEWOOF = f'{S3_IMAGE}imagewoof2.tgz'\n", " IMAGEWOOF_160 = f'{S3_IMAGE}imagewoof2-160.tgz'\n", " IMAGEWOOF_320 = f'{S3_IMAGE}imagewoof2-320.tgz'\n", " IMAGEWANG = f'{S3_IMAGE}imagewang.tgz'\n", " IMAGEWANG_160 = f'{S3_IMAGE}imagewang-160.tgz'\n", " IMAGEWANG_320 = f'{S3_IMAGE}imagewang-320.tgz'\n", "\n", " # kaggle competitions download dogs-vs-cats -p {DOGS.absolute()}\n", " DOGS = f'{URL}dogscats.tgz'\n", "\n", " # image classification datasets\n", " CALTECH_101 = f'{S3_IMAGE}caltech_101.tgz'\n", " CARS = f'{S3_IMAGE}stanford-cars.tgz'\n", " CIFAR_100 = f'{S3_IMAGE}cifar100.tgz'\n", " CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz'\n", " FLOWERS = f'{S3_IMAGE}oxford-102-flowers.tgz'\n", " FOOD = f'{S3_IMAGE}food-101.tgz'\n", " MNIST = f'{S3_IMAGE}mnist_png.tgz'\n", " PETS = f'{S3_IMAGE}oxford-iiit-pet.tgz'\n", "\n", " # NLP datasets\n", " AG_NEWS = f'{S3_NLP}ag_news_csv.tgz'\n", " AMAZON_REVIEWS = f'{S3_NLP}amazon_review_full_csv.tgz'\n", " AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz'\n", " DBPEDIA = f'{S3_NLP}dbpedia_csv.tgz'\n", " MT_ENG_FRA = f'{S3_NLP}giga-fren.tgz'\n", " SOGOU_NEWS = f'{S3_NLP}sogou_news_csv.tgz'\n", " WIKITEXT = f'{S3_NLP}wikitext-103.tgz'\n", " WIKITEXT_TINY = f'{S3_NLP}wikitext-2.tgz'\n", " YAHOO_ANSWERS = f'{S3_NLP}yahoo_answers_csv.tgz'\n", " YELP_REVIEWS = f'{S3_NLP}yelp_review_full_csv.tgz'\n", " YELP_REVIEWS_POLARITY = f'{S3_NLP}yelp_review_polarity_csv.tgz'\n", "\n", " # Image localization datasets\n", " BIWI_HEAD_POSE = f\"{S3_IMAGELOC}biwi_head_pose.tgz\"\n", " CAMVID = f'{S3_IMAGELOC}camvid.tgz'\n", " CAMVID_TINY = f'{URL}camvid_tiny.tgz'\n", " LSUN_BEDROOMS = f'{S3_IMAGE}bedroom.tgz'\n", " PASCAL_2007 = f'{S3_IMAGELOC}pascal_2007.tgz'\n", " PASCAL_2012 = f'{S3_IMAGELOC}pascal_2012.tgz'\n", "\n", " # Audio classification datasets\n", " MACAQUES = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip'\n", " ZEBRA_FINCH = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip'\n", "\n", " # Medical Imaging datasets\n", " #SKIN_LESION = f'{S3_IMAGELOC}skin_lesion.tgz'\n", " SIIM_SMALL = f'{S3_IMAGELOC}siim_small.tgz'\n", "\n", " #Pretrained models\n", " OPENAI_TRANSFORMER = f'{S3_MODEL}transformer.tgz'\n", " WT103_FWD = f'{S3_MODEL}wt103-fwd.tgz'\n", " WT103_BWD = f'{S3_MODEL}wt103-bwd.tgz'\n", "\n", " def path(url='.', c_key='archive'):\n", " \"Return local path where to download based on `c_key`\"\n", " fname = url.split('/')[-1]\n", " local_path = URLs.LOCAL_PATH/('models' if c_key=='models' else 'data')/fname\n", " if local_path.exists(): return local_path\n", " return Config()[c_key]/fname" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default local path is at `~/.fastai/archive/` but this can be updated by passing a different `c_key`. Note: `c_key` should be one of `'archive_path', 'data_archive_path', 'data_path', 'model_path', 'storage_path'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Path('/home/jhoward/.fastai/archive/oxford-iiit-pet.tgz')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = URLs.PETS\n", "local_path = URLs.path(url)\n", "test_eq(local_path.parent, Config()['archive']); \n", "local_path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Path('/home/jhoward/.fastai/models/oxford-iiit-pet.tgz')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "local_path = URLs.path(url, c_key='model')\n", "test_eq(local_path.parent, Config()['model'])\n", "local_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# export\n", "def download_url(url, dest, overwrite=False, pbar=None, show_progress=True, chunk_size=1024*1024,\n", " timeout=4, retries=5):\n", " \"Download `url` to `dest` unless it exists and not `overwrite`\"\n", " if os.path.exists(dest) and not overwrite: return\n", "\n", " s = requests.Session()\n", " s.mount('http://',requests.adapters.HTTPAdapter(max_retries=retries))\n", " # additional line to identify as a firefox browser, see fastai/#2438\n", " s.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0'})\n", " u = s.get(url, stream=True, timeout=timeout)\n", " try: file_size = int(u.headers[\"Content-Length\"])\n", " except: show_progress = False\n", "\n", " with open(dest, 'wb') as f:\n", " nbytes = 0\n", " if show_progress: pbar = progress_bar(range(file_size), leave=False, parent=pbar)\n", " try:\n", " if show_progress: pbar.update(0)\n", " for chunk in u.iter_content(chunk_size=chunk_size):\n", " nbytes += len(chunk)\n", " if show_progress: pbar.update(nbytes)\n", " f.write(chunk)\n", " except requests.exceptions.ConnectionError as e:\n", " fname = url.split('/')[-1]\n", " data_dir = dest.parent\n", " print(f'\\n Download of {url} has failed after {retries} retries\\n'\n", " f' Fix the download manually:\\n'\n", " f'$ mkdir -p {data_dir}\\n'\n", " f'$ cd {data_dir}\\n'\n", " f'$ wget -c {url}\\n'\n", " f'$ tar xf {fname}\\n'\n", " f' And re-run your code once the download is successful\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `download_url` is a very handy function inside fastai! This function can be used to download any file from the internet to a location passed by `dest` argument of the function. It should not be confused, that this function can only be used to download fastai-files. That couldn't be further away from the truth. As an example, let's download the pets dataset from the actual source file: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fname = Path(\"./dog.jpg\")\n", "if fname.exists(): os.remove(fname)\n", "url = \"https://i.insider.com/569fdd9ac08a80bd448b7138?width=1100&format=jpeg&auto=webp\"\n", "download_url(url, fname)\n", "assert fname.exists()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's confirm that the file was indeed downloaded correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from PIL import Image\n", "im = Image.open(fname)\n", "plt.imshow(im);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen, the file has been downloaded to the local path provided in `dest` argument. Calling the function again doesn't trigger a download since the file is already there. This can be confirmed by checking that the last modified time of the file that is downloaded doesn't get updated. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if fname.exists(): last_modified_time = os.path.getmtime(fname)\n", "download_url(url, fname)\n", "test_eq(os.path.getmtime(fname), last_modified_time)\n", "if fname.exists(): os.remove(fname)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use the `download_url` function to download the pet's dataset straight from the source by simply passing `https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz` in `url`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# export\n", "def download_data(url, fname=None, c_key='archive', force_download=False):\n", " \"Download `url` to `fname`.\"\n", " fname = Path(fname or URLs.path(url, c_key=c_key))\n", " fname.parent.mkdir(parents=True, exist_ok=True)\n", " if not fname.exists() or force_download: download_url(url, fname, overwrite=force_download)\n", " return fname" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `download_data` is a convenience function and a wrapper outside `download_url` to download fastai files to the appropriate local path based on the `c_key`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If `fname` is None, it will default to the archive folder you have in your config file (or data, model if you specify a different `c_key`) followed by the last part of the url: for instance `URLs.MNIST_SAMPLE` is `http://files.fast.ai/data/examples/mnist_sample.tgz` and the default value for `fname` will be `~/.fastai/archive/mnist_sample.tgz`.\n", "\n", "If `force_download=True`, the file is alwayd downloaded. Otherwise, it's only when the file doesn't exists that the download is triggered." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#hide\n", "try:\n", " test_eq(download_data(URLs.MNIST_SAMPLE), config.archive/'mnist_sample.tgz')\n", " test_eq(download_data(URLs.MNIST_TINY, fname=Path('mnist.tgz')), Path('mnist.tgz'))\n", "finally: Path('mnist.tgz').unlink()\n", "\n", "try:\n", " tst_model = config.model/'mnist_tiny.tgz'\n", " test_eq(download_data(URLs.MNIST_TINY, c_key='model'), tst_model)\n", " os.remove(tst_model)\n", "finally:\n", " if tst_model.exists(): tst_model.unlink()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check datasets -" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#hide\n", "from nbdev.imports import Config as NbdevConfig\n", "__file__ = NbdevConfig().lib_path/'data'/'external.py'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def _get_check(url):\n", " \"internal function to get the hash of the file at `url`.\"\n", " checks = json.load(open(Path(__file__).parent/'checks.txt', 'r'))\n", " return checks.get(url, None)\n", "\n", "def _check_file(fname):\n", " \"internal function to get the hash of the local file at `fname`.\"\n", " size = os.path.getsize(fname)\n", " with open(fname, \"rb\") as f: hash_nb = hashlib.md5(f.read(2**20)).hexdigest()\n", " return [size,hash_nb]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([3214948, '2dbc7ec6f9259b583af0072c55816a88'],\n", " [3214948, '2dbc7ec6f9259b583af0072c55816a88'])" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#hide\n", "test_eq(_get_check(URLs.MNIST_SAMPLE), _check_file(URLs.path(URLs.MNIST_SAMPLE)))\n", "_get_check(URLs.MNIST_SAMPLE), _check_file(URLs.path(URLs.MNIST_SAMPLE))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([1637796771, '433b4706eb7c42bd74e7f784e3fdf244'],\n", " [2618908000, 'd90e29e54a4c76c0c6fba8355dcbaca5'])" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_get_check(URLs.PASCAL_2007),_get_check(URLs.PASCAL_2012)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def _add_check(url, fname):\n", " \"Internal function to update the internal check file with `url` and check on `fname`.\"\n", " checks = json.load(open(Path(__file__).parent/'checks.txt', 'r'))\n", " checks[url] = _check_file(fname)\n", " json.dump(checks, open(Path(__file__).parent/'checks.txt', 'w'), indent=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def file_extract(fname, dest=None):\n", " \"Extract `fname` to `dest` using `tarfile` or `zipfile`.\"\n", " if dest is None: dest = Path(fname).parent\n", " fname = str(fname)\n", " if fname.endswith('gz'): tarfile.open(fname, 'r:gz').extractall(dest)\n", " elif fname.endswith('zip'): zipfile.ZipFile(fname ).extractall(dest)\n", " else: raise Exception(f'Unrecognized archive: {fname}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`file_extract` is used by default in `untar_data` to decompress the downloaded file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def _try_from_storage(dest, storage):\n", " \"an internal function to create symbolic links for files from `storage` to `dest` if `storage` exists\"\n", " if not storage.exists(): return\n", " os.makedirs(dest, exist_ok=True)\n", " for f in storage.glob('*'): os.symlink(f, dest/f.name, target_is_directory=f.is_dir())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#hide\n", "with tempfile.TemporaryDirectory() as d:\n", " with tempfile.TemporaryDirectory() as d2:\n", " d,d2 = Path(d),Path(d2)\n", " for k in ['a', 'b', 'c']: os.makedirs(d/k)\n", " for k in ['d', 'e', 'f']: (d/k).touch()\n", " _try_from_storage(d2, d)\n", " for k in ['a', 'b', 'c']: \n", " assert (d2/k).exists()\n", " assert (d2/k).is_dir()\n", " for k in ['d', 'e', 'f']: \n", " assert (d2/k).exists()\n", " assert (d2/k).is_file()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def newest_folder(path):\n", " \"Return newest folder on path\"\n", " list_of_paths = path.glob('*')\n", " return max(list_of_paths, key=lambda p: p.stat().st_ctime)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def rename_extracted(dest):\n", " \"Rename file if different from dest\"\n", " extracted = newest_folder(dest.parent)\n", " if not (extracted.name == dest.name): extracted.rename(dest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's rename the untar/unzip data if dest name is different from fname" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def untar_data(url, fname=None, dest=None, c_key='data', force_download=False, extract_func=file_extract):\n", " \"Download `url` to `fname` if `dest` doesn't exist, and un-tgz or unzip to folder `dest`.\"\n", " default_dest = URLs.path(url, c_key=c_key).with_suffix('')\n", " dest = default_dest if dest is None else Path(dest)/default_dest.name\n", " fname = Path(fname or URLs.path(url))\n", " if fname.exists() and _get_check(url) and _check_file(fname) != _get_check(url):\n", " print(\"A new version of this dataset is available, downloading...\")\n", " force_download = True\n", " if force_download:\n", " if fname.exists(): os.remove(fname)\n", " if dest.exists(): shutil.rmtree(dest)\n", " if not dest.exists(): _try_from_storage(dest, URLs.path(url, c_key='storage').with_suffix(''))\n", " if not dest.exists():\n", " fname = download_data(url, fname=fname, c_key=c_key)\n", " if _get_check(url) and _check_file(fname) != _get_check(url):\n", " print(f\"File downloaded is broken. Remove {fname} and try again.\")\n", " extract_func(fname, dest.parent)\n", " rename_extracted(dest)\n", " return dest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`untar_data` is a very powerful convenience function to download files from `url` to `dest`. The `url` can be a default `url` from the `URLs` class or a custom url. If `dest` is not passed, files are downloaded at the `default_dest` which defaults to `~/.fastai/data/`.\n", "\n", "This convenience function extracts the downloaded files to `dest` by default. In order, to simply download the files without extracting, pass the `noop` function as `extract_func`. \n", "\n", "Note, it is also possible to pass a custom `extract_func` to `untar_data` if the filetype doesn't end with `.tgz` or `.zip`. The `gzip` and `zip` files are supported by default and there is no need to pass custom `extract_func` for these type of files. \n", "\n", "Internally, if files are not available at `fname` location already which defaults to `~/.fastai/archive/`, the files get downloaded at `~/.fastai/archive` and are then extracted at `dest` location. If no `dest` is passed the `default_dest` to download the files is `~/.fastai/data`. If files are already available at the `fname` location but not available then a symbolic link is created for each file from `fname` location to `dest`.\n", "\n", "Also, if `force_download` is set to `True`, files are re downloaded even if they exist. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tempfile import TemporaryDirectory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "test_eq(untar_data(URLs.MNIST_SAMPLE), config.data/'mnist_sample')\n", "\n", "with TemporaryDirectory() as d:\n", " d = Path(d)\n", " dest = untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)\n", " assert Path('mnist_tiny.tgz').exists()\n", " assert (d/'mnist_tiny').exists()\n", " os.unlink('mnist_tiny.tgz')\n", "\n", "#Test c_key\n", "tst_model = config.model/'mnist_sample'\n", "test_eq(untar_data(URLs.MNIST_SAMPLE, c_key='model'), tst_model)\n", "assert not tst_model.with_suffix('.tgz').exists() #Archive wasn't downloaded in the models path\n", "assert (config.archive/'mnist_sample.tgz').exists() #Archive was downloaded there\n", "shutil.rmtree(tst_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes the extracted folder does not have the same name as the downloaded file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#test fname!=dest\n", "with TemporaryDirectory() as d:\n", " d = Path(d)\n", " untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)\n", " Path('mnist_tiny.tgz').rename('nims_tini.tgz')\n", " p = Path('nims_tini.tgz')\n", " dest = Path('nims_tini')\n", " assert p.exists()\n", " file_extract(p, dest.parent)\n", " rename_extracted(dest)\n", " p.unlink()\n", " shutil.rmtree(dest)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#hide\n", "#Check all URLs are in the checks.txt file and match for downloaded archives\n", "_whitelist = \"MDL LOCAL_PATH URL WT103_BWD WT103_FWD\".split()\n", "checks = json.load(open(Path(__file__).parent/'checks.txt', 'r'))\n", "for d in dir(URLs): \n", " if d.upper() == d and not d.startswith(\"S3\") and not d in _whitelist: \n", " url = getattr(URLs, d)\n", " assert url in checks,f\"\"\"{d} is not in the check file for all URLs.\n", "To fix this, you need to run the following code in this notebook before making a PR (there is a commented cell for this below):\n", "url = URLs.{d}\n", "untar_data(url, force_download=True)\n", "_add_check(url, URLs.path(url))\n", "\"\"\"\n", " f = URLs.path(url)\n", " if f.exists():\n", " assert checks[url] == _check_file(f),f\"\"\"The log we have for {d} in checks does not match the actual archive.\n", "To fix this, you need to run the following code in this notebook before making a PR (there is a commented cell for this below):\n", "url = URLs.{d}\n", "_add_check(url, URLs.path(url))\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export -" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Converted 00_torch_core.ipynb.\n", "Converted 01_layers.ipynb.\n", "Converted 02_data.load.ipynb.\n", "Converted 03_data.core.ipynb.\n", "Converted 04_data.external.ipynb.\n", "Converted 05_data.transforms.ipynb.\n", "Converted 06_data.block.ipynb.\n", "Converted 07_vision.core.ipynb.\n", "Converted 08_vision.data.ipynb.\n", "Converted 09_vision.augment.ipynb.\n", "Converted 09b_vision.utils.ipynb.\n", "Converted 09c_vision.widgets.ipynb.\n", "Converted 10_tutorial.pets.ipynb.\n", "Converted 11_vision.models.xresnet.ipynb.\n", "Converted 12_optimizer.ipynb.\n", "Converted 13_callback.core.ipynb.\n", "Converted 13a_learner.ipynb.\n", "Converted 13b_metrics.ipynb.\n", "Converted 14_callback.schedule.ipynb.\n", "Converted 14a_callback.data.ipynb.\n", "Converted 15_callback.hook.ipynb.\n", "Converted 15a_vision.models.unet.ipynb.\n", "Converted 16_callback.progress.ipynb.\n", "Converted 17_callback.tracker.ipynb.\n", "Converted 18_callback.fp16.ipynb.\n", "Converted 18a_callback.training.ipynb.\n", "Converted 19_callback.mixup.ipynb.\n", "Converted 20_interpret.ipynb.\n", "Converted 20a_distributed.ipynb.\n", "Converted 21_vision.learner.ipynb.\n", "Converted 22_tutorial.imagenette.ipynb.\n", "Converted 23_tutorial.vision.ipynb.\n", "Converted 24_tutorial.siamese.ipynb.\n", "Converted 24_vision.gan.ipynb.\n", "Converted 30_text.core.ipynb.\n", "Converted 31_text.data.ipynb.\n", "Converted 32_text.models.awdlstm.ipynb.\n", "Converted 33_text.models.core.ipynb.\n", "Converted 34_callback.rnn.ipynb.\n", "Converted 35_tutorial.wikitext.ipynb.\n", "Converted 36_text.models.qrnn.ipynb.\n", "Converted 37_text.learner.ipynb.\n", "Converted 38_tutorial.text.ipynb.\n", "Converted 39_tutorial.transformers.ipynb.\n", "Converted 40_tabular.core.ipynb.\n", "Converted 41_tabular.data.ipynb.\n", "Converted 42_tabular.model.ipynb.\n", "Converted 43_tabular.learner.ipynb.\n", "Converted 44_tutorial.tabular.ipynb.\n", "Converted 45_collab.ipynb.\n", "Converted 46_tutorial.collab.ipynb.\n", "Converted 50_tutorial.datablock.ipynb.\n", "Converted 60_medical.imaging.ipynb.\n", "Converted 61_tutorial.medical_imaging.ipynb.\n", "Converted 65_medical.text.ipynb.\n", "Converted 70_callback.wandb.ipynb.\n", "Converted 71_callback.tensorboard.ipynb.\n", "Converted 72_callback.neptune.ipynb.\n", "Converted 73_callback.captum.ipynb.\n", "Converted 74_callback.cutmix.ipynb.\n", "Converted 97_test_utils.ipynb.\n", "Converted 99_pytorch_doc.ipynb.\n", "Converted index.ipynb.\n", "Converted tutorial.ipynb.\n" ] } ], "source": [ "#hide\n", "from nbdev.export import notebook2script\n", "notebook2script()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "jupytext": { "split_at_heading": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }