{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "---\n", "aliases:\n", "- /noisy_imagenette\n", "badges: true\n", "branch: master\n", "categories:\n", "- deep learning\n", "- imagenette\n", "description: A noisy version of fastai's Imagenette/Imagewoof datasets\n", "permalink: /noisy_imagenette\n", "title: Introducing Noisy Imagenette\n", "toc: false\n", "\n", "---\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TL;DR:** We introduce a dataset, Noisy Imagenette, which is a version of the Imagenette dataset with noisy labels. We hope this dataset is useful for rapid experimentation and testing of methods to address noisy label training.\n", "\n", "# Introduction\n", "\n", "## Dataset have noisy labels!\n", "\n", "Deep learning has led to impressive results on datasets of all types, but its success often shines when models are trained with large datasets with human-annotated labels (extreme example: GPT-3 and more recently CLIP/ALIGN/DALL-E). A major challenge when constructing these datasets is obtaining enough labels to train a neural network model. There is an inherent tradeoff between the quality of the annotations and the cost of annotation (in the form of time or money). For example, while using sources like Amazon Mechanical Turk provide cheap labeling, the use of these non-expert labeling services will often produce unreliable labels. This is what is referred to as noisy labels, as these unreliable labels are not necessarily ground truth. Unfortunately, neural networks are known to be susceptible to overfitting to noisy labels (see [here](https://arxiv.org/abs/1611.03530)) which means alternative approaches are needed to achieve good generalization in the presence of noisy labels.\n", " \n", " ## Prior research on noisy labels\n", " \n", " Recently, many techniques have been presented in order to address label noise. These include novel loss functions like [Bi-Tempered Logistic Loss](https://arxiv.org/abs/1906.03361)[Taylor Cross Entropy Loss](https://www.ijcai.org/Proceedings/2020/305), or [Symmetric Cross Entropy](https://arxiv.org/abs/1908.06112). Additionally, there are many novel training techniques that have been recently developed like [MentorMix](https://arxiv.org/abs/1911.09781), [DivideMix](https://arxiv.org/abs/2002.07394), [Early-Learning Regularization](https://arxiv.org/abs/2007.00151) and [Noise-Robust Contrastive Learning](https://openreview.net/forum?id=D1E1h-K3jso). \n", " \n", " Most of these papers are using MNIST, SVHN, CIFAR10 or related datasets with synthetically-added noise. Other common datasets are the WebVision and Clothing1M datasets, which are real-world noisy, large-scale datasets with millions of images. Therefore there is an opportunity to develop a mid-scale dataset that allows for rapid prototyping but is complex enough to provide useful results when it comes to noisy label training.\n", "\n", "## fastai's Imagenette - a dataset for rapid prototyping\n", "\n", "The idea of mid-scale datasets for rapid prototyping has been explored in the past. For example, in 2019, fast.ai [released](https://github.com/fastai/imagenette) the Imagenette and Imagewoof datasets (subsequently updated in 2020), subsets of Imagenet for rapid experimentation and prototyping. It can serve as a small dataset proxy for the ImageNet, or a dataset with more complexity than MNIST or CIFAR10 but still small and simple enough for benchmarking and rapid experimentation. This dataset has been used to test and establish new training techniques like [Mish activation function](https://arxiv.org/abs/1908.08681) and [Ranger optimizer](https://forums.fast.ai/t/meet-ranger-radam-lookahead-optimizer/52886) (see [here](https://forums.fast.ai/t/how-we-beat-the-5-epoch-imagewoof-leaderboard-score-some-new-techniques-to-consider/53453)). The dataset also has been used in various papers (see [here](https://arxiv.org/abs/2004.07629), [here](https://arxiv.org/abs/2007.15248), [here](https://arxiv.org/abs/1906.04887), [here](https://arxiv.org/abs/2101.06639), [here](https://arxiv.org/abs/2006.05624), and [here](https://www.sciencedirect.com/science/article/pii/S1047320321000134?casa_token=uL4_SoQQgKsAAAAA:CPGu3HeZVciBO5YEocTnziH7YVhbcGF0JCpB0JuJi2pqHmkaAKibhaVYe-3t07nxtpdem2lv)). Clearly, this dataset has been quite useful to machine learning researchers and practitioners for testing and comparing new methods. We believe that an analogous dataset could be useful to researchers with modest compute for testing and comparing new methods for addressing label noise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introducing Noisy Imagenette\n", "We introduce Noisy Imagenette, a version of Imagenette (and Imagewoof) that has synthetically noisy labels at different levels: 1%, 5%, 25%, and 50% incorrect labels. The Noisy Imagenette dataset already comes with the Imagenette dataset:\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from fastai.vision.all import *\n", "source = untar_data(URLs.IMAGENETTE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the regular labels for Imagenette dataset are given as the names of the image folder, the noisy labels are provided as a separate CSV file with columns corresponding to the image filename and labels for each of the different noise levels:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | path | \n", "noisy_labels_0 | \n", "noisy_labels_1 | \n", "noisy_labels_5 | \n", "noisy_labels_25 | \n", "noisy_labels_50 | \n", "is_valid | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "train/n02979186/n02979186_9036.JPEG | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "False | \n", "
1 | \n", "train/n02979186/n02979186_11957.JPEG | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n03000684 | \n", "False | \n", "
2 | \n", "train/n02979186/n02979186_9715.JPEG | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n03417042 | \n", "n03000684 | \n", "False | \n", "
3 | \n", "train/n02979186/n02979186_21736.JPEG | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n03417042 | \n", "False | \n", "
4 | \n", "train/n02979186/ILSVRC2012_val_00046953.JPEG | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n02979186 | \n", "n03394916 | \n", "False | \n", "