{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Image Cleaner Widget" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "fastai offers several widgets to support the workflow of a deep learning practitioner. The purpose of the widgets are to help you organize, clean, and prepare your data for your model. Widgets are separated by data type." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.vision import *\n", "from fastai.widgets import DatasetFormatter, ImageCleaner, ImageDownloader, download_google_images\n", "from fastai.gen_doc.nbdoc import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "data = ImageDataBunch.from_folder(path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = cnn_learner(data, models.resnet18, metrics=error_rate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Total time: 00:17

class DatasetFormatter[source][test]

> DatasetFormatter()

\n", "\n", "Returns a dataset with the appropriate format and file indices to be displayed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [`DatasetFormatter`](/widgets.image_cleaner.html#DatasetFormatter) class prepares your image dataset for widgets by returning a formatted [`DatasetTfm`](/vision.data.html#DatasetTfm) based on the [`DatasetType`](/basic_data.html#DatasetType) specified. Use `from_toplosses` to grab the most problematic images directly from your learner. Optionally, you can restrict the formatted dataset returned to `n_imgs`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> from_similars(**`learn`**, **`layer_ls`**:`list`=***`[0, 7, 2]`***, **\\*\\*`kwargs`**)

\n", "\n", "Gets the indices for the most similar images. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.from_similars)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.widgets.image_cleaner import * " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> from_toplosses(**`learn`**, **`n_imgs`**=***`None`***, **\\*\\*`kwargs`**)

\n", "\n", "Gets indices with top losses. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.from_toplosses)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class ImageCleaner[source][test]

> ImageCleaner(**`dataset`**, **`fns_idxs`**, **`path`**, **`batch_size`**:`int`=***`5`***, **`duplicates`**=***`False`***)

\n", "\n", "Displays images for relabeling or deletion and saves changes in `path` as 'cleaned.csv'. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) is for cleaning up images that don't belong in your dataset. It renders images in a row and gives you the opportunity to delete the file from your file system. To use [`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) we must first use `DatasetFormatter().from_toplosses` to get the suggested indices for misclassified images." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds, idxs = DatasetFormatter().from_toplosses(learn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c1251de0d9ba41ccb63674dd40c91599", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(VBox(children=(Image(value=b'\\xff\\xd8\\xff\\xe0\\x00\\x10JFIF\\x00\\x01\\x01\\x01\\x00d\\x00d\\x00\\x00\\xff…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1d834d97a30046518493bf9c08f1ff0e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Button(button_style='primary', description='Next Batch', layout=Layout(width='auto'), style=ButtonStyle())" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ImageCleaner(ds, idxs, path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) does not change anything on disk (neither labels or existence of images). Instead, it creates a 'cleaned.csv' file in your data path from which you need to load your new databunch for the files to changes to be applied. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(path/'cleaned.csv', header='infer')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We create a databunch from our csv. We include the data in the training set and we don't use a validation set (DatasetFormatter uses only the training set)\n", "np.random.seed(42)\n", "db = (ImageList.from_df(df, path)\n", " .split_none()\n", " .label_from_df()\n", " .databunch(bs=64))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = cnn_learner(db, models.resnet18, metrics=error_rate)\n", "learn = learn.load('stage-1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can then use [`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) again to find duplicates in the dataset. To do this, you can specify `duplicates=True` while calling ImageCleaner after getting the indices and dataset from `.from_similars`. Note that if you are using a layer's output which has dimensions (n_batches, n_features, 1, 1) then you don't need any pooling (this is the case with the last layer). The suggested use of `.from_similars()` with resnets is using the last layer and no pooling, like in the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Getting activations...\n" ] }, { "data": { "text/html": [ "\n", "
class ImageDownloader[source][test]

> ImageDownloader(**`path`**:`PathOrStr`=***`'data'`***)

\n", "\n", "Displays a widget that allows searching and downloading images from google images search in a Jupyter Notebook or Lab. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageDownloader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ImageDownloader`](/widgets.image_downloader.html#ImageDownloader) widget gives you a way to quickly bootstrap your image dataset without leaving the notebook. It searches and downloads images that match the search criteria and resolution / quality requirements and stores them on your filesystem within the provided `path`.\n", "\n", "Images for each search query (or label) are stored in a separate folder within `path`. For example, if you pupulate `tiger` with a `path` setup to `./data`, you'll get a folder `./data/tiger/` with the tiger images in it.\n", "\n", "[`ImageDownloader`](/widgets.image_downloader.html#ImageDownloader) will automatically clean up and verify the downloaded images with [`verify_images()`](/vision.data.html#verify_images) after downloading them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0f3f11652e204b68b8de634aa6ec1484", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HBox(children=(Text(value='', placeholder='What images to search for?'), BoundedIntText(value=1…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = Config.data_path()/'image_downloader'\n", "os.makedirs(path, exist_ok=True)\n", "ImageDownloader(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Downloading images in python scripts outside Jupyter notebooks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
> download_google_images(**`path`**:`PathOrStr`, **`search_term`**:`str`, **`size`**:`str`=***`'>400*300'`***, **`n_images`**:`int`=***`10`***, **`format`**:`str`=***`'jpg'`***, **`max_workers`**:`int`=***`4`***, **`timeout`**:`int`=***`4`***) → `FilePathList`

\n", "\n", "Search for `n_images` images on Google, matching `search_term` and `size` requirements, download them into `path`/`search_term` and verify them, using `max_workers` threads. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(download_google_images)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After populating images with [`ImageDownloader`](/widgets.image_downloader.html#ImageDownloader), you can get a an [`ImageDataBunch`](/vision.data.html#ImageDataBunch) by calling `ImageDataBunch.from_folder(path, size=size)`, or using the data block API." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
> make_dropdown_widget(**`description`**=***`'Description'`***, **`options`**=***`['Label 1', 'Label 2']`***, **`value`**=***`'Label 1'`***, **`file_path`**=***`None`***, **`layout`**=***`Layout()`***, **`handler`**=***`None`***)

\n", "\n", "Return a Dropdown widget with specified `handler`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.make_dropdown_widget)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> next_batch(**`_`**)

\n", "\n", "Handler for 'Next Batch' button click. Delete all flagged images and renders next batch. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.next_batch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> sort_idxs(**`similarities`**)

\n", "\n", "Sorts `similarities` and return the indexes in pairs ordered by highest similarity. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.sort_idxs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> make_vertical_box(**`children`**, **`layout`**=***`Layout()`***, **`duplicates`**=***`False`***)

\n", "\n", "Make a vertical box with [`children`](/torch_core.html#children) and `layout`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.make_vertical_box)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> relabel(**`change`**)

\n", "\n", "Relabel images by moving from parent dir with old label `class_old` to parent dir with new label `class_new`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.relabel)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> largest_indices(**`arr`**, **`n`**)

\n", "\n", "Returns the `n` largest indices from a numpy array `arr`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.largest_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> delete_image(**`file_path`**)

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.delete_image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> empty()

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.empty)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> empty_batch()

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.empty_batch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> comb_similarity(**`t1`**:`Tensor`, **`t2`**:`Tensor`, **\\*\\*`kwargs`**)

\n", "\n", "Computes the similarity function between each embedding of `t1` and `t2` matrices. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.comb_similarity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> get_widgets(**`duplicates`**)

\n", "\n", "Create and format widget set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.get_widgets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> write_csv()

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.write_csv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> create_image_list(**`dataset`**, **`fns_idxs`**)

\n", "\n", "Create a list of images, filenames and labels but first removing files that are not supposed to be displayed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.create_image_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> render()

\n", "\n", "Re-render Jupyter cell for batch of images. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.render)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> get_similars_idxs(**`learn`**, **`layer_ls`**, **\\*\\*`kwargs`**)

\n", "\n", "Gets the indices for the most similar images in `ds_type` dataset " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.get_similars_idxs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> on_delete(**`btn`**)

\n", "\n", "Flag this image as delete or keep. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.on_delete)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> make_button_widget(**`label`**, **`file_path`**=***`None`***, **`handler`**=***`None`***, **`style`**=***`None`***, **`layout`**=***`Layout(width='auto')`***)

\n", "\n", "Return a Button widget with specified `handler`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.make_button_widget)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> make_img_widget(**`img`**, **`layout`**=***`Layout()`***, **`format`**=***`'jpg'`***)

\n", "\n", "Returns an image widget for specified file name `img`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.make_img_widget)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> get_actns(**`learn`**, **`hook`**:[`Hook`](/callbacks.hooks.html#Hook), **`dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`pool`**=***`'AdaptiveConcatPool2d'`***, **`pool_dim`**:`int`=***`4`***, **\\*\\*`kwargs`**)

\n", "\n", "Gets activations at the layer specified by `hook`, applies `pool` of dim `pool_dim` and concatenates " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.get_actns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> batch_contains_deleted()

\n", "\n", "Check if current batch contains already deleted images. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.batch_contains_deleted)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> make_horizontal_box(**`children`**, **`layout`**=***`Layout()`***)

\n", "\n", "Make a horizontal box with [`children`](/torch_core.html#children) and `layout`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ImageCleaner.make_horizontal_box)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> get_toplosses_idxs(**`learn`**, **`n_imgs`**, **\\*\\*`kwargs`**)

\n", "\n", "Sorts `ds_type` dataset by top losses and returns dataset and sorted indices. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.get_toplosses_idxs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "


> padded_ds(**`ll_input`**, **`size`**=***`(250, 300)`***, **`resize_method`**=***``***, **`padding_mode`**=***`'zeros'`***, **\\*\\*`kwargs`**)

\n", "\n", "For a LabelList `ll_input`, resize each image to `size` using `resize_method` and `padding_mode`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(DatasetFormatter.padded_ds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }