{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The data block API" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai import *\n", "from fastai.gen_doc.nbdoc import *\n", "from fastai.tabular import *\n", "from fastai.text import *\n", "from fastai.vision import * \n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data block API lets you customize how to create a [`DataBunch`](/basic_data.html#DataBunch) by isolating the underlying parts of that process in separate blocks, mainly:\n", " 1. Where are the inputs and how to create them?\n", " 1. How to split the data into a training and validation set?\n", " 1. How to label the inputs?\n", " 1. What transforms to apply?\n", " 1. How to add a test set?\n", " 1. How to wrap in dataloaders and create the [`DataBunch`](/basic_data.html#DataBunch)?\n", " \n", "For each of those questions, you can have multiple possible blocks: your inputs might be in a folder, a csv file, a dataframe. You may want to split them randomly, by certain indexes or depending on the folder they are in. You can have your labels in your csv file or your dataframe, but it may come from folders or a specific function of the input. You may or may not have data augmentation to deal with. Or a test set. Finally you have to set the arguments to put the data together in a [`DataBunch`](/basic_data.html#DataBunch) (batch size, collate function...)\n", "\n", "The data block API is called as such because you can mix and match each one of those blocks with the others, allowing you total flexibility to create your customized [`DataBunch`](/basic_data.html#DataBunch) for training. The factory methods of the various [`DataBunch`](/basic_data.html#DataBunch) are great for beginners but you can't always make your data fit in the tracks they require.\n", "\n", "\"Mix\n", "\n", "As usual, we'll begin with end-to-end examples, then switch to the details of each of those parts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples of use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin by our traditional MNIST example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_TINY)\n", "tfms = get_transforms(do_flip=False)\n", "path.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/7'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/3')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(path/'train').ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In [`vision.data`](/vision.data.html#vision.data), we create an easy [`DataBunch`](/basic_data.html#DataBunch) suitable for classification by simply typing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=24)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is aimed at data that is in folders following an ImageNet style, with a train and valid directory containing each one subdirectory per class, where all the pictures are. There is also a test set containing unlabelled pictures. With the data block API, we can group everything together like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (ImageItemList.from_folder(path) #Where to find the data? -> in path and its subfolders\n", " .split_by_folder() #How to split in train/valid? -> use the folders\n", " .label_from_folder() #How to label? -> depending on the folder of the filenames\n", " .add_test_folder() #Optionally add a test set (here default name is test)\n", " .transform(tfms, size=64) #Data augmentation? -> use tfms with a size of 64\n", " .databunch()) #Finally? -> use the defaults for conversion to ImageDataBunch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.show_batch(3, figsize=(6,6), hide_axis=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((Image (3, 64, 64), Category 7), ['7', '3'])" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.train_ds[0], data.test_ds.classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at another example from [`vision.data`](/vision.data.html#vision.data) with the planet dataset. This time, it's a multiclassification problem with the labels in a csv file and no given split between valid and train data, so we use a random split. The factory method is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "planet = untar_data(URLs.PLANET_TINY)\n", "planet_tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ImageDataBunch.from_csv(planet, folder='train', size=128, suffix='.jpg', sep = ' ', ds_tfms=planet_tfms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the data block API we can rewrite this like that:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (ImageItemList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')\n", " #Where to find the data? -> in planet 'train' folder\n", " .random_split_by_pct()\n", " #How to split in train/valid? -> randomly with the default 20% in valid\n", " .label_from_df(sep=' ')\n", " #How to label? -> use the csv file\n", " .transform(planet_tfms, size=128)\n", " #Data augmentation? -> use tfms with a size of 128\n", " .databunch()) \n", " #Finally -> use the defaults for conversion to databunch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.show_batch(rows=2, figsize=(9,7))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data block API also allows you to get your data together in problems for which there is no direct [`ImageDataBunch`](/vision.data.html#ImageDataBunch) factory method. For a segmentation task, for instance, we can use it to quickly get a [`DataBunch`](/basic_data.html#DataBunch). Let's take the example of the [camvid dataset](http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/). The images are in an 'images' folder and their corresponding mask is in a 'labels' folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "camvid = untar_data(URLs.CAMVID_TINY)\n", "path_lbl = camvid/'labels'\n", "path_img = camvid/'images'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a file that gives us the names of the classes (what each code inside the masks corresponds to: a pedestrian, a tree, a road...)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole',\n", " 'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock',\n", " 'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',\n", " 'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype='" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.show_batch(rows=2, figsize=(7,5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another example for object detection. We use our tiny sample of the [COCO dataset](http://cocodataset.org/#home) here. There is a helper function in the library that reads the annotation file and returns the list of images names with the list of labelled bboxes associated to it. We convert it to a dictionary that maps image names with their bboxes and then write the function that will give us the target for each image filename." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coco = untar_data(URLs.COCO_TINY)\n", "images, lbl_bbox = get_annotations(coco/'train.json')\n", "img2bbox = dict(zip(images, lbl_bbox))\n", "get_y_func = lambda o:img2bbox[o.name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code is very similar to what we saw before. The only new addition is the use of special function to collate the samples in batches. This comes from the fact that our images may have multiple bounding boxes, so we need to pad them to the largest number of bounding boxes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (ObjectItemList.from_folder(coco)\n", " #Where are the images? -> in coco\n", " .random_split_by_pct() \n", " #How to split in train/valid? -> randomly with the default 20% in valid\n", " .label_from_func(get_y_func)\n", " #How to find the labels? -> use get_y_func\n", " .transform(get_transforms(), tfm_y=True)\n", " #Data augmentation? -> Standard transforms with tfm_y=True\n", " .databunch(bs=16, collate_fn=bb_pad_collate)) \n", " #Finally we convert to a DataBunch and we use bb_pad_collate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But vision isn't the only application where the data block API works, it can also be used for text or tabular data. With ouy sample of the IMDB dataset (labelled texts in a csv file), here is how to get the data together for a language model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "imdb = untar_data(URLs.IMDB_SAMPLE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_lm = (TextList.from_csv(imdb, 'texts.csv', cols='text')\n", " #Where are the inputs? Column 'text' of this csv\n", " .random_split_by_pct()\n", " #How to split it? Randomly with the default 20%\n", " .label_for_lm()\n", " #Label it for a language model\n", " .databunch())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idxtext
0xxbos xxmaj old xxmaj jane 's mannered tale seems very popular these days . i have lost count of the number of versions going around . xxmaj probably the reason is that her \" xxunk \" are our \" xxunk \" even at this late date . xxmaj this xxup tv mini - series gives it a mannered telling suitable to the novel . xxmaj xxunk , xxunk xxmaj emma
1directed the chilling and disturbing xxmaj capote 's book about the reasons that xxunk these kids to the crime ( xxmaj are they xxmaj natural xxmaj born xxmaj killers ? ) . xxmaj the crime scenes are very brutal and haunting because of the lack of senses and reasons for what we witnessed . xxmaj stunning black & white cinematography from xxmaj xxunk xxmaj hall , excellent country - road
2sisters get the idea of pushing xxmaj precious into the path of a drunken xxmaj hungarian count , xxunk the two gold - xxunk women into thinking he is one of the xxunk men in xxmaj europe . xxmaj but a case of mistaken identity makes the girls think the count is good - looking xxmaj ray xxmaj xxunk , who goes along with the scheme xxunk he has a
3no xxunk the first xxmaj azumi film was a commercial product ; it was an adaptation of a popular manga and had cast of young , attractive actors and certainly was n't lacking in the budget department . xxmaj yet it more than entertained for what it was , and i ca n't xxunk i enjoyed it immensely . \\n\\n \" xxmaj azumi 2 \" lacks just about everything that
4long flashback . xxmaj the xxunk of the brother and the sister , from a family of rich xxmaj xxunk oil owners , is brought to the xxunk by xxunk clothes , and xxunk cars that go at top speed in a xxunk landscape . xxmaj malone 's xxunk at the end of the movie is stunning : suit and xxunk , xxunk with a small xxunk : she 's
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data_lm.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a classification problem, we just have to change the way labelling is done. Here we use the column 'label' of our csv." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_clas = (TextList.from_csv(imdb, 'texts.csv', cols='text')\n", " .split_from_df(col='is_valid')\n", " .label_from_df(cols='label')\n", " .databunch())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
texttarget
xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \\n\\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmajnegative
xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up withpositive
xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of \" xxmaj at xxmaj the xxmaj movies \" in taking xxmaj steven xxmaj soderbergh to task . \\n\\n xxmaj it 's usually satisfying to watch a film director change his style / subject ,negative
xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \\n\\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' \" xxmaj la xxmaj xxunk , \"positive
xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphicspositive
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data_clas.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, for tabular data, we just have to pass the name of our categorical and continuous variables as an extra argument. We also add some [`PreProcessor`](/data_block.html#PreProcessor)s that are going to be applied to our data once the splitting and the labelling is done." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adult = untar_data(URLs.ADULT_SAMPLE)\n", "df = pd.read_csv(adult/'adult.csv')\n", "dep_var = '>=50k'\n", "cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']\n", "cont_names = ['education-num', 'hours-per-week', 'age', 'capital-loss', 'fnlwgt', 'capital-gain']\n", "procs = [FillMissing, Categorify, Normalize]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (TabularList.from_df(df, path=adult, cat_names=cat_names, cont_names=cont_names, procs=procs)\n", " .split_by_idx(valid_idx=range(800,1000))\n", " .label_from_df(cols=dep_var)\n", " .databunch())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipracesexnative-countryeducation-num_naeducation-numhours-per-weekagecapital-lossfnlwgtcapital-gaintarget
Local-gov HS-grad Divorced Craft-repair Not-in-family White Male United-StatesFalse-0.4224-0.03560.03034.4430-0.9781-0.14590
Self-emp-not-inc 10th Married-civ-spouse Craft-repair Husband White Male United-StatesFalse-1.5958-2.62762.3758-0.21640.4623-0.14590
Private HS-grad Divorced Transport-moving Not-in-family White Male United-StatesFalse-0.4224-0.03560.6899-0.2164-0.4378-0.14590
Private Bachelors Married-civ-spouse Prof-specialty Husband White Male United-StatesFalse1.14220.2884-1.0692-0.21641.6128-0.14591
Self-emp-not-inc Some-college Never-married Other-service Own-child White Male United-StatesFalse-0.0312-0.8456-1.2891-0.21641.2244-0.14590
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Provide inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basic class to get your inputs into is the following one. It's also the same class that will contain all of your labels (hence the name [`ItemList`](/data_block.html#ItemList))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class ItemList[source]

\n", "\n", "> ItemList(`items`:`Iterator`, `path`:`PathOrStr`=`'.'`, `label_cls`:`Callable`=`None`, `xtra`:`Any`=`None`, `processor`:[`PreProcessor`](/data_block.html#PreProcessor)=`None`, `x`:`ItemList`=`None`, `kwargs`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList, title_level=3, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": { "hide_input": true }, "source": [ "This class regroups the inputs for our model in `items` and saves a `path` attribute which is where it will look for any files (image files, csv file with labels...) `create_func` is applied to `items` to get the final output. `label_cls` will be called to create the labels from the result of the label function, `xtra` contains additional information (usually an underlying dataframe) and `processor` is to be applied to the inputs after the splitting and labelling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It has multiple subclasses depending on the type of data you're handling. Here is a quick list:\n", " - [`CategoryList`](/data_block.html#CategoryList) for labels in classification\n", " - [`MultiCategoryList`](/data_block.html#MultiCategoryList) for labels in a multi classification problem\n", " - [`FloatList`](/data_block.html#FloatList) for float labels in a regression problem\n", " - [`ImageItemList`](/vision.data.html#ImageItemList) for data that are images\n", " - [`SegmentationItemList`](/vision.data.html#SegmentationItemList) like [`ImageItemList`](/vision.data.html#ImageItemList) but will default labels to [`SegmentationLabelList`](/vision.data.html#SegmentationLabelList)\n", " - [`SegmentationLabelList`](/vision.data.html#SegmentationLabelList) for segmentation masks\n", " - [`ObjectItemList`](/vision.data.html#ObjectItemList) like [`ImageItemList`](/vision.data.html#ImageItemList) but will default labels to `ObjectLabelList`\n", " - `ObjectLabelList` for object detection\n", " - [`PointsItemList`](/vision.data.html#PointsItemList) for points (of the type [`ImagePoints`](/vision.image.html#ImagePoints))\n", " - [`TextList`](/text.data.html#TextList) for text data\n", " - [`TextFilesList`](/text.data.html#TextFilesList) for text data stored in files\n", " - [`TabularList`](/tabular.data.html#TabularList) for tabular data\n", " - [`CollabList`](/collab.html#CollabList) for collaborative filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have selected the class that is suitable, you can instantiate it with one of the following factory methods" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_folder[source]

\n", "\n", "> from_folder(`path`:`PathOrStr`, `extensions`:`StrList`=`None`, `recurse`=`True`, `kwargs`) → `ItemList`\n", "\n", "Get the list of files in `path` that have a suffix in `extensions`. `recurse` determines if we search subfolders. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.from_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_df[source]

\n", "\n", "> from_df(`df`:`DataFrame`, `path`:`PathOrStr`=`'.'`, `cols`:`Union`\\[`int`, `Collection`\\[`int`\\], `str`, `StrList`\\]=`0`, `kwargs`) → `ItemList`\n", "\n", "Create an [`ItemList`](/data_block.html#ItemList) in `path` from the inputs in the `cols` of `df`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.from_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_csv[source]

\n", "\n", "> from_csv(`path`:`PathOrStr`, `csv_name`:`str`, `cols`:`Union`\\[`int`, `Collection`\\[`int`\\], `str`, `StrList`\\]=`0`, `header`:`str`=`'infer'`, `kwargs`) → `ItemList`\n", "\n", "Create an [`ItemList`](/data_block.html#ItemList) in `path` from the inputs in the `cols` of `path/csv_name` opened with `header`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.from_csv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optional step: filter your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The factory method may have grabbed too many items. For instance, if you were searching sub folders with the `from_folder` method, you may have gotten files you don't want. To remove those, you can use one of the following methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

filter_by_func[source]

\n", "\n", "> filter_by_func(`func`:`Callable`) → `ItemList`\n", "\n", "Only keeps elements for which `func` returns `True`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.filter_by_func)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

filter_by_folder[source]

\n", "\n", "> filter_by_folder(`include`=`None`, `exclude`=`None`)\n", "\n", "Only keep filenames in `include` folder or reject the ones in `exclude`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.filter_by_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

filter_by_rand[source]

\n", "\n", "> filter_by_rand(`p`:`float`, `seed`:`int`=`None`)\n", "\n", "Keep random sample of `items` with probability `p` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.filter_by_rand)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

to_text[source]

\n", "\n", "> to_text(`fn`:`str`)\n", "\n", "Save `self.items` to `fn` in `self.path` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.to_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing your own [`ItemList`](/data_block.html#ItemList)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First check if you can't easily customize one of the existing subclass by:\n", "- subclassing an existing one and replacing the `get` method (or the `open` method if you're dealing with images)\n", "- applying a custom `processor` (see step 4)\n", "- changing the default `label_cls` for the label creation\n", "- adding a default [`PreProcessor`](/data_block.html#PreProcessor) with the `_processor` class variable\n", "\n", "If this isn't the case and you really need to write your own class, there is a [full tutorial](/tutorial.itemlist) that explains how to proceed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

analyze_pred[source]

\n", "\n", "> analyze_pred(`pred`:`Tensor`)\n", "\n", "Called on `pred` before `reconstruct` for additional preprocessing. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.analyze_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source]

\n", "\n", "> reconstruct(`t`:`Tensor`, `x`:`Tensor`=`None`)\n", "\n", "Reconstuct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.reconstruct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Split the data between the training and the validation set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step is normally straightforward, you just have to pick oe of the following functions depending on what you need." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

random_split_by_pct[source]

\n", "\n", "> random_split_by_pct(`valid_pct`:`float`=`0.2`, `seed`:`int`=`None`) → `ItemLists`\n", "\n", "Split the items randomly by putting `valid_pct` in the validation set. Set the `seed` in numpy if passed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.random_split_by_pct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_files[source]

\n", "\n", "> split_by_files(`valid_names`:`ItemList`) → `ItemLists`\n", "\n", "Split the data by using the names in `valid_names` for validation. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_files)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_fname_file[source]

\n", "\n", "> split_by_fname_file(`fname`:`PathOrStr`, `path`:`PathOrStr`=`None`) → `ItemLists`\n", "\n", "Split the data by using the file names in `fname` for the validation set. `path` will override `self.path`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_fname_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_folder[source]

\n", "\n", "> split_by_folder(`train`:`str`=`'train'`, `valid`:`str`=`'valid'`) → `ItemLists`\n", "\n", "Split the data depending on the folder (`train` or `valid`) in which the filenames are. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Note: This method looks at the folder immediately after `self.path` for `valid` and `train`.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_note(\"This method looks at the folder immediately after `self.path` for `valid` and `train`.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_idx[source]

\n", "\n", "> split_by_idx(`valid_idx`:`Collection`\\[`int`\\]) → `ItemLists`\n", "\n", "Split the data according to the indexes in `valid_idx`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_idx)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_idxs[source]

\n", "\n", "> split_by_idxs(`train_idx`, `valid_idx`)\n", "\n", "Split the data between `train_idx` and `valid_idx`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_idxs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_list[source]

\n", "\n", "> split_by_list(`train`, `valid`)\n", "\n", "Split the data between `train` and `valid`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_valid_func[source]

\n", "\n", "> split_by_valid_func(`func`:`Callable`) → `ItemLists`\n", "\n", "Split the data by result of `func` (which returns `True` for validation set) " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_valid_func)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_from_df[source]

\n", "\n", "> split_from_df(`col`:`Union`\\[`int`, `Collection`\\[`int`\\], `str`, `StrList`\\]=`2`)\n", "\n", "Split the data from the `col` in the dataframe in `self.xtra`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_from_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Warning: This method assumes the data has been created from a csv file or a dataframe.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_warn(\"This method assumes the data has been created from a csv file or a dataframe.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Label the inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To label your inputs, use one of the following functions. Note that even if it's not in the documented arguments, you can always pass a `label_cls` that will be used to create those labels (the default is the one from your input [`ItemList`](/data_block.html#ItemList), and if there is none, it will go to [`CategoryList`](/data_block.html#CategoryList), [`MultiCategoryList`](/data_block.html#MultiCategoryList) or [`FloatList`](/data_block.html#FloatList) depending on the type of the labels).\n", "\n", "The first example in these docs created labels as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = untar_data(URLs.MNIST_TINY)\n", "ll = ImageItemList.from_folder(path).split_by_folder().label_from_folder().train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to save the data necessary to recreate your [`LabelList`](/data_block.html#LabelList) (not including saving the actual image/text/etc files), you can use `to_df` or `to_csv`:\n", "\n", "```python\n", "ll.train.to_csv('tmp.csv')\n", "```\n", "\n", "Or just grab a `pd.DataFrame` directly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
0train/7/7994.png7
1train/7/8437.png7
2train/7/9767.png7
3train/7/7236.png7
4train/7/9445.png7
\n", "
" ], "text/plain": [ " x y\n", "0 train/7/7994.png 7\n", "1 train/7/8437.png 7\n", "2 train/7/9767.png 7\n", "3 train/7/7236.png 7\n", "4 train/7/9445.png 7" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ll.to_df().head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_empty[source]

\n", "\n", "> label_empty()\n", "\n", "Label every item with an [`EmptyLabel`](/core.html#EmptyLabel). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_empty)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_list[source]

\n", "\n", "> label_from_list(`labels`:`Iterator`, `kwargs`) → `LabelList`\n", "\n", "Label `self.items` with `labels` using `label_cls` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_df[source]

\n", "\n", "> label_from_df(`cols`:`Union`\\[`int`, `Collection`\\[`int`\\], `str`, `StrList`\\]=`1`, `kwargs`)\n", "\n", "Label `self.items` from the values in `cols` in `self.xtra`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Warning: This method assumes the data has been created from a csv file or a dataframe.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_warn(\"This method assumes the data has been created from a csv file or a dataframe.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_const[source]

\n", "\n", "> label_const(`const`:`Any`=`0`, `kwargs`) → `LabelList`\n", "\n", "Label every item with `const`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_const)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_folder[source]

\n", "\n", "> label_from_folder(`kwargs`) → `LabelList`\n", "\n", "Give a label to each filename depending on its folder. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Note: This method looks at the last subfolder in the path to determine the classes.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_note(\"This method looks at the last subfolder in the path to determine the classes.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_func[source]

\n", "\n", "> label_from_func(`func`:`Callable`, `kwargs`) → `LabelList`\n", "\n", "Apply `func` to every input to get its label. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_func)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_re[source]

\n", "\n", "> label_from_re(`pat`:`str`, `full_path`:`bool`=`False`, `kwargs`) → `LabelList`\n", "\n", "Apply the re in `pat` to determine the label of every filename. If `full_path`, search in the full name. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_re)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CategoryList[source]

\n", "\n", "> CategoryList(`items`:`Iterator`, `classes`:`Collection`=`None`, `kwargs`) :: [`CategoryListBase`](/data_block.html#CategoryListBase)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ItemList`](/data_block.html#ItemList) suitable for storing labels in `items` belonging to `classes`. If `None` are passed, `classes` will be determined by the unique different labels. `processor` will default to [`CategoryProcessor`](/data_block.html#CategoryProcessor)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class MultiCategoryList[source]

\n", "\n", "> MultiCategoryList(`items`:`Iterator`, `classes`:`Collection`=`None`, `sep`:`str`=`None`, `kwargs`) :: [`CategoryListBase`](/data_block.html#CategoryListBase)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ItemList`](/data_block.html#ItemList) suitable for storing list of labels in `items` belonging to `classes`. If `None` are passed, `classes` will be determined by the unique different labels. `sep` is used to split the content of `items` in a list of labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class FloatList[source]

\n", "\n", "> FloatList(`items`:`Iterator`, `log`:`bool`=`False`, `kwargs`) :: [`ItemList`](/data_block.html#ItemList)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ItemList`](/data_block.html#ItemList) suitable for storing the floats in items for regression. Will add a `log` if this flag is `True`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Invisible step: preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This isn't seen here in the API, but if you passed a `processor` (or a list of them) in your initial [`ItemList`](/data_block.html#ItemList) during step 1, it will be applied here. If you didn't pass any processor, a list of them might still be created depending on what is in the `_processor` variable of your class of items (this can be a list of [`PreProcessor`](/data_block.html#PreProcessor) classes).\n", "\n", "A processor is a transformation that is applied to all the inputs once at initialization, with a state computed on the training set that is then applied without modification on the validation set (and maybe the test set). For instance, it can be processing texts to tokenize then numericalize them. In that case we want the validation set to be numericalized with exactly the same vocabulary as the training set.\n", "\n", "Another example is in tabular data, where we fill missing values with (for instance) the median computed on the training set. That statistic is stored in the inner state of the [`PreProcessor`](/data_block.html#PreProcessor) and applied on the validation set.\n", "\n", "This is the generic class for all processors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class PreProcessor[source]

\n", "\n", "> PreProcessor(`ds`:`Collection`=`None`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(PreProcessor, title_level=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

process_one[source]

\n", "\n", "> process_one(`item`:`Any`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(PreProcessor.process_one)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Process one `item`. This method needs to be written in any subclass." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

process[source]

\n", "\n", "> process(`ds`:`Collection`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(PreProcessor.process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Process a dataset. This default to apply `process_one` on every `item` of `ds`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CategoryProcessor[source]

\n", "\n", "> CategoryProcessor(`ds`:[`ItemList`](/data_block.html#ItemList)) :: [`PreProcessor`](/data_block.html#PreProcessor)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`PreProcessor`](/data_block.html#PreProcessor) that will convert labels to codes usings `classes` (if passed) in a single classificatio problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

generate_classes[source]

\n", "\n", "> generate_classes(`items`)\n", "\n", "Generate classes from `items` by taking the sorted unique values. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.generate_classes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class MultiCategoryProcessor[source]

\n", "\n", "> MultiCategoryProcessor(`ds`:[`ItemList`](/data_block.html#ItemList)) :: [`CategoryProcessor`](/data_block.html#CategoryProcessor)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryProcessor, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`PreProcessor`](/data_block.html#PreProcessor) that will convert labels to codes usings `classes` (if passed) in a single multi-classificatio problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

generate_classes[source]

\n", "\n", "> generate_classes(`items`)\n", "\n", "Generate classes from `items` by taking the sorted unique values. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryProcessor.generate_classes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add transforms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transforms differ from processors in the sense they are applied on the fly when we grab one item. They also may change each time we ask for the same item in the case of random transforms." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

transform[source]

\n", "\n", "> transform(`tfms`:`Optional`\\[`Tuple`\\[`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\], `Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]\\]\\]=`(None, None)`, `kwargs`)\n", "\n", "Set `tfms` to be applied to the xs of the train and validation set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.transform)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is primary for the vision application. The `kwargs` are the one expected by the type of transforms you pass. `tfm_y` is among them and if set to `True`, the transforms will be applied to input and target." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add a test set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To add a test set, you can use one of the two following methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

add_test[source]

\n", "\n", "> add_test(`items`:`Iterator`, `label`:`Any`=`None`)\n", "\n", "Add test set containing items from `items` and an arbitrary `label` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.add_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Note: Here `items` can be an `ItemList` or a collection.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_note(\"Here `items` can be an `ItemList` or a collection.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

add_test_folder[source]

\n", "\n", "> add_test_folder(`test_folder`:`str`=`'test'`, `label`:`Any`=`None`)\n", "\n", "Add test set containing items from folder `test_folder` and an arbitrary `label`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.add_test_folder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: convert to a [`DataBunch`](/basic_data.html#DataBunch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This last step is usually pretty straightforward. You just have to include all the arguments we pass to [`DataBunch.create`](/basic_data.html#DataBunch.create) (`bs`, `num_workers`, `collate_fn`). The class called to create a [`DataBunch`](/basic_data.html#DataBunch) is set in the `_bunch` attribute of the inputs of the training set if you need to modify it. Normally, the various subclasses we showed before handle that for you." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

databunch[source]

\n", "\n", "> databunch(`path`:`PathOrStr`=`None`, `kwargs`) → `ImageDataBunch`\n", "\n", "Create an [`DataBunch`](/basic_data.html#DataBunch) from self, `path` will override `self.path`, `kwargs` are passed to [`DataBunch.create`](/basic_data.html#DataBunch.create). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.databunch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inner classes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class LabelList[source]

\n", "\n", "> LabelList(`x`:[`ItemList`](/data_block.html#ItemList), `y`:[`ItemList`](/data_block.html#ItemList), `tfms`:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]=`None`, `tfm_y`:`bool`=`False`, `kwargs`) :: [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList, title_level=3, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basic dataset in fastai. Inputs are in `x`, targets in `y`. Optionally apply `tfms` to `x` and also `y` if `tfm_y` is `True`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

export[source]

\n", "\n", "> export(`fn`:`PathOrStr`)\n", "\n", "Export the minimal state and save it in `fn` to load an empty version for inference. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.export)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

transform_y[source]

\n", "\n", "> transform_y(`tfms`:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]=`None`, `kwargs`)\n", "\n", "Set `tfms` to be applied to the targets only. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.transform_y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

load_empty[source]

\n", "\n", "> load_empty(`fn`:`PathOrStr`, `tfms`:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]=`None`, `tfm_y`:`bool`=`False`, `kwargs`)\n", "\n", "Load the sate in `fn` to create an empty [`LabelList`](/data_block.html#LabelList) for inference. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.load_empty)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_lists[source]

\n", "\n", "> from_lists(`path`:`PathOrStr`, `inputs`, `labels`) → `LabelList`\n", "\n", "Create a [`LabelList`](/data_block.html#LabelList) in `path` with `inputs` and `labels`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.from_lists)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

to_df[source]

\n", "\n", "> to_df()\n", "\n", "Create `pd.DataFrame` containing `items` from `self.x` and `self.y` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.to_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

to_csv[source]

\n", "\n", "> to_csv(`dest`:`str`)\n", "\n", "Save `self.to_df()` to a CSV file in `self.path`/`dest` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.to_csv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class ItemLists[source]

\n", "\n", "> ItemLists(`path`:`PathOrStr`, `train`:[`ItemList`](/data_block.html#ItemList), `valid`:[`ItemList`](/data_block.html#ItemList), `test`:[`ItemList`](/data_block.html#ItemList)=`None`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemLists, doc_string=False, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data in `path` split between several streams of inputs, [`train`](/train.html#train), `valid` and maybe `test`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_lists[source]

\n", "\n", "> label_from_lists(`train_labels`:`Iterator`, `valid_labels`:`Iterator`, `label_cls`:`Callable`=`None`, `kwargs`) → `LabelList`\n", "\n", "Use the labels in `train_labels` and `valid_labels` to label the data. `label_cls` will overwrite the default. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemLists.label_from_lists)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class LabelLists[source]

\n", "\n", "> LabelLists(`path`:`PathOrStr`, `train`:[`ItemList`](/data_block.html#ItemList), `valid`:[`ItemList`](/data_block.html#ItemList), `test`:[`ItemList`](/data_block.html#ItemList)=`None`) :: [`ItemLists`](/data_block.html#ItemLists)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists, title_level=3, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Helper functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_files[source]

\n", "\n", "> get_files(`path`:`PathOrStr`, `extensions`:`StrList`=`None`, `recurse`:`bool`=`False`) → `FilePathList`\n", "\n", "Return list of files in `c` that have a suffix in `extensions`. `recurse` determines if we search subfolders. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(get_files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source]

\n", "\n", "> get(`i`) → `Any`" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source]

\n", "\n", "> new(`items`, `classes`=`None`, `kwargs`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get_processors[source]

\n", "\n", "> get_processors()" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.get_processors)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

from_lists[source]

\n", "\n", "> from_lists(`path`:`PathOrStr`, `inputs`, `labels`) → `LabelList`\n", "\n", "Create a [`LabelList`](/data_block.html#LabelList) in `path` with `inputs` and `labels`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.from_lists)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

set_item[source]

\n", "\n", "> set_item(`item`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.set_item)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source]

\n", "\n", "> new(`x`, `y`, `kwargs`) → `LabelList`" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source]

\n", "\n", "> get(`i`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

predict[source]

\n", "\n", "> predict(`res`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.predict)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source]

\n", "\n", "> new(`items`:`Iterator`, `processor`:[`PreProcessor`](/data_block.html#PreProcessor)=`None`, `kwargs`) → `ItemList`" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

clear_item[source]

\n", "\n", "> clear_item()" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.clear_item)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process_one[source]

\n", "\n", "> process_one(`item`, `processor`=`None`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.process_one)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process[source]

\n", "\n", "> process(`processor`=`None`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.process)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process[source]

\n", "\n", "> process()" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.process)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

transform[source]

\n", "\n", "> transform(`tfms`:`Optional`\\[`Tuple`\\[`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\], `Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]\\]\\]=`(None, None)`, `kwargs`)\n", "\n", "Set `tfms` to be applied to the xs of the train and validation set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemLists.transform)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process[source]

\n", "\n", "> process(`xp`=`None`, `yp`=`None`, `filter_missing_y`:`bool`=`False`)\n", "\n", "Launch the preprocessing on `xp` and `yp`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.process)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

transform[source]

\n", "\n", "> transform(`tfms`:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\], `tfm_y`:`bool`=`None`, `kwargs`)\n", "\n", "Set the `tfms` and `` tfm_y` value to be applied to the inputs and targets. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.transform)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process_one[source]

\n", "\n", "> process_one(`item`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryProcessor.process_one)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source]

\n", "\n", "> get(`i`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process_one[source]

\n", "\n", "> process_one(`item`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.process_one)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

create_classes[source]

\n", "\n", "> create_classes(`classes`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.create_classes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process[source]

\n", "\n", "> process(`ds`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.process)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source]

\n", "\n", "> get(`i`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source]

\n", "\n", "> new(`items`, `kwargs`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get_label_cls[source]

\n", "\n", "> get_label_cls(`labels`, `label_cls`:`Callable`=`None`, `sep`:`str`=`None`, `kwargs`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.get_label_cls)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source]

\n", "\n", "> reconstruct(`t`)\n", "\n", "Reconstuct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

analyze_pred[source]

\n", "\n", "> analyze_pred(`pred`, `thresh`:`float`=`0.5`)\n", "\n", "Called on `pred` before `reconstruct` for additional preprocessing. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList.analyze_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source]

\n", "\n", "> reconstruct(`t`)\n", "\n", "Reconstuct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source]

\n", "\n", "> reconstruct(`t`)\n", "\n", "Reconstuct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

transform_y[source]

\n", "\n", "> transform_y(`tfms`:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]=`None`, `kwargs`)\n", "\n", "Set `tfms` to be applied to the targets only. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.transform_y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

analyze_pred[source]

\n", "\n", "> analyze_pred(`pred`, `thresh`:`float`=`0.5`)\n", "\n", "Called on `pred` before `reconstruct` for additional preprocessing. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.analyze_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] } ], "metadata": { "jekyll": { "keywords": "fastai", "summary": "The data block API", "title": "data_block" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }