{ "cells": [ { "cell_type": "markdown", "metadata": { "hide_input": true }, "source": [ "## The data block API" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.basics import *\n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data block API lets you customize the creation of a [`DataBunch`](/basic_data.html#DataBunch) by isolating the underlying parts of that process in separate blocks, mainly:\n", " 1. Where are the inputs and how to create them?\n", " 1. How to split the data into a training and validation sets?\n", " 1. How to label the inputs?\n", " 1. What transforms to apply?\n", " 1. How to add a test set?\n", " 1. How to wrap in dataloaders and create the [`DataBunch`](/basic_data.html#DataBunch)?\n", " \n", "Each of these may be addressed with a specific block designed for your unique setup. Your inputs might be in a folder, a csv file, or a dataframe. You may want to split them randomly, by certain indices or depending on the folder they are in. You can have your labels in your csv file or your dataframe, but it may come from folders or a specific function of the input. You may choose to add data augmentation or not. A test set is optional too. Finally you have to set the arguments to put the data together in a [`DataBunch`](/basic_data.html#DataBunch) (batch size, collate function...)\n", "\n", "The data block API is called as such because you can mix and match each one of those blocks with the others, allowing for a total flexibility to create your customized [`DataBunch`](/basic_data.html#DataBunch) for training, validation and testing. The factory methods of the various [`DataBunch`](/basic_data.html#DataBunch) are great for beginners but you can't always make your data fit in the tracks they require.\n", "\n", "\"Mix\n", "\n", "As usual, we'll begin with end-to-end examples, then switch to the details of each of those parts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples of use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin with our traditional MNIST example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_TINY)\n", "tfms = get_transforms(do_flip=False)\n", "path.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/3'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/7')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(path/'train').ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In [`vision.data`](/vision.data.html#vision.data), we can create a [`DataBunch`](/basic_data.html#DataBunch) suitable for image classification by simply typing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=64)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a shortcut method which is aimed at data that is in folders following an ImageNet style, with the [`train`](/train.html#train) and `valid` directories, each containing one subdirectory per class, where all the labelled pictures are. There is also a `test` directory containing unlabelled pictures. \n", "\n", "Here is the same code, but this time using the data block API, which can work with any style of a dataset. All the stages, which will be explained below, can be grouped together like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (ImageList.from_folder(path) #Where to find the data? -> in path and its subfolders\n", " .split_by_folder() #How to split in train/valid? -> use the folders\n", " .label_from_folder() #How to label? -> depending on the folder of the filenames\n", " .add_test_folder() #Optionally add a test set (here default name is test)\n", " .transform(tfms, size=64) #Data augmentation? -> use tfms with a size of 64\n", " .databunch()) #Finally? -> use the defaults for conversion to ImageDataBunch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the created DataBunch:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data.show_batch(3, figsize=(6,6), hide_axis=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at another example from [`vision.data`](/vision.data.html#vision.data) with the planet dataset. This time, it's a multiclassification problem with the labels in a csv file and no given split between valid and train data, so we use a random split. The factory method is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "planet = untar_data(URLs.PLANET_TINY)\n", "planet_tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
image_nametags
0train_31112clear primary
1train_4300partly_cloudy primary water
2train_39539clear primary water
3train_12498agriculture clear primary road
4train_9320clear primary
\n", "
" ], "text/plain": [ " image_name tags\n", "0 train_31112 clear primary\n", "1 train_4300 partly_cloudy primary water\n", "2 train_39539 clear primary water\n", "3 train_12498 agriculture clear primary road\n", "4 train_9320 clear primary" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(planet/\"labels.csv\").head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = ImageDataBunch.from_csv(planet, folder='train', size=128, suffix='.jpg', label_delim = ' ', ds_tfms=planet_tfms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the data block API we can rewrite this like that:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/planet_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/planet_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/planet_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/planet_tiny/models')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planet.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
image_nametags
0train_31112clear primary
1train_4300partly_cloudy primary water
2train_39539clear primary water
3train_12498agriculture clear primary road
4train_9320clear primary
\n", "
" ], "text/plain": [ " image_name tags\n", "0 train_31112 clear primary\n", "1 train_4300 partly_cloudy primary water\n", "2 train_39539 clear primary water\n", "3 train_12498 agriculture clear primary road\n", "4 train_9320 clear primary" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(planet/\"labels.csv\").head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (ImageList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')\n", " #Where to find the data? -> in planet 'train' folder\n", " .split_by_rand_pct()\n", " #How to split in train/valid? -> randomly with the default 20% in valid\n", " .label_from_df(label_delim=' ')\n", " #How to label? -> use the second column of the csv file and split the tags by ' '\n", " .transform(planet_tfms, size=128)\n", " #Data augmentation? -> use tfms with a size of 128\n", " .databunch()) \n", " #Finally -> use the defaults for conversion to databunch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data.show_batch(rows=2, figsize=(9,7))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data block API also allows you to get your data together in problems for which there is no direct [`ImageDataBunch`](/vision.data.html#ImageDataBunch) factory method. For a segmentation task, for instance, we can use it to quickly get a [`DataBunch`](/basic_data.html#DataBunch). Let's take the example of the [camvid dataset](http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/). The images are in an 'images' folder and their corresponding mask is in a 'labels' folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "camvid = untar_data(URLs.CAMVID_TINY)\n", "path_lbl = camvid/'labels'\n", "path_img = camvid/'images'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a file that gives us the names of the classes (what each code inside the masks corresponds to: a pedestrian, a tree, a road...)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole',\n", " 'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock',\n", " 'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',\n", " 'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype=' in path_img and its subfolders\n", " .split_by_rand_pct()\n", " #How to split in train/valid? -> randomly with the default 20% in valid\n", " .label_from_func(get_y_fn, classes=codes)\n", " #How to label? -> use the label function on the file name of the data\n", " .transform(get_transforms(), tfm_y=True, size=128)\n", " #Data augmentation? -> use tfms with a size of 128, also transform the label images\n", " .databunch())\n", " #Finally -> use the defaults for conversion to databunch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data.show_batch(rows=2, figsize=(7,5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another example for object detection. We use our tiny sample of the [COCO dataset](http://cocodataset.org/#home) here. There is a helper function in the library that reads the annotation file and returns the list of images names with the list of labelled bboxes associated to it. We convert it to a dictionary that maps image names with their bboxes and then write the function that will give us the target for each image filename." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coco = untar_data(URLs.COCO_TINY)\n", "images, lbl_bbox = get_annotations(coco/'train.json')\n", "img2bbox = dict(zip(images, lbl_bbox))\n", "get_y_func = lambda o:img2bbox[o.name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code is very similar to what we saw before. The only new addition is the use of a special function to collate the samples in batches. This comes from the fact that our images may have multiple bounding boxes, so we need to pad them to the largest number of bounding boxes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (ObjectItemList.from_folder(coco)\n", " #Where are the images? -> in coco and its subfolders\n", " .split_by_rand_pct() \n", " #How to split in train/valid? -> randomly with the default 20% in valid\n", " .label_from_func(get_y_func)\n", " #How to find the labels? -> use get_y_func on the file name of the data\n", " .transform(get_transforms(), tfm_y=True)\n", " #Data augmentation? -> Standard transforms; also transform the label images\n", " .databunch(bs=16, collate_fn=bb_pad_collate)) \n", " #Finally we convert to a DataBunch, use a batch size of 16,\n", " # and we use bb_pad_collate to collate the data into a mini-batch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But vision isn't the only application where the data block API works. It can also be used for text and tabular data. With our sample of the IMDB dataset (labelled texts in a csv file), here is how to get the data together for a language model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.text import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "imdb = untar_data(URLs.IMDB_SAMPLE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_lm = (TextList\n", " .from_csv(imdb, 'texts.csv', cols='text')\n", " #Where are the text? Column 'text' of texts.csv\n", " .split_by_rand_pct()\n", " #How to split it? Randomly with the default 20% in valid\n", " .label_for_lm()\n", " #Label it for a language model\n", " .databunch())\n", " #Finally we convert to a DataBunch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idxtext
0! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk ! xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is
1, co - billed with xxup the xxup xxunk xxup vampire . a xxmaj spanish - xxmaj italian co - production where a series of women in a village are being murdered around the same time a local count named xxmaj yanos xxmaj xxunk is seen on xxunk , riding off with his ' man - eating ' dog behind him . \\n \\n xxmaj the xxunk already suspect
2sad relic that is well worth seeing . xxbos i caught this on the dish last night . i liked the movie . i xxunk to xxmaj russia 3 different times ( xxunk our 2 kids ) . i ca n't put my finger on exactly why i liked this movie other than seeing \" bad \" turn \" good \" and \" good \" turn \" semi - bad
3pushed him along . xxmaj the story ( if it can be called that ) is so full of holes it 's almost funny , xxmaj it never really explains why the hell he survived in the first place , or needs human flesh in order to survive . xxmaj the script is poorly written and the dialogue xxunk on just plane stupid . xxmaj the climax to movie (
4the xxunk of the xxmaj xxunk xxmaj race and had the xxunk of some of those racist xxunk . xxmaj fortunately , nothing happened like the incident in the movie where the young xxmaj caucasian man went off and started shooting at a xxunk gathering . \\n \\n i can only hope and pray that nothing like that ever will happen . \\n \\n xxmaj so is \"
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data_lm.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a classification problem, we just have to change the way labeling is done. Here we use the csv column `label`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_clas = (TextList.from_csv(imdb, 'texts.csv', cols='text')\n", " .split_from_df(col='is_valid')\n", " .label_from_df(cols='label')\n", " .databunch())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
texttarget
xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \\n \\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmajnegative
xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up withpositive
xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of \" xxmaj at xxmaj the xxmaj movies \" in taking xxmaj steven xxmaj soderbergh to task . \\n \\n xxmaj it 's usually satisfying to watch a film director change his style /negative
xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \\n \\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' \" xxmaj la xxmaj xxunkpositive
xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphicspositive
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data_clas.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, for tabular data, we just have to pass the name of our categorical and continuous variables as an extra argument. We also add some [`PreProcessor`](/data_block.html#PreProcessor)s that are going to be applied to our data once the splitting and labelling is done." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.tabular import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adult = untar_data(URLs.ADULT_SAMPLE)\n", "df = pd.read_csv(adult/'adult.csv')\n", "dep_var = 'salary'\n", "cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']\n", "cont_names = ['education-num', 'hours-per-week', 'age', 'capital-loss', 'fnlwgt', 'capital-gain']\n", "procs = [FillMissing, Categorify, Normalize]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = (TabularList.from_df(df, path=adult, cat_names=cat_names, cont_names=cont_names, procs=procs)\n", " .split_by_idx(valid_idx=range(800,1000))\n", " .label_from_df(cols=dep_var)\n", " .databunch())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipracesexnative-countryeducation-num_naeducation-numhours-per-weekagecapital-lossfnlwgtcapital-gaintarget
?DoctorateMarried-civ-spouse?HusbandAmer-Indian-EskimoMaleUnited-StatesFalse2.3157-0.03561.7161-0.2164-1.1496-0.1459>=50k
PrivateSome-collegeNever-marriedSalesOwn-childWhiteMaleUnited-StatesFalse-0.0312-0.4406-1.4357-0.2164-0.1893-0.1459<50k
PrivateSome-collegeNever-marriedProtective-servOwn-childWhiteMaleUnited-StatesFalse-0.0312-2.0606-1.2891-0.21641.1154-0.1459<50k
PrivateHS-gradMarried-civ-spouseHandlers-cleanersWifeWhiteFemaleMexicoFalse-0.4224-0.0356-0.7027-0.21640.0779-0.1459>=50k
PrivateHS-gradMarried-civ-spouseTech-supportHusbandWhiteMaleUnited-StatesFalse-0.42243.2043-0.1163-0.2164-0.6858-0.1459>=50k
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Provide inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basic class to get your inputs into is the following one. It's also the same class that will contain all of your labels (hence the name [`ItemList`](/data_block.html#ItemList))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class ItemList[source][test]

\n", "\n", "> ItemList(**`items`**:`Iterator`\\[`T_co`\\], **`path`**:`PathOrStr`=***`'.'`***, **`label_cls`**:`Callable`=***`None`***, **`inner_df`**:`Any`=***`None`***, **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **`x`**:`ItemList`=***`None`***, **`ignore_empty`**:`bool`=***`False`***)\n", "\n", "
×

Tests found for ItemList:

Some other tests where ItemList is used:

  • pytest -sv tests/test_data_block.py::test_category [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_filter_by_folder [source]
  • pytest -sv tests/test_data_block.py::test_multi_category [source]
  • pytest -sv tests/test_data_block.py::test_regression [source]
  • pytest -sv tests/test_data_block.py::test_split_subsets [source]
  • pytest -sv tests/test_data_block.py::test_splitdata_datasets [source]

To run tests please refer to this guide.

\n", "\n", "A collection of items with `__len__` and `__getitem__` with `ndarray` indexing semantics. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": { "hide_input": true }, "source": [ "This class regroups the inputs for our model in `items` and saves a `path` attribute which is where it will look for any files (image files, csv file with labels...). `label_cls` will be called to create the labels from the result of the label function, `inner_df` is an underlying dataframe, and `processor` is to be applied to the inputs after the splitting and labeling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It has multiple subclasses depending on the type of data you're handling. Here is a quick list:\n", " - [`CategoryList`](/data_block.html#CategoryList) for labels in classification\n", " - [`MultiCategoryList`](/data_block.html#MultiCategoryList) for labels in a multi classification problem\n", " - [`FloatList`](/data_block.html#FloatList) for float labels in a regression problem\n", " - [`ImageList`](/vision.data.html#ImageList) for data that are images\n", " - [`SegmentationItemList`](/vision.data.html#SegmentationItemList) like [`ImageList`](/vision.data.html#ImageList) but will default labels to [`SegmentationLabelList`](/vision.data.html#SegmentationLabelList)\n", " - [`SegmentationLabelList`](/vision.data.html#SegmentationLabelList) for segmentation masks\n", " - [`ObjectItemList`](/vision.data.html#ObjectItemList) like [`ImageList`](/vision.data.html#ImageList) but will default labels to `ObjectLabelList`\n", " - `ObjectLabelList` for object detection\n", " - [`PointsItemList`](/vision.data.html#PointsItemList) for points (of the type [`ImagePoints`](/vision.image.html#ImagePoints))\n", " - [`ImageImageList`](/vision.data.html#ImageImageList) for image to image tasks\n", " - [`TextList`](/text.data.html#TextList) for text data\n", " - [`TextList`](/text.data.html#TextList) for text data stored in files\n", " - [`TabularList`](/tabular.data.html#TabularList) for tabular data\n", " - [`CollabList`](/collab.html#CollabList) for collaborative filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get a little glimpse of how [`ItemList`](/data_block.html#ItemList)'s basic attributes and methods behave with the following code examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (3 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/labels.csv,/home/ubuntu/.fastai/data/mnist_tiny/history.csv,/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from fastai.vision import *\n", "path_data = untar_data(URLs.MNIST_TINY)\n", "il_data = ItemList.from_folder(path_data, extensions=['.csv'])\n", "il_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is how to access the path of [`ItemList`](/data_block.html#ItemList) and the actual `items` (here files) in the path." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/.fastai/data/mnist_tiny')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data.path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv')], dtype=object)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data.items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`len(il_data)` gives you the count of files inside `il_data` and you can access individual items using index. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(il_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ItemList`](/data_block.html#ItemList) returns a single item with a single index, but returns an [`ItemList`](/data_block.html#ItemList) if given a list of indexes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (1 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/labels.csv\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data[:1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `il_data.add` we can perform in_place concatenate another [`ItemList`](/data_block.html#ItemList) object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (6 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/labels.csv,/home/ubuntu/.fastai/data/mnist_tiny/history.csv,/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv,/home/ubuntu/.fastai/data/mnist_tiny/labels.csv,/home/ubuntu/.fastai/data/mnist_tiny/history.csv\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data.add(il_data); il_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY); path_data.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (20 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png,/home/ubuntu/.fastai/data/mnist_tiny/test/5071.png,/home/ubuntu/.fastai/data/mnist_tiny/test/617.png,/home/ubuntu/.fastai/data/mnist_tiny/test/585.png,/home/ubuntu/.fastai/data/mnist_tiny/test/2032.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny/test" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itemlist = ItemList.from_folder(path_data/'test')\n", "itemlist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the files do not necesarily return in alpha-numeric order by default. In the above: 1503.png, ... 617.png, 585.png ...\n", "\n", "This is OK when you're always using the same machine, as the same dataset should return in the same order. But when building a datablock on one machine (say GCP) and then porting the same code to a different machine (say your laptop) that same dataset and code might return the files in a different order.\n", "\n", "Since all random operations use the loaded order of the dataset as the starting point, you will not be able to replicate any random operations, say randomly splitting the data into 80% train, and 20% validation, even while correctly seeding.\n", "\n", "The solution is to use `presort=True` in the `.from_folder()` method. As can be seen below, with that argument turned on, the file return in ascending order, and this behavior will match across machines and across platforms. Now you can reproduce any random operation you perfrom on the loaded data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (20 items)\n", "/home/user/.fastai/data/mnist_tiny/test/1503.png,/home/user/.fastai/data/mnist_tiny/test/1605.png,/home/user/.fastai/data/mnist_tiny/test/1883.png,/home/user/.fastai/data/mnist_tiny/test/2032.png,/home/user/.fastai/data/mnist_tiny/test/205.png\n", "Path: /home/user/.fastai/data/mnist_tiny/test" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itemlist = ItemList.from_folder(path_data/'test', presort=True)\n", "itemlist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does such output above is generated?\n", "\n", "behind the scenes, executing `itemlist` calls [`ItemList.__repr__`](/data_block.html#ItemList.__repr__) which basically prints out `itemlist[0]` to `itemlist[4]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itemlist[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and `itemlist[0]` basically calls `itemlist.get(0)` which returns `itemlist.items[0]`. That's why we have outputs like above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have selected the class that is suitable, you can instantiate it with one of the following factory methods" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_folder[source][test]

\n", "\n", "> from_folder(**`path`**:`PathOrStr`, **`extensions`**:`StrList`=***`None`***, **`recurse`**:`bool`=***`True`***, **`include`**:`OptStrList`=***`None`***, **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **`presort`**:`Optional`\\[`bool`\\]=***`False`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

Tests found for from_folder:

Some other tests where from_folder is used:

  • pytest -sv tests/test_data_block.py::test_wrong_order [source]

To run tests please refer to this guide.

\n", "\n", "Create an [`ItemList`](/data_block.html#ItemList) in `path` from the filenames that have a suffix in `extensions`. [`recurse`](/core.html#recurse) determines if we search subfolders. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.from_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_TINY)\n", "path.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (1428 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ImageList.from_folder(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`path` is your root data folder. In the `path` directory you have _train_ and _valid_ folders which would contain your images. For the below example, _train_ folder contains two folders/classes _cat_ and _dog_.\n", "\n", "\"from_folder\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_df[source][test]

\n", "\n", "> from_df(**`df`**:`DataFrame`, **`path`**:`PathOrStr`=***`'.'`***, **`cols`**:`IntsOrStrs`=***`0`***, **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

Tests found for from_df:

Some other tests where from_df is used:

  • pytest -sv tests/test_data_block.py::test_category [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_multi_category [source]
  • pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

\n", "\n", "Create an [`ItemList`](/data_block.html#ItemList) in `path` from the inputs in the `cols` of `df`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.from_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dataframe has 2 columns. The first column is the path to the image and the second column contains label id for that image. In case you have multi-labels (i.e more than one label for a single image), you will have a space(as determined by `label_delim` argument of `label_from_df`) seperated string in the labels column.\n", "\n", "`from_df` and `from_csv` can be used in a more general way. In cases you are not able to figure out how to get your ImageList, it is very easy to make a csv file with the above format.\n", "\n", "How to set `path`? `path` refers to your root data directory. So the paths in your csv file should be relative to `path` and not absolute paths. In the below example, in _labels.csv_ the paths to the images are __path + train/3/7463.png__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_sample/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/item_list.txt'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/trained_model.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "path.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelabel
0train/3/7463.png0
1train/3/21102.png0
2train/3/31559.png0
3train/3/46882.png0
4train/3/26209.png0
\n", "
" ], "text/plain": [ " name label\n", "0 train/3/7463.png 0\n", "1 train/3/21102.png 0\n", "2 train/3/31559.png 0\n", "3 train/3/46882.png 0\n", "4 train/3/26209.png 0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(path/'labels.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (14434 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ImageList.from_df(df, path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_csv[source][test]

\n", "\n", "> from_csv(**`path`**:`PathOrStr`, **`csv_name`**:`str`, **`cols`**:`IntsOrStrs`=***`0`***, **`delimiter`**:`str`=***`None`***, **`header`**:`str`=***`'infer'`***, **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

No tests found for from_csv. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create an [`ItemList`](/data_block.html#ItemList) in `path` from the inputs in the `cols` of `path/csv_name` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.from_csv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_sample/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/item_list.txt'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/trained_model.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_sample/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "path.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (14434 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ImageList.from_csv(path, 'labels.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optional step: filter your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The factory method may have grabbed too many items. For instance, if you were searching sub folders with the `from_folder` method, you may have gotten files you don't want. To remove those, you can use one of the following methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

filter_by_func[source][test]

\n", "\n", "> filter_by_func(**`func`**:`Callable`) → `ItemList`\n", "\n", "
×

No tests found for filter_by_func. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Only keep elements for which `func` returns `True`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.filter_by_func)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelabel
0train/3/7463.png0
1train/3/21102.png0
2train/3/31559.png0
3train/3/46882.png0
4train/3/26209.png0
\n", "
" ], "text/plain": [ " name label\n", "0 train/3/7463.png 0\n", "1 train/3/21102.png 0\n", "2 train/3/31559.png 0\n", "3 train/3/46882.png 0\n", "4 train/3/26209.png 0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "df = pd.read_csv(path/'labels.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that you only want to keep images with a suffix \".png\". Well, this method will do magic for you." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'.png'" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Path(df.name[0]).suffix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (14434 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ImageList.from_df(df, path).filter_by_func(lambda fname: Path(fname).suffix == '.png')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

filter_by_folder[source][test]

\n", "\n", "> filter_by_folder(**`include`**=***`None`***, **`exclude`**=***`None`***)\n", "\n", "
×

Tests found for filter_by_folder:

  • pytest -sv tests/test_data_block.py::test_filter_by_folder [source]

To run tests please refer to this guide.

\n", "\n", "Only keep filenames in `include` folder or reject the ones in `exclude`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.filter_by_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

filter_by_rand[source][test]

\n", "\n", "> filter_by_rand(**`p`**:`float`, **`seed`**:`int`=***`None`***)\n", "\n", "
×

No tests found for filter_by_rand. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Keep random sample of `items` with probability `p` and an optional `seed`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.filter_by_rand)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (7267 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "ImageList.from_folder(path).filter_by_rand(0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Contrast the number of items with the list created without the filter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (14434 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ImageList.from_folder(path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

to_text[source][test]

\n", "\n", "> to_text(**`fn`**:`str`)\n", "\n", "
×

No tests found for to_text. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Save `self.items` to `fn` in `self.path`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.to_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelabel
0train/3/7463.png0
1train/3/21102.png0
2train/3/31559.png0
3train/3/46882.png0
4train/3/26209.png0
\n", "
" ], "text/plain": [ " name label\n", "0 train/3/7463.png 0\n", "1 train/3/21102.png 0\n", "2 train/3/31559.png 0\n", "3 train/3/46882.png 0\n", "4 train/3/26209.png 0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "pd.read_csv(path/'labels.csv').head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file_name = \"item_list.txt\"\n", "ImageList.from_folder(path).to_text(file_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train/3/5736.png\r\n", "train/3/35272.png\r\n", "train/3/26596.png\r\n", "train/3/42120.png\r\n", "train/3/39675.png\r\n", "train/3/47881.png\r\n", "train/3/38241.png\r\n", "train/3/59054.png\r\n", "train/3/9932.png\r\n", "train/3/50184.png\r\n", "cat: write error: Broken pipe\r\n" ] } ], "source": [ "! cat {path/file_name} | head" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

use_partial_data[source][test]

\n", "\n", "> use_partial_data(**`sample_pct`**:`float`=***`0.01`***, **`seed`**:`int`=***`None`***) → `ItemList`\n", "\n", "
×

No tests found for use_partial_data. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Use only a sample of `sample_pct`of the full dataset and an optional `seed`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.use_partial_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (7217 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "ImageList.from_folder(path).use_partial_data(0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Contrast the number of items with the list created without the filter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (14434 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ImageList.from_folder(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing your own [`ItemList`](/data_block.html#ItemList)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First check if you can't easily customize one of the existing subclass by:\n", "- subclassing an existing one and replacing the `get` method (or the `open` method if you're dealing with images)\n", "- applying a custom `processor` (see step 4)\n", "- changing the default `label_cls` for the label creation\n", "- adding a default [`PreProcessor`](/data_block.html#PreProcessor) with the `_processor` class variable\n", "\n", "If this isn't the case and you really need to write your own class, there is a [full tutorial](/tutorial.itemlist) that explains how to proceed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

analyze_pred[source][test]

\n", "\n", "> analyze_pred(**`pred`**:`Tensor`)\n", "\n", "
×

No tests found for analyze_pred. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Called on `pred` before `reconstruct` for additional preprocessing. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.analyze_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get[source][test]

\n", "\n", "> get(**`i`**) → `Any`\n", "\n", "
×

No tests found for get. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Subclass if you want to customize how to create item `i` from `self.items`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.get)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will have a glimpse of how `get` work with the following demo. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY); path_data.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (20 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png,/home/ubuntu/.fastai/data/mnist_tiny/test/5071.png,/home/ubuntu/.fastai/data/mnist_tiny/test/617.png,/home/ubuntu/.fastai/data/mnist_tiny/test/585.png,/home/ubuntu/.fastai/data/mnist_tiny/test/2032.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data_base = ItemList.from_folder(path=path_data, extensions=['.png'], include=['test'])\n", "il_data_base" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`get` is used inexplicitly within `il_data_base[15]`. `il_data_base.get(15)` gives the same result here, because its defulat it's to return that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test/6736.png')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data_base[15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While creating your custom [`ItemList`](/data_block.html#ItemList) however, you can override this function to do some things to your item (like opening an image)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (20 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data_image = ImageList.from_folder(path=path_data, extensions=['.png'], include=['test'])\n", "il_data_image" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, normally `get` is used inexplicitly within `il_data_image[15]`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "/9j/4AAQSkZJRgABAQEAZABkAAD/2wBDAAIBAQEBAQIBAQECAgICAgQDAgICAgUEBAMEBgUGBgYFBgYGBwkIBgcJBwYGCAsICQoKCgoKBggLDAsKDAkKCgr/2wBDAQICAgICAgUDAwUKBwYHCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgr/wAARCAAcABwDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD+f+vtv9jb/g3x/wCCln7ev7Mlr+1h+zV8PvDereGdQkvYtLgvfFttZ3l5LayPE8aJMVVSZEZFMjoueSQvzV8SV+tf/BuX+yV/wTK+JkSfGL9pH/gqh4k+FPxJi8SG0034d+H/AIhx+DJL2KNofKJvzIJrsTNOFWO3eCUMkgUttJAB8j/tef8ABDL/AIKpfsL/AA6vPjD+0j+yVqml+ErCQre+JNJ1rT9VtbZPMSNZZjY3ErW6M0iBWmVMlsdQQPkyv6Jv+Dpf/gp7+2P+y38LL7/gm34f/ZV/4R/4eeOdHgsNL+MuteJ5PEDeJtIjgjFzaKt1AWt75ZNqSyzzTz7cSqQ0yTD+dmgD9N/+CO3/AASv+Gv/AAUj/wCCa/7RelfB2Pwf4g/aRtvEGhW3g/QfFN5NbyaLoizwyT31vIrxpG05eeMyMJwBZCIxx/aA7eufs0/8Gifxj8IanJ8Uv+CqH7SvgP4T/C/SpHOt3Wk+M4VvzEqbzILi6tzZ26HDLvkdmXazGMhQH/GyigD9Q/8Ag4y/4Kgfsq/tSW/ws/YM/YEupNS+EHwK0sWWn+IZraUJqF0tvFbpHaSTP5klrBDGIgzRxhnDlN8Qidvy8oooA//Z\n", "image/png": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAYAAAByDd+UAAAABHNCSVQICAgIfAhkiAAAAOlJREFUSIntlksOwyAMRE3Vgw0ny3Ay52Z0UVGlfE3Ssqg6EpuE5MEzjuJEJMrC3FbC/sCv5G6dCEAAvF0jKQBk3/cpaLQMVY2WkIwAeu/6LDClBTXX0Hsvzjnx3ksIYagx1z+ttDUAVHeoqteU1kAtzR3YHLAHMcLsQJJXIPPAVq1Gp/KSUpLdnRqh509oDW6AngemcQzJ7tyi8QEIyfxyM5e/pbOHIG+TKaWpJiMtAKr1Gz1XAI+rnYVZrbhElafPl+cQQuF+27ZmXZxzzXt53lY/G6PGdlv0GjtFVc+ASqUr8vt/bcuBD5ipIJ8bKsRaAAAAAElFTkSuQmCC\n", "text/plain": [ "Image (3, 28, 28)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_data_image[15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason why an image is printed out instead of a FilePath object, is [`ImageList.get`](/vision.data.html#ImageList.get) overwrites [`ItemList.get`](/data_block.html#ItemList.get) and use [`ImageList.open`](/vision.data.html#ImageList.open) to print an image." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

new[source][test]

\n", "\n", "> new(**`items`**:`Iterator`\\[`T_co`\\], **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

No tests found for new. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a new [`ItemList`](/data_block.html#ItemList) from `items`, keeping the same attributes. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll never need to subclass this normally, just don't forget to add to `self.copy_new` the names of the arguments that needs to be copied each time `new` is called in `__init__`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will get a feel of how `new` works with the following examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY); path_data.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (699 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny/valid" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itemlist1 = ItemList.from_folder(path=path_data/'valid', extensions=['.png'])\n", "itemlist1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you will see below, `copy_new` allows use to borrow any argument and its value from `itemlist1`, and `itemlist1.new(itemlist1.items)` allows us to use `items` and arguments inside `copy_new` to create another [`ItemList`](/data_block.html#ItemList) by calling [`ItemList.__init__`](/data_block.html#ItemList.__init__)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itemlist1.copy_new == ['x', 'label_cls', 'path']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "((itemlist1.x == itemlist1.label_cls == itemlist1.inner_df == None) \n", " and (itemlist1.path == Path('/Users/Natsume/.fastai/data/mnist_tiny/valid')))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select any argument from [`ItemList.__init__`](/data_block.html#ItemList.__init__)'s signature and change their values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "itemlist1.copy_new = ['x', 'label_cls', 'path', 'inner_df']\n", "itemlist1.x = itemlist1.label_cls = itemlist1.path = itemlist1.inner_df = 'test'\n", "itemlist2 = itemlist1.new(items=itemlist1.items)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(itemlist2.inner_df == itemlist2.x == itemlist2.label_cls == 'test' \n", "and itemlist2.path == Path('test'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source][test]

\n", "\n", "> reconstruct(**`t`**:`Tensor`, **`x`**:`Tensor`=***`None`***)\n", "\n", "
×

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Reconstruct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.reconstruct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Split the data between the training and the validation set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step is normally straightforward, you just have to pick one of the following functions depending on what you need." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_none[source][test]

\n", "\n", "> split_none()\n", "\n", "
×

No tests found for split_none. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Don't split the data and create an empty validation set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_none)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_rand_pct[source][test]

\n", "\n", "> split_by_rand_pct(**`valid_pct`**:`float`=***`0.2`***, **`seed`**:`int`=***`None`***) → `ItemLists`\n", "\n", "
×

Tests found for split_by_rand_pct:

  • pytest -sv tests/test_data_block.py::test_splitdata_datasets [source]

Some other tests where split_by_rand_pct is used:

  • pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

\n", "\n", "Split the items randomly by putting `valid_pct` in the validation set, optional `seed` can be passed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_rand_pct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_subsets[source][test]

\n", "\n", "> split_subsets(**`train_size`**:`float`, **`valid_size`**:`float`, **`seed`**=***`None`***) → `ItemLists`\n", "\n", "
×

Tests found for split_subsets:

  • pytest -sv tests/test_data_block.py::test_split_subsets [source]

To run tests please refer to this guide.

\n", "\n", "Split the items into train set with size `train_size * n` and valid set with size `valid_size * n`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_subsets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function is handy if you want to work with subsets of specific sizes, e.g., you want to use 20% of the data for the validation dataset, but you only want to train on a small subset of the rest of the data: `split_subsets(train_size=0.08, valid_size=0.2)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_files[source][test]

\n", "\n", "> split_by_files(**`valid_names`**:`ItemList`) → `ItemLists`\n", "\n", "
×

No tests found for split_by_files. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Split the data by using the names in `valid_names` for validation. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_files)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_fname_file[source][test]

\n", "\n", "> split_by_fname_file(**`fname`**:`PathOrStr`, **`path`**:`PathOrStr`=***`None`***) → `ItemLists`\n", "\n", "
×

No tests found for split_by_fname_file. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Split the data by using the names in `fname` for the validation set. `path` will override `self.path`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_fname_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Internally makes a call to `split_by_files`. `fname` contains your image file names like 0001.png." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_folder[source][test]

\n", "\n", "> split_by_folder(**`train`**:`str`=***`'train'`***, **`valid`**:`str`=***`'valid'`***) → `ItemLists`\n", "\n", "
×

Tests found for split_by_folder:

Some other tests where split_by_folder is used:

  • pytest -sv tests/test_data_block.py::test_wrong_order [source]

To run tests please refer to this guide.

\n", "\n", "Split the data depending on the folder (`train` or `valid`) in which the filenames are. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Note: This method looks at the folder immediately after `self.path` for `valid` and `train`.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_note(\"This method looks at the folder immediately after `self.path` for `valid` and `train`.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically, `split_by_folder` takes in two folder names ('train' and 'valid' in the following example), to split `il` the large [`ImageList`](/vision.data.html#ImageList) into two smaller [`ImageList`](/vision.data.html#ImageList)s, one for training set and the other for validation set. Both [`ImageList`](/vision.data.html#ImageList)s are attached to a large [`ItemLists`](/data_block.html#ItemLists) which is the final output of `split_by_folder`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),\n", " PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY); path_data.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemList (1439 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/labels.csv,/home/ubuntu/.fastai/data/mnist_tiny/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/history.csv,/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv,/home/ubuntu/.fastai/data/mnist_tiny/test/1503.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il = ItemList.from_folder(path=path_data); il" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ItemList (713 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/train/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/train/3/9932.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/7189.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8498.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8888.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny;\n", "\n", "Valid: ItemList (699 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sd = il.split_by_folder(train='train', valid='valid'); sd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, `split_by_folder` uses `_get_by_folder(name)`, to turn both 'train' and 'valid' folders into two list of indexes, and pass them onto `split_by_idxs` to split `il` into two [`ImageList`](/vision.data.html#ImageList)s, and finally attached to a [`ItemLists`](/data_block.html#ItemLists). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([24, 25, 26, 27, 28], [732, 733, 734, 735, 736], 713)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_idx = il._get_by_folder(name='train')\n", "train_idx[:5], train_idx[-5:], len(train_idx)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([740, 741, 742, 743, 744], [1434, 1435, 1436, 1437, 1438], 699)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_idx = il._get_by_folder(name='valid') \n", "valid_idx[:5], valid_idx[-5:],len(valid_idx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the way, `_get_by_folder(name)` works in the following way, first, index the entire `il.items`, loop every item and if an item belongs to the named folder, e.g., 'train', then put it into a list. The folder `name` is the only input, and output is the list." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_idx[source][test]

\n", "\n", "> split_by_idx(**`valid_idx`**:`Collection`\\[`int`\\]) → `ItemLists`\n", "\n", "
×

Tests found for split_by_idx:

Some other tests where split_by_idx is used:

  • pytest -sv tests/test_data_block.py::test_category [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_multi_category [source]

To run tests please refer to this guide.

\n", "\n", "Split the data according to the indexes in `valid_idx`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_idx)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelabel
0train/3/7463.png0
1train/3/21102.png0
2train/3/31559.png0
3train/3/46882.png0
4train/3/26209.png0
\n", "
" ], "text/plain": [ " name label\n", "0 train/3/7463.png 0\n", "1 train/3/21102.png 0\n", "2 train/3/31559.png 0\n", "3 train/3/46882.png 0\n", "4 train/3/26209.png 0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "df = pd.read_csv(path/'labels.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can pass a list of indices that you want to put in the validation set like [1, 3, 10]. Or you can pass a contiguous list like `list(range(1000))`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (13434 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample;\n", "\n", "Valid: ImageList (1000 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = (ImageList.from_df(df, path)\n", " .split_by_idx(list(range(1000))))\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_idxs[source][test]

\n", "\n", "> split_by_idxs(**`train_idx`**, **`valid_idx`**)\n", "\n", "
×

No tests found for split_by_idxs. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Split the data between `train_idx` and `valid_idx`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_idxs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, `split_by_idxs` turns two index lists (`train_idx` and `valid_idx`) into two [`ImageList`](/vision.data.html#ImageList)s, and then pass onto `split_by_list` to split `il` into two [`ImageList`](/vision.data.html#ImageList)s and attach to a [`ItemLists`](/data_block.html#ItemLists)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ItemList (713 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/train/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/train/3/9932.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/7189.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8498.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8888.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny;\n", "\n", "Valid: ItemList (699 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sd = il.split_by_idxs(train_idx=train_idx, valid_idx=valid_idx); sd" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_list[source][test]

\n", "\n", "> split_by_list(**`train`**, **`valid`**)\n", "\n", "
×

No tests found for split_by_list. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Split the data between `train` and `valid`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`split_by_list` takes in two [`ImageList`](/vision.data.html#ImageList)s which in the case below are `il[train_idx]` and `il[valid_idx]`, and pass them onto `_split` ([`ItemLists`](/data_block.html#ItemLists)) to initialize an [`ItemLists`](/data_block.html#ItemLists) object, which basically takes in the training, valiation and testing (optionally) [`ImageList`](/vision.data.html#ImageList)s as its properties." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ItemList (713 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/train/export.pkl,/home/ubuntu/.fastai/data/mnist_tiny/train/3/9932.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/7189.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8498.png,/home/ubuntu/.fastai/data/mnist_tiny/train/3/8888.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny;\n", "\n", "Valid: ItemList (699 items)\n", "/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7692.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/7484.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9157.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/8703.png,/home/ubuntu/.fastai/data/mnist_tiny/valid/3/9182.png\n", "Path: /home/ubuntu/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sd = il.split_by_list(train=il[train_idx], valid=il[valid_idx]); sd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is more of an internal method, you should be using `split_by_files` if you want to pass a list of filenames for the validation set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_by_valid_func[source][test]

\n", "\n", "> split_by_valid_func(**`func`**:`Callable`) → `ItemLists`\n", "\n", "
×

No tests found for split_by_valid_func. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Split the data by result of `func` (which returns `True` for validation set). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_by_valid_func)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

split_from_df[source][test]

\n", "\n", "> split_from_df(**`col`**:`IntsOrStrs`=***`2`***)\n", "\n", "
×

No tests found for split_from_df. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Split the data from the `col` in the dataframe in `self.inner_df`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.split_from_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use this function, you need a boolean column `is_valid`. If `is_valid[index] = True`, then that example is put in the validation set and if `is_valid[index] = False` the example is put in the training set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(14434, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelabelis_valid
2071train/3/28571.png0True
9382train/7/24434.png1False
6399train/7/56604.png1True
130train/3/4740.png0True
9226train/7/18876.png1False
\n", "
" ], "text/plain": [ " name label is_valid\n", "2071 train/3/28571.png 0 True\n", "9382 train/7/24434.png 1 False\n", "6399 train/7/56604.png 1 True\n", "130 train/3/4740.png 0 True\n", "9226 train/7/18876.png 1 False" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.MNIST_SAMPLE)\n", "df = pd.read_csv(path/'labels.csv')\n", "\n", "# Create a new column for is_valid\n", "df['is_valid'] = [True]*(df.shape[0]//2) + [False]*(df.shape[0]//2)\n", "\n", "# Randomly shuffle dataframe\n", "df = df.reindex(np.random.permutation(df.index))\n", "print(df.shape)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (7217 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample;\n", "\n", "Valid: ImageList (7217 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /home/ubuntu/.fastai/data/mnist_sample;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = (ImageList.from_df(df, path)\n", " .split_from_df())\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Warning: This method assumes the data has been created from a csv file or a dataframe.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_warn(\"This method assumes the data has been created from a csv file or a dataframe.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Label the inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To label your inputs, use one of the following functions. Note that even if it's not in the documented arguments, you can always pass a `label_cls` that will be used to create those labels (the default is the one from your input [`ItemList`](/data_block.html#ItemList), and if there is none, it will go to [`CategoryList`](/data_block.html#CategoryList), [`MultiCategoryList`](/data_block.html#MultiCategoryList) or [`FloatList`](/data_block.html#FloatList) depending on the type of the labels). This is implemented in the following function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_label_cls[source][test]

\n", "\n", "> get_label_cls(**`labels`**, **`label_cls`**:`Callable`=***`None`***, **`label_delim`**:`str`=***`None`***, **\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for get_label_cls. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Return `label_cls` or guess one from the first element of `labels`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.get_label_cls)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, [`ItemList.get_label_cls`](/data_block.html#ItemList.get_label_cls) basically select a label class according to the item type of `labels`, whereas `labels` can be any of `Collection`, `pandas.core.frame.DataFrame`, `pandas.core.series.Series`. If the list elements are of type string or integer, `get_label_cls` will output [`CategoryList`](/data_block.html#CategoryList); they are of type float, then it will output [`FloatList`](/data_block.html#FloatList); if they are of type Collection, then it will output `MultiCateogryList`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (709 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Valid: ImageList (699 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY)\n", "sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid'); sd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.CategoryList" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels = ['7', '3']\n", "label_cls = sd.train.get_label_cls(labels); label_cls" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.CategoryList" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels = [7, 3]\n", "label_cls = sd.train.get_label_cls(labels); label_cls" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.FloatList" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels = [7.0, 3.0]\n", "label_cls = sd.train.get_label_cls(labels); label_cls" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.MultiCategoryList" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels = [[7, 3],]\n", "label_cls = sd.train.get_label_cls(labels); label_cls" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fastai.data_block.MultiCategoryList" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels = [['7', '3'],]\n", "label_cls = sd.train.get_label_cls(labels); label_cls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If no `label_cls` argument is passed, the correct labeling type can usually be inferred based on the data (for classification or regression). If you have multiple regression targets (e.g. predict 5 different numbers from a single image/text), be aware that arrays of floats are by default considered to be targets for one-hot encoded classification. If your task is regression, be sure the pass `label_cls = FloatList` so that learners created from your databunch initialize correctly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first example in these docs created labels as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = untar_data(URLs.MNIST_TINY)\n", "ll = ImageList.from_folder(path).split_by_folder().label_from_folder().train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to save the data necessary to recreate your [`LabelList`](/data_block.html#LabelList) (not including saving the actual image/text/etc files), you can use `to_df` or `to_csv`:\n", "\n", "```python\n", "ll.train.to_csv('tmp.csv')\n", "```\n", "\n", "Or just grab a `pd.DataFrame` directly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
0train/7/9243.png7
1train/7/9519.png7
2train/7/7534.png7
3train/7/9082.png7
4train/7/8377.png7
\n", "
" ], "text/plain": [ " x y\n", "0 train/7/9243.png 7\n", "1 train/7/9519.png 7\n", "2 train/7/7534.png 7\n", "3 train/7/9082.png 7\n", "4 train/7/8377.png 7" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ll.to_df().head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_empty[source][test]

\n", "\n", "> label_empty(**\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for label_empty. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Label every item with an [`EmptyLabel`](/core.html#EmptyLabel). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_empty)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_df[source][test]

\n", "\n", "> label_from_df(**`cols`**:`IntsOrStrs`=***`1`***, **`label_cls`**:`Callable`=***`None`***, **\\*\\*`kwargs`**)\n", "\n", "
×

Tests found for label_from_df:

Some other tests where label_from_df is used:

  • pytest -sv tests/test_data_block.py::test_category [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_multi_category [source]
  • pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

\n", "\n", "Label `self.items` from the values in `cols` in `self.inner_df`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Warning: This method only works with data objects created with either `from_csv` or `from_df` methods.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_warn(\"This method only works with data objects created with either `from_csv` or `from_df` methods.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_const[source][test]

\n", "\n", "> label_const(**`const`**:`Any`=***`0`***, **`label_cls`**:`Callable`=***`None`***, **\\*\\*`kwargs`**) → `LabelList`\n", "\n", "
×

Tests found for label_const:

Some other tests where label_const is used:

  • pytest -sv tests/test_data_block.py::test_split_subsets [source]
  • pytest -sv tests/test_data_block.py::test_splitdata_datasets [source]

To run tests please refer to this guide.

\n", "\n", "Label every item with `const`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_const)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_folder[source][test]

\n", "\n", "> label_from_folder(**`label_cls`**:`Callable`=***`None`***, **\\*\\*`kwargs`**) → `LabelList`\n", "\n", "
×

Tests found for label_from_folder:

  • pytest -sv tests/test_text_data.py::test_filter_classes [source]
  • pytest -sv tests/test_text_data.py::test_from_folder [source]

Some other tests where label_from_folder is used:

  • pytest -sv tests/test_data_block.py::test_wrong_order [source]

To run tests please refer to this guide.

\n", "\n", "Give a label to each filename depending on its folder. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Note: This method looks at the last subfolder in the path to determine the classes.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_note(\"This method looks at the last subfolder in the path to determine the classes.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, when an [`ItemList`](/data_block.html#ItemList) calls `label_from_folder`, it creates a lambda function which outputs a foldername which a file Path object immediately or directly belongs to, and then calls `label_from_func` with the lambda function as input. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the practical and high level, `label_from_folder` is mostly used with [`ItemLists`](/data_block.html#ItemLists) rather than [`ItemList`](/data_block.html#ItemList) for simplicity and efficiency, for details see the `label_from_folder` example on [ItemLists](). Even when you just want a training set [`ItemList`](/data_block.html#ItemList), you still need to do `split_none` to create an [`ItemLists`](/data_block.html#ItemLists) and then do labeling with `label_from_folder`, as the example shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY); path_data.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelLists;\n", "\n", "Train: LabelList (709 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Valid: LabelList (0 items)\n", "x: ImageList\n", "\n", "y: CategoryList\n", "\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sd_train = ImageList.from_folder(path_data/'train').split_none()\n", "ll_train = sd_train.label_from_folder(); ll_train" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_func[source][test]

\n", "\n", "> label_from_func(**`func`**:`Callable`, **`label_cls`**:`Callable`=***`None`***, **\\*\\*`kwargs`**) → `LabelList`\n", "\n", "
×

No tests found for label_from_func. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Apply `func` to every input to get its label. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inside `label_from_func`, it applies the input `func` to every item of an [`ItemList`](/data_block.html#ItemList) and puts all the function outputs into a list, and then passes the list onto [`ItemList._label_from_list`](/data_block.html#ItemList._label_from_list). Below is a simple example of using `label_from_func`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (709 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Valid: ImageList (699 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY)\n", "sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid');sd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func=lambda o: (o.parts if isinstance(o, Path) else o.split(os.path.sep))[-2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lambda function above is to access the immediate foldername for a file Path object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelLists;\n", "\n", "Train: LabelList (709 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Valid: LabelList (699 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ll = sd.label_from_func(func); ll" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_re[source][test]

\n", "\n", "> label_from_re(**`pat`**:`str`, **`full_path`**:`bool`=***`False`***, **`label_cls`**:`Callable`=***`None`***, **\\*\\*`kwargs`**) → `LabelList`\n", "\n", "
×

No tests found for label_from_re. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Apply the re in `pat` to determine the label of every filename. If `full_path`, search in the full name. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.label_from_re)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CategoryList[source][test]

\n", "\n", "> CategoryList(**`items`**:`Iterator`\\[`T_co`\\], **`classes`**:`Collection`\\[`T_co`\\]=***`None`***, **`label_delim`**:`str`=***`None`***, **\\*\\*`kwargs`**) :: [`CategoryListBase`](/data_block.html#CategoryListBase)\n", "\n", "
×

No tests found for CategoryList. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Basic [`ItemList`](/data_block.html#ItemList) for single classification labels. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`ItemList`](/data_block.html#ItemList) suitable for storing labels in `items` belonging to `classes`. If `None` are passed, `classes` will be determined by the unique different labels. `processor` will default to [`CategoryProcessor`](/data_block.html#CategoryProcessor)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`CategoryList`](/data_block.html#CategoryList) uses `labels` to create an [`ItemList`](/data_block.html#ItemList) for dealing with categorical labels. Behind the scenes, [`CategoryList`](/data_block.html#CategoryList) is a subclass of [`CategoryListBase`](/data_block.html#CategoryListBase) which is a subclass of [`ItemList`](/data_block.html#ItemList). [`CategoryList`](/data_block.html#CategoryList) inherits from [`CategoryListBase`](/data_block.html#CategoryListBase) the properties such as `classes` (default as `None`), `filter_missing_y` (default as `True`), and has its own unique property `loss_func` (default as `CrossEntropyFlat()`), and its own class attribute `_processor` (default as [`CategoryProcessor`](/data_block.html#CategoryProcessor)). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([1, 1, 1, 1, ..., 0, 0, 0, 0]), ['3', '7'], Category 7)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY)\n", "ll = ImageList.from_folder(path_data).split_by_folder('train', 'valid').label_from_folder()\n", "ll.train.y.items, ll.train.y.classes, ll.train.y[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CategoryList (709 items)\n", "7,7,7,7,7\n", "Path: ." ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cl = CategoryList(ll.train.y.items, ll.train.y.classes); cl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the behavior of printing out [`CategoryList`](/data_block.html#CategoryList) object or access an element using index, please see [`CategoryList.get`](/data_block.html#CategoryList.get) below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, [`CategoryList.get`](/data_block.html#CategoryList.get) is used inexplicitly when printing out the [`CategoryList`](/data_block.html#CategoryList) object or `cl[idx]`. According to the source of [`CategoryList.get`](/data_block.html#CategoryList.get), each `item` is used to get its own `class`. When 'classes' is a list of strings, then elements of `items` are used as index of a list, therefore they must be integers in the range from 0 to `len(classes)-1`; if `classes` is a dictionary, then elements of `items` are used as keys, therefore they can be strings too. See examples below for details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CategoryList (5 items)\n", "3,7,9,7,3\n", "Path: ." ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items = np.array([0, 1, 2, 1, 0])\n", "cl = CategoryList(items, classes=['3', '7', '9']); cl" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CategoryList (5 items)\n", "3,7,9,7,3\n", "Path: ." ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items = np.array(['3', '7', '9', '7', '3'])\n", "classes = {'3':3, '7':7, '9':9}\n", "cl = CategoryList(items, classes); cl" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class MultiCategoryList[source][test]

\n", "\n", "> MultiCategoryList(**`items`**:`Iterator`\\[`T_co`\\], **`classes`**:`Collection`\\[`T_co`\\]=***`None`***, **`label_delim`**:`str`=***`None`***, **`one_hot`**:`bool`=***`False`***, **\\*\\*`kwargs`**) :: [`CategoryListBase`](/data_block.html#CategoryListBase)\n", "\n", "
×

No tests found for MultiCategoryList. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Basic [`ItemList`](/data_block.html#ItemList) for multi-classification labels. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will store list of labels in `items` belonging to `classes`. If `None` are passed, `classes` will be determined by the unique different labels. `sep` is used to split the content of `items` in a list of tags.\n", "\n", "If `one_hot=True`, the items contain the labels one-hot encoded. In this case, it is mandatory to pass a list of `classes` (as we can't use the different labels)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class FloatList[source][test]

\n", "\n", "> FloatList(**`items`**:`Iterator`\\[`T_co`\\], **`log`**:`bool`=***`False`***, **`classes`**:`Collection`\\[`T_co`\\]=***`None`***, **\\*\\*`kwargs`**) :: [`ItemList`](/data_block.html#ItemList)\n", "\n", "
×

No tests found for FloatList. To contribute a test please refer to this guide and this discussion.

\n", "\n", "[`ItemList`](/data_block.html#ItemList) suitable for storing the floats in items for regression. Will add a `log` if this flag is `True`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList, title_level=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class EmptyLabelList[source][test]

\n", "\n", "> EmptyLabelList(**`items`**:`Iterator`\\[`T_co`\\], **`path`**:`PathOrStr`=***`'.'`***, **`label_cls`**:`Callable`=***`None`***, **`inner_df`**:`Any`=***`None`***, **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **`x`**:`ItemList`=***`None`***, **`ignore_empty`**:`bool`=***`False`***) :: [`ItemList`](/data_block.html#ItemList)\n", "\n", "
×

No tests found for EmptyLabelList. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Basic [`ItemList`](/data_block.html#ItemList) for dummy labels. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmptyLabelList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Invisible step: preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This isn't seen here in the API, but if you passed a `processor` (or a list of them) in your initial [`ItemList`](/data_block.html#ItemList) during step 1, it will be applied here. If you didn't pass any processor, a list of them might still be created depending on what is in the `_processor` variable of your class of items (this can be a list of [`PreProcessor`](/data_block.html#PreProcessor) classes).\n", "\n", "A processor is a transformation that is applied to all the inputs once at initialization, with a state computed on the training set that is then applied without modification on the validation set (and maybe the test set). For instance, it can be processing texts to tokenize then numericalize them. In that case we want the validation set to be numericalized with exactly the same vocabulary as the training set.\n", "\n", "Another example is in tabular data, where we fill missing values with (for instance) the median computed on the training set. That statistic is stored in the inner state of the [`PreProcessor`](/data_block.html#PreProcessor) and applied on the validation set.\n", "\n", "This is the generic class for all processors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class PreProcessor[source][test]

\n", "\n", "> PreProcessor(**`ds`**:`Collection`\\[`T_co`\\]=***`None`***)\n", "\n", "
×

No tests found for PreProcessor. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Basic class for a processor that will be applied to items at the end of the data block API. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(PreProcessor, title_level=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

process_one[source][test]

\n", "\n", "> process_one(**`item`**:`Any`)\n", "\n", "
×

Tests found for process_one:

Some other tests where process_one is used:

  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(PreProcessor.process_one)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Process one `item`. This method needs to be written in any subclass." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

process[source][test]

\n", "\n", "> process(**`ds`**:`Collection`\\[`T_co`\\])\n", "\n", "
×

No tests found for process. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(PreProcessor.process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`ds`: an object of [`ItemList`](/data_block.html#ItemList) \n", "Process a dataset. This default to apply `process_one` on every `item` of `ds`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CategoryProcessor[source][test]

\n", "\n", "> CategoryProcessor(**`ds`**:[`ItemList`](/data_block.html#ItemList)) :: [`PreProcessor`](/data_block.html#PreProcessor)\n", "\n", "
×

No tests found for CategoryProcessor. To contribute a test please refer to this guide and this discussion.

\n", "\n", "[`PreProcessor`](/data_block.html#PreProcessor) that create `classes` from `ds.items` and handle the mapping. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor, title_level=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

generate_classes[source][test]

\n", "\n", "> generate_classes(**`items`**)\n", "\n", "
×

No tests found for generate_classes. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Generate classes from `items` by taking the sorted unique values. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.generate_classes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

process[source][test]

\n", "\n", "> process(**`ds`**)\n", "\n", "
×

No tests found for process. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`ds` is an object of [`CategoryList`](/data_block.html#CategoryList). \n", "It basically generates a list of unique labels (assigned to `ds.classes`) and a dictionary mapping `classes` to indexes (assigned to `ds.c2i`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is an internal function only called to apply processors to training, validation and testing datasets after the labeling step." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class MultiCategoryProcessor[source][test]

\n", "\n", "> MultiCategoryProcessor(**`ds`**:[`ItemList`](/data_block.html#ItemList), **`one_hot`**:`bool`=***`False`***) :: [`CategoryProcessor`](/data_block.html#CategoryProcessor)\n", "\n", "
×

No tests found for MultiCategoryProcessor. To contribute a test please refer to this guide and this discussion.

\n", "\n", "[`PreProcessor`](/data_block.html#PreProcessor) that create `classes` from `ds.items` and handle the mapping. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryProcessor, title_level=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

generate_classes[source][test]

\n", "\n", "> generate_classes(**`items`**)\n", "\n", "
×

No tests found for generate_classes. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Generate classes from `items` by taking the sorted unique values. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryProcessor.generate_classes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add transforms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transforms differ from processors in the sense they are applied on the fly when we grab one item. They also may change each time we ask for the same item in the case of random transforms." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

transform[source][test]

\n", "\n", "> transform(**`tfms`**:`Optional`\\[`Tuple`\\[`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\], `Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]\\]\\]=***`(None, None)`***, **\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for transform. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Set `tfms` to be applied to the xs of the train and validation set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.transform)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is primary for the vision application. The `kwargs` arguments are the ones expected by the type of transforms you pass. `tfm_y` is among them and if set to `True`, the transforms will be applied to input and target.\n", "\n", "For examples see: [vision.transforms](vision.transform.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add a test set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To add a test set, you can use one of the two following methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

add_test[source][test]

\n", "\n", "> add_test(**`items`**:`Iterator`\\[`T_co`\\], **`label`**:`Any`=***`None`***)\n", "\n", "
×

No tests found for add_test. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Add test set containing `items` with an arbitrary `label`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.add_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Note: Here `items` can be an `ItemList` or a collection.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_note(\"Here `items` can be an `ItemList` or a collection.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

add_test_folder[source][test]

\n", "\n", "> add_test_folder(**`test_folder`**:`str`=***`'test'`***, **`label`**:`Any`=***`None`***)\n", "\n", "
×

No tests found for add_test_folder. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Add test set containing items from `test_folder` and an arbitrary `label`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.add_test_folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Warning: In fastai the test set is unlabeled! No labels will be collected even if they are available.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_warn(\"In fastai the test set is unlabeled! No labels will be collected even if they are available.\")" ] }, { "cell_type": "markdown", "metadata": { "hide_input": true }, "source": [ "Instead, either the passed `label` argument or an empty label will be used for all entries of this dataset (this is required by the internal pipeline of fastai). \n", "\n", "In the `fastai` framework `test` datasets have no labels - this is the unknown data to be predicted. If you want to validate your model on a `test` dataset with labels, you probably need to use it as a validation set, as in:\n", "\n", "```\n", "data_test = (ImageList.from_folder(path)\n", " .split_by_folder(train='train', valid='test')\n", " .label_from_folder()\n", " ...)\n", "```\n", "\n", "Another approach, where you do use a normal validation set, and then when the training is over, you just want to validate the test set w/ labels as a validation set, you can do this:\n", "\n", "```\n", "tfms = []\n", "path = Path('data').resolve()\n", "data = (ImageList.from_folder(path)\n", " .split_by_pct()\n", " .label_from_folder()\n", " .transform(tfms)\n", " .databunch()\n", " .normalize() ) \n", "learn = cnn_learner(data, models.resnet50, metrics=accuracy)\n", "learn.fit_one_cycle(5,1e-2)\n", "\n", "# now replace the validation dataset entry with the test dataset as a new validation dataset: \n", "# everything is exactly the same, except replacing `split_by_pct` w/ `split_by_folder` \n", "# (or perhaps you were already using the latter, so simply switch to valid='test')\n", "data_test = (ImageList.from_folder(path)\n", " .split_by_folder(train='train', valid='test')\n", " .label_from_folder()\n", " .transform(tfms)\n", " .databunch()\n", " .normalize()\n", " ) \n", "learn.validate(data_test.valid_dl)\n", "```\n", "Of course, your data block can be totally different, this is just an example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: convert to a [`DataBunch`](/basic_data.html#DataBunch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This last step is usually pretty straightforward. You just have to include all the arguments we pass to [`DataBunch.create`](/basic_data.html#DataBunch.create) (`bs`, `num_workers`, `collate_fn`). The class called to create a [`DataBunch`](/basic_data.html#DataBunch) is set in the `_bunch` attribute of the inputs of the training set if you need to modify it. Normally, the various subclasses we showed before handle that for you." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

databunch[source][test]

\n", "\n", "> databunch(**`path`**:`PathOrStr`=***`None`***, **`bs`**:`int`=***`64`***, **`val_bs`**:`int`=***`None`***, **`num_workers`**:`int`=***`4`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***, **\\*\\*`kwargs`**) → `DataBunch`\n", "\n", "
×

Tests found for databunch:

  • pytest -sv tests/test_vision_data.py::test_vision_datasets [source]

Some other tests where databunch is used:

  • pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

\n", "\n", "Create an [`DataBunch`](/basic_data.html#DataBunch) from self, `path` will override `self.path`, `kwargs` are passed to [`DataBunch.create`](/basic_data.html#DataBunch.create). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.databunch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inner classes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class LabelList[source][test]

\n", "\n", "> LabelList(**`x`**:[`ItemList`](/data_block.html#ItemList), **`y`**:[`ItemList`](/data_block.html#ItemList), **`tfms`**:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]=***`None`***, **`tfm_y`**:`bool`=***`False`***, **\\*\\*`kwargs`**) :: [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)\n", "\n", "
×

No tests found for LabelList. To contribute a test please refer to this guide and this discussion.

\n", "\n", "A list of inputs `x` and labels `y` with optional `tfms`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optionally apply `tfms` to `y` if `tfm_y` is `True`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, it takes inputs [`ItemList`](/data_block.html#ItemList) and labels [`ItemList`](/data_block.html#ItemList) as its properties `x` and `y`, sets property `item` to `None`, and uses [`LabelList.transform`](/data_block.html#LabelList.transform) to apply a list of transforms `TfmList` to `x` and `y` if `tfm_y` is set `True`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(ImageList (709 items)\n", " Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", " Path: /Users/Natsume/.fastai/data/mnist_tiny, CategoryList (709 items)\n", " 7,7,7,7,7\n", " Path: /Users/Natsume/.fastai/data/mnist_tiny)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY)\n", "ll = ImageList.from_folder(path_data).split_by_folder('train', 'valid').label_from_folder()\n", "ll.train.x, ll.train.y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelList (709 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LabelList(x=ll.train.x, y=ll.train.y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

export[source][test]

\n", "\n", "> export(**`fn`**:`PathOrStr`, **\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for export. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Export the minimal state and save it in `fn` to load an empty version for inference. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.export)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

transform_y[source][test]

\n", "\n", "> transform_y(**`tfms`**:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]=***`None`***, **\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for transform_y. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Set `tfms` to be applied to the targets only. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.transform_y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_state[source][test]

\n", "\n", "> get_state(**\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for get_state. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Return the minimal state for export. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.get_state)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

load_empty[source][test]

\n", "\n", "> load_empty(**`path`**:`PathOrStr`, **`fn`**:`PathOrStr`)\n", "\n", "
×

No tests found for load_empty. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Load the state in `fn` to create an empty [`LabelList`](/data_block.html#LabelList) for inference. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.load_empty)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

load_state[source][test]

\n", "\n", "> load_state(**`path`**:`PathOrStr`, **`state`**:`dict`) → `LabelList`\n", "\n", "
×

No tests found for load_state. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a [`LabelList`](/data_block.html#LabelList) from `state`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.load_state)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

process[source][test]

\n", "\n", "> process(**`xp`**:[`PreProcessor`](/data_block.html#PreProcessor)=***`None`***, **`yp`**:[`PreProcessor`](/data_block.html#PreProcessor)=***`None`***, **`name`**:`str`=***`None`***)\n", "\n", "
×

No tests found for process. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Launch the processing on `self.x` and `self.y` with `xp` and `yp`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, [`LabelList.process`](/data_block.html#LabelList.process) does 3 three things: 1. ask labels `y` to be processed by `yp` with `y.process(yp)`; 2. if `y.filter_missing_y` is `True`, then removes the missing data samples from `x` and `y`; 3. ask inputs `x` to be processed by `xp` with `x.process(xp)`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_data = untar_data(URLs.MNIST_TINY)\n", "sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sd.train = sd.train.label_from_folder(from_item_lists=True)\n", "sd.valid = sd.valid.label_from_folder(from_item_lists=True)\n", "sd.__class__ = LabelLists" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([], [])" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xp,yp = sd.get_processors()\n", "xp,yp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelList (709 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sd.train.process(xp, yp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

set_item[source][test]

\n", "\n", "> set_item(**`item`**)\n", "\n", "
×

No tests found for set_item. To contribute a test please refer to this guide and this discussion.

\n", "\n", "For inference, will briefly replace the dataset with one that only contains `item`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.set_item)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

to_df[source][test]

\n", "\n", "> to_df()\n", "\n", "
×

No tests found for to_df. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create `pd.DataFrame` containing `items` from `self.x` and `self.y`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.to_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

to_csv[source][test]

\n", "\n", "> to_csv(**`dest`**:`str`)\n", "\n", "
×

No tests found for to_csv. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Save `self.to_df()` to a CSV file in `self.path`/`dest`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.to_csv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

transform[source][test]

\n", "\n", "> transform(**`tfms`**:`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\], **`tfm_y`**:`bool`=***`None`***, **\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for transform. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Set the `tfms` and `tfm_y` value to be applied to the inputs and targets. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.transform)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class ItemLists[source][test]

\n", "\n", "> ItemLists(**`path`**:`PathOrStr`, **`train`**:[`ItemList`](/data_block.html#ItemList), **`valid`**:[`ItemList`](/data_block.html#ItemList))\n", "\n", "
×

No tests found for ItemLists. To contribute a test please refer to this guide and this discussion.

\n", "\n", "An [`ItemList`](/data_block.html#ItemList) for each of `train` and `valid` (optional `test`). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemLists, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It initializes an [`ItemLists`](/data_block.html#ItemLists) object, which basically brings in the training, valiation and testing (optionally) [`ItemList`](/data_block.html#ItemList)s as its properties. It also offers helpful warning messages on situations when the training or validation [`ItemList`](/data_block.html#ItemList) is empty. \n", "\n", "See the following example for how to create an [`ItemLists`](/data_block.html#ItemLists) object. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY); path_data.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "il_train = ImageList.from_folder(path_data/'train')\n", "il_valid = ImageList.from_folder(path_data/'valid')\n", "il_test = ImageList.from_folder(path_data/'test')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (709 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Valid: ImageList (699 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/valid;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ils = ItemLists(path=path_data, train=il_train, valid=il_valid); ils" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (709 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Valid: ImageList (699 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/valid;\n", "\n", "Test: ImageList (20 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/test" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ils.test = il_test; ils" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we are most likely to see an [`ItemLists`](/data_block.html#ItemLists), right after a large [`ItemList`](/data_block.html#ItemList) is splitted and turned into an [`ItemLists`](/data_block.html#ItemLists) by methods like [`ItemList.split_by_folder`](/data_block.html#ItemList.split_by_folder). Then, we will add labels to all training and validation simply using `sd.label_from_folder()` (`sd` is an [`ItemLists`](/data_block.html#ItemLists), see example below). Now, some of you may be surprised because `label_from_folder` is a method of [`ItemList`](/data_block.html#ItemList) not [`ItemLists`](/data_block.html#ItemLists). Well, this is a magic of fastai data_block api.\n", "\n", "With the following example, we may understand a little better how to get labelling done by calling [`ItemLists.__getattr__`](/data_block.html#ItemLists.__getattr__) with [`ItemList.label_from_folder`](/data_block.html#ItemList.label_from_folder)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ImageList (1428 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il = ImageList.from_folder(path_data); il" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An [`ItemList`](/data_block.html#ItemList) or its subclass object must do a split to turn itself into an [`ItemLists`](/data_block.html#ItemLists) before doing labeling to become a [`LabelLists`](/data_block.html#LabelLists) object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (709 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Valid: ImageList (699 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sd = il.split_by_folder(train='train', valid='valid'); sd\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelLists;\n", "\n", "Train: LabelList (709 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Valid: LabelList (699 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ll = sd.label_from_folder(); ll" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even when there is just an [`ImageList`](/vision.data.html#ImageList) from a traning set folder with no split needed, we still must do `split_none()` in order to create an [`ItemLists`](/data_block.html#ItemLists), and only then we can do `ItemLists.label_from_folder()` nicely." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ItemLists;\n", "\n", "Train: ImageList (709 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Valid: ImageList (0 items)\n", "\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "il_train = ImageList.from_folder(path_data/'train')\n", "sd_train = il_train.split_none(); sd_train" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelLists;\n", "\n", "Train: LabelList (709 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Valid: LabelList (0 items)\n", "x: ImageList\n", "\n", "y: CategoryList\n", "\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ll_valid_empty = sd_train.label_from_folder(); ll_valid_empty" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So practially, although `label_from_folder` is not an [`ItemLists`](/data_block.html#ItemLists) method, we can call `ItemLists.label_from_folder()` to label training, validation and test [`ItemList`](/data_block.html#ItemList)s once for all." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, `ItemLists.label_from_folder()` actually calls `ItemLists.__getattr__('label_from_folder')`, in which all training, validation even testing [`ItemList`](/data_block.html#ItemList) get to call `label_from_folder`, and then turns the [`ItemLists`](/data_block.html#ItemLists) into a [`LabelLists`](/data_block.html#LabelLists) and calls [`LabelLists.process`](/data_block.html#LabelLists.process) at last.\n", "\n", "You can directly use `LabelLists.__getattr__` to do labelling as below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelLists;\n", "\n", "Train: LabelList (709 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Valid: LabelList (699 items)\n", "x: ImageList\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "y: CategoryList\n", "7,7,7,7,7\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny;\n", "\n", "Test: None" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ld_inner = sd.__getattr__('label_from_folder'); ld_inner()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

label_from_lists[source][test]

\n", "\n", "> label_from_lists(**`train_labels`**:`Iterator`\\[`T_co`\\], **`valid_labels`**:`Iterator`\\[`T_co`\\], **`label_cls`**:`Callable`=***`None`***, **\\*\\*`kwargs`**) → `LabelList`\n", "\n", "
×

No tests found for label_from_lists. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Use the labels in `train_labels` and `valid_labels` to label the data. `label_cls` will overwrite the default. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemLists.label_from_lists)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

transform[source][test]

\n", "\n", "> transform(**`tfms`**:`Optional`\\[`Tuple`\\[`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\], `Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]\\]\\]=***`(None, None)`***, **\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for transform. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Set `tfms` to be applied to the xs of the train and validation set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemLists.transform)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

transform_y[source][test]

\n", "\n", "> transform_y(**`tfms`**:`Optional`\\[`Tuple`\\[`Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\], `Union`\\[`Callable`, `Collection`\\[`Callable`\\]\\]\\]\\]=***`(None, None)`***, **\\*\\*`kwargs`**)\n", "\n", "
×

No tests found for transform_y. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Set `tfms` to be applied to the ys of the train and validation set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemLists.transform_y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class LabelLists[source][test]

\n", "\n", "> LabelLists(**`path`**:`PathOrStr`, **`train`**:[`ItemList`](/data_block.html#ItemList), **`valid`**:[`ItemList`](/data_block.html#ItemList)) :: [`ItemLists`](/data_block.html#ItemLists)\n", "\n", "
×

No tests found for LabelLists. To contribute a test please refer to this guide and this discussion.

\n", "\n", "A [`LabelList`](/data_block.html#LabelList) for each of `train` and `valid` (optional `test`). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a [`LabelLists`](/data_block.html#LabelLists) object is exactly the same way as creating an [`ItemLists`](/data_block.html#ItemLists) object, because its base class is [`ItemLists`](/data_block.html#ItemLists) and does not overwrite [`ItemLists.__init__`](/data_block.html#ItemLists.__init__). The example below shows how to build a [`LabelLists`](/data_block.html#LabelLists) object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data = untar_data(URLs.MNIST_TINY); path_data.ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "il_train = ImageList.from_folder(path_data/'train')\n", "il_valid = ImageList.from_folder(path_data/'valid')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelLists;\n", "\n", "Train: ImageList (709 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/train;\n", "\n", "Valid: ImageList (699 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/valid;\n", "\n", "Test: ImageList (20 items)\n", "Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28),Image (3, 28, 28)\n", "Path: /Users/Natsume/.fastai/data/mnist_tiny/test" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ll_test = LabelLists(path_data, il_train, il_valid); \n", "ll_test.test = il_valid = ImageList.from_folder(path_data/'test')\n", "ll_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_processors[source][test]

\n", "\n", "> get_processors()\n", "\n", "
×

No tests found for get_processors. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Read the default class processors if none have been set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.get_processors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Behind the scenes, `LabelLists.get_processors()` first puts `train.x._processor` classes and `train.y._processor` classes into separate lists, and then instantiates those processors and put them into `xp` and `yp`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.vision import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_data = untar_data(URLs.MNIST_TINY)\n", "sd = ImageList.from_folder(path_data).split_by_folder('train', 'valid')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sd.train = sd.train.label_from_folder(from_item_lists=True)\n", "sd.valid = sd.valid.label_from_folder(from_item_lists=True)\n", "sd.__class__ = LabelLists" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xp,yp = sd.get_processors()\n", "xp,yp" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

load_empty[source][test]

\n", "\n", "> load_empty(**`path`**:`PathOrStr`, **`fn`**:`PathOrStr`=***`'export.pkl'`***)\n", "\n", "
×

No tests found for load_empty. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a [`LabelLists`](/data_block.html#LabelLists) with empty sets from the serialized file in `path/fn`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.load_empty)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

load_state[source][test]

\n", "\n", "> load_state(**`path`**:`PathOrStr`, **`state`**:`dict`)\n", "\n", "
×

No tests found for load_state. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a [`LabelLists`](/data_block.html#LabelLists) with empty sets from the serialized `state`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.load_state)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

process[source][test]

\n", "\n", "> process()\n", "\n", "
×

No tests found for process. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Process the inner datasets. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelLists.process)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": false }, "outputs": [ { "data": { "text/markdown": [ "

process[source][test]

\n", "\n", "> process(**`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***)\n", "\n", "
×

No tests found for process. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Apply `processor` or `self.processor` to `self`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.process)" ] }, { "cell_type": "markdown", "metadata": { "hide_input": true }, "source": [ "`processor` is one or more `PreProcessors` objects \n", "Behind the scenes, we put all of `processor` into a list and apply them all to an object of [`ItemList`](/data_block.html#ItemList) or its subclasses." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Helper functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_files[source][test]

\n", "\n", "> get_files(**`path`**:`PathOrStr`, **`extensions`**:`StrList`=***`None`***, **`recurse`**:`bool`=***`False`***, **`include`**:`OptStrList`=***`None`***, **`presort`**:`bool`=***`False`***) → `FilePathList`\n", "\n", "
×

No tests found for get_files. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Return list of files in `path` that have a suffix in `extensions`; optionally [`recurse`](/core.html#recurse). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(get_files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To to more precise, this function returns list of FilePath objects using files in `path` that must have a suffix in `extensions`, and hidden folders and files are ignored. If `recurse=True`, all files in subfolders will be applied; `include` is used to select particular folders to apply.\n", "\n", "Inside [`get_files`](/data_block.html#get_files), there is [`_get_files`](/data_block.html#_get_files) which turns all filenames inside `f` from directory `parent/p` into a list of FilePath objects. All filenames must have a suffix in `extensions`. All hidden files are ignored." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_data = untar_data(URLs.MNIST_TINY) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/models'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_data.ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `recurse=False`, no subfolder files are made available." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_FilePath_noRecurse = get_files(path_data) \n", "list_FilePath_noRecurse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `recurse=True`, all subfolder files are made available, except hidden files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/valid/7/9294.png')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_FilePath_recurse = get_files(path_data, recurse=True)\n", "list_FilePath_recurse[:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train/3/7263.png'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/train/3/7288.png')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_FilePath_recurse[-2:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `extensions=['.csv']`, only files with the suffix of `.csv` are made available." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/labels.csv'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/history.csv')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_FilePath_recurse_csv = get_files(path_data, recurse=True, extensions=['.csv'])\n", "list_FilePath_recurse_csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `include=['test']`, only files in `path_data` and its subfolder `test` are made available." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/4605.png'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/617.png'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/205.png')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_FilePath_include = get_files(path_data, recurse=True, extensions=['.png','.jpg','.jpeg'],\n", " include=['test'])\n", "list_FilePath_include[:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/1605.png'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/2642.png'),\n", " PosixPath('/Users/Natsume/.fastai/data/mnist_tiny/test/5071.png')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_FilePath_include[-3:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source][test]

\n", "\n", "> new(**`items`**:`Iterator`\\[`T_co`\\], **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

No tests found for new. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a new [`ItemList`](/data_block.html#ItemList) from `items`, keeping the same attributes. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source][test]

\n", "\n", "> new(**`x`**, **`y`**, **\\*\\*`kwargs`**) → `LabelList`\n", "\n", "
×

No tests found for new. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source][test]

\n", "\n", "> get(**`i`**)\n", "\n", "
×

No tests found for get. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Subclass if you want to customize how to create item `i` from `self.items`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

predict[source][test]

\n", "\n", "> predict(**`res`**)\n", "\n", "
×

No tests found for predict. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Delegates predict call on `res` to `self.y`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.predict)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source][test]

\n", "\n", "> new(**`items`**:`Iterator`\\[`T_co`\\], **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

No tests found for new. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a new [`ItemList`](/data_block.html#ItemList) from `items`, keeping the same attributes. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process_one[source][test]

\n", "\n", "> process_one(**`item`**:[`ItemBase`](/core.html#ItemBase), **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***)\n", "\n", "
×

Tests found for process_one:

Some other tests where process_one is used:

  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

\n", "\n", "Apply `processor` or `self.processor` to `item`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.process_one)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process_one[source][test]

\n", "\n", "> process_one(**`item`**)\n", "\n", "
×

Tests found for process_one:

Some other tests where process_one is used:

  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryProcessor.process_one)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source][test]

\n", "\n", "> get(**`i`**)\n", "\n", "
×

No tests found for get. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Subclass if you want to customize how to create item `i` from `self.items`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process_one[source][test]

\n", "\n", "> process_one(**`item`**)\n", "\n", "
×

Tests found for process_one:

  • pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
  • pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.process_one)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It basically converts `item` which is a category name to an index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`classes`: a list of unique and sorted labels; \n", "It creates the inner mapping from category name to index (stored in `c2i`) from the `classes`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

create_classes[source][test]

\n", "\n", "> create_classes(**`classes`**)\n", "\n", "
×

No tests found for create_classes. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryProcessor.create_classes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source][test]

\n", "\n", "> get(**`i`**)\n", "\n", "
×

No tests found for get. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Subclass if you want to customize how to create item `i` from `self.items`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source][test]

\n", "\n", "> new(**`items`**:`Iterator`\\[`T_co`\\], **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

No tests found for new. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a new [`ItemList`](/data_block.html#ItemList) from `items`, keeping the same attributes. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source][test]

\n", "\n", "> reconstruct(**`t`**)\n", "\n", "
×

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Reconstruct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FloatList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

analyze_pred[source][test]

\n", "\n", "> analyze_pred(**`pred`**, **`thresh`**:`float`=***`0.5`***)\n", "\n", "
×

No tests found for analyze_pred. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Called on `pred` before `reconstruct` for additional preprocessing. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList.analyze_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source][test]

\n", "\n", "> reconstruct(**`t`**)\n", "\n", "
×

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Reconstruct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(MultiCategoryList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source][test]

\n", "\n", "> reconstruct(**`t`**)\n", "\n", "
×

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Reconstruct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

analyze_pred[source][test]

\n", "\n", "> analyze_pred(**`pred`**, **`thresh`**:`float`=***`0.5`***)\n", "\n", "
×

No tests found for analyze_pred. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Called on `pred` before `reconstruct` for additional preprocessing. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CategoryList.analyze_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source][test]

\n", "\n", "> reconstruct(**`t`**:`Tensor`, **`x`**:`Tensor`=***`None`***)\n", "\n", "
×

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Reconstruct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmptyLabelList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source][test]

\n", "\n", "> get(**`i`**)\n", "\n", "
×

No tests found for get. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Subclass if you want to customize how to create item `i` from `self.items`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmptyLabelList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

databunch[source][test]

\n", "\n", "> databunch(**\\*\\*`kwargs`**)\n", "\n", "
×

Tests found for databunch:

Some other tests where databunch is used:

  • pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

\n", "\n", "To throw a clear error message when the data wasn't split. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(LabelList.databunch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

add[source][test]

\n", "\n", "> add(**`items`**:`ItemList`)\n", "\n", "
×

No tests found for add. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(ItemList.add)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "jekyll": { "keywords": "fastai", "summary": "The data block API", "title": "data_block" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }