{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-label prediction with Planet Amazon dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai import *\n", "from fastai.vision import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The planet dataset isn't available on the [fastai dataset page](https://course.fast.ai/datasets) due to copyright restrictions. You can download it from Kaggle however. Let's see how to do this by using the [Kaggle API](https://github.com/Kaggle/kaggle-api) as it's going to be pretty useful to you if you want to join a competition or use other Kaggle datasets later on.\n", "\n", "First, install the Kaggle API by uncommenting the following line and executing it, or by executing it in your terminal (depending on your platform you may need to modify this slightly to either add `source activate fastai` or similar, or prefix `pip` with a path. Have a look at how `conda install` is called for your platform in the appropriate *Returning to work* section of https://course-v3.fast.ai/. (Depending on your environment, you may also need to append \"--user\" to the command.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! pip install kaggle --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then you need to upload your credentials from Kaggle on your instance. Login to kaggle and click on your profile picture on the top left corner, then 'My account'. Scroll down until you find a button named 'Create New API Token' and click on it. This will trigger the download of a file named 'kaggle.json'.\n", "\n", "Upload this file to the directory this notebook is running in, by clicking \"Upload\" on your main Jupyter page, then uncomment and execute the next two commands (or run them in a terminal)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! mkdir -p ~/.kaggle/\n", "# ! mv kaggle.json ~/.kaggle/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You're all set to download the data from [planet competition](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space). You **first need to go to its main page and accept its rules**, and run the two cells below (uncomment the shell commands to download and unzip the data). If you get a `403 forbidden` error it means you haven't accepted the competition rules yet (you have to go to the competition page, click on *Rules* tab, and then scroll to the bottom to find the *accept* button)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/.fastai/data/planet')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = Config.data_path()/'planet'\n", "path.mkdir(parents=True, exist_ok=True)\n", "path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train-jpg.tar.7z -p {path} \n", "# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train_v2.csv -p {path} \n", "# ! unzip -q -n {path}/train_v2.csv.zip -d {path}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract the content of this file, we'll need 7zip, so uncomment the following line if you need to install it (or run `sudo apt install p7zip` in your terminal)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! conda install -y -c haasad eidl7zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can unpack the data (uncomment to run - this might take a few minutes to complete)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! 7za -bd -y -so x {path}/train-jpg.tar.7z | tar xf - -C {path}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiclassification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Contrary to the pets dataset studied in last lesson, here each picture can have multiple labels. If we take a look at the csv file containing the labels (in 'train_v2.csv' here) we see that each 'image_name' is associated to several tags separated by spaces." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | image_name | \n", "tags | \n", "
---|---|---|
0 | \n", "train_0 | \n", "haze primary | \n", "
1 | \n", "train_1 | \n", "agriculture clear primary water | \n", "
2 | \n", "train_2 | \n", "clear primary | \n", "
3 | \n", "train_3 | \n", "clear primary | \n", "
4 | \n", "train_4 | \n", "agriculture clear habitation primary road | \n", "