# Training and Evaluation with TensorFlow 2

[![TensorFlow 2.2](https://img.shields.io/badge/TensorFlow-2.2-FF6F00?logo=tensorflow)](https://github.com/tensorflow/tensorflow/releases/tag/v2.2.0)
[![Python 3.6](https://img.shields.io/badge/Python-3.6-3776AB)](https://www.python.org/downloads/release/python-360/)

This page walks through the steps required to train an object detection model.
It assumes the reader has completed the following prerequisites:

1.  The TensorFlow Object Detection API has been installed as documented in the
    [installation instructions](tf2.md#installation).
2.  A valid data set has been created. See [this page](preparing_inputs.md) for
    instructions on how to generate a dataset for the PASCAL VOC challenge or
    the Oxford-IIIT Pet dataset.

## Recommended Directory Structure for Training and Evaluation

```bash
.
├── data/
│   ├── eval-00000-of-00001.tfrecord
│   ├── label_map.txt
│   ├── train-00000-of-00002.tfrecord
│   └── train-00001-of-00002.tfrecord
└── models/
    └── my_model_dir/
        ├── eval/                 # Created by evaluation job.
        ├── my_model.config
        └── model_ckpt-100-data@1 #
        └── model_ckpt-100-index  # Created by training job.
        └── checkpoint            #
```

## Writing a model configuration

Please refer to sample [TF2 configs](../configs/tf2) and
[configuring jobs](configuring_jobs.md) to create a model config.

### Model Parameter Initialization

While optional, it is highly recommended that users utilize classification or
object detection checkpoints. Training an object detector from scratch can take
days. To speed up the training process, it is recommended that users re-use the
feature extractor parameters from a pre-existing image classification or object
detection checkpoint. The `train_config` section in the config provides two
fields to specify pre-existing checkpoints:

*   `fine_tune_checkpoint`: a path prefix to the pre-existing checkpoint
    (ie:"/usr/home/username/checkpoint/model.ckpt-#####").

*   `fine_tune_checkpoint_type`: with value `classification` or `detection`
    depending on the type.

A list of classification checkpoints can be found
[here](tf2_classification_zoo.md)

A list of detection checkpoints can be found [here](tf2_detection_zoo.md).

## Local

### Training

A local training job can be run with the following command:

```bash
# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --alsologtostderr
```

where `${PIPELINE_CONFIG_PATH}` points to the pipeline config and `${MODEL_DIR}`
points to the directory in which training checkpoints and events will be
written.

### Evaluation

A local evaluation job can be run with the following command:

```bash
# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
CHECKPOINT_DIR=${MODEL_DIR}
python object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --checkpoint_dir=${CHECKPOINT_DIR} \
    --alsologtostderr
```

where `${CHECKPOINT_DIR}` points to the directory with checkpoints produced by
the training job. Evaluation events are written to `${MODEL_DIR/eval}`

## Google Cloud VM

The TensorFlow Object Detection API supports training on Google Cloud with Deep
Learning GPU VMs and TPU VMs. This section documents instructions on how to
train and evaluate your model on them. The reader should complete the following
prerequistes:

1.  The reader has create and configured a GPU VM or TPU VM on Google Cloud with
    TensorFlow >= 2.2.0. See
    [TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) and
    [GPU quickstart](https://cloud.google.com/ai-platform/deep-learning-vm/docs/tensorflow_start_instance#with-one-or-more-gpus)

2.  The reader has installed the TensorFlow Object Detection API as documented
    in the [installation instructions](tf2.md#installation) on the VM.

3.  The reader has a valid data set and stored it in a Google Cloud Storage
    bucket or locally on the VM. See [this page](preparing_inputs.md) for
    instructions on how to generate a dataset for the PASCAL VOC challenge or
    the Oxford-IIIT Pet dataset.

Additionally, it is recommended users test their job by running training and
evaluation jobs for a few iterations [locally on their own machines](#local).

### Training

Training on GPU or TPU VMs is similar to local training. It can be launched
using the following command.

```bash
# From the tensorflow/models/research/ directory
USE_TPU=true
TPU_NAME="MY_TPU_NAME"
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --use_tpu=${USE_TPU} \  # (optional) only required for TPU training.
    --tpu_name=${TPU_NAME} \  # (optional) only required for TPU training.
    --alsologtostderr
```

where `${PIPELINE_CONFIG_PATH}` points to the pipeline config and `${MODEL_DIR}`
points to the root directory for the files produces. Training checkpoints and
events are written to `${MODEL_DIR}`. Note that the paths can be either local or
a path to GCS bucket.

### Evaluation

Evaluation is only supported on GPU. Similar to local evaluation it can be
launched using the following command:

```bash
# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
CHECKPOINT_DIR=${MODEL_DIR}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --checkpoint_dir=${CHECKPOINT_DIR} \
    --alsologtostderr
```

where `${CHECKPOINT_DIR}` points to the directory with checkpoints produced by
the training job. Evaluation events are written to `${MODEL_DIR/eval}`. Note
that the paths can be either local or a path to GCS bucket.

## Google Cloud AI Platform

The TensorFlow Object Detection API supports also supports training on Google
Cloud AI Platform. This section documents instructions on how to train and
evaluate your model using Cloud ML. The reader should complete the following
prerequistes:

1.  The reader has created and configured a project on Google Cloud AI Platform.
    See
    [Using GPUs](https://cloud.google.com/ai-platform/training/docs/using-gpus)
    and
    [Using TPUs](https://cloud.google.com/ai-platform/training/docs/using-tpus)
    guides.
2.  The reader has a valid data set and stored it in a Google Cloud Storage
    bucket. See [this page](preparing_inputs.md) for instructions on how to
    generate a dataset for the PASCAL VOC challenge or the Oxford-IIIT Pet
    dataset.

Additionally, it is recommended users test their job by running training and
evaluation jobs for a few iterations [locally on their own machines](#local).

### Training with multiple GPUs

A user can start a training job on Cloud AI Platform following the instruction
https://cloud.google.com/ai-platform/training/docs/custom-containers-training.

```bash
git clone https://github.com/tensorflow/models.git

# From the tensorflow/models/research/ directory
cp object_detection/dockerfiles/tf2_ai_platform/Dockerfile .

docker build -t gcr.io/${DOCKER_IMAGE_URI} .

docker push gcr.io/${DOCKER_IMAGE_URI}
```

```bash
gcloud ai-platform jobs submit training object_detection_`date +%m_%d_%Y_%H_%M_%S` \
    --job-dir=gs://${MODEL_DIR} \
    --region us-central1 \
    --master-machine-type n1-highcpu-16 \
    --master-accelerator count=8,type=nvidia-tesla-v100 \
    --master-image-uri gcr.io/${DOCKER_IMAGE_URI} \
    --scale-tier CUSTOM \
    -- \
    --model_dir=gs://${MODEL_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
```

Where `gs://${MODEL_DIR}` specifies the directory on Google Cloud Storage where
the training checkpoints and events will be written to and
`gs://${PIPELINE_CONFIG_PATH}` points to the pipeline configuration stored on
Google Cloud Storage, and `gcr.io/${DOCKER_IMAGE_URI}` points to the docker
image stored in Google Container Registry.

Users can monitor the progress of their training job on the
[ML Engine Dashboard](https://console.cloud.google.com/ai-platform/jobs).

### Training with TPU

Launching a training job with a TPU compatible pipeline config requires using
the following command:

```bash
# From the tensorflow/models/research/ directory
cp object_detection/packages/tf2/setup.py .
gcloud ai-platform jobs submit training `whoami`_object_detection_`date +%m_%d_%Y_%H_%M_%S` \
    --job-dir=gs://${MODEL_DIR} \
    --package-path ./object_detection \
    --module-name object_detection.model_main_tf2 \
    --runtime-version 2.1 \
    --python-version 3.6 \
    --scale-tier BASIC_TPU \
    --region us-central1 \
    -- \
    --use_tpu true \
    --model_dir=gs://${MODEL_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
```

As before `pipeline_config_path` points to the pipeline configuration stored on
Google Cloud Storage (but is now must be a TPU compatible model).

### Evaluating with GPU

Evaluation jobs run on a single machine. Run the following command to start the
evaluation job:

```bash
gcloud ai-platform jobs submit training object_detection_eval_`date +%m_%d_%Y_%H_%M_%S` \
    --job-dir=gs://${MODEL_DIR} \
    --region us-central1 \
    --scale-tier BASIC_GPU \
    --master-image-uri gcr.io/${DOCKER_IMAGE_URI} \
    -- \
    --model_dir=gs://${MODEL_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} \
    --checkpoint_dir=gs://${MODEL_DIR}
```

where `gs://${MODEL_DIR}` points to the directory on Google Cloud Storage where
training checkpoints are saved and `gs://{PIPELINE_CONFIG_PATH}` points to where
the model configuration file stored on Google Cloud Storage, and
`gcr.io/${DOCKER_IMAGE_URI}` points to the docker image stored in Google
Container Registry. Evaluation events are written to `gs://${MODEL_DIR}/eval`

Typically one starts an evaluation job concurrently with the training job. Note
that we do not support running evaluation on TPU.

## Running Tensorboard

Progress for training and eval jobs can be inspected using Tensorboard. If using
the recommended directory structure, Tensorboard can be run using the following
command:

```bash
tensorboard --logdir=${MODEL_DIR}
```

where `${MODEL_DIR}` points to the directory that contains the train and eval
directories. Please note it may take Tensorboard a couple minutes to populate
with data.