{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Image Classification from scratch with TPUs on Cloud ML Engine using ResNet\n", "\n", "This notebook demonstrates how to do image classification from scratch on a flowers dataset using TPUs and the resnet trainer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID\n", "BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME\n", "REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1\n", "\n", "# do not change these\n", "os.environ['PROJECT'] = PROJECT\n", "os.environ['BUCKET'] = BUCKET\n", "os.environ['REGION'] = REGION\n", "os.environ['TFVERSION'] = '1.9'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gcloud config set project $PROJECT\n", "gcloud config set compute/region $REGION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert JPEG images to TensorFlow Records\n", "\n", "My dataset consists of JPEG images in Google Cloud Storage. I have two CSV files that are formatted as follows:\n", " image-name, category\n", "\n", "Instead of reading the images from JPEG each time, we'll convert the JPEG data and store it as TF Records.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv | head -5 > /tmp/input.csv\n", "cat /tmp/input.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv | sed 's/,/ /g' | awk '{print $2}' | sort | uniq > /tmp/labels.txt\n", "cat /tmp/labels.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clone the TPU repo\n", "\n", "Let's git clone the repo and get the preprocessing and model files. The model code has imports of the form:\n", "
\n", "import resnet_model as model_lib\n", "\n", "We will need to change this to:\n", "
\n", "from . import resnet_model as model_lib\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile copy_resnet_files.sh\n", "#!/bin/bash\n", "rm -rf tpu\n", "git clone https://github.com/tensorflow/tpu\n", "cd tpu\n", "TFVERSION=$1\n", "echo \"Switching to version r$TFVERSION\"\n", "git checkout r$TFVERSION\n", "cd ..\n", " \n", "MODELCODE=tpu/models/official/resnet\n", "OUTDIR=mymodel\n", "rm -rf $OUTDIR\n", "\n", "# preprocessing\n", "cp -r imgclass $OUTDIR # brings in setup.py and __init__.py\n", "cp tpu/tools/datasets/jpeg_to_tf_record.py $OUTDIR/trainer/preprocess.py\n", "\n", "# model: fix imports\n", "for FILE in $(ls -p $MODELCODE | grep -v /); do\n", " CMD=\"cat $MODELCODE/$FILE \"\n", " for f2 in $(ls -p $MODELCODE | grep -v /); do\n", " MODULE=`echo $f2 | sed 's/.py//g'`\n", " CMD=\"$CMD | sed 's/^import ${MODULE}/from . import ${MODULE}/g' \"\n", " done\n", " CMD=\"$CMD > $OUTDIR/trainer/$FILE\"\n", " eval $CMD\n", "done\n", "find $OUTDIR\n", "echo \"Finished copying files into $OUTDIR\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!bash ./copy_resnet_files.sh $TFVERSION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enable TPU service account\n", "\n", "Allow Cloud ML Engine to access the TPU and bill to your project" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile enable_tpu_mlengine.sh\n", "SVC_ACCOUNT=$(curl -H \"Authorization: Bearer $(gcloud auth print-access-token)\" \\\n", " https://ml.googleapis.com/v1/projects/${PROJECT}:getConfig \\\n", " | grep tpuServiceAccount | tr '\"' ' ' | awk '{print $3}' )\n", "echo \"Enabling TPU service account $SVC_ACCOUNT to act as Cloud ML Service Agent\"\n", "gcloud projects add-iam-policy-binding $PROJECT \\\n", " --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent\n", "echo \"Done\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!bash ./enable_tpu_mlengine.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Try preprocessing locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "export PYTHONPATH=${PYTHONPATH}:${PWD}/mymodel\n", " \n", "rm -rf /tmp/out\n", "python -m trainer.preprocess \\\n", " --train_csv /tmp/input.csv \\\n", " --validation_csv /tmp/input.csv \\\n", " --labels_file /tmp/labels.txt \\\n", " --project_id $PROJECT \\\n", " --output_dir /tmp/out --runner=DirectRunner" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls -l /tmp/out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now run it over full training and evaluation datasets. This will happen in Cloud Dataflow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "export PYTHONPATH=${PYTHONPATH}:${PWD}/mymodel\n", "gsutil -m rm -rf gs://${BUCKET}/tpu/resnet/data\n", "python -m trainer.preprocess \\\n", " --train_csv gs://cloud-ml-data/img/flower_photos/train_set.csv \\\n", " --validation_csv gs://cloud-ml-data/img/flower_photos/eval_set.csv \\\n", " --labels_file /tmp/labels.txt \\\n", " --project_id $PROJECT \\\n", " --output_dir gs://${BUCKET}/tpu/resnet/data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above preprocessing step will take 15-20 minutes. Wait for the job to finish before you proceed. Navigate to [Cloud Dataflow section of GCP web console](https://console.cloud.google.com/dataflow) to monitor job progress. You will see something like this
\n", "gsutil -m cp gs://cloud-training-demos/tpu/resnet/data/* gs://${BUCKET}/tpu/resnet/copied_data\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil ls gs://${BUCKET}/tpu/resnet/data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train on the Cloud" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "echo -n \"--num_train_images=$(gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv | wc -l) \"\n", "echo -n \"--num_eval_images=$(gsutil cat gs://cloud-ml-data/img/flower_photos/eval_set.csv | wc -l) \"\n", "echo \"--num_label_classes=$(cat /tmp/labels.txt | wc -l)\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "TOPDIR=gs://${BUCKET}/tpu/resnet\n", "OUTDIR=${TOPDIR}/trained\n", "JOBNAME=imgclass_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR # Comment out this line to continue training from the last time\n", "gcloud ml-engine jobs submit training $JOBNAME \\\n", " --region=$REGION \\\n", " --module-name=trainer.resnet_main \\\n", " --package-path=$(pwd)/mymodel/trainer \\\n", " --job-dir=$OUTDIR \\\n", " --staging-bucket=gs://$BUCKET \\\n", " --scale-tier=BASIC_TPU \\\n", " --runtime-version=$TFVERSION --python-version=3.5 \\\n", " -- \\\n", " --data_dir=${TOPDIR}/data \\\n", " --model_dir=${OUTDIR} \\\n", " --resnet_depth=18 \\\n", " --train_batch_size=128 --eval_batch_size=32 --skip_host_call=True \\\n", " --steps_per_eval=250 --train_steps=1000 \\\n", " --num_train_images=3300 --num_eval_images=370 --num_label_classes=5 \\\n", " --export_dir=${OUTDIR}/export" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above training job will take 15-20 minutes. \n", "Wait for the job to finish before you proceed. \n", "Navigate to [Cloud ML Engine section of GCP web console](https://console.cloud.google.com/mlengine) \n", "to monitor job progress.\n", "\n", "The model should finish with a 80-83% accuracy (results will vary):\n", "```\n", "Eval results: {'global_step': 1000, 'loss': 0.7359053, 'top_1_accuracy': 0.82954544, 'top_5_accuracy': 1.0}\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can look at the training charts with TensorBoard:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "OUTDIR = 'gs://{}/tpu/resnet/trained/'.format(BUCKET)\n", "from google.datalab.ml import TensorBoard\n", "TensorBoard().start(OUTDIR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TensorBoard().stop(11531)\n", "print(\"Stopped Tensorboard\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These were the charts I got (I set smoothing to be zero):\n", "
\n", "# Copyright 2018 Google Inc. All Rights Reserved.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.6" } }, "nbformat": 4, "nbformat_minor": 2 }