In [1]:
# Copyright 2019 Google Inc.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Predicting Movie Review Sentiment with [kpe/bert-for-tf2](https://github.com/kpe/bert-for-tf2)


**BentoML makes moving trained ML models to production easy:**

* Package models trained with **any ML framework** and reproduce them for model serving in production
* **Deploy anywhere** for online API serving or offline batch serving
* High-Performance API model server with *adaptive micro-batching* support
* Central hub for managing models and deployment process via Web UI and APIs
* Modular and flexible design making it *adaptable to your infrastrcuture*

BentoML is a framework for serving, managing, and deploying machine learning models. It is aiming to bridge the gap between Data Science and DevOps, and enable teams to deliver prediction services in a fast, repeatable, and scalable way.

Before reading this example project, be sure to check out the [Getting started guide](https://github.com/bentoml/BentoML/blob/master/guides/quick-start/bentoml-quick-start-guide.ipynb) to learn about the basic concepts in BentoML.


A modification of https://github.com/kpe/bert-for-tf2/blob/master/examples/gpu_movie_reviews.ipynb,
which is a modification of https://github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb using the Tensorflow 2.0 Keras

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-112879361-3&cid=555&t=event&ec=tensorflow&ea=tensorflow_2_bert_movie_review&dt=tensorflow_2_bert_movie_review)

In [11]:
!pip install -q bentoml "tqdm==4.32.2" "bert-for-tf2==0.14.5"

In [2]:
import os
import sys
import math
import datetime
from tqdm import tqdm
import pandas as pd
import numpy as np
import tensorflow as tf

# tf.config.set_visible_devices([], 'GPU') # disable GPU

In [3]:
print("Tensorflow: ", tf.__version__)
print("Python: ", sys.version)
print("GPU: ", tf.test.is_gpu_available())
assert sys.version_info.major == 3 and sys.version_info.minor == 6 # required by clipper benchmark

Tensorflow: 2.1.0
Python: 3.6.10 |Anaconda, Inc.| (default, Jan 7 2020, 21:14:29) 
[GCC 7.3.0]
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
GPU: True


In [5]:
import bert
from bert import BertModelLayer
from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights
from bert.tokenization.bert_tokenization import FullTokenizer
from tensorflow import keras
import os
import re

In [6]:
from tensorflow import keras
import os
import re

# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
 data = {}
 data["sentence"] = []
 data["sentiment"] = []
 for file_path in tqdm(os.listdir(directory), desc=os.path.basename(directory)):
 with tf.io.gfile.GFile(os.path.join(directory, file_path), "r") as f:
 data["sentence"].append(f.read())
 data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
 return pd.DataFrame.from_dict(data)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
 pos_df = load_directory_data(os.path.join(directory, "pos"))
 neg_df = load_directory_data(os.path.join(directory, "neg"))
 pos_df["polarity"] = 1
 neg_df["polarity"] = 0
 return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
 dataset = tf.keras.utils.get_file(
 fname="aclImdb.tar.gz", 
 origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
 extract=True)

 train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
 "aclImdb", "train"))
 test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
 "aclImdb", "test"))

 return train_df, test_df


Let's use the `MovieReviewData` class below, to prepare/encode 
the data for feeding into our BERT model, by:
 - tokenizing the text
 - trim or pad it to a `max_seq_len` length
 - append the special tokens `[CLS]` and `[SEP]`
 - convert the string tokens to numerical `ID`s using the original model's token encoding from `vocab.txt`

In [7]:
import bert
from bert import BertModelLayer
from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights


class MovieReviewData:
 DATA_COLUMN = "sentence"
 LABEL_COLUMN = "polarity"

 def __init__(self, tokenizer: FullTokenizer, sample_size=None, max_seq_len=1024):
 self.tokenizer = tokenizer
 self.sample_size = sample_size
 self.max_seq_len = 0
 train, test = download_and_load_datasets()
 
 train, test = map(lambda df: df.reindex(df[MovieReviewData.DATA_COLUMN].str.len().sort_values().index), 
 [train, test])
 
 if sample_size is not None:
 train, test = train.head(sample_size), test.head(sample_size)
 # train, test = map(lambda df: df.sample(sample_size), [train, test])
 
 ((self.train_x, self.train_y),
 (self.test_x, self.test_y)) = map(self._prepare, [train, test])

 print("max seq_len", self.max_seq_len)
 self.max_seq_len = min(self.max_seq_len, max_seq_len)
 ((self.train_x, self.train_x_token_types),
 (self.test_x, self.test_x_token_types)) = map(self._pad, 
 [self.train_x, self.test_x])

 def _prepare(self, df):
 x, y = [], []
 with tqdm(total=df.shape[0], unit_scale=True) as pbar:
 for ndx, row in df.iterrows():
 text, label = row[MovieReviewData.DATA_COLUMN], row[MovieReviewData.LABEL_COLUMN]
 tokens = self.tokenizer.tokenize(text)
 tokens = ["[CLS]"] + tokens + ["[SEP]"]
 token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
 self.max_seq_len = max(self.max_seq_len, len(token_ids))
 x.append(token_ids)
 y.append(int(label))
 pbar.update()
 return np.array(x), np.array(y)

 def _pad(self, ids):
 x, t = [], []
 token_type_ids = [0] * self.max_seq_len
 for input_ids in ids:
 input_ids = input_ids[:min(len(input_ids), self.max_seq_len - 2)]
 input_ids = input_ids + [0] * (self.max_seq_len - len(input_ids))
 x.append(np.array(input_ids))
 t.append(token_type_ids)
 return np.array(x), np.array(t)


## A tweak

Because of a `tf.train.load_checkpoint` limitation requiring list permissions on the google storage bucket, we need to copy the pre-trained BERT weights locally.

In [8]:
asset_path = 'asset'
bert_model_name = "uncased_L-12_H-768_A-12"
bert_ckpt_dir = os.path.join(asset_path, bert_model_name)
bert_ckpt_file = os.path.join(bert_ckpt_dir, "bert_model.ckpt")
bert_config_file = os.path.join(bert_ckpt_dir, "bert_config.json")

In [9]:
%%bash

if [ ! -f asset/uncased_L-12_H-768_A-12.zip ]; then
 curl -o asset/uncased_L-12_H-768_A-12.zip --create-dirs https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
fi
if [ ! -d asset/uncased_L-12_H-768_A-12 ]; then
 unzip asset/uncased_L-12_H-768_A-12.zip -d asset/
fi

Archive: asset/uncased_L-12_H-768_A-12.zip
 creating: asset/uncased_L-12_H-768_A-12/
 inflating: asset/uncased_L-12_H-768_A-12/bert_model.ckpt.meta 
 inflating: asset/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001 
 inflating: asset/uncased_L-12_H-768_A-12/vocab.txt 
 inflating: asset/uncased_L-12_H-768_A-12/bert_model.ckpt.index 
 inflating: asset/uncased_L-12_H-768_A-12/bert_config.json 


 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 388M 100 388M 0 0 3564k 0 0:01:51 0:01:51 --:--:-- 3948k


# Preparing the Data

Now let's fetch and prepare the data by taking the first `max_seq_len` tokenens after tokenizing with the BERT tokenizer, und use `sample_size` examples for both training and testing.

To keep training fast, we'll take a sample of about 2500 train and test examples, respectively, and use the first 128 tokens only (transformers memory and computation requirements scale quadraticly with the sequence length - so with a TPU you might use `max_seq_len=512`, but on a GPU this would be too slow, and you will have to use a very small `batch_size`s to fit the model into the GPU memory).

In [10]:
%%time

tokenizer = FullTokenizer(vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt"))
data = MovieReviewData(tokenizer, 
 sample_size=10*128*2, #10*128*2
 max_seq_len=128)

pos: 100%|██████████| 12500/12500 [00:00<00:00, 19259.77it/s]
neg: 100%|██████████| 12500/12500 [00:00<00:00, 19675.69it/s]
pos: 100%|██████████| 12500/12500 [00:00<00:00, 19420.16it/s]
neg: 100%|██████████| 12500/12500 [00:00<00:00, 18046.21it/s]
100%|██████████| 2.56k/2.56k [00:02<00:00, 879it/s] 
100%|██████████| 2.56k/2.56k [00:02<00:00, 866it/s] 


max seq_len 178
CPU times: user 20 s, sys: 5.05 s, total: 25.1 s
Wall time: 25.5 s


In [9]:
print(" train_x", data.train_x.shape)
print("train_x_token_types", data.train_x_token_types.shape)
print(" train_y", data.train_y.shape)

print(" test_x", data.test_x.shape)

print(" max_seq_len", data.max_seq_len)

 train_x (2560, 128)
train_x_token_types (2560, 128)
 train_y (2560,)
 test_x (2560, 128)
 max_seq_len 128


## Adapter BERT

If we decide to use [adapter-BERT](https://arxiv.org/abs/1902.00751) we need some helpers for freezing the original BERT layers.

In [16]:
def flatten_layers(root_layer):
 if isinstance(root_layer, keras.layers.Layer):
 yield root_layer
 for layer in root_layer._layers:
 for sub_layer in flatten_layers(layer):
 yield sub_layer


def freeze_bert_layers(l_bert):
 """
 Freezes all but LayerNorm and adapter layers - see arXiv:1902.00751.
 """
 for layer in flatten_layers(l_bert):
 if layer.name in ["LayerNorm", "adapter-down", "adapter-up"]:
 layer.trainable = True
 elif len(layer._layers) == 0:
 layer.trainable = False
 l_bert.embeddings_layer.trainable = False


def create_learning_rate_scheduler(max_learn_rate=5e-5,
 end_learn_rate=1e-7,
 warmup_epoch_count=10,
 total_epoch_count=90):

 def lr_scheduler(epoch):
 if epoch < warmup_epoch_count:
 res = (max_learn_rate/warmup_epoch_count) * (epoch + 1)
 else:
 res = max_learn_rate*math.exp(
 math.log(end_learn_rate/max_learn_rate)*(epoch-warmup_epoch_count+1)/(total_epoch_count-warmup_epoch_count+1))
 return float(res)
 learning_rate_scheduler = tf.keras.callbacks.LearningRateScheduler(lr_scheduler, verbose=1)

 return learning_rate_scheduler


# Creating a model

Now let's create a classification model using [adapter-BERT](https//arxiv.org/abs/1902.00751), which is clever way of reducing the trainable parameter count, by freezing the original BERT weights, and adapting them with two FFN bottlenecks (i.e. `adapter_size` bellow) in every BERT layer.

**N.B.** The commented out code below show how to feed a `token_type_ids`/`segment_ids` sequence (which is not needed in our case).

In [13]:
def create_model(max_seq_len, adapter_size=64):
 """Creates a classification model."""

 #adapter_size = 64 # see - arXiv:1902.00751

 # create the bert layer
 with tf.io.gfile.GFile(bert_config_file, "r") as reader:
 bc = StockBertConfig.from_json_string(reader.read())
 bert_params = map_stock_config_to_params(bc)
 bert_params.adapter_size = adapter_size
 bert = BertModelLayer.from_params(bert_params, name="bert")

 input_ids = keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="input_ids")
 # token_type_ids = keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="token_type_ids")
 # output = bert([input_ids, token_type_ids])
 output = bert(input_ids)

 print("bert shape", output.shape)
 cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(output)
 cls_out = keras.layers.Dropout(0.5)(cls_out)
 logits = keras.layers.Dense(units=768, activation="tanh")(cls_out)
 logits = keras.layers.Dropout(0.5)(logits)
 logits = keras.layers.Dense(units=2, activation="softmax")(logits)

 # model = keras.Model(inputs=[input_ids, token_type_ids], outputs=logits)
 # model.build(input_shape=[(None, max_seq_len), (None, max_seq_len)])
 model = keras.Model(inputs=input_ids, outputs=logits)
 model.build(input_shape=(None, max_seq_len))

 # load the pre-trained model weights
 load_stock_weights(bert, bert_ckpt_file)

 # freeze weights if adapter-BERT is used
 if adapter_size is not None:
 freeze_bert_layers(bert)

 model.compile(optimizer=keras.optimizers.Adam(),
 loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
 metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])

 model.summary()

 return model

# Train

In [14]:
adapter_size = None # use None to fine-tune all of BERT
model = create_model(data.max_seq_len, adapter_size=adapter_size)

bert shape (None, 128, 768)
Done loading 196 BERT weights from: asset/uncased_L-12_H-768_A-12/bert_model.ckpt into (prefix:bert). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Unused weights from checkpoint: 
	bert/embeddings/token_type_embeddings
	bert/pooler/dense/bias
	bert/pooler/dense/kernel
	cls/predictions/output_bias
	cls/predictions/transform/LayerNorm/beta
	cls/predictions/transform/LayerNorm/gamma
	cls/predictions/transform/dense/bias
	cls/predictions/transform/dense/kernel
	cls/seq_relationship/output_bias
	cls/seq_relationship/output_weights
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param # 
input_ids (InputLayer) [(None, 128)] 0 
_________________________________________________________________
bert (BertModelLayer) (None, 128, 768) 108890112 
_________________________________________________________________
lambda (Lambda) (None, 768) 0 
_________________

In [21]:
%%time

log_dir = ".log/movie_reviews/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%s")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=log_dir)

total_epoch_count = 2
model.fit(x=data.train_x, y=data.train_y,
 validation_split=0.1,
 batch_size=12,
 shuffle=True,
 epochs=total_epoch_count,
 callbacks=[create_learning_rate_scheduler(max_learn_rate=1e-5,
 end_learn_rate=1e-7,
 warmup_epoch_count=20,
 total_epoch_count=total_epoch_count),
 keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True),
 tensorboard_callback])

Train on 2304 samples, validate on 256 samples

Epoch 00001: LearningRateScheduler reducing learning rate to 5.000000000000001e-07.
Epoch 1/2

Epoch 00002: LearningRateScheduler reducing learning rate to 1.0000000000000002e-06.
Epoch 2/2
CPU times: user 4min 31s, sys: 43.2 s, total: 5min 14s
Wall time: 4min 2s




In [22]:
model.save_weights('./movie_reviews.h5', overwrite=True)

In [23]:
%%time

_, train_acc = model.evaluate(data.train_x, data.train_y)
_, test_acc = model.evaluate(data.test_x, data.test_y)

print("train acc", train_acc)
print(" test acc", test_acc)

train acc 0.92695314
 test acc 0.89921874
CPU times: user 1min 7s, sys: 222 ms, total: 1min 8s
Wall time: 1min 7s


# Evaluation

To evaluate the trained model, let's load the saved weights in a new model instance, and evaluate.

In [12]:
model = create_model(data.max_seq_len, adapter_size=None)
model.load_weights("./movie_reviews.h5")

bert shape (None, 128, 768)
Done loading 196 BERT weights from: asset/uncased_L-12_H-768_A-12/bert_model.ckpt into (prefix:bert). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Unused weights from checkpoint: 
	bert/embeddings/token_type_embeddings
	bert/pooler/dense/bias
	bert/pooler/dense/kernel
	cls/predictions/output_bias
	cls/predictions/transform/LayerNorm/beta
	cls/predictions/transform/LayerNorm/gamma
	cls/predictions/transform/dense/bias
	cls/predictions/transform/dense/kernel
	cls/seq_relationship/output_bias
	cls/seq_relationship/output_weights
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param # 
input_ids (InputLayer) [(None, 128)] 0 
_________________________________________________________________
bert (BertModelLayer) (None, 128, 768) 108890112 
_________________________________________________________________
lambda (Lambda) (None, 768) 0 
_________________

In [17]:
%%time 

# _, train_acc = model.evaluate(data.train_x, data.train_y)
_, test_acc = model.evaluate(data.test_x, data.test_y)

# print("train acc", train_acc)
print(" test acc", test_acc)

 test acc 0.9230469
CPU times: user 34.6 s, sys: 113 ms, total: 34.7 s
Wall time: 34.6 s


# Prediction

For prediction, we need to prepare the input text the same way as we did for training - tokenize, adding the special `[CLS]` and `[SEP]` token at begin and end of the token sequence, and pad to match the model input shape.

In [25]:
%%time
CLASSES = ["negative","positive"]
max_seq_len = 128
pred_sentences = [
 "That movie was absolutely awful",
 "The acting was a bit lacking",
 "The film was creative and surprising",
 "Absolutely fantastic!",
]

inputs = pd.DataFrame(pred_sentences)

pred_tokens = map(tokenizer.tokenize, inputs.to_numpy()[:, 0].tolist())
pred_tokens = map(lambda tok: ["[CLS]"] + tok + ["[SEP]"], pred_tokens)
pred_token_ids = list(map(tokenizer.convert_tokens_to_ids, pred_tokens))
pred_token_ids = map(lambda tids: tids + [0] * (max_seq_len-len(tids)), pred_token_ids)
pred_token_ids = np.array(list(pred_token_ids))

res = model(pred_token_ids).numpy().argmax(axis=-1)
[CLASSES[i] for i in res]

CPU times: user 150 ms, sys: 7.46 ms, total: 158 ms
Wall time: 177 ms


['negative', 'negative', 'positive', 'positive']

# Build & Save bentoml service

In [29]:
%%writefile bentoml_service.py

import bentoml
import tensorflow as tf
import numpy as np
import pandas as pd
from typing import List


from bentoml.frameworks.tensorflow import TensorflowSavedModelArtifact
from bentoml.service.artifacts.common import PickleArtifact
from bentoml.adapters import DataframeInput


CLASSES = ["negative","positive"]
max_seq_len = 128

try:
 tf.config.set_visible_devices([], 'GPU') # disable GPU, required when served in docker
except:
 pass


@bentoml.env(pip_packages=['tensorflow', 'bert-for-tf2'])
@bentoml.artifacts([TensorflowSavedModelArtifact('model'), PickleArtifact('tokenizer')])
class BertService(bentoml.BentoService):

 def tokenize(self, inputs: pd.DataFrame):
 tokenizer = self.artifacts.tokenizer
 if isinstance(inputs, pd.DataFrame):
 inputs = inputs.to_numpy()[:, 0].tolist()
 else:
 inputs = inputs.tolist() # for predict_clipper
 pred_tokens = map(tokenizer.tokenize, inputs)
 pred_tokens = map(lambda tok: ["[CLS]"] + tok + ["[SEP]"], pred_tokens)
 pred_token_ids = list(map(tokenizer.convert_tokens_to_ids, pred_tokens))
 pred_token_ids = map(lambda tids: tids + [0] * (max_seq_len - len(tids)), pred_token_ids)
 pred_token_ids = tf.constant(list(pred_token_ids), dtype=tf.int32)
 return pred_token_ids

 @bentoml.api(input=DataframeInput(), mb_max_latency=3000, mb_max_batch_size=20, batch=True)
 def predict(self, inputs: pd.DataFrame) -> List[str]:
 model = self.artifacts.model
 pred_token_ids = self.tokenize(inputs)
 res = model(pred_token_ids).numpy().argmax(axis=-1)
 return [CLASSES[i] for i in res]

Overwriting bentoml_service.py


In [30]:
from bentoml_service import Service

bento_svc = Service()
bento_svc.pack("model", model)
bento_svc.pack("tokenizer", tokenizer)
saved_path = bento_svc.save()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: /tmp/bentoml-temp-r9yxj9ku/Service/artifacts/model_saved_model/assets
[2020-07-28 15:12:13,764] INFO - Detect BentoML installed in development model, copying local BentoML module file to target saved bundle path
running sdist
running egg_info
writing BentoML.egg-info/PKG-INFO
writing dependency_links to BentoML.egg-info/dependency_links.txt
writing entry points to BentoML.egg-info/entry_points.txt
writing requirements to BentoML.egg-info/requires.txt
writing top-level names to BentoML.egg-info/top_level.txt
reading manifest file 'BentoML.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'


no previously-included directories found matching 'e2e_tests'
no previously-included directories found matching 'tests'
no previously-included directories found matching 'benchmark'


writing manifest file 'BentoML.egg-info/SOURCES.txt'
running check
creating BentoML-0.8.3+42.gb8d36b6
creating BentoML-0.8.3+42.gb8d36b6/BentoML.egg-info
creating BentoML-0.8.3+42.gb8d36b6/bentoml
creating BentoML-0.8.3+42.gb8d36b6/bentoml/adapters
creating BentoML-0.8.3+42.gb8d36b6/bentoml/artifact
creating BentoML-0.8.3+42.gb8d36b6/bentoml/cli
creating BentoML-0.8.3+42.gb8d36b6/bentoml/clipper
creating BentoML-0.8.3+42.gb8d36b6/bentoml/configuration
creating BentoML-0.8.3+42.gb8d36b6/bentoml/configuration/__pycache__
creating BentoML-0.8.3+42.gb8d36b6/bentoml/handlers
creating BentoML-0.8.3+42.gb8d36b6/bentoml/marshal
creating BentoML-0.8.3+42.gb8d36b6/bentoml/saved_bundle
creating BentoML-0.8.3+42.gb8d36b6/bentoml/server
creating BentoML-0.8.3+42.gb8d36b6/bentoml/utils
creating BentoML-0.8.3+42.gb8d36b6/bentoml/yatai
creating BentoML-0.8.3+42.gb8d36b6/bentoml/yatai/client
creating BentoML-0.8.3+42.gb8d36b6/bentoml/yatai/deployment
creating BentoML-0.8.3+42.gb8d36b6/bentoml/yatai/dep

In [31]:
print(saved_path)

/home/bentoml/bentoml/repository/Service/20200728151102_B3E065


## REST API Model Serving


To start a REST API model server with the BentoService saved above, use the bentoml serve command:

*Since BERT is a large model, if you met OOM bellow, 
you may need to restart this kernel to release the RAM/GRAM used by training model.*

In [32]:
bentoml_bundle_path = '/home/bentoml/bentoml/repository/Service/20200728151102_B3E065' # saved_path

In [34]:
# Option 1: serve directly
print(f"bentoml serve-gunicorn {bentoml_bundle_path} --port 5000 --enable-microbatch --workers 1")

!bentoml serve-gunicorn {bentoml_bundle_path} --port 5000 --enable-microbatch --workers 1

bentoml serve-gunicorn /home/bentoml/bentoml/repository/Service/20200728151102_B3E065 --port 5000 --enable-microbatch --workers 1
[2020-07-28 15:14:35,761] INFO - Starting BentoML API server in production mode..
[2020-07-28 15:14:36,563] INFO - Running micro batch service on :5000
[2020-07-28 15:14:36 +0800] [2953201] [INFO] Starting gunicorn 20.0.4
[2020-07-28 15:14:36 +0800] [2952697] [INFO] Starting gunicorn 20.0.4
[2020-07-28 15:14:36 +0800] [2952697] [INFO] Listening at: http://0.0.0.0:60577 (2952697)
[2020-07-28 15:14:36 +0800] [2953201] [INFO] Listening at: http://0.0.0.0:5000 (2953201)
[2020-07-28 15:14:36 +0800] [2952697] [INFO] Using worker: sync
[2020-07-28 15:14:36 +0800] [2953201] [INFO] Using worker: aiohttp.worker.GunicornWebWorker
[2020-07-28 15:14:36 +0800] [2953203] [INFO] Booting worker with pid: 2953203
[2020-07-28 15:14:36 +0800] [2953202] [INFO] Booting worker with pid: 2953202
[2020-07-28 15:14:36,874] INFO - Micro batch enabled for API `predict`
[2020-07-28 15:1

If you are running this notebook from Google Colab, you can start the dev server with `--run-with-ngrok` option, to gain acccess to the API endpoint via a public endpoint managed by [ngrok](https://ngrok.com/):

In [None]:
!bentoml serve BertService --run-with-ngrok

Open http://127.0.0.1:5000 to see more information about the REST APIs server in your
browser.


### Send prediction requeset to the REST API server

Navigate to parent directory of the notebook(so you have reference to the `test.jpg` image), and run the following `curl` command to send the image to REST API server and get a prediction result:

```bash
curl -i \
 --request POST \
 --header "Content-Type: application/json" \
 --data '{"0":{"0":"The acting was a bit lacking."}}' \
 localhost:5000/predict
```

### Test the API with requests

In [4]:
%%time
import requests
import pandas as pd

server_url = f"http://127.0.0.1:5000/predict"
method = "POST"
headers = {"content-type": "application/json"}
pred_sentences = ["The acting was a bit lacking."]
data = pd.DataFrame(pred_sentences).to_json()

r = requests.request(method, server_url, headers=headers, data=data)
print(r.content)

b'["negative"]'
CPU times: user 2.36 ms, sys: 3.26 ms, total: 5.62 ms
Wall time: 29.1 s


In [None]:
# Option 2: serve in docker
!cd {bentoml_bundle_path}
IMG_NAME = bentoml_bundle_path.split('/')[-1].lower()

!docker build --quiet -t {IMG_NAME} {bentoml_bundle_path}
# launch docker instances
!docker run -itd -p 5000:5000 {IMG_NAME}:latest --workers 1 --enable-microbatch

## Containerize model server with Docker


One common way of distributing this model API server for production deployment, is via Docker containers. And BentoML provides a convenient way to do that.

Note that docker is **not available in Google Colab**. You will need to download and run this notebook locally to try out this containerization with docker feature.

If you already have docker configured, simply run the follow command to product a docker container serving the IrisClassifier prediction service created above:

In [None]:
!bentoml containerize BertService:latest

In [None]:
!docker run -p 5000:5000 bertservice

## Load saved BentoService

bentoml.load is the API for loading a BentoML packaged model in python:

In [None]:
from bentoml import load

service = load(saved_path)

print(service.predict([["The acting was a bit lacking."]]))

## Launch inference job from CLI

BentoML cli supports loading and running a packaged model from CLI. With the DataframeInput adapter, the CLI command supports reading input Dataframe data from CLI argument or local csv or json files:

In [None]:
!bentoml run BertService:latest predict --input '{"0":{"0":"The acting was a bit lacking."}}'

# Deployment Options

If you are at a small team with limited engineering or DevOps resources, try out automated deployment with BentoML CLI, currently supporting AWS Lambda, AWS SageMaker, and Azure Functions:
- [AWS Lambda Deployment Guide](https://docs.bentoml.org/en/latest/deployment/aws_lambda.html)
- [AWS SageMaker Deployment Guide](https://docs.bentoml.org/en/latest/deployment/aws_sagemaker.html)
- [Azure Functions Deployment Guide](https://docs.bentoml.org/en/latest/deployment/azure_functions.html)

If the cloud platform you are working with is not on the list above, try out these step-by-step guide on manually deploying BentoML packaged model to cloud platforms:
- [AWS ECS Deployment](https://docs.bentoml.org/en/latest/deployment/aws_ecs.html)
- [Google Cloud Run Deployment](https://docs.bentoml.org/en/latest/deployment/google_cloud_run.html)
- [Azure container instance Deployment](https://docs.bentoml.org/en/latest/deployment/azure_container_instance.html)
- [Heroku Deployment](https://docs.bentoml.org/en/latest/deployment/heroku.html)

Lastly, if you have a DevOps or ML Engineering team who's operating a Kubernetes or OpenShift cluster, use the following guides as references for implementating your deployment strategy:
- [Kubernetes Deployment](https://docs.bentoml.org/en/latest/deployment/kubernetes.html)
- [Knative Deployment](https://docs.bentoml.org/en/latest/deployment/knative.html)
- [Kubeflow Deployment](https://docs.bentoml.org/en/latest/deployment/kubeflow.html)
- [KFServing Deployment](https://docs.bentoml.org/en/latest/deployment/kfserving.html)
- [Clipper.ai Deployment Guide](https://docs.bentoml.org/en/latest/deployment/clipper.html)

