{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train a Model to Detect Sentiment from Trip Reports\n", "\n", "In this example, we have trip reports from customer engagements stored in Ceph. In order to detect the sentiment of future trips, we use the historic data to train our models. Over time, the accuracy of the models will improve as more data is stored in Ceph.\n", "\n", "The models are also stored back in Ceph for use by other execution environments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install Machine Learning libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install sklearn\n", "!pip install tensorflow\n", "!pip install keras\n", "!pip install pandas\n", "!pip install boto3\n", "!pip install matplotlib\n", "!pip install seaborn\n", "\n", "import pyspark\n", "\n", "import re\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Access the data using Spark" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Set the Spark configuration\n", "#This will point to a local Spark instance running in stand-alone mode on the notebook\n", "conf = pyspark.SparkConf().setAppName('Sentiment Analysis').setMaster('local[*]')\n", "sc = pyspark.SparkContext.getOrCreate(conf)\n", "\n", "accessKey= 'S3user1'\n", "secretKey= 'S3user1key'\n", "endpointUrl= 'http://'\n", "\n", "#Set the S3 configurations to access Ceph Object Storage\n", "sc._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\", 'S3user1') \n", "sc._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\", 'S3user1key') \n", "sc._jsc.hadoopConfiguration().set(\"fs.s3a.endpoint\", 'http://10.0.1.111')\n", "\n", "#Get the SQL context\n", "sqlContext = pyspark.SQLContext(sc)\n", "\n", "feedbackFile = sqlContext.read.option(\"sep\", \"\\t\").csv(\"s3a://SENTIMENT/data/trip_report.tsv\", header=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### IMPORTANT: If you run the above step with incorrect Ceph parameters, you must reset the Kernel to see changes.\n", "This can be done by going to Kernel in the menu and selecting 'Restart'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert the data to a Pandas data frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = feedbackFile.toPandas()\n", "sc.stop()\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Types of trip outcomes by field representative" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "np.random.seed(sum(map(ord, \"categorical\")))\n", "\n", "from matplotlib.colors import ListedColormap\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set(style=\"whitegrid\", color_codes=True)\n", "\n", "outcome_dict = {'Successful':0,'Partial Success':1,'Unsuccessful':2 }\n", "\n", "df_vis = df[['Your Name', 'Outcome']]\n", "df_vis['outcome_numeric'] = df_vis['Outcome'].apply(lambda a:outcome_dict[a])\n", "\n", "\n", "\n", "outcome_cross_table = pd.crosstab(index=df_vis[\"Your Name\"], \n", " columns=df_vis[\"Outcome\"])\n", "\n", "\n", "outcome_cross_table.plot(kind=\"bar\", \n", " figsize=(16,12),\n", " stacked=True,fontsize=12)\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Types of outcomes by event type" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "event_type_cross_table = pd.crosstab(index=df[\"Primary Audience Engaged\"], \n", " columns=df[\"Outcome\"])\n", "\n", "event_type_cross_table.plot(kind=\"bar\", \n", " figsize=(16,12),\n", " stacked=True,fontsize=12)\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now convert \"Highlights\" data to prepare for training the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Highlights'] = df['Highlights'].astype(str)\n", "\n", "df[['Highlights','Outcome']].head(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_outcome = df[['Highlights','Outcome']]\n", "\n", "pd.set_option('display.height', 1000)\n", "pd.set_option('display.max_rows', 500)\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.width', 1000)\n", "\n", "grouped_highlights = pd.DataFrame(df_outcome.groupby('Outcome')['Highlights'].apply(lambda x: \"%s\" % ' '.join(x)))\n", "\n", "grouped_highlights['Outcome'] = list(grouped_highlights.index.get_values())\n", "grouped_highlights.reset_index(drop=True, inplace=True)\n", "\n", "grouped_highlights['Highlights'] = grouped_highlights['Highlights'].astype(str)\n", "\n", "df['Highlights'] = df['Highlights'].apply(lambda a: a.lower())\n", "\n", "df_success = df[df['Outcome'] == 'Successful']\n", "df_unsuccess = df[df['Outcome'] == 'Unsuccessful']\n", "df_part_success = df[df['Outcome'] == 'Partial Success']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import additional Machine Learning libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "from keras.models import Sequential\n", "from keras.layers import Dense, Embedding, LSTM\n", "from sklearn.model_selection import train_test_split\n", "from keras.utils.np_utils import to_categorical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Separating train and test data. Taking successful and unsuccessful separately" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_failure = df_part_success.append(df_unsuccess, ignore_index= True)\n", "\n", "df_failure['Outcome'] = 'Unsuccessful'\n", "\n", "test_hold_out = 0.1\n", "\n", "#### Success\n", "\n", "train = df_success[ : -int(test_hold_out * len(df_success))]\n", "test = df_success[-int(test_hold_out * len(df_success)) : ]\n", "\n", "#### Failure\n", "\n", "train = train.append(df_failure[ : -int(test_hold_out * len(df_failure))])\n", "test = test.append(df_failure[-int(test_hold_out * len(df_failure)) : ])\n", "\n", "\n", "train = train.sample(frac = 1)\n", "train['type'] = \"Train\"\n", "test['type'] = \"Test\"\n", "\n", "train = train.append(test)\n", "\n", "train.reset_index(drop=True,inplace=True)\n", "\n", "Y = pd.get_dummies(train['Outcome']).values\n", "\n", "test_index_list = list(train[train['type'] == 'Test'].index)\n", "\n", "test_index_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use the HIGHLIGHTS field for sentiment analysis\n", "\n", "__max_features__ = Vocabulary size,its a hyper parameter
\n", "*Tokenizer creates vectors from text, mainly works like a dictionary id in total vocabulary, returns list of integers, where every integer acts like an index
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "max_fatures = 10000\n", "tokenizer = Tokenizer(nb_words=max_fatures, split=' ')\n", "tokenizer.fit_on_texts(train['Highlights'].values)\n", "X_highlights = tokenizer.texts_to_sequences(train['Highlights'].values)\n", "X_highlights = pad_sequences(X_highlights)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating the network layer by layer\n", "First layer is word embedding layer, second layer is LSTM based RNN, and third layer is Softmax activation layer, due to categorical outcome" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embed_dim = 128\n", "lstm_out = 196\n", "\n", "model = Sequential()\n", "model.add(Embedding(max_fatures, embed_dim,input_length = X_highlights.shape[1], dropout=0.05))\n", "model.add(LSTM(lstm_out, dropout_U=0.1, dropout_W=0.1))\n", "model.add(Dense(2,activation='softmax'))\n", "model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])\n", "print(model.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Separating train and test data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_highlights_train = X_highlights[0:test_index_list[0]]\n", "Y_highlights_train = Y[0:test_index_list[0]]\n", "\n", "X_highlights_test = X_highlights[test_index_list[0]:]\n", "Y_highlights_test = Y[test_index_list[0]:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Running the model\n", "Batch size and number of epoch can be changed as optimisation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "batch_size = 20\n", "model.fit(X_highlights_train, Y_highlights_train, nb_epoch = 10, batch_size=batch_size, verbose = 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Printing test data accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score,accuracy = model.evaluate(X_highlights_test, Y_highlights_test, verbose = 2, batch_size = batch_size)\n", "print(\"score: %.2f\" % (score))\n", "print(\"accuracy: %.2f\" % (accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save the model, tokenizer and feature dimension and store them in Ceph" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.save(\"./model\")\n", "\n", "import pickle\n", "\n", "with open('./tokenizer.pickle', 'wb') as handle:\n", " pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", "\n", "feature_dimension = X_highlights_train.shape[1]\n", "with open('./feature_dimension.pickle', 'wb') as handle:\n", " pickle.dump(feature_dimension, handle, protocol=pickle.HIGHEST_PROTOCOL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Save models to S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "s3 = boto3.resource('s3')\n", "\n", "#Create S3 session for writing manifest file\n", "session = boto3.Session(\n", " aws_access_key_id=accessKey,\n", " aws_secret_access_key=secretKey\n", ")\n", "\n", "s3 = session.resource('s3', endpoint_url=endpointUrl, verify=False)\n", "\n", "# Upload the model to S3\n", "s3.meta.client.upload_file('./model', 'SENTIMENT', 'models/trip_report_model')\n", "\n", "# Upload the tokenizer to S3\n", "s3.meta.client.upload_file('./tokenizer.pickle', 'SENTIMENT', 'models/trip_report_tokenizer.pickle')\n", "\n", "# Upload the feature dimension to S3\n", "s3.meta.client.upload_file('./feature_dimension.pickle', 'SENTIMENT', 'models/trip_report_feature_dimension.pickle')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The model has been saved to s3 as binary files and can be viewed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }