{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Protodash Explanations for Text data\n", "\n", "In the example shown in this notebook, we train a text classifier based on [UCI SMS dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) to distinguish 'SPAM' and 'HAM' (i.e. non spam) SMS messages. We then use the ProtodashExplainer to obtain spam and ham prototypes based on the labels assigned by the text classifier. \n", "\n", "In order to run this notebook, please: \n", "1. Download [UCI SMS dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) dataset and place the directory 'smsspamcollection' in the location of this notebook. \n", "2. Place glove embedding file \"glove.6B.100d.txt\" in the location of this notebook. This can be downloaded from [here](https://nlp.stanford.edu/projects/glove/) \n", "3. Create 2 folders: \"results\" and \"logs\" in the location of this notebook (these are used to store training logs). \n", "4. The models trained in this notebook can also be accessed from [here](https://github.com/IBM/AIX360/tree/master/aix360/models/protodash) if required. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1. Train a LSTM classifier on SMS dataset\n", "We train a LSTM model to label the dataset as spam / ham. The model is based on the following code: https://www.thepythoncode.com/article/build-spam-classifier-keras-python " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import statements" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "import tqdm\n", "import numpy as np\n", "import keras_metrics # for recall and precision metrics\n", "from keras.preprocessing.sequence import pad_sequences\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.layers import Embedding, LSTM, Dropout, Dense\n", "from keras.models import Sequential\n", "from keras.utils import to_categorical\n", "from keras.callbacks import ModelCheckpoint, TensorBoard\n", "from sklearn.model_selection import train_test_split\n", "import time\n", "import numpy as np\n", "import pickle\n", "import os.path\n", "from keras.models import model_from_json" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "SEQUENCE_LENGTH = 100 # the length of all sequences (number of words per sample)\n", "EMBEDDING_SIZE = 100 # Using 100-Dimensional GloVe embedding vectors\n", "TEST_SIZE = 0.25 # ratio of testing set\n", "\n", "BATCH_SIZE = 64\n", "EPOCHS = 20 # number of epochs\n", "\n", "# to convert labels to integers and vice-versa\n", "label2int = {\"ham\": 0, \"spam\": 1}\n", "int2label = {0: \"ham\", 1: \"spam\"}" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "combined_df = pd.read_csv('smsspamcollection/SMSSpamCollection.csv', delimiter='\\t',header=None)\n", "combined_df.columns = ['label', 'text']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# clean text and store as a column in original df\n", "X = combined_df['text'].values.tolist()\n", "y = combined_df['label'].values.tolist()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Text tokenization\n", "# vectorizing text, turning each text into sequence of integers\n", "tokenizer = Tokenizer()\n", "tokenizer.fit_on_texts(X)\n", "# convert to sequence of integers\n", "X = tokenizer.texts_to_sequences(X)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# convert to numpy arrays\n", "X = np.array(X)\n", "y = np.array(y)\n", "# pad sequences at the beginning of each sequence with 0's\n", "# for example if SEQUENCE_LENGTH=4:\n", "# [[5, 3, 2], [5, 1, 2, 3], [3, 4]]\n", "# will be transformed to:\n", "# [[0, 5, 3, 2], [5, 1, 2, 3], [0, 0, 3, 4]]\n", "X = pad_sequences(X, maxlen=SEQUENCE_LENGTH)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "y = [ label2int[label] for label in y ]\n", "y = to_categorical(y)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# split and shuffle\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use glove embeddings" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def get_embedding_vectors(tokenizer, dim=100):\n", " embedding_index = {}\n", " with open(f\"glove.6B.{dim}d.txt\", encoding='utf8') as f:\n", " for line in tqdm.tqdm(f, \"Reading GloVe\"):\n", " values = line.split()\n", " word = values[0]\n", " vectors = np.asarray(values[1:], dtype='float32')\n", " embedding_index[word] = vectors\n", "\n", " word_index = tokenizer.word_index\n", " embedding_matrix = np.zeros((len(word_index)+1, dim))\n", " for word, i in word_index.items():\n", " embedding_vector = embedding_index.get(word)\n", " if embedding_vector is not None:\n", " # words not found will be 0s\n", " embedding_matrix[i] = embedding_vector\n", " \n", " return embedding_matrix" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def get_model(tokenizer, lstm_units):\n", " \"\"\"\n", " Constructs the model,\n", " Embedding vectors => LSTM => 2 output Fully-Connected neurons with softmax activation\n", " \"\"\"\n", " # get the GloVe embedding vectors\n", " embedding_matrix = get_embedding_vectors(tokenizer)\n", " model = Sequential()\n", " model.add(Embedding(len(tokenizer.word_index)+1,\n", " EMBEDDING_SIZE,\n", " weights=[embedding_matrix],\n", " trainable=False,\n", " input_length=SEQUENCE_LENGTH))\n", "\n", " model.add(LSTM(lstm_units, recurrent_dropout=0.2))\n", " model.add(Dropout(0.3))\n", " model.add(Dense(2, activation=\"softmax\"))\n", " # compile as rmsprop optimizer\n", " # aswell as with recall metric\n", " model.compile(optimizer=\"rmsprop\", loss=\"categorical_crossentropy\",\n", " metrics=[\"accuracy\", keras_metrics.precision(), keras_metrics.recall()])\n", " model.summary()\n", " return model" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Reading GloVe: 400000it [00:14, 26995.27it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "tracking tp\n", "tracking fp\n", "tracking tp\n", "tracking fn\n", "Model: \"sequential_1\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "embedding_1 (Embedding) (None, 100, 100) 901000 \n", "_________________________________________________________________\n", "lstm_1 (LSTM) (None, 128) 117248 \n", "_________________________________________________________________\n", "dropout_1 (Dropout) (None, 128) 0 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 2) 258 \n", "=================================================================\n", "Total params: 1,018,506\n", "Trainable params: 117,506\n", "Non-trainable params: 901,000\n", "_________________________________________________________________\n" ] } ], "source": [ "# constructs the model with 128 LSTM units\n", "model = get_model(tokenizer=tokenizer, lstm_units=128)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Train model or load trained model from disk" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded model from disk\n", "Model: \"sequential_1\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "embedding_1 (Embedding) (None, 100, 100) 901000 \n", "_________________________________________________________________\n", "lstm_1 (LSTM) (None, 128) 117248 \n", "_________________________________________________________________\n", "dropout_1 (Dropout) (None, 128) 0 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 2) 258 \n", "=================================================================\n", "Total params: 1,018,506\n", "Trainable params: 117,506\n", "Non-trainable params: 901,000\n", "_________________________________________________________________\n", "tracking tp\n", "tracking fp\n", "tracking tp\n", "tracking fn\n" ] } ], "source": [ "to_train = False\n", "\n", "if (to_train): \n", "\n", " # initialize our ModelCheckpoint and TensorBoard callbacks\n", " # model checkpoint for saving best weights\n", " model_checkpoint = ModelCheckpoint(\"results/spam_classifier_{val_loss:.2f}\", save_best_only=True,\n", " verbose=1)\n", " # for better visualization\n", " tensorboard = TensorBoard(f\"logs/spam_classifier_{time.time()}\")\n", " # print our data shapes\n", " print(\"X_train.shape:\", X_train.shape)\n", " print(\"X_test.shape:\", X_test.shape)\n", " print(\"y_train.shape:\", y_train.shape)\n", " print(\"y_test.shape:\", y_test.shape)\n", " # train the model\n", " model.fit(X_train, y_train, validation_data=(X_test, y_test),\n", " batch_size=BATCH_SIZE, epochs=EPOCHS,\n", " callbacks=[tensorboard, model_checkpoint],\n", " verbose=1)\n", " \n", " # serialize model to JSON\n", " model_json = model.to_json()\n", " with open(\"sms-lstm-forprotodash.json\", \"w\") as json_file:\n", " json_file.write(model_json)\n", "\n", " # serialize weights to HDF5\n", " model.save_weights(\"sms-lstm-forprotodash.h5\")\n", " print(\"Saved model to disk\")\n", " \n", "else: \n", "\n", " # load json and create model\n", " json_file = open(\"sms-lstm-forprotodash.json\", 'r')\n", " loaded_model_json = json_file.read()\n", " json_file.close()\n", " model = model_from_json(loaded_model_json)\n", "\n", " # load weights into new model\n", " model.load_weights(\"sms-lstm-forprotodash.h5\")\n", " print(\"Loaded model from disk\")\n", "\n", " # print model \n", " model.summary()\n", "\n", " model.compile(optimizer=\"rmsprop\", loss=\"categorical_crossentropy\",\n", " metrics=[\"accuracy\", keras_metrics.precision(), keras_metrics.recall()]) " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /root/anaconda3/envs/ctext/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.\n", "\n", "1393/1393 [==============================] - 2s 1ms/step\n", "[+] Accuracy: 98.35%\n", "[+] Precision: 98.44%\n", "[+] Recall: 99.70%\n" ] } ], "source": [ "# get the loss and metrics\n", "result = model.evaluate(X_test, y_test)\n", "# extract those\n", "loss = result[0]\n", "accuracy = result[1]\n", "precision = result[2]\n", "recall = result[3]\n", "\n", "print(f\"[+] Accuracy: {accuracy*100:.2f}%\")\n", "print(f\"[+] Precision: {precision*100:.2f}%\")\n", "print(f\"[+] Recall: {recall*100:.2f}%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2. Get model predictions for the dataset" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def get_predictions(doclist):\n", " \n", " sequence = tokenizer.texts_to_sequences(doclist)\n", " \n", " # pad the sequence\n", " sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)\n", "\n", " # get the prediction as one-hot encoded vector\n", " prediction = model.predict(sequence)\n", " \n", " return (prediction) " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "spam\n" ] } ], "source": [ "text = \"Congratulations! you have won 100,000$ this week, click here to claim fast\"\n", "pred = get_predictions([text])\n", "print(int2label [ np.argmax(pred, axis=1)[0] ] )" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ham\n" ] } ], "source": [ "text = \"Hi man, I was wondering if we can meet tomorrow.\"\n", "pred = get_predictions([text])\n", "print(int2label [ np.argmax(pred, axis=1)[0] ] )" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "doclist = combined_df['text'].values.tolist()\n", "one_hot_prediction = get_predictions(doclist)\n", "label_prediction = np.argmax(one_hot_prediction, axis=1)\n", "\n", "# 0: ham, 1:spam\n", "idx_ham = (label_prediction == 0)\n", "idx_spam = (label_prediction == 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3. Use protodash explainer to compute spam and ham prototypes" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from aix360.algorithms.protodash import ProtodashExplainer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert text to vectors using TF-IDF for use in explainer\n", "\n", "We use TF-IDF vectors for scalability reasons as the original embedding vector for a full sentence can be quite large. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(5572, 8713)\n" ] } ], "source": [ "# create the transform\n", "vectorizer = TfidfVectorizer()\n", "\n", "# tokenize and build vocab\n", "vectorizer.fit(doclist)\n", "\n", "vec = vectorizer.transform(doclist)\n", "docvec = vec.toarray()\n", "print(docvec.shape)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# separate spam and ham messages and corrsponding vectors\n", "\n", "docvec_spam = docvec[idx_spam, :]\n", "docvec_ham = docvec[idx_ham, :]\n", "\n", "df_spam = combined_df[idx_spam]['text']\n", "df_ham = combined_df[idx_ham]['text']" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(727,)\n", "(4845,)\n" ] } ], "source": [ "print(df_spam.shape)\n", "print(df_ham.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compute prototypes for spam and ham datasets" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "explainer = ProtodashExplainer()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "m = 10\n", "\n", "# call protodash explainer\n", "# S contains indices of the selected prototypes\n", "# W contains importance weights associated with the selected prototypes \n", "(W_spam, S_spam, _) = explainer.explain(docvec_spam, docvec_spam, m=m)\n", "(W_ham, S_ham, _) = explainer.explain(docvec_ham, docvec_ham, m=m)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# get prototypes from index\n", "df_spam_prototypes = df_spam.iloc[S_spam].copy()\n", "df_ham_prototypes = df_ham.iloc[S_ham].copy()\n", "\n", "#normalize weights\n", "W_spam = np.around(W_spam/np.sum(W_spam), 2) \n", "W_ham = np.around(W_ham/np.sum(W_ham), 2) " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SPAM prototypes with weights:\n", "----------------------------\n", "0.13 We tried to call you re your reply to our sms for a video mobile 750 mins UNLIMITED TEXT + free camcorder Reply of call 08000930705 Now\n", "0.13 You have WON a guaranteed £1000 cash or a £2000 prize.To claim yr prize call our customer service representative on\n", "0.12 Get ur 1st RINGTONE FREE NOW! Reply to this msg with TONE. Gr8 TOP 20 tones to your phone every week just £1.50 per wk 2 opt out send STOP 08452810071 16\n", "0.1 December only! Had your mobile 11mths+? You are entitled to update to the latest colour camera mobile for Free! Call The Mobile Update Co FREE on 08002986906\n", "0.09 Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/expressoffer Ts&Cs apply. To stop texts, txt STOP to 80062\n", "0.09 URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only\n", "0.08 Free entry in 2 a weekly comp for a chance to win an ipod. Txt POD to 80182 to get entry (std txt rate) T&C's apply 08452810073 for details 18+\n", "0.09 Urgent! call 09066350750 from your landline. Your complimentary 4* Ibiza Holiday or 10,000 cash await collection SAE T&Cs PO BOX 434 SK3 8WP 150 ppm 18+ \n", "0.08 YES! The only place in town to meet exciting adult singles is now in the UK. Txt CHAT to 86688 now! 150p/Msg.\n", "0.08 What do U want for Xmas? How about 100 free text messages & a new video phone with half price line rental? Call free now on 0800 0721072 to find out more!\n" ] } ], "source": [ "print(\"SPAM prototypes with weights:\")\n", "print(\"----------------------------\")\n", "for i in range(m):\n", " print(W_spam[i], df_spam_prototypes.iloc[i])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HAM prototypes with weights:\n", "----------------------------\n", "0.14 The last thing i ever wanted to do was hurt you. And i didn't think it would have. You'd laugh, be embarassed, delete the tag and keep going. But as far as i knew, it wasn't even up. The fact that you even felt like i would do it to hurt you shows you really don't know me at all. It was messy wednesday, but it wasn't bad. The problem i have with it is you HAVE the time to clean it, but you choose not to. You skype, you take pictures, you sleep, you want to go out. I don't mind a few things here and there, but when you don't make the bed, when you throw laundry on top of it, when i can't have a friend in the house because i'm embarassed that there's underwear and bras strewn on the bed, pillows on the floor, that's something else. You used to be good about at least making the bed.\n", "0.11 What do u want when i come back?.a beautiful necklace as a token of my heart for you.thats what i will give but ONLY to MY WIFE OF MY LIKING.BE THAT AND SEE..NO ONE can give you that.dont call me.i will wait till i come.\n", "0.12 It‘s £6 to get in, is that ok?\n", "0.12 Babe ! What are you doing ? Where are you ? Who are you talking to ? Do you think of me ? Are you being a good boy? Are you missing me? Do you love me ?\n", "0.1 So you think i should actually talk to him? Not call his boss in the morning? I went to this place last year and he told me where i could go and get my car fixed cheaper. He kept telling me today how much he hoped i would come back in, how he always regretted not getting my number, etc.\n", "0.08 Dude. What's up. How Teresa. Hope you have been okay. When i didnt hear from these people, i called them and they had received the package since dec <#> . Just thot you'ld like to know. Do have a fantastic year and all the best with your reading. Plus if you can really really Bam first aid for Usmle, then your work is done.\n", "0.09 Dont hesitate. You know this is the second time she has had weakness like that. So keep i notebook of what she eat and did the day before or if anything changed the day before so that we can be sure its nothing\n", "0.08 Sorry, I'll call you later. I am in meeting sir.\n", "0.08 Just got to <#>\n", "0.08 Yes but can we meet in town cos will go to gep and then home. You could text at bus stop. And don't worry we'll have finished by march … ish!\n" ] } ], "source": [ "print(\"HAM prototypes with weights:\")\n", "print(\"----------------------------\")\n", "for i in range(m):\n", " print(W_ham[i], df_ham_prototypes.iloc[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Given a message, look for similar messages that are classified as spam by classifier" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "k = 0\n", "sample_text = df_spam.iloc[k]\n", "sample_vec = docvec_spam[k]\n", "sample_vec = sample_vec.reshape(1, sample_vec.shape[0])" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", "(1, 8713)\n" ] } ], "source": [ "print(sample_text)\n", "print(sample_vec.shape)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "docvec_spam_other = docvec_spam[np.arange(docvec_spam.shape[0]) != k, :] \n", "df_spam_other = df_spam.iloc[np.arange(docvec_spam.shape[0]) != k].copy()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# Take a sample spam text and find samples similar to it. \n", "(W1_spam, S1_spam, _) = explainer.explain(sample_vec, docvec_spam_other, m=m) " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "#normalize weights\n", "W1_spam = np.around(W1_spam/np.sum(W1_spam), 2) " ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([174, 264, 499, 210, 607, 517, 57, 480, 637, 711])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "S1_spam" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original text\n", "-------------\n", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", "\n", "Similar SPAM prototypes:\n", "------------------------\n", "1.0 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", "0.0 Free 1st week entry 2 TEXTPOD 4 a chance 2 win 40GB iPod or £250 cash every wk. Txt VPOD to 81303 Ts&Cs www.textpod.net custcare 08712405020.\n", "0.0 Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50\n", "0.0 SMS. ac JSco: Energy is high, but u may not know where 2channel it. 2day ur leadership skills r strong. Psychic? Reply ANS w/question. End? Reply END JSCO\n", "0.0 Todays Voda numbers ending with 7634 are selected to receive a £350 reward. If you have a match please call 08712300220 quoting claim code 7684 standard rates apply.\n", "0.0 Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!\n", "0.0 100 dating service cal;l 09064012103 box334sk38ch\n", "0.0 2/2 146tf150p\n", "0.0 1. Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face <#> .joker face.\n", "0.0 http//tms. widelive.com/index. wml?id=820554ad0a1705572711&first=true¡C C Ringtone¡\n" ] } ], "source": [ "# similar spam prototypes\n", "print(\"original text\")\n", "print(\"-------------\")\n", "print(sample_text)\n", "print(\"\")\n", "\n", "print(\"Similar SPAM prototypes:\")\n", "print(\"------------------------\")\n", "m = 10\n", "for i in range(m):\n", " print(W1_spam[i], df_spam_other.iloc[S1_spam[i]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Observation\n", "\n", "Note several spam messages repeat in the dataset as these may have been sent by the same entity to multiple users. As a consequence, the explainer retireves these. Try with a different k above to see prototypes corrsponding to other sample messages. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Given a ham message, look for similar messages that are classified as spam by classifier" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "k = 3\n", "sample_text = df_ham.iloc[k]\n", "sample_vec = docvec_ham[k]\n", "sample_vec = sample_vec.reshape(1, sample_vec.shape[0])" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nah I don't think he goes to usf, he lives around here though\n", "(1, 8713)\n" ] } ], "source": [ "print(sample_text)\n", "print(sample_vec.shape)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "docvec_ham_other = docvec_ham[np.arange(docvec_ham.shape[0]) != k, :] \n", "df_ham_other = df_ham.iloc[np.arange(docvec_ham.shape[0]) != k].copy()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# Take a sample spam text and find samples similar to it. \n", "(W1_ham, S1_ham, _) = explainer.explain(sample_vec, docvec_ham_other, m=m) " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "#normalize weights\n", "W1_ham = np.around(W1_ham/np.sum(W1_ham), 2) " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([ 924, 3263, 831, 345, 2346, 3818, 3578, 2131, 2459, 945])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "S1_ham" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original text\n", "-------------\n", "Nah I don't think he goes to usf, he lives around here though\n", "\n", "Similar HAM prototypes:\n", "------------------------\n", "0.13 I don't think he has spatula hands!\n", "0.13 Aight, let me know when you're gonna be around usf\n", "0.1 If he started searching he will get job in few days.he have great potential and talent.\n", "0.09 None of that's happening til you get here though\n", "0.11 Nah, I'm a perpetual DD\n", "0.1 Yes just finished watching days of our lives. I love it.\n", "0.1 Babe! How goes that day ? What are you up to ? I miss you already, my Love ... * loving kiss* ... I hope everything goes well.\n", "0.09 S.i think he is waste for rr..\n", "0.08 Were trying to find a Chinese food place around here\n", "0.08 Awesome, think we can get an 8th at usf some time tonight?\n" ] } ], "source": [ "# similar spam prototypes\n", "print(\"original text\")\n", "print(\"-------------\")\n", "print(sample_text)\n", "print(\"\")\n", "\n", "print(\"Similar HAM prototypes:\")\n", "print(\"------------------------\")\n", "m = 10\n", "for i in range(m):\n", " print(W1_ham[i], df_ham_other.iloc[S1_ham[i]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }