{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explaining a classfier \n",
"\n",
"It's good to able to explain what an algorithm is doing (esp in public sector).\n",
"\n",
"It boosts confidence in the ML, good for transparency, and is useful in iterating the model. \n",
"\n",
"This uses the python package lime (Local Interpretable Model-Agnostic Explanations), an implementation by the authors of their paper - \n",
"https://arxiv.org/abs/1602.04938"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: lime in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages\n",
"Requirement already satisfied: scikit-image>=0.12 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from lime)\n",
"Requirement already satisfied: numpy in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from lime)\n",
"Requirement already satisfied: scikit-learn>=0.18 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from lime)\n",
"Requirement already satisfied: scipy in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from lime)\n",
"Requirement already satisfied: six>=1.7.3 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from scikit-image>=0.12->lime)\n",
"Requirement already satisfied: networkx>=1.8 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from scikit-image>=0.12->lime)\n",
"Requirement already satisfied: pillow>=2.1.0 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from scikit-image>=0.12->lime)\n",
"Requirement already satisfied: dask[array]>=0.5.0 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from scikit-image>=0.12->lime)\n",
"Requirement already satisfied: decorator>=3.4.0 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from networkx>=1.8->scikit-image>=0.12->lime)\n",
"Requirement already satisfied: olefile in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from pillow>=2.1.0->scikit-image>=0.12->lime)\n",
"Requirement already satisfied: toolz>=0.7.2 in c:\\users\\leungr\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages (from dask[array]>=0.5.0->scikit-image>=0.12->lime)\n"
]
}
],
"source": [
"! pip install lime "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import sklearn\n",
"import sklearn.ensemble\n",
"import numpy as np\n",
"import lime\n",
"import lime.lime_tabular\n",
"from __future__ import print_function\n",
"np.random.seed(2)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(5000, 2)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" doc | \n",
" topic | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" That this House notes the conviction in the Un... | \n",
" House of Lords | \n",
"
\n",
" \n",
" 1 | \n",
" That this House congratulates the US corporate... | \n",
" North America | \n",
"
\n",
" \n",
" 2 | \n",
" That this House is dismayed by the report in t... | \n",
" Speaker | \n",
"
\n",
" \n",
" 3 | \n",
" That this House notes that, according to a No.... | \n",
" Members of the Lords | \n",
"
\n",
" \n",
" 4 | \n",
" That this House notes the appointment of a new... | \n",
" House of Lords | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" doc topic\n",
"0 That this House notes the conviction in the Un... House of Lords\n",
"1 That this House congratulates the US corporate... North America\n",
"2 That this House is dismayed by the report in t... Speaker\n",
"3 That this House notes that, according to a No.... Members of the Lords\n",
"4 That this House notes the appointment of a new... House of Lords"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"blob_account_name = \"parlpublic\"\n",
"blob_account_key = \"xKEIV42ZsO8eL2IPjvbLarR2Xu1brxGucDauvVytPXD1uKhAfYUId7SwbGF82FslfkKebPB/ic6/RcPYnNBO6w==\" \n",
"container = \"trainingdata\" \n",
"blobname = \"5000_edms_justonetopic.csv\"\n",
"datafile = \"output.txt\" \n",
"import os\n",
"import pandas as pd\n",
"from azure.storage.blob import BlockBlobService\n",
"\n",
"dirname = os.getcwd()\n",
"\n",
"blob_service = BlockBlobService(account_name=blob_account_name,account_key=blob_account_key)\n",
"blob_service.get_blob_to_path(container, blobname, datafile)\n",
"\n",
"edm = pd.read_csv(datafile, header = 0)\n",
"os.remove(os.path.join(dirname, datafile))\n",
"print(edm.shape)\n",
"edm.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# load in tag hierarchy to extract top terms\n",
"import json\n",
"with open(\"tag_hierarchy.json\", 'r') as f:\n",
" data = json.load(f) "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" doc | \n",
" topic | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" that this house notes the conviction in the un... | \n",
" Parliament, government and politics | \n",
"
\n",
" \n",
" 1 | \n",
" that this house congratulates the us corporate... | \n",
" International affairs | \n",
"
\n",
" \n",
" 2 | \n",
" that this house is dismayed by the report in t... | \n",
" Parliament, government and politics | \n",
"
\n",
" \n",
" 3 | \n",
" that this house notes that according to a no d... | \n",
" Parliament, government and politics | \n",
"
\n",
" \n",
" 4 | \n",
" that this house notes the appointment of a new... | \n",
" Parliament, government and politics | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" doc \\\n",
"0 that this house notes the conviction in the un... \n",
"1 that this house congratulates the us corporate... \n",
"2 that this house is dismayed by the report in t... \n",
"3 that this house notes that according to a no d... \n",
"4 that this house notes the appointment of a new... \n",
"\n",
" topic \n",
"0 Parliament, government and politics \n",
"1 International affairs \n",
"2 Parliament, government and politics \n",
"3 Parliament, government and politics \n",
"4 Parliament, government and politics "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Cleaning\n",
"import re\n",
"\n",
"def clean_text(string):\n",
" string = re.sub(r\"\\d\", \"\", string) # remove numbers \n",
" string = re.sub(r\"_+\", \"\", string) # remove consecutive underscores\n",
" string = re.sub(r\"<[^>]*>\", \" \", string) #remove all html tags\n",
" string = re.sub(r\"[^0-9a-zA-Z]+\", \" \", string) # remove speacial chars\n",
" string = string.lower() # tranform to lower case \n",
" \n",
" return string.strip()\n",
"\n",
"edm[\"doc\"] = edm.doc.apply(clean_text)\n",
"\n",
"def get_parent(term):\n",
" for i in data['children']:\n",
" for k in i['children']:\n",
" if term == k['name']:\n",
" return i['name']\n",
" if \"children\" in k:\n",
" for j in k['children']:\n",
" if term == j['name']:\n",
" return get_parent(k['name'])\n",
" \n",
"# get_parent(\"Supported housing\")\n",
"\n",
"edm[\"topic\"] = edm.topic.apply(get_parent)\n",
"\n",
"edm.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=0.95, max_features=1000, min_df=2,\n",
" ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
" tokenizer=None, vocabulary=None)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,max_features=1000,stop_words='english')\n",
"tf_vectorizer"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<5000x1000 sparse matrix of type ''\n",
"\twith 150811 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = tf_vectorizer.fit_transform(edm['doc'])\n",
"X"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0., 0., 0., ..., 0., 0., 0.],\n",
" [ 0., 0., 0., ..., 0., 0., 0.],\n",
" [ 0., 0., 0., ..., 0., 0., 0.],\n",
" ..., \n",
" [ 0., 0., 0., ..., 0., 1., 0.],\n",
" [ 0., 0., 0., ..., 0., 0., 0.],\n",
" [ 0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_array = X.toarray() \n",
"doc_array.astype(float)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['shall', 'fight', 'beaches']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"analyze = tf_vectorizer.build_analyzer()\n",
"analyze(\"we shall fight on the beaches\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"feature_names = tf_vectorizer.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_split=1e-07, min_samples_leaf=1,\n",
" min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
" n_estimators=500, n_jobs=1, oob_score=False, random_state=None,\n",
" verbose=0, warm_start=False)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(doc_array, edm.topic, train_size=0.80)\n",
"rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)\n",
"rf.fit(train, labels_train)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.57399999999999995"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sklearn.metrics.accuracy_score(labels_test, rf.predict(test))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn import preprocessing\n",
"le = preprocessing.LabelEncoder()\n",
"le.fit(edm.topic)\n",
"classes_names = le.classes_"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\leungr\\AppData\\Local\\Continuum\\Anaconda3\\lib\\site-packages\\sklearn\\utils\\validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n",
" warnings.warn(msg, _DataConversionWarning)\n"
]
}
],
"source": [
"explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=feature_names, class_names=classes_names, discretize_continuous=True)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[array(['calls', 'chief', 'concern', 'court', 'engage', 'ensure',\n",
" 'following', 'home', 'legislation', 'notes', 'officers', 'order',\n",
" 'police', 'properly', 'public', 'recent', 'secretary', 'senior',\n",
" 'settlement'], \n",
" dtype='\n",
" \n",
" \n",
" \n",
" \n",
" \n",
"