{ "cells": [ { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 1 Business Problem" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "The Canadian banking system continues to rank at the top of the world thanks to the continuous effort to improve our quality control practices. As evident during the 2008 Sub-Prime Mortgage Crisis, Canada was one of the few countries that withstood the Great Recession.\n", "\n", "One approach to improve quality control practices is by analyzing the quality of a Bank's business portfolio for each individual business line. For example, a Bank's core business line could be providing construction loan products, and based on the rationale behind each deal for the approval and denial of construction loans, we can also determine the topics in each decision from the rationales. By determining the topics in each decision, we can then perform quality control to ensure all the decisions that were made are in accordance to the Bank's risk appetite and pricing.\n", "\n", "With this approach, Banks can improve the quality of their construction loan business from their own decision making standards, and thus improving the overall quality of their business.\n", "\n", "However, in order to get this information, the Bank needs to extract topics from hundreds and thousands of data, and then interpret the topics before determining if the decisions that were made meets the Bank's decision making standards, all of which can take a lot of time and resources to complete.\n", "
\n", "
\n", "\n", "**Business Solutions:**\n", "\n", "To solve this issue, I have created a \"Quality Control System\" that learns and extracts topics from a Bank's rationale for decision making. This can then be used as quality control to determine if the decisions that were made are in accordance to the Bank's standards.\n", "\n", "We will perform an unsupervised learning algorithm in Topic Modeling, which uses Latent Dirichlet Allocation (LDA) Model, and LDA Mallet (Machine Learning Language Toolkit) Model, on an entire department's decision making rationales.\n", "\n", "We will also determine the dominant topic associated to each rationale, as well as determining the rationales for each dominant topics in order to perform quality control analysis.\n", "\n", "Note: Although we were given permission to showcase this project, however, we will not showcase any relevant information from the actual dataset for privacy protection.\n", "
\n", "
\n", "\n", "**Benefits:**\n", "- Efficiently determine the main topics of rationale texts in a large dataset\n", "- Improve the quality control of decisions based on the topics that were extracted\n", "- Conveniently determine the topics of each rationale\n", "- Extract detailed information by determining the most relevant rationales for each topic\n", "
\n", "
\n", "\n", "**Robustness:**\n", "\n", "To ensure the model performs well, I will take the following steps:\n", "- Run the LDA Model and the LDA Mallet Model to compare the performances of each model\n", "- Run the LDA Mallet Model and optimize the number of topics in the rationales by choosing the optimal model with highest performance\n", "\n", "Note that the main different between LDA Model vs. LDA Mallet Model is that, LDA Model uses Variational Bayes method, which is faster, but less precise than LDA Mallet Model which uses Gibbs Sampling. \n", "
\n", "
\n", "\n", "**Assumption:**\n", "- We are using data with a sample size of 511, and assuming that this dataset is sufficient to capture the topics in the rationale\n", "- We're also assuming that the results in this model is applicable in the same way if we were to train an entire population of the rationale dataset with the exception of few parameter tweaks\n", "
\n", "
\n", "\n", "**Future:**\n", "\n", "This model is an innovative way to determine key topics embedded in large quantity of texts, and then apply it in a business context to improve a Bank's quality control practices for different business lines. However, since we did not fully showcase all the visualizations and outputs for privacy protection, please refer to \"[Employer Reviews using Topic Modeling](https://nbviewer.jupyter.org/github/mick-zhang/Employer-Reviews-using-Topic-Modeling/blob/master/Topic%20Employer%20Github.ipynb?flush_cache=true)\" for more detail." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 2 Data Overview" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UnderwriterDeal NumberDecisionDeal Notes
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [Underwriter, Deal Number, Decision, Deal Notes]\n", "Index: []" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "csv = (\"audit_rating_banking.csv\")\n", "df = pd.read_csv(csv, encoding='latin1') # Solves enocding issue when importing csv\n", "df.head(0)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "After importing the data, we see that the \"Deal Notes\" column is where the rationales are for each deal. This is the column that we are going to use for extracting topics. \n", "\n", "Note that actual data were not shown for privacy protection." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(511, 1)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df[['Deal Notes']]\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Deal Notes
2Unfortunately I will have
3Were going to pass
4Credit: main applicant has
\n", "
" ], "text/plain": [ " Deal Notes\n", "2 Unfortunately I will have\n", "3 Were going to pass\n", "4 Credit: main applicant has" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = df.copy()\n", "df1[\"Deal Notes\"] = df1[\"Deal Notes\"].apply(lambda x : x.rsplit(maxsplit=len(x.split())-4)[0]) # sets the character limit to 4 words\n", "df1.loc[2:4, ['Deal Notes']]" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "As a expected, we see that there are 511 items in our dataset with 1 data type (text).\n", "\n", "I have also wrote a function showcasing a sneak peak of the \"Rationale\" data (only the first 4 words are shown)." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 3 Data Cleaning" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We will use regular expressions to clean out any unfavorable characters in our dataset, and then preview what the data looks like after the cleaning." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hidden": true }, "outputs": [], "source": [ "data = df['Deal Notes'].values.tolist() # convert to list\n", "\n", "# Use Regex to remove all characters except letters and space\n", "import re\n", "data = [re.sub(r'[^a-zA-Z ]+', '', sent) for sent in data]\n", "\n", "# Preview the first list of the cleaned data\n", "from pprint import pprint\n", "pprint(data[:1])" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Note that output were omitted for privacy protection. However the actual output here are text that has been cleaned with only words and space characters." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 4 Pre-Processing" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "With our data now cleaned, the next step is to pre-process our data so that it can used as an input for our LDA model.\n", "\n", "We will perform the following:\n", "- Breakdown each sentences into a list of words through Tokenization by using Gensim's `simple_preprocess`\n", "- Additional cleaning by converting text into lowercase, and removing punctuations by using Gensim's `simple_preprocess` once again\n", "- Remove stopwords (words that carry no meaning such as to, the, etc) by using NLTK's `corpus.stopwords`\n", "- Apply Bigram and Trigram model for words that occurs together (ie. warrant_proceeding, there_isnt_enough) by using Gensim's `models.phrases.Phraser`\n", "- Transform words to their root words (ie. walking to walk, mice to mouse) by Lemmatizing the text using `spacy.load(en)` which is Spacy's English dictionary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": true }, "outputs": [], "source": [ "# Implement simple_preprocess for Tokenization and additional cleaning\n", "import gensim\n", "from gensim.utils import simple_preprocess \n", "def sent_to_words(sentences):\n", " for sentence in sentences:\n", " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) \n", " # deacc=True removes punctuations \n", "data_words = list(sent_to_words(data))\n", "\n", "# Remove stopwords using gensim's simple_preprocess and NLTK's stopwords\n", "from nltk.corpus import stopwords\n", "stop_words = stopwords.words('english')\n", "stop_words.extend(['from', 'subject', 're', 'edu', 'use']) # Add additional stop words\n", "def remove_stopwords(texts):\n", " return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]\n", "data_words_nostops = remove_stopwords(data_words)\n", "\n", "\n", "# Create and Apply Bigrams and Trigrams\n", "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases\n", "trigram = gensim.models.Phrases(bigram[data_words], threshold=100)\n", "bigram_mod = gensim.models.phrases.Phraser(bigram) # Faster way to get a sentence into a trigram/bigram\n", "trigram_mod = gensim.models.phrases.Phraser(trigram)\n", "def make_trigram(texts):\n", " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", "data_words_trigrams = make_trigram(data_words_nostops)\n", "\n", "\n", "# Lemmatize the data\n", "import spacy\n", "nlp = spacy.load('en', disable=['parser', 'ner'])\n", "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):\n", " texts_out = []\n", " for sent in texts:\n", " doc = nlp(\" \".join(sent)) # Adds English dictionary from Spacy\n", " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])\n", " # lemma_ is base form and pos_ is lose part\n", " return texts_out\n", "data_lemmatized = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])\n", "\n", " \n", "# Preview the data \n", "print(data_lemmatized[:1])" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Note that output were omitted for privacy protection. However the actual output here are text that are Tokenized, Cleaned (stopwords removed), Lemmatized with applicable bigram and trigrams." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 5 Prepare Dictionary and Corpus" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Now that our data have been cleaned and pre-processed, here are the final steps that we need to implement before our data is ready for LDA input:\n", "- Create a dictionary from our pre-processed data using Gensim's `corpora.Dictionary`\n", "- Create a corpus by applying \"term frequency\" (word count) to our \"pre-processed data dictionary\" using Gensim's `.doc2bow`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]\n" ] } ], "source": [ "import gensim.corpora as corpora\n", "id2word = corpora.Dictionary(data_lemmatized) # Create dictionary\n", "texts = data_lemmatized # Create corpus\n", "corpus = [id2word.doc2bow(text) for text in texts] # Apply Term Frequency\n", "print(corpus[:1]) # Preview the data" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We can see that our corpus is a list of every word in an index form followed by count frequency." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'brother'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id2word[0]" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We can also see the actual word of each index by calling the index from our pre-processed data dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": false }, "outputs": [], "source": [ "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Lastly, we can see the list of every word in actual word (instead of index form) followed by their count frequency using a simple `for` loop.\n", "\n", "Note that output were omitted for privacy protection. However the actual output here are a list of text showing words with their corresponding count frequency.\n", "\n", "Now that we have created our dictionary and corpus, we can feed the data into our LDA Model." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 6 LDA Model" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "**Latent (hidden) Dirichlet Allocation** is a generative probabilistic model of a documents (composites) made up of words (parts). The model is based on the probability of words when selecting (sampling) topics (category), and the probability of topics when selecting a document.\n", "\n", "Essentially, we are extracting topics in documents by looking at the probability of words to determine the topics, and then the probability of topics to determine the documents. \n", "\n", "There are two LDA algorithms. The **Variational Bayes** is used by Gensim's **LDA Model**, while **Gibb's Sampling** is used by **LDA Mallet Model** using Gensim's Wrapper package.\n", "\n", "Here is the general overview of Variational Bayes and Gibbs Sampling:\n", "- **Variational Bayes**\n", " - Sampling the variations between, and within each word (part or variable) to determine which topic it belongs to (but some variations cannot be explained)\n", " - Fast but less accurate\n", "- **Gibb's Sampling (Markov Chain Monte Carlos)**\n", " - Sampling one variable at a time, conditional upon all other variables\n", " - Slow but more accurate" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Build LDA Model\n", "lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics = 9, random_state = 100,\n", " update_every = 1, chunksize = 100, passes = 10, alpha = 'auto',\n", " per_word_topics=True) # Here we selected 9 topics\n", "pprint(lda_model.print_topics())\n", "doc_lda = lda_model[corpus]" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "After building the LDA Model using Gensim, we display the 10 topics in our document along with the top 10 keywords and their corresponding weights that makes up each topic.\n", "\n", "Note that output were omitted for privacy protection.. However the actual output is a list of the 9 topics, and each topic shows the top 10 keywords and their corresponding weights that makes up the topic." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 7 LDA Model Performance" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Perplexity: -6.876382108603263\n", "\n", "Coherence Score: 0.4136469957174781\n" ] } ], "source": [ "# Compute perplexity\n", "print('Perplexity: ', lda_model.log_perplexity(corpus))\n", "\n", "# Compute coherence score\n", "from gensim.models import CoherenceModel\n", "coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", "coherence_lda = coherence_model_lda.get_coherence()\n", "print('Coherence Score: ', coherence_lda)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "In order to determine the accuracy of the topics that we used, we will compute the Perplexity Score and the Coherence Score. The Perplexity score measures how well the LDA Model predicts the sample (the lower the perplexity score, the better the model predicts). The Coherence score measures the quality of the topics that were learned (the higher the coherence score, the higher the quality of the learned topics).\n", "\n", "Here we see a **Perplexity score of -6.87** (negative due to log space), and **Coherence score of 0.41**. \n", "\n", "Note: We will use the Coherence score moving forward, since we want to optimizing the number of topics in our documents." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## 7.1 Visualize LDA Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\", category=FutureWarning) # Hides all future warnings\n", "import pyLDAvis\n", "import pyLDAvis.gensim \n", "pyLDAvis.enable_notebook()\n", "vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)\n", "vis" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We are using pyLDAvis to visualize our topics. \n", "\n", "For interpretation of pyLDAvis:\n", "- Each bubble represents a topic\n", "- The larger the bubble, the more prevalent the topic will be\n", "- A good topic model has fairly big, non-overlapping bubbles scattered through the chart (instead of being clustered in one quadrant)\n", "- Red highlight: Salient keywords that form the topics (most notable keywords)\n", "\n", "Note that output were omitted for privacy protection." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 8 LDA Mallet Model" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Now that we have completed our Topic Modeling using \"Variational Bayes\" algorithm from Gensim's LDA, we will now explore Mallet's LDA (which is more accurate but slower) using Gibb's Sampling (Markov Chain Monte Carlos) under Gensim's Wrapper package.\n", "\n", "Mallet's LDA Model is more accurate, since it utilizes Gibb's Sampling by sampling one variable at a time conditional upon all other variables." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true }, "outputs": [], "source": [ "import os\n", "from gensim.models.wrappers import LdaMallet\n", "os.environ.update({'MALLET_HOME':r'/Users/Mick/Desktop/mallet/'}) # Set environment\n", "mallet_path = '/Users/Mick/Desktop/mallet/bin/mallet' # Update this path\n", "\n", "# Build the LDA Mallet Model\n", "ldamallet = LdaMallet(mallet_path,corpus=corpus,num_topics=9,id2word=id2word) # Here we selected 9 topics again\n", "pprint(ldamallet.show_topics(formatted=False))" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "After building the LDA Mallet Model using Gensim's Wrapper package, here we see our 9 new topics in the document along with the top 10 keywords and their corresponding weights that makes up each topic.\n", "\n", "Note that output were omitted for privacy protection. However the actual output is a list of the 9 topics, and each topic shows the top 10 keywords and their corresponding weights that makes up the topic." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## 8.1 LDA Mallet Model Performance" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Coherence Score: 0.4102038587308669\n" ] } ], "source": [ "# Compute coherence score\n", "coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence=\"c_v\")\n", "coherence_ldamallet = coherence_model_ldamallet.get_coherence()\n", "print('\\nCoherence Score: ', coherence_ldamallet)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Here we see the Coherence Score for our **LDA Mallet Model** is showing **0.41** which is similar to the LDA Model above. Also, given that we are now using a more accurate model from **Gibb's Sampling**, and combined with the purpose of the Coherence Score was to measure the quality of the topics that were learned, then our next step is to improve the actual Coherence Score, which will ultimately improve the overall quality of the topics learned.\n", "\n", "To improve the quality of the topics learned, we need to find the optimal number of topics in our document, and once we find the optimal number of topics in our document, then our Coherence Score will be optimized, since all the topics in the document are extracted accordingly without redundancy." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 9 Finding the Optimal Number of Topics for LDA Mallet Model" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We will use the following function to run our **LDA Mallet Model**:\n", "\n", " compute_coherence_values\n", " \n", "Note: We will trained our model to find topics between the range of 2 to 12 topics with an interval of 1." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Compute a list of LDA Mallet Models and corresponding Coherence Values\n", "def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):\n", " coherence_values = []\n", " model_list = []\n", " for num_topics in range(start, limit, step):\n", " model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)\n", " model_list.append(model)\n", " coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')\n", " coherence_values.append(coherencemodel.get_coherence()) \n", " return model_list, coherence_values\n", "model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized,\n", " start=2, limit=12, step=1)\n", "\n", "# Visualize the optimal LDA Mallet Model\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "limit=12; start=2; step=1;\n", "x = range(start, limit, step)\n", "plt.plot(x, coherence_values)\n", "plt.xlabel('Num Topics')\n", "plt.ylabel('Coherence score')\n", "plt.legend(('coherence_values'), loc='best')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Num Topics = 2 has Coherence Value of 0.3009\n", "Num Topics = 3 has Coherence Value of 0.3654\n", "Num Topics = 4 has Coherence Value of 0.3877\n", "Num Topics = 5 has Coherence Value of 0.3953\n", "Num Topics = 6 has Coherence Value of 0.4238\n", "Num Topics = 7 has Coherence Value of 0.3779\n", "Num Topics = 8 has Coherence Value of 0.4006\n", "Num Topics = 9 has Coherence Value of 0.391\n", "Num Topics = 10 has Coherence Value of 0.4349\n", "Num Topics = 11 has Coherence Value of 0.3763\n" ] } ], "source": [ "# Print the coherence scores\n", "for m, cv in zip(x, coherence_values):\n", " print('Num Topics =', m, ' has Coherence Value of', round(cv, 4))" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "With our models trained, and the performances visualized, we can see that the optimal number of topics here is **10 topics** with a Coherence Score of **0.43** which is slightly higher than our previous results at 0.41. However, we can also see that the model with a coherence score of 0.43 is also the highest scoring model, which implies that there are a total 10 dominant topics in this document.\n", "\n", "We will proceed and select our final model using 10 topics." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": true }, "outputs": [], "source": [ "# Select the model with highest coherence value and print the topics\n", "optimal_model = model_list[8]\n", "model_topics = optimal_model.show_topics(formatted=False)\n", "pprint(optimal_model.print_topics(num_words=10)) # Set num_words parament to show 10 words per each topic" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "By using our **Optimal LDA Mallet Model** using Gensim's Wrapper package, we displayed the 10 topics in our document along with the top 10 keywords and their corresponding weights that makes up each topic.\n", "\n", "Note that output were omitted for privacy protection. However the actual output is a list of the 10 topics, and each topic shows the top 10 keywords and their corresponding weights that makes up the topic." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## 9.1 Visual the Optimal LDA Mallet Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Wordcloud of Top N words in each topic\n", "from matplotlib import pyplot as plt\n", "from wordcloud import WordCloud, STOPWORDS\n", "import matplotlib.colors as mcolors\n", "cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]\n", "cloud = WordCloud(stopwords=stop_words,\n", " background_color='white',\n", " width=2500,\n", " height=1800,\n", " max_words=10,\n", " colormap='tab10',\n", " color_func=lambda *args, **kwargs: cols[i],\n", " prefer_horizontal=1.0)\n", "topics = optimal_model.show_topics(formatted=False)\n", "fig, axes = plt.subplots(2, 5, figsize=(10,10), sharex=True, sharey=True)\n", "for i, ax in enumerate(axes.flatten()):\n", " fig.add_subplot(ax)\n", " topic_words = dict(topics[i][1])\n", " cloud.generate_from_frequencies(topic_words, max_font_size=300)\n", " plt.gca().imshow(cloud)\n", " plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))\n", " plt.gca().axis('off')\n", "plt.subplots_adjust(wspace=0, hspace=0)\n", "plt.axis('off')\n", "plt.margins(x=0, y=0)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Here we also visualized the 10 topics in our document along with the top 10 keywords. Each keyword's corresponding weights are shown by the size of the text.\n", "\n", "Note that output were omitted for privacy protection." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 10 Analysis" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Now that our **Optimal Model** is constructed, we will apply the model and determine the following:\n", "- Determine the dominant topics for each document\n", "- Determine the most relevant document for each of the 10 dominant topics\n", "- Determine the distribution of documents contributed to each of the 10 dominant topics" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## 10.1 Finding topics for each document" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": true }, "outputs": [], "source": [ "def format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data):\n", " sent_topics_df = pd.DataFrame()\n", " # Get dominant topic in each document\n", " for i, row in enumerate(ldamodel[corpus]): \n", " row = sorted(row, key=lambda x: (x[1]), reverse=True) \n", " # Get the Dominant topic, Perc Contribution and Keywords for each document\n", " for j, (topic_num, prop_topic) in enumerate(row):\n", " if j == 0: \n", " wp = ldamodel.show_topic(topic_num) \n", " topic_keywords = \", \".join([word for word, prop in wp])\n", " sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4),\n", " topic_keywords]), ignore_index=True)\n", " else:\n", " break\n", " sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'] # Create dataframe title\n", " # Add original text to the end of the output (recall that texts = data_lemmatized)\n", " contents = pd.Series(texts)\n", " sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)\n", " return(sent_topics_df) \n", "df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)\n", "df_dominant_topic = df_topic_sents_keywords.reset_index()\n", "df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Document']\n", "df_dominant_topic.head(10)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Note that output were omitted for privacy protection. However the actual output is a list of the first 10 document with corresponding dominant topics attached." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## 10.2 Finding documents for each topic" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "scrolled": true }, "outputs": [], "source": [ "# Group top 10 documents for the 10 dominant topic\n", "sent_topics_sorteddf_mallet = pd.DataFrame()\n", "sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic') \n", "for i, grp in sent_topics_outdf_grpd:\n", " sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet,\n", " grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], axis=0)\n", "sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)\n", "sent_topics_sorteddf_mallet.columns = ['Topic_Num', \"Topic_Perc_Contrib\", \"Keywords\", \"Document\"]\n", "sent_topics_sorteddf_mallet " ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Note that output were omitted for privacy protection. However the actual output is a list of most relevant documents for each of the 10 dominant topics." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "## 10.3 Document distribution across Topics" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Dominant TopicNum_DocumentPerc_Document
00.0270.0528
11.0270.0528
22.0230.0450
33.0960.1879
44.0550.1076
55.0430.0841
66.0260.0509
77.0420.0822
88.0440.0861
99.01280.2505
\n", "
" ], "text/plain": [ " Dominant Topic Num_Document Perc_Document\n", "0 0.0 27 0.0528\n", "1 1.0 27 0.0528\n", "2 2.0 23 0.0450\n", "3 3.0 96 0.1879\n", "4 4.0 55 0.1076\n", "5 5.0 43 0.0841\n", "6 6.0 26 0.0509\n", "7 7.0 42 0.0822\n", "8 8.0 44 0.0861\n", "9 9.0 128 0.2505" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of Documents for Each Topic\n", "topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()\n", "topic_contribution = round(topic_counts/topic_counts.sum(), 4)\n", "topic_num_keywords = {'Topic_Num': pd.Series([0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0])}\n", "topic_num_keywords = pd.DataFrame(topic_num_keywords)\n", "df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)\n", "df_dominant_topics.reset_index(drop=True, inplace=True)\n", "df_dominant_topics.columns = ['Dominant Topic', 'Num_Document', 'Perc_Document']\n", "df_dominant_topics" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Here we see the number of documents and the percentage of overall documents that contributes to each of the 10 dominant topics." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# 11 Answering the Questions" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Based on our modeling above, we were able to use a very accurate model from Gibb's Sampling, and further optimize the model by finding the optimal number of dominant topics without redundancy.\n", "\n", "As a result, we are now able to see the 10 dominant topics that were extracted from our dataset. Furthermore, we are also able to see the dominant topic for each of the 511 documents, and determine the most relevant document for each dominant topics.\n", "\n", "With the in-depth analysis of each individual topics and documents above, the Bank can now use this approach as a \"Quality Control System\" to learn the topics from their rationales in decision making, and then determine if the rationales that were made are in accordance to the Bank's standards for quality control." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "413px", "left": "1097px", "top": "110px", "width": "183px" }, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }