{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will see another dimesionality reduction technique called LDA and compare it with PCA. Later on, using this we will do topic modelling.\n", "* LDA\n", "* LDA vs PCA\n", "* Topic Modelling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import nltk\n", "import numpy as np\n", "import pandas as pd\n", "from nltk.corpus import stopwords\n", "from nltk.stem import WordNetLemmatizer\n", "from nltk.stem.porter import *\n", "from nltk.tokenize import word_tokenize\n", "from sklearn.datasets import fetch_20newsgroups, load_iris\n", "from sklearn.decomposition import PCA, LatentDirichletAllocation\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LDA(Linear Discriminant Analysis) and LDA vs PCA\n", "\n", "**LDA** is a dimesionality reduction tecnique.It transforms your data from say 'n' dimensions to 'k'dimensions. Both are pretty similar in output but with one **major** difference. LDA is a supervised algorithm whereas PCA is not, PCA ignores **class labels**.\n", "\n", " As, we have seen in previous weeks, PCA tries to find directions of maximum variance. PCA projects data onto new axis in such a way they explain the maximum variance without taking class labels into consideration. **LDA** on the other hand, creates new axis in such a way that when we project data on this axis, there is a maximum separation bewtween two class categories. LDA tries to separate classes as much as feasible on the new axis. \n", " \n", "Below is the demonstration of same with Iris dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data=load_iris().data\n", "target=load_iris().target\n", "target_names=load_iris().target_names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe=pd.DataFrame(data=np.concatenate((data,target.reshape(150,1)),axis=1),columns=['col_1','col_2','col_3','col_4','target'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe.drop(columns=['target'],axis=1,inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = PCA (n_components=2)\n", "X_feature_reduced = pca.fit(dataframe).transform(dataframe)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X_feature_reduced[:,0],X_feature_reduced[:,1],c=target)\n", "plt.title(\"PCA\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " lda = LatentDirichletAllocation(n_components=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_feature_reduced = lda.fit(dataframe).transform(dataframe)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X_feature_reduced[:,0],X_feature_reduced[:,1],c=target)\n", "plt.title('LDA')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Observation\n", "As we can see from above that LDA projected data on new axis in such a way that class are separated as much as possible. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### How does LDA achieves this?\n", "\n", "LDA creates new axis based on two criteria:\n", "* Distance between means of classes\n", "* Variation within each category\n", "\n", "It projects data on new axis and finds mean for each class and variance for each class. It tries to maximise the distance between class means and tries to minimise the variation with each class. Using these into consideration we get a new axis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![alt text](../../img/shivam_panwar_data1.png \"data\")\n", "Above is the data for two Genes, we want to project them on new axis with one dimension." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Criterion which we choose above to solve this is\n", "\n", "**(µ1-µ2)^2\n", "/(s1+s2)^2**\n", "\n", ",where µ1 and µ2 are mean of each class and 's1 and s2' are variation/scatter within a clas while making new axis.\n", "We try to maximise this criteria while making new axis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Topic modelling \n", "\n", "**Topic modelling** is a method of assigning topic to each document. Each topic is made up of certain words.\n", "\n", "Consider for example:\n", "\n", "We have two topics, topic 1 and topic 2. **'Topic1'** is represented by 'apple, banana, mango' and **topic2** is represented by 'tennis, cricket, hockey'. We can infer that topic1 is talking about fruits and topic2 is talking about sports. We can assign new incoming document into one of these topics and that can be used for **clustering** purpose too. It is used in recommendation systems and many more. \n", "\n", "Another example: \n", "Consider we have 6 documents\n", "* apple banana\n", "* apple orange\n", "* banana orange\n", "* tiger cat\n", "* tiger dog\n", "* cat dog\n", "\n", "What topic modelling would do is if want to extract say two topic out of these documents, it will give two distributions, topic-word distribution and doc-topic distribution. In topic-word representation it should give word wise distribution for each topic and in doc-topic it would give for each document, it's topic representation or distribution of document for each topic.\n", "\n", "It's ideal topic-word distribution should be:\n", "\n", "| Topic | Apple | Banana | Orange | Tiger | Cat | Dog | \n", "| --- | --- | --- | --- | --- | --- | --- | \n", "| Topic 1 | .33 | .33 | .33 | 0 | 0 | 0 |\n", "| Topic 2 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 |\n", "\n", "and it's ideal document-topic distrubution should be:\n", "\n", "| Topic | doc1 | doc2 | doc3 | doc4 | doc5 | doc6 | \n", "| --- | --- | --- | --- | --- | --- | --- | \n", "| Topic 1 | 1 | 1 | 1 | 0 | 0 | 0 |\n", "| Topic 2 | 0 | 0 | 0 | 1 | 1 | 1 |\n", "\n", "and now suppose we have a new document say, ' cat dog apple', its topic wise representation should be\n", "\n", "\n", "**Topic1: 0.33**\n", "\n", "**Topic2: 0.63**\n", "\n", "\n", "LDA is highly used for this purpose.It's usage for topic modelling and has been demonstrated below. We give to it the number of topics we want to find out of the corpus. Remember it follow bow approach therefore, relationship between words are lost in this manner.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lemmatizer=WordNetLemmatizer() #For words Lemmatization\n", "stemmer=PorterStemmer() #For stemming words\n", "stop_words=set(stopwords.words('english'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def TokenizeText(text):\n", " ''' \n", " Tokenizes text by removing various stopwords and lemmatizing them\n", " '''\n", " text=re.sub('[^A-Za-z0-9\\s]+', '', text)\n", " word_list=word_tokenize(text)\n", " word_list_final=[]\n", " for word in word_list:\n", " if word not in stop_words:\n", " word_list_final.append(lemmatizer.lemmatize(word))\n", " return word_list_final" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def gettopicwords(topics,cv,n_words=10):\n", " '''\n", " Print top n_words for each topic.\n", " cv=Countvectorizer\n", " '''\n", " for i,topic in enumerate(topics):\n", " top_words_array=np.array(cv.get_feature_names())[np.argsort(topic)[::-1][:n_words]]\n", " print \"For topic {} it's top {} words are \".format(str(i),str(n_words))\n", " combined_sentence=\"\"\n", " for word in top_words_array:\n", " combined_sentence+=word+\" \"\n", " print combined_sentence\n", " print \" \"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=pd.read_csv('million-headlines.zip',usecols=[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data link: \n", "\n", "[https://www.kaggle.com/therohk/million-headlines](https://www.kaggle.com/therohk/million-headlines)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "num_features=100000\n", "# cv=CountVectorizer(min_df=0.01,max_df=0.97,tokenizer=TokenizeText,max_features=num_features)\n", "cv=CountVectorizer(tokenizer=TokenizeText,max_features=num_features)\n", "transformed_data=cv.fit_transform(df['headline_text'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transformed_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "no_topics=10 ## We can change this, hyperparameter\n", "lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(transformed_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lda.components_** is a topic_word table, it shows representation of each word in the topic. components_[i, j] can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gettopicwords(lda.components_,cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assigning new topic " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that each document is a combination of each topic. Let's see topic represntation of first ten documents." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First ten documents and their topicwise representation is shown below" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "docs=df['headline_text'][:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data=[]\n", "for doc in docs:\n", " data.append(lda.transform(cv.transform([doc])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cols=['topic'+str(i) for i in range(1,11)]\n", "doc_topic_df=pd.DataFrame(columns=cols,data=np.array(data).reshape((10,10)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc_topic_df['major_topic']=doc_topic_df.idxmax(axis=1)\n", "doc_topic_df['raw_doc']=docs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc_topic_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We saw how LDA can be used for topic modelling. This can be used for document custering based on the doc topic representation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "[Statquest LDA](https://www.youtube.com/watch?v=azXCzI57Yfc)\n", "\n", "[https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/)\n", "\n", "[https://sebastianraschka.com/faq/docs/lda-vs-pca.html](https://sebastianraschka.com/faq/docs/lda-vs-pca.html)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }