{ "worksheets": [ { "cells": [ { "source": "# Topics extraction with Non-Negative Matrix Factorization\n\nThis is a proof of concept application of Non Negative Matrix\nFactorization of the term frequency matrix of a corpus of documents so\nas to extract an additive model of the topic structure of the corpus.", "cell_type": "markdown" }, { "source": "## Load the 20 newsgroups dataset", "cell_type": "markdown" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 14, "input": "from sklearn import datasets\ndataset = datasets.fetch_20newsgroups(shuffle=True, random_state=1)\nprint dataset.target_names[dataset.target[0]]" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 15, "input": "print dataset.data[0]" }, { "source": "## Restrict the dimensions of the problem\n\nFor shorter computation times.", "cell_type": "markdown" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": true, "prompt_number": 16, "input": "n_samples = 1000\nn_features = 1000" }, { "source": "## Vectorize to compute word frequencies for each document\n\nRestrict to the most common word frequency and use TF-IDF weighting (without top 5% stop words)", "cell_type": "markdown" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 17, "input": "from sklearn.feature_extraction import text\n\nvectorizer = text.CountVectorizer(max_df=0.95, max_features=n_features)\ncounts = vectorizer.fit_transform(dataset.data[:n_samples])\n\ntfidf = text.TfidfTransformer().fit_transform(counts)\ntfidf" }, { "source": "Convert from a `scipy.sparse.csr_matrix` representation to a dense `numpy` array and remove negative values.", "cell_type": "markdown" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 18, "input": "tfidf.toarray()" }, { "source": "## Extract some topics with Non-negative Matrix Factorization", "cell_type": "markdown" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": true, "prompt_number": 19, "input": "from sklearn import decomposition\nn_topics = 5\n\nnmf = decomposition.NMF(n_components=n_topics).fit(tfidf)" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 20, "input": "print nmf" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 21, "input": "print nmf.components_" }, { "source": "## Display the most important words for each extracted topic\n\nReuse the vocabulary of the vectorizer to find the words names from the matrix positions.", "cell_type": "markdown" }, { "cell_type": "code", "language": "python", "outputs": [], "collapsed": false, "prompt_number": 22, "input": "n_top_words = 12\ninverse_vocabulary = dict((v, k) for k, v in vectorizer.vocabulary.iteritems())\n\nfor topic_idx, topic in enumerate(nmf.components_):\n print \"Topic #%d: \" % topic_idx,\n print \" \".join([inverse_vocabulary[i]\n for i in topic.argsort()[:-(n_top_words + 1):-1]])\n print" }, { "input": "", "cell_type": "code", "collapsed": true, "language": "python", "outputs": [] } ] } ], "metadata": { "name": "nmf_topics" }, "nbformat": 2 }