{
    "worksheets": [
        {
            "cells": [
                {
                    "source": "# Topics extraction with Non-Negative Matrix Factorization\n\nThis is a proof of concept application of Non Negative Matrix\nFactorization of the term frequency matrix of a corpus of documents so\nas to extract an additive model of the topic structure of the corpus.", 
                    "cell_type": "markdown"
                }, 
                {
                    "source": "## Load the 20 newsgroups dataset", 
                    "cell_type": "markdown"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": false, 
                    "prompt_number": 14, 
                    "input": "from sklearn import datasets\ndataset = datasets.fetch_20newsgroups(shuffle=True, random_state=1)\nprint dataset.target_names[dataset.target[0]]"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": false, 
                    "prompt_number": 15, 
                    "input": "print dataset.data[0]"
                }, 
                {
                    "source": "## Restrict the dimensions of the problem\n\nFor shorter computation times.", 
                    "cell_type": "markdown"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": true, 
                    "prompt_number": 16, 
                    "input": "n_samples = 1000\nn_features = 1000"
                }, 
                {
                    "source": "## Vectorize to compute word frequencies for each document\n\nRestrict to the most common word frequency and use TF-IDF weighting (without top 5% stop words)", 
                    "cell_type": "markdown"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": false, 
                    "prompt_number": 17, 
                    "input": "from sklearn.feature_extraction import text\n\nvectorizer = text.CountVectorizer(max_df=0.95, max_features=n_features)\ncounts = vectorizer.fit_transform(dataset.data[:n_samples])\n\ntfidf = text.TfidfTransformer().fit_transform(counts)\ntfidf"
                }, 
                {
                    "source": "Convert from a `scipy.sparse.csr_matrix` representation to a dense `numpy` array and remove negative values.", 
                    "cell_type": "markdown"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": false, 
                    "prompt_number": 18, 
                    "input": "tfidf.toarray()"
                }, 
                {
                    "source": "## Extract some topics with Non-negative Matrix Factorization", 
                    "cell_type": "markdown"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": true, 
                    "prompt_number": 19, 
                    "input": "from sklearn import decomposition\nn_topics = 5\n\nnmf = decomposition.NMF(n_components=n_topics).fit(tfidf)"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": false, 
                    "prompt_number": 20, 
                    "input": "print nmf"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": false, 
                    "prompt_number": 21, 
                    "input": "print nmf.components_"
                }, 
                {
                    "source": "## Display the most important words for each extracted topic\n\nReuse the vocabulary of the vectorizer to find the words names from the matrix positions.", 
                    "cell_type": "markdown"
                }, 
                {
                    "cell_type": "code", 
                    "language": "python", 
                    "outputs": [], 
                    "collapsed": false, 
                    "prompt_number": 22, 
                    "input": "n_top_words = 12\ninverse_vocabulary = dict((v, k) for k, v in vectorizer.vocabulary.iteritems())\n\nfor topic_idx, topic in enumerate(nmf.components_):\n    print \"Topic #%d: \" % topic_idx,\n    print \" \".join([inverse_vocabulary[i]\n                    for i in topic.argsort()[:-(n_top_words + 1):-1]])\n    print"
                }, 
                {
                    "input": "", 
                    "cell_type": "code", 
                    "collapsed": true, 
                    "language": "python", 
                    "outputs": []
                }
            ]
        }
    ], 
    "metadata": {
        "name": "nmf_topics"
    }, 
    "nbformat": 2
}