{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "***\n", "***\n", "# Topic Modeling Using Graphlab\n", "***\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import graphlab\n", "graphlab.canvas.set_target(\"ipynb\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Download Data: http://select.cs.cmu.edu/code/graphlab/datasets/wikipedia/wikipedia_raw/w15" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This non-commercial license of GraphLab Create is assigned to wangchengjun@nju.edu.cn and will expire on July 31, 2016. For commercial licensing options, visit https://dato.com/buy/.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2016-04-14 01:12:14,140 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1460567529.log\n" ] }, { "data": { "text/html": [ "
Finished parsing file /Users/chengjun/bigdata/w15" ], "text/plain": [ "Finished parsing file /Users/chengjun/bigdata/w15" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.546547 secs." ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.546547 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "------------------------------------------------------\n", "Inferred types from first line of file as \n", "column_type_hints=[str]\n", "If parsing fails due to incorrect types, you can correct\n", "the inferred type list above and pass it to read_csv in\n", "the column_type_hints argument\n", "------------------------------------------------------\n" ] }, { "data": { "text/html": [ "
Read 12278 lines. Lines per second: 12121.5" ], "text/plain": [ "Read 12278 lines. Lines per second: 12121.5" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/chengjun/bigdata/w15" ], "text/plain": [ "Finished parsing file /Users/chengjun/bigdata/w15" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 72269 lines in 2.23078 secs." ], "text/plain": [ "Parsing completed. Parsed 72269 lines in 2.23078 secs." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sf = graphlab.SFrame.read_csv(\"/Users/chengjun/bigdata/w15\", header=False)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
X1 | \n", "
---|
aynrand born and educated in russia rand migrated ... | \n",
"
asphalt in american english asphalt or ... | \n",
"
actinopterygii the actinopterygii consti ... | \n",
"
altaiclanguages these language families share ... | \n",
"
argon the name argon is derived from the greek ... | \n",
"
augustderleth a 1938 guggenheim fellow der ... | \n",
"
amateur amateurism can be seen in both a negative ... | \n",
"
assemblyline an assembly line is a manufacturing ... | \n",
"
astronomicalunit an astronomical unit ... | \n",
"
abbess an abbess latin abbatissa feminine form ... | \n",
"
X1 | \n", "tfidf | \n", "bow | \n", "
---|---|---|
aynrand born and educated in russia rand migrated ... | \n",
" {'limited': 10.04705669672047, ... | \n",
" {'limited': 3, 'writings': 2, ... | \n",
"
asphalt in american english asphalt or ... | \n",
" {'all': 1.3891905239989626, ... | \n",
" {'all': 1, 'accadian': 1, 'similarity': 1, ... | \n",
"
actinopterygii the actinopterygii consti ... | \n",
" {'andreolepis': 11.188150547181156, ... | \n",
" {'andreolepis': 1, 'all': 1, 'evolutionary': 2, ... | \n",
"
altaiclanguages these language families share ... | \n",
" {'sergei': 20.031873121992916, ... | \n",
" {'sergei': 3, 'all': 6, 'todays': 1, 'chinese': ... | \n",
"
argon the name argon is derived from the greek ... | \n",
" {'limited': 3.3490188989068232, ... | \n",
" {'limited': 1, 'embolism': 1, ... | \n",
"
augustderleth a 1938 guggenheim fellow der ... | \n",
" {'evelyn': 6.7937013925087175, ... | \n",
" {'evelyn': 1, 'detective': 4, ... | \n",
"
amateur amateurism can be seen in both a negative ... | \n",
" {'since': 1.8775124538896095, ... | \n",
" {'since': 1, 'subpar': 1, 'lack': 2, 'valuable' ... | \n",
"
assemblyline an assembly line is a manufacturing ... | \n",
" {'all': 4.167571571996888, ... | \n",
" {'all': 3, 'concept': 6, 'consider': 1, 'chine ... | \n",
"
astronomicalunit an astronomical unit ... | \n",
" {'precise': 5.491057060675752, 'a ... | \n",
" {'precise': 1, 'all': 2, 'chinese': 1, 'suns': 1, ... | \n",
"
abbess an abbess latin abbatissa feminine form ... | \n",
" {'kildares': 11.188150547181156, ... | \n",
" {'kildares': 1, 'they': 4, 'founder': 1, ... | \n",
"
Learning a topic model" ], "text/plain": [ "Learning a topic model" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of documents 72269" ], "text/plain": [ " Number of documents 72269" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Vocabulary size 171005" ], "text/plain": [ " Vocabulary size 171005" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Running collapsed Gibbs sampling" ], "text/plain": [ " Running collapsed Gibbs sampling" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |" ], "text/plain": [ "| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 10 | 2.48s | 8.92734e+06 | 0 |" ], "text/plain": [ "| 10 | 2.48s | 8.92734e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m = graphlab.topic_model.create(docs)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Class : TopicModel\n", "\n", "Schema\n", "------\n", "Vocabulary Size : 171005\n", "\n", "Settings\n", "--------\n", "Number of Topics : 10\n", "alpha : 5.0\n", "beta : 0.1\n", "Iterations : 10\n", "Training time : 3.4936\n", "Verbose : False\n", "\n", "Accessible fields : \n", "m['topics'] : An SFrame containing the topics.\n", "m['vocabulary'] : An SArray containing the words in the vocabulary.\n", "Useful methods : \n", "m.get_topics() : Get the most probable words per topic.\n", "m.predict(new_docs) : Make predictions for new documents." ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
topic | \n", "word | \n", "score | \n", "
---|---|---|
0 | \n", "series | \n", "0.018582602707 | \n", "
0 | \n", "time | \n", "0.0160412461512 | \n", "
0 | \n", "played | \n", "0.0142993990545 | \n", "
0 | \n", "back | \n", "0.00951875933204 | \n", "
0 | \n", "game | \n", "0.00839911774869 | \n", "
1 | \n", "war | \n", "0.0176185833315 | \n", "
1 | \n", "film | \n", "0.0159278169528 | \n", "
1 | \n", "group | \n", "0.0140632734063 | \n", "
1 | \n", "party | \n", "0.0103356107163 | \n", "
1 | \n", "year | \n", "0.0102957274319 | \n", "
topic_probabilities | \n", "vocabulary | \n", "
---|---|
[1.6417032014e-07, 1.42440301489e-07, ... | \n",
" duke | \n", "
[1.6417032014e-07, 1.42440301489e-07, ... | \n",
" studies | \n", "
[1.6417032014e-07, 1.42440301489e-07, ... | \n",
" journal | \n", "
[1.6417032014e-07, 1.42440301489e-07, ... | \n",
" chris | \n", "
[1.6417032014e-07, 1.42440301489e-07, ... | \n",
" research | \n", "
[0.000305520965781, 1.42440301489e-07, ... | \n",
" matthew | \n", "
[3.44757672294e-06, 1.42440301489e-07, ... | \n",
" crisis | \n", "
[1.6417032014e-07, 4.41564934616e-06, ... | \n",
" financial | \n", "
[1.6417032014e-07, 1.42440301489e-07, ... | \n",
" paul | \n", "
[1.6417032014e-07, 0.00033772595483, ... | \n",
" 1987 | \n", "