{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "***\n", "***\n", "# Topic Modeling Using Turicreate\n", "***\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:10:06.537165Z", "start_time": "2019-06-14T15:10:06.534426Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import turicreate as tc" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Download Data: http://select.cs.cmu.edu/code/graphlab/datasets/wikipedia/wikipedia_raw/w15" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:07.812317Z", "start_time": "2019-06-14T15:11:05.552049Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/w15
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/w15" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.447868 secs.
" ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.447868 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "------------------------------------------------------\n", "Inferred types from first 100 line(s) of file as \n", "column_type_hints=[str]\n", "If parsing fails due to incorrect types, you can correct\n", "the inferred type list above and pass it to read_csv in\n", "the column_type_hints argument\n", "------------------------------------------------------\n" ] }, { "data": { "text/html": [ "
Read 12278 lines. Lines per second: 16914.5
" ], "text/plain": [ "Read 12278 lines. Lines per second: 16914.5" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/w15
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/w15" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 72269 lines in 1.73223 secs.
" ], "text/plain": [ "Parsing completed. Parsed 72269 lines in 1.73223 secs." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sf = tc.SFrame.read_csv(\"/Users/datalab/bigdata/cjc/w15\", \n", " header=False)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:10.863623Z", "start_time": "2019-06-14T15:11:10.670898Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1
aynrand born and educated
in russia rand migrated ...
asphalt in american
english asphalt or ...
actinopterygii the
actinopterygii consti ...
altaiclanguages these
language families share ...
argon the name argon is
derived from the greek ...
augustderleth a 1938
guggenheim fellow der ...
amateur amateurism can be
seen in both a negative ...
assemblyline an assembly
line is a manufacturing ...
astronomicalunit an
astronomical unit ...
abbess an abbess latin
abbatissa feminine form ...
\n", "[72269 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tX1\tstr\n", "\n", "Rows: 72269\n", "\n", "Data:\n", "+-------------------------------+\n", "| X1 |\n", "+-------------------------------+\n", "| aynrand born and educated ... |\n", "| asphalt in american englis... |\n", "| actinopterygii the actinop... |\n", "| altaiclanguages these lang... |\n", "| argon the name argon is de... |\n", "| augustderleth a 1938 gugge... |\n", "| amateur amateurism can be ... |\n", "| assemblyline an assembly l... |\n", "| astronomicalunit an astron... |\n", "| abbess an abbess latin abb... |\n", "+-------------------------------+\n", "[72269 rows x 1 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sf" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Transformations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "https://dato.com/learn/userguide/text/analysis.html" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:14.476922Z", "start_time": "2019-06-14T15:11:14.470278Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['_SArray__check_min_observations',\n", " '__abs__',\n", " '__add__',\n", " '__and__',\n", " '__bool__',\n", " '__class__',\n", " '__contains__',\n", " '__copy__',\n", " '__deepcopy__',\n", " '__delattr__',\n", " '__dir__',\n", " '__div__',\n", " '__doc__',\n", " '__eq__',\n", " '__floordiv__',\n", " '__format__',\n", " '__ge__',\n", " '__get_content_identifier__',\n", " '__getattribute__',\n", " '__getitem__',\n", " '__gt__',\n", " '__has_size__',\n", " '__hash__',\n", " '__init__',\n", " '__is_materialized__',\n", " '__iter__',\n", " '__le__',\n", " '__len__',\n", " '__lt__',\n", " '__materialize__',\n", " '__mod__',\n", " '__module__',\n", " '__mul__',\n", " '__ne__',\n", " '__neg__',\n", " '__new__',\n", " '__nonzero__',\n", " '__or__',\n", " '__pos__',\n", " '__pow__',\n", " '__proxy__',\n", " '__radd__',\n", " '__rdiv__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__rfloordiv__',\n", " '__rmod__',\n", " '__rmul__',\n", " '__rpow__',\n", " '__rsub__',\n", " '__rtruediv__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__slots__',\n", " '__str__',\n", " '__sub__',\n", " '__subclasshook__',\n", " '__truediv__',\n", " '_count_ngrams',\n", " '_count_words',\n", " '_getitem_cache',\n", " '_save_as_text',\n", " 'all',\n", " 'any',\n", " 'append',\n", " 'apply',\n", " 'argmax',\n", " 'argmin',\n", " 'astype',\n", " 'clip',\n", " 'clip_lower',\n", " 'clip_upper',\n", " 'contains',\n", " 'countna',\n", " 'cumulative_max',\n", " 'cumulative_mean',\n", " 'cumulative_min',\n", " 'cumulative_std',\n", " 'cumulative_sum',\n", " 'cumulative_var',\n", " 'date_range',\n", " 'datetime_to_str',\n", " 'dict_has_all_keys',\n", " 'dict_has_any_keys',\n", " 'dict_keys',\n", " 'dict_trim_by_keys',\n", " 'dict_trim_by_values',\n", " 'dict_values',\n", " 'dropna',\n", " 'dtype',\n", " 'element_slice',\n", " 'explore',\n", " 'fillna',\n", " 'filter',\n", " 'filter_by',\n", " 'from_const',\n", " 'from_sequence',\n", " 'hash',\n", " 'head',\n", " 'is_in',\n", " 'is_materialized',\n", " 'is_topk',\n", " 'item_length',\n", " 'materialize',\n", " 'max',\n", " 'mean',\n", " 'min',\n", " 'nnz',\n", " 'pixel_array_to_image',\n", " 'plot',\n", " 'random_integers',\n", " 'random_split',\n", " 'read_json',\n", " 'rolling_count',\n", " 'rolling_max',\n", " 'rolling_mean',\n", " 'rolling_min',\n", " 'rolling_stdv',\n", " 'rolling_sum',\n", " 'rolling_var',\n", " 'sample',\n", " 'save',\n", " 'shape',\n", " 'show',\n", " 'sort',\n", " 'split_datetime',\n", " 'stack',\n", " 'std',\n", " 'str_to_datetime',\n", " 'sum',\n", " 'summary',\n", " 'tail',\n", " 'to_numpy',\n", " 'unique',\n", " 'unpack',\n", " 'value_counts',\n", " 'var',\n", " 'vector_slice',\n", " 'where']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(sf['X1']) " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:19.154908Z", "start_time": "2019-06-14T15:11:19.150199Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "bow = sf['X1']._count_words() " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:19.961430Z", "start_time": "2019-06-14T15:11:19.956525Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "turicreate.data_structures.sarray.SArray" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(sf['X1'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:20.710963Z", "start_time": "2019-06-14T15:11:20.706859Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "turicreate.data_structures.sarray.SArray" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(bow)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:22.233966Z", "start_time": "2019-06-14T15:11:22.116394Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "dtype: int\n", "Rows: 72269\n", "[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ... ]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bow.dict_has_any_keys(['limited'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:24.124953Z", "start_time": "2019-06-14T15:11:24.109085Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bow.dict_values()[0][:20]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:25.136174Z", "start_time": "2019-06-14T15:11:24.947136Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1
aynrand born and educated
in russia rand migrated ...
asphalt in american
english asphalt or ...
actinopterygii the
actinopterygii consti ...
altaiclanguages these
language families share ...
argon the name argon is
derived from the greek ...
augustderleth a 1938
guggenheim fellow der ...
amateur amateurism can be
seen in both a negative ...
assemblyline an assembly
line is a manufacturing ...
astronomicalunit an
astronomical unit ...
abbess an abbess latin
abbatissa feminine form ...
\n", "[72269 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tX1\tstr\n", "\n", "Rows: 72269\n", "\n", "Data:\n", "+-------------------------------+\n", "| X1 |\n", "+-------------------------------+\n", "| aynrand born and educated ... |\n", "| asphalt in american englis... |\n", "| actinopterygii the actinop... |\n", "| altaiclanguages these lang... |\n", "| argon the name argon is de... |\n", "| augustderleth a 1938 gugge... |\n", "| amateur amateurism can be ... |\n", "| assemblyline an assembly l... |\n", "| astronomicalunit an astron... |\n", "| abbess an abbess latin abb... |\n", "+-------------------------------+\n", "[72269 rows x 1 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sf" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:25.831004Z", "start_time": "2019-06-14T15:11:25.827877Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "sf['bow'] = bow" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:27.075901Z", "start_time": "2019-06-14T15:11:26.563321Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1 bow
aynrand born and educated
in russia rand migrated ... {'spoke': 1, '5000': 1,
'follows': 1, 'given' ...
asphalt in american
english asphalt or ... {'lain': 1, 'commonly':
4, 'has': 6, 'percent': ...
actinopterygii the
actinopterygii consti ... {'what': 1, 'follows': 1,
'given': 1, ...
altaiclanguages these
language families share ... {'follows': 2, 'has': 11,
'general': 1, ...
argon the name argon is
derived from the greek ... {'commonly': 1,
'lattice': 1, ...
augustderleth a 1938
guggenheim fellow der ... {'rescue': 2, 'amoral':
1, 'lovecraft': 11, ...
amateur amateurism can be
seen in both a negative ... {'receiving': 1,
'having': 1, 'hand': 1, ...
assemblyline an assembly
line is a manufacturing ... {'consider': 1, 'world':
1, 'bring': 2, 'pins' ...
astronomicalunit an
astronomical unit ... {'given': 1, 'ephemeris':
2, 'world': 2, ...
abbess an abbess latin
abbatissa feminine form ... {'major': 1, 'abbess':
10, 'given': 1, ...
\n", "[72269 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tX1\tstr\n", "\tbow\tdict\n", "\n", "Rows: 72269\n", "\n", "Data:\n", "+-------------------------------+-------------------------------+\n", "| X1 | bow |\n", "+-------------------------------+-------------------------------+\n", "| aynrand born and educated ... | {'spoke': 1, '5000': 1, 'f... |\n", "| asphalt in american englis... | {'lain': 1, 'commonly': 4,... |\n", "| actinopterygii the actinop... | {'what': 1, 'follows': 1, ... |\n", "| altaiclanguages these lang... | {'follows': 2, 'has': 11, ... |\n", "| argon the name argon is de... | {'commonly': 1, 'lattice':... |\n", "| augustderleth a 1938 gugge... | {'rescue': 2, 'amoral': 1,... |\n", "| amateur amateurism can be ... | {'receiving': 1, 'having':... |\n", "| assemblyline an assembly l... | {'consider': 1, 'world': 1... |\n", "| astronomicalunit an astron... | {'given': 1, 'ephemeris': ... |\n", "| abbess an abbess latin abb... | {'major': 1, 'abbess': 10,... |\n", "+-------------------------------+-------------------------------+\n", "[72269 rows x 2 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sf" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:27.825104Z", "start_time": "2019-06-14T15:11:27.820405Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "turicreate.data_structures.sarray.SArray" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(sf['bow'])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:28.665108Z", "start_time": "2019-06-14T15:11:28.660557Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "72269" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(sf['bow'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:11:43.181325Z", "start_time": "2019-06-14T15:11:43.158243Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[('spoke', 1), ('5000', 1), ('follows', 1)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(sf['bow'][0].items())[:3]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:12:06.137333Z", "start_time": "2019-06-14T15:11:48.549415Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sf['tfidf'] = tc.text_analytics.tf_idf(sf['X1'])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:12:09.021826Z", "start_time": "2019-06-14T15:12:07.941753Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1 bow tfidf
aynrand born and educated
in russia rand migrated ... {'spoke': 1, '5000': 1,
'follows': 1, 'given' ... {'spoke':
4.830308280673057, ...
asphalt in american
english asphalt or ... {'lain': 1, 'commonly':
4, 'has': 6, 'percent': ... {'lain':
8.480100346078947, ...
actinopterygii the
actinopterygii consti ... {'what': 1, 'follows': 1,
'given': 1, ... {'what':
2.505781957805935, ...
altaiclanguages these
language families share ... {'follows': 2, 'has': 11,
'general': 1, ... {'follows':
7.5113334785280745, ...
argon the name argon is
derived from the greek ... {'commonly': 1,
'lattice': 1, ... {'commonly':
3.7717720679882287, ...
augustderleth a 1938
guggenheim fellow der ... {'rescue': 2, 'amoral':
1, 'lovecraft': 11, ... {'rescue':
9.19021202607744, ...
amateur amateurism can be
seen in both a negative ... {'receiving': 1,
'having': 1, 'hand': 1, ... {'receiving':
4.162612232542636, ...
assemblyline an assembly
line is a manufacturing ... {'consider': 1, 'world':
1, 'bring': 2, 'pins' ... {'consider':
4.336965619687414, ...
astronomicalunit an
astronomical unit ... {'given': 1, 'ephemeris':
2, 'world': 2, ... {'given':
2.5682202783138064, ...
abbess an abbess latin
abbatissa feminine form ... {'major': 1, 'abbess':
10, 'given': 1, ... {'major':
2.356876809458607, ...
\n", "[72269 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tX1\tstr\n", "\tbow\tdict\n", "\ttfidf\tdict\n", "\n", "Rows: 72269\n", "\n", "Data:\n", "+-------------------------------+-------------------------------+\n", "| X1 | bow |\n", "+-------------------------------+-------------------------------+\n", "| aynrand born and educated ... | {'spoke': 1, '5000': 1, 'f... |\n", "| asphalt in american englis... | {'lain': 1, 'commonly': 4,... |\n", "| actinopterygii the actinop... | {'what': 1, 'follows': 1, ... |\n", "| altaiclanguages these lang... | {'follows': 2, 'has': 11, ... |\n", "| argon the name argon is de... | {'commonly': 1, 'lattice':... |\n", "| augustderleth a 1938 gugge... | {'rescue': 2, 'amoral': 1,... |\n", "| amateur amateurism can be ... | {'receiving': 1, 'having':... |\n", "| assemblyline an assembly l... | {'consider': 1, 'world': 1... |\n", "| astronomicalunit an astron... | {'given': 1, 'ephemeris': ... |\n", "| abbess an abbess latin abb... | {'major': 1, 'abbess': 10,... |\n", "+-------------------------------+-------------------------------+\n", "+-------------------------------+\n", "| tfidf |\n", "+-------------------------------+\n", "| {'spoke': 4.83030828067305... |\n", "| {'lain': 8.480100346078947... |\n", "| {'what': 2.505781957805935... |\n", "| {'follows': 7.511333478528... |\n", "| {'commonly': 3.77177206798... |\n", "| {'rescue': 9.1902120260774... |\n", "| {'receiving': 4.1626122325... |\n", "| {'consider': 4.33696561968... |\n", "| {'given': 2.56822027831380... |\n", "| {'major': 2.35687680945860... |\n", "+-------------------------------+\n", "[72269 rows x 3 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sf" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:12:21.040646Z", "start_time": "2019-06-14T15:12:21.012144Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[('spoke', 4.830308280673057),\n", " ('5000', 4.791220891965009),\n", " ('follows', 3.7556667392640373),\n", " ('given', 2.5682202783138064),\n", " ('percent', 17.481279902908025)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(sf['tfidf'][0].items())[:5]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Text cleaning" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:12:25.655792Z", "start_time": "2019-06-14T15:12:25.651399Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "docs = sf['bow'].dict_trim_by_values(2)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:14:09.998427Z", "start_time": "2019-06-14T15:14:09.991281Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "docs = docs.dict_trim_by_keys(\n", " tc.text_analytics.stop_words(),\n", " exclude=True)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Topic modeling" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:14:15.739727Z", "start_time": "2019-06-14T15:14:15.734950Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function create in module turicreate.toolkits.topic_model.topic_model:\n", "\n", "create(dataset, num_topics=10, initial_topics=None, alpha=None, beta=0.1, num_iterations=10, num_burnin=5, associations=None, verbose=False, print_interval=10, validation_set=None, method='auto')\n", " Create a topic model from the given data set. A topic model assumes each\n", " document is a mixture of a set of topics, where for each topic some words\n", " are more likely than others. One statistical approach to do this is called a\n", " \"topic model\". This method learns a topic model for the given document\n", " collection.\n", " \n", " Parameters\n", " ----------\n", " dataset : SArray of type dict or SFrame with a single column of type dict\n", " A bag of words representation of a document corpus.\n", " Each element is a dictionary representing a single document, where\n", " the keys are words and the values are the number of times that word\n", " occurs in that document.\n", " \n", " num_topics : int, optional\n", " The number of topics to learn.\n", " \n", " initial_topics : SFrame, optional\n", " An SFrame with a column of unique words representing the vocabulary\n", " and a column of dense vectors representing\n", " probability of that word given each topic. When provided,\n", " these values are used to initialize the algorithm.\n", " \n", " alpha : float, optional\n", " Hyperparameter that controls the diversity of topics in a document.\n", " Smaller values encourage fewer topics per document.\n", " Provided value must be positive. Default value is 50/num_topics.\n", " \n", " beta : float, optional\n", " Hyperparameter that controls the diversity of words in a topic.\n", " Smaller values encourage fewer words per topic. Provided value\n", " must be positive.\n", " \n", " num_iterations : int, optional\n", " The number of iterations to perform.\n", " \n", " num_burnin : int, optional\n", " The number of iterations to perform when inferring the topics for\n", " documents at prediction time.\n", " \n", " verbose : bool, optional\n", " When True, print most probable words for each topic while printing\n", " progress.\n", " \n", " print_interval : int, optional\n", " The number of iterations to wait between progress reports.\n", " \n", " associations : SFrame, optional\n", " An SFrame with two columns named \"word\" and \"topic\" containing words\n", " and the topic id that the word should be associated with. These words\n", " are not considered during learning.\n", " \n", " validation_set : SArray of type dict or SFrame with a single column\n", " A bag of words representation of a document corpus, similar to the\n", " format required for `dataset`. This will be used to monitor model\n", " performance during training. Each document in the provided validation\n", " set is randomly split: the first portion is used estimate which topic\n", " each document belongs to, and the second portion is used to estimate\n", " the model's performance at predicting the unseen words in the test data.\n", " \n", " method : {'cgs', 'alias'}, optional\n", " The algorithm used for learning the model.\n", " \n", " - *cgs:* Collapsed Gibbs sampling\n", " - *alias:* AliasLDA method.\n", " \n", " Returns\n", " -------\n", " out : TopicModel\n", " A fitted topic model. This can be used with\n", " :py:func:`~TopicModel.get_topics()` and\n", " :py:func:`~TopicModel.predict()`. While fitting is in progress, several\n", " metrics are shown, including:\n", " \n", " +------------------+---------------------------------------------------+\n", " | Field | Description |\n", " +==================+===================================================+\n", " | Elapsed Time | The number of elapsed seconds. |\n", " +------------------+---------------------------------------------------+\n", " | Tokens/second | The number of unique words processed per second |\n", " +------------------+---------------------------------------------------+\n", " | Est. Perplexity | An estimate of the model's ability to model the |\n", " | | training data. See the documentation on evaluate. |\n", " +------------------+---------------------------------------------------+\n", " \n", " See Also\n", " --------\n", " TopicModel, TopicModel.get_topics, TopicModel.predict,\n", " turicreate.SArray.dict_trim_by_keys, TopicModel.evaluate\n", " \n", " References\n", " ----------\n", " - `Wikipedia - Latent Dirichlet allocation\n", " `_\n", " \n", " - Alias method: Li, A. et al. (2014) `Reducing the Sampling Complexity of\n", " Topic Models. `_.\n", " KDD 2014.\n", " \n", " Examples\n", " --------\n", " The following example includes an SArray of documents, where\n", " each element represents a document in \"bag of words\" representation\n", " -- a dictionary with word keys and whose values are the number of times\n", " that word occurred in the document:\n", " \n", " >>> docs = turicreate.SArray('https://static.turi.com/datasets/nytimes')\n", " \n", " Once in this form, it is straightforward to learn a topic model.\n", " \n", " >>> m = turicreate.topic_model.create(docs)\n", " \n", " It is also easy to create a new topic model from an old one -- whether\n", " it was created using Turi Create or another package.\n", " \n", " >>> m2 = turicreate.topic_model.create(docs, initial_topics=m['topics'])\n", " \n", " To manually fix several words to always be assigned to a topic, use\n", " the `associations` argument. The following will ensure that topic 0\n", " has the most probability for each of the provided words:\n", " \n", " >>> from turicreate import SFrame\n", " >>> associations = SFrame({'word':['hurricane', 'wind', 'storm'],\n", " 'topic': [0, 0, 0]})\n", " >>> m = turicreate.topic_model.create(docs,\n", " associations=associations)\n", " \n", " More advanced usage allows you to control aspects of the model and the\n", " learning method.\n", " \n", " >>> import turicreate as tc\n", " >>> m = tc.topic_model.create(docs,\n", " num_topics=20, # number of topics\n", " num_iterations=10, # algorithm parameters\n", " alpha=.01, beta=.1) # hyperparameters\n", " \n", " To evaluate the model's ability to generalize, we can create a train/test\n", " split where a portion of the words in each document are held out from\n", " training.\n", " \n", " >>> train, test = tc.text_analytics.random_split(.8)\n", " >>> m = tc.topic_model.create(train)\n", " >>> results = m.evaluate(test)\n", " >>> print results['perplexity']\n", "\n" ] } ], "source": [ "help(tc.topic_model.create)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:14:20.794184Z", "start_time": "2019-06-14T15:14:20.790797Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function random_split in module turicreate.toolkits.text_analytics._util:\n", "\n", "random_split(dataset, prob=0.5)\n", " Utility for performing a random split for text data that is already in\n", " bag-of-words format. For each (word, count) pair in a particular element,\n", " the counts are uniformly partitioned in either a training set or a test\n", " set.\n", " \n", " Parameters\n", " ----------\n", " dataset : SArray of type dict, SFrame with columns of type dict\n", " A data set in bag-of-words format.\n", " \n", " prob : float, optional\n", " Probability for sampling a word to be placed in the test set.\n", " \n", " Returns\n", " -------\n", " train, test : SArray\n", " Two data sets in bag-of-words format, where the combined counts are\n", " equal to the counts in the original data set.\n", " \n", " Examples\n", " --------\n", " >>> docs = turicreate.SArray([{'are':5, 'you':3, 'not': 1, 'entertained':10}])\n", " >>> train, test = turicreate.text_analytics.random_split(docs)\n", " >>> print(train)\n", " [{'not': 1.0, 'you': 3.0, 'are': 3.0, 'entertained': 7.0}]\n", " >>> print(test)\n", " [{'are': 2.0, 'entertained': 3.0}]\n", "\n" ] } ], "source": [ "help(tc.text_analytics.random_split)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:14:38.091836Z", "start_time": "2019-06-14T15:14:24.927630Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "train, test = tc.text_analytics.random_split(docs, .8)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:15:01.989430Z", "start_time": "2019-06-14T15:14:42.688161Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
Learning a topic model
" ], "text/plain": [ "Learning a topic model" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of documents 72269
" ], "text/plain": [ " Number of documents 72269" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Vocabulary size 108205
" ], "text/plain": [ " Vocabulary size 108205" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Running collapsed Gibbs sampling
" ], "text/plain": [ " Running collapsed Gibbs sampling" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |
" ], "text/plain": [ "| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 10 | 1.66s | 6.17013e+06 | 0 |
" ], "text/plain": [ "| 10 | 1.66s | 6.17013e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 20 | 3.12s | 6.57117e+06 | 0 |
" ], "text/plain": [ "| 20 | 3.12s | 6.57117e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 30 | 4.58s | 6.41968e+06 | 0 |
" ], "text/plain": [ "| 30 | 4.58s | 6.41968e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 40 | 6.00s | 6.61674e+06 | 0 |
" ], "text/plain": [ "| 40 | 6.00s | 6.61674e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 50 | 7.42s | 6.53873e+06 | 0 |
" ], "text/plain": [ "| 50 | 7.42s | 6.53873e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 60 | 8.85s | 6.46591e+06 | 0 |
" ], "text/plain": [ "| 60 | 8.85s | 6.46591e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 70 | 10.38s | 5.92867e+06 | 0 |
" ], "text/plain": [ "| 70 | 10.38s | 5.92867e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 80 | 11.90s | 6.36375e+06 | 0 |
" ], "text/plain": [ "| 80 | 11.90s | 6.36375e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 90 | 13.42s | 6.20572e+06 | 0 |
" ], "text/plain": [ "| 90 | 13.42s | 6.20572e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 100 | 15.03s | 5.4879e+06 | 0 |
" ], "text/plain": [ "| 100 | 15.03s | 5.4879e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m = tc.topic_model.create(train, \n", " num_topics=100, # number of topics\n", " num_iterations=100, # algorithm parameters\n", " alpha=None, beta=.1) # hyperparameters" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:15:19.366488Z", "start_time": "2019-06-14T15:15:12.923625Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4552.105441414741\n" ] } ], "source": [ "results = m.evaluate(test)\n", "print(results['perplexity'])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:15:22.298503Z", "start_time": "2019-06-14T15:15:22.265410Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Class : TopicModel\n", "\n", "Schema\n", "------\n", "Vocabulary Size : 108205\n", "\n", "Settings\n", "--------\n", "Number of Topics : 100\n", "alpha : 0.5\n", "beta : 0.1\n", "Iterations : 100\n", "Training time : 16.0432\n", "Verbose : True\n", "\n", "Accessible fields : \n", "m.topics : An SFrame containing the topics.\n", "m.vocabulary : An SArray containing the words in the vocabulary.\n", "Useful methods : \n", "m.get_topics() : Get the most probable words per topic.\n", "m.predict(new_docs) : Make predictions for new documents." ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:15:26.257938Z", "start_time": "2019-06-14T15:15:26.096754Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
topic word score
0 years 0.004647514462204498
0 evans 0.004059221492305195
0 lebanon 0.0028172696669622205
0 green 0.0028172696669622205
0 time 0.0020982449259741827
1 national 0.005351237598960527
1 back 0.002278117379832135
1 baldwin 0.0018772756121197358
1 chicago 0.0018104686508343358
1 private 0.0016100477669781365
\n", "[500 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\ttopic\tint\n", "\tword\tstr\n", "\tscore\tfloat\n", "\n", "Rows: 500\n", "\n", "Data:\n", "+-------+----------+-----------------------+\n", "| topic | word | score |\n", "+-------+----------+-----------------------+\n", "| 0 | years | 0.004647514462204498 |\n", "| 0 | evans | 0.004059221492305195 |\n", "| 0 | lebanon | 0.0028172696669622205 |\n", "| 0 | green | 0.0028172696669622205 |\n", "| 0 | time | 0.0020982449259741827 |\n", "| 1 | national | 0.005351237598960527 |\n", "| 1 | back | 0.002278117379832135 |\n", "| 1 | baldwin | 0.0018772756121197358 |\n", "| 1 | chicago | 0.0018104686508343358 |\n", "| 1 | private | 0.0016100477669781365 |\n", "+-------+----------+-----------------------+\n", "[500 rows x 3 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.get_topics()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:15:30.005613Z", "start_time": "2019-06-14T15:15:30.001439Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method get_topics in module turicreate.toolkits.topic_model.topic_model:\n", "\n", "get_topics(topic_ids=None, num_words=5, cdf_cutoff=1.0, output_type='topic_probabilities') method of turicreate.toolkits.topic_model.topic_model.TopicModel instance\n", " Get the words associated with a given topic. The score column is the\n", " probability of choosing that word given that you have chosen a\n", " particular topic.\n", " \n", " Parameters\n", " ----------\n", " topic_ids : list of int, optional\n", " The topics to retrieve words. Topic ids are zero-based.\n", " Throws an error if greater than or equal to m['num_topics'], or\n", " if the requested topic name is not present.\n", " \n", " num_words : int, optional\n", " The number of words to show.\n", " \n", " cdf_cutoff : float, optional\n", " Allows one to only show the most probable words whose cumulative\n", " probability is below this cutoff. For example if there exist\n", " three words where\n", " \n", " .. math::\n", " p(word_1 | topic_k) = .1\n", " \n", " p(word_2 | topic_k) = .2\n", " \n", " p(word_3 | topic_k) = .05\n", " \n", " then setting :math:`cdf_{cutoff}=.3` would return only\n", " :math:`word_1` and :math:`word_2` since\n", " :math:`p(word_1 | topic_k) + p(word_2 | topic_k) <= cdf_{cutoff}`\n", " \n", " output_type : {'topic_probabilities' | 'topic_words'}, optional\n", " Determine the type of desired output. See below.\n", " \n", " Returns\n", " -------\n", " out : SFrame\n", " If output_type is 'topic_probabilities', then the returned value is\n", " an SFrame with a column of words ranked by a column of scores for\n", " each topic. Otherwise, the returned value is a SArray where\n", " each element is a list of the most probable words for each topic.\n", " \n", " Examples\n", " --------\n", " Get the highest ranked words for all topics.\n", " \n", " >>> docs = turicreate.SArray('https://static.turi.com/datasets/nips-text')\n", " >>> m = turicreate.topic_model.create(docs,\n", " num_iterations=50)\n", " >>> m.get_topics()\n", " +-------+----------+-----------------+\n", " | topic | word | score |\n", " +-------+----------+-----------------+\n", " | 0 | cell | 0.028974400831 |\n", " | 0 | input | 0.0259470208503 |\n", " | 0 | image | 0.0215721599763 |\n", " | 0 | visual | 0.0173635081992 |\n", " | 0 | object | 0.0172447874156 |\n", " | 1 | function | 0.0482834508265 |\n", " | 1 | input | 0.0456270024091 |\n", " | 1 | point | 0.0302662839454 |\n", " | 1 | result | 0.0239474934631 |\n", " | 1 | problem | 0.0231750116011 |\n", " | ... | ... | ... |\n", " +-------+----------+-----------------+\n", " \n", " Get the highest ranked words for topics 0 and 1 and show 15 words per\n", " topic.\n", " \n", " >>> m.get_topics([0, 1], num_words=15)\n", " +-------+----------+------------------+\n", " | topic | word | score |\n", " +-------+----------+------------------+\n", " | 0 | cell | 0.028974400831 |\n", " | 0 | input | 0.0259470208503 |\n", " | 0 | image | 0.0215721599763 |\n", " | 0 | visual | 0.0173635081992 |\n", " | 0 | object | 0.0172447874156 |\n", " | 0 | response | 0.0139740298286 |\n", " | 0 | layer | 0.0122585145062 |\n", " | 0 | features | 0.0115343177265 |\n", " | 0 | feature | 0.0103530459301 |\n", " | 0 | spatial | 0.00823387994361 |\n", " | ... | ... | ... |\n", " +-------+----------+------------------+\n", " \n", " If one wants to instead just get the top words per topic, one may\n", " change the format of the output as follows.\n", " \n", " >>> topics = m.get_topics(output_type='topic_words')\n", " dtype: list\n", " Rows: 10\n", " [['cell', 'image', 'input', 'object', 'visual'],\n", " ['algorithm', 'data', 'learning', 'method', 'set'],\n", " ['function', 'input', 'point', 'problem', 'result'],\n", " ['model', 'output', 'pattern', 'set', 'unit'],\n", " ['action', 'learning', 'net', 'problem', 'system'],\n", " ['error', 'function', 'network', 'parameter', 'weight'],\n", " ['information', 'level', 'neural', 'threshold', 'weight'],\n", " ['control', 'field', 'model', 'network', 'neuron'],\n", " ['hidden', 'layer', 'system', 'training', 'vector'],\n", " ['component', 'distribution', 'local', 'model', 'optimal']]\n", "\n" ] } ], "source": [ "help(m.get_topics)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:15:40.870971Z", "start_time": "2019-06-14T15:15:40.694223Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['storm', 'florida', 'area', 'due', 'texas', 'people', 'damage', 'hurricane', 'tropical', 'system']\n", "['project', 'ryan', 'founded', 'harvard', 'carroll', 'including', 'wilson', 'school', 'oxford', 'national']\n", "['alliance', 'organization', 'membership', 'board', 'member', 'members', 'association', 'groups', 'group', 'society']\n", "['summer', 'george', 'kan', 'julian', '1978', 'date', 'years', 'wolfe', 'london', 'italian']\n", "['german', 'division', 'forces', 'british', 'army', 'men', 'battle', 'general', 'war', 'military']\n", "['arrow', 'williams', 'california', 'warren', 'santa', 'los', 'san', 'great', 'angeles', 'renamed']\n", "['miller', 'time', 'produced', 'years', 'lewis', 'hamilton', 'morris', '1999', '2005', 'epic']\n", "['played', 'games', 'coach', 'won', 'points', 'team', 'game', 'record', 'season', 'teams']\n", "['green', 'time', 'connecticut', 'evans', 'years', 'sixth', 'lebanon', '2001', 'worked', '2005']\n", "['van', 'berlin', 'im', 'des', 'von', 'die', 'den', 'der', 'das', 'und']\n", "['clothing', 'hat', 'worn', 'wear', 'made', 'green', 'summer', 'black', 'top', 'dress']\n", "['2009', 'year', 'kids', 'october', '2010', '2007', 'school', '2006', '2008', 'national']\n", "['magazine', 'grant', 'oregon', 'portland', 'college', '1985', '2006', 'year', 'long', 'beer']\n", "['network', 'show', 'local', 'news', 'broadcast', 'radio', 'channel', 'stations', 'television', 'station']\n", "['bridge', 'north', 'river', 'county', 'state', 'west', 'route', 'city', 'east', 'road']\n", "['part', 'degree', 'na', 'valid', 'agent', 'history', 'reform', 'born', 'agency', 'trn']\n", "['children', 'father', 'mother', 'married', 'years', 'family', 'born', 'life', 'died', 'time']\n", "['early', 'population', 'century', 'american', 'war', 'government', 'states', 'british', 'united', 'people']\n", "['stories', 'magazine', 'writing', 'history', 'work', 'wrote', 'works', 'published', 'book', 'books']\n", "['collection', 'painting', 'arts', 'museum', 'work', 'york', 'design', 'artists', 'art', 'works']\n", "['named', 'flag', 'orange', 'blue', 'colors', 'yellow', 'tiger', 'black', 'red', 'white']\n", "['made', 'mission', 'test', 'project', 'space', 'aircraft', 'wright', 'design', 'launch', 'flight']\n", "['alan', 'phillips', 'preov', 'koice', 'beth', 'trenn', 'phillip', 'nitra', 'town', 'trnava']\n", "['racing', 'cars', 'car', 'engines', 'models', 'series', 'engine', 'system', 'race', 'model']\n", "['building', 'duck', 'london', 'philadelphia', 'represented', 'major', 'york', 'design', 'journal', 'duncan']\n", "['english', 'language', 'form', 'meaning', 'word', 'words', 'names', 'written', 'languages', 'common']\n", "['gene', 'enzyme', 'dna', 'site', 'proteins', 'protein', 'cells', 'cell', 'structure', 'genes']\n", "['work', 'year', '2009', 'andrew', 'list', 'henin', 'link', 'dates', 'part', 'calendar']\n", "['law', 'public', 'united', 'state', 'legal', 'act', 'states', 'court', 'case', 'federal']\n", "['trains', 'west', 'line', 'rail', 'train', 'station', 'services', 'service', 'opened', 'railway']\n", "['served', 'cemetery', 'received', 'texas', 'po', 'christi', 'university', 'post', 'corpus', 'mexico']\n", "['international', 'worked', 'festival', 'director', 'film', 'awards', 'won', 'award', 'awarded', 'academy']\n", "['empire', 'china', 'chinese', 'dynasty', 'roman', 'greek', 'time', 'king', 'bc', 'emperor']\n", "['port', 'ship', 'coast', 'island', 'sea', 'fleet', 'region', 'bay', 'ships', 'islands']\n", "['south', 'north', 'river', 'mountain', 'water', 'creek', 'lake', 'valley', 'park', 'area']\n", "['january', 'day', 'june', '2009', 'october', 'december', '2010', '2007', '2008', 'september']\n", "['round', 'team', 'won', 'event', 'title', 'world', 'match', 'championship', 'lost', 'open']\n", "['created', '2010', 'ash', 'complex', 'sanders', 'hotel', 'skating', 'school', 'house', 'msn']\n", "['left', 'easy', 'leg', 'ancient', 'benson', 'school', '2003', 'critical', 'british', 'westfield']\n", "['released', 'world', 'games', 'version', 'series', 'video', 'player', 'game', 'players', '2']\n", "['won', 'cup', 'team', 'teams', 'season', 'final', 'football', 'club', 'league', 'played']\n", "['1988', 'called', 'list', '1971', '2010', '1994', 'made', 'trent', 'found', 'local']\n", "['world', 'received', 'usa', 'wall', 'part', 'start0', 'williams', 'work', 'events', 'plotdata']\n", "['years', 'age', 'income', 'median', '18', 'living', 'males', 'town', 'average', 'population']\n", "['youth', 'bamboo', 'ross', 'griffin', 'williams', 'meeting', 'monument', 'town', 'joined', 'girl']\n", "['south', 'european', 'international', 'economic', 'states', 'development', 'united', 'government', 'countries', 'world']\n", "['national', 'constituencies', 'animation', 'part', 'track', 'baldwin', '1982', 'private', 'back', 'chicago']\n", "['hill', 'morgan', 'davis', 'cj', '1992', 'arch', 'child', '1995', 'tony', 'part']\n", "['form', 'called', '1', 'theory', 'set', 'number', 'function', 'type', 'system', 'model']\n", "['form', 'women', 'social', 'life', 'society', 'people', 'time', 'world', 'human', 'work']\n", "['city', 'san', 'paris', 'el', 'france', 'la', 'de', 'french', 'spain', 'spanish']\n", "['city', 'located', 'house', 'park', 'building', 'street', 'built', 'town', 'area', 'century']\n", "['made', 'include', 'variety', 'traditional', 'rice', 'popular', 'food', 'called', 'served', 'wine']\n", "['canadian', 'ontario', 'alberta', 'british', 'york', '2002', 'toronto', 'quebec', 'canada', 'montreal']\n", "['khan', 'afghanistan', 'indian', 'singh', 'temple', 'punjab', 'pakistan', 'sri', 'india', 'community']\n", "['brooklyn', 'city', 'class', 'ambassador', '2004', 'due', 'babylon', 'historical', 'area', '5']\n", "['heat', 'iron', 'high', 'process', 'water', 'form', 'gas', 'temperature', 'nuclear', 'energy']\n", "['airlines', 'regional', 'airport', '2010', 'international', 'travel', 'airways', '2005', 'airline', 'joined']\n", "['head', '15', 'home', '2010', '2003', 'elected', 'park', 'stone', 'gaol', 'numerous']\n", "['poland', 'years', 'polish', 'soviet', 'moscow', 'war', 'russian', 'swedish', 'union', 'russia']\n", "['william', 'henry', 'england', 'sir', 'london', 'royal', 'lord', 'king', 'son', 'john']\n", "['ohio', 'virginia', 'district', 'washington', 'john', 'carolina', 'county', 'state', 'served', 'south']\n", "['state', 'election', 'political', 'minister', 'council', 'national', 'party', 'president', 'government', 'elected']\n", "['district', 'districts', 'register', 'properties', 'places', 'illinois', 'national', 'county', 'historic', 'school']\n", "['operations', 'aircraft', 'base', 'united', 'war', 'service', 'air', 'group', 'force', 'training']\n", "['number', '25', '2010', 'golf', 'time', 'scrooge', 'group', 'university', 'chams', 'world']\n", "['light', 'range', 'made', 'time', 'system', 'small', 'power', 'energy', 'high', 'current']\n", "['1983', 'university', 'states', 'served', 'appointed', '1986', 'degree', 'bill', 'professor', '2000']\n", "['years', 'head', 'joseph', 'school', 'smiths', 'rejoice', 'smith', 'clark', 'stone', 'plates']\n", "['found', 'occur', 'body', 'small', 'disease', 'large', 'species', 'family', 'order', 'birds']\n", "['number', 'fat', '2006', 'named', 'work', 'death', 'published', 'flores', 'vincent', 'villa']\n", "['company', 'business', 'mine', 'sold', 'oil', 'stores', 'industry', 'year', 'production', 'mining']\n", "['martin', 'kitty', 'obesity', 'turtle', 'point', '2000', 'camp', 'clark', 'moved', 'born']\n", "['1994', '1998', 'ryo', 'venezuela', 'tannins', 'national', 'group', 'minor', 'christmas', 'world']\n", "['parish', 'saint', 'bishop', 'church', 'roman', 'council', 'st', 'century', 'catholic', 'pope']\n", "['released', 'records', 'single', 'album', 'songs', 'music', 'band', 'live', 'song', 'tour']\n", "['wales', 'australia', 'played', 'cricket', 'zealand', 'day', 'sydney', 'made', 'australian', 'south']\n", "['information', 'network', 'system', 'software', 'systems', 'internet', 'users', 'technology', 'computer', 'data']\n", "['philippine', 'clara', 'foundation', 'chief', 'full', 'fair', 'philippines', 'manila', 'rupiah', 'belle']\n", "['group', 'began', 'anderson', 'minnesota', 'john', 'period', 'jones', 'te', 'td', 'dont']\n", "['northern', 'cork', 'dublin', 'ireland', 'irish', 'senior', 'john', 'county', 'title', 'medal']\n", "['school', '2006', 'stone', 'york', 'bay', 'village', 'held', 'created', 'local', 'community']\n", "['years', 'jersey', 'police', 'city', 'york', 'family', 'crime', 'gang', 'prison', 'arrested']\n", "['2001', 'american', 'saudi', 'bin', 'world', 'al', 'sh', 'including', 'laden', 'joined']\n", "['school', 'education', 'year', 'students', 'research', 'schools', 'college', 'high', 'university', 'program']\n", "['events', 'time', 'mr', 'ride', 'roller', 'part', 'including', 'bicycle', 'riders', 'bike']\n", "['montenegro', 'yugoslavia', 'serbia', 'school', 'serbian', 'albanian', 'albania', 'croatian', 'croatia', 'year']\n", "['austin', 'shells', 'place', 'peter', 'shell', 'uk', 'school', '2010', 'active', 'mark']\n", "['york', 'series', 'masters', 'booth', 'world', 'free', '2003', 'major', 'birmingham', 'setting']\n", "['role', 'appeared', 'films', 'episode', 'television', 'show', 'series', 'character', 'movie', 'film']\n", "['medicine', 'dr', 'training', 'care', 'hospital', 'health', 'surgery', 'center', 'dog', 'medical']\n", "['norwegian', 'travis', '2007', 'ste', 'bar', 'city', 'home', 'norway', 'parker', 'include']\n", "['jewish', 'christian', 'people', 'god', 'gods', 'religious', 'great', 'church', 'churches', 'jesus']\n", "['zion', 'contest', 'miss', 'relay', 'gold', 'post', 'nec', 'mori', 'yacht', 'time']\n", "['york', 'theatre', 'theater', 'dance', 'musical', 'orchestra', 'opera', 'music', 'performed', 'festival']\n", "['tax', 'money', 'financial', 'market', 'bank', 'million', 'business', 'price', 'services', 'company']\n", "['life', 'death', 'father', 'end', 'find', 'back', 'man', 'make', 'time', 'tells']\n", "['years', '2007', 'walker', 'post', '1911', 'croydon', 'burma', 'burmese', 'named', 'day']\n", "['high', 'vegas', 'purchased', 'alice', 'paul', 'hotel', 'scott', 'ranch', 'las', 'highlands']\n", "['horse', 'stakes', 'born', 'american', 'horses', 'race', 'boas', 'breed', 'english', 'racing']\n" ] } ], "source": [ "topics = m.get_topics(num_words=10).unstack(['word','score'], \\\n", " new_column_name='topic_words')['topic_words'].apply(lambda x: x.keys())\n", "for topic in topics:\n", " print(topic)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:16:11.030526Z", "start_time": "2019-06-14T15:16:11.023731Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on TopicModel in module turicreate.toolkits.topic_model.topic_model object:\n", "\n", "class TopicModel(turicreate.toolkits._model.Model)\n", " | TopicModel objects can be used to predict the underlying topic of a\n", " | document.\n", " | \n", " | This model cannot be constructed directly. Instead, use\n", " | :func:`turicreate.topic_model.create` to create an instance\n", " | of this model. A detailed list of parameter options and code samples\n", " | are available in the documentation for the create function.\n", " | \n", " | Method resolution order:\n", " | TopicModel\n", " | turicreate.toolkits._model.Model\n", " | turicreate.toolkits._model.ExposeAttributesFromProxy\n", " | builtins.object\n", " | \n", " | Methods defined here:\n", " | \n", " | __init__(self, model_proxy)\n", " | Initialize self. See help(type(self)) for accurate signature.\n", " | \n", " | __repr__(self)\n", " | Print a string description of the model when the model name is entered\n", " | in the terminal.\n", " | \n", " | __str__(self)\n", " | Return a string description of the model to the ``print`` method.\n", " | \n", " | Returns\n", " | -------\n", " | out : string\n", " | A description of the model.\n", " | \n", " | evaluate(self, train_data, test_data=None, metric='perplexity')\n", " | Estimate the model's ability to predict new data. Imagine you have a\n", " | corpus of books. One common approach to evaluating topic models is to\n", " | train on the first half of all of the books and see how well the model\n", " | predicts the second half of each book.\n", " | \n", " | This method returns a metric called perplexity, which is related to the\n", " | likelihood of observing these words under the given model. See\n", " | :py:func:`~turicreate.topic_model.perplexity` for more details.\n", " | \n", " | The provided `train_data` and `test_data` must have the same length,\n", " | i.e., both data sets must have the same number of documents; the model\n", " | will use train_data to estimate which topic the document belongs to, and\n", " | this is used to estimate the model's performance at predicting the\n", " | unseen words in the test data.\n", " | \n", " | See :py:func:`~turicreate.topic_model.TopicModel.predict` for details\n", " | on how these predictions are made, and see\n", " | :py:func:`~turicreate.text_analytics.random_split` for a helper function\n", " | that can be used for making train/test splits.\n", " | \n", " | Parameters\n", " | ----------\n", " | train_data : SArray or SFrame\n", " | A set of documents to predict topics for.\n", " | \n", " | test_data : SArray or SFrame, optional\n", " | A set of documents to evaluate performance on.\n", " | By default this will set to be the same as train_data.\n", " | \n", " | metric : str\n", " | The chosen metric to use for evaluating the topic model.\n", " | Currently only 'perplexity' is supported.\n", " | \n", " | Returns\n", " | -------\n", " | out : dict\n", " | The set of estimated evaluation metrics.\n", " | \n", " | See Also\n", " | --------\n", " | predict, turicreate.toolkits.text_analytics.random_split\n", " | \n", " | Examples\n", " | --------\n", " | >>> docs = turicreate.SArray('https://static.turi.com/datasets/nips-text')\n", " | >>> train_data, test_data = turicreate.text_analytics.random_split(docs)\n", " | >>> m = turicreate.topic_model.create(train_data)\n", " | >>> m.evaluate(train_data, test_data)\n", " | {'perplexity': 2467.530370396021}\n", " | \n", " | get_topics(self, topic_ids=None, num_words=5, cdf_cutoff=1.0, output_type='topic_probabilities')\n", " | Get the words associated with a given topic. The score column is the\n", " | probability of choosing that word given that you have chosen a\n", " | particular topic.\n", " | \n", " | Parameters\n", " | ----------\n", " | topic_ids : list of int, optional\n", " | The topics to retrieve words. Topic ids are zero-based.\n", " | Throws an error if greater than or equal to m['num_topics'], or\n", " | if the requested topic name is not present.\n", " | \n", " | num_words : int, optional\n", " | The number of words to show.\n", " | \n", " | cdf_cutoff : float, optional\n", " | Allows one to only show the most probable words whose cumulative\n", " | probability is below this cutoff. For example if there exist\n", " | three words where\n", " | \n", " | .. math::\n", " | p(word_1 | topic_k) = .1\n", " | \n", " | p(word_2 | topic_k) = .2\n", " | \n", " | p(word_3 | topic_k) = .05\n", " | \n", " | then setting :math:`cdf_{cutoff}=.3` would return only\n", " | :math:`word_1` and :math:`word_2` since\n", " | :math:`p(word_1 | topic_k) + p(word_2 | topic_k) <= cdf_{cutoff}`\n", " | \n", " | output_type : {'topic_probabilities' | 'topic_words'}, optional\n", " | Determine the type of desired output. See below.\n", " | \n", " | Returns\n", " | -------\n", " | out : SFrame\n", " | If output_type is 'topic_probabilities', then the returned value is\n", " | an SFrame with a column of words ranked by a column of scores for\n", " | each topic. Otherwise, the returned value is a SArray where\n", " | each element is a list of the most probable words for each topic.\n", " | \n", " | Examples\n", " | --------\n", " | Get the highest ranked words for all topics.\n", " | \n", " | >>> docs = turicreate.SArray('https://static.turi.com/datasets/nips-text')\n", " | >>> m = turicreate.topic_model.create(docs,\n", " | num_iterations=50)\n", " | >>> m.get_topics()\n", " | +-------+----------+-----------------+\n", " | | topic | word | score |\n", " | +-------+----------+-----------------+\n", " | | 0 | cell | 0.028974400831 |\n", " | | 0 | input | 0.0259470208503 |\n", " | | 0 | image | 0.0215721599763 |\n", " | | 0 | visual | 0.0173635081992 |\n", " | | 0 | object | 0.0172447874156 |\n", " | | 1 | function | 0.0482834508265 |\n", " | | 1 | input | 0.0456270024091 |\n", " | | 1 | point | 0.0302662839454 |\n", " | | 1 | result | 0.0239474934631 |\n", " | | 1 | problem | 0.0231750116011 |\n", " | | ... | ... | ... |\n", " | +-------+----------+-----------------+\n", " | \n", " | Get the highest ranked words for topics 0 and 1 and show 15 words per\n", " | topic.\n", " | \n", " | >>> m.get_topics([0, 1], num_words=15)\n", " | +-------+----------+------------------+\n", " | | topic | word | score |\n", " | +-------+----------+------------------+\n", " | | 0 | cell | 0.028974400831 |\n", " | | 0 | input | 0.0259470208503 |\n", " | | 0 | image | 0.0215721599763 |\n", " | | 0 | visual | 0.0173635081992 |\n", " | | 0 | object | 0.0172447874156 |\n", " | | 0 | response | 0.0139740298286 |\n", " | | 0 | layer | 0.0122585145062 |\n", " | | 0 | features | 0.0115343177265 |\n", " | | 0 | feature | 0.0103530459301 |\n", " | | 0 | spatial | 0.00823387994361 |\n", " | | ... | ... | ... |\n", " | +-------+----------+------------------+\n", " | \n", " | If one wants to instead just get the top words per topic, one may\n", " | change the format of the output as follows.\n", " | \n", " | >>> topics = m.get_topics(output_type='topic_words')\n", " | dtype: list\n", " | Rows: 10\n", " | [['cell', 'image', 'input', 'object', 'visual'],\n", " | ['algorithm', 'data', 'learning', 'method', 'set'],\n", " | ['function', 'input', 'point', 'problem', 'result'],\n", " | ['model', 'output', 'pattern', 'set', 'unit'],\n", " | ['action', 'learning', 'net', 'problem', 'system'],\n", " | ['error', 'function', 'network', 'parameter', 'weight'],\n", " | ['information', 'level', 'neural', 'threshold', 'weight'],\n", " | ['control', 'field', 'model', 'network', 'neuron'],\n", " | ['hidden', 'layer', 'system', 'training', 'vector'],\n", " | ['component', 'distribution', 'local', 'model', 'optimal']]\n", " | \n", " | predict(self, dataset, output_type='assignment', num_burnin=None)\n", " | Use the model to predict topics for each document. The provided\n", " | `dataset` should be an SArray object where each element is a dict\n", " | representing a single document in bag-of-words format, where keys\n", " | are words and values are their corresponding counts. If `dataset` is\n", " | an SFrame, then it must contain a single column of dict type.\n", " | \n", " | The current implementation will make inferences about each document\n", " | given its estimates of the topics learned when creating the model.\n", " | This is done via Gibbs sampling.\n", " | \n", " | Parameters\n", " | ----------\n", " | dataset : SArray, SFrame of type dict\n", " | A set of documents to use for making predictions.\n", " | \n", " | output_type : str, optional\n", " | The type of output desired. This can either be\n", " | \n", " | - assignment: the returned values are integers in [0, num_topics)\n", " | - probability: each returned prediction is a vector with length\n", " | num_topics, where element k represents the probability that\n", " | document belongs to topic k.\n", " | \n", " | num_burnin : int, optional\n", " | The number of iterations of Gibbs sampling to perform when\n", " | inferring the topics for documents at prediction time.\n", " | If provided this will override the burnin value set during\n", " | training.\n", " | \n", " | Returns\n", " | -------\n", " | out : SArray\n", " | \n", " | See Also\n", " | --------\n", " | evaluate\n", " | \n", " | Examples\n", " | --------\n", " | Make predictions about which topic each document belongs to.\n", " | \n", " | >>> docs = turicreate.SArray('https://static.turi.com/datasets/nips-text')\n", " | >>> m = turicreate.topic_model.create(docs)\n", " | >>> pred = m.predict(docs)\n", " | \n", " | If one is interested in the probability of each topic\n", " | \n", " | >>> pred = m.predict(docs, output_type='probability')\n", " | \n", " | Notes\n", " | -----\n", " | For each unique word w in a document d, we sample an assignment to\n", " | topic k with probability proportional to\n", " | \n", " | .. math::\n", " | p(z_{dw} = k) \\propto (n_{d,k} + \\alpha) * \\Phi_{w,k}\n", " | \n", " | where\n", " | \n", " | - :math:`W` is the size of the vocabulary,\n", " | - :math:`n_{d,k}` is the number of other times we have assigned a word in\n", " | document to d to topic :math:`k`,\n", " | - :math:`\\Phi_{w,k}` is the probability under the model of choosing word\n", " | :math:`w` given the word is of topic :math:`k`. This is the matrix\n", " | returned by calling `m['topics']`.\n", " | \n", " | This represents a collapsed Gibbs sampler for the document assignments\n", " | while we keep the topics learned during training fixed.\n", " | This process is done in parallel across all documents, five times per\n", " | document.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from turicreate.toolkits._model.Model:\n", " | \n", " | save(self, location)\n", " | Save the model. The model is saved as a directory which can then be\n", " | loaded using the :py:func:`~turicreate.load_model` method.\n", " | \n", " | Parameters\n", " | ----------\n", " | location : string\n", " | Target destination for the model. Can be a local path or remote URL.\n", " | \n", " | See Also\n", " | ----------\n", " | turicreate.load_model\n", " | \n", " | Examples\n", " | ----------\n", " | >>> model.save('my_model_file')\n", " | >>> loaded_model = turicreate.load_model('my_model_file')\n", " | \n", " | summary(self, output=None)\n", " | Print a summary of the model. The summary includes a description of\n", " | training data, options, hyper-parameters, and statistics measured\n", " | during model creation.\n", " | \n", " | Parameters\n", " | ----------\n", " | output : str, None\n", " | The type of summary to return.\n", " | \n", " | - None or 'stdout' : print directly to stdout.\n", " | \n", " | - 'str' : string of summary\n", " | \n", " | - 'dict' : a dict with 'sections' and 'section_titles' ordered\n", " | lists. The entries in the 'sections' list are tuples of the form\n", " | ('label', 'value').\n", " | \n", " | Examples\n", " | --------\n", " | >>> m.summary()\n", " | \n", " | ----------------------------------------------------------------------\n", " | Methods inherited from turicreate.toolkits._model.ExposeAttributesFromProxy:\n", " | \n", " | __dir__(self)\n", " | Combine the results of dir from the current class with the results of\n", " | list_fields().\n", " | \n", " | __getattribute__(self, attr)\n", " | Use the internal proxy object for obtaining list_fields.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors inherited from turicreate.toolkits._model.ExposeAttributesFromProxy:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data and other attributes inherited from turicreate.toolkits._model.ExposeAttributesFromProxy:\n", " | \n", " | __proxy__ = None\n", "\n" ] } ], "source": [ "help(m)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:16:53.661546Z", "start_time": "2019-06-14T15:16:53.495319Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['people', 'texas', 'tropical', 'florida', 'storm']\n", "['school', 'founded', 'wilson', 'harvard', 'including']\n", "['members', 'association', 'society', 'member', 'group']\n", "['summer', 'date', 'julian', 'years', 'george']\n", "['war', 'general', 'battle', 'german', 'army']\n", "['california', 'angeles', 'san', 'los', 'santa']\n", "['miller', 'morris', '2005', '1999', 'time']\n", "['game', 'season', 'team', 'points', 'games']\n", "['evans', 'years', 'time', 'green', 'lebanon']\n", "['und', 'der', 'die', 'van', 'von']\n", "['wear', 'worn', 'made', 'green', 'top']\n", "['2008', 'year', '2007', '2009', '2010']\n", "['oregon', 'magazine', 'year', 'portland', 'beer']\n", "['network', 'news', 'radio', 'show', 'station']\n", "['bridge', 'north', 'route', 'west', 'road']\n", "['part', 'born', 'na', 'agent', 'valid']\n", "['years', 'family', 'father', 'died', 'time']\n", "['united', 'population', 'people', 'american', 'states']\n", "['work', 'magazine', 'published', 'book', 'books']\n", "['collection', 'museum', 'arts', 'art', 'work']\n", "['white', 'black', 'red', 'flag', 'blue']\n", "['aircraft', 'design', 'space', 'project', 'flight']\n", "['trenn', 'phillip', 'nitra', 'koice', 'preov']\n", "['engine', 'cars', 'car', 'race', 'model']\n", "['london', 'duck', 'journal', 'philadelphia', 'york']\n", "['english', 'word', 'language', 'languages', 'words']\n", "['dna', 'cell', 'cells', 'enzyme', 'protein']\n", "['2009', 'calendar', 'year', 'link', 'list']\n", "['court', 'act', 'state', 'states', 'law']\n", "['station', 'services', 'railway', 'line', 'service']\n", "['cemetery', 'university', 'mexico', 'texas', 'corpus']\n", "['film', 'worked', 'festival', 'award', 'awards']\n", "['chinese', 'emperor', 'bc', 'king', 'empire']\n", "['islands', 'region', 'ships', 'island', 'ship']\n", "['water', 'park', 'lake', 'area', 'river']\n", "['2007', '2009', 'december', '2010', '2008']\n", "['match', 'world', 'title', 'won', 'championship']\n", "['msn', 'house', '2010', 'skating', 'school']\n", "['school', 'westfield', 'easy', 'british', '2003']\n", "['player', 'players', 'game', 'games', 'series']\n", "['team', 'league', 'season', 'club', 'football']\n", "['local', 'list', 'found', 'called', '2010']\n", "['plotdata', 'received', 'world', 'part', 'usa']\n", "['income', 'population', '18', 'years', 'age']\n", "['youth', 'ross', 'griffin', 'joined', 'williams']\n", "['south', 'government', 'international', 'countries', 'world']\n", "['national', 'private', 'back', 'chicago', 'baldwin']\n", "['cj', 'hill', 'morgan', 'davis', 'child']\n", "['set', '1', 'model', 'theory', 'number']\n", "['social', 'people', 'time', 'form', 'work']\n", "['french', 'de', 'spanish', 'france', 'la']\n", "['area', 'city', 'building', 'town', 'built']\n", "['food', 'popular', 'made', 'called', 'wine']\n", "['canadian', 'british', 'ontario', 'canada', 'toronto']\n", "['india', 'temple', 'khan', 'pakistan', 'indian']\n", "['babylon', '2004', 'area', 'class', '5']\n", "['water', 'process', 'nuclear', 'energy', 'form']\n", "['international', '2010', 'airport', 'airlines', 'airline']\n", "['15', 'park', 'head', 'stone', '2003']\n", "['soviet', 'russia', 'poland', 'russian', 'polish']\n", "['john', 'royal', 'william', 'sir', 'king']\n", "['virginia', 'county', 'served', 'state', 'carolina']\n", "['election', 'government', 'president', 'state', 'party']\n", "['school', 'county', 'national', 'districts', 'district']\n", "['group', 'war', 'force', 'air', 'aircraft']\n", "['world', 'university', '2010', 'chams', 'number']\n", "['energy', 'power', 'system', 'time', 'light']\n", "['bill', 'professor', 'degree', 'served', 'university']\n", "['joseph', 'head', 'smiths', 'smith', 'school']\n", "['birds', 'family', 'small', 'species', 'found']\n", "['fat', '2006', 'named', 'work', 'vincent']\n", "['business', 'stores', 'oil', 'company', 'mine']\n", "['2000', 'kitty', 'camp', 'moved', 'obesity']\n", "['1994', 'christmas', 'world', 'national', 'minor']\n", "['bishop', 'saint', 'church', 'st', 'century']\n", "['band', 'released', 'album', 'music', 'song']\n", "['made', 'zealand', 'australian', 'australia', 'south']\n", "['system', 'software', 'data', 'systems', 'information']\n", "['full', 'fair', 'philippines', 'manila', 'philippine']\n", "['began', 'jones', 'minnesota', 'anderson', 'td']\n", "['county', 'ireland', 'medal', 'irish', 'dublin']\n", "['school', 'held', 'village', 'local', 'community']\n", "['family', 'city', 'prison', 'police', 'york']\n", "['al', 'bin', '2001', 'joined', 'world']\n", "['college', 'university', 'education', 'students', 'school']\n", "['including', 'riders', 'ride', 'time', 'part']\n", "['year', 'yugoslavia', 'serbian', 'albanian', 'serbia']\n", "['shells', 'uk', 'school', 'austin', 'shell']\n", "['world', 'birmingham', 'york', 'booth', 'free']\n", "['show', 'role', 'episode', 'series', 'film']\n", "['dr', 'care', 'health', 'medical', 'hospital']\n", "['include', '2007', 'norway', 'norwegian', 'city']\n", "['religious', 'christian', 'church', 'god', 'jesus']\n", "['nec', 'mori', 'miss', 'post', 'time']\n", "['festival', 'theatre', 'dance', 'music', 'opera']\n", "['business', 'company', 'financial', 'bank', 'million']\n", "['man', 'tells', 'life', 'back', 'time']\n", "['years', '2007', 'post', 'croydon', 'day']\n", "['las', 'scott', 'vegas', 'ranch', 'hotel']\n", "['american', 'horses', 'horse', 'breed', 'stakes']\n" ] } ], "source": [ "def print_topics(m):\n", " topics = m.get_topics(num_words=5)\n", " topics = topics.unstack(['word','score'], new_column_name='topic_words')['topic_words']\n", " topics = topics.apply(lambda x: x.keys())\n", " for topic in topics:\n", " print(topic)\n", "print_topics(m)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:17:00.220909Z", "start_time": "2019-06-14T15:17:00.207180Z" }, "slideshow": { "slide_type": "slide" } }, "source": [ "> pred = m.predict(another_data) \n", "\n", "> pred = m.predict(another_data, output_type='probabilities')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Initializing from other models" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:19:58.593941Z", "start_time": "2019-06-14T15:19:58.588723Z" } }, "outputs": [ { "data": { "text/plain": [ "['__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__proxy__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " '_get',\n", " '_get_queryable_methods',\n", " '_get_summary_struct',\n", " '_list_fields',\n", " '_name',\n", " '_native_name',\n", " '_training_stats',\n", " 'alpha',\n", " 'beta',\n", " 'evaluate',\n", " 'get_topics',\n", " 'num_burnin',\n", " 'num_iterations',\n", " 'num_topics',\n", " 'predict',\n", " 'print_interval',\n", " 'save',\n", " 'summary',\n", " 'topics',\n", " 'training_iterations',\n", " 'training_time',\n", " 'validation_time',\n", " 'verbose',\n", " 'vocabulary']" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(m)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:20:38.229839Z", "start_time": "2019-06-14T15:20:38.200206Z" } }, "outputs": [ { "data": { "text/plain": [ "dtype: str\n", "Rows: 108205\n", "['book', 'political', 'time', 'readers', 'individual', 'appeared', 'peikoff', 'concepts', '100', 'picture', 'america', 'reviewers', 'philosopher', 'screenwriter', 'work', 'traditional', 'purged', 'ayn', 'york', 'articles', 'pharmacy', 'scholarship', '2001', 'designed', 'permission', 'taking', 'historian', 'library', 'russian', 'extent', 'childhood', 'respondents', 'language', 'alisa', 'writing', 'union', 'libertarian', 'positive', 'jennifer', 'notes', 'line', 'burns', 'state', 'crimea', 'sciabarra', 'based', 'rights', 'life', 'shes', 'argument', 'nonfiction', 'rejection', 'allowed', 'reason', 'culture', 'closest', 'shrugged', 'free', 'january', 'success', 'living', 'robert', 'literary', 'animated', 'american', 'reviews', 'paterson', 'people', 'percent', 'house', 'academic', 'sacrificing', 'referred', 'broadway', 'fountainhead', 'lectures', 'john', 'inspiration', 'conditions', 'lifetime', 'written', '1938', 'established', 'barbara', 'twelve', 'modern', 'final', 'intellectual', 'audience', 'stated', 'selfinterest', 'achievement', 'century', 'relationship', 'allowing', 'delivering', 'writers', 'influence', 'branden', 'film', ... ]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.vocabulary" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:21:11.640706Z", "start_time": "2019-06-14T15:21:11.272036Z" } }, "outputs": [ { "data": { "text/plain": [ "108205" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.topics" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:21:44.842823Z", "start_time": "2019-06-14T15:21:39.593564Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Initializing from provided topics and vocabulary.
" ], "text/plain": [ "Initializing from provided topics and vocabulary." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Learning a topic model
" ], "text/plain": [ "Learning a topic model" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of documents 72269
" ], "text/plain": [ " Number of documents 72269" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Vocabulary size 171005
" ], "text/plain": [ " Vocabulary size 171005" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Running collapsed Gibbs sampling
" ], "text/plain": [ " Running collapsed Gibbs sampling" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |
" ], "text/plain": [ "| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 10 | 2.99s | 7.251e+06 | 0 |
" ], "text/plain": [ "| 10 | 2.99s | 7.251e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m2 = tc.topic_model.create(docs,\n", " num_topics=100,\n", " initial_topics=m.topics)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Seeding the model with prior knowledge" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:21:53.743973Z", "start_time": "2019-06-14T15:21:53.739544Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "associations = tc.SFrame()\n", "associations['word'] = ['recognition']\n", "associations['topic'] = [0]" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:22:10.912220Z", "start_time": "2019-06-14T15:22:00.395606Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Learning a topic model
" ], "text/plain": [ "Learning a topic model" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of documents 72269
" ], "text/plain": [ " Number of documents 72269" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Vocabulary size 171005
" ], "text/plain": [ " Vocabulary size 171005" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Running collapsed Gibbs sampling
" ], "text/plain": [ " Running collapsed Gibbs sampling" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |
" ], "text/plain": [ "| Iteration | Elapsed Time | Tokens/Second | Est. Perplexity |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 10 | 2.10s | 1.04058e+07 | 0 |
" ], "text/plain": [ "| 10 | 2.10s | 1.04058e+07 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 20 | 3.97s | 1.09325e+07 | 0 |
" ], "text/plain": [ "| 20 | 3.97s | 1.09325e+07 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 30 | 5.79s | 9.42067e+06 | 0 |
" ], "text/plain": [ "| 30 | 5.79s | 9.42067e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 40 | 7.66s | 1.0637e+07 | 0 |
" ], "text/plain": [ "| 40 | 7.66s | 1.0637e+07 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 50 | 9.51s | 9.86708e+06 | 0 |
" ], "text/plain": [ "| 50 | 9.51s | 9.86708e+06 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+---------------+----------------+-----------------+
" ], "text/plain": [ "+-----------+---------------+----------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m2 = tc.topic_model.create(docs,\n", " num_topics=20,\n", " num_iterations=50,\n", " associations=associations, \n", " verbose=False)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:22:26.128839Z", "start_time": "2019-06-14T15:22:26.065521Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
topic word score
0 line 0.0109206038501584
0 german 0.010542910113799127
0 de 0.010148794910641626
0 railway 0.010079824750089063
0 english 0.009242329943379372
0 chinese 0.008900763433976205
0 language 0.008654441432002766
0 china 0.008490226764020474
0 large 0.008004151346792889
0 russian 0.007705280651065117
\n", "[200 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\ttopic\tint\n", "\tword\tstr\n", "\tscore\tfloat\n", "\n", "Rows: 200\n", "\n", "Data:\n", "+-------+----------+----------------------+\n", "| topic | word | score |\n", "+-------+----------+----------------------+\n", "| 0 | line | 0.0109206038501584 |\n", "| 0 | german | 0.010542910113799127 |\n", "| 0 | de | 0.010148794910641626 |\n", "| 0 | railway | 0.010079824750089063 |\n", "| 0 | english | 0.009242329943379372 |\n", "| 0 | chinese | 0.008900763433976205 |\n", "| 0 | language | 0.008654441432002766 |\n", "| 0 | china | 0.008490226764020474 |\n", "| 0 | large | 0.008004151346792889 |\n", "| 0 | russian | 0.007705280651065117 |\n", "+-------+----------+----------------------+\n", "[200 rows x 3 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m2.get_topics(num_words=10)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T15:22:27.872285Z", "start_time": "2019-06-14T15:22:27.805060Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['german', 'line', 'english', 'de', 'railway']\n", "['son', 'time', 'book', 'life', 'john']\n", "['role', 'episode', 'show', 'film', 'series']\n", "['york', 'station', 'park', 'de', 'company']\n", "['information', 'law', 'time', 'work', 'social']\n", "['west', 'county', 'north', 'city', 'east']\n", "['years', '18', 'population', 'town', 'age']\n", "['games', 'team', 'game', 'won', 'season']\n", "['company', 'aircraft', 'air', 'force', 'division']\n", "['india', 'art', 'century', 'roman', 'church']\n", "['government', 'students', 'national', 'state', 'party']\n", "['schools', 'college', 'university', 'school', 'high']\n", "['series', 'world', 'back', 'king', 'time']\n", "['services', 'system', 'service', 'million', 'engine']\n", "['area', 'built', 'river', 'road', 'region']\n", "['systems', 'set', 'number', 'system', 'data']\n", "['water', 'small', 'species', 'food', 'found']\n", "['league', 'year', 'club', 'song', 'time']\n", "['released', 'music', 'songs', 'album', 'band']\n", "['army', 'united', 'states', 'court', 'war']\n" ] } ], "source": [ "print_topics(m2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 阅读材料\n", "\n", "~~https://dato.com/learn/userguide/text/topic-models.html~~\n", "\n", "https://apple.github.io/turicreate/docs/userguide/text/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }