{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tokenizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different functions to tokenize text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.text import tokenizer_[type]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different functions to tokenize text for natural language processing tasks, for example such as building a bag-of-words model for text classification." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "\n", "- -" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 - Extract Emoticons" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from mlxtend.text import tokenizer_emoticons" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[':)', ':(', ':-)']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer_emoticons('This :) is :( a test :-)!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2 - Extract Words and Emoticons" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from mlxtend.text import tokenizer_words_and_emoticons" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['this', 'is', 'a', 'test', ':)', ':(', ':-)']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer_words_and_emoticons('This :) is :( a test :-)!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## tokenizer_emoticons\n", "\n", "*tokenizer_emoticons(text)*\n", "\n", "Return emoticons from text\n", "\n", "**Examples**\n", "\n", "\n", " >>> tokenizer_emoticons('This :) is :( a test :-)!')\n", "[':)', ':(', ':-)']\n", "\n", "For usage examples, please see\n", "[http://rasbt.github.io/mlxtend/user_guide/text/tokenizer_emoticons/](http://rasbt.github.io/mlxtend/user_guide/text/tokenizer_emoticons/)\n", "\n", "

\n", "*tokenizer_words_and_emoticons(text)*\n", "\n", "Convert text to lowercase words and emoticons.\n", "\n", "**Examples**\n", "\n", "\n", " >>> tokenizer_words_and_emoticons('This :) is :( a test :-)!')\n", "['this', 'is', 'a', 'test', ':)', ':(', ':-)']\n", "\n", "For more usage examples, please see\n", "[http://rasbt.github.io/mlxtend/user_guide/text/tokenizer_words_and_emoticons/](http://rasbt.github.io/mlxtend/user_guide/text/tokenizer_words_and_emoticons/)\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.text/tokenizer_emoticons.md', 'r') as f:\n", " s = f.read() + '

'\n", "\n", "with open('../../api_modules/mlxtend.text/tokenizer_words_and_emoticons.md', 'r') as f:\n", " s2 = f.readlines()\n", " s += ''.join(s2[1:])\n", "print(s)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }