{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TF-IDF demo" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import h2o\n", "\n", "h2o.init()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data sources:\n", "\n", "* https://github.com/h2oai/h2o-3\n", "* https://en.wikipedia.org/wiki/Ice_hockey\n", "* https://en.wikipedia.org/wiki/Antibody" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |█████████████████████████████████████████████████████████| 100%\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDDocument
0H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.
1Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. The sport is known to be fast-paced and physical.
2An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import OrderedDict\n", "\n", "documents = [\n", " 'H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.',\n", " 'Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent\\'s net to score goals. The sport is known to be fast-paced and physical.',\n", " 'An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.'\n", "]\n", "doc_ids = list(range(len(documents)))\n", "\n", "input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),\n", " column_types=['numeric', 'string'])\n", "input_frame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TF-IDF with pre-processing" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. 10.6931470.693147
0h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. 10.6931470.693147
1ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from h2o.information_retrieval.tf_idf import tf_idf\n", "\n", "tf_idf_out = tf_idf(input_frame, \"DocID\", \"Document\", False, False)\n", "tf_idf_out.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from IPython.display import DisplayObject, display\n", "VALUES_CNT_TO_SHOW = 3\n", "\n", "def tf_idf_output_summary(tf_idf_out):\n", " for doc_id in doc_ids:\n", " sorted_doc_tf_idfs = tf_idf_out[tf_idf_out['DocID'] == doc_id].sort(by='TF-IDF')\n", " print('The highest TF-IDF values for document ' + str(doc_id) +':')\n", " display(sorted_doc_tf_idfs.tail(VALUES_CNT_TO_SHOW))\n", " print('The lowest TF-IDF values for document ' + str(doc_id) +':')\n", " display(sorted_doc_tf_idfs.head(VALUES_CNT_TO_SHOW))\n", " print('\\n')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The highest TF-IDF values for document 0:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
0h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 0:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
0h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The highest TF-IDF values for document 1:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
1ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 1:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
1ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The highest TF-IDF values for document 2:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 2:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "tf_idf_output_summary(tf_idf_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TF-IDF without pre-processing" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |█████████████████████████████████████████████████████████| 100%\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDDocument
0H2O
0is
0an
0in-memory
0platform
0for
0distributed,
0scalable
0machine
0learning.
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessed_data = [(doc_id, word) for doc_id, document in enumerate(documents) for word in document.split()]\n", "\n", "preprocessed_input_frame = h2o.H2OFrame(preprocessed_data,\n", " column_names=['DocID', 'Document'],\n", " column_types=['numeric', 'string'])\n", "preprocessed_input_frame.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2(Ab), 10.6931470.693147
2(Ig), 10.6931470.693147
2An 10.6931470.693147
0Flow 10.6931470.693147
0H2O 20.6931471.38629
0Hadoop 10.6931470.693147
1Ice 10.6931470.693147
0JSON 10.6931470.693147
0Java, 10.6931470.693147
0Python, 10.6931470.693147
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_idf_out = tf_idf(preprocessed_input_frame, 'DocID', 'Document', preprocess=False)\n", "tf_idf_out.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The highest TF-IDF values for document 0:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
0works 10.6931470.693147
0H2O 20.6931471.38629
0like 20.6931471.38629
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 0:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
0and 30 0
0is 10 0
0an 10.2876820.287682
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The highest TF-IDF values for document 1:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
1in 20.693147 1.38629
1sport 20.693147 1.38629
1their 20.693147 1.38629
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 1:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
1and 10 0
1is 20 0
1known 10.2876820.287682
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The highest TF-IDF values for document 2:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2viruses. 10.6931470.693147
2as 20.6931471.38629
2by 20.6931471.38629
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 2:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2and 10 0
2is 20 0
2a 10.2876820.287682
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "tf_idf_output_summary(tf_idf_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case insensitive TF-IDF" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |█████████████████████████████████████████████████████████| 100%\n" ] } ], "source": [ "input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),\n", " column_types=['numeric', 'string'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2(ab), 10.6931470.693147
2(ig), 10.6931470.693147
1a 30.2876820.863046
2a 10.2876820.287682
2also 10.6931470.693147
0an 10.2876820.287682
2an 20.2876820.575364
0and 30 0
1and 10 0
2and 10 0
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_idf_out = tf_idf(input_frame, 'DocID', 'Document', case_sensitive=False)\n", "tf_idf_out.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The highest TF-IDF values for document 0:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
0works 10.6931470.693147
0h2o 20.6931471.38629
0like 20.6931471.38629
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 0:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
0and 3 0 0
0is 1 0 0
0the 1 0 0
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The highest TF-IDF values for document 1:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
1in 20.693147 1.38629
1sport 20.693147 1.38629
1their 20.693147 1.38629
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 1:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
1and 1 0 0
1is 2 0 0
1the 1 0 0
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The highest TF-IDF values for document 2:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2y-shaped 10.6931470.693147
2as 20.6931471.38629
2by 20.6931471.38629
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The lowest TF-IDF values for document 2:\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DocIDWord TF IDF TF-IDF
2and 1 0 0
2is 2 0 0
2the 1 0 0
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "tf_idf_output_summary(tf_idf_out)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3rc1" } }, "nbformat": 4, "nbformat_minor": 4 }