{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TF-IDF demo"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import h2o\n",
"\n",
"h2o.init()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data sources:\n",
"\n",
"* https://github.com/h2oai/h2o-3\n",
"* https://en.wikipedia.org/wiki/Ice_hockey\n",
"* https://en.wikipedia.org/wiki/Antibody"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Parse progress: |█████████████████████████████████████████████████████████| 100%\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
" DocID | Document |
\n",
"\n",
"\n",
" 0 | H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. |
\n",
" 1 | Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. The sport is known to be fast-paced and physical. |
\n",
" 2 | An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import OrderedDict\n",
"\n",
"documents = [\n",
" 'H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.',\n",
" 'Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent\\'s net to score goals. The sport is known to be fast-paced and physical.',\n",
" 'An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.'\n",
"]\n",
"doc_ids = list(range(len(documents)))\n",
"\n",
"input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),\n",
" column_types=['numeric', 'string'])\n",
"input_frame.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TF-IDF with pre-processing"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. | 1 | 0.693147 | 0.693147 |
\n",
" 0 | h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. | 1 | 0.693147 | 0.693147 |
\n",
" 1 | ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from h2o.information_retrieval.tf_idf import tf_idf\n",
"\n",
"tf_idf_out = tf_idf(input_frame, \"DocID\", \"Document\", False, False)\n",
"tf_idf_out.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import DisplayObject, display\n",
"VALUES_CNT_TO_SHOW = 3\n",
"\n",
"def tf_idf_output_summary(tf_idf_out):\n",
" for doc_id in doc_ids:\n",
" sorted_doc_tf_idfs = tf_idf_out[tf_idf_out['DocID'] == doc_id].sort(by='TF-IDF')\n",
" print('The highest TF-IDF values for document ' + str(doc_id) +':')\n",
" display(sorted_doc_tf_idfs.tail(VALUES_CNT_TO_SHOW))\n",
" print('The lowest TF-IDF values for document ' + str(doc_id) +':')\n",
" display(sorted_doc_tf_idfs.head(VALUES_CNT_TO_SHOW))\n",
" print('\\n')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The highest TF-IDF values for document 0:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 0 | h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 0:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 0 | h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"The highest TF-IDF values for document 1:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 1 | ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 1:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 1 | ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"The highest TF-IDF values for document 2:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 2:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n"
]
}
],
"source": [
"tf_idf_output_summary(tf_idf_out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TF-IDF without pre-processing"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Parse progress: |█████████████████████████████████████████████████████████| 100%\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Document |
\n",
"\n",
"\n",
" 0 | H2O |
\n",
" 0 | is |
\n",
" 0 | an |
\n",
" 0 | in-memory |
\n",
" 0 | platform |
\n",
" 0 | for |
\n",
" 0 | distributed, |
\n",
" 0 | scalable |
\n",
" 0 | machine |
\n",
" 0 | learning. |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"preprocessed_data = [(doc_id, word) for doc_id, document in enumerate(documents) for word in document.split()]\n",
"\n",
"preprocessed_input_frame = h2o.H2OFrame(preprocessed_data,\n",
" column_names=['DocID', 'Document'],\n",
" column_types=['numeric', 'string'])\n",
"preprocessed_input_frame.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | (Ab), | 1 | 0.693147 | 0.693147 |
\n",
" 2 | (Ig), | 1 | 0.693147 | 0.693147 |
\n",
" 2 | An | 1 | 0.693147 | 0.693147 |
\n",
" 0 | Flow | 1 | 0.693147 | 0.693147 |
\n",
" 0 | H2O | 2 | 0.693147 | 1.38629 |
\n",
" 0 | Hadoop | 1 | 0.693147 | 0.693147 |
\n",
" 1 | Ice | 1 | 0.693147 | 0.693147 |
\n",
" 0 | JSON | 1 | 0.693147 | 0.693147 |
\n",
" 0 | Java, | 1 | 0.693147 | 0.693147 |
\n",
" 0 | Python, | 1 | 0.693147 | 0.693147 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf_idf_out = tf_idf(preprocessed_input_frame, 'DocID', 'Document', preprocess=False)\n",
"tf_idf_out.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The highest TF-IDF values for document 0:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 0 | works | 1 | 0.693147 | 0.693147 |
\n",
" 0 | H2O | 2 | 0.693147 | 1.38629 |
\n",
" 0 | like | 2 | 0.693147 | 1.38629 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 0:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 0 | and | 3 | 0 | 0 |
\n",
" 0 | is | 1 | 0 | 0 |
\n",
" 0 | an | 1 | 0.287682 | 0.287682 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"The highest TF-IDF values for document 1:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 1 | in | 2 | 0.693147 | 1.38629 |
\n",
" 1 | sport | 2 | 0.693147 | 1.38629 |
\n",
" 1 | their | 2 | 0.693147 | 1.38629 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 1:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 1 | and | 1 | 0 | 0 |
\n",
" 1 | is | 2 | 0 | 0 |
\n",
" 1 | known | 1 | 0.287682 | 0.287682 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"The highest TF-IDF values for document 2:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | viruses. | 1 | 0.693147 | 0.693147 |
\n",
" 2 | as | 2 | 0.693147 | 1.38629 |
\n",
" 2 | by | 2 | 0.693147 | 1.38629 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 2:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | and | 1 | 0 | 0 |
\n",
" 2 | is | 2 | 0 | 0 |
\n",
" 2 | a | 1 | 0.287682 | 0.287682 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n"
]
}
],
"source": [
"tf_idf_output_summary(tf_idf_out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Case insensitive TF-IDF"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Parse progress: |█████████████████████████████████████████████████████████| 100%\n"
]
}
],
"source": [
"input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),\n",
" column_types=['numeric', 'string'])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | (ab), | 1 | 0.693147 | 0.693147 |
\n",
" 2 | (ig), | 1 | 0.693147 | 0.693147 |
\n",
" 1 | a | 3 | 0.287682 | 0.863046 |
\n",
" 2 | a | 1 | 0.287682 | 0.287682 |
\n",
" 2 | also | 1 | 0.693147 | 0.693147 |
\n",
" 0 | an | 1 | 0.287682 | 0.287682 |
\n",
" 2 | an | 2 | 0.287682 | 0.575364 |
\n",
" 0 | and | 3 | 0 | 0 |
\n",
" 1 | and | 1 | 0 | 0 |
\n",
" 2 | and | 1 | 0 | 0 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf_idf_out = tf_idf(input_frame, 'DocID', 'Document', case_sensitive=False)\n",
"tf_idf_out.head()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The highest TF-IDF values for document 0:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 0 | works | 1 | 0.693147 | 0.693147 |
\n",
" 0 | h2o | 2 | 0.693147 | 1.38629 |
\n",
" 0 | like | 2 | 0.693147 | 1.38629 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 0:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 0 | and | 3 | 0 | 0 |
\n",
" 0 | is | 1 | 0 | 0 |
\n",
" 0 | the | 1 | 0 | 0 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"The highest TF-IDF values for document 1:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 1 | in | 2 | 0.693147 | 1.38629 |
\n",
" 1 | sport | 2 | 0.693147 | 1.38629 |
\n",
" 1 | their | 2 | 0.693147 | 1.38629 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 1:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 1 | and | 1 | 0 | 0 |
\n",
" 1 | is | 2 | 0 | 0 |
\n",
" 1 | the | 1 | 0 | 0 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"The highest TF-IDF values for document 2:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | y-shaped | 1 | 0.693147 | 0.693147 |
\n",
" 2 | as | 2 | 0.693147 | 1.38629 |
\n",
" 2 | by | 2 | 0.693147 | 1.38629 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The lowest TF-IDF values for document 2:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" DocID | Word | TF | IDF | TF-IDF |
\n",
"\n",
"\n",
" 2 | and | 1 | 0 | 0 |
\n",
" 2 | is | 2 | 0 | 0 |
\n",
" 2 | the | 1 | 0 | 0 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n"
]
}
],
"source": [
"tf_idf_output_summary(tf_idf_out)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3rc1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}