{ "cells": [ { "cell_type": "markdown", "id": "22d93371-d72a-498a-b25e-affb11659f0d", "metadata": {}, "source": [ "## Tutorial : Exploration of full-text indexing\n", "We'll read in some files, then index the \"important\" words in their contents, and finally search for some of those words\n", "\n", "For more info and background info, please see: \n", " https://julianspolymathexplorations.blogspot.com/2023/08/full-text-search-neo4j-indexing.html\n", "\n", "#### CAUTION: running this tutorial will clear out the database!\n", "\n", "--- \n", "\n", "## PREPARATIONS: to run this tutorial, create a text file named `test1.txt` and one named `test2.htm`\n", "and place them on a local folder of your choice (make a note of its name!)\n", "\n", "---\n", "**Contents of test1.txt:** \n", "hello to the world !!! ? Welcome to learning how she cooks with potatoes...\n", "\n", "**Contents of test2.htm:** \n", "\n", "`

Let's make a much better world, shall we? What do you say to that enticing prospect?

`\n", "\n", "`

Starting on a small scale – we’ll learn cooking a potato well.

` \n", "\n", "--- \n", "Also, change the value of the variable `MY_FOLDER` , below, to the location on your computer where you stored the above folders,\n", "and use your database login credentials." ] }, { "cell_type": "code", "execution_count": 1, "id": "e00686a6-c019-414e-92be-a44d32cfe138", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import getpass\n", "\n", "from neoaccess import NeoAccess\n", "\n", "from brainannex import NeoSchema\n", "from brainannex.full_text_indexing import FullTextIndexing\n", "from brainannex.media_manager import MediaManager\n", "# In case of problems, try a sys.path.append(directory) , where directory is your project's root directory" ] }, { "cell_type": "code", "execution_count": 2, "id": "16304be0-1589-44de-8988-ff5dfa2f668b", "metadata": {}, "outputs": [], "source": [ "MY_FOLDER = \"D:/tmp/tests for tutorials/\" # ****** IMPORTANT: CHANGE AS NEEDED on your system; use forward slashes on Windows, too! ******" ] }, { "cell_type": "code", "execution_count": null, "id": "c0227343-d154-4ee7-9093-988dda462158", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "be1fb174-5bb9-4dee-a920-0ac5dcfb74a5", "metadata": {}, "source": [ "# Connect to the database\n", "#### You can use a free local install of the Neo4j database, or a remote one on a virtual machine under your control, or a hosted solution, or simply the FREE \"Sandbox\" : [instructions here](https://julianspolymathexplorations.blogspot.com/2023/03/neo4j-sandbox-tutorial-cypher.html)\n", "NOTE: This tutorial is tested on **version 4.4** of the Neo4j database, but will probably also work on the new version 5 (NOT guaranteed, however...)" ] }, { "cell_type": "code", "execution_count": 3, "id": "a14f375c-10d2-4560-afc9-d00617a21973", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Save your credentials here - or use the prompts given by the next cell\n", "host = \"\" # EXAMPLES: bolt://123.456.789.012 OR neo4j://localhost \n", " # (CAUTION: do NOT include the port number!)\n", "password = \"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "f742deed-9b21-4129-966e-659ce059fa1e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 4, "id": "7247f139-9f06-41d1-98e8-410ff7c9f177", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connection to Neo4j database established.\n" ] } ], "source": [ "db = NeoAccess(host=host,\n", " credentials=(\"neo4j\", password), debug=False) # Notice the debug option being OFF" ] }, { "cell_type": "code", "execution_count": 5, "id": "c96ece03-2b07-4a4d-ad6e-e41ccbf67251", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Version of the Neo4j driver: 4.4.11\n" ] } ], "source": [ "print(\"Version of the Neo4j driver: \", db.version())" ] }, { "cell_type": "code", "execution_count": null, "id": "f896cbac-5678-48f8-8fb9-bb7f7812528e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4ca98da0-f267-4efa-8302-624efbd1a744", "metadata": {}, "source": [ "# Explorations of Indexing" ] }, { "cell_type": "code", "execution_count": 6, "id": "4232fe77-a1ca-45e0-af0d-2d30c43396cc", "metadata": {}, "outputs": [], "source": [ "# CLEAR OUT THE DATABASE\n", "#db.empty_dbase() # UNCOMMENT ***************** WARNING: USE WITH CAUTION!!! ************************" ] }, { "cell_type": "code", "execution_count": 7, "id": "e86c1f5b-490f-4853-8944-f6a7ea5c2703", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Verify that the database is empty\n", "q = \"MATCH (n) RETURN COUNT(n) AS number_nodes\"\n", "\n", "db.query(q, single_cell=\"number_nodes\")" ] }, { "cell_type": "markdown", "id": "8065fd62-1609-454d-ad2e-23b752679f66", "metadata": {}, "source": [ "#### Initialize the indexing" ] }, { "cell_type": "code", "execution_count": 8, "id": "fd329598-dedd-4e53-b1ea-8bce1f540b3c", "metadata": { "tags": [] }, "outputs": [], "source": [ "NeoSchema.set_database(db)\n", "FullTextIndexing.db = db" ] }, { "cell_type": "code", "execution_count": 9, "id": "53e30599-8f89-4099-9051-8bcdcfb0d1b7", "metadata": {}, "outputs": [], "source": [ "FullTextIndexing.initialize_schema()" ] }, { "cell_type": "code", "execution_count": null, "id": "63088d2c-9d9b-45b2-9c4b-b3a42deb909a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "89288a04-bec9-48d9-a9cb-255ab30afa74", "metadata": {}, "source": [ "#### Read in 2 files (stored in the MY_FOLDER location specified above), and index them" ] }, { "cell_type": "code", "execution_count": 10, "id": "57e6f55b-4927-44e2-8d6a-d4536f2449c7", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'hello to the world !!! ? Welcome to learning how she cooks with potatoes...'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = \"test1.txt\" # 1st FILE\n", "file_contents = MediaManager.get_from_text_file(path=MY_FOLDER, filename=filename)\n", "file_contents" ] }, { "cell_type": "code", "execution_count": 11, "id": "582287e0-e5ae-4a7a-bede-4fe7331963c7", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'cooks', 'learning', 'potatoes', 'welcome', 'world'}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_list = FullTextIndexing.extract_unique_good_words(file_contents)\n", "word_list # Not shown in any particular order" ] }, { "cell_type": "markdown", "id": "c79449ec-85c8-4dfb-b707-7b29cc28c353", "metadata": {}, "source": [ "#### Note that many common words get dropped..." ] }, { "cell_type": "code", "execution_count": 12, "id": "ef6ebaa4-4e1b-4c51-8fbf-ae85722569e6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "43287" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "internal_id = NeoSchema.create_data_node(class_node=\"Content Item\", properties = {\"name\": filename})\n", "internal_id" ] }, { "cell_type": "code", "execution_count": 13, "id": "2dc06ee4-b0e9-451f-9a16-6ad144c0b2ff", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Index the chosen words for this first Content Item\n", "FullTextIndexing.new_indexing(internal_id = internal_id, unique_words = word_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "8401dd5a-c0ba-4105-8950-a85970f28188", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1709c195-2e90-4173-90e7-d157ab6e71d7", "metadata": {}, "source": [ "#### Process the 2nd Content Item" ] }, { "cell_type": "code", "execution_count": 14, "id": "3b456b09-9300-4515-b520-a5c87a1e0f56", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\"

Let's make a much better world, shall we? What do you say to that enticing prospect?

\\n\\n

Starting on a small scale – we’ll learn cooking a potato well.

\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = \"test2.htm\" # 2nd FILE\n", "file_contents = MediaManager.get_from_text_file(path=MY_FOLDER, filename=filename)\n", "file_contents" ] }, { "cell_type": "code", "execution_count": 15, "id": "2a9a76f2-be99-45de-937c-a9e7e4733974", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'cooking', 'enticing', 'learn', 'potato', 'prospect', 'scale', 'world'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_list = FullTextIndexing.extract_unique_good_words(file_contents)\n", "word_list" ] }, { "cell_type": "code", "execution_count": 16, "id": "65f70363-f772-490e-af20-883b0fc0dc9b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "43294" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "internal_id = NeoSchema.create_data_node(class_node=\"Content Item\", properties = {\"name\": filename})\n", "internal_id" ] }, { "cell_type": "code", "execution_count": 17, "id": "5bb78b01-2dd7-4159-9496-49738004225b", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Index the chosen words for this 2nd Content Item\n", "FullTextIndexing.new_indexing(internal_id = internal_id, unique_words = word_list)" ] }, { "cell_type": "markdown", "id": "bb926fc6-724d-47d2-8954-10863d3d636c", "metadata": {}, "source": [ "_Here's what we have created so far (Note: **THE INDEXED WORDS MIGHT VARY, BASED ON THE LATEST LIST OF COMMON WORDS TO DROP**):_" ] }, { "cell_type": "markdown", "id": "dad0793b-6d5f-40c7-b482-b7fc3dddf8cf", "metadata": {}, "source": [ "![Full Text Indexing](https://raw.githubusercontent.com/BrainAnnex/brain-annex/main/docs/tutorial_full_text_indexing.png)" ] }, { "cell_type": "code", "execution_count": null, "id": "2eee3b10-998f-459f-b9f4-6fbb88a2df6d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "a50dac74-5c63-4d8f-9f45-03ddaeb1dad1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "37865ecc-7857-4b4e-bc3d-e2c0db71957e", "metadata": {}, "source": [ "# Now, can finally try out some word searches" ] }, { "cell_type": "markdown", "id": "c90cb2d9-5e64-41b4-8ec4-c2c1047b6428", "metadata": {}, "source": [ "#### Using methods provided by the `FullTextIndexing` class:" ] }, { "cell_type": "code", "execution_count": 19, "id": "df65dda6-c2d5-41c5-94e2-f6ca965c31c4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'name': 'test2.htm', 'internal_id': 43294, 'neo4j_labels': ['Content Item']},\n", " {'name': 'test1.txt', 'internal_id': 43287, 'neo4j_labels': ['Content Item']}]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FullTextIndexing.search_word(\"world\", all_properties=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "aae1a25f-defe-4d28-b028-e562eb9984c7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fc7788c6-da52-4237-ab2c-a3e049eb572a", "metadata": {}, "source": [ "### IMPORTANT: make sure to search for the word *STEMS*, in order to find all variants!!\n", "For example, search for \"potato\" in order to find both \"potato\" and \"potatoes\"." ] }, { "cell_type": "code", "execution_count": 21, "id": "14a09629-0ab4-4ad3-92ac-7d6f150b80a6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[43287, 43294]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FullTextIndexing.search_word(\"POTATO\") # Here, we're just returning the internal database ID of the document records" ] }, { "cell_type": "code", "execution_count": 22, "id": "ea744c93-e11f-43cd-a213-86f4d3a37af8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[43287]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FullTextIndexing.search_word(\"POTATOES\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "99226331-f12a-4a7a-80c9-464a54e7858b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[43287, 43294]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FullTextIndexing.search_word(\"Learn\")" ] }, { "cell_type": "code", "execution_count": 24, "id": "a051f005-6468-4a8e-ae94-82faa60c97c5", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[43287]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FullTextIndexing.search_word(\"Learning\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "e791d090-9899-430c-842a-450970b64234", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FullTextIndexing.search_word(\"Supercalifragili\")" ] }, { "cell_type": "code", "execution_count": null, "id": "64d75875-9a50-4394-b9ff-551cb19684f3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ecd36d8f-a34c-4c5a-a5ab-557c998772c6", "metadata": {}, "source": [ "### Note: full-text indexing and search is also available as part of the UI of the web app that is included in the release of Brain Annex.\n", "Currently supported: indexing of text files, HTML files (e.g., formatted notes) and PDF documents." ] }, { "cell_type": "code", "execution_count": null, "id": "21ead4df-7a0b-4822-bdba-379b3c19ae20", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }