{ "cells": [ { "cell_type": "markdown", "id": "22d93371-d72a-498a-b25e-affb11659f0d", "metadata": {}, "source": [ "## Tutorial : Exploration of full-text indexing\n", "We'll read in some files, then index the \"important\" words in their contents, and finally search for some of those words\n", "\n", "For more info and background info, please see: \n", " https://julianspolymathexplorations.blogspot.com/2023/08/full-text-search-neo4j-indexing.html" ] }, { "cell_type": "code", "execution_count": 1, "id": "910c294a-eb6b-43d7-9369-980f20974e12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 'D:\\Docs\\- MY CODE\\Brain Annex\\BA-Win7' to sys.path\n" ] } ], "source": [ "import set_path # Importing this module will add the project's home directory to sys.path" ] }, { "cell_type": "code", "execution_count": 2, "id": "e00686a6-c019-414e-92be-a44d32cfe138", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import getpass\n", "\n", "from neoaccess import NeoAccess\n", "\n", "from BrainAnnex.modules.neo_schema.neo_schema import NeoSchema\n", "from BrainAnnex.modules.full_text_indexing.full_text_indexing import FullTextIndexing\n", "from BrainAnnex.modules.media_manager.media_manager import MediaManager" ] }, { "cell_type": "markdown", "id": "be1fb174-5bb9-4dee-a920-0ac5dcfb74a5", "metadata": {}, "source": [ "# Connect to the database\n", "#### You can use a free local install of the Neo4j database, or a remote one on a virtual machine under your control, or a hosted solution, or simply the FREE \"Sandbox\" : [instructions here](https://julianspolymathexplorations.blogspot.com/2023/03/neo4j-sandbox-tutorial-cypher.html)\n", "NOTE: This tutorial is tested on version 4 of the Neo4j database, but will probably also work on the new version 5# Connect to the database" ] }, { "cell_type": "code", "execution_count": 3, "id": "4f62ab54-ffd7-4432-8f4c-dacf5618fa91", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Save your credentials here - or use the prompts given by the next cell\n", "host = \"\" # EXAMPLES: bolt://123.456.789.012 OR neo4j://localhost\n", "password = \"\"" ] }, { "cell_type": "code", "execution_count": 4, "id": "8ebe967e-3446-4ca5-8584-9603ec532a9b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "To create a database connection, enter the host IP, but leave out the port number: (EXAMPLES: bolt://123.456.789.012 OR neo4j://localhost )\n", "\n" ] }, { "name": "stdin", "output_type": "stream", "text": [ "Enter host IP WITHOUT the port number. EXAMPLE: bolt://123.456.789.012 bolt://123.456.789.012\n", "Enter the database password: ········\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "=> Will be using: host='bolt://123.456.789.012:7687', username='neo4j', password=**********\n" ] } ], "source": [ "print(\"To create a database connection, enter the host IP, but leave out the port number: (EXAMPLES: bolt://123.456.789.012 OR neo4j://localhost )\\n\")\n", "\n", "host = input(\"Enter host IP WITHOUT the port number. EXAMPLE: bolt://123.456.789.012 \")\n", "host += \":7687\" # EXAMPLE of host value: \"bolt://123.456.789.012:7687\"\n", "\n", "password = getpass.getpass(\"Enter the database password:\")\n", "\n", "print(f\"\\n=> Will be using: host='{host}', username='neo4j', password=**********\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f742deed-9b21-4129-966e-659ce059fa1e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 5, "id": "7247f139-9f06-41d1-98e8-410ff7c9f177", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Attempting to connect to Neo4j database\n" ] } ], "source": [ "db = NeoAccess(host=host,\n", " credentials=(\"neo4j\", password), debug=False) # Notice the debug option being OFF" ] }, { "cell_type": "code", "execution_count": 6, "id": "c96ece03-2b07-4a4d-ad6e-e41ccbf67251", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Version of the Neo4j driver: 4.4.11\n" ] } ], "source": [ "print(\"Version of the Neo4j driver: \", db.version())" ] }, { "cell_type": "markdown", "id": "4ca98da0-f267-4efa-8302-624efbd1a744", "metadata": {}, "source": [ "# Explorations of Indexing" ] }, { "cell_type": "code", "execution_count": 7, "id": "e86c1f5b-490f-4853-8944-f6a7ea5c2703", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "21" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Verify that the database is empty (if necessary, use db.empty_dbase() to clear it)\n", "q = \"MATCH (n) RETURN COUNT(n) AS number_nodes\"\n", "\n", "db.query(q, single_cell=\"number_nodes\")" ] }, { "cell_type": "markdown", "id": "8065fd62-1609-454d-ad2e-23b752679f66", "metadata": {}, "source": [ "#### Initialize the indexing" ] }, { "cell_type": "code", "execution_count": 8, "id": "fd329598-dedd-4e53-b1ea-8bce1f540b3c", "metadata": { "tags": [] }, "outputs": [], "source": [ "NeoSchema.set_database(db)\n", "FullTextIndexing.db = db" ] }, { "cell_type": "code", "execution_count": 9, "id": "13d47b2c-348b-437b-a595-84e2659ba8a1", "metadata": { "tags": [] }, "outputs": [], "source": [ "MediaManager.set_media_folder(\"D:/tmp/\") # CHANGE AS NEEDED on your system" ] }, { "cell_type": "code", "execution_count": 10, "id": "a213753b-66eb-487f-af01-fb6789379844", "metadata": {}, "outputs": [], "source": [ "db.empty_dbase() # WARNING: USE WITH CAUTION!!!" ] }, { "cell_type": "code", "execution_count": 11, "id": "53e30599-8f89-4099-9051-8bcdcfb0d1b7", "metadata": {}, "outputs": [], "source": [ "FullTextIndexing.initialize_schema()" ] }, { "cell_type": "code", "execution_count": null, "id": "63088d2c-9d9b-45b2-9c4b-b3a42deb909a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "89288a04-bec9-48d9-a9cb-255ab30afa74", "metadata": {}, "source": [ "#### Read in 2 files (stored in the \"media folder\" specified above), and index them" ] }, { "cell_type": "code", "execution_count": 12, "id": "57e6f55b-4927-44e2-8d6a-d4536f2449c7", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'hello to the world !!! ? Welcome to learning how she cooks with potatoes...'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = \"test1.txt\" # 1st FILE\n", "file_contents = MediaManager.get_from_file(filename)\n", "file_contents" ] }, { "cell_type": "code", "execution_count": 13, "id": "582287e0-e5ae-4a7a-bede-4fe7331963c7", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'cooks', 'learning', 'potatoes', 'welcome', 'world'}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_list = FullTextIndexing.extract_unique_good_words(file_contents)\n", "word_list" ] }, { "cell_type": "markdown", "id": "c79449ec-85c8-4dfb-b707-7b29cc28c353", "metadata": {}, "source": [ "#### Note that many common words get dropped..." ] }, { "cell_type": "code", "execution_count": 14, "id": "ef6ebaa4-4e1b-4c51-8fbf-ae85722569e6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "16" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content_item_id = NeoSchema.create_data_node(class_node=\"Content Item\", properties = {\"name\": filename})\n", "content_item_id" ] }, { "cell_type": "code", "execution_count": 15, "id": "2dc06ee4-b0e9-451f-9a16-6ad144c0b2ff", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Index the chosen words for this first Content Item\n", "FullTextIndexing.new_indexing(content_item_id = content_item_id, unique_words = word_list)" ] }, { "cell_type": "markdown", "id": "1709c195-2e90-4173-90e7-d157ab6e71d7", "metadata": {}, "source": [ "#### Process the 2nd Content Item" ] }, { "cell_type": "code", "execution_count": 16, "id": "3b456b09-9300-4515-b520-a5c87a1e0f56", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\"

Let's make a much better world, shall we? What do you say to that enticing prospect?

\\n\\n

Starting on a small scale – we’ll learn cooking a potato well.

\"" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = \"test2.htm\" # 2nd FILE\n", "file_contents = MediaManager.get_from_file(filename)\n", "file_contents" ] }, { "cell_type": "code", "execution_count": 17, "id": "2a9a76f2-be99-45de-937c-a9e7e4733974", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'cooking', 'enticing', 'learn', 'potato', 'prospect', 'say', 'scale', 'world'}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_list = FullTextIndexing.extract_unique_good_words(file_contents)\n", "word_list" ] }, { "cell_type": "code", "execution_count": 18, "id": "65f70363-f772-490e-af20-883b0fc0dc9b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "23" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content_item_id = NeoSchema.create_data_node(class_node=\"Content Item\", properties = {\"name\": filename})\n", "content_item_id" ] }, { "cell_type": "code", "execution_count": 19, "id": "5bb78b01-2dd7-4159-9496-49738004225b", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Index the chosen words for this 2nd Content Item\n", "FullTextIndexing.new_indexing(content_item_id = content_item_id, unique_words = word_list)" ] }, { "cell_type": "markdown", "id": "bb926fc6-724d-47d2-8954-10863d3d636c", "metadata": {}, "source": [ "_Here's what we have created so far:_" ] }, { "cell_type": "markdown", "id": "dad0793b-6d5f-40c7-b482-b7fc3dddf8cf", "metadata": {}, "source": [ "![Full Text Indexing](../BrainAnnex/docs/tutorial_full_text_indexing.png)" ] }, { "cell_type": "code", "execution_count": null, "id": "2eee3b10-998f-459f-b9f4-6fbb88a2df6d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "98cd92f4-ac3e-4843-89f6-787671c0da14", "metadata": {}, "source": [ "### The following function provides a simple way to search content that includes a given word in the index, for demonstration purpose; for actual use, please see the methods provided by the `FullTextIndexing` class" ] }, { "cell_type": "code", "execution_count": 20, "id": "630b7a82-f846-4efb-b43f-499e28ae1ff3", "metadata": { "tags": [] }, "outputs": [], "source": [ "def search_word(word :str) -> [str]:\n", " \"\"\"\n", " Look up any stored words that contains the requested one (ignoring case.) \n", " Then locate the Content Items that are indexed by any of those words.\n", " Return a list of the values of the \"name\" attributes in all the found Content Items\n", " \"\"\"\n", " q= f'''MATCH (w:Word)-[:occurs]->(:Indexer)<-[:has_index]-(ci:`Content Item`)\n", " WHERE w.name CONTAINS toLower('{word}')\n", " RETURN DISTINCT ci.name AS content_name\n", " '''\n", " result = db.query(q, single_column=\"content_name\")\n", " return result" ] }, { "cell_type": "code", "execution_count": null, "id": "a50dac74-5c63-4d8f-9f45-03ddaeb1dad1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "37865ecc-7857-4b4e-bc3d-e2c0db71957e", "metadata": {}, "source": [ "# Now, can finally try out some word searches" ] }, { "cell_type": "markdown", "id": "aaa1b677-e4f7-42ae-a0b9-d7ac50f1111c", "metadata": {}, "source": [ "### Using the search_word() function above:" ] }, { "cell_type": "code", "execution_count": 21, "id": "f0efce75-3353-489b-9502-1a377a0ee964", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_word(\"hello\")" ] }, { "cell_type": "code", "execution_count": 22, "id": "45087339-97b3-461a-95f8-58230394b91d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['test1.txt', 'test2.htm']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_word(\"world\")" ] }, { "cell_type": "markdown", "id": "c90cb2d9-5e64-41b4-8ec4-c2c1047b6428", "metadata": {}, "source": [ "### Or using methods provided by the `FullTextIndexing` class:" ] }, { "cell_type": "code", "execution_count": 23, "id": "df65dda6-c2d5-41c5-94e2-f6ca965c31c4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'name': 'test1.txt', 'internal_id': 16, 'neo4j_labels': ['Content Item']},\n", " {'name': 'test2.htm', 'internal_id': 23, 'neo4j_labels': ['Content Item']}]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FullTextIndexing.search_word(\"world\", all_properties=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "aae1a25f-defe-4d28-b028-e562eb9984c7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fc7788c6-da52-4237-ab2c-a3e049eb572a", "metadata": {}, "source": [ "### IMPORTANT: make sure to search for the word STEMS, in order to find all variants!!\n", "For example, search for \"potato\" in order to find both \"potato\" and \"potatoes\"." ] }, { "cell_type": "code", "execution_count": 24, "id": "14a09629-0ab4-4ad3-92ac-7d6f150b80a6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['test1.txt', 'test2.htm']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_word(\"POTATO\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "ea744c93-e11f-43cd-a213-86f4d3a37af8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['test1.txt']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_word(\"POTATOES\")" ] }, { "cell_type": "code", "execution_count": 26, "id": "99226331-f12a-4a7a-80c9-464a54e7858b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['test1.txt', 'test2.htm']" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_word(\"Learn\")" ] }, { "cell_type": "code", "execution_count": 27, "id": "a051f005-6468-4a8e-ae94-82faa60c97c5", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['test1.txt']" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_word(\"Learning\")" ] }, { "cell_type": "code", "execution_count": null, "id": "64d75875-9a50-4394-b9ff-551cb19684f3", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }