{ "cells": [ { "cell_type": "markdown", "id": "22d93371-d72a-498a-b25e-affb11659f0d", "metadata": {}, "source": [ "## `NeoAccess` library - Tutorial : Importing data into Neo4j from a Pandas data frame, \n", "#### using the `load_pandas()` method\n", "\n", "#### [Overview and Intro article](https://julianspolymathexplorations.blogspot.com/2023/06/neo4j-python-neoaccess-library.html)\n", "\n", "#### [Reference guide](https://brainannex.org/docs/neoaccess.htm)\n", "\n", "#### CAUTION: running this tutorial will clear out the database" ] }, { "cell_type": "code", "execution_count": 1, "id": "e00686a6-c019-414e-92be-a44d32cfe138", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import getpass\n", "import pandas as pd\n", "\n", "from neoaccess import NeoAccess\n", "# In case of problems, try a sys.path.append(directory) , where directory is your project's root directory" ] }, { "cell_type": "markdown", "id": "be1fb174-5bb9-4dee-a920-0ac5dcfb74a5", "metadata": {}, "source": [ "# 1. Connecting to the database\n", "#### You can use a free local install of the Neo4j database, or a remote one on a virtual machine under your control, or a hosted solution\n", "NOTE: This tutorial is tested on **version 4.4** of the Neo4j database, but will probably also work on the new version 5 (NOT guaranteed, however...)" ] }, { "cell_type": "code", "execution_count": 2, "id": "34c5f375-64d7-44c4-bf14-c34abda2fd53", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Save your credentials here - or use the prompts given by the next cell\n", "host = \"\" # EXAMPLES: bolt://123.456.789.012 OR neo4j://localhost \n", " # (CAUTION: do NOT include the port number!)\n", "password = \"\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "7247f139-9f06-41d1-98e8-410ff7c9f177", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connection to Neo4j database established.\n" ] } ], "source": [ "db = NeoAccess(host=host,\n", " credentials=(\"neo4j\", password)) " ] }, { "cell_type": "code", "execution_count": 4, "id": "c96ece03-2b07-4a4d-ad6e-e41ccbf67251", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Version of the Neo4j driver: 4.4.11\n" ] } ], "source": [ "print(\"Version of the Neo4j driver: \", db.version())" ] }, { "cell_type": "code", "execution_count": null, "id": "fa4add63-bf1b-4222-a5cd-f53ab5b0c449", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "adc97bbf-9d16-4dfe-abe5-4f07733f45e3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4ca98da0-f267-4efa-8302-624efbd1a744", "metadata": {}, "source": [ "# 2. Importing data into an empty database from a Pandas dataframe" ] }, { "cell_type": "markdown", "id": "62803c10-bcb6-4d93-b747-a1d831ba84a6", "metadata": {}, "source": [ "### In the following example, the database entities are EMPLOYEES at a company" ] }, { "cell_type": "code", "execution_count": 5, "id": "ef0b6978-e8ca-4b26-b633-99c923ca935d", "metadata": {}, "outputs": [], "source": [ "# CLEAR OUT THE DATABASE\n", "#db.empty_dbase() # UNCOMMENT IF DESIRED ***************** WARNING: USE WITH CAUTION!!! ************************" ] }, { "cell_type": "code", "execution_count": 6, "id": "9b1f3651-f613-4966-b1ac-f2eddfbb2e79", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employee IDactivenamejob titlesalary
0100TrueJohn Doeadministrative assistant80000
1200TrueValerie LebenCEO400000
2300TrueJill Smithhead of marketing250000
\n", "
" ], "text/plain": [ " employee ID active name job title salary\n", "0 100 True John Doe administrative assistant 80000\n", "1 200 True Valerie Leben CEO 400000\n", "2 300 True Jill Smith head of marketing 250000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prepare a Pandas data frame with the data. (In some use cases, the data frame would be read in from a CSV file)\n", "\n", "df = pd.DataFrame({\"employee ID\": [100, 200, 300], \n", " \"active\": [True, True, True],\n", " \"name\": [\"John Doe\", \"Valerie Leben\", \"Jill Smith\"],\n", " \"job title\": [\"administrative assistant\", \"CEO\", \"head of marketing\"],\n", " \"salary\": [80000, 400000, 250000]\n", " })\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 7, "id": "a300fe8c-be58-42c4-8c68-c9411c11f087", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "employee ID int64\n", "active bool\n", "name object\n", "job title object\n", "salary int64\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Always a good idea to make sure that the data types are what we need them to be,\n", "# especially if your dataframe originated from a CSV file.\n", "# Note: it's normal for strings show up as \"object\" in a Pandas dataframe (because of their variable length)\n", "\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": 8, "id": "5063cec5-e693-4002-aa84-c5c71240039d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[43302, 43303, 43304]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now import the data into the database. \n", "# We'll use the label \"Employee\". Note: blanks are allowed, and so are multiple labels (use a list or tuple)\n", "\n", "internal_ids = db.load_pandas(df, labels=\"Employee\")\n", "\n", "internal_ids # This is a list of the Neo4j ID's assigned to the new nodes" ] }, { "cell_type": "markdown", "id": "dd3d435f-acd5-45bb-95bd-1298269042de", "metadata": {}, "source": [ "## Voila', as simple as that! Here's what the database looks like at this stage:" ] }, { "cell_type": "markdown", "id": "07da197f-2b2a-4c9b-979f-b8d785ef2484", "metadata": {}, "source": [ "![Three imported records](https://raw.githubusercontent.com/BrainAnnex/brain-annex/main/docs/tutorials_neoaccess_load_pandas.jpg)" ] }, { "cell_type": "markdown", "id": "6f2ac5bf-a150-4d23-bf70-131890f77ec6", "metadata": {}, "source": [ "#### Now, retrieve the newly-created nodes" ] }, { "cell_type": "code", "execution_count": 9, "id": "755de70d-f5d7-4e30-a4b8-efdef812f880", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RAW match structure (object of class NodeSpecs):\n", " internal_id: None labels: Employee key_name: None key_value: None properties: None clause: None clause_dummy_name: None\n" ] } ], "source": [ "match_specs = db.match(labels=\"Employee\") # Our specifications to later locate nodes by (here, by using labels) \n", " # NO DATABASE OPERATION PERFORMED HERE\n", "\n", "print(match_specs)" ] }, { "cell_type": "code", "execution_count": 10, "id": "92397462-f3b6-4634-8726-5491cf9cfd57", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'employee ID': 100,\n", " 'name': 'John Doe',\n", " 'active': True,\n", " 'salary': 80000,\n", " 'job title': 'administrative assistant'},\n", " {'employee ID': 200,\n", " 'name': 'Valerie Leben',\n", " 'active': True,\n", " 'salary': 400000,\n", " 'job title': 'CEO'},\n", " {'employee ID': 300,\n", " 'name': 'Jill Smith',\n", " 'active': True,\n", " 'salary': 250000,\n", " 'job title': 'head of marketing'}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.get_nodes(match_specs)" ] }, { "cell_type": "markdown", "id": "277c3ec3-a5aa-46db-ab9c-f3475d237763", "metadata": {}, "source": [ "#### Side note - If APOC is installed in the database, we can verify that the **data types** got correctly imported by issuing the following Cyher command:" ] }, { "cell_type": "code", "execution_count": 11, "id": "43626de4-8e0e-49e9-af76-84da2edd6e64", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'apoc.meta.cypher.types(n)': {'employee ID': 'INTEGER',\n", " 'name': 'STRING',\n", " 'active': 'BOOLEAN',\n", " 'salary': 'INTEGER',\n", " 'job title': 'STRING'}}]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''MATCH (n :Employee) \n", " RETURN apoc.meta.cypher.types(n) \n", " LIMIT 1\n", " '''\n", "\n", "db.query(q) # Run the Cypher query" ] }, { "cell_type": "code", "execution_count": null, "id": "1932fb59-73c7-42c9-9e77-a7b56ef9af34", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6baf3f6c-befa-46c6-83e5-2aa1a5dcb8de", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "75878b79-a966-4d0b-bb7a-95d6d3af3a0a", "metadata": {}, "source": [ "# 3. Adding data to an existing database from a Pandas dataframe (keeping existing fields)" ] }, { "cell_type": "code", "execution_count": 12, "id": "76d3ee3c-d2e9-4cdd-8e94-1f054f05a68b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employee IDactivenamejob titlesalary
0100TrueJohn Doeadministrative assistant80000
1200TrueValerie LebenCEO400000
2300TrueJill Smithhead of marketing250000
\n", "
" ], "text/plain": [ " employee ID active name job title salary\n", "0 100 True John Doe administrative assistant 80000\n", "1 200 True Valerie Leben CEO 400000\n", "2 300 True Jill Smith head of marketing 250000" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's review our old, original data frame\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "31278b89-5b9f-4562-95f1-d37bf6abb9e1", "metadata": {}, "source": [ "#### Now, consider a NEW dataframe with changes/updates\n", "This dataframe spells out the details of Jill's raise:" ] }, { "cell_type": "code", "execution_count": 13, "id": "2704e1e6-6d3f-4dcb-ad31-01d696ca81d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employee IDsalary
0300270000
\n", "
" ], "text/plain": [ " employee ID salary\n", "0 300 270000" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_update_1 = pd.DataFrame({\"employee ID\": [300], \n", " \"salary\": [270000]\n", " })\n", "\n", "df_update_1" ] }, { "cell_type": "markdown", "id": "3d9082d7-283d-4dfb-bd7b-f8bbcaa56e8c", "metadata": {}, "source": [ "#### The new data can be MERGED into the existing data, by using the \"employee ID\" for matching against existing records\n", "Notice the use of the argument `merge_primary_key=\"employee ID\"`" ] }, { "cell_type": "code", "execution_count": 14, "id": "cf9dec5d-2376-4a63-a1ab-24d5fbfc1dc2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[43304]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.load_pandas(df_update_1, labels=\"Employee\",\n", " merge_primary_key=\"employee ID\", merge_overwrite=False) # merge_overwrite=False means updating the record rather then completely over-writing it, \n", " # i.e. \"KEEP EXISTING DATA IN FIELDS NOT SPECIFIED IN THE DATAFRAME\"" ] }, { "cell_type": "code", "execution_count": 15, "id": "df19cdc1-992b-471b-8fa2-ada0fb4c226d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'employee ID': 100,\n", " 'name': 'John Doe',\n", " 'active': True,\n", " 'salary': 80000,\n", " 'job title': 'administrative assistant'},\n", " {'employee ID': 200,\n", " 'name': 'Valerie Leben',\n", " 'active': True,\n", " 'salary': 400000,\n", " 'job title': 'CEO'},\n", " {'employee ID': 300,\n", " 'name': 'Jill Smith',\n", " 'active': True,\n", " 'salary': 270000,\n", " 'job title': 'head of marketing'}]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.get_nodes(match_specs) # The specs haven't changed: we still want to \"locate all nodes with the \"Employee\" label" ] }, { "cell_type": "markdown", "id": "848cd359-0e33-4c6d-aca9-f46f2f9e30c5", "metadata": {}, "source": [ "#### Notice how Jill got her raise, and her other fields were left untouched" ] }, { "cell_type": "code", "execution_count": null, "id": "4af98bc8-eac8-4f52-a130-106613bdf372", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "16ac2c24-ea2d-4da1-9d70-a6e92675bdf0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3269d8f3-81c8-488f-b4c3-0e8ed5eb48ed", "metadata": {}, "source": [ "# 4. Adding data to an existing database from a Pandas dataframe (completely over-writing records)" ] }, { "cell_type": "markdown", "id": "6fad7a17-1039-46b6-b959-e3a39957b72b", "metadata": {}, "source": [ "#### Let's say that John is retiring. We may want to keep the record with the employee ID and the name, but flip the \"active\" field to false, and DROP all other fields\n", "This can be done by importing the following dataframe, again with `merge_primary_key=\"employee ID\"` BUT this time with the `merge_overwrite=True` option" ] }, { "cell_type": "code", "execution_count": 16, "id": "aebffaac-379a-4bbf-83b0-84493e57350f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employee IDactivename
0100FalseJohn Doe
\n", "
" ], "text/plain": [ " employee ID active name\n", "0 100 False John Doe" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_update_2 = pd.DataFrame({\"employee ID\": [100], \n", " \"active\": [False],\n", " \"name\": [\"John Doe\"]\n", " })\n", "\n", "df_update_2" ] }, { "cell_type": "code", "execution_count": 17, "id": "a4bea348-b332-4d71-b5b6-26e159b107e4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[43302]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.load_pandas(df_update_2, labels=\"Employee\",\n", " merge_primary_key=\"employee ID\", merge_overwrite=True) # merge_overwrite=True means blanking out the record and then re-building it, \n", " # i.e. \"DON'T KEEP EXISTING DATA IN FIELDS NOT SPECIFIED IN THE DATAFRAME\"" ] }, { "cell_type": "code", "execution_count": 18, "id": "0cab441d-547a-4868-95ef-ef1343f4b170", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'employee ID': 100, 'name': 'John Doe', 'active': False},\n", " {'employee ID': 200,\n", " 'name': 'Valerie Leben',\n", " 'active': True,\n", " 'salary': 400000,\n", " 'job title': 'CEO'},\n", " {'employee ID': 300,\n", " 'name': 'Jill Smith',\n", " 'active': True,\n", " 'salary': 270000,\n", " 'job title': 'head of marketing'}]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.get_nodes(match_specs) # The specs haven't changed: we still want to \"locate all nodes with the \"Employee\" label" ] }, { "cell_type": "markdown", "id": "15b08375-23d2-4ca7-aa3e-82251790fcb5", "metadata": {}, "source": [ "### Notice how John's old record got ditched, and COMPLETELY REPLACED with the new data (containing fewer fields)\n", "Nothing else got modified" ] }, { "cell_type": "code", "execution_count": null, "id": "5bdef354-7b1e-47a0-b204-0e68a4003bed", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d6f97408-524f-41be-b817-181fa7d159d6", "metadata": {}, "source": [ "# For more options to use with `load_pandas()`, please see the [reference guide](https://brainannex.org/docs/neoaccess.htm)" ] }, { "cell_type": "code", "execution_count": null, "id": "8fc24bad-5a7c-43e6-99fd-6cfd1fc8529e", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }