{ "cells": [ { "cell_type": "markdown", "id": "cdc74b23", "metadata": {}, "source": [ "# Build a Data Analyst AI Agent from Scratch" ] }, { "cell_type": "markdown", "id": "db160825", "metadata": {}, "source": [ "## Project Setup \n", "\n", "This section outlines the steps required to set up the project: \n", "- Import necessary libraries \n", "- Load the required API keys \n", "- Establish the database connection \n", "- Load the sample data into the database \n" ] }, { "cell_type": "code", "execution_count": null, "id": "a3eda0fc", "metadata": {}, "outputs": [], "source": [ "!pip install openai" ] }, { "cell_type": "code", "execution_count": null, "id": "d6717e73-fcac-4d57-96c0-d87c05f06c2f", "metadata": {}, "outputs": [], "source": [ "from openai import OpenAI\n", "import json\n", "import inspect\n", "from teradataml import *" ] }, { "cell_type": "code", "execution_count": null, "id": "00de9d73-5f8b-4be9-9cea-ba7d20bd4e13", "metadata": {}, "outputs": [], "source": [ "with open('configs.json', 'r') as file:\n", " configs=json.load(file)" ] }, { "cell_type": "code", "execution_count": null, "id": "515a6b17-6d23-4787-81d3-e1d3cd165208", "metadata": {}, "outputs": [], "source": [ "client = OpenAI(api_key = configs['llm-api-key'])" ] }, { "cell_type": "code", "execution_count": null, "id": "e66e2aef-96e0-4461-878f-1461721af3eb", "metadata": {}, "outputs": [], "source": [ "%run -i ../startup.ipynb\n", "eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)\n", "print(eng)" ] }, { "cell_type": "code", "execution_count": null, "id": "babdcb79-cfe6-4861-9309-1811f84980bd", "metadata": {}, "outputs": [], "source": [ "data_loading_queries = [\n", "'''\n", "CREATE DATABASE teddy_retailers\n", "AS PERMANENT = 50000000;\n", "''',\n", "'''\n", "CREATE TABLE teddy_retailers.source_catalog AS\n", "(\n", " SELECT product_id, product_name, product_category, price_cents\n", " FROM (\n", "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_products.csv') as products\n", ") WITH DATA;''',\n", "\n", "'''\n", "CREATE TABLE teddy_retailers.source_stock AS\n", "(\n", " SELECT entry_id, product_id, product_quantity, purchase_price_cents, entry_date\n", " FROM (\n", "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_stock.csv') as stock\n", ") WITH DATA;\n", "''',\n", "'''\n", "CREATE TABLE teddy_retailers.source_customers AS\n", "(\n", " SELECT customer_id, customer_name, customer_surname, customer_email\n", " FROM (\n", "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_customers.csv') as customers\n", ") WITH DATA;\n", "''',\n", "'''\n", "CREATE TABLE teddy_retailers.source_orders AS\n", "(\n", " SELECT order_id, customer_id, order_status, order_date\n", " FROM (\n", "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_orders.csv') as orders\n", ") WITH DATA;\n", "''',\n", "'''\n", "CREATE TABLE teddy_retailers.source_order_products AS\n", "(\n", " SELECT transaction_id, order_id, product_id, product_quantity\n", " FROM (\n", "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_order_products.csv') as transactions\n", ") WITH DATA;\n", "'''\n", "]\n", "for query in data_loading_queries:\n", " execute_sql(query)" ] }, { "cell_type": "markdown", "id": "8d1d5a34", "metadata": {}, "source": [ "## Agent Configuration \n", "\n", "This section covers the configuration of the agent, including: \n", "* Defining the data context that the agent will interact with \n", "* Setting up the routine the agent will follow as a system prompt (embedding the data context) \n", "* Establishing the list of tools available for the agent to complete its tasks \n" ] }, { "cell_type": "code", "execution_count": null, "id": "7daf1083-749c-4be4-beb4-567dc8a1a578", "metadata": {}, "outputs": [], "source": [ "databases = [\"teddy_retailers\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "097baac5-72a8-4d7a-a578-c2d3666f38f3", "metadata": {}, "outputs": [], "source": [ "def query_teradata_dictionary(databases_of_interest):\n", " query = f'''\n", " SELECT DatabaseName, TableName, ColumnName, ColumnFormat, ColumnType\n", " FROM DBC.ColumnsV\n", " WHERE DatabaseName IN ('{', '.join(databases_of_interest)}')\n", " '''\n", " table_dictionary = DataFrame.from_query(query)\n", " return json.dumps(table_dictionary.to_pandas().to_json())" ] }, { "cell_type": "code", "execution_count": null, "id": "9e9399e6-155b-4149-b7b0-d25396a3d691", "metadata": {}, "outputs": [], "source": [ "system_prompt= f\"\"\"\n", "You are an advanced data analyst for a retailer company, specializing in analyzing data from a Teradata system. Your primary responsibility is to assist users by answering business-related questions using SQL queries on the Teradata database. Follow these steps:\n", "\n", "1. Understanding User Requests\n", " - Users provide business questions in plain English.\n", " - Extract relevant data points needed to construct a meaningful response.\n", "\n", "2. Generating SQL Queries\n", " - Construct an optimized Teradata SQL query to retrieve the necessary data.\n", " - The query must be a **single-line string** without carriage returns or line breaks.\n", " - Ensure that the SQL query adheres to **Teradata SQL syntax** and avoids unsupported keywords.\n", " - The catalog of databases, tables, and columns to query is in the following json structure \n", " {query_teradata_dictionary(databases)}\n", " - Apply appropriate filtering, grouping, and ordering to enhance performance and accuracy.\n", "\n", "3. Executing the Query\n", " - Run the SQL query on the Teradata system and retrieve the results efficiently.\n", "\n", "4. Responding to the User\n", " - Convert the query results into a **concise, insightful, and plain-English response**.\n", " - Present the information in a clear, structured, and user-friendly manner.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "c39da44e-c3e2-4fcd-bac1-51ad6e7fd2e6", "metadata": {}, "outputs": [], "source": [ "def function_to_schema(func) -> dict:\n", " type_map = {\n", " str: \"string\",\n", " int: \"integer\",\n", " float: \"number\",\n", " bool: \"boolean\",\n", " list: \"array\",\n", " dict: \"object\",\n", " type(None): \"null\",\n", " }\n", "\n", " try:\n", " signature = inspect.signature(func)\n", " except ValueError as e:\n", " raise ValueError(\n", " f\"Failed to get signature for function {func.__name__}: {str(e)}\"\n", " )\n", "\n", " parameters = {}\n", " for param in signature.parameters.values():\n", " try:\n", " param_type = type_map.get(param.annotation, \"string\")\n", " except KeyError as e:\n", " raise KeyError(\n", " f\"Unknown type annotation {param.annotation} for parameter {param.name}: {str(e)}\"\n", " )\n", " parameters[param.name] = {\"type\": param_type}\n", "\n", " required = [\n", " param.name\n", " for param in signature.parameters.values()\n", " if param.default == inspect._empty\n", " ]\n", "\n", " return {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": func.__name__,\n", " \"description\": (func.__doc__ or \"\").strip(),\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": parameters,\n", " \"required\": required,\n", " },\n", " },\n", " }" ] }, { "cell_type": "code", "execution_count": null, "id": "8bba18d6-5579-424d-8361-d2b28197d5bd", "metadata": {}, "outputs": [], "source": [ "def query_teradata_database(sql_statement):\n", " query_statement = sql_statement.split('ORDER BY',1)[0]\n", " query_result = DataFrame.from_query(query_statement)\n", " return json.dumps(query_result.to_pandas().to_json())" ] }, { "cell_type": "markdown", "id": "a796bc2b", "metadata": {}, "source": [ "## Agent Runtime\n", "This section covers the code executed while the agent is in action, including:\n", "* Preparing the tools for use by the agent\n", "* The agent's runtime function" ] }, { "cell_type": "code", "execution_count": null, "id": "88fd833b-57aa-4b11-b86c-89d5a7b41705", "metadata": {}, "outputs": [], "source": [ "tools = [query_teradata_database]\n", "tool_schemas = [function_to_schema(tool) for tool in tools]\n", "tools_map = {tool.__name__: tool for tool in tools}" ] }, { "cell_type": "code", "execution_count": null, "id": "d7a03fa4", "metadata": {}, "outputs": [], "source": [ "def execute_tool_call(tool_call, tools_map):\n", " name = tool_call.function.name\n", " args = json.loads(tool_call.function.arguments)\n", "\n", " print(f\"Assistant: {name}({args})\")\n", "\n", " # call corresponding function with provided arguments\n", " return tools_map[name](**args)" ] }, { "cell_type": "code", "execution_count": null, "id": "ceda72dd-8e44-4a83-b9ea-29f40b5929d4", "metadata": {}, "outputs": [], "source": [ "def run_full_turn(system_message, messages):\n", "\n", " while True:\n", " print(f\"just logging messages {messages}\")\n", " response = client.chat.completions.create(\n", " model=\"gpt-4o\",\n", " messages=[{\"role\": \"system\", \"content\": system_message}] + messages,\n", " tools=tool_schemas or None,\n", " seed = 2\n", " )\n", " print(f\"logging response {response}\")\n", " message = response.choices[0].message\n", " messages.append(message)\n", "\n", " if message.content: # print assistant response\n", " print(\"Assistant Response:\", message.content)\n", "\n", " if message.tool_calls: # if finished handling tool calls, break\n", " # === 2. handle tool calls ===\n", " for tool_call in message.tool_calls:\n", " result = execute_tool_call(tool_call, tools_map)\n", "\n", " result_message = {\n", " \"role\": \"tool\",\n", " \"tool_call_id\": tool_call.id,\n", " \"content\": result,\n", " }\n", " messages.append(result_message)\n", " else:\n", " break" ] }, { "cell_type": "markdown", "id": "1673e7da", "metadata": {}, "source": [ "## Running the Agent \n", "Ask a business question and receive a response. Since this is a simple agent, it can only handle basic questions. While its capabilities can be enhanced, such improvements are currently out of scope." ] }, { "cell_type": "code", "execution_count": null, "id": "ff9c8d62-b4ab-45ba-a98a-f09a7fdc16ed", "metadata": {}, "outputs": [], "source": [ "messages =[]\n", "user_input = input(\"User: \")\n", "messages.append({\"role\": \"user\", \"content\": user_input})\n", "new_messages = run_full_turn(system_prompt, messages)" ] }, { "cell_type": "markdown", "id": "e8b7ef0f", "metadata": {}, "source": [ "## Cleaning testing data" ] }, { "cell_type": "code", "execution_count": null, "id": "1302f8cf", "metadata": {}, "outputs": [], "source": [ "data_cleaning_queries = [\n", "'''\n", "DELETE DATABASE teddy_retailers ALL;\n", "''',\n", "'''\n", "DROP DATABASE teddy_retailers\n", "'''\n", "]\n", "for query in data_cleaning_queries:\n", " execute_sql(query)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" } }, "nbformat": 4, "nbformat_minor": 5 }