{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cdc74b23",
   "metadata": {},
   "source": [
    "# Build a Data Analyst AI Agent from Scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db160825",
   "metadata": {},
   "source": [
    "## Project Setup  \n",
    "\n",
    "This section outlines the steps required to set up the project:  \n",
    "- Import necessary libraries  \n",
    "- Load the required API keys  \n",
    "- Establish the database connection  \n",
    "- Load the sample data into the database  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3eda0fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6717e73-fcac-4d57-96c0-d87c05f06c2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "import json\n",
    "import inspect\n",
    "from teradataml import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00de9d73-5f8b-4be9-9cea-ba7d20bd4e13",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('configs.json', 'r') as file:\n",
    "    configs=json.load(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "515a6b17-6d23-4787-81d3-e1d3cd165208",
   "metadata": {},
   "outputs": [],
   "source": [
    "client = OpenAI(api_key = configs['llm-api-key'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e66e2aef-96e0-4461-878f-1461721af3eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%run -i ../startup.ipynb\n",
    "eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)\n",
    "print(eng)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "babdcb79-cfe6-4861-9309-1811f84980bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_loading_queries = [\n",
    "'''\n",
    "CREATE DATABASE teddy_retailers\n",
    "AS PERMANENT = 50000000;\n",
    "''',\n",
    "'''\n",
    "CREATE TABLE teddy_retailers.source_catalog AS\n",
    "(\n",
    "  SELECT product_id, product_name, product_category, price_cents\n",
    "     FROM (\n",
    "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_products.csv') as products\n",
    ") WITH DATA;''',\n",
    "\n",
    "'''\n",
    "CREATE TABLE teddy_retailers.source_stock AS\n",
    "(\n",
    "  SELECT entry_id, product_id, product_quantity, purchase_price_cents, entry_date\n",
    "     FROM (\n",
    "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_stock.csv') as stock\n",
    ") WITH DATA;\n",
    "''',\n",
    "'''\n",
    "CREATE TABLE teddy_retailers.source_customers AS\n",
    "(\n",
    "  SELECT customer_id, customer_name, customer_surname, customer_email\n",
    "     FROM (\n",
    "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_customers.csv') as customers\n",
    ") WITH DATA;\n",
    "''',\n",
    "'''\n",
    "CREATE TABLE teddy_retailers.source_orders AS\n",
    "(\n",
    "  SELECT order_id, customer_id, order_status, order_date\n",
    "     FROM (\n",
    "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_orders.csv') as orders\n",
    ") WITH DATA;\n",
    "''',\n",
    "'''\n",
    "CREATE TABLE teddy_retailers.source_order_products AS\n",
    "(\n",
    "  SELECT transaction_id, order_id, product_id, product_quantity\n",
    "     FROM (\n",
    "\t\tLOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_order_products.csv') as transactions\n",
    ") WITH DATA;\n",
    "'''\n",
    "]\n",
    "for query in data_loading_queries:\n",
    "    execute_sql(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d1d5a34",
   "metadata": {},
   "source": [
    "## Agent Configuration  \n",
    "\n",
    "This section covers the configuration of the agent, including:  \n",
    "* Defining the data context that the agent will interact with  \n",
    "* Setting up the routine the agent will follow as a system prompt (embedding the data context)  \n",
    "* Establishing the list of tools available for the agent to complete its tasks  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7daf1083-749c-4be4-beb4-567dc8a1a578",
   "metadata": {},
   "outputs": [],
   "source": [
    "databases = [\"teddy_retailers\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "097baac5-72a8-4d7a-a578-c2d3666f38f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_teradata_dictionary(databases_of_interest):\n",
    "    query = f'''\n",
    "            SELECT DatabaseName, TableName, ColumnName, ColumnFormat, ColumnType\n",
    "            FROM DBC.ColumnsV\n",
    "            WHERE DatabaseName IN ('{', '.join(databases_of_interest)}')\n",
    "            '''\n",
    "    table_dictionary = DataFrame.from_query(query)\n",
    "    return json.dumps(table_dictionary.to_pandas().to_json())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e9399e6-155b-4149-b7b0-d25396a3d691",
   "metadata": {},
   "outputs": [],
   "source": [
    "system_prompt= f\"\"\"\n",
    "You are an advanced data analyst for a retailer company, specializing in analyzing data from a Teradata system. Your primary responsibility is to assist users by answering business-related questions using SQL queries on the Teradata database. Follow these steps:\n",
    "\n",
    "1. Understanding User Requests\n",
    "   - Users provide business questions in plain English.\n",
    "   - Extract relevant data points needed to construct a meaningful response.\n",
    "\n",
    "2. Generating SQL Queries\n",
    "   - Construct an optimized Teradata SQL query to retrieve the necessary data.\n",
    "   - The query must be a **single-line string** without carriage returns or line breaks.\n",
    "   - Ensure that the SQL query adheres to **Teradata SQL syntax** and avoids unsupported keywords.\n",
    "   - The catalog of databases, tables, and columns to query is in the following json structure \n",
    "     {query_teradata_dictionary(databases)}\n",
    "   - Apply appropriate filtering, grouping, and ordering to enhance performance and accuracy.\n",
    "\n",
    "3. Executing the Query\n",
    "   - Run the SQL query on the Teradata system and retrieve the results efficiently.\n",
    "\n",
    "4. Responding to the User\n",
    "   - Convert the query results into a **concise, insightful, and plain-English response**.\n",
    "   - Present the information in a clear, structured, and user-friendly manner.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c39da44e-c3e2-4fcd-bac1-51ad6e7fd2e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def function_to_schema(func) -> dict:\n",
    "    type_map = {\n",
    "        str: \"string\",\n",
    "        int: \"integer\",\n",
    "        float: \"number\",\n",
    "        bool: \"boolean\",\n",
    "        list: \"array\",\n",
    "        dict: \"object\",\n",
    "        type(None): \"null\",\n",
    "    }\n",
    "\n",
    "    try:\n",
    "        signature = inspect.signature(func)\n",
    "    except ValueError as e:\n",
    "        raise ValueError(\n",
    "            f\"Failed to get signature for function {func.__name__}: {str(e)}\"\n",
    "        )\n",
    "\n",
    "    parameters = {}\n",
    "    for param in signature.parameters.values():\n",
    "        try:\n",
    "            param_type = type_map.get(param.annotation, \"string\")\n",
    "        except KeyError as e:\n",
    "            raise KeyError(\n",
    "                f\"Unknown type annotation {param.annotation} for parameter {param.name}: {str(e)}\"\n",
    "            )\n",
    "        parameters[param.name] = {\"type\": param_type}\n",
    "\n",
    "    required = [\n",
    "        param.name\n",
    "        for param in signature.parameters.values()\n",
    "        if param.default == inspect._empty\n",
    "    ]\n",
    "\n",
    "    return {\n",
    "        \"type\": \"function\",\n",
    "        \"function\": {\n",
    "            \"name\": func.__name__,\n",
    "            \"description\": (func.__doc__ or \"\").strip(),\n",
    "            \"parameters\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": parameters,\n",
    "                \"required\": required,\n",
    "            },\n",
    "        },\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bba18d6-5579-424d-8361-d2b28197d5bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_teradata_database(sql_statement):\n",
    "    query_statement = sql_statement.split('ORDER BY',1)[0]\n",
    "    query_result = DataFrame.from_query(query_statement)\n",
    "    return json.dumps(query_result.to_pandas().to_json())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a796bc2b",
   "metadata": {},
   "source": [
    "## Agent Runtime\n",
    "This section covers the code executed while the agent is in action, including:\n",
    "* Preparing the tools for use by the agent\n",
    "* The agent's runtime function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88fd833b-57aa-4b11-b86c-89d5a7b41705",
   "metadata": {},
   "outputs": [],
   "source": [
    "tools = [query_teradata_database]\n",
    "tool_schemas = [function_to_schema(tool) for tool in tools]\n",
    "tools_map = {tool.__name__: tool for tool in tools}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7a03fa4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def execute_tool_call(tool_call, tools_map):\n",
    "    name = tool_call.function.name\n",
    "    args = json.loads(tool_call.function.arguments)\n",
    "\n",
    "    print(f\"Assistant: {name}({args})\")\n",
    "\n",
    "    # call corresponding function with provided arguments\n",
    "    return tools_map[name](**args)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ceda72dd-8e44-4a83-b9ea-29f40b5929d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_full_turn(system_message, messages):\n",
    "\n",
    "    while True:\n",
    "        print(f\"just logging messages {messages}\")\n",
    "        response = client.chat.completions.create(\n",
    "            model=\"gpt-4o\",\n",
    "            messages=[{\"role\": \"system\", \"content\": system_message}] + messages,\n",
    "            tools=tool_schemas or None,\n",
    "            seed = 2\n",
    "        )\n",
    "        print(f\"logging response {response}\")\n",
    "        message = response.choices[0].message\n",
    "        messages.append(message)\n",
    "\n",
    "        if message.content:  # print assistant response\n",
    "            print(\"Assistant Response:\", message.content)\n",
    "\n",
    "        if message.tool_calls:  # if finished handling tool calls, break\n",
    "            # === 2. handle tool calls ===\n",
    "            for tool_call in message.tool_calls:\n",
    "                result = execute_tool_call(tool_call, tools_map)\n",
    "\n",
    "                result_message = {\n",
    "                    \"role\": \"tool\",\n",
    "                    \"tool_call_id\": tool_call.id,\n",
    "                    \"content\": result,\n",
    "                }\n",
    "                messages.append(result_message)\n",
    "        else:\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1673e7da",
   "metadata": {},
   "source": [
    "## Running the Agent  \n",
    "Ask a business question and receive a response. Since this is a simple agent, it can only handle basic questions. While its capabilities can be enhanced, such improvements are currently out of scope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff9c8d62-b4ab-45ba-a98a-f09a7fdc16ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "messages =[]\n",
    "user_input = input(\"User: \")\n",
    "messages.append({\"role\": \"user\", \"content\": user_input})\n",
    "new_messages = run_full_turn(system_prompt, messages)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8b7ef0f",
   "metadata": {},
   "source": [
    "## Cleaning testing data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1302f8cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_cleaning_queries = [\n",
    "'''\n",
    "DELETE DATABASE teddy_retailers ALL;\n",
    "''',\n",
    "'''\n",
    "DROP DATABASE teddy_retailers\n",
    "'''\n",
    "]\n",
    "for query in data_cleaning_queries:\n",
    "    execute_sql(query)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}