{"cells":[{"cell_type":"markdown","id":"707d13a7","metadata":{},"source":["# Databricks\n","\n","本笔记涵盖了如何使用LangChain的SQLDatabase包连接[Databricks运行时](https://docs.databricks.com/runtime/index.html)和[Databricks SQL](https://www.databricks.com/product/databricks-sql)。它分为三个部分:安装和设置、连接到Databricks以及示例。"]},{"cell_type":"markdown","id":"0076d072","metadata":{},"source":["## 安装和设置"]},{"cell_type":"code","execution_count":1,"id":"739b489b","metadata":{},"outputs":[],"source":["# 安装databricks-sql-connector库\n","!pip install databricks-sql-connector"]},{"cell_type":"markdown","id":"73113163","metadata":{},"source":["## 连接到Databricks\n","\n","您可以使用`SQLDatabase.from_databricks()`方法连接到[Databricks运行时](https://docs.databricks.com/runtime/index.html)和[Databricks SQL](https://www.databricks.com/product/databricks-sql)。\n","\n","### 语法\n","```python\n","SQLDatabase.from_databricks(\n"," catalog: str,\n"," schema: str,\n"," host: Optional[str] = None,\n"," api_token: Optional[str] = None,\n"," warehouse_id: Optional[str] = None,\n"," cluster_id: Optional[str] = None,\n"," engine_args: Optional[dict] = None,\n"," **kwargs: Any)\n","```\n","### 必需参数\n","* `catalog`: Databricks数据库中的目录名称。\n","* `schema`: 目录中的模式名称。\n","\n","### 可选参数\n","以下参数是可选的。在Databricks笔记本中执行该方法时,在大多数情况下不需要提供它们。\n","* `host`: Databricks工作区主机名,不包括'https://'部分。默认为'DATABRICKS_HOST'环境变量或当前工作区(如果在Databricks笔记本中)。\n","* `api_token`: 用于访问Databricks SQL仓库或集群的Databricks个人访问令牌。默认为'DATABRICKS_TOKEN'环境变量,如果在Databricks笔记本中,则生成一个临时令牌。\n","* `warehouse_id`: Databricks SQL中的仓库ID。\n","* `cluster_id`: Databricks运行时中的集群ID。如果在Databricks笔记本中运行,并且'warehouse_id'和'cluster_id'都为None,则使用附加到笔记本的集群的ID。\n","* `engine_args`: 连接Databricks时要使用的参数。\n","* `**kwargs`: `SQLDatabase.from_uri`方法的其他关键字参数。"]},{"cell_type":"markdown","id":"b11c7e48","metadata":{},"source":["## 例子"]},{"cell_type":"code","execution_count":2,"id":"8102bca0","metadata":{},"outputs":[],"source":["# 使用SQLDatabase包连接到Databricks\n","from langchain_community.utilities import SQLDatabase\n","\n","# 创建SQLDatabase对象,并连接到Databricks\n","db = SQLDatabase.from_databricks(catalog=\"samples\", schema=\"nyctaxi\")"]},{"cell_type":"code","execution_count":3,"id":"9dd36f58","metadata":{},"outputs":[],"source":["# 创建一个OpenAI Chat LLM包装器\n","from langchain_openai import ChatOpenAI\n","\n","# 创建ChatOpenAI对象,并设置temperature为0,model_name为\"gpt-4\"\n","llm = ChatOpenAI(temperature=0, model_name=\"gpt-4\")"]},{"cell_type":"markdown","id":"5b5c5f1a","metadata":{},"source":["### SQL链示例\n","\n","该示例演示了使用[SQL链](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html)来回答关于Databricks数据库的问题。"]},{"cell_type":"code","execution_count":4,"id":"36f2270b","metadata":{},"outputs":[],"source":["from langchain_community.utilities import SQLDatabaseChain\n","\n","db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)"]},{"cell_type":"code","execution_count":5,"id":"4e2b5f25","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["\n","\n","\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n","What is the average duration of taxi rides that start between midnight and 6am?\n","SQLQuery:\u001b[32;1m\u001b[1;3mSELECT AVG(UNIX_TIMESTAMP(tpep_dropoff_datetime) - UNIX_TIMESTAMP(tpep_pickup_datetime)) as avg_duration\n","FROM trips\n","WHERE HOUR(tpep_pickup_datetime) >= 0 AND HOUR(tpep_pickup_datetime) < 6\u001b[0m\n","SQLResult: \u001b[33;1m\u001b[1;3m[(987.8122786304605,)]\u001b[0m\n","Answer:\u001b[32;1m\u001b[1;3mThe average duration of taxi rides that start between midnight and 6am is 987.81 seconds.\u001b[0m\n","\u001b[1m> Finished chain.\u001b[0m\n"]},{"data":{"text/plain":["'The average duration of taxi rides that start between midnight and 6am is 987.81 seconds.'"]},"execution_count":6,"metadata":{},"output_type":"execute_result"}],"source":["# 调用db_chain的run函数,查询在午夜到早上6点之间开始的出租车行程的平均持续时间\n","db_chain.run(\n"," \"What is the average duration of taxi rides that start between midnight and 6am?\"\n",")"]},{"cell_type":"markdown","id":"e496d5e5","metadata":{},"source":["### SQL数据库代理示例\n","\n","该示例演示了如何使用[SQL数据库代理](/docs/integrations/toolkits/sql_database.html)来对Databricks数据库中的问题进行回答。"]},{"cell_type":"code","execution_count":7,"id":"9918e86a","metadata":{},"outputs":[],"source":["from langchain.agents import create_sql_agent\n","from langchain_community.agent_toolkits import SQLDatabaseToolkit\n","\n","toolkit = SQLDatabaseToolkit(db=db, llm=llm)\n","agent = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)"]},{"cell_type":"code","execution_count":8,"id":"c484a76e","metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["\n","\n","\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n","\u001b[32;1m\u001b[1;3mAction: list_tables_sql_db\n","Action Input: \u001b[0m\n","Observation: \u001b[38;5;200m\u001b[1;3mtrips\u001b[0m\n","Thought:\u001b[32;1m\u001b[1;3mI should check the schema of the trips table to see if it has the necessary columns for trip distance and duration.\n","Action: schema_sql_db\n","Action Input: trips\u001b[0m\n","Observation: \u001b[33;1m\u001b[1;3m\n","CREATE TABLE trips (\n","\ttpep_pickup_datetime TIMESTAMP, \n","\ttpep_dropoff_datetime TIMESTAMP, \n","\ttrip_distance FLOAT, \n","\tfare_amount FLOAT, \n","\tpickup_zip INT, \n","\tdropoff_zip INT\n",") USING DELTA\n","\n","/*\n","3 rows from trips table:\n","tpep_pickup_datetime\ttpep_dropoff_datetime\ttrip_distance\tfare_amount\tpickup_zip\tdropoff_zip\n","2016-02-14 16:52:13+00:00\t2016-02-14 17:16:04+00:00\t4.94\t19.0\t10282\t10171\n","2016-02-04 18:44:19+00:00\t2016-02-04 18:46:00+00:00\t0.28\t3.5\t10110\t10110\n","2016-02-17 17:13:57+00:00\t2016-02-17 17:17:55+00:00\t0.7\t5.0\t10103\t10023\n","*/\u001b[0m\n","Thought:\u001b[32;1m\u001b[1;3mThe trips table has the necessary columns for trip distance and duration. I will write a query to find the longest trip distance and its duration.\n","Action: query_checker_sql_db\n","Action Input: SELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1\u001b[0m\n","Observation: \u001b[31;1m\u001b[1;3mSELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1\u001b[0m\n","Thought:\u001b[32;1m\u001b[1;3mThe query is correct. I will now execute it to find the longest trip distance and its duration.\n","Action: query_sql_db\n","Action Input: SELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1\u001b[0m\n","Observation: \u001b[36;1m\u001b[1;3m[(30.6, '0 00:43:31.000000000')]\u001b[0m\n","Thought:\u001b[32;1m\u001b[1;3mI now know the final answer.\n","Final Answer: The longest trip distance is 30.6 miles and it took 43 minutes and 31 seconds.\u001b[0m\n","\n","\u001b[1m> Finished chain.\u001b[0m\n"]},{"data":{"text/plain":["'The longest trip distance is 30.6 miles and it took 43 minutes and 31 seconds.'"]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["# 运行代理程序,并向其传递一个问题作为参数\n","agent.run(\"What is the longest trip distance and how long did it take?\")"]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.3"}},"nbformat":4,"nbformat_minor":5}