{"cells":[{"attachments":{},"cell_type":"markdown","metadata":{},"source":["# 使用Activeloop的DeepLake进行问答\n","在本教程中,我们将使用Langchain + Activeloop的Deep Lake和GPT4来在群聊中进行语义搜索和提问。\n","\n","在此处查看一个工作演示[链接](https://twitter.com/thisissukh_/status/1647223328363679745)。"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## 1. 安装所需软件包"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# 安装所需的软件包\n","!python3 -m pip install --upgrade langchain 'deeplake[enterprise]' openai tiktoken\n"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## 2. 添加 API 密钥"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":[]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[],"source":["import getpass\n","import os\n","\n","# 导入所需的模块\n","from langchain.chains import RetrievalQA\n","from langchain_community.vectorstores import DeepLake\n","from langchain_openai import OpenAI, OpenAIEmbeddings\n","from langchain_text_splitters import (\n"," CharacterTextSplitter,\n"," RecursiveCharacterTextSplitter,\n",")\n","\n","# 获取用户输入的 OpenAI API Key\n","os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n","\n","# 获取用户输入的 Activeloop Token\n","activeloop_token = getpass.getpass(\"Activeloop Token:\")\n","os.environ[\"ACTIVELOOP_TOKEN\"] = activeloop_token\n","\n","# 获取用户输入的 Activeloop Org\n","os.environ[\"ACTIVELOOP_ORG\"] = getpass.getpass(\"Activeloop Org:\")\n","\n","# 获取 Activeloop Org ID\n","org_id = os.environ[\"ACTIVELOOP_ORG\"]\n","\n","# 创建 OpenAIEmbeddings 实例\n","embeddings = OpenAIEmbeddings()\n","\n","# 设置数据集路径\n","dataset_path = \"hub://\" + org_id + \"/data\"\n","\n","以上代码主要是导入所需的模块,并获取用户输入的 OpenAI API Key、Activeloop Token 和 Activeloop Org。然后创建了一个 OpenAIEmbeddings 实例,并设置了数据集路径。"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## 2. 创建样本数据"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["你可以使用ChatGPT生成一个样本群聊对话,使用以下提示:\n","\n","```\n","生成一个包含三个朋友谈论他们一天的群聊对话,引用真实地点和虚构的名字。让对话有趣并尽可能详细。\n","```\n","\n","我已经在`messages.txt`中生成了这样的对话。我们可以简单地使用这个作为我们的示例。\n","\n","## 3. 导入聊天嵌入\n","\n","我们加载文本文件中的消息,对其进行分块并上传到ActiveLoop向量存储中。"]},{"cell_type":"code","execution_count":4,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["[Document(page_content='Participants:\\n\\nJerry: Loves movies and is a bit of a klutz.\\nSamantha: Enthusiastic about food and always trying new restaurants.\\nBarry: A nature lover, but always manages to get lost.\\nJerry: Hey, guys! You won\\'t believe what happened to me at the Times Square AMC theater. I tripped over my own feet and spilled popcorn everywhere! 🍿💥\\n\\nSamantha: LOL, that\\'s so you, Jerry! Was the floor buttery enough for you to ice skate on after that? 😂\\n\\nBarry: Sounds like a regular Tuesday for you, Jerry. Meanwhile, I tried to find that new hiking trail in Central Park. You know, the one that\\'s supposed to be impossible to get lost on? Well, guess what...\\n\\nJerry: You found a hidden treasure?\\n\\nBarry: No, I got lost. AGAIN. 🧭🙄\\n\\nSamantha: Barry, you\\'d get lost in your own backyard! But speaking of treasures, I found this new sushi place in Little Tokyo. \"Samantha\\'s Sushi Symphony\" it\\'s called. Coincidence? I think not!\\n\\nJerry: Maybe they named it after your ability to eat your body weight in sushi. 🍣', metadata={}), Document(page_content='Barry: How do you even FIND all these places, Samantha?\\n\\nSamantha: Simple, I don\\'t rely on Barry\\'s navigation skills. 😉 But seriously, the wasabi there was hotter than Jerry\\'s love for Marvel movies!\\n\\nJerry: Hey, nothing wrong with a little superhero action. By the way, did you guys see the new \"Captain Crunch: Breakfast Avenger\" trailer?\\n\\nSamantha: Captain Crunch? Are you sure you didn\\'t get that from one of your Saturday morning cereal binges?\\n\\nBarry: Yeah, and did he defeat his arch-enemy, General Mills? 😆\\n\\nJerry: Ha-ha, very funny. Anyway, that sushi place sounds awesome, Samantha. Next time, let\\'s go together, and maybe Barry can guide us... if we want a city-wide tour first.\\n\\nBarry: As long as we\\'re not hiking, I\\'ll get us there... eventually. 😅\\n\\nSamantha: It\\'s a date! But Jerry, you\\'re banned from carrying any food items.\\n\\nJerry: Deal! Just promise me no wasabi challenges. I don\\'t want to end up like the time I tried Sriracha ice cream.', metadata={}), Document(page_content=\"Barry: Wait, what happened with Sriracha ice cream?\\n\\nJerry: Let's just say it was a hot situation. Literally. 🔥\\n\\nSamantha: 🤣 I still have the video!\\n\\nJerry: Samantha, if you value our friendship, that video will never see the light of day.\\n\\nSamantha: No promises, Jerry. No promises. 🤐😈\\n\\nBarry: I foresee a fun weekend ahead! 🎉\", metadata={})]\n"]},{"name":"stdout","output_type":"stream","text":["Your Deep Lake dataset has been successfully created!\n"]},{"name":"stderr","output_type":"stream","text":["\\"]},{"name":"stdout","output_type":"stream","text":["Dataset(path='hub://adilkhan/data', tensors=['embedding', 'id', 'metadata', 'text'])\n","\n"," tensor htype shape dtype compression\n"," ------- ------- ------- ------- ------- \n"," embedding embedding (3, 1536) float32 None \n"," id text (3, 1) str None \n"," metadata json (3, 1) str None \n"," text text (3, 1) str None \n"]},{"name":"stderr","output_type":"stream","text":[" \r"]}],"source":["# 打开名为 \"messages.txt\" 的文件\n","with open(\"messages.txt\") as f:\n"," # 读取文件内容并存储在 state_of_the_union 变量中\n"," state_of_the_union = f.read()\n","\n","# 使用 CharacterTextSplitter 类将文本分割成长度为 1000 的片段,不重叠\n","text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n","# 将 state_of_the_union 分割成多个片段,并存储在 pages 变量中\n","pages = text_splitter.split_text(state_of_the_union)\n","\n","# 使用 RecursiveCharacterTextSplitter 类将文本分割成长度为 1000 的片段,重叠为 100\n","text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n","# 根据分割后的 pages 创建文档,并存储在 texts 变量中\n","texts = text_splitter.create_documents(pages)\n","\n","# 打印 texts 变量的内容\n","print(texts)\n","\n","# 设置数据集路径为 \"hub:///data\"\n","dataset_path = \"hub://\" + org_id + \"/data\"\n","# 创建 OpenAIEmbeddings 实例并存储在 embeddings 变量中\n","embeddings = OpenAIEmbeddings()\n","# 使用 DeepLake 类从文档中提取信息,并存储在 db 变量中\n","db = DeepLake.from_documents(\n"," texts, embeddings, dataset_path=dataset_path, overwrite=True\n",")"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["`可选项`: 您还可以使用Deep Lake的托管张量数据库作为托管服务,并在那里运行查询。为了这样做,需要在创建向量存储时将运行时参数指定为 {'tensor_db': True}。此配置使得可以在托管张量数据库上执行查询,而不是在客户端上执行。需要注意的是,此功能不适用于本地或内存中存储的数据集。如果已经在托管张量数据库之外创建了向量存储,则可以按照规定的步骤将其转移到托管张量数据库中。"]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[],"source":["# 打开名为\"messages.txt\"的文件\n","with open(\"messages.txt\") as f:\n"," # 读取文件内容并存储在state_of_the_union变量中\n"," state_of_the_union = f.read()\n","\n","# 创建一个CharacterTextSplitter对象,设置每个块的大小为1000,块之间不重叠\n","text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n","# 使用text_splitter将state_of_the_union文本分割成多个页面\n","pages = text_splitter.split_text(state_of_the_union)\n","\n","# 创建一个RecursiveCharacterTextSplitter对象,设置每个块的大小为1000,块之间重叠100个字符\n","text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n","# 使用text_splitter创建文档集合\n","texts = text_splitter.create_documents(pages)\n","\n","# 打印文档集合\n","print(texts)\n","\n","# 设置数据集路径为\"hub://\" + org + \"/data\"\n","dataset_path = \"hub://\" + org + \"/data\"\n","# 创建一个OpenAIEmbeddings对象\n","embeddings = OpenAIEmbeddings()\n","# 使用DeepLake从文档集合中构建数据库,设置数据集路径为dataset_path,覆盖已存在的数据集,设置运行时参数为{\"tensor_db\": True}\n","db = DeepLake.from_documents(\n"," texts, embeddings, dataset_path=dataset_path, overwrite=True, runtime={\"tensor_db\": True}\n",")"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## 4. 提问\n","\n","现在我们可以提出一个问题,并通过语义搜索获得答案:"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# 创建一个DeepLake对象,指定数据集路径、只读模式和嵌入\n","db = DeepLake(dataset_path=dataset_path, read_only=True, embedding=embeddings)\n","\n","# 将DeepLake对象转换为检索器\n","retriever = db.as_retriever()\n","retriever.search_kwargs[\"distance_metric\"] = \"cos\" # 设置检索参数中的距离度量为余弦相似度\n","retriever.search_kwargs[\"k\"] = 4 # 设置检索参数中的返回结果数量为4个\n","\n","# 使用RetrievalQA类创建一个QA对象,指定语言模型为OpenAI,链类型为\"stuff\",检索器为retriever,不返回源文档\n","qa = RetrievalQA.from_chain_type(\n"," llm=OpenAI(), chain_type=\"stuff\", retriever=retriever, return_source_documents=False\n",")\n","\n","# 输入查询问题\n","query = input(\"Enter query:\")\n","\n","# 获取答案\n","ans = qa({\"query\": query})\n","\n","print(ans)"]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.12"}},"nbformat":4,"nbformat_minor":2}