{"cells": [{"attachments": {}, "cell_type": "markdown", "id": "0af3ec93", "metadata": {}, "source": ["<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/ChromaIndexDemo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"在 Colab 中打开\"/></a>\n"]}, {"attachments": {}, "cell_type": "markdown", "id": "307804a3-c02b-4a57-ac0d-172c30ddc851", "metadata": {}, "source": ["# Chroma\n", "\n", "[Chroma](https://docs.trychroma.com/getting-started) 是一个以人工智能为基础的开源向量数据库，专注于开发者的生产力和幸福感。Chroma 使用 Apache 2.0 许可证。\n", "\n", "<a href=\"https://discord.gg/MMeYNTmh3x\" target=\"_blank\">\n", "      <img src=\"https://img.shields.io/discord/1073293645303795742\" alt=\"Discord\">\n", "  </a>&nbsp;&nbsp;\n", "  <a href=\"https://github.com/chroma-core/chroma/blob/master/LICENSE\" target=\"_blank\">\n", "      <img src=\"https://img.shields.io/static/v1?label=license&message=Apache 2.0&color=white\" alt=\"License\">\n", "  </a>&nbsp;&nbsp;\n", "  <img src=\"https://github.com/chroma-core/chroma/actions/workflows/chroma-integration-test.yml/badge.svg?branch=main\" alt=\"Integration Tests\">\n", "\n", "- [网站](https://www.trychroma.com/)\n", "- [文档](https://docs.trychroma.com/)\n", "- [Twitter](https://twitter.com/trychroma)\n", "- [Discord](https://discord.gg/MMeYNTmh3x)\n", "\n", "Chroma 是完全类型化、经过充分测试和充分文档化的。\n", "\n", "使用以下命令安装 Chroma：\n", "\n", "```sh\n", "pip install chromadb\n", "```\n", "\n", "Chroma 可以以各种模式运行。请参见下面的示例，每个示例都与 LangChain 集成。\n", "- `in-memory` - 在 Python 脚本或 Jupyter 笔记本中\n", "- `in-memory with persistance` - 在脚本或笔记本中，并保存/加载到磁盘\n", "- `在 Docker 容器中` - 作为在本地计算机或云中运行的服务器\n", "\n", "与任何其他数据库一样，您可以：\n", "- `.add` \n", "- `.get` \n", "- `.update`\n", "- `.upsert`\n", "- `.delete`\n", "- `.peek`\n", "- 和 `.query` 运行相似性搜索。\n", "\n", "在 [文档](https://docs.trychroma.com/reference/Collection) 中查看完整文档。\n"]}, {"attachments": {}, "cell_type": "markdown", "id": "b5331b6b", "metadata": {}, "source": ["## 基本示例\n", "\n", "在这个基本示例中，我们将保罗·格雷厄姆的文章分成片段，使用开源嵌入模型进行嵌入，将其加载到 Chroma 中，然后进行查询。\n"]}, {"attachments": {}, "cell_type": "markdown", "id": "54361467", "metadata": {}, "source": ["如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。\n"]}, {"cell_type": "code", "execution_count": null, "id": "e46f9e63", "metadata": {}, "outputs": [], "source": ["%pip install llama-index-vector-stores-chroma\n", "%pip install llama-index-embeddings-huggingface"]}, {"cell_type": "code", "execution_count": null, "id": "0ffe7d98", "metadata": {}, "outputs": [], "source": ["!pip install llama-index"]}, {"attachments": {}, "cell_type": "markdown", "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396", "metadata": {}, "source": ["#### 创建色度指数\n", "\n"]}, {"cell_type": "code", "execution_count": null, "id": "b3df0b97", "metadata": {}, "outputs": [], "source": ["# !pip install llama-index chromadb --quiet\n", "# !pip install chromadb\n", "# !pip install sentence-transformers\n", "# !pip install pydantic==1.10.11"]}, {"cell_type": "code", "execution_count": null, "id": "d48af8e1", "metadata": {}, "outputs": [], "source": ["# import\n", "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n", "from llama_index.vector_stores.chroma import ChromaVectorStore\n", "from llama_index.core import StorageContext\n", "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", "from IPython.display import Markdown, display\n", "import chromadb"]}, {"cell_type": "code", "execution_count": null, "id": "374a148b", "metadata": {}, "outputs": [], "source": ["# 设置OpenAI\n", "import os\n", "import getpass\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n", "import openai\n", "\n", "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"]}, {"attachments": {}, "cell_type": "markdown", "id": "7b9a55de", "metadata": {}, "source": ["下载数据\n"]}, {"cell_type": "code", "execution_count": null, "id": "01f19bc6", "metadata": {}, "outputs": [], "source": ["!mkdir -p 'data/paul_graham/'\n", "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"]}, {"cell_type": "code", "execution_count": null, "id": "667f3cb3-ce18-48d5-b9aa-bfc1a1f0f0f6", "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", "  from .autonotebook import tqdm as notebook_tqdm\n", "/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", "  warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n"]}, {"name": "stdout", "output_type": "stream", "text": ["'NoneType' object has no attribute 'cadam32bit_grad_fp32'\n"]}, {"data": {"text/markdown": ["<b>The author worked on writing and programming growing up. They wrote short stories and tried writing programs on an IBM 1401 computer. Later, they got a microcomputer and started programming more extensively.</b>"], "text/plain": ["<IPython.core.display.Markdown object>"]}, "metadata": {}, "output_type": "display_data"}], "source": ["# 创建客户端和新的集合\n", "chroma_client = chromadb.EphemeralClient()\n", "chroma_collection = chroma_client.create_collection(\"quickstart\")\n", "\n", "# 定义嵌入函数\n", "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-base-en-v1.5\")\n", "\n", "# 加载文档\n", "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()\n", "\n", "# 设置ChromaVectorStore并加载数据\n", "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n", "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n", "index = VectorStoreIndex.from_documents(\n", "    documents, storage_context=storage_context, embed_model=embed_model\n", ")\n", "\n", "# 查询数据\n", "query_engine = index.as_query_engine()\n", "response = query_engine.query(\"What did the author do growing up?\")\n", "display(Markdown(f\"<b>{response}</b>\"))"]}, {"attachments": {}, "cell_type": "markdown", "id": "349de571", "metadata": {}, "source": ["## 基本示例（包括保存到磁盘）\n", "\n", "在扩展前面的示例时，如果你想要保存到磁盘，只需初始化Chroma客户端并传递要保存数据的目录即可。\n", "\n", "`注意`：Chroma会尽最大努力自动将数据保存到磁盘，但是多个内存中的客户端可能会互相干扰彼此的工作。作为最佳实践，任何时候只能有一个客户端在运行指定的路径下。\n"]}, {"cell_type": "code", "execution_count": null, "id": "9c3a56a5", "metadata": {}, "outputs": [{"data": {"text/markdown": ["<b>The author worked on writing and programming growing up. They wrote short stories and tried writing programs on an IBM 1401 computer. Later, they got a microcomputer and started programming games and a word processor.</b>"], "text/plain": ["<IPython.core.display.Markdown object>"]}, "metadata": {}, "output_type": "display_data"}], "source": ["# 保存到磁盘\n", "\n", "db = chromadb.PersistentClient(path=\"./chroma_db\")\n", "chroma_collection = db.get_or_create_collection(\"quickstart\")\n", "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n", "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n", "\n", "index = VectorStoreIndex.from_documents(\n", "    documents, storage_context=storage_context, embed_model=embed_model\n", ")\n", "\n", "# 从磁盘加载\n", "\n", "db2 = chromadb.PersistentClient(path=\"./chroma_db\")\n", "chroma_collection = db2.get_or_create_collection(\"quickstart\")\n", "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n", "index = VectorStoreIndex.from_vector_store(\n", "    vector_store,\n", "    embed_model=embed_model,\n", ")\n", "\n", "# 从持久化索引查询数据\n", "query_engine = index.as_query_engine()\n", "response = query_engine.query(\"作者在成长过程中做了什么？\")\n", "display(Markdown(f\"<b>{response}</b>\"))"]}, {"attachments": {}, "cell_type": "markdown", "id": "d596e475", "metadata": {}, "source": ["## 基本示例（使用Docker容器）\n", "\n", "您还可以在单独的Docker容器中运行Chroma服务器，创建一个客户端来连接它，然后将其传递给LlamaIndex。\n", "\n", "以下是克隆、构建和运行Docker镜像的方法：\n", "```\n", "git clone git@github.com:chroma-core/chroma.git\n", "docker-compose up -d --build\n", "```\n"]}, {"cell_type": "code", "execution_count": null, "id": "d6c9bd64", "metadata": {}, "outputs": [], "source": ["# 创建chroma客户端并添加我们的数据\n", "import chromadb\n", "\n", "remote_db = chromadb.HttpClient()\n", "chroma_collection = remote_db.get_or_create_collection(\"quickstart\")\n", "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n", "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n", "\n", "index = VectorStoreIndex.from_documents(\n", "    documents, storage_context=storage_context, embed_model=embed_model\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "88e10c26", "metadata": {}, "outputs": [{"data": {"text/markdown": ["<b>\n", "Growing up, the author wrote short stories, programmed on an IBM 1401, and wrote programs on a TRS-80 microcomputer. He also took painting classes at Harvard and worked as a de facto studio assistant for a painter. He also tried to start a company to put art galleries online, and wrote software to build online stores.</b>"], "text/plain": ["<IPython.core.display.Markdown object>"]}, "metadata": {}, "output_type": "display_data"}], "source": ["# 从Chroma Docker索引中查询数据\n", "query_engine = index.as_query_engine()\n", "response = query_engine.query(\"作者在成长过程中做了什么？\")\n", "display(Markdown(f\"<b>{response}</b>\"))"]}, {"attachments": {}, "cell_type": "markdown", "id": "0a0e79f7", "metadata": {}, "source": ["## 更新和删除\n", "\n", "在构建真实应用程序时，您希望不仅仅是添加数据，还要能够更新和删除数据。\n", "\n", "Chroma要求用户提供`ids`来简化这里的簿记工作。`ids`可以是文件名，也可以是类似于`filename_paragraphNumber`的组合哈希值。\n", "\n", "下面是一个基本示例，展示了如何进行各种操作：\n"]}, {"cell_type": "code", "execution_count": null, "id": "d9411826", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'_node_content': '{\"id_\": \"be08c8bc-f43e-4a71-ba64-e525921a8319\", \"embedding\": null, \"metadata\": {}, \"excluded_embed_metadata_keys\": [], \"excluded_llm_metadata_keys\": [], \"relationships\": {\"1\": {\"node_id\": \"2cbecdbb-0840-48b2-8151-00119da0995b\", \"node_type\": null, \"metadata\": {}, \"hash\": \"4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35\"}, \"3\": {\"node_id\": \"6a75604a-fa76-4193-8f52-c72a7b18b154\", \"node_type\": null, \"metadata\": {}, \"hash\": \"d6c408ee1fbca650fb669214e6f32ffe363b658201d31c204e85a72edb71772f\"}}, \"hash\": \"b4d0b960aa09e693f9dc0d50ef46a3d0bf5a8fb3ac9f3e4bcf438e326d17e0d8\", \"text\": \"\", \"start_char_idx\": 0, \"end_char_idx\": 4050, \"text_template\": \"{metadata_str}\\\\n\\\\n{content}\", \"metadata_template\": \"{key}: {value}\", \"metadata_seperator\": \"\\\\n\"}', 'author': 'Paul Graham', 'doc_id': '2cbecdbb-0840-48b2-8151-00119da0995b', 'document_id': '2cbecdbb-0840-48b2-8151-00119da0995b', 'ref_doc_id': '2cbecdbb-0840-48b2-8151-00119da0995b'}\n", "count before 20\n", "count after 19\n"]}], "source": ["doc_to_update = chroma_collection.get(limit=1)\n", "doc_to_update[\"metadatas\"][0] = {\n", "    **doc_to_update[\"metadatas\"][0],\n", "    **{\"author\": \"Paul Graham\"},\n", "}\n", "chroma_collection.update(\n", "    ids=[doc_to_update[\"ids\"][0]], metadatas=[doc_to_update[\"metadatas\"][0]]\n", ")\n", "updated_doc = chroma_collection.get(limit=1)\n", "print(updated_doc[\"metadatas\"][0])\n", "\n", "# 删除最后一个文档\n", "print(\"删除前计数\", chroma_collection.count())\n", "chroma_collection.delete(ids=[doc_to_update[\"ids\"][0]])\n", "print(\"删除后计数\", chroma_collection.count())"]}], "metadata": {"kernelspec": {"display_name": "llama-index", "language": "python", "name": "llama-index"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}, "vscode": {"interpreter": {"hash": "0ac390d292208ca2380c85f5bce7ded36a7a25670a97c40b8009630eb36cb06e"}}}, "nbformat": 4, "nbformat_minor": 5}