{ "cells": [ { "cell_type": "markdown", "id": "59723cea", "metadata": {}, "source": [ "# StarRocks\n", "\n", ">[StarRocks](https://www.starrocks.io/) is a High-Performance Analytical Database.\n", "`StarRocks` is a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.\n", "\n", ">Usually `StarRocks` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n", "\n", "Here we'll show how to use the StarRocks Vector Store." ] }, { "cell_type": "markdown", "id": "1685854f", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "311d44bb-4aca-4f3b-8f97-5e1f29238e40", "metadata": {}, "outputs": [], "source": [ "#!pip install pymysql" ] }, { "cell_type": "markdown", "id": "2c891bba", "metadata": {}, "source": [ "Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs" ] }, { "cell_type": "code", "execution_count": 1, "id": "3c85fb93", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/dirlt/utils/py3env/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (5.1.0)/charset_normalizer (2.0.9) doesn't match a supported version!\n", " warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] } ], "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.vectorstores import StarRocks\n", "from langchain.vectorstores.starrocks import StarRocksSettings\n", "from langchain.vectorstores import Chroma\n", "from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter\n", "from langchain.llms import OpenAI\nfrom langchain.chains import VectorDBQA\n", "from langchain.document_loaders import DirectoryLoader\n", "from langchain.chains import RetrievalQA\n", "from langchain.document_loaders import TextLoader, UnstructuredMarkdownLoader\n", "\n", "update_vectordb = False" ] }, { "cell_type": "markdown", "id": "ee821c00", "metadata": {}, "source": [ "## Load docs and split them into tokens" ] }, { "cell_type": "markdown", "id": "34ba0cfd", "metadata": {}, "source": [ "Load all markdown files under the `docs` directory\n", "\n", "for starrocks documents, you can clone repo from https://github.com/StarRocks/starrocks, and there is `docs` directory in it." ] }, { "cell_type": "code", "execution_count": 2, "id": "85912696", "metadata": {}, "outputs": [], "source": [ "loader = DirectoryLoader(\n", " \"./docs\", glob=\"**/*.md\", loader_cls=UnstructuredMarkdownLoader\n", ")\n", "documents = loader.load()" ] }, { "cell_type": "markdown", "id": "b415fe2a", "metadata": {}, "source": [ "Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens." ] }, { "cell_type": "code", "execution_count": 3, "id": "07e8acff", "metadata": {}, "outputs": [], "source": [ "# load text splitter and split docs into snippets of text\n", "text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n", "split_docs = text_splitter.split_documents(documents)\n", "\n", "# tell vectordb to update text embeddings\n", "update_vectordb = True" ] }, { "cell_type": "code", "execution_count": 4, "id": "1f365370", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Document(page_content='Compile StarRocks with Docker\\n\\nThis topic describes how to compile StarRocks using Docker.\\n\\nOverview\\n\\nStarRocks provides development environment images for both Ubuntu 22.04 and CentOS 7.9. With the image, you can launch a Docker container and compile StarRocks in the container.\\n\\nStarRocks version and DEV ENV image\\n\\nDifferent branches of StarRocks correspond to different development environment images provided on StarRocks Docker Hub.\\n\\nFor Ubuntu 22.04:\\n\\n| Branch name | Image name |\\n | --------------- | ----------------------------------- |\\n | main | starrocks/dev-env-ubuntu:latest |\\n | branch-3.0 | starrocks/dev-env-ubuntu:3.0-latest |\\n | branch-2.5 | starrocks/dev-env-ubuntu:2.5-latest |\\n\\nFor CentOS 7.9:\\n\\n| Branch name | Image name |\\n | --------------- | ------------------------------------ |\\n | main | starrocks/dev-env-centos7:latest |\\n | branch-3.0 | starrocks/dev-env-centos7:3.0-latest |\\n | branch-2.5 | starrocks/dev-env-centos7:2.5-latest |\\n\\nPrerequisites\\n\\nBefore compiling StarRocks, make sure the following requirements are satisfied:\\n\\nHardware\\n\\n', metadata={'source': 'docs/developers/build-starrocks/Build_in_docker.md'})" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "split_docs[-20]" ] }, { "cell_type": "code", "execution_count": 5, "id": "50012b29", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# docs = 657, # splits = 2802\n" ] } ], "source": [ "print(\"# docs = %d, # splits = %d\" % (len(documents), len(split_docs)))" ] }, { "cell_type": "markdown", "id": "5371f152", "metadata": {}, "source": [ "## Create vectordb instance" ] }, { "cell_type": "markdown", "id": "15702d9c", "metadata": {}, "source": [ "### Use StarRocks as vectordb" ] }, { "cell_type": "code", "execution_count": 6, "id": "ced7dbe1", "metadata": {}, "outputs": [], "source": [ "def gen_starrocks(update_vectordb, embeddings, settings):\n", " if update_vectordb:\n", " docsearch = StarRocks.from_documents(split_docs, embeddings, config=settings)\n", " else:\n", " docsearch = StarRocks(embeddings, settings)\n", " return docsearch" ] }, { "cell_type": "markdown", "id": "15d86fda", "metadata": {}, "source": [ "## Convert tokens into embeddings and put them into vectordb" ] }, { "cell_type": "markdown", "id": "ff1322ea", "metadata": {}, "source": [ "Here we use StarRocks as vectordb, you can configure StarRocks instance via `StarRocksSettings`.\n", "\n", "Configuring StarRocks instance is pretty much like configuring mysql instance. You need to specify:\n", "1. host/port\n", "2. username(default: 'root')\n", "3. password(default: '')\n", "4. database(default: 'default')\n", "5. table(default: 'langchain')" ] }, { "cell_type": "code", "execution_count": 8, "id": "26410d9b", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Inserting data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2802/2802 [02:26<00:00, 19.11it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[92m\u001b[1mzya.langchain @ 127.0.0.1:41003\u001b[0m\n", "\n", "\u001b[1musername: root\u001b[0m\n", "\n", "Table Schema:\n", "----------------------------------------------------------------------------\n", "|\u001b[94mname \u001b[0m|\u001b[96mtype \u001b[0m|\u001b[96mkey \u001b[0m|\n", "----------------------------------------------------------------------------\n", "|\u001b[94mid \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mtrue \u001b[0m|\n", "|\u001b[94mdocument \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mfalse \u001b[0m|\n", "|\u001b[94membedding \u001b[0m|\u001b[96marray \u001b[0m|\u001b[96mfalse \u001b[0m|\n", "|\u001b[94mmetadata \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mfalse \u001b[0m|\n", "----------------------------------------------------------------------------\n", "\n" ] } ], "source": [ "embeddings = OpenAIEmbeddings()\n", "\n", "# configure starrocks settings(host/port/user/pw/db)\n", "settings = StarRocksSettings()\n", "settings.port = 41003\n", "settings.host = \"127.0.0.1\"\n", "settings.username = \"root\"\n", "settings.password = \"\"\n", "settings.database = \"zya\"\n", "docsearch = gen_starrocks(update_vectordb, embeddings, settings)\n", "\n", "print(docsearch)\n", "\n", "update_vectordb = False" ] }, { "cell_type": "markdown", "id": "bde66626", "metadata": {}, "source": [ "## Build QA and ask question to it" ] }, { "cell_type": "code", "execution_count": 10, "id": "84921814", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " No, profile is not enabled by default. To enable profile, set the variable `enable_profile` to `true` using the command `set enable_profile = true;`\n" ] } ], "source": [ "llm = OpenAI()\n", "qa = RetrievalQA.from_chain_type(\n", " llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever()\n", ")\n", "query = \"is profile enabled by default? if not, how to enable profile?\"\n", "resp = qa.run(query)\n", "print(resp)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }