{ "cells": [ { "cell_type": "markdown", "id": "e80e9ce5-7996-46b2-8924-e34157f7180d", "metadata": {}, "source": [ "Link to the original blog post: https://jina.ai/news/retrieve-jira-tickets-with-jina-reranker-and-haystack-20" ] }, { "cell_type": "markdown", "id": "11112c18-76e4-4705-a2c5-41c2ab8c4eb2", "metadata": {}, "source": [ "### Upload files to Google Colab" ] }, { "cell_type": "markdown", "id": "213e76bf-3b66-470c-a3bc-3384e22f1fef", "metadata": {}, "source": [ "Before we can access local files on Google Colab, we need to upload them to the Colab environment. Here are the steps to do so:\n", " \n", "1. [Download the file `tickets.json`](https://raw.githubusercontent.com/jina-ai/workshops/main/notebooks/embeddings/haystack/tickets.json) to your local drive.\n", "2. Click on the “Files” tab on the left-side menu in Google Colab (Make sure it is the “Files tab” not the “File” Dropdown menu).\n", "3. Click on the “Upload to Session Storage” button and select the `tickets.json` file you previously downloaded.\n", "4. Wait for the upload to complete.\n", "\n", "Once the `tickets.json` file is uploaded, you can access it in the “Files” tab." ] }, { "cell_type": "markdown", "id": "85c2e587-88ce-4618-a4ab-fd3ef89fa5dd", "metadata": {}, "source": [ "# Jina Haystack extension" ] }, { "cell_type": "markdown", "id": "4f64c0bb-9e69-4839-8e10-afd74ca33300", "metadata": {}, "source": [ "Install prerequisites:" ] }, { "cell_type": "code", "execution_count": null, "id": "01b0c2e8-7c47-43f0-b98a-1b223a819840", "metadata": {}, "outputs": [], "source": [ "!pip install --q chromadb haystack-ai jina-haystack chroma-haystack " ] }, { "cell_type": "markdown", "id": "3bd59cf2-ffe7-4f49-8254-902ea8df616b", "metadata": {}, "source": [ "Add the Jina API key as environment variable:" ] }, { "cell_type": "code", "execution_count": null, "id": "94690e26-e870-4322-ad4a-2f47966f89ef", "metadata": {}, "outputs": [], "source": [ "import os\n", "import getpass\n", "\n", "os.environ[\"JINA_API_KEY\"] = getpass.getpass()" ] }, { "cell_type": "markdown", "id": "48546e0a-9c52-4951-b10e-17377bd93cb3", "metadata": {}, "source": [ "Create the vector store:" ] }, { "cell_type": "code", "execution_count": null, "id": "13ab5fda-d1f3-4671-8115-7c9efccf315e", "metadata": {}, "outputs": [], "source": [ "from haystack_integrations.document_stores.chroma import ChromaDocumentStore\n", "\n", "document_store = ChromaDocumentStore()" ] }, { "cell_type": "markdown", "id": "83f059b4-fab7-4224-bc50-d8030ee28fe6", "metadata": {}, "source": [ "Define the custom cleaner to remove irrelevant data:" ] }, { "cell_type": "code", "execution_count": null, "id": "2f9cc271-e750-4c26-bfbc-69f752d0b7ad", "metadata": {}, "outputs": [], "source": [ "import json\n", "from typing import List\n", "from haystack import Document, component\n", "\n", "relevant_keys = ['Summary', 'Issue key', 'Issue id', 'Parent id', 'Issue type', 'Status', 'Project lead', 'Priority', 'Assignee', 'Reporter', 'Creator', 'Created', 'Updated', 'Last Viewed', 'Due Date', 'Labels',\n", " 'Description', 'Comment', 'Comment__1', 'Comment__2', 'Comment__3', 'Comment__4', 'Comment__5', 'Comment__6', 'Comment__7', 'Comment__8', 'Comment__9', 'Comment__10', 'Comment__11', 'Comment__12',\n", " 'Comment__13', 'Comment__14', 'Comment__15']\n", "\n", "@component\n", "class RemoveKeys:\n", " @component.output_types(documents=List[Document])\n", " def run(self, file_name: str):\n", " with open(file_name, 'r') as file:\n", " tickets = json.load(file)\n", " cleaned_tickets = []\n", " for t in tickets:\n", " t = {k: v for k, v in t.items() if k in relevant_keys and v}\n", " cleaned_tickets.append(t)\n", " return {'documents': cleaned_tickets}" ] }, { "cell_type": "markdown", "id": "9b48728a-40b9-45ed-9a0c-5bc717388089", "metadata": {}, "source": [ "Define the custom JSON converter:" ] }, { "cell_type": "code", "execution_count": null, "id": "139c037f", "metadata": {}, "outputs": [], "source": [ "@component\n", "class JsonConverter:\n", " @component.output_types(documents=List[Document])\n", " def run(self, tickets: List[Document]):\n", " tickets_documents = []\n", " for t in tickets:\n", " if 'Parent id' in t:\n", " t = Document(content=json.dumps(t), meta={'Issue key': t['Issue key'], 'Issue id': t['Issue id'], 'Parent id': t['Parent id']})\n", " else:\n", " t = Document(content=json.dumps(t), meta={'Issue key': t['Issue key'], 'Issue id': t['Issue id'], 'Parent id': ''})\n", " tickets_documents.append(t)\n", " return {'documents': tickets_documents}" ] }, { "cell_type": "markdown", "id": "70174c50-bd42-4c77-884a-de1fb18f52e4", "metadata": {}, "source": [ "Create and run the indexing pipeline:" ] }, { "cell_type": "code", "execution_count": null, "id": "dc118402-e6a9-4685-8b19-e3318738a1dd", "metadata": {}, "outputs": [], "source": [ "from haystack import Pipeline\n", "\n", "from haystack.components.writers import DocumentWriter\n", "from haystack_integrations.components.retrievers.chroma import ChromaEmbeddingRetriever\n", "from haystack.document_stores.types import DuplicatePolicy\n", "\n", "from haystack_integrations.components.embedders.jina import JinaDocumentEmbedder\n", "\n", "retriever = ChromaEmbeddingRetriever(document_store=document_store)\n", "retriever_reranker = ChromaEmbeddingRetriever(document_store=document_store)\n", "\n", "indexing_pipeline = Pipeline()\n", "indexing_pipeline.add_component('cleaner', RemoveKeys())\n", "indexing_pipeline.add_component('converter', JsonConverter())\n", "indexing_pipeline.add_component('embedder', JinaDocumentEmbedder(model='jina-embeddings-v2-base-en'))\n", "indexing_pipeline.add_component('writer', DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP))\n", "\n", "indexing_pipeline.connect('cleaner', 'converter')\n", "indexing_pipeline.connect('converter', 'embedder')\n", "indexing_pipeline.connect('embedder', 'writer')\n", "\n", "indexing_pipeline.run({'cleaner': {'file_name': 'tickets.json'}})" ] }, { "cell_type": "markdown", "id": "98160f99-8a80-45cf-85e2-24b54ad249f8", "metadata": {}, "source": [ "Define the custom cleaner to remove related tickets:" ] }, { "cell_type": "code", "execution_count": null, "id": "5f3bb1be-dc12-4332-813b-e4f377d14792", "metadata": {}, "outputs": [], "source": [ "from typing import Optional\n", "\n", "@component\n", "class RemoveRelated:\n", " @component.output_types(documents=List[Document])\n", " def run(self, tickets: List[Document], query_id: Optional[str]):\n", " retrieved_tickets = []\n", " for t in tickets:\n", " if not t.meta['Issue id'] == query_id and not t.meta['Parent id'] == query_id:\n", " retrieved_tickets.append(t)\n", " return {'documents': retrieved_tickets}" ] }, { "cell_type": "markdown", "id": "77104581-929a-4ccb-8963-6f7d7744eaf8", "metadata": {}, "source": [ "Create the query pipeline WITHOUT Jina Reranker to compare the results prior to the reranking:" ] }, { "cell_type": "code", "execution_count": null, "id": "c1e76c79-a682-400b-9bab-fe1a984e8e45", "metadata": {}, "outputs": [], "source": [ "from haystack_integrations.components.embedders.jina import JinaTextEmbedder\n", "from haystack_integrations.components.rankers.jina import JinaRanker\n", "\n", "query_pipeline = Pipeline()\n", "query_pipeline.add_component('query_embedder', JinaTextEmbedder(model='jina-embeddings-v2-base-en'))\n", "query_pipeline.add_component('query_retriever', retriever)\n", "query_pipeline.add_component('query_cleaner', RemoveRelated())\n", "\n", "query_pipeline.connect('query_embedder.embedding', 'query_retriever.query_embedding')\n", "query_pipeline.connect('query_retriever', 'query_cleaner')" ] }, { "cell_type": "markdown", "id": "9c45ff09-a28e-418c-8b0a-a7331cac2160", "metadata": {}, "source": [ "Create the query pipeline WITH Jina Reranker to compare the results after the reranking:" ] }, { "cell_type": "code", "execution_count": null, "id": "30be7ef1-684a-4f01-a9af-ce4946139d1c", "metadata": {}, "outputs": [], "source": [ "query_pipeline_reranker = Pipeline()\n", "query_pipeline_reranker.add_component('query_embedder_reranker', JinaTextEmbedder(model='jina-embeddings-v2-base-en'))\n", "query_pipeline_reranker.add_component('query_retriever_reranker', retriever_reranker)\n", "query_pipeline_reranker.add_component('query_cleaner_reranker', RemoveRelated())\n", "query_pipeline_reranker.add_component('query_ranker_reranker', JinaRanker())\n", "\n", "query_pipeline_reranker.connect('query_embedder_reranker.embedding', 'query_retriever_reranker.query_embedding')\n", "query_pipeline_reranker.connect('query_retriever_reranker', 'query_cleaner_reranker')\n", "query_pipeline_reranker.connect('query_cleaner_reranker', 'query_ranker_reranker')" ] }, { "cell_type": "markdown", "id": "8a4cb9de-0ac7-47f2-8188-ff0ceae32e3e", "metadata": {}, "source": [ "Define the query as a ticket in the dataset that needs to be compared:" ] }, { "cell_type": "code", "execution_count": null, "id": "7345043d-a202-4f8e-80f1-d95c181cea06", "metadata": {}, "outputs": [], "source": [ "query_ticket_key = 'ZOOKEEPER-3282'\n", "\n", "with open('tickets.json', 'r') as file:\n", " tickets = json.load(file)\n", "\n", "for ticket in tickets:\n", " if ticket['Issue key'] == query_ticket_key:\n", " query = str(ticket)\n", " query_ticket_id = ticket['Issue id']" ] }, { "cell_type": "markdown", "id": "31b54337-b3fa-46e4-8c63-9fc1151a40e8", "metadata": {}, "source": [ "Run the query pipeline WITHOUT Jina Reranker:" ] }, { "cell_type": "code", "execution_count": null, "id": "898ba91f-b95c-4bfa-bb48-90b61f88df1f", "metadata": {}, "outputs": [], "source": [ "result = query_pipeline.run(data={'query_embedder':{'text': query},\n", " 'query_retriever': {'top_k': 20},\n", " 'query_cleaner': {'query_id': query_ticket_id}\n", " }\n", " )\n", "\n", "for idx, res in enumerate(result['query_cleaner']['documents']):\n", " print('Doc {}:'.format(idx + 1), res)" ] }, { "cell_type": "markdown", "id": "1d3cf0cf-d645-49fc-8b53-1f968cd40441", "metadata": {}, "source": [ "Run the query pipeline WITH Jina Reranker:" ] }, { "cell_type": "code", "execution_count": null, "id": "3dec9401-9003-4ff2-9ea2-344ec896dc85", "metadata": {}, "outputs": [], "source": [ "result = query_pipeline_reranker.run(data={'query_embedder_reranker':{'text': query},\n", " 'query_retriever_reranker': {'top_k': 20},\n", " 'query_cleaner_reranker': {'query_id': query_ticket_id},\n", " 'query_ranker_reranker': {'query': query, 'top_k': 10}\n", " }\n", " )\n", "\n", "for idx, res in enumerate(result['query_ranker_reranker']['documents']):\n", " print('Doc {}:'.format(idx + 1), res)" ] }, { "cell_type": "markdown", "id": "0413a436-b0b7-478d-9dc2-6e8d222a3f8a", "metadata": {}, "source": [ "The results above clearly show the necessity for both Jina Embeddings to retrieve relevant documents through vector search, and Jina Reranker to finally obtain the most relevant context. If we take, for example, the two issues that relate to adding documentation, i.e. \"ZOOKEEPER-3585\" and \"ZOOKEEPER-3587\", we see that after the retrieval step, they are both correctly included in positions 11 and 9 respectively (note that the order in the output is reversed since the scores are outputted from least to most relevant). After reranking the documents, they are now within the top 5 most relevant documents at positions 5 and 1 respectively, showing a significant improvement." ] }, { "cell_type": "code", "execution_count": null, "id": "1463a65c-f4b9-4b38-95b1-aedbbc8a4e09", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }