Link to the original blog post: https://jina.ai/news/retrieve-jira-tickets-with-jina-reranker-and-haystack-20

### Upload files to Google Colab

Before we can access local files on Google Colab, we need to upload them to the Colab environment. Here are the steps to do so:
 
1. [Download the file `tickets.json`](https://raw.githubusercontent.com/jina-ai/workshops/main/notebooks/embeddings/haystack/tickets.json) to your local drive.
2. Click on the “Files” tab on the left-side menu in Google Colab (Make sure it is the “Files tab” not the “File” Dropdown menu).
3. Click on the “Upload to Session Storage” button and select the `tickets.json` file you previously downloaded.
4. Wait for the upload to complete.

Once the `tickets.json` file is uploaded, you can access it in the “Files” tab.

# Jina Haystack extension

Install prerequisites:

In [None]:
!pip install --q chromadb haystack-ai jina-haystack chroma-haystack 

Add the Jina API key as environment variable:

In [None]:
import os
import getpass

os.environ["JINA_API_KEY"] = getpass.getpass()

Create the vector store:

In [None]:
from haystack_integrations.document_stores.chroma import ChromaDocumentStore

document_store = ChromaDocumentStore()

Define the custom cleaner to remove irrelevant data:

In [None]:
import json
from typing import List
from haystack import Document, component

relevant_keys = ['Summary', 'Issue key', 'Issue id', 'Parent id', 'Issue type', 'Status', 'Project lead', 'Priority', 'Assignee', 'Reporter', 'Creator', 'Created', 'Updated', 'Last Viewed', 'Due Date', 'Labels',
 'Description', 'Comment', 'Comment__1', 'Comment__2', 'Comment__3', 'Comment__4', 'Comment__5', 'Comment__6', 'Comment__7', 'Comment__8', 'Comment__9', 'Comment__10', 'Comment__11', 'Comment__12',
 'Comment__13', 'Comment__14', 'Comment__15']

@component
class RemoveKeys:
 @component.output_types(documents=List[Document])
 def run(self, file_name: str):
 with open(file_name, 'r') as file:
 tickets = json.load(file)
 cleaned_tickets = []
 for t in tickets:
 t = {k: v for k, v in t.items() if k in relevant_keys and v}
 cleaned_tickets.append(t)
 return {'documents': cleaned_tickets}

Define the custom JSON converter:

In [None]:
@component
class JsonConverter:
 @component.output_types(documents=List[Document])
 def run(self, tickets: List[Document]):
 tickets_documents = []
 for t in tickets:
 if 'Parent id' in t:
 t = Document(content=json.dumps(t), meta={'Issue key': t['Issue key'], 'Issue id': t['Issue id'], 'Parent id': t['Parent id']})
 else:
 t = Document(content=json.dumps(t), meta={'Issue key': t['Issue key'], 'Issue id': t['Issue id'], 'Parent id': ''})
 tickets_documents.append(t)
 return {'documents': tickets_documents}

Create and run the indexing pipeline:

In [None]:
from haystack import Pipeline

from haystack.components.writers import DocumentWriter
from haystack_integrations.components.retrievers.chroma import ChromaEmbeddingRetriever
from haystack.document_stores.types import DuplicatePolicy

from haystack_integrations.components.embedders.jina import JinaDocumentEmbedder

retriever = ChromaEmbeddingRetriever(document_store=document_store)
retriever_reranker = ChromaEmbeddingRetriever(document_store=document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component('cleaner', RemoveKeys())
indexing_pipeline.add_component('converter', JsonConverter())
indexing_pipeline.add_component('embedder', JinaDocumentEmbedder(model='jina-embeddings-v2-base-en'))
indexing_pipeline.add_component('writer', DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP))

indexing_pipeline.connect('cleaner', 'converter')
indexing_pipeline.connect('converter', 'embedder')
indexing_pipeline.connect('embedder', 'writer')

indexing_pipeline.run({'cleaner': {'file_name': 'tickets.json'}})

Define the custom cleaner to remove related tickets:

In [None]:
from typing import Optional

@component
class RemoveRelated:
 @component.output_types(documents=List[Document])
 def run(self, tickets: List[Document], query_id: Optional[str]):
 retrieved_tickets = []
 for t in tickets:
 if not t.meta['Issue id'] == query_id and not t.meta['Parent id'] == query_id:
 retrieved_tickets.append(t)
 return {'documents': retrieved_tickets}

Create the query pipeline WITHOUT Jina Reranker to compare the results prior to the reranking:

In [None]:
from haystack_integrations.components.embedders.jina import JinaTextEmbedder
from haystack_integrations.components.rankers.jina import JinaRanker

query_pipeline = Pipeline()
query_pipeline.add_component('query_embedder', JinaTextEmbedder(model='jina-embeddings-v2-base-en'))
query_pipeline.add_component('query_retriever', retriever)
query_pipeline.add_component('query_cleaner', RemoveRelated())

query_pipeline.connect('query_embedder.embedding', 'query_retriever.query_embedding')
query_pipeline.connect('query_retriever', 'query_cleaner')

Create the query pipeline WITH Jina Reranker to compare the results after the reranking:

In [None]:
query_pipeline_reranker = Pipeline()
query_pipeline_reranker.add_component('query_embedder_reranker', JinaTextEmbedder(model='jina-embeddings-v2-base-en'))
query_pipeline_reranker.add_component('query_retriever_reranker', retriever_reranker)
query_pipeline_reranker.add_component('query_cleaner_reranker', RemoveRelated())
query_pipeline_reranker.add_component('query_ranker_reranker', JinaRanker())

query_pipeline_reranker.connect('query_embedder_reranker.embedding', 'query_retriever_reranker.query_embedding')
query_pipeline_reranker.connect('query_retriever_reranker', 'query_cleaner_reranker')
query_pipeline_reranker.connect('query_cleaner_reranker', 'query_ranker_reranker')

Define the query as a ticket in the dataset that needs to be compared:

In [None]:
query_ticket_key = 'ZOOKEEPER-3282'

with open('tickets.json', 'r') as file:
 tickets = json.load(file)

for ticket in tickets:
 if ticket['Issue key'] == query_ticket_key:
 query = str(ticket)
 query_ticket_id = ticket['Issue id']

Run the query pipeline WITHOUT Jina Reranker:

In [None]:
result = query_pipeline.run(data={'query_embedder':{'text': query},
 'query_retriever': {'top_k': 20},
 'query_cleaner': {'query_id': query_ticket_id}
 }
 )

for idx, res in enumerate(result['query_cleaner']['documents']):
 print('Doc {}:'.format(idx + 1), res)

Run the query pipeline WITH Jina Reranker:

In [None]:
result = query_pipeline_reranker.run(data={'query_embedder_reranker':{'text': query},
 'query_retriever_reranker': {'top_k': 20},
 'query_cleaner_reranker': {'query_id': query_ticket_id},
 'query_ranker_reranker': {'query': query, 'top_k': 10}
 }
 )

for idx, res in enumerate(result['query_ranker_reranker']['documents']):
 print('Doc {}:'.format(idx + 1), res)

The results above clearly show the necessity for both Jina Embeddings to retrieve relevant documents through vector search, and Jina Reranker to finally obtain the most relevant context. If we take, for example, the two issues that relate to adding documentation, i.e. "ZOOKEEPER-3585" and "ZOOKEEPER-3587", we see that after the retrieval step, they are both correctly included in positions 11 and 9 respectively (note that the order in the output is reversed since the scores are outputted from least to most relevant). After reranking the documents, they are now within the top 5 most relevant documents at positions 5 and 1 respectively, showing a significant improvement.