## Jina Embeddings Top-Performing Open-Source Bilingual Models Now On Hugging Face

In this tutorial, we'll download the Jina Embeddings v2 bilingual German-English model and use it for cross-langauge information retrieval.


First, install the necessary libraries: `transformers`, `faiss-cpu` ([FAISS](https://faiss.ai/)), and `bs4` ([Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)):

In [None]:
!pip install transformers faiss-cpu bs4

Next, you will need a Hugging Face access token. Sign up for a Hugging Face account if you don't already have one and [follow these instructions](https://huggingface.co/docs/hub/security-tokens) to make a token. Then, insert your token into the code below and run it.

In [None]:
import os

os.environ['HF_TOKEN'] = "<your token>"

## Download Jina Embeddings v2 for German and English

Once your token is set, you can download the Jina Embeddings German-English bilingual model using the `transformers` library:

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True)

## Download English-language Data

For this tutorial, we are going to get the English-language version of the book [*Pro Git: Everything You Need to Know About Git*](https://open.umn.edu/opentextbooks/textbooks/pro-git-everything-you-need-to-know-about-git). This book is also available in Chinese and German, which we’ll use later in this tutorial.

In [None]:
!wget -O progit-en.epub https://open.umn.edu/opentextbooks/formats/3437

## Processing the Data

The function below opens an EPUB file, splits the contents on the `<section>` tag, and stored it in a Python dictionary.

In [None]:
from zipfile import ZipFile
from bs4 import BeautifulSoup
import copy

def decompose_epub(file_name):

  def to_top_text(section):
    selected = copy.copy(section)
    while next_section := selected.find("section"):
      next_section.decompose()
    return selected.get_text().strip()

  ret = {}
  with ZipFile(file_name, 'r') as zip:
    for name in zip.namelist():
      if name.endswith(".xhtml"):
        data = zip.read(name)
        doc = BeautifulSoup(data.decode('utf-8'), 'html.parser')
        ret[name + ":top"] = to_top_text(doc)
        for num, sect in enumerate(doc.find_all("section")):
          ret[name + f"::{num}"] = to_top_text(sect)
  return ret

Process the book you just downloaded:

In [None]:
book_data = decompose_epub("progit-en.epub")

The code below generates the embeddings and stores them in a FAISS index. Set the variable `batch_size` as appropriate to your resources. Colab without extra memory appears to work well with it set to 5.

**This may take some time, depending on the speed and resources of the system you run it on.**

In [None]:
import faiss

batch_size = 5

vector_data = []
faiss_index = faiss.IndexFlatIP(768)

data = [(key, txt) for key, txt in book_data.items()]
batches = [data[i:i + batch_size] for i in range(0, len(data), batch_size)]

for ind, batch in enumerate(batches):
    print(f"Processing batch {ind + 1} of {len(batches)}")
    batch_embeddings = model.encode([x[1] for x in batch], normalize_embeddings=True)
    vector_data.extend(batch)
    faiss_index.add(batch_embeddings)

Verify that we have 583 embeddings stored in the index:

In [None]:
# This should be 583
faiss_index.ntotal

Now, let's create a function to query the FAISS index and corresponding data:

In [None]:
def query(query_str):
  query = model.encode([query_str], normalize_embeddings=True)
  cosine, index = faiss_index.search(query, 1)
  print(f"Cosine: {cosine[0][0]}")
  loc, txt = vector_data[index[0][0]]
  print(f"Location: {loc}\nText:\n\n{txt}")

Let's query in English to get German answers:

In [None]:
# Translation: "How do I roll back to a previous version?"
query("Wie kann ich auf eine frühere Version zurücksetzen?")


In [None]:
# Translation: "What does 'version control' mean?"
query("Was bedeutet 'Versionsverwaltung'?")


## Reversing the Roles: Querying German documents with English

The book [*Pro Git: Everything You Need to Know About Git*](https://open.umn.edu/opentextbooks/textbooks/pro-git-everything-you-need-to-know-about-git) is also available in German. We can use this same model to give this demo with the languages reversed:



Download the German edition:

In [None]:
!wget -O progit-de.epub https://open.umn.edu/opentextbooks/formats/3454

Process the book the same way we did for English:

In [None]:
book_data = decompose_epub("progit-de.epub")


Now we generate embeddings for the German version the same way we did for English:

In [None]:
batch_size = 5

vector_data = []
faiss_index = faiss.IndexFlatIP(768)

data = [(key, txt) for key, txt in book_data.items()]
batches = [data[i:i + batch_size] for i in range(0, len(data), batch_size)]

for ind, batch in enumerate(batches):
    print(f"Processing batch {ind + 1} of {len(batches)}")
    batch_embeddings = model.encode([x[1] for x in batch], normalize_embeddings=True)
    vector_data.extend(batch)
    faiss_index.add(batch_embeddings)

We can use the same `query` function we used before, but with English questions:

In [None]:
# The result should start with "Was ist Versionsverwaltung?"
query("What is version control?")

## Querying in Chinese
The Chinese-English bilingual model works exactly the same way. To use the Chinese model instead, just run the following:



In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)

Get the Chinese edition of [*Pro Git: Everything You Need to Know About Git*](https://open.umn.edu/opentextbooks/textbooks/pro-git-everything-you-need-to-know-about-git):

In [None]:
!wget -O progit-zh.epub https://open.umn.edu/opentextbooks/formats/3455

Process the Chinese book like the German and English ones:

In [None]:
book_data = decompose_epub("progit-zh.epub")


You can just copy the code from the previous sections to process the book into a FAISS embeddings index, and query it in English for Chinese results.