Indexing and searching in Opera with Opera Quick Find History Search

By Pavel Studený

Introduction

Opera 9.5 contains a new feature that allows you to search the full text from any previously-visited pages — Opera Quick Find History Search. Anytime you visit a page, its contents is indexed. Then for example if, a month later, you start thinking "Where did I see that article about Pierre Richard?", you can go to Tools -> History (or press Ctrl/Cmd + Shift + H) and search for "pierre" and "richard" in the search field. Voilà — you get a list of all the pages you have previously read containing the strings "pierre" and "richard". You can also keep refining/reducing the search term as you type, which is very useful. Easier still, you can even do the search directly in the address bar by typing "h pierre richard" into it!

Of course, if you don't want others to know that you have seen such a page, you can delete the URL from Opera's history and it won't appear in your search results anymore. There is also an option to delete all your private data in a single step (choose Tools > Delete Private Data from the Opera menu bar).

The structure of the Index

So how does Opera Quick Find History Search work? Opera contains a simple database engine that specializes in indexing and storing full-text data. All changes to its indexes are journaled, so it doesn't break when you for example run out of power in the middle of modifying the data.

For the indexing to work effectively, all the words from the documents you read have to be recognized by the indexing database. It's quite easy for latin or arabic languages, but Japanese and Thai are difficult, because they don't have spaces between words. Luckily enough, they use different character sets to arabic languages, so it's easy to find out that a page contains such a script, and adjust things accordingly.

As the words are parsed, they are stored as an inverted file index. An inverted file index contains a list of words and, for each word, a list of ID numbers of the documents that contain the word. For example, say you have the following three documents, each containing a single line, numbered 1, 2 and 3:

1: Pourquoi un site sur Pierre Richard?
2: Interview de Pierre Richard
3: Interview de Pierre Curie

The three documents give us a following inverted file index:

curie{3}
de{2, 3}
interview{2, 3}
pierre{1, 2, 3}
pourquoi{1}
richard{1, 2}
site{1}
sur{1}
un{1}

The words in the indexed documents are ranked during indexing to give you the best search results first. Ranking is stored in the inverted file index together with document numbers; bear in mind that the same word can have a different ranking in different documents. An inverted file index with the ranking added looks something like the following:

curie{[3, 0.8]}
de{[2, 0.5], [3, 0.6]}
...

Searching

The index can contain hundreds of thousands of items, so it's important to return the best matching results first. When you search for multiple words, the ranking of the searched words is combined for each document to give an overall ranking of a result. Here is an example how searching works — consider the following inverted index:

Pierre Richard
rankdocument rankdocument
0.93 0.72
0.51 0.21
0.32   

Let's say that Gertrude, an Opera user, likes French movies, so she decides to buy a copy of a movie featuring Pierre Richard. She remembers she has previously read an article about it, but she doesn't remember the name of the movie, or the URL of the site she read it at. So she enters "Pierre Richard" in the History Search search box, as explained above.

Document 3 contains "Pierre", but doesn't contain "Richard", therefore it will not appear in the results. The other documents (1 and 2) contain both of the words, therefore documents 1 and 2 will be presented to Gertrude, sorted by their overall ranking. The overall ranking is computed as the arithmetic mean of all the rankings of the particular words. Document 1 has a ranking of 0.35 (0.2 + 0.5 divided by 2) and document 2 has a ranking of 0.5 (0.3 + 0.7 divided by 2,) so document 2 will appear first in the list of results.

Summary

Opera has a fast indexing and searching engine, able to, for example, import more than 100 emails per second and update their status. Searching for a word in Opera Quick Find History Search may require just 1 read from disk, but usually it takes at least 3 reads. A prefix search for multiple words will take longer, of course.

The engine has a safe system of transactions, which keeps its indexes safe even if power is lost in a the middle of a disk write operation. All the data is kept on your HDD, so there is very little memory used, except for caches. Processing long queries may require slightly more significant memory usage, but it's never likely to weigh heavily on your CPU.

Besides the visited documents and email indexing, the indexing engine is also used for other tasks in opera. It does have some weaknesses — for example no support for a database language such as SQL and weak support for advanced queries; this is deliberate however — a big effort has been put into making the transactions run as smoothly as possible.


* Note that all operations for the examples in this article were run in a single thread on a 2.4GHz processor with antivirus turned off.

Resources

  1. Bienvenue sur le site officiel de Pierre Richard
  2. Wikipedia inverted index entry

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.