# Performance Loupe is fast - very fast! Why is this? And what can you do to make it even faster? This chapter is here to document some of the internals and to give you ideas on which parameters you can change in order to adjust the performance to your needs. ## Basics of search engines At their core, all search engines basically work the same. When indexing, they split documents into terms and store them in such a way that they can be queried efficiently. Typo tolerance is usually achieved by using some sort of Levenshtein automaton. Basically, in order to know whether `Sawyer` can match `Huckleberry` with an edit distance of `2`, you don't need to calculate the distance between those words entirely. The actual distance between those two words is `9` so it would not match `2`. However, why bothering about the `9`? If you were to look at the characters step by step, `Sa` would still potentially match `Huckleberry` (if after `Sa` the letters `ckleberry` were to follow) but as soon as you reach `Saw`, there is no way you can ever end with a distance less or equal to `2` which means you can early abort here. Moreover, if the next word to compare against `Huckleberry` were for example `Sawdust`, then you can remember that `Saw` could never lead to a match so you don't even have to compare this word anymore as it shares the same prefix. Prefix! That's exactly the term we're looking for. Most search engines organize their terms in some sort of prefix tree. So if we would index `Sawyer` and `Sawdust`, this could look something like this: * S * a * w * y * e * r * d * u * s * t As you can see, this allows search engines to **prune** entire parts of a tree. In our case, Anything underneath `Saw` can be pruned because it will never match `Huckleberry`. This data structure is also called a "trie" and there are programming languages that support data structures like this natively. Also, modern search engines save this structure in memory so looking up matches is extremely fast. There's of course more to an automaton than just this part (backtracking etc.). You may want to search the Internet for more information in case you are interested in details. PHP doesn't have any of those specialized data structures and Loupe is designed to work with an SQLite database and not in memory. So how can we still achieve good performance with typo tolerance? Enter the State Set Index. ## Adjust the State Set Index settings So we've just learned that basically, modern search engines that can rely on memory and specialized data structures can compare query terms with indexed terms very quickly and efficiently. Loupe, however, does not have those tools. What we do is we apply the ideas of a Levenshtein automaton (a nondeterministic finite automaton) to calculate a state of a given term when indexing. When searching, we then calculate all the possible states that could be reached with a given edit distance and match those up. Pre-calculating all possible states would result in an enormous amount of possibilities which of course would not work. Hence, we limit the "alphabet" we work with. So instead of working with the real alphabet (UTF-8), we work with a reduced alphabet of numbers. Sounds complex? It sure is but it's also genius! Let's look at this with an example. So we want to index our `Huckleberry` term again and we want it to match `Guckleberry` when we configure the allowed edit distance of `2` as in this case, the distance is just `1`. Again, if we were to calculate all possibilities of what could be done with `Guckleberry` and an edit distance of `2`, they are virtually endless. It could be `Guckleber`, `Gaakleberry`, `Puckleberro`, `Huüklebérry` and so on. All the UTF-8 characters there are times `2` mistakes. An endless amount of possibilities. So what we do now, is instead of working with an UTF-8 alphabet, we assign a number to every letter and reduce this alphabet. So let's say we only want an alphabet of `4` digits because that would bring down the possible states to a manageable set! We're lucky because in UTF-8 every character has a code point so we can get a number between `1` and `4` for every character like so: ```php $number = (mb_ord($char, 'UTF-8') % 4) + 1; ``` So let's analyze our search term `Guckleberry`: `G` would be `4`, `u` would be `2`, `c` would be `4`, `k` would be `4` as well - hey, stop! There are already `3` letters that are assigned `4` so there are going to be many duplicates now, right? Yes, exactly! That's the whole point! By reducing the alphabet, we can reduce the possibilities to search for. But the downside is that this will lead to duplicates when searching because searching for `Guckleberry` would be the same as searching for `keGclurubru`. Not a word that makes any sense, but it has the exact same meaning when translated to an alphabet of `4` as `Guckleberry`. This means that we still need to apply a real Levenshtein function on the results in order to filter out false-positives. But at the same time we can also filter out an **enormous amount** of terms that will never match with a relatively cheap algorithm! Now you can make it even a bit faster by not analyzing the entire query of `Guckleberry` but only - say - the first `5` characters. This is what's called the "index length". Again, this will make the calculation of possible matching states quicker but it will also result in more false-positives. Both, the "alphabet size" and the "index length" affect the size of the stored state set and the number of false-positives that then need to be filtered again using a Levenshtein function. So the higher those values, the less false-positives you will have. However, generating the possible matching states when querying will also be more time-consuming. Loupe ships with a decent default configuration that is also recommended by the [research paper introducing the State Set Index Algorithm][Paper] (alphabet size set to `4`, index length to `14`) which you should totally read in case you are interested in details. You can adjust the default configuration like so: ```php $typoTolerance = \Loupe\Loupe\Config\TypoTolerance::create() ->withAlphabetSize(5) ->withIndexLength(18) ; $configuration = \Loupe\Loupe\Configuration::create() ->withTypoTolerance($typoTolerance) ; ``` Note: The paper works using the Levenshtein algorithm. Loupe includes adjustments built on top of that paper to support Damerau-Levenshtein. ## Limit the languages to detect You can read more about what the tokenizer does in the [respective docs](tokenizer.md) but basically, if you know that your documents are always in e.g. English and German, always make sure to configure Loupe for those languages. In order for Loupe to detect the languages, it has to load a set of N-Grams into memory and if you load the entire supported language set, this is quite an amount. You can limit the languages Loupe should detect like so: ```php $configuration = \Loupe\Loupe\Configuration::create() ->withLanguages(['en', 'de']) ; ``` ## Configure the searchable attributes Do not forget to configure the searchable attributes. By default, Loupe will index all attributes of the documents you give it to index. If you know, you are only ever going to search a subset of those, configure Loupe accordingly: ```php $configuration = \Loupe\Loupe\Configuration::create() ->withSearchableAttributes(['title', 'overview']) ; ``` Note: Attributes you neither want to search or filter for are best kept **outside** of Loupe. Don't bother it with data that doesn't need to be processed. ## Avoid highlighting in nested attributes You should limit the highlighting and cropping of results to top-level string attributes. Loupe optimizes for this case by re-using existing match information when formatting results. Highlighting nested/array values is a more expensive operation as it has to recalculate the matches for each value in the array, leading to slower performance when applied to large datasets. ## Memory considerations Loupe uses SQLite and executes quite heavy queries which require complex computations by SQLite, but there's also quite some stuff going on in PHP. Thus, if you have a long-running process that evaluates e.g. hundreds of queries, make sure to instantiate a fresh `Loupe` instance for every iteration so your database is closed and memory can be freed. Normally, the Garbage Collector of PHP does a good job cleaning up but if you want to release memory immediately, you should look into calling `gc_collect_cycles()` right away (do not without profiling the results!): ```php foreach ($thousandsOfQueries as $query) { $loupe = $loupeFactory->create(...); $result = $loupe->search(SearchParameters::create()->withQuery($query)); // Do something with the result unset($loupe); gc_collect_cycles(); } ``` The same also applies when indexing documents. Do not pass 20.000 documents to `$loupe->addDocuments()` if memory is relevant. Try splitting them in smaller chunks and measure your process. > Hint: Do not measure using memory_get_usage(true), Blackfire, Tideways or any other PHP profiler. You will miss out on > all the allocations that are NOT created by PHP/your code. You won't see what SQLite itself is using so monitor your > process using process monitoring tools such as "top". [Paper]: https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2012_fenz_efficient.pdf