Gridnite part 2: developing the formula engine

By Slava Vishnyakov

24th April 2012: Please note

Starting with Opera 12, Opera Unite will be turned off for new users and completely removed in a later release. If you're interested in building addons for Opera, we recommend going with our extensions platform — check out our extensions documentation to get started.

Introduction

Part 3: saving and restoring multiple spreadsheets is now also available.

In the first part of my "Gridnite" series, I showed you how to build a basic multi-user spreadsheet application that can propagate (push) changes to other users. In this part, I'm going to fix some bugs and add formula functionality to Gridnite. Let's face it — formulas are an essential spreadsheet feature.

The structure of this article is as follows:

To follow along with this article and try out the Gridnite formula engine for yourself, download the updated application.

First, the bugs

Bugs are an inevitable part of the software development process, and I do mean inevitable. Turns out that the escape(...) function in the original Gridnite application's JavaScript wasn't doing exactly what it was supposed to. Its purpose is to safely send data between Unite's server-side JavaScript, and the client-side JavaScript in the application visitor's browsers. The problem arose when I started implementing the formula functionality — when I inserted a formula of say a1+b2 into a cell, the escape function didn't change the + sign at all, which meant that the a1+b2 became a1 b2 upon decoding.

The solution is to use the escapeURIComponent() function instead. I read that it only works in recent browsers, but I tested it in Internet Explorer 6 and found that it does work, although IE's JavaScript slows to a crawl with even the most basic calculations.

The formula engine

This article assumes that you understand what a regular expression is. If you don't, check out a tutorial on this subject, such as the Regular Expression Tutorial from regular-expressions.info. It's a very useful feature of many programming languages. There is also an excellent regular expression tester Opera Widget available.

So, how do you evaluate an expression like =(A1+B2)*SUM(A1:C3)?

The right way to develop a formula engine would be to create a recursive descent parser, which would be secure and fairly fast too. However, creating such a parser in a limited timeframe and explaining it in an article seems like an impossible task: you need to create a parser that converts a string into an abstract syntax tree and then evaluate that tree to match particular cell values.

That's a lot to explain, so for now I'm going to settle for an easier and slightly more evil solution: naive regular expressions and the eval (...) function. I will make it nicer along the way, but bear in mind that this is a simple solution, and therefore pretty slow, although it is fairly secure.

Planning the code

Let's outline the plan. I need to convert =(A1+B2)*SUM(A1:C3) to something like (cell_value('A1')+cell_value('B2'))*sum_of(range('A1', 'C3')), then supply this to eval(...) and get a result via Opera's JavaScript parser (which is recursive descent too).

There's a number of stumbling blocks here that I need to be mindful of. First of all, what if a user supplies a string like =alert(1); into a cell? That would result into a call to the alert function during eval. Although example could be quite annoying, it's pretty harmless. On the other hand, some bad guy could input a script element that could load an external JavaScript, which in turn could upload a whole spreadsheet onto their server, which is quite evil. I need to take precautions against this kind of activity.

The regular expressions

First, I will check that the string entered into a cell doesn't contain weird symbols — I can do this using a regular expression, like so:

_formula.match(/^=["A-Za-z0-9\.\(\)\*/:+-]+$/)

In human language, this equates to "I will match a string that starts with an equals (=) sign, followed by only letters, numbers, dots, brackets, math signs and colons ( : )". That is probably enough to validate most of the formulas I need.

There's an extra symbol that should be here though — the backslash ("\"). It's quite important, because you might want to use a string that has a quote inside of it, but it also brings a lot of insecurity to applications using eval(...) — that's one of the good reasons you should create a real parser when creating a real world app. Most PHP attacks use a backslash to end a string and then execute malicious code in some way. Therefore I will not be allowing the use of backslashes.

Is it possible to safely add a backslash using regular expressions? Yes, but it would require a lot of testing using the most of horrible strings you can imagine, like "TEST\"", "TEST\" and even "TEST\\". It isn't easy (which is why it's not covered in this article) but it's certainly possible.

Next I will get rid of another way to get eval to do more evil things — comments (for example // and /* ... */). The regular expression that does this for us is as follows:

_formula.match(/(\/\*|\/\/)/)

If there's a match, that means the string contains either "/*" or "//" symbols, which isn't needed in a spreadsheet formula, and could mean a lot of unexpected trouble during evaluation, eg from malicious users trying to set up XSS attacks.

The last thing to do is check that functions used in formulae entered into cells are limited to ones I actually want to support (thereby protecting against users entering JavaScript function names into a formula). The regular expression for this looks like so:

var strings = (_formula.replace(/".*?"/g,")+' ').match(/[a-z]{2,}/gi);

First I'll remove any quoted strings (".*?"); that's where the problems with backslash could begin — for example matching ".*?" on a string of "They call him \"Robot\"" would yield a result of "They call him \" (up to first quote), which is wrong. But since I'm not allowing backslashes, that's not a problem.

Then I match /[a-z]{2,}/, which is any letter, twice or more.

Next I check:

if(!strings[i].match(/^(SUM|MAX|MIN|AVG)$/)) {

This contains the functions I'm going to define — SUM, MAX, MIN and AVG. (You can extend this to contain any function you like).

Now I'm done with regular expressions and security, so we can move on.

Wiring up the regular expressions to the spreadsheet

Remember the sample formula I listed earlier:

=(A1+B2)*SUM(A1:C3)

To make this work I first need to find all references to cells by name (A1, B2, C3, etc.) and then replace them with a function call to get that cell's value. I've called this function formula_single_cell_value('A1') in Gridnite. The formula seems to be a pretty obvious first step when dealing with a regular expression:

_eval_formula = _formula.replace(/\b([A-Z][0-9]{1,3})\b/g,
'formula_single_cell_value(\'$1\')'); 

But here's the problem — if I just look for anything that looks like A1 and replace it, I'm going to run into expressions that A1 is just a part of, such as A1:C3. This would result in "formula_single_cell_value('A1'):formula_single_cell_value('C3')", which is JavaScript nonsense (two function calls, separated by colon!) I need to find those references and replace them with a call to another function, for example "formula_range_value('A1','C3')".

_eval_formula = _eval_formula.replace(
/\bformula_single_cell_value\(\'([A-Z][0-9]{1,3})\'\):formula_single_cell_value\(\'([A-Z][0-9]{1,3})\'\)/g,
'formula_range_value(\'$1\', \'$2\')'); 

The last step is to convert our special functions, like SUM(...), to JavaScript functions, like formula_sum(...). This is also straight-forward:

_eval_formula = _eval_formula.replace(/\bSUM\(/g, 'formula_sum(');

Getting the formula result

At this point, I just need to "eval" this variable (_eval_formula) to get the result. Just in case the user makes a mistake in the syntax, I'll wrap the code in a try { } catch construct:

try {
  _new_value = eval(_eval_formula);
} catch(err) {
  throw('Can not evaluate formula.' + /*_eval_formula + err*/);
};

The throw here is used to "bubble" the exception above (where it's handled and a nice, user-friendly message is shown). The commented out part (_eval_formula + err) is a way to debug our formula conversion — it allows us to see the final string to be evaluated).

Formulae inside formulae

You can write formulae involving cells that have other formulae inside them — how does that work? Let's say we have the formula B2 = A1+2, and A1 is itself a formula, say =5+C3?

B2 would be evaluated as a call to formula_single_cell_value('B2'), which would be converted to formula_single_cell_value('A1') + 2. The call to formula_single_cell_value('A1') would be evaluated as "5+formula_single_cell_value('C3')", so there's no problem here at all.

Evaluating the whole spreadsheet

How do we evaluate a whole spreadsheet? Simple — I'll just call formula_single_cell_value on each cell:

function evaluate_cells() {
  cells_cache = {};
  $('td.cell').each(function(){
    evaluate_cell(this.id.replace(/^CELL_/,"));
  });
};

evaluate_cell is a small helper function that sends a call to formula_single_cell_value and shows the value in a cell (via jQuery), OR shows a nice user-friendly message, as discussed above. I'll explain cells_cache very shortly.

Reading the cell contents

The last part of the puzzle is the actual formula_single_cell_value function. It looks at the cell contents and determines how to parse them. If the cell contains numbers and a dot, that's a float, and I can use JavaScript's parseFloat(...) function to return the number. If the cell contains only numbers, that's an int, so I can use parseInt. If neither of these possibilities are true formula_eval is called, which checks if the string starts with equal sign (=). If it does it's a formula; if not it's just a string.

If formula_eval comes across a formula it evaluates it, as described above, starting with the safety checks and working up to processing it with the eval(...) function.

Note: I didn't explain how to write functions like formula_sum(...) because of limited space, but it's pretty trivial — you just sum all of the formula_single_cell_value(...) results for each of the supplied elements. formula_range_value is a little more complex, but again it isn't really voodoo magic; all it does is convert the range — eg ('A1','C2') — to an array — eg ['A1','B1','C1', 'A2','B2','C2']. It does a little black magic with ASCII codes, but it's not too complex.

Dealing with cyclic references

One more thing that I haven't dealt with (yet) is "cyclic references" — what if I have two cells that reference each other, for example A1 = B1+1 and B1 = A1+1? That's an infinite loop! This is where the cells_cache variable comes into play.

When evaluate_cells is called, I create an associative array, as follows:

cells_cache = {}

Each time we start evaluating a formula, we compare the value stored in the cells_cache['..'] array to the calculating_34#43 variable. If they are identical then we're dealing with a cyclic reference, otherwise we set the value of cells_cache['..'] to 'calculating_34#43' (although this should be something even more random to avoid collision with something the user actually enters into a cell). Let's look at an example:

A1 = B1+1
  cells_cache['A1'] == "calculating_34#43" ?
  // No, it's empty - go on:
  Set cells_cache['A1'] to "calculating_34#43"

  // While evaluating this, we arrive at B1:
  B1 = A1+1
  cells_cache['B1'] == "calculating_34#43" ?
  // No, it's empty - go on:
  Set cells_cache['B1'] to "calculating_34#43"

  // While evaluating this, we arrive at A1:
  cells_cache['A1'] == "calculating_34#43" ?
  // YES, it's a cyclic reference!

Modifying the loading procedure

Now that the whole things works, I need to think about the fact that in the first article I made the whole spreadsheet load as a static HTML table. That was ok initially, but now I have a formula engine to worry about I need to do things differently, otherwise the formulas won't be evaluated. The modification will load the page as an EMPTY table, followed by a SCRIPT element with a series of calls to set_cell_value('A1', '=B1+1');

init_script = '';
... for each cell, stored on server side:
init_script += 'set_cell_value('+safe_js_val(A+j)+', '+safe_js_val(_value)+');';

template.select('body')[0].innerHTML +=
  '<scr'+'ipt>$(document).ready(function(){' + init_script + '});</scr'+'ipt>';

In any case, I would have needed to modify the set_cell_value(...) function so that each time a user updates a cell (ie, each time this function is called) the whole spreadsheet is recalculated with a call to evaluate_cells(...).

There's another modification needed. When all I was dealing with was static data, I could just keep the data in a table cell. But what about the formulas? Once I evaluate the formula, it's gone (replaced with the evaluated content). Here I'll do a clever thing and kill two rabbits with one stone: I'm going to store the actual formulas in the cell title attributes. By doing so I can prevent formulas from being replaced with evaluated content, and allow the user to see the formulas by hovering over the cells.

Optimizations

Now that the spreadsheet is calculating everything needed, it's time to think about some optimizations. Remember that Premature optimization is the root of all evil (Donald Knuth — see Program optimization < When to optimize on Wikipedia for the quote source). In other words, make sure your program works first, then optimize it.

At this stage, the main problem is that when the spreadsheet loads and a cell updates, the whole spread sheet needs to be recalculated, which results in a lot of unnecessary overhead. To get around this I've added a simple page_loaded = 0 variable to the start of the script that prevents the evaluate_cells function from being executed until all cells are filled up, at which point, page_loaded is set to 1. This simple modification resulted in 8600 calculations being cut down to 500 — the loading speed was increased by more than a factor of 10 with less than 5 lines of code.

Another thing to consider is that the cells_cache array can be reused to cut down on unnecessary calculations. Let's take a sample range formula, for example C4 = SUM(A1:C3). That's 9 calculations at least, even more if the cells within the A1:C3 contain their own formulas. This also means that any other cell that contains a formula involving C4 would also require at least 9 calculations. But wait — the formula values are already stored in the cells_cache array! So each time the code starts calculating a cell value it looks at that array to see if there's a value in the cells referenced in the formula — cells_cache['C4'] in the case of the above example. This means that in actual fact I doesn't need to be calculated again; I just return it early from the cache. If there is not already a value in the array, the formula is evaluated, and the result stored in cells_cache['C4'].

Conclusion

The development of the formula engine took about 7 hours in total. A recursive descent parser would have taken much longer to implement, but it would be more robust. However, the parser I've created here is still perfectly functional, it was a good exercise, and it was much easier to explain! This article contains some good tips on creating formulas, reg exs, and JavaScript security, and more importantly, I have added further functionality to the Gridnite Unite application.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.