Snowball 3.1.0 (2026-05-22) =========================== Compiler changes ---------------- * Bug fixes: + Fix segmentation fault if -syntax is used on a program with no code. + Fix segmentation fault on some assignment syntax errors. + Fix bug introduced in v3.0.0 with conversion of `among` starter. If there were any commands after the among in the same command list then the among itself would get lost. Not triggered by any current algorithms. + Clear name field when removing dead assignments. This is visible in the syntax tree shown when command line option -syntax is used, but probably doesn't affect anything otherwise. * Compiler command-line options: + Using `-` for the Snowball source file is now interpreted as stdin. + Improve comments generated by `-comments` to show more details of the corresponding Snowball code (e.g. variable names, arithmetic expressions, and literal strings). + Add `-coverage` option which enables a code coverage feature. So far this tracks which among strings and functions are exercised, and which grouping characters are exercised. ! + Support `-eprefix` for all target languages. This is easy to do and provides a way to deal with externals which collide with keywords in the target language. Our build system now uses `-eprefix _` for Python to make the `stem` external non-public (it is called by BaseStemmer method `stemWord()`) and we no longer hard-code prefixing Python externals with `_`. + Describe more options in `--help` output. + Sort target language options in `--help` output. + The `-o` option is now optional. If not specified we now write output(s) to the same filename as the first source, but with a different extension (e.g. path/to/english.sbl -> path/to/english.c and path/to/english.h). + The `-o` option can now optionally include an extension so you can now write `-c++ -o path/to/foo.cxx` instead of `-c++ -o path/to/foo`, which can be more convenient (e.g. in `make` rules) and also provides an easy way to specify an alternative extension (for example, `.cxx`, `.cc` and `.cpp` are all extensions commonly used for C++ source code). + Reject `-vprefix` option for target languages which don't support it (it is currently only implemented for C/C++). * Diagnostics: + Clean up and improve error reporting. + Improve line numbers reported for some errors and warnings by using the line number of an appropriate token rather than the current line number of the tokeniser (which is often the line after the command being warned about). + Improve recovery after various errors, trying to resynchronise based on what's more likely, and eliminating some additional irrelevant errors (including reporting the exact same error twice in some situations). + Emit warnings for uses of legacy Snowball language features. + The Snowball manual describes `integers (x)` as a declaration of `x` so we now warn: integer 'x' declared but not used rather than: integer 'x' defined but not used + 3.0.0 added a warning if the body of a `repeat` or `atleast` loop always signals `t` (meaning it will loop forever which is very undesirable for a stemming algorithm) or always signals `f` (meaning it will never loop, which seems unlikely to be what was intended). This warning was added to the C generator, but has been moved to generic code so it is now issued regardless of the current target language. + Improve the wording of the warning if the body of a `repeat` or `atleast` loop always signals 't' to explicitly say this means the loop is infinite. + Improve warning message for unreachable code after `not`. + `$x = x + 1` cleared the initialised status of x (rather than just not setting it) which could lead to bogus warnings that `x` is never initialised. + The compiler no longer exits immediately after reporting a division by zero error in the Snowball code. + We now report a division by zero error for `$x /= 0` (this was meant to be already implemented but wasn't working due to a code typo). + More consistent wording of "is a no-op" warnings. + Warn that `insert ''` and `attach ''` are no-ops (and don't generate code for them). + Warn if a string used to define a grouping repeats characters. There's no reason to do this, so it seems likely to be a typo. + Avoid sometimes reporting "-1 blocks unfreed". * Optimisations: + Speed up processing larger Snowball programs by growing large string buffers exponentially to avoid a huge number of reallocations. For example, this reduced the time to compile serbian.sbl to C by about 80%! + Optimise reading of input file when it is seekable (which it is in typical usage). Non-seekable input files are still supported. + Optimise writing integers when generating code. 72% of integers we write are 0 to 9 and these are now written as a character. Other values are now handled without a temporary buffer, avoiding a copy. This reduced the time to compile serbian.sbl to C by about 8%, for example. + Optimise comparing among actions to find and merge equivalent actions. The comparison function used for this was carefully returning a full order, but actually we only need to know if the actions are equivalent or not which can be tested more efficiently. For example, this reduced the time to compile serbian.sbl to C by about 2%. + We now precompute the possible signals from each command which means this is now done exactly once per command, whereas previously we could end up doing it many times for some commands in some cases. The only functional change should we no longer make a pessimistic assumption if the function call depth reaches 100. This is cleaner but is unlikely to make a difference for any real-world Snowball programs. + Handle possible_signals for string-$ which just passes on signals from its subcommand. This doesn't affect code generation for any algorithms we currently ship. + We now only generate function bodies to a temporary buffer for target languages where we need to. This makes the code a bit clearer and reduces the amount of copying of data so will make the Snowball compiler a little faster. This change produces identical output for all current algorithms. + Tokenisation now decodes symbol tokens using switch statements. We don't know the length of these tokens in advance, so the old approach of binary chop on a sorted list required searching the list multiple times with different possible lengths. Alphabetical tokens are still decoded by binary chop. * Code quality: + Remove unused routines and groupings from the program during the analysis phase, which avoids each generator having to have duplicate code to skip them. + Fix small memory leak if all uses of a name are eliminated. + Always use `snprintf()` instead of `sprintf()`. If the buffer passed was too small we now emit an error rather than quietly using truncated output. + Fix GCC -Wcast-qual warnings in compiler and enable this warning by default. + Switch to using the standard C `bool` type in the code of the compiler. (The generated code still aims to require only C90.) * Other changes: + Provide a simpler way to build a cut-down Snowball compiler. The motivation here was to have a way to more quickly build a smaller Snowball compiler which only targets C. Rather than have a DISABLE_xxx macro for each language, just check if TARGET_C_ONLY is defined, and only turn off the code to actually call the other generators which greatly reduces the amount of conditionalisation required. Generic code generation changes ------------------------------- * Bug fixes: + Fix code generated for `setlimit tomark` for all target languages to restore the limit correctly afterwards. The bug was not triggered by any of the existing stemmers. + We no longer optimise repeat/atleast applied to goto/gopast on a (non) grouping. This optimisation was flawed - it requires that the code in the loop preserves the cursor's value on failure, but the target language helper functions used here don't currently do that (they probably easily could so there's scope to reinstate this optimisation). Looking at the stemmers we ship, this affects the code generated for one loop in indonesian.sbl, but it happens the cursor value is overwritten immediately after this loop anyway. The bug could affect non-shipped Snowball code though so isn't purely latent. This bug was introduced in Snowball 3.0.0. + When generating target language literal strings we now always escape characters which can be problematic when viewing the generated source code. We always escape control characters U+007F to U+009F, non-breaking space U+00A0 (visually identical to a space), and U+0590 and above (as a crude way to avoid literal LTR characters in sources which can result in confusing rendering). + Fix line numbers given to various tokens (the line numbers previously given were at least of lines in the same command or the line after it). These lone numbers can be seem in the target language comments generated when `-comments` is used. + Fix warning and simplification of code when `not` is applied to a command which always signals `t`. Bug introduced in v3.0.0. Fixes #271. + Warn and simplify `not` applied to a command which always signals `f`. * Optimisations: + Add machinery to generate a Snowball variable as a local variable in the target language instead of it being "global" (typically a private class member in the target language). This reduces the amount of state in stemmer objects, and typically reduces the overhead of accessing these variables a little. We now do this for integers and booleans in all target languages, and for strings in target languages where benchmarking seems to show it is faster (Dart, Go, JS, Pascal, PHP, Python). It's done for a Snowball variable which is only used in one routine, that routine doesn't (directly or indirectly) call itself, and the variable is set by any code path which leads to a use of the variable. The mechanism which traces the code paths errs on being too conservative in some cases, but it's good enough for all instances in the code we ship, and is likely to handle the vast majority of real-world cases. We issue an "info" diagnostic to report when a variable which is only used in one routine can't be localised - please report if you see this in real world code and we can try to enhance the code path tracing. + Tail-calling and similar optimisations can now work for non-trivial routines (previously they only worked for routines consisting of a single command and not enclosed in parentheses). + A grouping test at the end of a routine now generates simpler code. + A string test at the end of a routine now generates simpler code. + Optimise testing a boolean (optionally preceded by `not`) when used at the end of a routine. + Optimise an `among` with no commands at the end of a routine. + Generate simpler code for `not` applied to testing a boolean variable. + A `not` only needs to restore the cursor when its subcommand fails, so we now consider whether its subcommand can modify the cursor on failure (rather than whether it can modify the cursor at all). Related to #226. + An `or` only needs to restore the cursor when a subcommand fails, so we now consider whether its subcommand can modify the cursor on failure (rather than whether it can modify the cursor at all). We also now consider each subcommand individually, and only emit the cursor restore for those subcommands which need it. (#226) + Both `and` and `or` only need to restore the cursor between sub-commands so no longer consider if the final subcommand might change the cursor. This makes some small improvements to the generated code for a few of the currently shipped algorithms. (#226) + Handle more commands when checking if the cursor needs restoring - this improves the generated code for tamil.sbl a bit. + Single case amongs are now refactored to eliminate the `among` and so no longer call the among machinery. Sometimes a single case among is the natural way to express a single rule in Snowball code as it can show commonality with rulesets with multiple rules, but it's inefficient to actually generate as an among. Of the stemmers we currently ship, this improves code generation for arabic, estonian, greek and lithuanian. + Avoid unnecessary cursor update in among helpers. We only need to update the cursor on success, but were unconditionally doing so after calling an among function. + Handle more commands in repeat_score(). None of these help code generation for any currently shipped algorithms, but they are valid to optimise here. + Canonicalise `<-''` to `delete`. + Simplify some cases of compound assignment operators when the argument is (or can be simplified to) a constant integer, like we already do for arithmetic expressions. For example, `$x += len '{a"}' - len 'a'` is a no-op when using a fixed-width encoding. + Canonicalise `fail C` to `false` in some cases where `C` has no side-effects. + Removing unreachable code could leave single-entry `and`/`or` nodes which could result in generating target language code with unused variables. These nodes are now replaced with their subnode. + Eliminate `true` below `and`/`false` below `or`. These are unlikely to appear verbatim in real programs, but can be created by optimisations, and also can appear in runtime tests, leading to the generated target language code having unused variables and/or unreachable code. + Canonicalise setmark, atmark and atlimit by converting `setmark x` to `$x = cursor`, `atmark x` to `$(cursor == x)` and `atlimit` to either `$(cursor >= limit)` or `$(cursor <= limit)` (depending on whether we're in backwardmode or not). This means the target language generators have three fewer commands to handle, and also gives us tail-calling of `atlimit` and `atmark x` (there's a tail-callable use of `atlimit` in the Turkish stemmer). + Remerge among actions after optimisation. It seems hard to fully move the code to merge them later, but we can check for actions which have become equivalent to `true` or to other actions after optimisation but before we generate code. * Code quality: + Find Snowball routines which are not reachable by calling an external. We no longer generate code for such routines, nor for variables and groupings which are only used in them, which helps to avoid "unused" warnings in the generated target language code. + If the sub-command of `repeat`/`atleast` always signals `t` or always signals `f` we now prune the rest of the current comamnd list, and simplify the command for always `f` (c_repeat -> c_do; c_atleast -> c_bra). These changes help to avoid generating redundant target language code which can trigger errors or warnings. * Other changes: + `delete` and `<-` now update the slice end (see the "Snowball Language Changes" section). Ada --- * Bug fixes: + Ada variable names are case-insensitive, so if two Snowball names of the same type differed only by case we would generate Ada code with a name collision. We now avoid such collisions by adding a counter after the type code for the second and subsequent names that differ only by case. + Ada stemmer names are now prefixed with `S_` so `or.sbl` now generates stemmer `S_or`, avoiding a name clash with an Ada keyword. + Fix Ada code generated `for setlimit tomark p`. This affected the generated code for the Lithuanian stemmer, but it appears by luck in this case the bug didn't actually affect the stemmer's output for any input. + Fix `setlimit` ... `repeat` bug. The generated Ada code was running the code to recover from a failure inside a `repeat` loop twice due to a missing line of code compared to other generators. In `backwardmode`, the failure code happens to be idempotent so running it twice doesn't cause a problem, but in forwards mode this results in the cursor getting double adjusted if the length of the stem has changed due to insertions, deletions or substitutions. None of the existing algorithms use `setlimit` in forwards mode, so they're unaffected by this bug. Fixes #275. + Fix overcopying in string replacement code. The code to move the tail up/down was copying one byte too many. We're working in a 1024 byte fixed-length string buffer, and the maximum allowed input word is one byte shorter, so it seems this was harmless in practice. + Allow characters <32 and 127 in string literals. + `=S` can no longer result in the slice ends becoming negative and triggering a CONSTRAINT_ERROR (the slice is now specified to be unset after `=S` - see the "Snowball Language Changes" section). + Fix Ada code generated for string-$ which was actually partly Pascal code (the Ada generator was originally based on the Pascal one) and didn't even compile. To fix this, Snowball string variables in Ada the same way as the current string. This means they now take up more space (a fixed 1KB), but a typical Snowball program has either no string variables or just one so the overhead seems acceptable. + Fix matching of an empty string variable. This valid Snowball code would trigger "failed precondition" in Ada: externals (stem) strings (s) define stem as ([] ->s s) + Fix assumption that there's a single external called "stem". + Fix incorrect assumption that an among containing the empty string always matched, even if the empty string had a gating function. This construct is not used by any existing stemmers. * Optimisations: + Avoid calling among helper when the among contains only strings which are one byte long, no among functions are used, and there are no actions. * Code quality: + Fix indentation of generated grouping tables. + Rename Context to Z in runtime code. This now matches variable naming in the generated Ada code (and also the C runtime and generated C code). + Eliminate redundant limit check (the Skip_Utf8 helper also checks the limit). Looking at the history this check is a left-over from when the generated code directly incremented the cursor. + Emit Ada literal strings without redundant empty strings between adjacent escaped bytes. + Generate dummy loop around `or`, which allows us to handle a sub-command succeeding with Ada `exit` rather than `goto`, which seems clearer. + Avoid creating unused labels. This is just a cosmetic improvements - there are no longer mysterious gaps in the numbering of labels in the generated code. + Avoid generating unreachable `exit`. * Other changes: + Implement support for `?` (debug command). The code we generate for this case is gnat-specific, but previously the code generated didn't compile so working with one implementation seems a step forwards. The `?` command can now be used to debug Ada, and someone with actual Ada knowledge can now more easily step in and provide a portable replacement. C/C++ ----- * Bug fixes: + Maintain invariant that the C variable corresponding to a Snowball string variable is non-NULL. Previously we would release and NULL out the entry in some error cases, but elsewhere the code was assuming the value was non-NULL. + Fix invalid code generated for `setlimit`. This doesn't happen for `setlimit tomark` (which is the only way `setlimit` is used in the stemmers we currently ship. Bug introduced in v3.0.0. + Fix codegen for `hop` with constant argument. We were relying on the cursor being restored on failure by the code which handled that failure, but if that code is a repeat or atleast command that it has an optimisation which assumes `hop` won't do this. This means we generated incorrect C code for some cases where `hop` was used inside `repeat` or `atleast`. This doesn't functionally affect any of the stemmers we currently ship. Bug introduced in v2.1.0. + Fix bug in code generated when `-vprefix` is specified, introduced in Snowball 2.1.0. + Fix incorrect assumption that an among containing the empty string always matched, even if the empty string had a gating function. This construct is not used by any existing stemmers. * Optimisations: + Rework how non-localised variables are stored, which eliminates an indirection on every access to such a variable, and also avoids some extra allocations (one if a stemmer has any non-localised integer or boolean variables, and another if the stemmer has any string variables). So it uses a bit less memory, it makes creating and destroying a stemmer faster, and it also makes stemming a bit faster (though only by ~0.1% for the English stemmer on our sample vocabulary). The `-vprefix` option now generates getter functions rather than using macro magic, which means the syntax for accessing Snowball variables from C has changed. + We now maintain the invariant that SN_env's p member is non-NULL, which simplifies the runtime code. + We now have a specialised implementation of the slice_del() runtime helper. Deleting the slice is a fairly common operation, and can be done more simply than via a generic replace_s() with an empty replacement string. This speeds up the English stemmer by about 1% on our test vocabulary. + Avoid calling among helper when the among contains only strings which are one byte long, no among functions are used, and there are no actions. + Only fetch SIZE() in replace_s() if we need it. + Don't return adjustment from replace_s() runtime helper since calculating the adjustment in the one caller where we actually want it is just one integer addition and one integer subtraction, and that turns out to be slightly more efficient as well as simpler. + Move check for negative hop from runtime to generated code. This means we can omit it for hop with a constant argument, which is all uses of hop in the stemmers we currently ship. * Code quality: + The generated header is now included from the generated C/C++ source file (which seems cleaner than the previous approach of generating the same prototypes in the header and source file). + The implementation of among functions has changed. Previously we stored a function pointer in struct among, but that requires relocation when the code is in a dynamic library, which adds load-time overhead and means the among structures can't be put in a read-only section. We now store an integer index instead, and pass in a pointer to a dispatcher function when calling the find_among()/find_among_b() helper which gets called when this index is non-zero. The value of the index is stored in z->af so the dispatcher function can use it. If only one unique function is used in an among, we can just pass this to find_among() as the dispatcher which reduces the overhead for this common case. Profiling with cachegrind suggests this change adds a small overhead to algorithms which use among functions - currently finnish and hindi (and also lovins, but that's really only of academic interest and is not enabled by default). + Avoid long string in C source. C90 only guarantees support for literal strings up to 509 characters. Fixes GCC -Woverlength-strings warning. + Avoid C23 feature in C runtime code, introduced in Snowball 3.0.0. Initialising with empty braces was only standardised in C23 (though seems to be widely supported as an extension). + Fix code generated for `setlimit` to be C90. Bug introduced in v3.0.0, but isn't triggered by any of the stemmers we currently ship. + Fix -Wshadow warning for nested string-$ use. We were generating code using a C variable with the fixed name `failure` - now an integer suffix is appended, and we only emit the variable in cases where the subcommand signal isn't known at compile time. + Generate `do {`...`} while (0)` around `or` code, which allows us to handle a sub-command succeeding with `break;` rather than having to use `goto`, which reduces the number of labels used and makes the generated code a bit easier to follow. + C comments are now generated for `(` and `do` when `-comments` is used. + We now generate `+=` or `-=` for `hop ` (instead of something like `z->c = z->c + 2`). The C compiler should treat both the same, but it arguably makes the generated code a little clearer. * Other changes: + C++: The `-c++` option used to generate exactly the same code as for C, except with extension `.cc` instead of `.c` but now: - C++ classes are generated. - C++ `bool` is used for Snowball booleans. - Loop variables are declared inside `for (`...`)`. - Allocation failures and internal errors (e.g. slice_check() failing) throw a C++ exception - this is a bit simpler and more efficient that the C code approach of returning -1 which then has to be checked for and propagated through the generated code. + Snowball's debug command (`?`) now works out of the box (previously you have to adjust a `#if 0` preprocessor conditional in the runtime code). + Rename `runtime/header.h` (which really seems too generic, and is also easy to confuse with `compiler/header.h`) to `runtime/snowball_runtime.h`. We expect most users will be using the C stemmers through libstemmer and so won't be affected by this. C# -- * Bug fixes: + Fix code generated for `<-s`. This is not used by any of the stemmers we currently ship. Test case based on one from ajroetker in #270. + Fix code generated for string-$. This feature is not used by any of the stemmers we currently ship. + Fix assumption that there's a single external called "stem". * Optimisations: + Use Debug.Assert() in slice_check() runtime helper. Previously the runtime code wrote a diagnostic message and continued if one of these checks failed, but failures should only happen with a Snowball program containing logic errors, or for bugs in the Snowball compiler or its runtime (or possibly in the C# compiler, runtime, OS, hardware, etc). Therefore an assertion seems an appropriate choice, and means the check is not enabled for a production build, which seems more helpful overall. See #242. * Code quality: + Eliminate duplicates from groupings. We currently implement these for C# with a linear string search, and a side-effect of this change is that the grouping string is now sorted, which will affect the time taken to look up different characters in an arbitrary way (none of the Snowball sources seem to try to list characters in frequency order). Really C# should be fixed to use an O(1) lookup like other target languages. + The implementation of among functions has changed. We now store an integer index in the Among class, and pass a dispatcher function to the among helper method. If only one unique function is used in an among, we can just pass this to the helper method as the dispatcher which reduces the overhead for this common case. Crude profiling with `time make check_csharp` suggests this doesn't harm performance (perhaps a little faster, but maybe just within the noise). The main benefit is all Among arrays can now be static, which previously we wasn't possible for those which used among functions (#146). + Remove unused return value from Stemmer.Replace() runtime helper. + Fix inaccurate doc comments on runtime functions. * Other changes: + csharp_stemwords: Speed up output to stdout. + csharp_stemwords: Don't write the chosen stemmer to stdout. This is not really useful information, and breaks sending the stemmed words to stdout because they're preceded by extra output. + csharp_stemwords: Try to open input before output so we don't leave an empty output file behind if we can't open the input file. Go -- * Bug fixes: Go: Fix code generated for non-constant hop A non-constant integer expression has type `int` in the generated Go code, but the hop helpers expected `int32`. For a constant hop this worked because Go integer literals are untyped, so will convert to `int32`. To fix this, the helpers now take `int` instead of `int32`. + Fix code generated if `minint` or `maxint` is used. In this case we were generating `use std::usize;` near the start of the Go code, but that's actually Rust code and a hangover from the Go backend being originally based on the Rust one. + The Go code generated for `->` was incorrectly signalling `f` if the slice was empty. Luckily this case is not exercised by any current algorithms. See #242. + Fix code generated for string-$ (which isn't used by any of the algorithms we currently ship). + A snowball `external` could not previously be called from within the Snowball program. This is allowed by the Snowball language, but none of the shipped stemmers do this, and it's unlikely any stemmer would. Perhaps it's useful if you use Snowball for other string-processing tasks. + Fix handling of `minint` and `maxint` - we were generating some code copied verbatim from the Rust generator for this case which was not valid Go. (These are not used by any of the algorithms we currently ship.) * Optimisations: + Reuse `env` in stemwords which is measurably faster than creating a new one for every word. * Code quality: + Eliminate unnecessary semicolons from generated code. + Fix formatting of generated code. The code gets run through gofmt which was fixing up these issues, but better to generate the code cleanly to start with. The only things which gofmt now changes are that it indents variable names to align in adjacent variable declarations, and a couple of things which are apparently for compatibility with older versions of Go. + Runtime helpers SliceDel() and SliceFrom() always returned true, but the generated code included failure checks in case false was returned. These helpers no longer return anything, and the checks are gone. * Documentation: + Recommend that users reuse an `env` since this is measurably faster than creating a new one for every word. * Other changes: + Remove `-gopackage` option from compiler. Use `-package`/`-P` instead (`-gopackage` has just been an alias for these since Snowball 2.0.0). Java ---- * Bug fixes: + Generate correct Java code for ASCII control chars in string literals. + Fix code generated for string-$. As part of this fix, we now use char[] for string variables as well as the current string, which makes it much simpler to switch to working on a string variable and back. Fixes #252. + Fix assumption that there's a single external called "stem". * Other changes: + The generated Java classes no longer implement Serializable. This support was added in 2016, but in 2026 this approach to serialization in Java is apparently no longer used due to security problems. Fixes #255. Javascript ---------- * Bug fixes: + Fix `->` to work when the slice is empty - previously it incorrectly signalled `f` for this case. Luckily this case is not exercised by any current algorithms (#242) + Generate public functions for all externals. Patch from simlrh (#258). + Fix code generated for string-$ * Optimisations: + Use startsWith()/endsWith() in eq_s()/eq_s_b(). This is quite a bit faster as it avoids slice() creating a temporary string (e.g. measured a reduction of ~17% wallclock time for tamil on the test vocabulary, taking the fastest of 5 runs before and after). + Optimise among when all actions are `<-` with a literal string. We now generate a single call to slice_from() with the argument obtained by indexing into an array of literal strings. This is perhaps faster, albeit not by much, but it definitely results in smaller code, which is helpful for in browser use. See #227. + The substring_i member in the Among class is now an offset from the current index, and now zero in the common case where there's not another string which is a sub-prefix/sub-suffix. We've also swapped the order of elements so we can omit this in the common case when it is zero and there's no among function). This reduces the size of the generated Javascript code (even after minification). Fixes #236. + Change slice_check() to assert its conditions. In C we must not perform string slicing if slice_check() fails because that could result in writing outside of the allocated buffer, but it's not problematic in this way for Javascript, and the situations which slice_check() checks for should only happen with a Snowball program containing logic errors, or for bugs in the Snowball compiler or its runtime (or possibly in the Javascript interpreter, OS, hardware, etc). Therefore assert() seems an appropriate choice. * Code quality: + Convert to using Javascript modules and classes. The way among functions are called has been reworked to allow this, copying the approach now used for C and C# (#234, #240). Patches from Adam Turner and Titus Ng. + Adjust generated code to work with deno, and suppress a few deno warnings which are hard to avoid in generated code. + Avoid generating blocks around failure handling. The failure handle code is always a single statement (and if we ever needed more than a statement for some situation then we could arrange to add a block for just those situations). This significantly reduces the size of the generated JS code. + Always inline code for `=>`. The code is not much longer than the call to a helper function in BaseStemmer. Also in 3.0.0 we deprecated `=>` and nothing we ship contains this command, so removing it from BaseStemmer reduces the total code size a little. + Rename BaseStemmer's internal `cursor` property to `c`. Unfortunately, `cursor` is a DOM property, so Javascript minifiers are cautious about renaming it to avoid breaking code. The name `c` matches the naming we use for C, Ada and Pascal. + Generate smaller code for hop by constant. All current uses of hop in the stemmers we ship have a constant argument, so avoid using a temporary variable in these cases. + Optimise `+=1` to `++`, `-=1` to `--`. These are a byte shorter, and it seems Javascript minifiers don't do this for us because it's not a safe transformation unless the minifier can deduce that the variable can't hold a string. + Improve temporary var naming and use. These variables don't need unique generated names now we're declaring them as `const` which has more sensible scoping rules than `var`. + Generate smaller code for `insert` and string-`=`. In some cases we know we have the value of member variable `this.cursor` in local `const c` so use the latter instead. + Use triple equality for JavaScript. Patch from Adam Turner. + Fix position of grouping type comment which is now placed consistently with other type comments. + Use `a` instead of `among_var` in generated code. This reduces the size of the generated code, which is helpful if a minification step isn't being used. + Consistently cuddle braces in runtime code. The style wasn't entirely consistent before, and cuddling braces matches the generated Javascript code and the Snowball C code. + Generate block around case to bound the scope of `const` and `let` within the case. + Use `let` in README example. + Use `let` consistently in stemwords.js. + Initialise integer Snowball variables - we annotate them as being type "number" so we shouldn't let them have value undefined. Patch from Adam Turner. + Improve/fix typescript annotations in runtime and generated code. + Annotate runtime with @ts-expect-error. It doesn't seem to be possible to express the types fully in some places, but the invariants we require are ensured by the Snowball compiler. Annotating the expected errors allows unexpected type checking errors to be be more easily seen, and they are now fatal is CI. + Use `===` and `!==` in stemwords.js. Patch from Adam Turner. * Other changes: + Make stemmer subclasses anonymous and export them by default. This makes creating a stemmer object easier as you only need to build the filename of the stemmer subclass, and not also its class name. + Adjust interpretation of `-parentclassname` option. We supply the JS snowball runtime so being able to specify a different base class name doesn't seem very useful, so instead interpret this as the name to import the base class as in generated stemmers. It now defaults to just `B` which reduces the size of the generated stemmer code a little (even after running it through most Javascript minification tools). + Improve stemwords.js option parsing. Make `-i` and `-o` optional to match other target language versions of stemwords. Eliminate the check that there are at least 3 command line arguments as we don't require any now. If we encounter an argument we don't understand, we now report it and show the usage message (previously we silently ignored it). We now exit with status 1 if there's a problem parsing the command line. + stemwords.js: Emit help message in one console.log. Patch from Titus Ng (#221). Pascal ------ * Bug fixes: + We were generating invalid Pascal code when tail-calling or calling a routine which always fails. Neither case is currently exercised by any stemmers we ship and generate Pascal code for (the Pascal generator currently only supports iso-8859-1). + Fix code generated for string-$ (which isn't used by any of the algorithms we currently ship). + Fix assumption that there's a single external called "stem". * Code quality: + Merge EqS and EqV runtime functions. We can get the length of a Pascal AnsiString `s` cheaply with `Length(s)` so there isn't a need to pass in the length in the string literal case. + Eliminate `While` in code generated for `repeat`/`atleast`. Pascal lacks `Continue` (at least as a standard feature) and this loop only exists so we can jump back to its start with `continue` in other languages - we have a `Break;` at its end so it doesn't loop in the normal way. In Pascal we generate a label before the loop and use `goto` to continue iterating, so we can get rid of the Pascal loop entirely. + Use `Break` instead of `Goto` in code generated for `go`/`gopast`. + Generate dummy loop around `or` so we can handle a sub-command succeeding with Pascal `Break` rather than `Goto`, which seems clearer. + Avoid generating `Repeat` ... `Until True` dummy loops which are not actually needed. + Fix problem introduced in v3.0.0 with formatting of code generated for `go`/`gopast` applied to a grouping. + Switch to a simpler name mangling system. Pascal variable names are case-insensitive but Snowball names are case-sensitive. We used to address this by encoding the case of letters into a prefix on the name but that can generate long and ugly names in some cases (e.g. integer Foo_Bar -> IUllU_Foo_Bar). We now avoid collisions by adding a counter after the type code for the second and subsequent names that differ only by case (so Foo_Bar is only mangled if there's another integer which differs only by case which is declared before it, and even then just becomes something like I2_Foo_Bar). + Emit Pascal literal strings without redundant empty strings between adjacent escaped bytes. + The -comments option now includes the values of string literals, so has been changed to generate "rest of line" comments (starting `//`) rather than block comments (delimited by `{` ... `}`) so that string literals containing `}` don't need escaping. We were already using `//` comments in the Pascal runtime so this shouldn't harm portability. Python ------ * Bug fixes: + Fix `algorithms()` when forwarding to PyStemmer. It looks like this has never worked as the code has been like this since it was merged, and we were forwarding to a method which PyStemmer doesn't provide and never seems to have provided. + stemwords.py: Make -i and -o optional. The command syntax already suggested they were, but actually we gave an error if they were omitted. + Fix code generated for string-$ (which isn't used by any of the algorithms we currently ship). + Fix `->` to work when the slice is empty - previously it incorrectly signalled `f` for this case. Luckily this case is not exercised by any current algorithms (#242) + Remove deprecated licence classifier which now triggers a deprecation warning from Python's setuptools. We already specify the licensing in the now preferred way via `license=` with a SPDX licence expression. * Optimisations: + Optimise single-character string literal checks in the same way we already do for C. This seems to be measurably faster (tested with Turkish which has lots of single character literal tests). + Groupings are now implemented via a Python set, or a string for small groupings. + Eliminate use of exception in code generated for `or`. We can instead wrap the code in a loop and use `break`. + Eliminate use of exception in `goto` and `gopast`. We can just use `break` here to exit the `while` loop we're also inside and move the `except` from the previous `try` onto the `while`. + Avoid using a temporary for `hop` with a constant argument as benchmarking with timeit shows this is faster. + Optimise string test by using startswith()/endswith() with suitable start/end parameters which avoids creating a temporary substring and avoids an explicit limit check. This speeds up artificial testcases consisting of `goto 'the'` by 10%. + Optimise among when all actions are `<-` with a literal string. We now generate a single call to slice_from() with the argument obtained by indexing into an array of literal strings. See #227. + Reduce overhead of code to forward to PyStemmer, both when forwarding and when using the pure Python stemmers. + Reuse exception classes much more. This reduces the number of labN classes we need by 142 over all the current stemmers. + Change slice_check() to assert its conditions. In C we must not perform string slicing if slice_check() fails because that could result in writing outside of the allocated buffer, but it's not problematic in this way for Python, and the situations which slice_check() checks for should only happen with a Snowball program containing logic errors, or for bugs in the Snowball compiler or its runtime (or possibly in the Python interpreter, OS, hardware, etc). Therefore assert() seems an appropriate choice. * Code quality: + Use _ as dummy loop variable. We don't use the loop variable's value, and the loop itself tracks the current iteration so generating nested loops using `_` as the loop variable works correctly. + Avoid mysterious gaps in the numbering of variables in the generated code. This was already done for the other languages, but I missed Python it seems. + Avoid generating unused lab0 class for a Snowball program which doesn't use any failure labels. + Avoid generating a blank line at start of the body of a Snowball `loop`. + stemwords.py: Replace deprecated `codecs.open()` with built-in `open()`. Patch from Dmitry Shachnev. * Documentation: + Remove unnecessary semicolons from Python code in docs. * Other changes: + Remove Python 2 support. We stopped officially supporting it in Snowball 2.1.0, but now we've actually stripped out support. Versions of Python ≥ 3.3 continue to be supported. Patch from Dmitry Shachnev (#212). Rust ---- * Bug fixes: + Fix code generated for string-$ (which isn't used by any of the algorithms we currently ship). + A snowball `external` could not previously be called from within the Snowball program. This is allowed by the Snowball language, but none of the shipped stemmers do this, and it's unlikely any stemmer would, but perhaps it's useful if you use Snowball for other string-processing tasks. + The generated code previously treated an empty string returned by slice_to() as an error, but this was buggy since if the slice is empty the return value will be an empty string. The helper doesn't try to signal an error with an empty string so we can just drop this check. Luckily this case is not exercised by any current algorithms. See #242. + Fix incorrect assumption that an among containing the empty string always matched, even if the empty string had a gating function. This construct is not used by any existing stemmers. * Optimisations: + Avoid calling among helper when the among contains only strings which are one byte long, no among functions are used, and there are no actions. * Code quality: + Fix formatting of code generated for `goto`/`gopast` applied to a grouping or inverted grouping. This is just a cosmetic problem - functionally it was correct. The poor formatting was introduced in v3.0.0. + Runtime helpers slice_del() and slice_from() always returned true, but the generated code included failure checks in case false was returned. These helpers no longer return anything, and the checks are gone. + Generate space after condition in integer test (purely cosmetic). New Code Generators ------------------- * Add Dart generator from Ryan Heise (#156, #250). * Add PHP generator from Tim Whitlock and Olly Betts (#243). Requires PHP 8.3 or later, which allows us to use typed class constants. * Add Zig backend from AJ Roetker. Requires Zig 0.16.0 or later. Snowball Language Changes ------------------------- * `delete` and `<-` now update the slice end. The manual said that after `[` and `]` "the slice ends will retain the same values until altered", which doesn't make it clear what happens for operations which modify the text the slice ends are in. The existing handling here was inconsistent between commands: `delete` and `<-` left the slice ends on the same numeric positions, while `attach` and `insert` adjusted the slice ends to leave the slice marking the equivalent substring of the updated string. When working in UTF-8 the slice end could end up in the middle of a multi-byte character after `delete` or `<-`, which seems especially undesirable. I talked this over with Martin Porter and we've agreed that it makes sense for `delete` and `<-` to also update the slice ends (in fact only the right end needs adjusting) and I've clarified the wording in the manual. Existing algorithms we ship don't rely on what the slice is set to after these commands. * The slice is now specified to be unset after `=S` (so the same state as at the start of the program). Previously Snowball attempted to adjust the slice after `=S`, but there isn't an obvious adjustment in general because it can replace part of the content of the slice. Martin said he'd not thought of this case, and we've concluded it's best to adjust the Snowball language definition. New stemming algorithms ----------------------- * Add Czech stemmer from Olly Betts and Jim O’Regan (#151). * Add Persian (Farsi) stemmer from Saeid Darvish (#181). * Add Polish stemmer from Dmitry Shachnev (#245). * Add Sesotho stemmer from Kamohelo Lebjane (#260). Behavioural changes to existing algorithms ------------------------------------------ * Danish: + Adjust to handle apostrophe (#187). + Restrict undoubling to valid cases Coverage showed that a number of the consonants we would undouble never occur in our Danish vocabulary. Testing a larger list didn't find any matches for Danish words either, so restrict the undoubling which reduces the potential for damage to foreign words and should be a little more efficient. * English: + Restore exception for `skis` so it stems to `ski`. This reverts a change made erroneously in Snowball 3.0.0. + Improve the stemming of some words starting `inter`: - We now avoid conflating intern, internal, international and internment. - We now conflate interfere/interferes/interference with interfered/interfering. - The stem of `interval` is now `interval` rather than `interv`, which is mostly a cosmetic change as no unrelated words stem to `interv`. * Estonian: + Handle apostrophe (#187). * Finnish: + Handle apostrophe (#187). + Improve fallback from illative rules. If a word ends -han, -hen, -hin, -hon, -hän or -hön but the vowel before does not match we were not removing a suffix in case_ending, we now fallback to handling as a genitive and remove -n. This changes how we handle about 90 words - almost all for the better, most of the rest seem neutral changes. + Allow "ø" to match with -hön as this is seen with Norwegian place names, e.g. Bodøhön. + Remove illative form -hun. This improves the stemming of 14 words in our test vocabulary. * German: + Handle apostrophe (#187). * Italian: + Handle elisions (#187). * Lithuanian: + Don't remove -er- before normal suffixes. These aren't real grammatic suffixes and seem to have been included mainly to try to conflate ancient forms of the Lithuanian word for "sister" (e.g. "sesers") with modern forms (e.g. "sesė"). We weren't even doing a complete job there however as "seserimis" and "seseris" were not handled. Removing these suffixes entirely means we no longer try to conflate the ancient and modern forms here, but at least all the forms of the old word get grouped, as do all forms of the new word. The stemming for ~150 other words is also improved, without obvious downsides. Patch from Justas Sakalauskas (#263). + Remove trailing apostrophe as final step - an apostrophe is sometimes used to separate a Lithuanian ending on an international word (#187). * Norwegian: + Adjust to handle apostrophe (#187). * Polish: + Remove optional apostrophe after removing suffix. Polish uses an apostrophe to separate loanwords from native suffixes. (The correct use is to mark the elision of the final sound of a loanword before a Polish inflectional endings, but it's also often used with any loanword) (#187). Optimisations to existing algorithms ------------------------------------ * English: + Optimise -eed, -eedly handling by performing the much cheaper R1 check before the among of exceptional cases. * Esperanto: + Eliminate use of among functions. It's easy to avoid them, and they come with a performance overhead in some target languages. For C, the new version is 0.09% faster (from cachegrind estimated cycle count). * Indonesian: + Avoid use of among functions, which gives a 1.9% speed up for C (from cachegrind estimated cycle count). * Lithuanian: + Minor simplification/optimisation by relying on Snowball restoring the cursor on failure. * Turkish: + Simplify `not test C` to just `not C`. If C succeeds, then the `not` fails and the cursor will get restored by whatever handles that signal. Code clarity improvements to existing algorithms ------------------------------------------------ * Finnish: Rename `V1` and `LONG` to match the names used in the algorithm description on the website. * Italian: Eliminate use of legacy among starter. Build system ------------ * The default flags used with `ar` are now `-cr` instead of `-cru`. Many Linux distros configure `ar` to use option `D` (deterministic mode) by default, which was triggering a warning that option `u` is ignored. Option `u` is just a minor optimisation for the case where the archive already exist and only some object files have change, so it seems best to just not try to use it and avoid the warning. Make variable `ARFLAGS` can now be used to specify flags to use with `ar`, so if you want to continue using `-cru`, you can use: make ARFLAGS=-cru If `D` is on by default in your `ar`, you'll actually want: make ARFLAGS=-cruU * Add comment documenting how to use iconv.py (simple pure-Python alternative which allows running the testsuite without iconv installed). * `make clean` now removes all built files for all target languages, and is now tested by CI to ensure this doesn't regress. * Make "make check_utf8" parallel-safe by avoiding writing the stemmed output to disk by default (except for Arabic). To get the output saved as tmp.txt on error for debugging you can now use: `make SAVETMP=1 check_utf8`. Patch from Adam Turner (#237, #238). * Ada: Fix parallel build by adding missing dependency from .adb to the corresponding .sbl file (#237, #238). * Go: Use `$(go)` for `go generate` as well. * Python: Omit output "(THIN_FACTOR=)" if set empty. * Add SNOWBALL_FLAGS, intended to allow passing options such as `-comments` and `-coverage` during development and debugging. * Add make targets to assist comparing generated code before and after a compiler change: `baseline-create`, `generate` and `baseline-diff`. * We now have CI testing that the Snowball compiler builds as C99 (we were already testing that the generated C code builds as C90). Fixes #283, reported by Domingo Alvarez Duarte. Testsuite --------- * New testsuite for the Snowball compiler which tests parsing, errors and warnings. * New runtime testsuite which tests the implementation of Snowball language features in each supported target language. These provide something much more like a proper set of unit tests rather than relying on checking all the algorithms produce the expected output to validate all the target language generators. These tests are run with -comments on to provide some test coverage for this option. Fixes #157. * stemtest: Add more number testcases, relocated to here from finnish/voc.txt. They're better by stemtest as we want to avoid any stemmer damaging numbers, and testcases here can easily be run for all stemmers. Snowball 3.0.1 (2025-05-09) =========================== Python ------ * The __init__.py in 3.0.0 was incorrectly generated due to a missing build dependency and the list of algorithms was empty. First reported by laymonage. Thanks to Dmitry Shachnev, Henry Schreiner and Adam Turner for diagnosing and fixing. (#229, #230, #231) * Add trove classifiers for Armenian and Yiddish which have now been registered with PyPI. Thanks to Henry Schreiner and Dmitry Shachnev. (#228) * Update documented details of Python 2 support in old versions. Snowball 3.0.0 (2025-05-08) =========================== Ada --- * Bug fixes: + Fix invalid Ada code generated for Snowball `loop` (it was partly Pascal!) None of the stemmers shipped in previous releases triggered this bug, but the Turkish stemmer now does. + The Ada runtime was not tracking the current length of the string but instead used the current limit value or some other substitute, which manifested as various incorrect behaviours for code inside of `setlimit`. + `size` was incorrectly returning the difference between the limit and the backwards limit. + `lenof` or `sizeof` on a string variable generated Ada code that didn't even compile. + Fix incorrect preconditions on some methods in the runtime. + Fix bug in runtime code used by `attach`, `insert`, `<-` and string variable assignment when a (sub)string was replaced with a larger string. This bug was triggered by code in the Kraaij-Pohlmann Dutch stemmer implementation (which was previously not enabled by default but is now the standard Dutch stemmer). + Fix invalid code generated for `insert`, `<-` and string variable assignment. This bug was triggered by code in the Kraaij-Pohlmann Dutch stemmer implementation (which was previously not enabled by default but is now the standard Dutch stemmer). + Generate valid code for programs which don't use `among`. This didn't affect code generation for any algorithms we currently ship. + If the end of a routine was unreachable code the Snowball compiler would think the start of the next routine was also unreachable and would not generate it. This didn't affect code generation for any algorithms we currently ship. * Code quality: + Only declare variables A and C when each is needed. + Fix indentation of generated declarations. + Drop extra blank line before `Result := True`. C/C++ ----- * Bug fixes: + Fix potential NULL dereference in runtime code if we failed to allocate memory for the p or S member for a Snowball program which uses one or more string variables. Problem was introduced in Snowball 2.0.0. Fixes #206, reported by Maxim Korotkov. + Fix invalid C code generated when a failure is handled in a context with the opposite direction to where it happened, for example: externals (stem) define stem as ( try backwards 'x' ) This was fixed by changing the C generator to work like all the other generators and pre-generate the code to handle failure. + Eliminate assumptions that NULL has all-zero bit pattern. We don't know of any current platforms where this assumption fails, but the C standard doesn't require an all-zero bit pattern for NULL. Fixes #207. * Optimisations: + Store index delta for among substring_i field. This makes trying substrings after a failed match slightly faster because we can just add the offset to the pointer we already have to the current element. * Code quality: + Improve formatting of generated code. C# -- * Bug fixes: + Add missing runtime support for testing for a string var at the current position when working forwards. This situation isn't exercised by any of the stemming algorithms we currently ship. + Adjust generated code to work around a code flow analysis bug in the `mcs` C# compiler. * Code quality: + Prune unused `using System.Text;`. + Generate C# with UTF-8 source encoding. This makes the generated code easier to follow, which helps during development. It's also a bit smaller. For now codepoints U+0590 and above are still emitted as escape sequences to avoid confusing source code rendering when LTR scripts are involved. Go -- * Optimisations: + Drop some unneeded Go code generated for string `$`. None of the shipped stemmers use string `$`, though the Schinke Latin stemmer algorithm on the website does. * Code quality: + Dispatch among result with `switch` instead of an `if` ... `else if` chain (which looks like we did because the Go generator evolved from the Python generator and Python didn't used to have a switch-like construct. This doesn't make a measurable speed difference so it seems the Go compiler is optimising both to equivalent code, but using a switch here seems clearer, a better match for the intent, and is a bit simpler to generate. + Generate Go with UTF-8 source encoding. This makes the generated code easier to follow, which helps during development. It's also a bit smaller. For now codepoints U+0590 and above are still emitted as escape sequences to avoid confusing source code rendering when LTR scripts are involved. Java ---- * The Java code generated by Snowball requires now requires Java >= 7. Java 7 was released in 2011, and Java 6's EOL was 2013 so we don't expect this to be a problematic requirement. See #195. * Optimisations: + We now store the current string in a `char[]` rather than using a `StringBuilder` to reduce overheads. The `getCurrent()` method continues to return a Java `String`, but the `char[]` can be accessed using the new `getCurrentBuffer()` and `getCurrentBufferLength()` methods. Patch from Robert Muir (#195). + Use a more efficient mechanism for calling `among` functions. Patch from Robert Muir (#195). * Code quality: + Consistently put `[]` right after element type for array types, which seems the most used style. + Fix javac warnings in SnowballProgram.java. + Improve formatting of generated code. Javascript ---------- * Bug fixes: + Use base class specified by `-p` in string `$` rather than hard-coding `BaseStemmer` (which is the default if you don't specify `-p`). None of the shipped stemmers use string `$`, though the Schinke Latin stemmer algorithm on the website does. * Code quality: + Modernise the generated code a bit. Loosely based on changes proposed in #123 by Emily Marigold Klassen. * Other changes: + The Javascript runner is now specified by make variable `JSRUN` instead of `NODE` (since node is just one JS implementation). The default value is now `node` instead of `nodejs` (older Debian and Ubuntu packages used `/usr/bin/nodejs` because `/usr/bin/node` was already in use by a completely different package, but that has since changed). Pascal ------ * Bug fixes: + Add missing semicolons to code generated in some cases for a function which always succeeds or always fails. The new dutch.sbl was triggering this bug. + If the end of a routine was unreachable code the Snowball compiler would think the start of the next routine was also unreachable and would not generate it. This didn't affect code generation for any algorithms we currently ship. * Code quality: + Eliminate commented out code generated for string `$`. None of the shipped stemmers use string `$`, though the Schinke Latin stemmer algorithm on the website does. * Other changes: + Enable warnings, etc from fpc. + Select GNU-style diagnostic format. Python ------ * Optimisations: + Use Python set for grouping checks. This speeds up running the Python testsuite by about 4%. + Routines used in `among` are now referenced by name directly in the generated code, rather than using a string containing the name. This avoids a `getattr()` call each time an among wants to call a routine. This doesn't seem to make a measurable speed difference, but it's cleaner and avoids problems with name mangling. Suggested by David Corbett in #217. + Simplify code generated for `loop`. If the iteration count is constant and at most 4 then iterate over a tuple which microbenchmarking shows is faster. The only current uses of loop in the shipped stemmers are `loop 2` so benefit from this. Otherwise we now use `range(AE)` instead of `range (AE, 0, -1)` (the actual value of the loop variable is never used so only the number of iterations matter). * Bug fixes: + Correctly handle stemmer names with an underscore. * Code quality: + Generate Python with UTF-8 source encoding. This makes the generated code easier to follow, which helps during development. It's also a bit smaller. For now codepoints U+0590 and above are still emitted as escape sequences to avoid confusing source code rendering when LTR scripts are involved. * Other changes: + Set python_requires to indicate to install tools that the generated code won't work with Python 3.0.x, 3.1.x and 3.2.x (due to use of `u"foo"` string literals). Closes #192 and #191, opened by Andreas Maier. + Add classifiers to indicate support for Python 3.3 and for 3.8 to 3.13. Fixes #158, reported by Dmitry Shachnev. + Stop marking the wheel as universal, which had started to give a warning message. Patch from Dmitry Shachnev (#210). + Stop calling `setup.py` directly which is deprecated and now produces a warning - use the `build` module instead. Patch from Dmitry Shachnev (#210). Rust ---- * Optimisations: + Shortcut unnecessary calls to find_among, porting an optimization from the C generator. In some stemming benchmarks this improves the performance of the rust english stemmer by about 27%. Patch from jedav (#202). * Code quality: + Suppress unused_parens warning, for example triggered by the code generated for `$x = x*x` (where `x` is an integer). + Dispatch `among` result with `match` instead of an `if` ... `else if` chain (which looks like we did because the Rust generator evolved from the Python generator and Python didn't used to have a switch-like construct. This results in a 3% speed-up for an unoptimised Rust compile but doesn't seem to make a measurable difference when optimising so it seems the Rust compiler is optimising both to equivalent code. However using a `match` here seems clearer, a better match for the intent, and is a bit simpler to generate. + Generate Rust with UTF-8 source encoding. This makes the generated code easier to follow, which helps during development. It's also a bit smaller. For now codepoints U+0590 and above are still emitted as escape sequences to avoid confusing source code rendering when LTR scripts are involved. New stemming algorithms ----------------------- * Add Esperanto stemmer from David Corbett (#185). * Add Estonian algorithm from Linda Freienthal (#108). Behavioural changes to existing algorithms ------------------------------------------ * Dutch: Switch to Kraaij-Pohlmann as the default for Dutch. In case you want Martin Porter's Dutch stemming algorithm for compatibility, this is now available as `dutch_porter`. Fixes #1, reported by gboer. * Dutch (Kraaij-Pohlmann): Fix differences between the Snowball implementation and the original C implementation. * Dutch (Kraaij-Pohlmann): Add a small number of exceptions to the Snowball implementation to avoid unwanted conflations. This addresses all cases so far identified which Martin's Dutch stemmer handled better. Fixes #208. * Dutch (Porter): The "at least 3 characters" part of the R1 definition was actually implemented such that when working in UTF-8 it was "at least 3 bytes". We stripped accents normally found in Dutch except for `è` before setting R1, and no Dutch words starting `è` seem to stem differently depending on encoding, but proper nouns and other words of foreign origin may contain other accented characters and it seems better for the stemmer to handle such words the same way regardless of the encoding in use. * English: Replace '-ogist' with '-og' to conflate "geologist" and "geology", etc. Suggested by Marc Schipperheijn on snowball-discuss. * English: Add extra condition to undoubling. We no longer undouble if the double consonant is preceded by exactly "a", "e" or "o" to avoid conflating "add"/"ad", "egg"/"eg", "off"/"of", etc. Fixes #182, reported by Ed Page. * English: Avoid conflating 'emerge' and 'emergency'. Reported by Frederick Ross on snowball-discuss. * English: Avoid conflating 'evening' and 'even'. Reported by Ann B on snowball-discuss. * English: Avoid conflating 'lateral' and 'later'. Reported by Steve Tolkin on snowball-discuss. * English: Avoid conflating 'organ', 'organic' and 'organize'. * English: Avoid conflating 'past' and 'paste'. Reported by Sonny on snowball-discuss. * English: Avoid conflating 'universe', 'universal' and 'university'. Reported by Clem Wang on snowball-discuss. * English: Handle -eed and -ing exceptions in their respective rules. This avoids the overhead of checking for them for the majority of words which don't end -eed or -ing. It also allows us to easily handle vying->vie and hying->hie at basically no extra cost. Reduces the time to stem all words in our English word list by nearly 2%. * French: Remove elisions as first step. See #187. Originally reported by Paul Rudin and kelson42. * French: Remove -aise and -aises so for example, "française" and "françaises" are now conflated with "français". Fixes #209. Originally reported by ririsoft and Fred Fung. * French: Avoid incorrect conflation of `mauvais` (bad) with `mauve` (mauve, mallow or seagull); avoid conflating `mal` with `malais`, `pal` with `palais`, etc. * French: Avoid conflating `ni` (neither/nor) with `niais` (inexperienced/silly) and `nie`/`nié`/`nier`/`nierais`/`nierons` (to deny). * French: -oux -> -ou. Fixes #91, reported by merwok. * German: Replace with the "german2" variant. This normalises umlauts ("ä" to "ae", "ö" to "oe", "ü" to "ue") which is presumably much less common in newly created text than it once was as modern computer systems generally don't have the limitations which motivated this, but there will still be large amounts of legacy text which it seems helpful for the stemmer to handle without having to know to select a variant. On our sample German vocabulary which contains 35033 words, 77 words give different stems. A significant proportion of these are foreign words, and some are proper nouns. Some cases definitely seem improved, and quite a few are just different but effectively just change the stem for a word or group of words to a stem that isn't otherwise generated. There don't seem any changes that are clearly worse, though there are some changes that have both good and bad aspects to them. Fixes #92, reported by jrabensc. * German: Don't remove -em if preceded by -syst to avoid overstemming words ending -system. This change means we now conflate e.g. "system" and "systemen". Partly addresses #161, reported by Olga Gusenikova. * German: Remove -erin and -erinnen suffixes which conflates singular and plural female versions of nouns with the male versions. Fixes #85 and partly addresses #161, reported by Olga Gusenikova. * German: Replace -ln and -lns with -l. This improves 82 cases in the current sample data without making anything worse. Tests on a larger word list look good too. Partly addresses #161, reported by Olga Gusenikova. * German: Remove -et suffix when we safely can. Fixes #200, reported by Robert Frunzke. * Greek: Fix "faulty slice operation" for input `ισαισα`. The fix changes `ισα` to stem to `ισ` instead of the empty string, which seems better (and to be what the second paper actually says to do if read carefully). Fixes #204, reported by subnix. * Italian: Address overstemming of "divano" (sofa) which previously stemmed to "div", which is the stem for 'diva' (diva). Now it is stemmed to 'divan', which is what its plural form 'divani' already stemmed to. Fixes #49, reported by francesco. * Norwegian: Improve stemming of words ending -ers. Fixes #175, reported by Karianne Berg. * Norwegian: Include more accented vowels - treating "ê", "ò", "ó" and "ô" as vowels improves the stemming of a fairly small number of words, but there's basically no cost to having extra vowels in the grouping, and some of these words are commonly used. Fixes #218, reported by András Jankovics. * Romanian: Fix to work with Romanian text encoded using the correct Unicode characters. Romanian uses a "comma below" diacritic on letters "s" and "t" ("ș" and "ț"). Before Unicode these weren't easily available so Romanian text was written using the visually similar "cedilla" diacritic on these letters instead ("ş" and "ţ"). Previously our stemmer only recognised the latter. Now it maps the cedilla forms to "comma below" as a first step. Patch from Robert Muir. * Spanish: Handle -acion like -ación and -ucion like -ución. It's apparently common to miss off accents in Spanish, and there are examples in our test vocabulary that these changes help. Proposed by Damian Janowski. * Swedish: Replace suffix "öst" with "ös" when preceded by any of 'iklnprtuv' rather than just 'l'. The new rule only requires the "öst" to be in R1 whereas previously we required all of "löst" to be. This second tweak doesn't seem to affect any words ending "löst" but it conflates a few extra cases when combined with the expanded list of preceding letters, and seems more logical linguistically (since "ös" is akin to "ous" in English). Fixes #152, reported by znakeeye. * Swedish: Remove -et/-ets in cases where it helps. Removing -et can't be done unconditionally because many words end in -et where this isn't a suffix. However it's a very common suffix so it seems worth crafting a more complex condition under which to remove. Fixes #47. * Turkish: Remove proper noun suffixes. For example, `Türkiye'dir` ("it is Turkey") is now conflated with `Türkiye` ("Turkey"). Fixes #188. * Yiddish: Avoid generating empty stem for input "גע" (not a valid word, but it's better to avoid an empty stem for any non-empty input). Optimisations to existing algorithms ------------------------------------ * General change: Use `gopast` everywhere to establish R1 and R2 as it is a little more efficient to do so. * Basque: Use an empty action rather than replacing the suffix with itself which seems clearer and is a little more efficient. * Dutch (Porter): Optimise prelude routine. * English: Remove unnecessary exception for `skis` as the algorithm stems `skis` to `ski` by itself (`skies` and `sky` do still need a special case to avoid conflation with `ski` though). * Hungarian: We no longer take digraphs into account when determining where R1 starts. This can only make a difference to the stemming if we removed a suffix that started with the last character of the digraph (or with "zs" in the case of "dzs"), and that doesn't happen for any of the suffixes we remove for any valid Hungarian words. This simplification speeds up stemming by ~2% on the current sample vocabulary list. See #216. Thanks to András Jankovics for confirming no Hungarian words are affected by this change. * Lithuanian: Remove redundant R1 check. * Nepali: Eliminate redundant check_category_2 routine. * Tamil: Optimise by using `among` instead of long `or` chains. The generated C version now takes 43% less time to processes the test vocabulary. * Tamil: Remove many cases which can't be triggered due to being handled by another case. * Tamil: Clean up some uses of `test`. * Tamil: Make `fix_va_start` simpler and faster. * Tamil: Localise use of `found_a_match` flag. * Tamil: Eliminate pointless flag changes. * Turkish: Minor optimisations. Code clarity improvements to existing algorithms ------------------------------------------------ * Stop noting dates changes were made in comments in the code - we now maintain a changelog in each algorithm's description page on the website (and the version control history provides a finer grained view). * Always use `insert` instead of `<+` as the named command seems clearer. * English: Add comments documenting motivating examples for all exceptional cases. * Lithuanian: Change to recommended latin stringdef codes. Using common codes makes it easier to work across algorithms, but they are more mnemonic so also seem clearer when just considering this one algorithm. * Serbian: Change to recommended latin stringdef codes. Using common codes makes it easier to work across algorithms, but they are more mnemonic so also seem clearer when just considering this one algorithm. * Turkish: Use `{sc}` for s-cedilla and `{i}` for dotless-i to match other uses. Compiler -------- * Generic code generation improvements: + Show Snowball source leafname in "generated" comment at start of files. + Add generic reachability tracking machinery. This facilitates various new optimisations, so far the following have been implemented: - Tail-calling - Simpler code for calling routines which always give the same signal - Simpler code when a routine ends in a integer test (this also allows eliminating an Ada-specific codegen optimisation which did something similar but only for routines which consisted *entirely* of a single integer test. - Dead code reporting and removal (only in simple cases currently) Currently this overlaps in functionality with the existing reachability tracking which is implemented on a per-language basis, and only for some languages. This reachability tracking was originally added for Java where some unreachable code is invalid and result in a compile time error, but then seems to have been copied for some other newer languages which may or may not actually need it. The approach it uses unfortunately relies on correctly updating the reachability flag anywhere in the generator code where reachability can change which has proved to be a source of bugs, some unfixed. This new approach seems better and with some more work should allow us to eliminate the older code. Fixes #83. + Omit check for `among` failing in generated code when we can tell at compile time that it can't fail. + Optimise `goto`/`gopast` applied to a grouping or inverted grouping (which is by far the most common way to use `goto`/`gopast`) for all target languages (new for Go, Java, Javascript, Pascal and Rust). + We never need to restore the cursor after `not`. If `not` turns signal `f` into `t` then it sets `c` back to its old position; otherwise, `not` signals `f` and `c` will get reset by whatever ultimately handles this `f` (or the program exits and the position of `c` no longer matters). This slightly improves the generated code for the `english` and `porter` stemmers. + Don't generate code for undefined or unused routines. + Avoid generating variable names and then not actually using them. This eliminates mysterious gaps in the numbering of variables in the generated code. + Eliminate `!`/`not` from integer test code by generating the inverse comparison operator instead for all languages, e.g. for Python we now generate if self.I_p1 >= self.I_x: instead of if not self.I_p1 < self.I_x: This isn't going to be faster in compiled languages with an optimiser but for scripting languages it may be faster, and even if not, it makes for a little less work when loading the script. + Canonicalise `hop 1` to `next` as the generated code for `next` can be slightly more efficient. This will also apply to `hop` followed by a constant expression which Snowball can reduce to `1`. + Avoid trailing whitespace in generated files. + Fix problems with --comments option: - When generating C code we would segfault for code containing `atleast`, `hop` or integer tests. - Fix missing comments for some commands in some target languages. - Fix inconsistent formatting of comments in some target languages. - Comments in C are now always on their own line - previously some were after at the end of the line and some on their own line which made them harder to follow. - Emit comments before `among` and before routine/external definitions. + Simplify more cases of numeric expressions (e.g. `x * 1` to `x`). * Improve --help output. * Division by zero during constant folding now gives an error. * For `hop` followed by an unexpected token (e.g. `hop hop`) we were already emitting a suitable error but would then segfault. * Emit error for redefinition of a grouping. * Improve errors for `define` of an undeclared name. We already peek at the next token to decide whether to try to parse as a routine or grouping. Previously we parsed as a routine if it was `as`, and a grouping otherwise, but routine definitions are more common and a grouping can only start with a literal string or a name, so now we assume a routine definition with a missing `as` if the next token isn't valid for either. * Suppress duplicate (or even triplicate) "unexpected" errors for the same token when the compiler tried to recover from the error by adjusting the parse stare and marking the token to be reparsed, but the same token then failed to parse in the new state. * Fix NULL pointer dereference if an undefined grouping is used in the definition of another grouping. * Fix mangled error for `set` or `unset` on a non-boolean: test.sbl:2: nameInvalid type 98 in name_of_type() * Emit warning if `=>` is used. The documentation of how it works doesn't match the implementation, and it seems it has only ever been used in the Schinke stemmer implementation (which assumes the implemented behaviour). We've updated the Schinke implementation to avoid it. If you're using it in your own Snowball code please let us know. * Improve errors for unterminated string literals. * Fix NULL pointer dereference on invalid code such as `$x = $y`. * If malloc fails while compiling the compiler will now report the failure and exit. Previously the NULL return from malloc wasn't checked for so we'd typically segfault. * `lenof` and `sizeof` applied to a string variable now mark the variable as used, which avoids a bogus error followed by a confusing additional message if this is the only use of that variable: lenofsizeofbug.sbl:3: warning: string 's' is set but never used Unhandled type of dead assignment via sizeof This is situation is unlikely to occur in real world code. * The reported line number for "string not terminated" error was one too high in the case where we were in a stringdef (but correct if we weren't). * Eliminate special handling for among starter. We now convert the starter to be a command before the among, adding an explict substring if there isn't one. * We now warn if the body of a `repeat` or `atleast` loop always signals `t` (meaning it will loop forever which is very undesirable for a stemming algorithm) or always signals `f` (meaning it will never loop, which seems unlikely to be what was intended). * Release memory in compiler before exit. The OS will free all allocated memory when a process exits, so this memory isn't actually leaked, but it can be annoying with when using snowball as part of a larger build process with some leak-finding tools. Patch from jsteemann in #166. * Store textual data more efficiently in memory during Snowball compilation. Previously almost all textual data was stored as 16 bit values, but most such data only uses 8 bit character values. Doubling the memory usage isn't really an issue as Snowball programs are tiny, but this also complicated code handling such data. Now only literal strings use the 16 bit values. * Fix clang -Wunused-but-set-variable warning in compiler code. * Fix a few -Wshadow warnings in compiler and enable this warning by default. * Tighten parsing of `writef()` format strings. We now error out on unrecognised escape codes or if a numbered escape is used with too high a number or a non-digit. This change reveals that the Go and Rust generators were using invalid escape ~A - the old writef() code was substituting this with just A which is what is wanted so this case was harmless but being lenient here could hide bugs, especially when copying code between generators as they don't all support the same set of format codes. Build system ------------ * Turn on Java warnings and make them errors. * Compile C code with -g by default. This makes debugging easier, and matches the default for at least some other build systems (e.g. autotools). * Fix "make clean" to remove all built Ada files. * Clean `stemtest` too. Patch from Stefano Rivera. * Add missing `COMMON_FILES` dependency to dist targets. * GNUmakefile: Tidy up and make more consistent * GNUmakefile: Make use of $* to improve speed and readability. * Use $(patsubst ...) instead of sed in .java.class rule which gives cleaner make output and is a bit more efficient. * Add `WERROR` make variable to provide a way to add `-Werror` to existing CFLAGS. libstemmer ---------- Testsuite --------- * Give a clear error if snowball-data isn't found. Fixes #196, reported by Andrea Maccis. * Handle not thinning testdata better. If THIN_FACTOR is set to 1 we no longer run gzipped test data through awk. We also now handle THIN_FACTOR being set empty as equivalent to 1 for convenience. * csharp_stemwords: Correctly handle a stemmer name containing an underscore. * csharp_stemwords: Make `-i` option optional and read from stdin if omitted, like the C version does. * csharp_stemwords: Process the input line by line which is more helpful for interactive testing, and also a little faster. * Fix Java TestApp to allow a single argument. The documented command line syntax is that you only need to specify the language and there was already code to read from stdin if no input file was specified, but at least two command line options were required. * Fix deprecation warning in TestApp.java. * Optimise TestApp.java by creating fewer objects. Patch from Robert Muir. * stemwords.py: We no longer create an empty output file if we fail to open the input file. * stemwords: Improve error message to say "Out of memory or internal error" rather than just "Out of memory". Documentation ------------- * Include "what is stemming" section in each README. * Include section on threads in each README. Based on patch for Python from dbcerigo. * Document that input should be lowercase with composed accents. See #186, reported by 1993fpale. * Add README section on building, including notes on cross-compiling. Fixes #205, reported by sin-ack. * CONTRIBUTING.rst: Clarify which charsets to list * CONTRIBUTING.rst: Add general advice section. In particular, note to use spaces-only for indentation in most cases. Thanks to Dmitry Shachnev for raising this point. * CONTRIBUTING.rst: Note that UTF-8 is OK in comments. Thanks to Dmitry Shachnev for asking. * Fix some typos. Patch from Josh Soref. * Document that our CI now uses github actions. * Update link to Greek stemmer PDF. Patch from Michael Bissett (#33). Snowball 2.2.0 (2021-11-10) =========================== New Code Generators ------------------- * Add Ada generator from Stephane Carrez (#135). Javascript ---------- * Fix generated code to use integer division rather than floating point division. Noted by David Corbett. Pascal ------ * Fix code generated for division. Previously real division was used and the generated code would fail to compile with an "Incompatible types" error. Noted by David Corbett. * Fix code generated for Snowball's `minint` and `maxint` constant. Python ------ * Python 2 is no longer actively supported, as proposed on the mailing list: https://lists.tartarus.org/pipermail/snowball-discuss/2021-August/001721.html * Fix code generated for division. Previously the Python code we generated used integer division but rounded negative fractions towards negative infinity rather than zero under Python 2, and under Python 3 used floating point division. Noted by David Corbett. Code quality Improvements ------------------------- * C/C++: Generate INT_MIN and INT_MAX directly, including from the generated C file if necessary, and remove the MAXINT and MININT macros from runtime/header.h. * C#: An `among` without functions is now generated as `static` and groupings are now generated as constant. Patches from James Turner in #146 and #147. Code generation improvements ---------------------------- * General: + Constant numeric subexpressions and constant numeric tests are now evaluated at Snowball compile time. + Simplify the following degnerate `loop` and `atleast` constructs where N is a compile-time constant: - loop N C where N <= 0 is a no-op. - loop N C where N == 1 is just C. - atleast N C where N <= 0 is just repeat C. If the value of N doesn't depend on the current target language, platform or Unicode settings then we also issue a warning. Behavioural changes to existing algorithms ------------------------------------------ * german2: Fix handling of `qu` to match algorithm description. Previously the implementation erroneously did `skip 2` after `qu`. We suspect this was intended to skip the `qu` but that's already been done by the substring/among matching, so it actually skips an extra two characters. The implementation has always differed in this way, but there's no good reason to skip two extra characters here so overall it seems best to change the code to match the description. This change only affects the stemming of a single word in the sample vocabulary - `quae` which seems to actually be Latin rather than German. Optimisations to existing algorithms ------------------------------------ * arabic: Handle exception cases in the among they're exceptions to. * greek: Remove unused slice setting, handle exception cases in the among they're exceptions to, and turn `substring ... among ... or substring ... among ...` into a single `substring ... among ...` in cases where it is trivial to do so. * hindi: Eliminate the need for variable `p`. * irish: Minor optimisation in setting `pV` and `p1`. * yiddish: Make use of `among` more. Compiler -------- * Fix handling of `len` and `lenof` being declared as names. For compatibility with programs written for older Snowball versions len and lenof stop being tokens if declared as names. However this code didn't work correctly if the tokeniser's name buffer needed to be enlarged to hold the token name (i.e. 3 or 5 elements respectively). * Report a clearer error if `=` is used instead of `==` in an integer test. * Replace a single entry command list with its contents in the internal syntax tree. This puts things in a more canonical form, which helps subsequent optimisations. Build system ------------ * Support building on Microsoft Windows (using mingw+msys or a similar Unix-like environment). Patch from Jannick in #129. * Split out INCLUDES from CPPFLAGS so that CPPFLAGS can now be overridden by the user if required. Fixes #148, reported by Dominique Leuenberger. * Regenerate algorithms.mk only when needed rather than on every `make` run. libstemmer ---------- * The libstemmer static library now has a `.a` extension, rather than `.o`. Patch from Michal Vasilek in #150. Testsuite --------- * stemtest: Test that numbers and numeric codes aren't damaged by any of the algorithms. Regression test for #66. Fixes #81. * ada: Fix ada tests to fail if output differs. There was an extra `| head -300` compared to other languages, which meant that the exit code of `diff` was ignored. It seems more helpful (and is more consistent) not to limit how many differences are shown so just drop this addition. * go: Stop thinning testdata. It looks like we only are because the test harness code was based on that for rust, which was based on that for javascript, which was only thinning because it was reading everything into memory and the larger vocabulary lists were resulting in out of memory issues. * javascript: Speed up stemwords.js. Process input line-by-line rather than reading the whole file into memory, splitting, iterating, and creating an array with all the output, joining and writing out a single huge string. This also means we can stop thinning the test data for javascript, which we were only doing because the huge arabic test data file was causing out of memory errors. Also drop the -p option, which isn't useful here and complicates the code. * rust: Turn on optimisation in the makefile rather than the CI config. This makes the tests run in about 1/5 of the time and there's really no reason to be thinning the testdata for rust. Documentation ------------- * CONTRIBUTING.rst: Improve documentation for adding a new stemming algorithm. * Improve wording of Python docs. Snowball 2.1.0 (2021-01-21) =========================== C/C++ ----- * Fix decoding of 4-byte UTF-8 sequences in `grouping` checks. This bug affected Unicode codepoints U+40000 to U+7FFFF and U+C0000 to U+FFFFF and doesn't affect any of the stemming algorithms we currently ship (#138, reported by Stephane Carrez). Python ------ * Fix snowballstemmer.algorithms() method (#132, reported by kkaiser). * Update code to generate trove language classifiers for PyPI. All the natural languages we previously had stemmers for have now been added to PyPI's list, but Armenian and Yiddish aren't on it. Patch from Dmitry Shachnev. Code Quality Improvements ------------------------- * Suppress GCC warning in compiler code. * Use `const` pointers more in C runtime. * Only use spaces for indentation in javascript code. Change proposed by Emily Marigold Klassen in #123, and seems to be the modern Javascript norm. New Snowball Language Features ------------------------------ * `lenof` and `sizeof` can now be applied to a literal string, which can be useful if you want to do calculations on cursor values. This change actually simplifies the language a little, since you can now use a literal string in any read-only context which accepts a string variable. Code generation improvements ---------------------------- * General: + Fix bugs in the code generated to handle failure of `goto`, `gopast` or `try` inside `setlimit` or string-`$`. This affected all languages (though the issue with `try` wasn't present for C). These bugs don't affect any of the stemming algorithms we currently ship. Reported by Stefan Petkovic on snowball-discuss. + Change `hop` with a negative argument to work as documented. The manual says a negative argument to hop will raise signal f, but the implementation for all languages was actually to move the cursor in the opposite direction to `hop` with a positive argument. The implemented behaviour is problematic as it allows invalidating implicitly saved cursor values by modifying the string outside the current region, so we've decided it's best to fix the implementation to match the documentation. The only Snowball code we're aware of which relies on this was the original version of the new Yiddish stemming algorithm, which has been updated not to rely on this. The compiler now issues a warning for `hop` with a constant negative argument (internally now converted to `false`), and for `hop` with a constant zero argument (internally now converted to `true`). + Canonicalise `among` actions equivalent to `()` such as `(true)` which previously resulted in an extra case in the among, and for Python we'd generate invalid Python code (`if` or `elif` with an empty body). Bug revealed by Assaf Urieli's Yiddish stemmer in #137. + Eliminate variables whose values are never used - they no longer have corresponding member variables, etc, and no code is generated for any assignments to them. + Don't generate anything for an unused `grouping`. + Stop warning "grouping X defined but not used" for a `grouping` which is only used to define another `grouping`. * C/C++: + Store booleans in same array as integers. This means each boolean is stored as an int instead of an unsigned char which means 4 bytes instead of 1, but we save a pointer (4 or 8 bytes) in struct SN_env which is a win for all the current stemmers. For an algorithm which uses both integers and booleans, we also save the overhead of allocating a block on the heap, and potentially improve data locality. + Eliminate duplicate generated C comment for sliceto. * Pascal: + Avoid generating unused variables. The Pascal code generated for the stemmers we ship is now warning free (tested with fpc 3.2.0). + Don't emit empty `private` sections. Cosmetic, but makes the generated code a bit easier to follow. * Python: + End `if`-chain with `else` where possible, avoiding a redundant test of the variable being switched on. This optimisation kicks in for an `among` where all cases have commands. This change seems to speed up `make check_python_arabic` by a few percent. New stemming algorithms ----------------------- * Add Serbian stemmer from stef4np (#113). * Add Yiddish stemmer from Assaf Urieli (#137). * Add Armenian stemmer from Astghik Mkrtchyan. It's been on the website for over a decade, and included in Xapian for over 9 years without any negative feedback. Optimisations to existing algorithms ------------------------------------ * kraaij_pohlmann: Use `$v = limit` instead of `do (tolimit setmark v)` since this generates simpler code, and also matches the code other algorithm implementations use. Probably for languages like C with optimising compilers the compiler will generate equivalent code anyway, but e.g. for Python this should be an improvement. Code clarity improvements to existing algorithms ------------------------------------------------ * hindi.sbl: Fix comment typo. Compiler -------- * Don't count `$x = x + 1` as initialising or using `x`, so it's now handled like `$x += 1` already is. * Comments are now only included in the generated code if command line option -comments is specified. The comments in the generated code are useful if you're trying to debug the compiler, and perhaps also if you are trying to debug your Snowball code, but for everyone else they just bloat the code which as the number of languages we support grows becomes more of an issue. * `-parentclassname` is not only for java and csharp so don't disable it if those backends are disabled. * `-syntax` now reports the value for each numeric literal. * Report location for excessive get nesting error. * Internally the compiler now represents negated literal numbers as a simple `c_number` rather than `c_neg` applied to a `c_number` with a positive value. This simplifies optimisations that want to check for a constant numeric expression. Build system ------------ * Link binaries with LDFLAGS if it's set, which is needed for some platform (e.g. OpenEmbedded). Patch from Andreas Müller (#120). * Add missing dependencies of algorithms.go rule. Testsuite --------- * C: Add stemtest for low-level regression tests. Documentation ------------- * Document a C99 compiler as a requirement for building the snowball compiler (but the C code it generates should still work with any ISO C compiler). A few declarations mixed with code crept in some time ago (which nobody's complained about), so this is really just formally documenting a requirement which already existed. * README: Explain what Snowball is and what Stemming is (#131, reported by Sean Kelly). * CONTRIBUTING.rst: Expand section on adding a new generator. * For Python snowballstemmer module include global NEWS instead of Python-specific CHANGES.rst and use README.rst as the long description. Patch from Dmitry Shachnev (#119). * COPYING: Update and incorporate Python backend licensing information which was previously in a separate file. Snowball 2.0.0 (2019-10-02) =========================== C/C++ ----- * Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled sequences of any length, but commands which look at the character value only handled sequences up to length 3. Fixes #89. * Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`. Java ---- * TestApp.java: - Always use UTF-8 for I/O. Patch from David Corbett (#80). - Allow reading input from stdin. - Remove rather pointless "stem n times" feature. - Only lower case ASCII to match stemwords.c. - Stem empty lines too to match stemwords.c. Code Quality Improvements ------------------------- * Fix various warnings from newer compilers. * Improve use of `const`. * Share common functions between compiler backends rather than having multiple copies of the same code. * Assorted code clean-up. * Initialise line_labelled member of struct generator to 0. Previously we were invoking undefined behaviour, though in practice it'll be zero initialised on most platforms. New Code Generators ------------------- * Add Python generator (#24). Originally written by Yoshiki Shibukawa, with additional updates by Dmitry Shachnev. * Add Javascript generator. Based on JSX generator (#26) written by Yoshiki Shibukawa. * Add Rust generator from Jakob Demler (#51). * Add Go generator from Marty Schoch (#57). * Add C# generator. Based on patch from Cesar Souza (#16, #17). * Add Pascal generator. Based on Delphi backend from stemming.zip file on old website (#75). New Snowball Language Features ------------------------------ * Add `len` and `lenof` to measure Unicode length. These are similar to `size` and `sizeof` (respectively), but `size` and `sizeof` return the length in bytes under `-utf8`, whereas these new commands give the same result whether using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in the length of the string). For compatibility with existing code which might use these as variable or function names, they stop being treated as tokens if declared to be a variable or function. * New `{U+1234}` stringdef notation for Unicode codepoints. * More versatile integer tests. Now you can compare any two arithmetic expressions with a relational operator in parentheses after the `$`, so for example `$(len > 3)` can now be used when previously a temporary variable was required: `$tmp = len $tmp > 3` Code generation improvements ---------------------------- * General: + Avoid unnecessarily saving and restoring of the cursor for more commands - `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always restore its value, and for C `booltest` (which other languages already handled). + Special case handling for `setlimit tomark AE`. All uses of setlimit in the current stemmers we ship follow this pattern, and by special-casing we can avoid having to save and restore the cursor (#74). + Merge duplicate actions in the same `among`. This reduces the size of the switch/if-chain in the generated code which dispatch the among for many of the stemmers. + Generate simpler code for `among`. We always check for a zero return value when we call the among, so there's no point also checking for that in the switch/if-chain. We can also avoid the switch/if-chain entirely when there's only one possible outcome (besides the zero return). + Optimise code generated for `do `. This speeds up "make check_python" by about 2%, and should speed up other interpreted languages too (#110). + Generate more and better comments referencing snowball source. + Add homepage URL and compiler version as comments in generated files. * C/C++: + Fix `size` and `sizeof` to not report one too high (reported by Assem Chelli in #32). + If signal `f` from a function call would lead to return from the current function then handle this and bailing out on an error together with a simple `if (ret <= 0) return ret;` + Inline testing for a single character literals. + Avoiding generating `|| 0` in corner case - this can result in a compiler warning when building the generated code. + Implement `insert_v()` in terms of `insert_s()`. + Add conditional `extern "C"` so `runtime/api.h` can be included from C++ code. Closes #90, reported by vvarma. * Java: + Fix functions in `among` to work in Java. We seem to need to make the methods called from among `public` instead of `private`, and to call them on `this` instead of the `methodObject` (which is cleaner anyway). No revision in version control seems to generate working code for this case, but Richard says it definitely used to work - possibly older JVMs failed to correctly enforce the access controls when methods were invoked by reflection. + Code after handling `f` by returning from the current function is unreachable too. + Previously we incorrectly decided that code after an `or` was unreachable in certain cases. None of the current stemmers in the distribution triggered this, but Martin Porter's snowball version of the Schinke Latin stemmer does. Fixes #58, reported by Alexander Myltsev. + The reachability logic was failing to consider reachability from the final command in an `or`. Fixes #82, reported by David Corbett. + Fix `maxint` and `minint`. Patch from David Corbett in #31. + Fix `$` on strings. The previous generated code was just wrong. This doesn't affect any of the included algorithms, but for example breaks Martin Porter's snowball implementation of Schinke's Latin Stemmer. Issue noted by Jakob Demler while working on the Rust backend in #51, and reported in the Schinke's Latin Stemmer by Alexander Myltsev in #58. + Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43. + Eliminate range-check implementation for groupings. This was removed from the C generator 10 years earlier, isn't used for any of the existing algorithms, and it doesn't seem likely it would be - the grouping would have to consist entirely of a contiguous block of Unicode code-points. + Simplify code generated for `repeat` and `atleast`. + Eliminate unused return values and variables from runtime functions. + Only import the `among` and `SnowballProgram` classes if they're actually used. + Only generate `copy_from()` method if it's used. + Merge runtime functions `eq_s` and `eq_v` functions. + Java arrays know their own length so stop storing it separately. + Escape char 127 (DEL) in generated Java code. It's unlikely that this character would actually be used in a real stemmer, so this was more of a theoretical bug. + Drop unused import of InvocationTargetException from SnowballStemmer. Reported by GerritDeMeulder in #72. + Fix lint check issues in generated Java code. The stemmer classes are only referenced in the example app via reflection, so add @SuppressWarnings("unused") for them. The stemmer classes override equals() and hashCode() methods from the standard java Object class, so mark these with @Override. Both suggested by GerritDeMeulder in #72. + Declare Java variables at point of use in generated code. Putting all declarations at the top of the function was adding unnecessary complexity to the Java generator code for no benefit. + Improve formatting of generated code. New stemming algorithms ----------------------- * Add Tamil stemmer from Damodharan Rajalingam (#2, #3). * Add Arabic stemmer from Assem Chelli (#32, #50). * Add Irish stemmer from Jim O'Regan (#48). * Add Nepali stemmer from Arthur Zakirov (#70). * Add Indonesian stemmer from Olly Betts (#71). * Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review. * Add Lithuanian stemmer from Dainius Jocas (#22, #76). * Add Greek stemmer from Oleg Smirnov (#44). * Add Catalan and Basque stemmers from Israel Olalla (#104). Behavioural changes to existing algorithms ------------------------------------------ * Portuguese: + Replace incorrect Spanish suffixes by Portuguese suffixes (#1). * French: + The MSDOS CP850 version of the French algorithm was missing changes present in the ISO8859-1 and Unicode versions. There's now a single version of each algorithm which was based on the Unicode version. + Recognize French suffixes even when they begin with diaereses. Patch from David Corbett in #78. * Russian: + We now normalise 'ё' to 'е' before stemming. The documentation has long said "we assume ['ё'] is mapped into ['е']" but it's more convenient for the stemmer to actually perform this normalisation. This change has no effect if the caller is already normalising as we recommend. It's a change in behaviour they aren't, but 'ё' occurs rarely (there are currently no instances in our test vocabulary) and this improves behaviour when it does occur. Patch from Eugene Mirotin (#65, #68). * Finish: + Adjust the Finnish algorithm not to mangle numbers. This change also means it tends to leave foreign words alone. Fixes #66. * Danish: + Adjust Danish algorithm not to mangle alphanumeric codes. In particular alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000, space1999) are no longer mangled. See #81. Optimisations to existing algorithms ------------------------------------ * Turkish: + Simplify uses of `test` in stemmer code. + Check for 'ad' or 'soyad' more efficiently, and without needing the strlen variable. This speeds up "make check_utf8_turkish" by 11% on x86 Linux. * Kraaij-Pohlmann: + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient than `setmark x $x >= p1`. Code clarity improvements to existing algorithms ------------------------------------------------ * Turkish: + Use , for cedilla to match the conventions used in other stemmers. * Kraaij-Pohlmann: + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same `[substring] among (` ... `)` construct we do in other stemmers. Compiler -------- * Support conventional --help and --version options. * Warn if -r or -ep used with backend other than C/C++. * Warn if encoding command line options are specified when generating code in a language with a fixed encoding. * The default classname is now set based on the output filename, so `-n` is now often no longer needed. Fixes #64. * Avoid potential one byte buffer over-read when parsing snowball code. * Avoid comparing with uninitialised array element during compilation. * Improve `-syntax` output for `setlimit L for C`. * Optimise away double negation so generators don't have to worry about generating `--` (decrement operator in many languages). Fixes #52, reported by David Corbett. * Improved compiler error and warning messages: - We now report FILE:LINE: before each diagnostic message. - Improve warnings for unused declarations/definitions. - Warn for variables which are used, but either never initialised or never read. - Flag non-ASCII literal strings. This is an error for wide Unicode, but only a warning for single-byte and UTF-8 which work so long as the source encoding matches the encoding used in the generated stemmer code. - Improve error recovery after an undeclared `define`. We now sniff the token after the identifier and if it is `as` we parse as a routine, otherwise we parse as a grouping. Previously we always just assumed it was a routine, which gave a confusing second error if it was a grouping. - Improve error recovery after an unexpected token in `among`. Previously we acted as if the unexpected token closed the `among` (this probably wasn't intended but just a missing `break;` in a switch statement). Now we issue an error and try the next token. * Report error instead of silently truncating character values (e.g. `hex 123` previously silently became byte 0x23 which is `#` rather than a g-with-cedilla). * Enlarge the initial input buffer size to 8192 bytes and double each time we hit the end. Snowball programs are typically a few KB in size (with the current largest we ship being the Greek stemmer at 27KB) so the previous approach of starting with a 10 byte input buffer and increasing its size by 50% plus 40 bytes each time it filled was inefficient, needing up to 15 reallocations to load greek.sbl. * Identify variables only used by one `routine`/`external`. This information isn't yet used, but such variables which are also always written to before being read can be emitted as local variables in most target languages. * We now allow multiple source files on command line, and allow them to be after (or even interspersed) with options to better match modern Unix conventions. Support for multiple source files allows specifying a single byte character set mapping via a source file of `stringdef`. * Avoid infinite recursion in compiler when optimising a recursive snowball function. Recursive functions aren't typical in snowball programs, but the compiler shouldn't crash for any input, especially not a valid one. We now simply limit on how deep the compiler will recurse and make the pessimistic assumption in the unlikely event we hit this limit. Build system ------------ * `make clean` in C libstemmer_c distribution now removes `examples/*.o`. (#59) * Fix all the places which previously had to have a list of stemmers to work dynamically or be generated, so now only modules.txt needs updating to add a new stemmer. * Add check_java make target which runs tests for java. * Support gzipped test data (the uncompressed arabic test data is too big for github). * GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball invocations for Java - these are only meaningful when generating C code. * Pass CFLAGS when linking which matches convention (e.g. automake does it) and facilitates use of tools such as ASan. Fixes #84, reported by Thomas Pointhuber. * Add CI builds with -std=c90 to check compiler and generated code are C90 (#54) libstemmer ---------- * Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords. * Add -O2 to CFLAGS. * Make generated tables of encodings and modules const. * Fix clang static analyzer memory leak warning (in practice this code path can never actually be taken). Patch from Patrick O. Perry (#56) Documentation ------------- * Added copyright and licensing details (#10). * Document that libstemmer supports ISO_8859_2 encoding. Currently hungarian and romanian are available in ISO_8859_2. * Remove documentation falsely claiming that libstemmer supports CP850 encoding. * CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and new language backends. * Overhaul libstemmer_python_README. Most notably, replace the benchmark data which was very out of date.