--- title: 'Blazingly fast parsing, part 1: optimizing the scanner' author: 'Toon Verwaest ([@tverwaes](https://twitter.com/tverwaes)), scandalous optimizer' avatars: - 'toon-verwaest' date: 2019-03-25 13:33:37 tags: - internals - parsing tweet: '1110205101652787200' description: 'The cornerstone of parser performance is a fast scanner. This article explains how V8’s JavaScript scanner recently got up to 2.1× faster.' --- To run a JavaScript program, the source text needs to be processed so V8 can understand it. V8 starts out by parsing the source into an abstract syntax tree (AST), a set of objects that represent the program structure. That AST gets compiled to bytecode by Ignition. The performance of these parse + compile phases is important: V8 cannot run code before compilation is done. In this series of blog posts, we focus on parsing, and the work done in V8 to ship a blazingly fast parser. In fact, we start the series one stage before the parser. V8’s parser consumes ‘tokens’ provided by the ‘scanner’. Tokens are blocks of one or more characters that have a single semantic meaning: a string, an identifier, an operator like `++`. The scanner constructs these tokens by combining consecutive characters in an underlying character stream. The scanner consumes a stream of Unicode characters. These Unicode characters are always decoded from a stream of UTF-16 code units. Only a single encoding is supported to avoid branching or specializing the scanner and parser for various encodings, and we chose UTF-16 since that’s the encoding of JavaScript strings, and source positions need to be provided relative to that encoding. The [`UTF16CharacterStream`](https://cs.chromium.org/chromium/src/v8/src/scanner.h?rcl=edf3dab4660ed6273e5d46bd2b0eae9f3210157d&l=46) provides a (possibly buffered) UTF-16 view over the underlying Latin1, UTF-8, or UTF-16 encoding that V8 receives from Chrome, which Chrome in turn received from the network. In addition to supporting more than one encoding, the separation between scanner and character stream allows V8 to transparently scan as if the entire source is available, even though we may only have received a portion of the data over the network so far. ![](/_img/scanner/overview.svg) The interface between the scanner and the character stream is a method named [`Utf16CharacterStream::Advance()`](https://cs.chromium.org/chromium/src/v8/src/scanner.h?rcl=edf3dab4660ed6273e5d46bd2b0eae9f3210157d&l=54) that returns either the next UTF-16 code unit, or `-1` to flag end of input. UTF-16 cannot encode every Unicode character in a single code unit. Characters outside the [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane) are encoded as two code units, also called surrogate pairs. The scanner operates on Unicode characters rather than UTF-16 code units though, so it wraps this low-level stream interface in a [`Scanner::Advance()`](https://cs.chromium.org/chromium/src/v8/src/scanner.h?sq=package:chromium&g=0&rcl=edf3dab4660ed6273e5d46bd2b0eae9f3210157d&l=569) method that decodes UTF-16 code units into full Unicode characters. The currently decoded character is buffered and picked up by scan methods, such as [`Scanner::ScanString()`](https://cs.chromium.org/chromium/src/v8/src/scanner.cc?rcl=edf3dab4660ed6273e5d46bd2b0eae9f3210157d&l=775). The scanner [chooses](https://cs.chromium.org/chromium/src/v8/src/scanner.cc?rcl=edf3dab4660ed6273e5d46bd2b0eae9f3210157d&l=422) a specific scanner method or token based on a maximum lookahead of 4 characters, the longest ambiguous sequence of characters in JavaScript[^1]. Once a method like `ScanString` is chosen, it consumes the remainder of characters for that token, buffering the first character that’s not part of the token for the next scanned token. In the case of `ScanString` it also copies the scanned characters into a buffer encoded as Latin1 or UTF-16, while decoding escape sequences. [^1]: `