Tokenizer

A tokenizer converts a input string in to tokens. It's very usefull when you need to do parsing of text.

This tokenizer is used by the BruteForce language written by El Muerte, for information check his El Muerte/Journal History 1

class Tokenizer extends Object;

const NEWLINE = 10;

var private array<string> buffer;   // the input buffer
var private byte c;                 // holds the current char
var private int linenr;             // current line in the input buffer
var private int pos;                // position on the current line

var int verbose;                    // for debugging

enum tokenType 
{
  TT_None,
  TT_Literal,
  TT_Identifier,  
  TT_Integer,
  TT_Float,
  TT_String,
  TT_Operator,
  TT_EOF,
};

This tokenizer recognizes 8 different tokens.

TT_None: this token is never assigned, but used as a default
TT_Literal: the token is a literal, e.g. ( or ). Most tokenizers just use the ascii value of the literal but because of limitations in UScript we will use this
TT_Identifier: an identifier is a string which begins with a alpha or underscore followed by zero or more alphanumeric characters or underscores. Regular expression: Identifier ::= [a-z_][a-z0-9_]*
TT_Integer: a natural number, negative numbers are not supported because this is incompatible with a '-' operator, so you have to keep that in mind when you define your grammar. Regular expression: Integer ::= [0-9]+
TT_Float: a regular number with a floating point. Regular expression: Float ::= [0-9]+\.[0-9]*
TT_String: a string of characters encapsuled with double quotes, literal double quotes need to be escaped using a backslashRegular expression: String ::= "[^"]*"
TT_Operator: an operator: =, ==, >, >=, ... Regular expression: Identifier ::= [-=+<>*/!]+
TT_EOF: the end of file

var private tokenType curTokenType; // holds the current token
var private string curTokenString;  // holds the current string representation

/**
  Create a tokenizer
*/
function Create(array<string> buf)
{
  buffer.length = 0;
  buffer = buf;
  linenr = 0;
  pos = 0;
  c = 0;
}

Call this to initialize the tokenizer with a new buffer

/**
  returns the string representation of the current token
*/
function string tokenString()
{
  return curTokenString;
}

/**
  returns the type of the current token
*/
function tokenType currentToken()
{
  return curTokenType;
}

We don't want anybody writing to out variables thus provide them with functions to read the value

/**
  retreives the next token
*/
function tokenType nextToken()
{
  return _nextToken();
}

Get the next token in the buffer, this calls the private _nextToken() for the real processing

/* Private functions */

private function tokenType _nextToken()
{
  local int tokenPos, endPos;
  skipBlanks();
  if (curTokenType == TT_EOF) return curTokenType; 
  tokenPos = pos;
  // identifier: [A-Za-z]([A-Za-z0-9_])*
  if (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95))
  {
    pos++;
    c = _c();
    while (((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122)) || (c == 95) || ((c >= 48) && (c <= 57)))
    {
      pos++;
      c = _c();
    }
    endPos = pos;
    curTokenType = TT_Identifier;
  }
  // number: [0-9]+(\.([0-9])+)?
  else if ((c >= 48) && (c <= 57))
  {
    pos++;
    c = _c();
    while ((c >= 48) && (c <= 57))
    {
      pos++;
      c = _c();
    }
    if (c == 46) // .
    {
      pos++;
      c = _c();
      while ((c >= 48) && (c <= 57))
      {
        pos++;
        c = _c();
      }
      endPos = pos;
      curTokenType = TT_Float;
    }
    else {
      endPos = pos;
      curTokenType = TT_Integer;
    }
  }
  // string: "[^"]*"
  else if (c == 34)
  {
    pos++;
    c = _c();
    while (true)
    {
      if (c == 34) break;
      if (c == 92) // escape char skip one char
      {
        pos++;
      }
      if (c == NEWLINE)
      {
        Warn("Unterminated string @"@linenr$","$pos);
        assert(false);
      }
      pos++;
      c = _c();
    }
    tokenPos++;
    endPos = pos;
    pos++;
    curTokenType = TT_String;
  }
  // operator: [+-*/=><!]+
  // literal
  else if ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61))
  {
    pos++;
    c = _c();
    while ((c == 33) || (c == 42) || (c == 43) || (c == 45) || (c == 47) || (c == 60) || (c == 61) || (c == 62) || (c == 61))
    {
      pos++;
      c = _c();
    }
    endPos = pos;
    curTokenType = TT_Operator;
  }
  else {
    pos++;
    endPos = pos;
    curTokenType = TT_Literal;
  }
  // make up result
  if (linenr >= buffer.length) // EOF break
  {
    curTokenType = TT_EOF; 
    curTokenString = "";
  }
  else {
    curTokenString = Mid(buffer[linenr], tokenPos, endPos-tokenPos);
  }
  if (verbose > 0) log(curTokenType@curTokenString, 'Tokenizer');
  return curTokenType;
}

/**
  Skip all characters with ascii value < 33 (32 is space)
*/
private function skipBlanks()
{  
  c = _c();
  while (c < 33)
  {
    if (c == NEWLINE)
    {
      linenr++;
      pos = 0;
      if (linenr >= buffer.length) // EOF break
      {
        curTokenType = TT_EOF; 
        curTokenString = "";
        return;
      }
    }
    else pos++;
    c = _c();
  }
}

skipBlanks skips all characters considered whitespace, in this case all ASCII controll characters including the space.

/**
  returns the current char
*/
private function byte _c(optional int displacement)
{
  local string t;
  t =  Mid(buffer[linenr], pos+displacement, 1);
  if (t == "") return NEWLINE; // empty string is a newline
  return Asc(t);
}

This function is used to read the current character, because we can't just increase the read pointer like you would do normaly we need to extract the current character from the current line and convert it to the ASCII value for better processing.

defaultproperties
{
  verbose=0
}

Issues

Escape characters in strings

Escaped characters are accepted by this tokenizer but not fixed.

"a string with \"double quotes\"" will be returned as:

a string with \"double quotes\"

Negative numbers

Negative numbers are not supported but this tokenizer, instead you will get a Operator '-' and a Number '123' insetad of a Number '-123'. This is because it's impossible to see the diffirence between the operator '-' and a leading minus symbol in a string. For example:

x = x - 1 and x = -1

So when parsing your code you need to keep this in mind that a number can be preceded with a '-' (pre-operator)

Discussion

Daid303: I noticed that you can get an 'Unterminated string' error if you put a normal enter in a string, but in normal C this is excepted, it actualy does the same as \n. I dunno how UCC likes enters in strings constants...

Wormbo: The compiler can't handle control characters in string constants and there are no escape characters for them.

El Muerte: It's by design (of this tokenizer). You could add support for it. Just remove the if (c == NEWLINE) check for strings to support newlines. Anyway, this tokenizer is experminetal and quite slow. Maybe I should add a notice about that on the top of the page.

Daid303: Ok, I just noticed that the Visual C compiler doesn't like newlines in string constants. While the borland compiler (I think) has no problems with them.

Another thing I noticed in the operator parsing part, the comment says "[-=+<>*/!]+" those are 8 characters (the | and & are missing) but the code has 9 character values. "If the comment and the code disagree then they are both wrong"

Maybe this is updated in the latest version and is this not the latest version...

PTGui: I sent a string with 999 characters and the games just crashed in the tokenizer. The way I used to fix was to make a string with MID(PREVIOUS_STRING,0,995)and the problem was solved. I just don't know if the problem was from the type of string (was dynamically created), but the crash was due to the tokenizer.

Wormbo: UScript is not designed to handle long strings very well. Working with them is quite slow as they have to be copied every time they are assigned or passed into a function.

Tokenizer

Issues

Escape characters in strings

Negative numbers

Discussion

Related Topics