tokenize

Prev Next

Function Names

tokenize

Description

This function breaks a string into individual token (pieces), returning them as a set of strings or other types, depending on the settings provided in the 2nd function parameter.

Call as: function

Restrictions

Indirect parameter passing is disabled

Parameter count

1-2

Parameters

No.TypeDescription
1
input
string input string

This string will be tokenized

Opt. 2
input
set or string Options

Specify one or more options. Formulation rules (applicable to 2nd-7th function parameters):

  • Use a string to specify one value
  • Use set to specify multiple values
  • No values can be provided by using an empty set {}

Following options are supported

include blanks Blank contents (between separators) are also recognized as contents and will not be ignored / skipped.
trim token Leading and trailing white spaces (UNICODE 0 - 32) such as spaces, new lines, tabs, etc. will be removed from the token contents. Non-break blanks (UNICODE 160) are excluded from this rule.
allow new line inside quotations Normally, line breaks in texts inside specified quotation marks will generate exceptions with error messages. This option will suppress exceptions and treat new lines like part of text.
include quotations in text The text returned will also include the quotation symbol. Quotations may consist of more than 1 character, e.g. << ... >>.
include quotations as tokens The quotation symbols will be returned as dedicated tokens
read numerals Tokenized contents looking like numbers will be included as tokens of numeral type
scientific notation Tokenized contents looking like numbers with or without scientific notation will be included as tokens of numeral type
read date Tokenized contents looking like dates will be read as dates
read boolean Tokenized contents looking like booleans (true, false) will be read as boolean values
thousand separator (Next element must contain the character) Assumptions for numbers in the input string. Default: -
decimal separator (Next element must contain the character) Assumptions for numbers in the input string. Default: .

Opt. 3
input
set or string Token separator strings

Specify at least 1 separator string (e.g. blank, comma, tab, slash, etc). The string may contain multiple characters. In this case, the sequence of these multiple characters combined represent the separation, e.g. { "//", "..." } specified.

Formulation rules: See 2nd function parameter.

Default value: {' ', new line } (blank and new line)
Opt. 4
input
set or string Quotation marks

If one string used: considered for opening and closing. Example: "Hello World"
If 2 strings used: Separate for opening and closing. Example: <<Hello World>>
If number of quotation marks provided is odd, then the last quotation mark applies both as opening and closing quotation marks. Multi-character quotation marks, e.g. <TEXT>Hello World</TEXT> are also allowed.

Formulation rules: See 2nd function parameter.

Default value: (quotation marks not assumed)
Opt. 5
input
set or string Additional tokens

Collection of token symbols to be parsed separately and suitable for categorization. This is specifically important if these tokens follow without separators (e.g. white spaces) in-between. You can also assign multi-character tokens, e.g. "<=", "<>", "while", "for", etc.

Formulation rules: See 2nd function parameter.

Default value: (none specified)
Opt. 6
input
set or string Block comment symbols

Pairwise collection of opening and closing block comment symbols, e.g. { "/*", "*/", "<--", "-->" }. Contents commented out will not be tokenized.

Formulation rules: See 2nd function parameter.

Default value: (none specified)
Opt. 7
input
set or string Line comment symbols

Specify all comment symbols which declare the rest of the line as comment, e.g. { "//", "#!" }. Contents commented out will not be tokenized.

Formulation rules: See 2nd function parameter.

Default value: (none specified)

Return value

TypeDescription
set Tokenized result

Every token is represented as an element in the set

Examples

       echo( new line, "Basic use of tokenize. Separators are blank and new line" );
       echo( tokenize( "This   is a" + new line + " test" ) );

       echo( new line, "Demonstrate 'include blanks' and 'trim token'" );
       echo( tokenize( ",Ha, He ,,Hi,", {}, "," ) );
       echo( tokenize( ",Ha, He ,,Hi,", include blanks, "," ) );
       echo( tokenize( ";Ha; He ,,Hi,", trim token, {",",";"} ) );
       echo( tokenize( ",Ha, He ,,Hi,", {include blanks, trim token}, "," ) );

       echo( new line, "New line inside quotations allowed" );
       echo( tokenize( "'Me, and"+new line+"You','and us'" , allow new line inside quotations, ",", "'" ) );

       echo( new line, "Demonstrate usage of quotations" );
       echo( tokenize( "<text>A gnu</text>,<text>A gnat</text>" , {}, ",", { "<text>", "</text>" } ) );
       echo( tokenize( "<text>A gnu</text>,<text>A gnat</text>" , include quotations as tokens, ",", { "<text>", "</text>" } ) );

       echo( new line, "Read numerals, dates, booleans" );
       echo( tokenize( "1 true 1E+3 FALSE text 2020-05-07 15:30:00", { read numerals, scientific notation, read dates, read booleans } ) );

       echo( new line, "Thousand and Decimal separators" );
       echo( tokenize( "1,234 1.234", { read numerals, thousand separator, ".", decimal separator, "," } ) );

       echo( new line, "Additional tokens" );
       echo( tokenize( "for a=1to5 'do something'", {}, " ", "'", { "=", to, for } ) );

       echo( new line, "Ignore comments" );
       echo( tokenize( "for a=1to5 /: for a = 3 to 4 :/ 'do something'", {}, " ", "'", { "=", to, for }, { "/:", ":/" } ) );
       echo(tokenize("for a=1to5 // 'do something'", {}, " ", "'", { "=", to, for }, { }, "//"));

Output

Basic use of tokenize. Separators are blank and new line
{'This','is','a','test'}

Demonstrate 'include blanks' and 'trim token'
{'Ha',' He ','Hi'}
{'','Ha',' He ','','Hi',''}
{'Ha','He','Hi'}
{'','Ha','He','','Hi',''}

New line inside quotations allowed
{'Me, and
You','and us'}

Demonstrate usage of quotations
{'A gnu','A gnat'}
{'<text>','A gnu','</text>','<text>','A gnat','</text>'}

Read numerals, dates, booleans
{1,true,1000,false,'text','2020-05-07','15:30:00'}

Thousand and Decimal separators
{1.234,1234}

Additional tokens
{'for','a','=','1','to','5','do something'}

Ignore comments
{'for','a','=','1','to','5','do something'}
{'for','a','=','1','to','5'}
Try it yourself: Open LIB_Function_tokenize.b4p in B4P_Examples.zip. Decompress before use.

See also

set