B4P Reference

Function Names

tokenize

Description

This function breaks a string into individual token (pieces), returning them as a set of strings or other types, depending on the settings provided in the 2nd function parameter.

Call as: function

Restrictions

Indirect parameter passing is disabled

Parameter count

1-2

Parameters

No.

Type

Description

1
input

string

input string

This string will be tokenized

Opt. 2
input

set or string

Options

Specify one or more options. Formulation rules (applicable to 2nd-7th function parameters):

Use a string to specify one value
Use set to specify multiple values
No values can be provided by using an empty set {}

Following options are supported

include blanks	Blank contents (between separators) are also recognized as contents and will not be ignored / skipped.
trim token	Leading and trailing white spaces (UNICODE 0 - 32) such as spaces, new lines, tabs, etc. will be removed from the token contents. Non-break blanks (UNICODE 160) are excluded from this rule.
allow new line inside quotations	Normally, line breaks in texts inside specified quotation marks will generate exceptions with error messages. This option will suppress exceptions and treat new lines like part of text.
include quotations in text	The text returned will also include the quotation symbol. Quotations may consist of more than 1 character, e.g. << ... >>.
include quotations as tokens	The quotation symbols will be returned as dedicated tokens
read numerals	Tokenized contents looking like numbers will be included as tokens of numeral type
scientific notation	Tokenized contents looking like numbers with or without scientific notation will be included as tokens of numeral type
read date	Tokenized contents looking like dates will be read as dates
read boolean	Tokenized contents looking like booleans (true, false) will be read as boolean values
thousand separator	(Next element must contain the character) Assumptions for numbers in the input string. Default: -
decimal separator	(Next element must contain the character) Assumptions for numbers in the input string. Default: .

Opt. 3
input

set or string

Token separator strings

Specify at least 1 separator string (e.g. blank, comma, tab, slash, etc). The string may contain multiple characters. In this case, the sequence of these multiple characters combined represent the separation, e.g. { "//", "..." } specified.

Formulation rules: See 2nd function parameter.

Default value: {' ', new line } (blank and new line)

Opt. 4
input

set or string

Quotation marks

If one string used: considered for opening and closing. Example: "Hello World"
If 2 strings used: Separate for opening and closing. Example: <<Hello World>>
If number of quotation marks provided is odd, then the last quotation mark applies both as opening and closing quotation marks. Multi-character quotation marks, e.g. <TEXT>Hello World</TEXT> are also allowed.

Formulation rules: See 2nd function parameter.

Default value: (quotation marks not assumed)

Opt. 5
input

set or string

Additional tokens

Collection of token symbols to be parsed separately and suitable for categorization. This is specifically important if these tokens follow without separators (e.g. white spaces) in-between. You can also assign multi-character tokens, e.g. "<=", "<>", "while", "for", etc.

Formulation rules: See 2nd function parameter.

Default value: (none specified)

Opt. 6
input

set or string

Block comment symbols

Pairwise collection of opening and closing block comment symbols, e.g. { "/*", "*/", "<--", "-->" }. Contents commented out will not be tokenized.

Formulation rules: See 2nd function parameter.

Default value: (none specified)

Opt. 7
input

set or string

Line comment symbols

Specify all comment symbols which declare the rest of the line as comment, e.g. { "//", "#!" }. Contents commented out will not be tokenized.

Formulation rules: See 2nd function parameter.

Default value: (none specified)

Return value

Type	Description
set	Tokenized result Every token is represented as an element in the set

Examples

       echo( new line, "Basic use of tokenize. Separators are blank and new line" );

       echo( tokenize( "This   is a" + new line + " test" ) );



       echo( new line, "Demonstrate 'include blanks' and 'trim token'" );

       echo( tokenize( ",Ha, He ,,Hi,", {}, "," ) );

       echo( tokenize( ",Ha, He ,,Hi,", include blanks, "," ) );

       echo( tokenize( ";Ha; He ,,Hi,", trim token, {",",";"} ) );

       echo( tokenize( ",Ha, He ,,Hi,", {include blanks, trim token}, "," ) );



       echo( new line, "New line inside quotations allowed" );

       echo( tokenize( "'Me, and"+new line+"You','and us'" , allow new line inside quotations, ",", "'" ) );



       echo( new line, "Demonstrate usage of quotations" );

       echo( tokenize( "<text>A gnu</text>,<text>A gnat</text>" , {}, ",", { "<text>", "</text>" } ) );

       echo( tokenize( "<text>A gnu</text>,<text>A gnat</text>" , include quotations as tokens, ",", { "<text>", "</text>" } ) );



       echo( new line, "Read numerals, dates, booleans" );

       echo( tokenize( "1 true 1E+3 FALSE text 2020-05-07 15:30:00", { read numerals, scientific notation, read dates, read booleans } ) );



       echo( new line, "Thousand and Decimal separators" );

       echo( tokenize( "1,234 1.234", { read numerals, thousand separator, ".", decimal separator, "," } ) );



       echo( new line, "Additional tokens" );

       echo( tokenize( "for a=1to5 'do something'", {}, " ", "'", { "=", to, for } ) );



       echo( new line, "Ignore comments" );

       echo( tokenize( "for a=1to5 /: for a = 3 to 4 :/ 'do something'", {}, " ", "'", { "=", to, for }, { "/:", ":/" } ) );

       echo(tokenize("for a=1to5 // 'do something'", {}, " ", "'", { "=", to, for }, { }, "//"));

Output

Basic use of tokenize. Separators are blank and new line
{'This','is','a','test'}

Demonstrate 'include blanks' and 'trim token'
{'Ha',' He ','Hi'}
{'','Ha',' He ','','Hi',''}
{'Ha','He','Hi'}
{'','Ha','He','','Hi',''}

New line inside quotations allowed
{'Me, and
You','and us'}

Demonstrate usage of quotations
{'A gnu','A gnat'}
{'<text>','A gnu','</text>','<text>','A gnat','</text>'}

Read numerals, dates, booleans
{1,true,1000,false,'text','2020-05-07','15:30:00'}

Thousand and Decimal separators
{1.234,1234}

Additional tokens
{'for','a','=','1','to','5','do something'}

Ignore comments
{'for','a','=','1','to','5','do something'}
{'for','a','=','1','to','5'}

Try it yourself: Open LIB_Function_tokenize.b4p in B4P_Examples.zip. Decompress before use.

tokenize

Function Names

Description

Call as: function

Restrictions

Parameter count

Parameters

Return value

Examples

Output

See also