Tokenizer

Assumes Tcl 8.6 (couroutine support)

if {[catch {package req Tcl 8.6}]} return

Rosetta example: Tokenize a string with escaping

Write a class which allows for splitting a string at each non-escaped occurrence of a separator character.

package req nx

nx::Class create Tokenizer {
    :property s:required
    :method init {} {
        :require namespace
        set coro [coroutine [current]::nextCoro [current] iter ${:s}]
        :public object forward next $coro
    }
    :public method iter {s} {
        yield [info coroutine]
        for {set i 0} {$i < [string length $s]} {incr i} {
            yield [string index $s $i]
        }
        return -code break
    }
    :public object method tokenize {{-sep |} {-escape ^} s} {
        set t [[current] new -s $s]
        set part ""
        set parts [list]
        while {1} {
            set c [$t next]
            if {$c eq $escape} {
                append part [$t next]
            } elseif {$c eq $sep} {
                lappend parts $part
                set part ""
            } else {
                append part $c
            }
        }
        lappend parts $part
        return $parts
    }
}

Run some tests incl. the escape character:

% Tokenizer tokenize -sep | -escape ^ ^|
|
% Tokenizer tokenize -sep | -escape ^ ^|^|
||
% Tokenizer tokenize -sep | -escape ^ ^^^|
^|
% Tokenizer tokenize -sep | -escape ^ |
{} {}

Test for the output required by the Rosetta example:

% Tokenizer tokenize -sep | -escape ^ one^|uno||three^^^^|four^^^|^cuatro|
one|uno {} three^^ four^|cuatro {}