The module lexing.py (download here) aims to reproduce the functionality of GNU flex in
Python. It defines a class Lexer
that can be subclassed in order
to define a new lexer. Here is a simple example.
class Calc(Lexer): order = 'real', 'integer' # 123.45 is a float! separators = 'sep', 'operator', 'bind' sep = silent(r'\s') # This won't produce tokens integer = r'[0-9]+' real = r'[0-9]+\.[0-9]*' variable = r'[a-z]+' operator = r'[+-/*]' bind = r'=' def process_integer(state, val): return int(val) def process_real(state, val): return float(val)
Token types are created automatically:
>>> Calc.variable<tokentype variable>>>> print Calc.variablevariable>>>
Tokens can be created if needed (but usually just obtained through
scan()
):
>>> varx = Calc.variable.token('x') >>> varx<token variable: x>>>> print varxx>>>
Use the scan()
method to tokenise a string. It returns an
iterator over all the tokens in the string.
>>> list(Calc.scan('a = 3.0*2'))[<token variable: a>, <token bind: =>, <token real: 3.0>, <token operator: *>, <token integer: 2>]>>>
The val
attribute of real tokens is a float thanks to the
process_real()
function.
tokentype
init_state
separators
order
To create a new token type, simply define a class attribute whose value is a regular expression recognising that token. E.g.
ident = r'[_a-zA-Z][_a-zA-Z0-9]*' number = r'[0-9]+'
If you want number token to have an integer value rather than a string, define the process_number function.
def process_number(state, val): return int(val)
Some tokens (e.g. comments) need not be processed at all. Wrap
their rules in the silent()
function. E.g.
comment = silent(r'#.*')
Use the functions istart()
and xstart()
to define inclusive and
exclusive start conditions (with the same meaning as in GNU flex).
If you want to add a start condition to a token rule, write COND
>> rule
. If you want a token rule to change the current start
condition, write rule >> COND
(None if you want to clear the start
condition). E.g.
STRING = xstart() start_string = r'"' >> STRING string_body = STRING >> r'[^"]*' end_string = STRING >> '"' >> None
The token objects yielded by the scan()
method have a number of
useful attributes:
val
and strval
val
can be changed if a
process_token()
function was defined (see example above).toktype
Lexer
subclass, and are
accessible as class attributes. Each token type has a name
attribute.