The module lexing.py (download here) aims to reproduce the functionality of GNU flex in
Python. It defines a class Lexer that can be subclassed in order
to define a new lexer. Here is a simple example.
class Calc(Lexer): order = 'real', 'integer' # 123.45 is a float! separators = 'sep', 'operator', 'bind' sep = silent(r'\s') # This won't produce tokens integer = r'[0-9]+' real = r'[0-9]+\.[0-9]*' variable = r'[a-z]+' operator = r'[+-/*]' bind = r'=' def process_integer(state, val): return int(val) def process_real(state, val): return float(val)
Token types are created automatically:
>>> Calc.variable<tokentype variable>>>> print Calc.variablevariable>>>
Tokens can be created if needed (but usually just obtained through
scan()):
>>> varx = Calc.variable.token('x') >>> varx<token variable: x>>>> print varxx>>>
Use the scan() method to tokenise a string. It returns an
iterator over all the tokens in the string.
>>> list(Calc.scan('a = 3.0*2'))[<token variable: a>, <token bind: =>, <token real: 3.0>, <token operator: *>, <token integer: 2>]>>>
The val attribute of real tokens is a float thanks to the
process_real() function.
tokentypeinit_stateseparatorsorderTo create a new token type, simply define a class attribute whose value is a regular expression recognising that token. E.g.
ident = r'[_a-zA-Z][_a-zA-Z0-9]*'
number = r'[0-9]+'
If you want number token to have an integer value rather than a string, define the process_number function.
def process_number(state, val):
return int(val)
Some tokens (e.g. comments) need not be processed at all. Wrap
their rules in the silent() function. E.g.
comment = silent(r'#.*')
Use the functions istart() and xstart() to define inclusive and
exclusive start conditions (with the same meaning as in GNU flex).
If you want to add a start condition to a token rule, write COND
>> rule. If you want a token rule to change the current start
condition, write rule >> COND (None if you want to clear the start
condition). E.g.
STRING = xstart()
start_string = r'"' >> STRING
string_body = STRING >> r'[^"]*'
end_string = STRING >> '"' >> None
The token objects yielded by the scan() method have a number of
useful attributes:
val and strvalval can be changed if a
process_token() function was defined (see example above).toktypeLexer subclass, and are
accessible as class attributes. Each token type has a name
attribute.