Saturday, March 27, 2010

Building a simple compiler series(7)

Last post, I have a brief introduction about the Lex, and gave a simple example. Lex is a important tool for me to achieve my goal, so I want to say more about this tool.

Regular expressions in Lex

A regular expression is a pattern description using a meta language. An expression is made up of symbols. Normal symbols are characters and numbers, but there are other symbols that have special meaning in Lex. Below are some symbols and definitions about the regular expression in Lex.(These tables come from here)

CharacterMeaning
A-Z, 0-9, a-zCharacters and numbers that form part of the pattern.
.Matches any character except \n.
-Used to denote range. Example: A-Z implies all characters from A to Z.
[ ]A character class. Matches any character in the brackets. If the first character is ^ then it indicates a negation pattern. Example: [abC] matches either of a, b, and C.
*Match zero or more occurrences of the preceding pattern.
+Matches one or more occurrences of the preceding pattern.
?Matches zero or one occurrences of the preceding pattern.
$Matches end of line as the last character of the pattern.
{ }Indicates how many times a pattern can be present. Example: A{1,3} implies one or three occurrences of A may be present.
\Used to escape meta characters. Also used to remove the special meaning of characters as defined in this table.
^Negation.
|Logical OR between expressions.
"<some symbols>"Literal meanings of characters. Meta characters hold.
/Look ahead. Matches the preceding pattern only if followed by the succeeding expression. Example: A0/1 matches A0 only if A01 is the input.
( )Groups a series of regular expressions.

Tokens in Lex are similar to  variable names in C language. Each token has its own expression.

TokenAssociated expressionMeaning
number([0-9])+1 or more occurrences of a digit
chars[A-Za-z]Any character
blank" "A blank space
word(chars)+1 or more occurrences of chars
variable(chars)+(number)*(chars)*( number)*Â

The internal variables and functions in Lex

yyinOf the type FILE*. This points to the current file being parsed by the lexer.
yyoutOf the type FILE*. This points to the location where the output of the lexer will be written. By default, both yyin and yyout point to standard input and output.
yytextThe text of the matched pattern is stored in this variable (char*).
yylengGives the length of the matched pattern.
yylinenoProvides current line number information. (May or may not be supported by the lexer.)
yylex()The function that starts the analysis. It is automatically generated by Lex.
yywrap()This function is called when end of file (or input) is encountered. If this function returns 1, the parsing stops. So, this can be used to parse multiple files. Code can be written in the third section, which will allow multiple files to be parsed. The strategy is to make yyin file pointer (see the preceding table) point to a different file until all the files are parsed. At the end, yywrap() can return 1 to indicate end of parsing.
yyless(int n)This function can be used to push back all but first ‘n’ characters of the read token.
yymore()This function tells the lexer to append the next token to the current token.

Lex declarations

In the definition section, you not only can add C variables, but also you can add the lex declarations like below:

%{
int wordCount = 0;
%}
chars [A-za-z\_\'\.\"]
numbers ([0-9])+
delim [" "\n\t]
whitespace {delim}+
words {chars}+
%%

The content of test file.

The result




No comments:

Post a Comment