Last post, I have a brief introduction about the Lex, and gave a simple example. Lex is a important tool for me to achieve my goal, so I want to say more about this tool.
Regular expressions in Lex
A regular expression is a pattern description using a meta language. An expression is made up of symbols. Normal symbols are characters and numbers, but there are other symbols that have special meaning in Lex. Below are some symbols and definitions about the regular expression in Lex.(These tables come from here)
Character | Meaning |
A-Z, 0-9, a-z | Characters and numbers that form part of the pattern. |
. | Matches any character except \n. |
- | Used to denote range. Example: A-Z implies all characters from A to Z. |
[ ] | A character class. Matches any character in the brackets. If the first character is ^ then it indicates a negation pattern. Example: [abC] matches either of a, b, and C. |
* | Match zero or more occurrences of the preceding pattern. |
+ | Matches one or more occurrences of the preceding pattern. |
? | Matches zero or one occurrences of the preceding pattern. |
$ | Matches end of line as the last character of the pattern. |
{ } | Indicates how many times a pattern can be present. Example: A{1,3} implies one or three occurrences of A may be present. |
\ | Used to escape meta characters. Also used to remove the special meaning of characters as defined in this table. |
^ | Negation. |
| | Logical OR between expressions. |
"<some symbols>" | Literal meanings of characters. Meta characters hold. |
/ | Look ahead. Matches the preceding pattern only if followed by the succeeding expression. Example: A0/1 matches A0 only if A01 is the input. |
( ) | Groups a series of regular expressions. |
Tokens in Lex are similar to variable names in C language. Each token has its own expression.
Token | Associated expression | Meaning |
number | ([0-9])+ | 1 or more occurrences of a digit |
chars | [A-Za-z] | Any character |
blank | " " | A blank space |
word | (chars)+ | 1 or more occurrences of chars |
variable | (chars)+(number)*(chars)*( number)* | Â |
The internal variables and functions in Lex
yyin | Of the type FILE*. This points to the current file being parsed by the lexer. |
yyout | Of the type FILE*. This points to the location where the output of the lexer will be written. By default, both yyin and yyout point to standard input and output. |
yytext | The text of the matched pattern is stored in this variable (char*). |
yyleng | Gives the length of the matched pattern. |
yylineno | Provides current line number information. (May or may not be supported by the lexer.) |
yylex() | The function that starts the analysis. It is automatically generated by Lex. |
yywrap() | This function is called when end of file (or input) is encountered. If this function returns 1, the parsing stops. So, this can be used to parse multiple files. Code can be written in the third section, which will allow multiple files to be parsed. The strategy is to make yyin file pointer (see the preceding table) point to a different file until all the files are parsed. At the end, yywrap() can return 1 to indicate end of parsing. |
yyless(int n) | This function can be used to push back all but first ‘n’ characters of the read token. |
yymore() | This function tells the lexer to append the next token to the current token. |
Lex declarations
In the definition section, you not only can add C variables, but also you can add the lex declarations like below:
%{int wordCount = 0;
%}
chars [A-za-z\_\'\.\"]
numbers ([0-9])+
delim [" "\n\t]
whitespace {delim}+
words {chars}+
%%
No comments:
Post a Comment