Yongwei Xing's base with ET: Compiler

Showing posts with label Compiler. Show all posts

Monday, April 5, 2010

Building a simple compiler series(9)--Simple Context Free Grammar

Context free grammar using LISP
Peter Norvig has a great post about the LISP and Python, which compared the Python with LISP. In this article, he implements a context free grammar to generate random sentence using Lisp. I am interested in it, so I use Peter's most code to have a try.

Wednesday, March 31, 2010

Building a simple compiler series (8)–Yacc

Yacc(Yet another compiler-compiler), another powerful tool for generator a parser according to a specify context free grammar.

Yacc provides a general tool for imposing structure on the input to a computer program. The Yacc user prepares a specification of the input process; this includes rules describing the input structure, code to be invoked when these rules are recognized, and a low-level routine to do the basic input. Yacc then generates a function to control the input process. This function, called a parser, calls the user-supplied low-level input routine (the lexical analyzer) to pick up the basic items (called tokens) from the input stream. These tokens are organized according to the input structure rules, called grammar rules; when one of these rules has been recognized, then user code supplied for this rule, an action, is invoked; actions have the ability to return values and make use of the values of other actions.

Yacc file specifcation just like below which is similar with the Lex file specification.

declarations
%%
rules
%%
programs

Here is a quick demo which based on here. You can refer to that link for more iinformation.I add a main function and a yyerror function. I will add more detail information and other examples in next post. Here is just a simple example which shows what’s the yacc like, how to use it, and what is it like.

%{
# include
# include
int regs[26];
int base;
%}
%start list
%token DIGIT LETTER
%left '|'
%left '&'
%left '+' '-'
%left '*' '/' '%'
%left UMINUS /* supplies precedence for unary minus */
%% /* beginning of rules section */
list : /* empty */
| list stat '\n'
| list error '\n'
{ yyerrok; }
;
stat : expr
{ printf( "%d\n", $1 ); }
| LETTER '=' expr
{ regs[$1] = $3; }
;
expr : '(' expr ')'
{ $$ = $2; }
| expr '+' expr
{ $$ = $1 + $3; }
| expr '-' expr
{ $$ = $1 - $3; }
| expr '*' expr
{ $$ = $1 * $3; }
| expr '/' expr
{ $$ = $1 / $3; }
| expr '%' expr
{ $$ = $1 % $3; }
| expr '&' expr
{ $$ = $1 & $3; }
| expr '|' expr
{ $$ = $1 | $3; }
| '-' expr %prec UMINUS
{ $$ = - $2; }
| LETTER
{ $$ = regs[$1]; }
| number
;
number : DIGIT
{ $$ = $1; base = ($1==0) ? 8 : 10; }
| number DIGIT
{ $$ = base * $1 + $2; }
;
%% /* start of programs */
yylex() { /* lexical analysis routine */
/* returns LETTER for a lower case letter, yylval = 0 through 25 */
/* return DIGIT for a digit, yylval = 0 through 9 */
/* all other characters are returned immediately */
int c;
while( (c=getchar()) == ' ' ) {/* skip blanks */ }
/* c is now nonblank */
if( islower( c ) ) {
yylval = c - 'a';
return ( LETTER );
}
if( isdigit( c ) ) {
yylval = c - '0';
return( DIGIT );
}
return( c );
}
main()
{
return yyparse();
}
int yyerror(char *s)
{
fprintf(stderr,"%s\n",s);
return 0;
}

The input file.

The result of the expression.

Saturday, March 27, 2010

Building a simple compiler series(7)

Last post, I have a brief introduction about the Lex, and gave a simple example. Lex is a important tool for me to achieve my goal, so I want to say more about this tool.

Regular expressions in Lex

A regular expression is a pattern description using a meta language. An expression is made up of symbols. Normal symbols are characters and numbers, but there are other symbols that have special meaning in Lex. Below are some symbols and definitions about the regular expression in Lex.(These tables come from here)

Character	Meaning
A-Z, 0-9, a-z	Characters and numbers that form part of the pattern.
.	Matches any character except \n.
-	Used to denote range. Example: A-Z implies all characters from A to Z.
[ ]	A character class. Matches any character in the brackets. If the first character is ^ then it indicates a negation pattern. Example: [abC] matches either of a, b, and C.
*	Match zero or more occurrences of the preceding pattern.
+	Matches one or more occurrences of the preceding pattern.
?	Matches zero or one occurrences of the preceding pattern.
$	Matches end of line as the last character of the pattern.
{ }	Indicates how many times a pattern can be present. Example: A{1,3} implies one or three occurrences of A may be present.
\	Used to escape meta characters. Also used to remove the special meaning of characters as defined in this table.
^	Negation.
\|	Logical OR between expressions.
"<some symbols>"	Literal meanings of characters. Meta characters hold.
/	Look ahead. Matches the preceding pattern only if followed by the succeeding expression. Example: A0/1 matches A0 only if A01 is the input.
( )	Groups a series of regular expressions.

Tokens in Lex are similar to variable names in C language. Each token has its own expression.

Token	Associated expression	Meaning
number	([0-9])+	1 or more occurrences of a digit
chars	[A-Za-z]	Any character
blank	" "	A blank space
word	(chars)+	1 or more occurrences of chars
variable	(chars)+(number)(chars)( number)*	Â

The internal variables and functions in Lex

yyin	Of the type FILE*. This points to the current file being parsed by the lexer.
yyout	Of the type FILE*. This points to the location where the output of the lexer will be written. By default, both yyin and yyout point to standard input and output.
yytext	The text of the matched pattern is stored in this variable (char*).
yyleng	Gives the length of the matched pattern.
yylineno	Provides current line number information. (May or may not be supported by the lexer.)

yylex()	The function that starts the analysis. It is automatically generated by Lex.
yywrap()	This function is called when end of file (or input) is encountered. If this function returns 1, the parsing stops. So, this can be used to parse multiple files. Code can be written in the third section, which will allow multiple files to be parsed. The strategy is to make yyin file pointer (see the preceding table) point to a different file until all the files are parsed. At the end, yywrap() can return 1 to indicate end of parsing.
yyless(int n)	This function can be used to push back all but first Â‘nÂ’ characters of the read token.
yymore()	This function tells the lexer to append the next token to the current token.

Lex declarations

In the definition section, you not only can add C variables, but also you can add the lex declarations like below:

%{
int wordCount = 0;
%}
chars [A-za-z\_\'\.\"]
numbers ([0-9])+
delim [" "\n\t]
whitespace {delim}+
words {chars}+
%%

The content of test file.

The result

Friday, March 26, 2010

Building a simple compiler series(6)

Lexical analyze is a important part in the processing of compiling. If you just want to learn something about the compiler and try to write a simple compiler by yourself, you do not need write a lexical analyzer by yourself. In computer science, there is a program named Lex which is used to generate lexical analyzers. You can find it on many Unix/Linux system you can install the flex using apt-get in Ubuntu, this tool is writen by Eric Schmidt and Mike Lesk. Lex and the parser generator, such as yacc and bison, are used together. The lex is used to provide the tokens for parser.

Structure of a lex file

Definition section
%%
Rules section
%%
C code section

The definition section is the place to define macros and to include the header files. You also can write any C code here.

The rules section is the most important part in the lex file.Patterns are simply regular expressions. When the lex sees some text in the input matching a given pattern, it would execute the C code which is associated to the pattern.

The C code section contains statements and functions.

There are some internal variables and functions which would be used in the lex source file.

yyin yyout yytext yyleng yylineno (Lex Variables)

yylex() yywrap() yyless(int n) yymore()

Here is an simple example which pick all the number from a file

Then create a file containing string mixing character and number. The string is skdj3244sdkfj234kjsdfkl23kjl12

Thursday, March 18, 2010

Building a simple compiler series(5)

LR is a buttom-up parsing algorithm,the L indicate the input is processed from left, and the R indicates that a rightmost derivation is processed from left to right. A buttom-up parser has two possible actions: the first one is Shift( a terminal from input string to the top of the stack ), the other is Reduce( a string a at the top of the stack to a nonterminal A, given the BNF choice A->a ). So a buttom-up parser is thus sometimes called shift-reduce parser.

LR(0) item

A LR(0) item of a context-free grammar is a production choice with a distinguished position in its right side. In most of books, it will indicate this distinguished position by a period. For eample A->XYZ would generate four produces. A->.XYZ A->X.YZ A->XY.Z and A->XYZ. . A->.XYZ means we expect a string that can be derivate from XYZ. A->X.YZ means we already get a string that can be derivate from X, and now we expect a string that can be derivate from YZ.

LR(0) item closure

If I is a item collection for grammar G, then CLOSURE(I) follow these two rules to construct item collection from I:

At first, add all items in I to CLOSURE(I)
If A->a.Bb in CLOSURE(I), B->c is a production and B->.c is not in the CLOSURE(I), then add it. Repeat it until no one can be added.

GOTO(I,X), I is a item collection and X is a grammar symbol. GOTO(I,X) indicate all the items of [A->aX.b]’s closure corresponding to all items whose format is [A->a.Xb] in the I .

LR Parsing Alogrithm

Input：A input string and a LR parsing table which describes grammar G , Action function and GOTO function.

Output: If w in the L(G), then it gives the derivation, otherwise it gives a error.

Sunday, March 14, 2010

Building a simple compiler series(4)

Symbol Table
Symbol table is a mechanical that associate the attribute with identifiers. Just like we declare a variable in our source code, then we can refer to this variable according to the name, like temp or x. When compiling the source code, compiler would scan and parser the code and construct a table for the symbols.
There are some choices for data structure of the symbol table.

Liner list
Various Binary Search Tree
Hash Table

Liner list is the simplest way. You only need a array or a linked list. But this way is inefficiency. The principle symbol table operation are insert , lookup and delete, other operation may be necessary. Take the liner list as an example, there are n symbols in this list.The average operation for inserting is constant O(1), the average operation for lookup is O(n).If the list is array, the average operation for delete is O(n) and you also need shift the array.
Binary Search tree is efficiency than liner list. The best average operation for insert and lookup is O(log(n)). But deleting is more complicated than that of liner list. If your language is not so complicated and the number of symbol is small, you can use the liner list or binary search tree.
Actually, the hash table is the choice. In fact, most of commercial compiler use hash table to implement their symbol table in practice. The hash table can provide constant time for operation.
Collisions is one of the most problems hash table meet. Perhaps the best scheme for compiler construction is open addressing, called separate chaining. In this method, each bucket actually a liner list. When collisions happen, a new item would be inserted into the bucket list.
Declarations
The behavior of a symbol table depends heavily on the properties of declarations of the language being translated. There are four basic kind of declarations in programming language:constant declarations, type declarations,variable declarations and function declarations.

Wednesday, March 10, 2010

Building a simple compiler series(3)

Lexical Analysis
Lexical analysis is the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.Besides identification of lexemes, it contans other tasks. One such task is stripping out comments and whitespace. Another task is correlating error message generated by the compiler with the source program.
Sometimes.lexical analyzers are divided into two parts:

Scanning
Lexical analysis

Operations on Languages

Union of L and M
Concatenation of L and M
Kleene closure of L
Positive closure of L

Nondeterministic Finite Automata and Deterministic Finite Automata
From NFA to DFA(Subset Construction Algorithm)

At the first,e-closure(s0) is the only one state and not marked.
If there is a unmarked state T, mark it
For every input symbol a, U=,e-closure(move(T,a))
If U is not in the Dstates, then Add U to Dstates as a unmarked state,else Dtran[T,a]=U
Go to 3
Go to 2

Remove Left Recrusive
The grammar is A->Aa|b
Remove the Left Recrusive: A->bA' A'->aA'|e (e is empty string)
It only works in the case of grammar with no e-production and with no cycles.

Saturday, March 6, 2010

Building a simple compiler Series (2)

Postfix Notation
A syntax-directed definition associates

With each grammar symbol, a set of attributes and,
With each production, a set of semantic rules for computing the values of the attributes associated with the symbols appearing in the production.

Parsing
Parsing is the process of determining how a string of terminals can be generated by a grammar.Most parsing methods fall into one of two classes, called the top-down and buttom-up methods. These terms refer to the order in which nodes in the parse tree are constructed.Top-down is easy to do by hand, but for software tools for generating parsers directly from grammars often use bottom-up methods, because this method can handle a larger class of grammars and translation schemes.
A Translator for simple Expressions
This simple translator is translating the infix expression to postfix expression. But the expression only includes "+","-" and single digit.

Friday, March 5, 2010

Building a simple compiler Series (1)

I have a plan this year, which is trying to write a simple compiler by myself. I know it sounds crazy, but I think it's would be fun, and it would be very helpful for deeply understanding some low level of the machine,and also can help improve my programming skill. There is a populer computer science book which is called "Dragon book". I am reading it now, and will take note about it. Do it, it's not so hard as you think.
Syntax Definition
stmt-> if (expr) stmt else stmt
-> may be read as "can have the form", this rule is called a production. In a production ,lexical element like the keyword if and the parentheses are called terminals, variables like expr and stmt represent sequences of terminals are called nonterminals
Definition of Grammars:
A context-free grammar has four components:

A set of terminal symbols
A set of nonterminals
A set of production
A designation of one of the nonterminals as the start symbol

Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar, and if it cannot be derived from the start symbol of the grammer.
Parse Tree

The root is labeled by the start symbol
Each leaf is labeled by a terminal or by empty string
Each interior node is labeled by a nonterminal

Associativity of Operators
In most programming language the four arithmetic operators, addition,substraction,multiplication,and division are left-associative.
Syntax-Directed Translation

Attributes.An Attributes is any quantity associated with a programming construct.Examples of attributes are data types of expressions, the number of instructions in the generated code.
(Syntax-directed) translation schemes. A translation scheme is a notation for attaching program fragments to the production of a grammar.

Yongwei Xing's base with ET