A compiler that accepts any valid program written in C. It is made using Lex and Yacc. Returns a symbol table, parse tree, annotated syntax tree and intermediate code.
lex lexer.l
yacc -d -v parser.y
gcc -ll -w y.tab.c
./a.out<input1.c
LEX is a tool used to generate a lexical analyzer. The input is a set of regular expressions in addition to actions. The output is a table driven scanner called lex.yy.c.
YACC (Yet Another Compiler Compiler) is a tool used to generate a parser. It parses the input file and does semantic analysis on the stream of tokens produced by the LEX file. YACC translates a given Context Free Grammar (CFG) specifications into a C implementation y.tab.c. This C program when compiled, yields an executable parser.
A blog where we walk through on how to build the compiler from scratch! Link to the article.
Check out this amazing tutorial on YouTube by Jonathan Engelsma.
printf
statements take a single string as a parameter. No type checking is performed.scanf
statements take a string and &id as input. No type checking is performed.The final parser takes a C program with nested for loops or if-else blocks and performs lexical, syntax, and semantic analysis and then intermediate code generation. Let’s look at the implementation of each phase in detail:
The first step was to code the lexer file to take the stream of characters from the input programs and identify the tokens defined with the help of regular expressions. Next, the yacc file was created, which contained the context-free grammar that accepts a complete C program constituting headers, main function, variable declarations, and initializations, nested for loops, if-else constructs, and expressions of the form of binary arithmetic and unary operations. At this stage, the parser will accept a program with the correct structure and throw a syntax error if the input program is not derived from the CFG.
A struct was defined with the attributes id_name, data_type (int, float, char, etc.), type (keyword, identifier, constant, etc.), and line_no to generate a symbol table. The symbol table is an array of the above-defined struct. Whenever the program encounters a header, keyword, a constant or a variable declaration, the add function is called, which takes the type of the symbol as a parameter. In the add function, we first check if the element is already present in the symbol table. If it is not present, a new entry is created using the value of yytext as id_name, datatype from the global variable type in case of variables, type from the passed parameter, and line_no. After the CFG accepts the program, the symbol table is printed.
For the syntax analysis phase, a struct was declared that represented the node for the binary tree that is to be generated. The struct node has attributes left, right, and a token which is a character array. All the tokens are declared to be of type nam, a struct with attributes node and name, representing the token’s name. To generate the syntax tree, a node is created for each token and linked to the nodes of the tokens, which occur to its left and right semantically. The inorder traversal of the generated syntax tree should recreate the program logically.
The semantic analyzer handles three types of static checks.
For intermediate code generation, the three address code representation was used. Variables were used to keep track of the next temporary variable and label to be generated. The condition statements of if and for were also declared to store the labels to goto in case the condition is satisfied or not satisfied. The output of this step is the intermediate code.
#include<stdio.h>
#include<string.h>
int main() {
int a;
int x=1;
int y=2;
int z=3;
x=3;
y=10;
z=5;
if(x>5) {
for(int k=0; k<10; k++) {
y = x+3;
printf("Hello!");
}
} else {
int idx = 1;
}
for(int i=0; i<10; i++) {
printf("Hello World!");
scanf("%d", &x);
if (x>5) {
printf("Hi");
}
for(int j=0; j<z; j++) {
a=1;
}
}
return 1;
}