Wednesday, June 1, 2011

Compiler Construction



Programming languages are notations for describing computations to people and to machines. The world as we know it depends on many programming languages, because all the software running on all the computers was written in some programming language. But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do this translation are called compilers.

There are many good books about how to design and implement compilers. We shall discover that a few basic ideas implemented in some Pascal compilers can be used to construct translators for a wide variety of languages and machines. Besides compilers, the principles, algorithms and techniques for compiler design are applicable to so many other domains that they are likely to be reused many times in the career of a computer scientist. The study of compiler writing touches upon programming languages, machine architecture, language theory, algorithms, and software engineering.

Usually compilers are treated as a single box that maps a source program into a semantically equivalent target program. If we open up this box a little, we see that there are two parts to this mapping: analysis and synthesis. The analysis part breaks up the source program into constituent pieces and imposes a grammatical structure on them. It then uses this structure to create an intermediate representation of the source program. If the analysis part detects that the source program is either syntactically ill formed or semantically unsound, then it must provide informative error reports, so the user can take corrective action. The analysis part also gathers information about the source program and stores it in a data structure called a symbol table, which is passed along with the intermediate representation to the code synthesis part. The synthesis part constructs the desired target program from the intermediate representation and the information in the symbol table. The analysis part is often called the front end of the compiler; the synthesis part is the back end.

The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the form (token-name, attribute-value) that it passes on to the subsequent phase, syntax analysis. In the token, the first component token-name is an abstract symbol that is used during syntax analysis, and the second component attribute-value points to a record in the symbol table for this token. Information from the symbol-table entry is needed for semantic analysis and code generation.

The second phase of the compiler is syntax analysis or parsing. The parser uses the first components of the tokens produced by the lexical analyzer (scanner) to create a tree-like intermediate representation that depicts the grammatical structure of the token stream. The subsequent phases of the compiler use the grammatical structure to help analyze the source program and generate the target code.

In the process of translating a source program into target code, a compiler may construct one or more intermediate representations, which can have a variety of forms. Syntax trees are a form of intermediate representation; they are commonly used during syntax and semantic analysis.

After syntax and semantic analysis of the source program, many compilers generate an explicit low-level or machine-like intermediate representation, which we can think of as a program for a virtual machine. This intermediate representation should have two important properties: it should be easy to produce and it should be easy to translate into the target machine.

Of course, compiler construction is not an easy task. However, with right data structures and algorithms it is possible to create a good optimizing compiler.



No comments:

Post a Comment