introduction

One of the modules that I decided to take during my honours year is, as you could guess from this blog series, compiler construction. Despite receiving crazy looks from fellow classmates and questions on why take that? and that is a lot of work, I am excited to see what this module has in store. I took it for two main reasons: I am genuinely interested as to what all actually goes on and I want to try and make my own programming language.

As of the release of this introduction blog, I have completed one class. All I can say is that I am even more interested in this modules as I was when I chose it at the beginning of the year. At the end of the 90 minute lecture, I simply just gave curious glances to my one friend that is also taking the module, and other fellow students in the class to see their reactions. The common response from each of us? “I am excited for this module!” With that, it is incredibly encouraging to be surrounded by classmates that are also interested.

All in all, I decided to just write about everything that we will be doing over the next 13 weeks. This introduction is just to bring some background to the first part of our work: general compiler overview and production rules for a language. The blog series will follow everything I have learnt during class, as well as during the development of my compiler project.

compiler overview

First of all, a compiler is basically a computer program that translates other computer programs. In essence, it takes in an input (being the source program), creates an intermediate representation (IR) of the source program, and then produces the final target program.

Figure 01: Flowchart of a simple compiler structure.

The following points are considered:

Form / Syntax
Context / Meaning
Mappings / Translations

Expanding more on this, form/syntax is referring to the structure of inputted language. Context/meaning refers to the overall purpose of the program. Most translations occur within two sections of the compiler, being the front end and back end (these are not the commonly known web development’s frontend and underlying backend server logic). This refers to the transformations of the IR between the different sections of the compiler.

front end

The front end’s purpose is to take in the source program and construct the IR of the program (see [[#intermediate representation]] for more on this). This is done with calls done to the following components:

scanner
- input character stream to a stream of valid words
parser
- stream of valid words checked against the language’s grammar or production rules
elaborator
- used to provide additional computations to the parser

The front end follows a set of production rules (a single rule being one that represents a valid sentence) and checks it against the source program. This ensures that the source program is 1) in the correct format, 2) syntactically correct and 3) complete.

Its output (the IR) is the resulting structured representation of the knowledge of the source program, its syntax and structure that is used within the compiler, and is passed to all components of the compiler. Each component is then able to understand the source program and make respective changes, such as improving the code’s structure or adding additional information to the IR.

intermediate representation

The IR holds the internal structure of the code, as well as any necessary data structures. This can be represented in many different ways depending on the source/target language or the goals of the compiler, but the rules that govern the creation of the IR are typically stored within the compiler, deriving from the compiler’s design.

back end

The goal of the back end is to translate the contents of the IR into code that can be run on the target machine. This solves the following:

Instruction selection
- the source machine has an instruction set architecture (ISA) that outlines all the operations that the processor provides. This part maps the IR to the respective ISA operation(s)
Instruction scheduling
- reordering of the operations
Register Allocation
- handles what registers to use and their contents

The result of all of this is the executable of the program that can then be run.

What I have previously mentioned only covers some base sections of a compiler. Sections like an optimiser can be present as well, and often is- a common compiler structure is a three-phase compiler, that has the front end, optimiser, and back end. Furthermore, each section is usually made up of a collection of different interconnected components to get the job done. Communication done between these components is known as “passes”, which pass along the IR to be used for a specific purpose. This gives the following benefits: 1) maintainability, 2) separation of concerns, 3) easier debugging, and 4) simplifies the implementation process.

Figure 02: Flowchart of a three-pass compiler.

other notes

There are also different types of compilers, with below being some:

Ahead-of-time compiler
- in essence, translations occur in a separate step from the execution
Just-in-time compiler
- translation of code at runtime

Furthermore, there is also a compiler alternative called an interpreter, which basically executes the source code on a line-by-line basis without producing a final executable file that would typically then be run, such as Python.

A key part of designing a compiler is to ensure the following principles are met:

must preserve the meaning of the source program
must provide some discernibly improvement to the input program, such as it being more energy efficient, produces a smaller executable file, or is cross-platform compatible.

production rules

Referring back to the previous section, the production rules of a compiler is a set of rules governing what constitutes a valid sentence of source code. This is a key part as it is used to determine if the source program is properly formatted and complete, allowing for it to be translated and executed.

With that, there have been various representations that have been developed to help document these rules. Below will be a brief explanation of one popular representation methods, EBNF, which is what I have decided to go along with for this project due to it being 1) neat, 2) easier to understand, and 3) simpler compared to BNF.

extended backus-naur form

Extended Backus-Naur Form is an extension to Backus-Naur Form (BNF) ¹ created by John W. Backus in the 1950s ². This notation was created to describe the syntax of a language’s grammar ³. Whilst John Backus created the notation, it was Peter Naur that popularised it ⁴.

With ENBF, more meta-notations- namely repetition, grouping and optionality- was added. On top of this, some smaller changes were made to BNF to make the final production rule simpler and easier to interpret. [[#Table of Symbols]] shows the changes made to BNF and the added meta-notations. Whilst there are some symbols to guide in how to create a valid EBNF grammar, there are different variants or styling guides on how to approach its construction.

Notation	Meaning
=	is defined to be what follows
. or ;	end of the grammar
,	concatenation*
\|	alternatively
[…]	optional- match 0 or 1 times
{…}	repetition- match 0 or more times
(…)	grouping- select any contents
(* … *)	comments

* it is common to see production rules making use of juxtaposition instead of the symbol, providing a cleaner and easier to read production rule.

general structure

Below is an example of the general structure for an EBNF production rule:

digits = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" .

Furthermore, below is another example of matching any integer value:

digits = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" .
integer = ["+" | "-"] digit { digit } .

What the grammars provides above, digits matches any digit character. integer is matching a single digit with zero or more following digits, as well as an option sign character at the beginning.

conclusion

That is basically it. It was enjoyable and interesting getting to know what I have mentioned in this introduction blog post.

One last note is that I am making use of the textbook Engineering a Compiler 3rd Edition - K.D. Cooper ² for this course.

Until next time…

J. W. Backus et al., “Report on the algorithmic language ALGOL 60,” Commun. ACM, vol. 3, no. 5, pp. 299–311, May 1960, doi: 10.1145/367236.367262. ↩
K. D. Cooper and L. Torczon, Engineering a Compiler. Morgan Kaufmann, 2022. ↩ ↩²
D. D. McCracken and E. D. Reilly, “Backus-Naur form (BNF),” in Encyclopedia of Computer Science, GBR: John Wiley and Sons Ltd., 2003, pp. 129–131. ↩
R. Feynman, “EBNF: A Notation to Describe Syntax,” vol. 10, 2016, [Online]. Available: https://ics.uci.edu/~pattis/ICS-33/lectures/ebnf.pdf ↩