The Millipascal Compiler

This document is an exposition of the Millipascal compiler (mpc). It serves as resource for other people who are trying to solve similar problems as i did.

Additional resources to this document are the language specification and compiler source. If all else fails, feel free to ask a question as an issue on github.

The questions this document is meant to answer are:

What is Millipascal? Is it edible?
How did I structure the compiler? What comes after parsing?
How the compiler handles errors?
How the main language features are implemented?
How did I represent the source code inside the compiler?

A Brief Description of Millipascal
The Overall Compiler Architecture
Errors
The High Level Intermediate Representation
Padeir0's Intermediate Representation (PIR)
PIR validation
The Middle-level Intermediate Representation (MIR)
MIR validation
The End

A Brief Description of Millipascal

Millipascal is a simple imperative language. It looks like Pascal, hence the name. Here's a hello-world program:

from io import print

data hello "Hello, World!\n"

proc main
begin
  print[hello, sizeof[hello]];
end

This program hardly shows anything about the language. But what we can see is that:

There is some sort of module system
For some reason I didn't place the string literal inline with the print call
There is an entry point called main
Functions appear to be called with square brackets
It looks like Pascal.

Millipascal is very minimalistic, the reason string literals are not directly present in expressions is that they are only treated as blobs, just like you would treat strings in assembly. The identifier hello evaluates to a pointer (ptr), completely untyped.

Besides pointers, we have the basic signed integers and unsigned integers from 8 bits to 64 bits, and a boolean. The boolean is not strictly necessary, but it helps when we're trying to validate our compiler transformations (more on that later).

Now, why would I choose square brackets over parenthesis for function calls? That is: why f[] instead of f()? There are three main reasons:

Parenthesis already have a job inside expressions: to disambiguate and group terms
It takes less finger acrobatics to type [ than (
I find it prettier.

After you get used to this call syntax, it grows on you. Initially it was a decision based purely to make the grammar more robust, then i started seeing other advantages.

Now, if you're familiar with how print is implemented in other languages, you know how weird our software stack can get. In millipascal it is not much better, here's one of the few assembly procedures I ever got working in the language:

export print

const begin
    SYS_WRITE = 1;
    STDOUT = 1;
end

proc print<stack>[p:ptr, size:i32] i64
asm
begin
    push rbp;
    mov rbp, rsp;

    mov r0, {SYS_WRITE};
    mov r2d, [rbp, size]@dword;
    mov r6, [rbp, p]@qword;
    mov r7, {STDOUT};
    syscall;

    mov [rbp, _ret0]@qword, r0;
    mov rsp, rbp;
    pop rbp;
    ret;
end

This directly implements the write syscall on linux for amd64. Lot's more questions may be raised about this piece of code. The export print is how we make names public to other modules. When you see <stack>, this is specifying the ABI the procedure will use. Only one ABI was ever implemented.

If by now you haven't realised yet, this project was never finished. It is functional, and I've even wrote a arbitrary precision number library for the language, but some features never got fully developed. The support for multiple ABIs exist, but only one ABI was ever implemented. The support for ASM procedures exist, but only few procedures got throughly used and tested. This is what happens when projects grow out of proportion.

The following procedure converts a buffer to uppercase. This one is much more interesting, it shows that the language has support for first class procedures, and that global symbols can be declared out of order.

from io import print

data buff "am i uppercase yet?\n"

proc main
begin
  byte_map[buff, sizeof[buff], upper_case];
  print[buff, sizeof[buff]];
end

proc byte_map[b:ptr, bsize:i32, op:proc[i8][i8]]
var i:i32
begin
  set i = 0;
  while i < bsize begin
    set (b+i)@i8 = op[(b+i)@i8];
    set i += 1;
  end
end

proc upper_case[a:i8] i8
begin
  if a >= 'a' and a <= 'z' begin
    return a - 32ss;
  end
  return a;
end

Here, (b+i)@i8 is a pointer offset followed by a dereference. It reads "take b, add i and read an i8 from that address". There are other ways to do this, in fact, the language has support for indexing, but it requires introducing the semantics behind struct (it's different from C, much to your dismay).

To be more precise, structs in millipascal are only syntax sugar for pointers, they are not values per se, only compile-time metadata for pointers. Here's some code that uses structs to benefit from the use of the indexing syntax:

(Note that ~ is the unary minus in Millipascal).

from ioutil import put_int, put_char

struct I32A begin
    num:i32;
end

data my_ints:I32A {
    5, ~10, 2, 50, ~9, 65, 0
}
const my_ints_length = sizeof[my_ints]/sizeof[I32A];

proc main
begin
    print_ints[my_ints, my_ints_length];
    insertion_sort[my_ints, my_ints_length];
    print_ints[my_ints, my_ints_length];
end

proc insertion_sort[array:I32A, length:i32]
var i, j:i32
begin
    set i = 1;
    while i < length begin
        set j = i;
        while j > 0 and
              array[j-1]->num > array[j]->num begin
            set array[j]->num <> array[j-1]->num;
            set j -= 1;
        end
        set i += 1;
    end
end

proc print_ints[array:I32A, length:i32]
var i:i32
begin
    set i = 0;
    while i < length begin
        put_int[array[i]->num:i64];
        put_char[' '];
        set i += 1;
    end
    put_char['\n'];
end

It may seem confusing, but array[i] is only syntax sugar for array+(i * sizeof[I32A]). The -> operator performs a dereference, just like in C. It is sugar for (pointer+offset)@type, in this particular case, the offset is zero, so the expression array[i]->num is sugar for:

(array+(i * sizeof[I32A]))@i32

Structs are a bit more powerful than that, we're able to specify the offsets of fields, and even have negative offsets. This may seem silly, but i implemented that already thinking of a use-case: storing metadata about an object.

# object header
struct Obj [sizeof[i64]] begin
    Size:i64 {~sizeof[i64]};
end

This Obj struct is used inside the flalloc.mp allocator to retrieve the size of objects that come from user code:

set size = (p:Obj)->Size;

As seen, the language is pretty boring, it's just like C, but strongly typed, with a module system and Pascal's syntax. Even being simple, compilation can be quite involved, now we start talking about how to implement it.

The Overall Compiler Architecture

The compiler is a monolithic 15kloc project separated into various modules, it generates FASM assembly and depends on FASM to generate ELF files.

Names of most stages here mirror the names of folders in the source. Here's a quick summary of the repository and the respective total line count of each directory.

src               15756
├── asmproc       298
├── backend0      3603
│   ├── gen       1009
│   ├── mir       1411
│   └── resalloc  1183
├── constexpr     546
├── core          3420
├── fasm          224
├── format        490
├── lexer         761
├── linearization 1025
├── messages      297
├── parser        1870
├── pipelines     204
├── resolution    1058
├── testing       191
├── typechecker   1570
└── main.go       199

This gives you an idea of how complex each compiler phase is, for example, lexing is rather simple, thus the line count of lexer is smaller than a more complicated stage like the typechecker.

The file that describes most of the diagrams seen here is the pipelines/pipelines.go file. It contains the code that glues all compiler phases together, it may be helpfull to consult this file while reading this section.

The core folder contains all public data structures used throughout the compiler, a quick summary follows.

core           3420
├── asm        644
├── cc         51
├── errorkind  164
├── module     759
├── pir        1123
├── severity   28
├── strbuilder 48
├── types      409
├── util       53
└── core.go    141

Here, the most important data structures are:

asm which is a in-memory representation of assembly code, henceforth refered to as ASM
module which represents the source using an abstract syntax tree (AST) and represents modules with a dependency graph
types which represents the structure of types, including structs
pir which is the main intermediate representation of source, based on a control flow graph (CFG)

The file core.go contains the Error data structure that is used almost everywhere in the source (hence it is the most core of the core data structures).

Internally, the compiler is separated into frontend and backend. The frontend is aware of how the language syntax works, and transforms the higher level, syntax-oriented representation of source into a semantic-oriented intermediate representation called PIR. The backend is only aware of the minute semantics of the intermediate representation, having only to understand the underlying abstract machine. The backend transforms PIR into lower representations and emits textual assembly.

The assembly procedures are directly transformed to the last representation the compiler has of the code: ASM. ASM is a in-memory representation of assembly code and, in the diagrams, is not the same as the "assembly" outputs. Don't worry too much about the nuances of each representation now, these will be explained in detail later.

PIR is a high-level representation of the semantics of millipascal, it is quite different from the syntax tree of the first stages. The frontend is also composed of several stages to transform millipascal source into PIR.

Stage 0 is concerned in resolving all dependencies and all types. It extracts all necessary information from the source so that the stage 1 is able to properly transform the module representation into PIR.

Most compiler errors will be emitted during stage 0, the only few errors coming out of stage 1 are related to control flow and entry points.

Stage 0 starts with the resolution phase, it finds all dependencies, builds the dependency graph and checks all global declarations for name usage. It creates the module data structure, which contains dependencies, the abstract syntax tree, and space to place information collected in other phases. If everything is well with name resolution and no errors are found, then execution proceeds to the typechecking phase.

Typechecking is the first phase that deals with types. It infers and checks types at the same time and fills the syntax tree with the types of each expression. Besides types, it also checks for other semantic problems like if assignments are well formed or if a procedure with multiple returns is being used in an expression. If no problems are found in typechecking, execution proceeds to the constexpr phase.

The constant evaluation phase (constexpr) is a small pass that evaluates constants in global symbols and sizeof expressions using arbitrary precision arithmetic (bignums). It has to run after all types are known so that numbers can be constrained to their fixed sizes. After constexpr, constants are just numbers, and incur no runtime overhead.

The stage 1 starts with the asmproc phase, which parses assembly procedures into the ASM representation. It needs to run in this exact place in the pipeline to use evaluated constants, and to avoid keeping syntax trees through the next phase. It also checks for semantic errors like invalid labeling or invalid instructions.

The linearization phase is a very complicated phase that transforms the whole source, including all dependencies, into a single PIR object. It mainly works on top the AST to transform it into a CFG. The only error this phase emits is regarding missing returns in procedures.

Lastly, after PIR is built, a phase checks if the representation is well formed. This phase is located in core/pir/checker as it is part of the semantics of the data structure.

The backend is where most of my tears have been shed. It starts by transforming PIR into a lower level representation called MIR, this is the job of regalloc (register allocator). This representation cares about where things are stored and implements a simplified version of what will be written in assembly.

After MIR is created, a validation phase performs abstract interpretation of the code to check if everything is well-formed. This validation is crucial and often caught a lot of bugs in the register allocator, bugs that previously required chasing segmentation faults.

If MIR is well-formed, the next stage is to transform it into ASM, joining it together with the assembly procedures generated in the asmproc phase. Transforming MIR into ASM requires some workarounds related to weird quirks of the amd64 architecture, quirks i only got to know after many segfaults.

Finally, the last phase transforms ASM into textual assembly following the syntax rules of the FASM assembler. This is rather simple and is almost an one-to-one translation.

Errors

The compiler halts as soon as it encounters a single error. This is simple, but sufficient, as long as the user knows a bit about what he is doing.

There is a single Error type. This type is used everywhere in the compiler, and contains information to communicate errors with the user. Here it is:

type Error struct {
	Code     et.ErrorKind
	Severity sv.Severity
	Message  string
	Location *Location
}

The type ErrorKind is the error code, enumerated on the file src/core/error/errorkind.go. This is used mostly to test the compiler as i don't remember a single error code from the top of my head.

Serverity is barely used, the only two used kinds are Error and InternalError. Which represent, respectivelly, errors to blame on the user and errors that are my fault.

Message is simply a string with the error message.

Finally, Location is a bit more complicated. Here is the full definition:

type Location struct {
	File  string
	Range *Range
}

type Range struct {
	Begin Position
	End   Position
}

type Position struct {
	Line   int
	Column int
}

Because most editors are line oriented, it is important that Position represents something the user can understand. The type Range simply encodes a section of source, from beginning to end. While the field file is the file name of the source.

To display the error to the user, a simple String method transforms the error to a string representation. This code is located in src/core/core.go:120.

func (this *Error) String() string {
	source := this.Location.Source()
	message := this.Location.String() + " " +
		this.Severity.String() +
		": " + this.Message
	if source != "" {
		return message + "\n" + source
	}
	return message
}

To print the source in the error message, the file is opened again and the section is extracted. The offending part is then coloured red using ANSI color codes. Here's some typical mpc errors:

fact.mp:2:4 to 2:14 error: invalid number of arguments, expected: 1
    if fact[5, 5] != 120 begin

F.mp:15:9 to 15:14 error: name is not defined
    proc F[a:Undef] i32

ioutil.mp:0:29 to 0:39 error: name not defined in module
    from io import print, fatal, everything

conv.mp:0:0 to 0:27 error: 'io' forms a invalid cycle:
(io, ioutil, conv, io)
    from io import print, fatal

As you can see, errors are simple and sometimes even bad, but they're good enough to find what you did wrong.

The High Level Intermediate Representation

The first intermediate representation is the Module type. Defined in src/core/module/module.go:82. This data structure aglomerates information of the source, including scopes, syntax trees, module graphs, types, etc.

type Module struct {
	BasePath string
	Name     string
	FullPath string
	Root     *Node

	Globals      map[string]*Global
	Dependencies map[string]*Dependency
	Exported     map[string]*Global

	Visited bool
}

A module has a one to one relation with the source files, this is why FullPath is present in the structure. Similarly, all imported modules are also files, and each have a corresponding Module instance.

The three maps (Globals, Dependencies and Exported) are first populated by the name resolution phase. The map[string]*Dependency declares a map of modules, creating a dependency graph. Here's the definition of Dependency:

type Dependency struct {
	M      *Module
	Source *Node
}

It is a simple pair of a Module and the node where that module was imported.

Another field that needs mention is Visited. This field is present on all types that are part of a directed graph, and is used to walk the graph without visiting the same node twice. There are lots of directed graphs in this compiler, some are acyclic, like the dependency graph, some are cyclic, like the control flow graph, these will be explained in detail later.

Representation of Symbols

Symbols come in many forms, the natural way to represent those would be using a sum type, since Go doesn't have those, we have to improvise using structs. Here's the struct for global symbols:

type Global struct {
	Kind       GK.GlobalKind
	ModuleName string
	Name       string
	N          *Node
	External   bool
	Refs       Refs
	Attr       []string
	Visited    bool

	Proc   *Proc
	Data   *Data
	Const  *Const
	Struct *Struct
}

Each of the Proc, Data, Const, Struct should be separate options in a sum type. In this case, we use Kind to disambiguate which one is populated, and the rest is set as nil.

There are, however, lots of fields that are common to all symbols. Name and ModuleName are respectivelly: the symbol name and the name of the module that symbol resides in. N is simply the root of the subtree for that symbol. External is true when the symbol comes from other modules (as in from M import S imports), in this case, we use ModuleName to find the original definition of that symbol.

Again we see the Visited field, in this case, the graph in question is the symbol graph, and is defined by the Refs field. Because we can refer to the fields of structs, this graph is a little more complicated:

type Unit struct{}

type Refs struct {
	Set     map[string]Unit
	Symbols []SyField
}

type SyField struct {
	Sy    *Global
	Field int
}

When the reference in question is a field, the value of Field is greater or equal to zero. If the reference in question is a global, this field is set to -1. The field Set is used to lookup if a symbol is already on the Symbols list.

This defines a reference map that needs to be acyclic, otherwise a symbol is defined in terms of itself. The symbols referenced by the body of procedures are not included in this graph, so recursive procedures are allowed, even if we maintain this graph without cycles.

Scopes

Millipascal has only 3 nested scopes. The global module scope, where procedures, constants and data declarations are defined. The argument scope, where procedure arguments are defined. finally the local scope, that contains the variables declared inside procedures.

Both Argument and Local scopes are inside the Proc data structure, at src/core/module/module.go:239. While the global scope is inside the Module data structure, described earlier.

The Proc type is defined like the following:

type Proc struct {
	Name   string
	ArgMap map[string]int
	VarMap map[string]int
	Args   []*Local
	Vars   []*Local
	Rets   []*T.Type
	Type   *T.Type
	Asm    []asm.Line

	N *Node
}

Here we can see the ArgMap and VarMap fields, these are part of the argument scope and local scope, respectivelly.

All scopes are simple Go maps, for example, the global scope is defined as map[string]*Global, whereas the argument and local names are similar, with two structures: a map map[string]int that maps the identifier to a index, and an array []*Local that contains the information about a local on that given index. Locals are stored in a different way to preserve order of arguments for the ABI.

The compiler checks if a name exists by first checking against the local scope, if it fails, then it checks the argument scope, then the global scope. Following the direction of the arrows in the above diagram. If all of these checks fail, the compiler emits an error that the name was not declared.

The Abstract Syntax Tree (AST)

The abstract syntax tree represents the source code using an hierarchical data structure. The compiler can output AST by calling it with mpc --ast file.mp. For example, consider the following code.

proc fact[n:i32] i32
begin
	if n == 0 begin
		return 1;
	end
	return n * fact[n-1];
end

Omitting the topmost node, becomes the following mess:

└─>{proc, 'proc':nil, 0:0 to 6:3}
    └─>{identifier, 'fact':nil, 0:5 to 0:9}
    └─>{parameters, '':nil, 0:10 to 0:15}
        └─>{:, ':':nil, 0:10 to 0:15}
            └─>{id list, '':nil, 0:10 to 0:11}
                └─>{identifier, 'n':nil, 0:10 to 0:11}
            └─>{i32, 'i32':nil, 0:12 to 0:15}
    └─>{type list, '':nil, 0:17 to 0:20}
        └─>{i32, 'i32':nil, 0:17 to 0:20}
    └─>nil
    └─>{block, '':nil, 1:0 to 6:3}
        └─>{if, 'if':nil, 2:1 to 4:4}
            └─>{==, '==':nil, 2:4 to 2:10}
                └─>{identifier, 'n':nil, 2:4 to 2:5}
                └─>{i32 literal, '0':nil, 2:9 to 2:10}
            └─>{block, '':nil, 2:11 to 4:4}
                └─>{return, 'return':nil, 3:2 to 3:10}
                    └─>{i32 literal, '1':nil, 3:9 to 3:10}
            └─>nil
            └─>nil
        └─>{return, 'return':nil, 5:1 to 5:21}
            └─>{*, '*':nil, 5:8 to 5:21}
                └─>{identifier, 'n':nil, 5:8 to 5:9}
                └─>{procedure call, '':nil, 5:12 to 5:21}
                    └─>{expression list, '':nil, 5:17 to 5:20}
                        └─>{-, '-':nil, 5:17 to 5:20}
                            └─>{identifier, 'n':nil, 5:17 to 5:18}
                            └─>{i32 literal, '1':nil, 5:19 to 5:20}
                    └─>{identifier, 'fact':nil, 5:12 to 5:16}
    └─>nil

You may find this unreadable, but i guarantee that after hours of debugging, you will still find it unreadable. Anyway, here's what the compiler is telling you, considering the excerpt {identifier, 'fact':nil, 0:5 to 0:9}

The indentation shows how deep the node is in the tree
The first field tells you what kind of node it is, in this case, it is an identifier
The second field is the section directly from the source, in this case 'fact', followed by the type of the node, in this case, nil because types were still not inferred.
The third and last field tells you where this node occurs in the source. Since this is an identifier, it corresponds to the 'fact' token, directly from source. But other nodes, like the return will comprise of longer sections. The first tuple of numbers represents the line and column of the beginning of the node, that is, like line:colum. The second one represents the end.

Now, if we ignore all the clutter, here's the graph representation of that piece of code:

Here you can see much better the structure of the grammar, this is implemented inside the compiler as a simple recursive data structure, you can find it in src/core/module/module.go:17.

type Node struct {
	Text   string
	Lex    LxK.LexKind
	Leaves []*Node

	Type     *T.Type
	MultiRet bool // if this is true, T == nil

	Value *big.Int // for int literals

	Range *Range
}

Fields are filled depending on the type of the node, and in different phases of the compiler. There are other ways to implement this data structure, for example, by using sum types. Go does not have sum types, so we have to live with big structs.

The field Leaves is how the tree is linked together. It is simply linked in a single direction, and all passes go top-down on the AST.

Lexemes (Nodes)

A Lexeme's single objective is labeling a section of source. In mpc, lexemes do not exist as a separate data structure. Instead, it is inlined with the abstract syntax tree node. Two fields are of special interest: Text which is of type string, and Lex which is of type LxK.LexKind.

Text corresponds directly to a piece of source code. It is a substring, and as strings in Go are immutable, it produces no copies of the input string.

LxK.LexKind is an enum, defined in src/core/module/lexkind as a bunch of constants. This enum can both represent lexical kinds (LEFTPAREN, RIGHTPAREN, IDENTIFIER, ...) as well as node kinds (IDLIST, TYPELIST, INSTR, ...).

The compiler can output lexemes by running mpc --lexemes file.mp. Consider the tokenization of the following excerpt:

proc max[a,b:i32] i32
begin
	if a >= b begin
		return a;
	end
	return b;
end

The function signature would be tokenized like so, with quotes added to remove ambiguity:

"proc", "max", "[", "a", ",", "b", ":", "i32", "]", "i32"

Each substring marked with their respective LexKind according to the rules in src/lexer/lexer.go:245.

The Types of Types

I often find questions on how to represent the structure of types, so that one can compare them and return meaningful errors. Here are a few types in Millipascal:

i32     ptr     proc[][]     proc[i32, i32][ptr]

Besides those, which are often called "anonymous types", there are nominal types created by struct declarations.

struct BigInt begin
    Array:I32A;
    Cap,Len:i16;
    Neg:bool;

    _pad1:u8; _pad2:u16;
end

As structs are nominal types, we need not to compare them by structure, only by name. In essence, structs are just pointers, so it's completely fine to just cast them to ptr. However, we do need information about field names and field types, so structural information must still exist. Here's how i represent structs in the compiler:

type Struct struct {
	Module   string
	Name     string
	Fields   []Field
	FieldMap map[string]int
	Size     *big.Int

	WellBehaved bool // whether it can be used to typecheck blobs
}

Again, we see both Module and Name, this is because any two structures in different modules are by definition different, ie, M1::mystruct is not the same as M2::mystruct.

Fields must also remain in order, this is why we use both map[string]int and []Fields. We then use a function to grab information based on names:

func (this *Struct) Field(name string) (Field, bool) {
	i, ok := this.FieldMap[name]
	if !ok {
		return Field{}, false
	}
	return this.Fields[i], true
}

While Field simply encodes information about offset, name and type:

type Field struct {
	Name   string
	Type   *Type
	Offset *big.Int
}

This Struct type is then part of the Type type, which encodes all necessary information about any Millipascal type.

type Type struct {
	Basic  BasicType
	Proc   *ProcType
	Struct *Struct
}

This is also one of those Go structs that should have been an sum type. In essence, only one of the fields ever carries a valid value, with the exception that if Struct is non-nil, then Basic is set to ptr. The Struct field was already explained, while Basic is just an enumeration of all built-in types. Lastly, the ProcType is defined as the following:

type ProcType struct {
	CC   CCKind
	Args []*Type
	Rets []*Type
}

Which includes information no the calling convention (CCKind) as an enumeration, and also inclues all arguments and returns, in order. Since ProcType is defined in terms of Type, this is a recursive definition.

The code that compares if two types are identical is bad enough that i will not dare to show it here (src/core/types/types.go:59), instead, here's a high level description of what it does, given T and U as types:

If T and U are not of the same kind, ie, not both basic, or not both procedures, then they are different.
If both T and U are basic types, then the Basic enum is compared directly (as an integer).
If both T and U are procedures, the arguments and returns of each procedure is compared in order, if any one of them is not equal, the procedure types are different.
If both T and U are struct types, then they are equal if they have the same names and are in the same module.

How do we build the types from source is a whole deal in itself, this is done by several procedures, including createInternalStructs in src/typechecker/typechecker.go:89, also getType in the same file at line 467 and other small procedures.

How we infer types is rather simple, we simply propagate the types of expressions up the syntax tree, for example:

Inference is done on all expressions, the only minor difference is that global constants must be inferred in topological order, ie, all dependencies must be resolved before you start to infer a global expression.

The module IR is all about placing information on top the source code, the next intermediate representation does not even look like Millipascal, in fact, i created it in a manner that is generic enough that other languages could use that intermediate representation as a target, this is why it's called Padeir0's Intermediate Representation and not Millipascal's Intermediate Representation.

Padeir0's Intermediate Representation (PIR)

PIR Validation

The Middle-level Intermediate Representation (MIR)

MIR Validation

The End

mpc was a big project that took way too long to finish. I decided to archive it and write this exposition so that it can serve as a resource to future compiler beginners. Hopefully it was useful to you, or at least, i hope you had fun reading this.