This document is an exposition of the Millipascal compiler (mpc).
It serves as resource for other people who are trying to solve
similar problems as i did.
Additional resources to this document are the language specification and compiler source. If all else fails, feel free to ask a question as an issue on github.
The questions this document is meant to answer are:
Millipascal is a simple imperative language. It looks like Pascal, hence the name. Here's a hello-world program:
from io import print
data hello "Hello, World!\n"
proc main
begin
print[hello, sizeof[hello]];
end
This program hardly shows anything about the language. But what we can see is that:
main
Millipascal is very minimalistic, the reason string literals are not directly
present in expressions is that they are only treated as blobs,
just like you would treat strings in assembly. The identifier
hello evaluates to a pointer (ptr),
completely untyped.
Besides pointers, we have the basic signed integers and unsigned integers from 8 bits to 64 bits, and a boolean. The boolean is not strictly necessary, but it helps when we're trying to validate our compiler transformations (more on that later).
Now, why would I choose square brackets over parenthesis
for function calls? That is: why f[] instead of
f()? There are three main reasons:
[ than ( After you get used to this call syntax, it grows on you. Initially it was a decision based purely to make the grammar more robust, then i started seeing other advantages.
Now, if you're familiar with how print is implemented
in other languages, you know how weird our software stack
can get. In millipascal it is not much better, here's
one of the few assembly procedures I ever got working in the language:
export print
const begin
SYS_WRITE = 1;
STDOUT = 1;
end
proc print<stack>[p:ptr, size:i32] i64
asm
begin
push rbp;
mov rbp, rsp;
mov r0, {SYS_WRITE};
mov r2d, [rbp, size]@dword;
mov r6, [rbp, p]@qword;
mov r7, {STDOUT};
syscall;
mov [rbp, _ret0]@qword, r0;
mov rsp, rbp;
pop rbp;
ret;
end
This directly implements the write syscall on linux for amd64.
Lot's more questions may be raised about this piece of code.
The export print is
how we make names public to other modules.
When you see <stack>, this is specifying
the ABI the procedure will use. Only one ABI was ever implemented.
If by now you haven't realised yet, this project was never finished. It is functional, and I've even wrote a arbitrary precision number library for the language, but some features never got fully developed. The support for multiple ABIs exist, but only one ABI was ever implemented. The support for ASM procedures exist, but only few procedures got throughly used and tested. This is what happens when projects grow out of proportion.
The following procedure converts a buffer to uppercase. This one is much more interesting, it shows that the language has support for first class procedures, and that global symbols can be declared out of order.
from io import print
data buff "am i uppercase yet?\n"
proc main
begin
byte_map[buff, sizeof[buff], upper_case];
print[buff, sizeof[buff]];
end
proc byte_map[b:ptr, bsize:i32, op:proc[i8][i8]]
var i:i32
begin
set i = 0;
while i < bsize begin
set (b+i)@i8 = op[(b+i)@i8];
set i += 1;
end
end
proc upper_case[a:i8] i8
begin
if a >= 'a' and a <= 'z' begin
return a - 32ss;
end
return a;
end
Here, (b+i)@i8 is a pointer offset followed
by a dereference. It reads
"take b, add i and read an i8 from that address".
There are other ways to do this, in fact, the language has
support for indexing, but it requires introducing the
semantics behind struct (it's different from C,
much to your dismay).
To be more precise, structs in millipascal are only syntax sugar for pointers, they are not values per se, only compile-time metadata for pointers. Here's some code that uses structs to benefit from the use of the indexing syntax:
(Note that ~ is the unary minus in Millipascal).
from ioutil import put_int, put_char
struct I32A begin
num:i32;
end
data my_ints:I32A {
5, ~10, 2, 50, ~9, 65, 0
}
const my_ints_length = sizeof[my_ints]/sizeof[I32A];
proc main
begin
print_ints[my_ints, my_ints_length];
insertion_sort[my_ints, my_ints_length];
print_ints[my_ints, my_ints_length];
end
proc insertion_sort[array:I32A, length:i32]
var i, j:i32
begin
set i = 1;
while i < length begin
set j = i;
while j > 0 and
array[j-1]->num > array[j]->num begin
set array[j]->num <> array[j-1]->num;
set j -= 1;
end
set i += 1;
end
end
proc print_ints[array:I32A, length:i32]
var i:i32
begin
set i = 0;
while i < length begin
put_int[array[i]->num:i64];
put_char[' '];
set i += 1;
end
put_char['\n'];
end
It may seem confusing, but array[i] is only
syntax sugar for array+(i * sizeof[I32A]).
The -> operator performs a dereference, just like in
C. It is sugar for (pointer+offset)@type, in this particular case,
the offset is zero, so the expression array[i]->num
is sugar for:
(array+(i * sizeof[I32A]))@i32
Structs are a bit more powerful than that, we're able to specify the offsets of fields, and even have negative offsets. This may seem silly, but i implemented that already thinking of a use-case: storing metadata about an object.
# object header
struct Obj [sizeof[i64]] begin
Size:i64 {~sizeof[i64]};
end
This Obj struct is used inside the
flalloc.mp
allocator
to retrieve the size of objects that come from user code:
set size = (p:Obj)->Size;
As seen, the language is pretty boring, it's just like C, but strongly typed, with a module system and Pascal's syntax. Even being simple, compilation can be quite involved, now we start talking about how to implement it.
The compiler is a monolithic 15kloc project separated into various modules, it generates FASM assembly and depends on FASM to generate ELF files.
Names of most stages here mirror the names of folders in the source. Here's a quick summary of the repository and the respective total line count of each directory.
src 15756
├── asmproc 298
├── backend0 3603
│ ├── gen 1009
│ ├── mir 1411
│ └── resalloc 1183
├── constexpr 546
├── core 3420
├── fasm 224
├── format 490
├── lexer 761
├── linearization 1025
├── messages 297
├── parser 1870
├── pipelines 204
├── resolution 1058
├── testing 191
├── typechecker 1570
└── main.go 199
This gives you an idea of how complex each compiler phase is,
for example, lexing is rather simple, thus
the line count of lexer is smaller than a more complicated stage
like the typechecker.
The file that describes most of the diagrams seen here
is the pipelines/pipelines.go file. It contains
the code that glues all compiler phases together, it may
be helpfull to consult this file while reading this section.
The core folder contains all public data structures
used throughout the compiler, a quick summary follows.
core 3420
├── asm 644
├── cc 51
├── errorkind 164
├── module 759
├── pir 1123
├── severity 28
├── strbuilder 48
├── types 409
├── util 53
└── core.go 141
Here, the most important data structures are:
asm which is a in-memory representation of assembly code, henceforth refered to as ASMmodule which represents the source using an abstract syntax tree (AST) and represents modules with a dependency graphtypes which represents the structure of types, including structspir which is the main intermediate representation of source, based on a control flow graph (CFG)
The file core.go contains the Error data structure
that is used almost everywhere in the source (hence it is the most core
of the core data structures).
Internally, the compiler is separated into frontend and backend. The frontend is aware of how the language syntax works, and transforms the higher level, syntax-oriented representation of source into a semantic-oriented intermediate representation called PIR. The backend is only aware of the minute semantics of the intermediate representation, having only to understand the underlying abstract machine. The backend transforms PIR into lower representations and emits textual assembly.
The assembly procedures are directly transformed to the last representation the compiler has of the code: ASM. ASM is a in-memory representation of assembly code and, in the diagrams, is not the same as the "assembly" outputs. Don't worry too much about the nuances of each representation now, these will be explained in detail later.
PIR is a high-level representation of the semantics of millipascal, it is quite different from the syntax tree of the first stages. The frontend is also composed of several stages to transform millipascal source into PIR.
Stage 0 is concerned in resolving all dependencies and all types. It extracts all necessary information from the source so that the stage 1 is able to properly transform the module representation into PIR.
Most compiler errors will be emitted during stage 0, the only few errors coming out of stage 1 are related to control flow and entry points.
Stage 0 starts with the resolution phase, it finds all dependencies, builds the dependency graph and checks all global declarations for name usage. It creates the module data structure, which contains dependencies, the abstract syntax tree, and space to place information collected in other phases. If everything is well with name resolution and no errors are found, then execution proceeds to the typechecking phase.
Typechecking is the first phase that deals with types. It infers and checks types at the same time and fills the syntax tree with the types of each expression. Besides types, it also checks for other semantic problems like if assignments are well formed or if a procedure with multiple returns is being used in an expression. If no problems are found in typechecking, execution proceeds to the constexpr phase.
The constant evaluation phase (constexpr) is a small
pass that evaluates constants in global symbols and
sizeof expressions using arbitrary precision
arithmetic (bignums). It has to run after all types are known
so that numbers can be constrained to their fixed sizes.
After constexpr,
constants are just numbers, and incur no runtime overhead.
The stage 1 starts with the asmproc phase, which parses assembly procedures into the ASM representation. It needs to run in this exact place in the pipeline to use evaluated constants, and to avoid keeping syntax trees through the next phase. It also checks for semantic errors like invalid labeling or invalid instructions.
The linearization phase is a very complicated phase that transforms the whole source, including all dependencies, into a single PIR object. It mainly works on top the AST to transform it into a CFG. The only error this phase emits is regarding missing returns in procedures.
Lastly, after PIR is built, a phase checks if the representation
is well formed. This phase is located in core/pir/checker
as it is part of the semantics of the data structure.
The backend is where most of my tears have been shed. It starts by transforming PIR into a lower level representation called MIR, this is the job of regalloc (register allocator). This representation cares about where things are stored and implements a simplified version of what will be written in assembly.
After MIR is created, a validation phase performs abstract interpretation of the code to check if everything is well-formed. This validation is crucial and often caught a lot of bugs in the register allocator, bugs that previously required chasing segmentation faults.
If MIR is well-formed, the next stage is to transform it into ASM, joining it together with the assembly procedures generated in the asmproc phase. Transforming MIR into ASM requires some workarounds related to weird quirks of the amd64 architecture, quirks i only got to know after many segfaults.
Finally, the last phase transforms ASM into textual assembly following the syntax rules of the FASM assembler. This is rather simple and is almost an one-to-one translation.
The compiler halts as soon as it encounters a single error. This is simple, but sufficient, as long as the user knows a bit about what he is doing.
There is a single Error type. This type is used
everywhere in the compiler, and contains information
to communicate errors with the user. Here it is:
type Error struct {
Code et.ErrorKind
Severity sv.Severity
Message string
Location *Location
}
The type ErrorKind is the error code, enumerated on
the file src/core/error/errorkind.go. This is used
mostly to test the compiler as i don't remember a single error
code from the top of my head.
Serverity is barely used, the only two used kinds
are Error and InternalError. Which
represent, respectivelly, errors to blame on the user
and errors that are my fault.
Message is simply a string with the error message.
Finally, Location is a bit more complicated. Here is
the full definition:
type Location struct {
File string
Range *Range
}
type Range struct {
Begin Position
End Position
}
type Position struct {
Line int
Column int
}
Because most editors are line oriented, it is important that Position
represents something the user can understand. The type Range simply
encodes a section of source, from beginning to end. While the field file
is the file name of the source.
To display the error to the user, a simple String method
transforms the error to a string representation. This code is
located in src/core/core.go:120.
func (this *Error) String() string {
source := this.Location.Source()
message := this.Location.String() + " " +
this.Severity.String() +
": " + this.Message
if source != "" {
return message + "\n" + source
}
return message
}
To print the source in the error message, the file is opened again
and the section is extracted. The offending part is then coloured red
using ANSI color codes. Here's some typical mpc errors:
fact.mp:2:4 to 2:14 error: invalid number of arguments, expected: 1
if fact[5, 5] != 120 begin
F.mp:15:9 to 15:14 error: name is not defined
proc F[a:Undef] i32
ioutil.mp:0:29 to 0:39 error: name not defined in module
from io import print, fatal, everything
conv.mp:0:0 to 0:27 error: 'io' forms a invalid cycle:
(io, ioutil, conv, io)
from io import print, fatal
As you can see, errors are simple and sometimes even bad, but they're good enough to find what you did wrong.
The first intermediate representation is the Module type.
Defined in src/core/module/module.go:82. This data structure
aglomerates information of the source, including scopes, syntax trees,
module graphs, types, etc.
type Module struct {
BasePath string
Name string
FullPath string
Root *Node
Globals map[string]*Global
Dependencies map[string]*Dependency
Exported map[string]*Global
Visited bool
}
A module has a one to one relation with the source files, this is why FullPath
is present in the structure. Similarly, all imported modules are also files, and each have
a corresponding Module instance.
The three maps (Globals, Dependencies and Exported)
are first populated by the name resolution phase. The map[string]*Dependency
declares a map of modules, creating a dependency graph. Here's the definition of Dependency:
type Dependency struct {
M *Module
Source *Node
}
It is a simple pair of a Module and the node where that module was imported.
Another field that needs mention is Visited. This field
is present on all types that are part of a directed graph, and is
used to walk the graph without visiting the same node twice.
There are lots of directed graphs in this compiler, some are acyclic,
like the dependency graph, some are cyclic, like the control flow graph,
these will be explained in detail later.
Symbols come in many forms, the natural way to represent those would be using a sum type, since Go doesn't have those, we have to improvise using structs. Here's the struct for global symbols:
type Global struct {
Kind GK.GlobalKind
ModuleName string
Name string
N *Node
External bool
Refs Refs
Attr []string
Visited bool
Proc *Proc
Data *Data
Const *Const
Struct *Struct
}
Each of the Proc, Data, Const, Struct
should be separate options in a sum type. In this case, we use Kind to disambiguate
which one is populated, and the rest is set as nil.
There are, however, lots of fields that are common to all symbols.
Name and ModuleName are respectivelly:
the symbol name and the name of the module that symbol resides in.
N is simply the root of the subtree for that symbol.
External is true when the symbol comes from other modules
(as in from M import S imports), in this case, we use
ModuleName to find the original definition of that symbol.
Again we see the Visited field, in this case, the graph in
question is the symbol graph, and is defined by the Refs
field. Because we can refer to the fields of structs, this graph is
a little more complicated:
type Unit struct{}
type Refs struct {
Set map[string]Unit
Symbols []SyField
}
type SyField struct {
Sy *Global
Field int
}
When the reference in question is a field, the value of Field
is greater or equal to zero. If the reference in question is a global, this
field is set to -1. The field Set is used to lookup
if a symbol is already on the Symbols list.
This defines a reference map that needs to be acyclic, otherwise a symbol is defined in terms of itself. The symbols referenced by the body of procedures are not included in this graph, so recursive procedures are allowed, even if we maintain this graph without cycles.
Millipascal has only 3 nested scopes. The global module scope, where procedures, constants and data declarations are defined. The argument scope, where procedure arguments are defined. finally the local scope, that contains the variables declared inside procedures.
Both Argument and Local scopes are inside the Proc
data structure, at src/core/module/module.go:239. While
the global scope is inside the Module data structure,
described earlier.
The Proc type is defined like the following:
type Proc struct {
Name string
ArgMap map[string]int
VarMap map[string]int
Args []*Local
Vars []*Local
Rets []*T.Type
Type *T.Type
Asm []asm.Line
N *Node
}
Here we can see the ArgMap and VarMap
fields, these are part of the argument scope and
local scope, respectivelly.
All scopes are simple Go maps, for example, the global
scope is defined as map[string]*Global, whereas
the argument and local names are similar, with two structures:
a map map[string]int that maps the identifier
to a index, and an array []*Local that contains the
information about a local on that given index. Locals are stored
in a different way to preserve order of arguments for the ABI.
The compiler checks if a name exists by first checking against the local scope, if it fails, then it checks the argument scope, then the global scope. Following the direction of the arrows in the above diagram. If all of these checks fail, the compiler emits an error that the name was not declared.
The abstract syntax tree represents the source code using an hierarchical data structure.
The compiler can output AST by calling it with mpc --ast file.mp. For
example, consider the following code.
proc fact[n:i32] i32
begin
if n == 0 begin
return 1;
end
return n * fact[n-1];
end
Omitting the topmost node, becomes the following mess:
└─>{proc, 'proc':nil, 0:0 to 6:3}
└─>{identifier, 'fact':nil, 0:5 to 0:9}
└─>{parameters, '':nil, 0:10 to 0:15}
└─>{:, ':':nil, 0:10 to 0:15}
└─>{id list, '':nil, 0:10 to 0:11}
└─>{identifier, 'n':nil, 0:10 to 0:11}
└─>{i32, 'i32':nil, 0:12 to 0:15}
└─>{type list, '':nil, 0:17 to 0:20}
└─>{i32, 'i32':nil, 0:17 to 0:20}
└─>nil
└─>{block, '':nil, 1:0 to 6:3}
└─>{if, 'if':nil, 2:1 to 4:4}
└─>{==, '==':nil, 2:4 to 2:10}
└─>{identifier, 'n':nil, 2:4 to 2:5}
└─>{i32 literal, '0':nil, 2:9 to 2:10}
└─>{block, '':nil, 2:11 to 4:4}
└─>{return, 'return':nil, 3:2 to 3:10}
└─>{i32 literal, '1':nil, 3:9 to 3:10}
└─>nil
└─>nil
└─>{return, 'return':nil, 5:1 to 5:21}
└─>{*, '*':nil, 5:8 to 5:21}
└─>{identifier, 'n':nil, 5:8 to 5:9}
└─>{procedure call, '':nil, 5:12 to 5:21}
└─>{expression list, '':nil, 5:17 to 5:20}
└─>{-, '-':nil, 5:17 to 5:20}
└─>{identifier, 'n':nil, 5:17 to 5:18}
└─>{i32 literal, '1':nil, 5:19 to 5:20}
└─>{identifier, 'fact':nil, 5:12 to 5:16}
└─>nil
You may find this unreadable, but i guarantee that after hours of debugging,
you will still find it unreadable. Anyway, here's what the compiler is telling you,
considering the excerpt {identifier, 'fact':nil, 0:5 to 0:9}
identifier'fact',
followed by the type of the node, in this case, nil because types were still
not inferred.
'fact' token, directly from
source. But other nodes, like the return will comprise of longer sections.
The first tuple of numbers represents the line and column of the beginning of the node,
that is, like line:colum. The second one represents the end.
Now, if we ignore all the clutter, here's the graph representation of that piece of code:
Here you can see much better the structure of the grammar, this is implemented inside the
compiler as a simple recursive data structure, you can find it in src/core/module/module.go:17.
type Node struct {
Text string
Lex LxK.LexKind
Leaves []*Node
Type *T.Type
MultiRet bool // if this is true, T == nil
Value *big.Int // for int literals
Range *Range
}
Fields are filled depending on the type of the node, and in different phases of the compiler. There are other ways to implement this data structure, for example, by using sum types. Go does not have sum types, so we have to live with big structs.
The field Leaves is how the tree is linked together. It is
simply linked in a single direction, and all passes go top-down on the AST.
A Lexeme's single objective is
labeling a section of source. In mpc, lexemes do
not exist as a separate data structure. Instead, it is
inlined with the abstract syntax tree node.
Two fields are of special interest: Text
which is of type string, and Lex
which is of type LxK.LexKind.
Text corresponds directly to a piece of source code.
It is a substring, and as strings in Go are immutable, it produces
no copies of the input string.
LxK.LexKind is an enum,
defined in src/core/module/lexkind as a bunch of constants.
This enum can both represent lexical kinds (LEFTPAREN, RIGHTPAREN, IDENTIFIER, ...)
as well as node kinds (IDLIST, TYPELIST, INSTR, ...).
The compiler can output lexemes by running mpc --lexemes file.mp.
Consider the tokenization of the following excerpt:
proc max[a,b:i32] i32
begin
if a >= b begin
return a;
end
return b;
end
The function signature would be tokenized like so, with quotes added to remove ambiguity:
"proc", "max", "[", "a", ",", "b", ":", "i32", "]", "i32"
Each substring marked with their respective LexKind according to the
rules in src/lexer/lexer.go:245.
I often find questions on how to represent the structure of types, so that one can compare them and return meaningful errors. Here are a few types in Millipascal:
i32 ptr proc[][] proc[i32, i32][ptr]
Besides those, which are often called "anonymous types", there are nominal types created by
struct declarations.
struct BigInt begin
Array:I32A;
Cap,Len:i16;
Neg:bool;
_pad1:u8; _pad2:u16;
end
As structs are nominal types, we need not to compare them by structure, only by name.
In essence, structs are just pointers, so it's completely fine to just cast them to ptr.
However, we do need information about field names and field types, so structural information
must still exist. Here's how i represent structs in the compiler:
type Struct struct {
Module string
Name string
Fields []Field
FieldMap map[string]int
Size *big.Int
WellBehaved bool // whether it can be used to typecheck blobs
}
Again, we see both Module and Name, this is because any
two structures in different modules are by definition different, ie, M1::mystruct
is not the same as M2::mystruct.
Fields must also remain in order, this is why we use both map[string]int and []Fields.
We then use a function to grab information based on names:
func (this *Struct) Field(name string) (Field, bool) {
i, ok := this.FieldMap[name]
if !ok {
return Field{}, false
}
return this.Fields[i], true
}
While Field simply encodes information about offset, name and type:
type Field struct {
Name string
Type *Type
Offset *big.Int
}
This Struct type is then part of the Type type,
which encodes all necessary information about any Millipascal type.
type Type struct {
Basic BasicType
Proc *ProcType
Struct *Struct
}
This is also one of those Go structs that should have been an sum type.
In essence, only one of the fields ever carries a valid value,
with the exception that if Struct is non-nil, then
Basic is set to ptr.
The Struct field was already explained, while
Basic is just an enumeration of all built-in types.
Lastly, the ProcType is defined as the following:
type ProcType struct {
CC CCKind
Args []*Type
Rets []*Type
}
Which includes information no the calling convention (CCKind) as an enumeration,
and also inclues all arguments and returns, in order. Since ProcType is defined in
terms of Type, this is a recursive definition.
The code that compares if two types are identical is bad enough that i will not dare to
show it here (src/core/types/types.go:59), instead,
here's a high level description of what it does, given T and U
as types:
T and U are not of the same kind, ie, not both
basic, or not both procedures, then they are different.
T and U are basic types, then the Basic enum
is compared directly (as an integer).
T and U are procedures, the arguments and returns
of each procedure is compared in order, if any one of them is not equal, the procedure
types are different.
T and U are struct types, then they are equal if
they have the same names and are in the same module.
How do we build the types from source is a whole deal in itself, this is done by
several procedures, including createInternalStructs
in src/typechecker/typechecker.go:89, also getType
in the same file at line 467 and other small procedures.
How we infer types is rather simple, we simply propagate the types of expressions up the syntax tree, for example:
1 + a.field as i32Inference is done on all expressions, the only minor difference is that global constants must be inferred in topological order, ie, all dependencies must be resolved before you start to infer a global expression.
The module IR is all about placing information on top the source code, the next intermediate representation does not even look like Millipascal, in fact, i created it in a manner that is generic enough that other languages could use that intermediate representation as a target, this is why it's called Padeir0's Intermediate Representation and not Millipascal's Intermediate Representation.
mpc was a big project that took way too long to finish.
I decided to archive it and write this exposition so that
it can serve as a resource to future compiler beginners.
Hopefully it was useful to you, or at least, i hope you
had fun reading this.