Skip to content

LLVM IR Emission(1): Overall and Type Emission

This blog briefly introduces the LLVM IR emission practice in Piovt-Language. The pl utilizes the LLVM to generates the backend code and does optimizer with the help of LLVM passes. The emission happens during converting from AST node into the LLVM, coupled with the syntax check as well. In this blog, we focus on the conversion details between LLVM IR and AST, and by the way inspects the ways of type check in a static language.

Kinds of AST Nodes

AST nodes stand all possible syntaxs in a language specification, and wholes all information about its own structure. For example, as a if-then-elseconditional statement, the AST node IfNode carries the conditional epxression(the result must be a bool type but it could be an expression as well),the then block logic and the else block logic.

Inside Pivot-Lang, there are many kinds of node which satisfy an enum here. We don't discuss each of them but focus on their classification. The AST nodes are the code representation of the Pivot-Lang EBNF. It carries the all necessary tokens mentioned in the ebnf, and nothing syntax check will be executed during construction.

Categories

Here, we could classify the nodes into different catagories: type, declaration, statement and expression.

Category Description
type type descirbe the meaning of a sequence of bits, such as the same bit sequence we could get diverse integer and float number.
declaration descirbe the properties of an identifier.
expression evaluate/calculate value, here we think the expression has no side effect.
statement control the how to compose the evaluations, it's a part of calculation but it focuses on the data flow. The expression statement is an ideal example as it contains both evaluation and control data flow.

Types

Bool Num String
Function Closure Macro
Tuple Trait Union

Blocks

A block is a possibly empty sequence of declarations and statements within matching brace brackets.

Declarations

Declaration defines some identifiers by associating the name with the other components.

  • Struct(struct)
  • Function(fn)
  • Implementation(impl)
  • New Variable(let)
  • Global variable(var)
  • Global const(const)
  • Macro(macro)
  • Type(type)
  • Trait(trait)
  • Use(use)
  • Generic Parameters(<>)
  • Generic Type Definition.

Expressions:

An expression specifies the computation of a value by applying operators and functions to operands. The common expressions are:

  • operands: Operands denote the elementary values in an expression, such as the literal values of the basic types, the identifier, qualified identifier(the external identifier), and parantheses expression. Unlike the math concept, it doesn't allow the operands to be nested.
    For example, 1 and 2 are operands but 1+2 is not an operand but an addition expression.
  • index exp: array element operation
  • primary exp: logic exp, compare exp, bit exp, arithmetic operators
  • selector(TakeExpOp): use . to denote the field or method f of the value x in x.f, digit is allowed to access tuple data like x.0.
  • macro pattern match exp
  • index expression(ArrayElementOp)
  • type cast expression (as and is)
  • **call expression**s: macro calls and function calls
  • address exp: pointer exp and de-reference expression
  • initialization expression: struct, array initialization
  • operators: Operators combine operands into expressions.
  • array exp(construct an array)
  • type cast: is and as
  • call: function call exp
  • closure expression
  • deconstruct expression: tuple, struct, struct field deconstructions
  • traits bound
  • type addition

Statements

  • empty stmt
  • expression stmt:
    An expression that performs as a statement.
  • conditional stmt: if
  • loop stmt: for and while
  • terminated stmt: break, continue
  • assignment stmt

LLVM IR Type Emission

LLVM IR is foundamentally represented in SSA form, and the LLVM tries to avoid exposing too many high level APIs for users. During emitting AST nodes into LLVM IR, we do the type check and then call the LLVM API to emit the LLVM IR into files. Currently, Pivot-Lang mixes them together. Hence, when it wants to support a check command for type check only, it uses a noop llvm builder which emits nothing LLVM IR code.

Then we look at how the LLVM IR represents the types.

Primary Types

Primary types are those basic and native types in Pivot-Lang, they are boolean and different kinds of numbers. LLVM provides some built-in types for them, which we can use them directly as the code refers:

fn get_pri_basic_type(&self, tp: &PriType) -> BasicTypeEnum<'ctx> {
    match tp {
      // the self.context is the inkwell context 
      PriType::I128 => self.context.i128_type().as_basic_type_enum(),
      PriType::U128 => self.context.i128_type().as_basic_type_enum(),
      PriType::F64 => self.context.f64_type().as_basic_type_enum(),
      PriType::BOOL => self.context.bool_type().as_basic_type_enum(),
      // ignore lines
    }
}
Then, it triggers the respective LLVM API to generate the constants with the value got in parsing stage.

String Literal

Unlike C stores the raw strings as an array of chars, Pivot-Lang stores it as a structure with three fields:

  • length of the string
  • utf8 length of the string
  • the content of the string

Here is a simple code snippet about using inkwell library to emit the string similar to the pivot-lang:

  let string_content = "this is a string literal";
  let val_ptr = builder.build_global_string_ptr(string_content, "string_literal").unwrap();

  // construct a structure field lists
  let struct_type = context.struct_type(&[
    context.i128_type().into(),
    context.i64_type().ptr_type(AddressSpace::from(1)).as_basic_type_enum(),
  ],false);

  // create a pointer which stands for an empty structure
  let alloc  = builder.build_alloca(struct_type, "string").unwrap();

  // insert fields based on the index
  let len= builder.build_struct_gep(struct_type,alloc, 0, "len").unwrap();
  builder.build_store(len, context.i128_type().const_int(string_content.len() as u64, true)).unwrap();
  let ptr = builder.build_struct_gep(struct_type,alloc, 1, "value").unwrap();
  builder.build_store(ptr, val_ptr).unwrap();

The emitting code looks like this:

@string_literal = private unnamed_addr constant [25 x i8] c"this is a string literal\00", align 1

define void @my_fn() {
entry:
  %string = alloca { i128, ptr }, align 8
  %len = getelementptr inbounds { i128, ptr }, ptr %string, i32 0, i32 0
  store i128 24, ptr %len, align 4
  %value = getelementptr inbounds { i128, ptr }, ptr %string, i32 0, i32 1
  store ptr @string_literal, ptr %value, align 8
}

In Pivot-Lang, the memory allocation is much complex as it utilizes its GC fully. The important here is to understand how the Pivot-Lang expresses types in LLVM format.

Array

As array is a sequence of elements with the same type, we need to allocate length * type_size at least to represent it. The array type holds two necessary fields, the vector of elements and the expression representing length. In Pivot-Lang, an array is managed by a structure as well:

+----------------------+           +-----------+
| ArrayType            |           | Real Data |
+======================+           +===========+
| GC Fn                | <---------| 8bytes ptr|
+----------------------+           +-----------+
| Element Type Pointer | --------->| elem1     |
+----------------------+           +-----------+
| Length               |           | ...       |
+----------------------+           +-----------+
                                   | elemn     |
                                   +-----------+

Because we focus on the representation of types, we ignore the GC here. There are two structures are constructed for array, one to store the meta-data of the array, and another stores the real element data only. Here is an example code about how to generate the fat pointer structure and the real data.

The build_in_bounds_gep API is called, which could append a value at the certain offset from the head(first element) specified by the index and element type.

let fat_pointer_type = context.struct_type(&[
  context.i64_type().into(), // array length
  context.i64_type().ptr_type(AddressSpace::from(1)).as_basic_type_enum(), // the real data
], false);

let fat_pointer_alloc = builder.build_alloca(fat_pointer_type, "fat_pointer").unwrap();

let elem_typ= context.i64_type();
let real_arr_head = builder.build_struct_gep(fat_pointer_type, fat_pointer_alloc, 1, "")
  .unwrap();

let elems = vec![111,2222,3333];
for (i, v) in elems.into_iter().enumerate(){
  let v = context.i64_type().const_int(v, false);

  let ptr = unsafe {
    builder.build_in_bounds_gep(
      elem_typ,
      real_arr_head, 
      &[context.i64_type().const_int(i as u64,false) ],
      "element"
    ).unwrap()
  };
  builder.build_store(ptr, v).unwrap();
}
The emit llvm IR is shown below:
define void @my_fn() {
entry:
  %fat_pointer = alloca { i64, ptr addrspace(1) }, align 8
  %0 = getelementptr inbounds { i64, ptr addrspace(1) }, ptr %fat_pointer, i32 0, i32 1
  %element = getelementptr inbounds i64, ptr %0, i64 0
  store i64 111, ptr %element, align 4
  %element1 = getelementptr inbounds i64, ptr %0, i64 1
  store i64 2222, ptr %element1, align 4
  %element2 = getelementptr inbounds i64, ptr %0, i64 2
  store i64 3333, ptr %element2, align 4
}
The getelementptr inbounds i64, ptr %0, i64 0 clearly shows it calculates the offset(0) of the head ptr with the element type(i64), and then get the pointer to be assigned. By this way, the values are assigned one by one.

Tuple

Pivot-Lang treats tuple as structure, hence it creates an LLVM structure though build_struct_gep instruction. Differ from the array which all elements holds the same type, tuple aggregates data with different types. But it doesn't disturb the LLVM IR emission for the layout of a tuple is clear.

The tuple llvm representation holds a 8-bytes GC pointer at the beginning and then the real data, the tuple fileds could be embed such as (1, (2,3), "4"), and the embedded tuples have the same layout conventions.

+--------+
| Tuple  |
+========+
| GC Fn  |
+--------+
| First  |
+--------+
| Second |
+--------+
| ...    |
+--------+
| Nth    |
+--------+

During accessing tuple fields, the compiler calculates the position of the field in the structure and visit it directly.

Structure

Structure is a bit more complex than the tuple due to the generic support, whose layout is unknown in the definition. Hence, the layout of a structure cannot be affirmed until it's instantiated. Due to this uncertain, the compiler processes it and stores it in symbol table for type check. Without generic, the structure layout is equivalent with the tuple in LLVM IR representation and the only difference is structure has named fields but tuple has index fields only.

As structure definitions are usually the top-level declarations, the Pivot-Lang compiler generates the structure definitions at the beginning before starting to generate LLVM IR. It processes the structure information for type check only and the real emission is delayed until compiler encountered the real usage.

Function

During the declaration, if the function doesn't carry any generic arguments, the compiler will help to generate the function llvm IR. Otherwise, the function IR emission will delay until the function is called so the compiler has enough information to instantiate the generic function.

No matter how complex the pivot-lang function is, in the LLVM layer, functions are very easy. Below is a simple demo to generate a function by the inkwell.

use inkwell::context::Context
fn main() {
    let context = Context::create();
    let module = context.create_module("my_mod");
    let builder = context.create_builder();

    let i32_type = context.i32_type();
    let fn_type = i32_type.fn_type(&[i32_type.into(), i32_type.into()], false);
    let fn_value = module.add_function("my_fn", fn_type, None);

    let entry = context.append_basic_block(fn_value, "fn_demo");

    builder.position_at_end(entry);


    let x = fn_value.get_nth_param(0).unwrap().into_int_value();
    let y = fn_value.get_nth_param(1).unwrap().into_int_value();

    let sum = builder.build_int_add(x, y, "sum").unwrap();
    builder.build_return(Some(&sum)).unwrap();
    module.print_to_stderr()
  }

The output is:

; ModuleID = 'my_mod'
source_filename = "my_mod"

define i32 @my_fn(i32 %0, i32 %1) {
fn_demo:
  %sum = add i32 %0, %1
  ret i32 %sum
}

Ditto, pivot-lang generates the function type in inkwell representation based on its all parameter types and the returned types.

Closure

Clsoure doesn't support generic like functions, which means it doesn't have a concept of instantiating. Instead, the closure supports to take parameters without decorated type.

Closure is consisted by a closure function type and a pointer which points to the closure captured data represented by a structure.

+--------------+     +--------------+
| Closure Type |     | CapturedVar1 |
+--------------+     +--------------+
| Point        | --> | CapturedVar2 |
+--------------+     +              |
                     |              |
                     +--------------+
                     | CapturedVarN |
                     +--------------+

The closure type is the same format with function, it uses LLVM function concept to create a function for the closure. One thing to highlight is that the first parameter of the closure is an i8 pointer points to the structure holds all captured data.

Union

Union is the sum-type in Pivot-Lang, it allows to create a new type by sum multiple types together.

pub type A<T|F> = B<T> | C
type A=b
pub type A<T|F> = B<T> | C
The representation of Union is marked here, which is a structure within i64 tag and an i8 pointer.

The i64 tag is a GC pointer and the i8 pointer is used to specify the real type for an union. Even though an union could be represented by many types, in the runtime it could be instanilized by only instantiate by one fixed type.