Array length in high-level languages

While pondering writing a standard library for a language I’ve written for my RPG engine, I’ve been stuck considering a question I find pretty interesting – how can one allow a standard library of a language, and ideally only the standard library, to do special operations, like direct memory accesses? Specifically, in my instance, array length accesses, but I was hoping for a more general philosophy.

First, let’s consider the problem of getting the length of an array. In my language, as in many, arrays are a first-class construct, and their allocated length is written in the first word of the array in memory. As a result, getting the length of the array involves reading the word of memory pointed at by the array’s reference.

For example, suppose we had an array of five integers. This would occupy six words of memory; the initial set to the length (5), and then the five elements, all initialised to zero.

This is where the problem arises for my standard library: the language allows  users to index the array in a fairly typical manner – x[0] through x[5] – but these do not correspond directly to the memory accesses (in fact, the memory accesses are all shifted upwards by one to skip the length). As a result, it is not possible to access the array’s length directly through the language.

A trivial solution would be to implement some special new syntax – length x – which generates a direct memory read to the array’s address, hence returning the length. But that’s no fun – it involves making the parser more complicated, adding a special case to code generation, and causes what I would call a “break in immersion” when coding – it’s one less thing that is intuitive and natural to users, when they can do array.sort() but not array.length(). Taking this vein of thought further, we could instead parse it as normal, and hijack it during code generation – if we’re generating code for a method call on an array, then we don’t try generating a classic method call, but instead directly output memory access code.

This approach has many benefits, in that it’s trivial to implement, and doesn’t add any special cases to parsing (just code generation), or increase the mental load for users too much. Essentially, to end users, this is a fairly seamless approach, but it still leaves something to be desired – now some array logic, such as sorting, is encoded in higher level code, but some is just hard-coded into the compiler.

Maybe that’s an acceptable loss, but I was still interested in how other languages had solved this problem, so I looked into how Java and C# solved this issue.

Java

Java seems to solve this by having a dedicated JVM instruction called arraylength – this is along the lines of what I was saying above, where the compiler hijacks what syntactically looks like a field access. Syntactically, it is next to identical to your average field, but you can use reflection to prove that it’s not actually a field.

C#

C# seems to take a very similar approach to Java (unsurprisingly, given the similarity between the two), with a CIL instruction ldlen (this article http://mattwarren.org/2017/05/08/Arrays-and-the-CLR-a-Very-Special-Relationship/ is a goldmine for related information)

Summary

I really intended to look into quite a lot more languages – specifically Python, Ruby and Lua, but didn’t really have time. Digging through the Python compiler to find the answer was taking me quite a long time. If anyone stumbles upon this and happens to know how they handle it, I’d love a comment.

It does seem like the mainstream approach is just a special case in code generation, though. Personally, I was expecting an approach where verified library code would be able to hold lower-level code in it (like inline assembly in C) to avoid this, but this seems like quite an overkill feature in retrospect.

vTables and runtime method resolution

My first second year university group project recently came to an end, in which we had to implement a fully functional compiler that generated machine code from a relatively simplistic procedural language. It had functions, nested expressions, and a strong dynamic typing system. In addition to typical stack-stored primitive types such as int, char, and bool, it also had two dynamically allocated types; arrays and pairs.

For the final milestone of the compiler, we had two weeks to implement our choice of an extension. We chose, fairly ambitiously, to implement a number of extensions, one of which was a fully working Object Oriented system which I took most of the responsibility for implementing.

In the end, our OO system ended up supporting three main types of user-defined types: structs, interfaces, and classes. Structs were relatively simplistic – little more than a fixed-size pieces of memory on the heap. Interfaces allowed you to declare methods which a class implementing the interface would have to implement, and classes allowed you to define methods and fields together.

Our classes supported extending a single other class, and implementing any number of interfaces. Semantically, this worked nearly identical to in Java – in fact, our implementation ended up creating a language that was essentially a subset of Java. The implementation had what are often referred to as the three pillars of object oriented programming – inheritance, encapsulation (public, private and protected, where private was class-only and protected was class-and-subclass-only), and polymorphism (run-time method dispatch).

This post is primarily about how I structured objects on the heap to allow them to support the run-time method dispatch – this was something we found difficult to research, as most resources about vTables and the like tend to be about C++ and rarely are about the low-level implementation. We found that C++ compilers would often optimise away the vTables, even with optimisations turned off, making it very difficult to analyse the ones it generated. As a result of these, I decided to write a summary of how I went about implementing it, in the hope that it is useful to others.

The system I ended up settling with results in a tiny amount of extra overhead for method calls on classes, but adds a significant overhead on calls to interfaces. This is something that I am sure is not done optimally, and as such I certainly do not wholeheartedly recommend this as a perfect way of laying out objects on the heap.

First, to consider why this is not a trivial problem, consider a basic example:

class A { void printAString() { print "A string" } }
class B extends A { void printAString() { print "Another string" } void anotherFunction() { print "Hello, World!" } }
void main() {
    A a = new B();
    a.printAString();
}

A naïve implementation of this would print out something that may be unexpected if you come from a background that leads you to expect the kind of behaviour that run-time dispatch allows, which is “A string”, rather than “Another string”. This comes about as the call to “a”  seems, to the compiler, to be acting on an A, not a B. This could theoretically be avoided by intelligently (as far as a compiler is concerned) noticing that it’s instantiated as a B, but this causes some problems – what if you have a function accepting an A as a parameter, or you instantiate the A in two mutually exclusive branches differently (for example an “if” that sets it to a B, but an “else” that sets it to an A)? These problems, so far as I am aware, make it nigh-on impossible to implement it with compile-time code analysis (Though, theoretically, you could generate a huge file that accounts for every branching possibility and expands them cleverly, this would certainly not be viable for large files).

So the way I solved this, and indeed, I believe, the way it is typically solved, is to create a table of methods for each object at compile-time, which is included in the assembly code generated (and hence the runtime, in some form). I implemented it in a fashion such that every object would have a table for itself, and an additional one for each object it “is” – that is, that it extends or implements. In the example above, the following might be generated:

A:
 o_A_printAString
B:
 o_B_printAString
 o_B_anotherFunction

Function labels here are named as “o_ImplementingType_functionName”. As you can see, the superclass function is in the same location in both tables. This means that, were a function calling a method on an object typed as A to use this as a layer of indirect to access A’s method, then it would be “tricked” into calling B’s version instead.

We then stored objects on the heap similarly to the instance of B that follows:

B's type identification number
B's vTable address
A's fields
B's fields
Interface information about B's implemented interfaces

To call a method on an instance of a class, you would have to navigate a word down from the object pointer, load the vTable pointer stored, use this to load the vTable, then add the offset of the method (this is known at compile time – for example, “printAString” will always be the zeroth method in the vTable). Then, load the value stored here into the program counter to jump to the method.

This is trickier for interfaces – since an object can implement many, we can’t just have a fixed position in the vTable for each method. There is undoubtedly a better way to do this, but I chose to put a few pieces of data about them at the end of the object. For each interface, the following was stored:

The interface's type identification number
The address of a custom version of this interface's vTable with correctly overloaded methods

Additionally, at the end of an object, a zero is stored, which denotes the end of the object. Interface vTables are looked up by a simple, but fairly significant overhead-adding algorithm: look to the interface section with the interface offset information stored at the beginning of the object, and then repeatedly jump two words until either the interface identification number you want is found, or you reach a zero (The zero indicating an explicit cast has failed – it should never occur on an implicit cast). Once found, jump a single word to get the vTable of the interface, then look down a fixed offset to find the method you are interested in calling. The overhead added by this is somewhere in the line of 6 instructions, plus potentially up to four times the total number of interfaces (depending on how far down the one you are interested in is in the list). This is clearly suboptimal, and an approach I considered is, when an object is cast to an interface, moving it’s pointer down to the start of the relevant interface, but this would have been considerably more difficult to integrate into the codebase, as all implicit cast locations would need to be considered.

There is undoubtedly a better way to handle interfaces – I am very interested in finding this out; feel free to contact me if you know of a more optimal way or are wondering about anything I have explained above.