Array length in high-level languages

While pondering writing a standard library for a language I’ve written for my RPG engine, I’ve been stuck considering a question I find pretty interesting – how can one allow a standard library of a language, and ideally only the standard library, to do special operations, like direct memory accesses? Specifically, in my instance, array length accesses, but I was hoping for a more general philosophy.

First, let’s consider the problem of getting the length of an array. In my language, as in many, arrays are a first-class construct, and their allocated length is written in the first word of the array in memory. As a result, getting the length of the array involves reading the word of memory pointed at by the array’s reference.

For example, suppose we had an array of five integers. This would occupy six words of memory; the initial set to the length (5), and then the five elements, all initialised to zero.

This is where the problem arises for my standard library: the language allows  users to index the array in a fairly typical manner – x[0] through x[5] – but these do not correspond directly to the memory accesses (in fact, the memory accesses are all shifted upwards by one to skip the length). As a result, it is not possible to access the array’s length directly through the language.

A trivial solution would be to implement some special new syntax – length x – which generates a direct memory read to the array’s address, hence returning the length. But that’s no fun – it involves making the parser more complicated, adding a special case to code generation, and causes what I would call a “break in immersion” when coding – it’s one less thing that is intuitive and natural to users, when they can do array.sort() but not array.length(). Taking this vein of thought further, we could instead parse it as normal, and hijack it during code generation – if we’re generating code for a method call on an array, then we don’t try generating a classic method call, but instead directly output memory access code.

This approach has many benefits, in that it’s trivial to implement, and doesn’t add any special cases to parsing (just code generation), or increase the mental load for users too much. Essentially, to end users, this is a fairly seamless approach, but it still leaves something to be desired – now some array logic, such as sorting, is encoded in higher level code, but some is just hard-coded into the compiler.

Maybe that’s an acceptable loss, but I was still interested in how other languages had solved this problem, so I looked into how Java and C# solved this issue.

Java

Java seems to solve this by having a dedicated JVM instruction called arraylength – this is along the lines of what I was saying above, where the compiler hijacks what syntactically looks like a field access. Syntactically, it is next to identical to your average field, but you can use reflection to prove that it’s not actually a field.

C#

C# seems to take a very similar approach to Java (unsurprisingly, given the similarity between the two), with a CIL instruction ldlen (this article http://mattwarren.org/2017/05/08/Arrays-and-the-CLR-a-Very-Special-Relationship/ is a goldmine for related information)

Summary

I really intended to look into quite a lot more languages – specifically Python, Ruby and Lua, but didn’t really have time. Digging through the Python compiler to find the answer was taking me quite a long time. If anyone stumbles upon this and happens to know how they handle it, I’d love a comment.

It does seem like the mainstream approach is just a special case in code generation, though. Personally, I was expecting an approach where verified library code would be able to hold lower-level code in it (like inline assembly in C) to avoid this, but this seems like quite an overkill feature in retrospect.

Leave a Reply

Your email address will not be published. Required fields are marked *