Tumbleweed Picks up Speed

Following up from my previous introduction to the Tumbleweed project, I’ve managed to scavenge some more free time to tinker with it. I’ve been looking into two areas, UI and performance.

GUI/IDE

The UI side of things is proving to be more problematic than I’d hoped. It originally used IUP from the same Tecgraf/PUC-Rio stable as Lua. However, while it is almost perfect in most ways, it has a very clean and simple API, looks nice, and is plenty powerful enough for my needs, I hit one snag early on. It has no native MacOSX support, the only support for MacOSX is via GTK/X which just doesn’t quite feel right to me. I’ve considered other options, including the following…

  • Implement a complete GUI on top of OpenGL within Tumbleweed.
    • Pros:
      • Complete control over the GUI from within the language.
      • Common look and feel across all platforms.
      • Fun challenge.
    • Cons:
      • No native look and feel on different platforms.
      • Lots of work.
  • Implement the GUI in a browser, and expose the necessary data via a socket in the runtime.
    • Pros:
      • Common interface on all platforms.
      • Can utilise existing UI tooling from the HTML/JavaScript world.
    • Cons:
      • No truly standalone applications (not sure if this is a problem or not yet).
      • Not written in Smalltalk.
  • Create the GUI in Qt and bind the runtime into that.
    • Pros:
      • Industry standard cross platform GUI tools, complete and rich.
      •  Well proven.
    • Cons:
      • Heavyweight.

Anyway, this is on hold for the moment, while I consider the options.

Performance

This area has been more productive. I brought in some of the tiny benchmark code from GNU Smalltalk for testing purposes. It doesn’t fully work, but is a good starting point. From here I managed to get some initial timings for message throughput. Before starting, initial tests were showing around 25,000 sends/second or thereabouts. Since then, I’ve implemented some key optimisations to the VM. Most of these are fairly obvious, but work up to now has been focused on getting it working right, without a view to performance.

  1. Re-introduce the special handling of small integers. LST originally treated small integers as a special type, holding the actual value in the object handle, rather than a reference to an object on the heap. This was removed to make some refactoring easier, and to simplify the code in various places. Some profiling showed that the number of small integers allocated during normal running is very high, so this was a clear area for significant benefit. Implementing specialisation of small integers improved the throughput to ~480,000 sends/sec, big win!
  2. ObjectHandle referencing. Internally, Tumbleweed uses a custom class, ObjectHandle, to hold references to Smalltalk objects that mustn’t be garbage collected, but have as yet no reference in the main system. The main problem this solves is when the VM has to allocate multiple objects during a primitive operation, if the second or subsequent allocation calls a GC, the first item may be collected, as it isn’t yet visible from the Smalltalk system. Unfortunately, the naive mechanism used initially relied on std::map to hold a list of hard object references. Adding and removing these with each ObjectHandle constructor/destructor was expensive. To alleviate this I changed ObjectHandle to use an intrusive linked list mechanism. Inserting new ObjectHandles into the list is very very cheap, as I keep a ‘tail’ pointer statically on the class, and removing is just as easy, simply rewiring the links around the object being deleted. During GC, iteration is easy, just pass over the linked list, a single indirection per iteration. This change improved throughput to ~848,000 sends/second.
  3. Runtime optimisations. This third optimisation phase involved lots of tweaks to hotspots identified during profiling. Validate all use of ObjectHandle, only use where it’s likely to be a problem. Cache the lookup of all standard class objects used within the VM. Inline some key functions that weren’t being inlined by the compilers. And finally, a smaller change that might offer more avenues in the future, completely separate the building of the ‘initial’ image builder and the main runtime. This allows some assumptions to be made in the runtime, that can’t be made if the same code must support the building of the initial image. Final speed enhancement due to these changes, ~1,914,000 sends/second.

Overall, very positive, although this only uses the Fibonacci benchmark from tinyBenchmarks. Next I need to enable the memory benchmarks and see how it fares in that area.   Paul

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply