Hadrian 0.2

As recently mentioned, I’ve been working on improving the performance of my parser library. Now that I’m done picking all the low-hanging fruits, performance wise, I have decided to release the next version of hadrian, 0.2. This version contains a bunch of bugfixes and a few new features, but mainly it’s just a lot faster than 0.1.

Last time I wrote about it throughput was around 1 MB of input per second, discounting I/O, which was already a considerable improvement over 0.1. Now, it is around 1.5 MB per second. If you include I/O overhead that is around 1.2 MB per second on my machine. That’s actually faster than ANTLR, at least according to the benchmarks I used for my thesis.

Of course it doesn’t matter how fast it is. Even though it really is the world’s most ass-kickingest parser, people are unlikely to use it until I write some more documentation. Bugger.

Cheated!

Lately I’ve been working on optimizing the parser engine in tedir. It’s been going pretty well — over the last few days I’ve been able to get a 40% performance improvement (discounting I/O overhead) which brings the parse time down to around 1 second per megabyte of input for a pretty messy grammar. It’s still not as fast as the implementation I wrote for my thesis (I got it up to around 4 MB/s) but still pretty good considering that this is a generalized parser, not LL, LALR(1) or whatever.

Most of the work during parsing happens in just a handful of methods in a single class, Runtime. Optimizing tight loops like this is a lot of fun because you can learn a lot about what goes on in the VM by seeing which optimizations work as expected and which ones don’t. Sometimes I’ve made refactorings that I thought should obviously improve performance but didn’t. For instance, inlining methods by hand often doesn’t help because the VM will do that for you. On the other hand, sometimes optimizations do work as expected and that’s what makes it all worthwhile. You probably have to have tried it to know what I’m talking about.

Today, I came across an interesting example of an optimization that I was sure would give a great improvement in performance but which turned out to do exactly the opposite.

Tedir uses BitSets a lot. In particular, the methods that take most of the time during parsing do. If you’ve ever looked at the implementation of BitSet you will have noticed that it does not appear to have been, shall we say, particularly heavily optimized. At least, I got the distinct feeling that this annoying class was wasting my precious processor cycles big time. Naturally, I though I’ll write my own BitSet, FastBitSet, that’ll kick BitSet’s ass! So I did: I wrote a very simple bit set without all the bounds checks, basically BitSet with half the code or so discarded. I replaced all uses of BitSets with FastBitSets and ran the benchmarks. The result was shocking: not only was the implementation not faster, it was slower by a factor of two! I tried out various changes to the implementation but nothing helped. I got suspicious.

My colleague Kasper suggested that I should go back to java’s BitSet again and see how much time the profiler reported was spent executing methods in BitSet. Surprise: none! According to the profiler, no time at all was spent in methods in BitSet. I can only draw one conclusion from this: the VM is cheating — it must have special handling of the BitSet class. Incidentally, I work with a bunch of people that have previously worked on sun’s java VM but none of them knew anything about this optimization, at least they weren’t the ones who had implemented it. But I think the empirical evidence is pretty strong that the optimization takes place.

It’s hard to complain about an optimization that causes your code to become twice as fast without losing generality but still, I do feel kind of cheated.

Post-script: Until now my goal has been to post here roughly once a week. However, as long as the weather in Denmark is as good as it has been the last few weeks I won’t post as often. Sitting indoors in front of my computer just doesn’t have the same appeal in June as it does in January.

CbfgFpevcg

One of my favorite languages is PostScript. It’s not so much the language itself — it’s simple and general but it can be a bit of a hassle to use — but the idea to use a general programming language as a document format. I’ve been playing around a bit lately and have written two small PostScript programs that can transform PS documents: munch-rot13.ps and munch-nl.ps. The one script ROT13s the document (example) and the other one simply swaps the letter N and the letter L (example). The N/L thing is sort of a tradition at daimi.

How do you “run” the programs? You simply concatenate the program with the document you want to transform. To apply ROT13 to document.ps you simply write

$ cat munch-rot13.ps document.ps > document-rot13.ps

You can apply the transformations as many times as you want so given a document encoded with munch-rot13.ps you can decode it by just applying munch-rot13.ps again. And, incidentally, if want to decode a document but don’t have munch-rot13.ps, don’t worry: since the document was encoded by appending it to the script, the encoded document actually carries around the script.

You can also mix the transformations: apply ROT13, then N/L and then ROT13 again. That causes A and Y to be switched in the document.

Pointless? Maybe, but I thought it was an interesting challenge. And, as Mayer Goldberg demonstrates, you can actually write useful programs in PostScript. Maybe I’ll try that some day. I’m not completely done with pointlessness yet, though: I’m considering writing a version that Enigmas the document.

Hadrian Demo

My pet project over the last few years is a parser library for Java called hadrian. I was lucky enough to be able to write my master’s thesis about this project but that was neither the beginning nor the end of it. Since my thesis I’ve rewritten it completely and now we’re using the new library in the neptune compiler.

One thing I’ve neglected, though, is documentation. I’ll try to write a few posts about it here at some point, to get started, but until then I’ve thrown together a demo applet which demonstrates some aspects of the library. The demo allows you to write a grammar and some input and then parses the input according to the grammar and shows the result as a syntax tree and XML. It also contains a few examples to play around with.

In itself, this demonstrates two things. Firstly, that hadrian allows you to parse input according to a grammar that wasn’t generated a compile-time but constructed by the running program. Secondly, that the result of parsing the input is constructed by the framework — you don’t have to specify your own ASTs. That’s only a small pat of what you can do but it’s pretty cool in itself. If I do say so myself. Which I do.

The code that does the actual work of the applet is pretty simple. Here it is (I’ll explain what it does below)

  private void processInput(final String grammarString,
final String inputString) {
1 final ITextDocument grammarDoc = new StringDocument(grammarString);
2 final MessageMonitor monitor = new MessageMonitor();
3 final HadrianParser parser =
HadrianReader.getDefault().readParser(grammarDoc, monitor);
4 if (processMessages(monitor)) return;
final ITextDocument inputDoc =
new StringDocument(inputString);
5 final INode node = parser.parse(inputDoc, monitor);
if (processMessages(monitor)) return;
final DefaultTreeModel treeModel =
new DefaultTreeModel(SyntaxTreeNode.create(node, null));
6 syntaxTree.setModel(treeModel);
7 syntaxXML.setText(TextUtils.toXML(node));
}

Here’s what’s going on in the code.

  1. The string containing the grammar is wrapped in a text document. A text document is pretty similar to a string except that it keeps track of where line breaks are and stuff like that. That can be very handy for instance when reporting errors in a document.
  2. We create a message monitor. A message monitor collects any messages that the system might generate while processing the grammar.
  3. We read the grammar and construct a parser. If this goes well, readParser returns a parser we can use to parse the input. If something goes wrong it reports an error to the message monitor which we…
  4. …check for and bail out if necessary.
  5. We wrap the string containing the input and parse it using the parser we just successfully constructed from the grammar.
  6. After checking that parsing went well we take the result, a syntax tree in the form of an INode, and wrap it in something that can be displayed in the UI.
  7. Finally, we also convert the syntax tree to XML and plant the result in the appropriate text box

Grammars don’t have to be read from source files or strings, there’s also an API that allows programs to construct them directly. But it turns out to be pretty handy to be able to read them from source files.

If you want to take a look at how the applet is implemented, the source is available with the code in tedir-applet.jar. Hadrian itself lives at sourceforge, as a part of the tedir project.

Neptune

Neptune is a new programming language inspired mainly by smalltalk and C/C++ but also taking ideas from many other places. It is, in short, a dynamically typed object-oriented language with an optional static type system. It is similar in many ways to smalltalk but has a C-like syntax. The language was designed by Esmertec AG in Århus, Denmark. Esmertec Denmark is now closed but work has started on a new implementation at sourceforge.

Posts

I’ ve written about it in these posts:

  • Types #2: more about the type system, including the void type and protocol literals.
  • Types: an overview of the static type system
  • Selectors: about selector objects, a dynamic representation of method names, which is a very powerful abstraction.
  • Characters: about a neat way to specify character literals in neptune.
  • Why Neptune?: A post that tries to explain why we decided to switch from using smalltalk to designing our own language.
  • Traits: How traits work in neptune.
  • Exceptions: About neptune’s exception mechanism which has some unusual features
  • Using: How neptune’s using statement works
  • Brevity: An example of how you could represent a simple concept in neptune, demonstrating various language features
  • C# 3.0: A look at some new features in C# that look very similar to features in neptune.
  • Constructors: How constructors work.
  • Interpol: About neptune’s approach to string construction: string interpolation.
  • Structs and Memory: Describes a tool I’ve written to make it easier to work with external C structures from neptune. Describes neptune’s interface to external calls and external memory.

I’ll keep this list updated as I write new posts.

Selectors

One of the most important rules in software engineering is don’t repeat yourself. If you find yourself writing the same code, or almost the same code, in more than one place then that is a sign that your code smells. For instance, if you see this code

Node root = current_node;
while (root.get_parent() != null)
root = root.get_parent();

near this code

Node topmost = left_leaf;
while (topmost.get_parent() != null)
topmost = topmost.get_parent();

you should feel a strong urge to factor out the similarities:

Node find_root(Node start) {
Node current = start;
while (current.get_parent() != null)
current = current.get_parent();
return current;
}

and then just use that method:

Node root = find_root(current_node);
...
Node topmost = find_root(left_leaf);

Refactorings like this is something we do all the time: notice similarities in our code and factor them out.

But not all similarities are easy to factor out. In the example above it was easy: the same thing was done with two different objects, current_node and left_leaf. Factoring out the subject of an operation is usually easy: you just create a method or function that takes the subject as an argument. But consider these two code snippets:

Node root = current_node;
while (root.get_parent() != null)
root = root.get_parent();

and

File root_directory = current_directory;
while (root_directory.get_enclosing_directory() != null)
root_directory = root_directory.get_enclosing_directory();

These two pieces of code are almost identical in what they do but in this case they’re not only different in the object they operate on but also in which method is called. In most object-oriented languages you can’t “factor out” the name of a method, like get_parent or get_enclosing_directory in this example, so you can’t write a find_root method that can be used to replace both loops as we could in the previous example.

In neptune, on the other hand, there is a mechanism for abstracting over method names: selectors. A selector is an object that represents the name of a method. For instance, the name of the get_enclosing_directory method is written as ##get_enclosing_directory:0. The syntax of a selector, at least in the common case, is ## followed by the name of the method, colon, and the number of arguments expected by the method. Given a selector object you can invoke the corresponding method on an object using the perform syntax:

Selector sel = ##to_string:0;
Point p = new Point(3, 5);
String s = p.{sel}(); // = p.to_string()

The syntax recv.{expr}(args...) means “invoke the method specified by expr on recv with the specified arguments. Using this, the loop example from before can be refactored into

find_root(var start, Selector method_name) {
var current = start;
while (current.{method_name}() != null)
current = current.{method_name}();
return current;
}

and then the two instances can call that method:

Node root = find_root(current_node, ##get_parent:0);
...
Node root_directory = find_root(current_directory,
##get_enclosing_directory:0);

Using selectors this way can sometimes be useful but code that is identical except for the name of a method is pretty rare, at least in my experience. But selectors can be used for many other things.

One of the most useful applications of selectors is delegates. A delegate is a selector coupled with an object. You can think of it as a delayed method call: you specify a particular method to call on a particular object but you don’t perform the call just yet.

Point p = new Point(3, 5);
Delegate del = new Delegate(p, ##to_string:0);
String s = del();

Here, we create an object, then we create a delegate which can be used to send to_string() to the object, and finally we invoke the delegate which causes to_string() to be called on the point. The syntax for invoking a delegate is the standard function call syntax: delegate(args...).

The syntax new Delegate(...) is a bit cumbersome so there is also a binary operator, =>, that can be used to create delegates:

...
Delegate del = (##to_string:0 => p);
...

How are delegates useful? Well, the place where I’ve had most use for them is as event handlers. For instance, we have a rudimentary GUI toolkit based on Qt that uses delegates for all events:

void draw_controls(qt::Widget parent) {
qt::Button ok_button = new qt::Button(parent);
ok_button.add_on_click_listener(##ok_button_clicked:0 => this);
}

void ok_button_clicked() {
System.out.println("Ok button clicked");
}

This code demonstrates how delegates can be used in a very light-weight mechanism for specifying event handlers, in this case causing the system to print a message on the console each time the button is clicked. And if we use accessor methods the code that sets the event handler can be made even more concise:

...
ok_button.on_click = (##ok_button_clicked:0 => this);
...

Another use of delegates is for spawning threads. Besides just invoking a delegate, you can also call the spawn method which invokes the delegate in a new thread:

void start_process() {
Worklist list = new Worklist();
(##produce:1 => this).spawn(list); // spawn producer
(##consume:1 => this).spawn(list); // spawn consumer
}

void produce(Worklist list) {
while (true) {
var obj = produce();
list.offer(obj);
}
}

void consume(Worklist list) {
while (true) {
var obj = list.take();
consume(obj);
}
}

The start_process method starts two threads: on one that adds objects to the worklist and one that consumes those object, again in a very light-weight fashion using delegates to invoke two local methods in separate threads. I think that’s pretty elegant!

Unlike languages like C#, delegates in neptune are not “magic”; they are implemented as pure neptune code that uses selectors and perform to do the actual delegation. While selectors might not look like that useful a construct they can be used to build some very powerful abstractions.

Prefix Keywords

Even though we’ve decided not to use smalltalk in OSVM it doesn’t mean that we haven’t been happy with smalltalk ourselves. You always want what you can’t have and if I wasn’t a smalltalk fan before, I was certainly made one by the decision that we were not going to use it anymore. Having said that, I think there are situations where smalltalk is less than perfect; in particular, I find some smalltalk code very hard to read. In this post I’ll discuss some examples of this and describe a neat solution (at least I think so) I once experimented with.

One thing that can make smalltalk code hard to read is the fact that almost everything is a message send, and that the receiver of a message send is always written first. For instance, in a send to ifTrue:ifFalse: you always write the condition first because that is the receiver of the send:

self size ~= other size ifTrue: [ ... ] ifFalse: [ ... ]

When you read this line you have to read past the self size ~= other size part before you see that what you’re reading is actually the condition of a conditional. In this case the line is short so it’s not a big problem but it does cost me just one or two extra brain cycles when reading the line. I would have saved those if you could write the conditional so that it was immediately clear, when reading from left to right, that it was a conditional. For instance:

if: (self size ~= other size) ...

The problem becomes much worse when you consider the operations whose receiver is a block, for instance some repetition, thread and exception operations, because blocks can be very long. An example is this code snippet taken from Squeak (well, slightly changed):

[
[ | promise |
promise := Processor tallyCPUUsageFor: secs
every: msecs.
tally := promise value.
promise := nil.
self findThePig.
] repeat
] on: Exception do: [
^nil.
]

What’s happening here? The outer block is an exception handler, the body of which repeatedly performs some operation. When you read this code you have to read ahead several lines to understand what each block means. By the way, I have no idea what findThePig does.

Another example, also taken from Squeak, is this:

[
[
newMorph := NetworkTerminalMorph connectTo:
self ipAddress.
WorldState
addDeferredUIMessage:
[newMorph openInStyle: #scaled] fixTemps.
]
on: Error
do: [ :ex |
WorldState addDeferredUIMessage: [
self inform: 'No connection to: ',
self ipAddress,' (',ex printString,')'
] fixTemps
].
] fork

Here you not only have to read way ahead to see that the outer block forks a thread, but you have to look closely at the body of the inner block to spot the on:do: selector which doesn’t stand out at all but is very important in understanding the code. This is only two levels of nesting and if you add another level the code becomes even harder to read.

Back when I wrote a lot of smalltalk I started inserting comments to make code such as this easier to read. For instance, I would insert “do” before a loop and “try” before an exception handler:

"try" [
"do" [ | promise |
promise := Processor tallyCPUUsageFor: secs
every: msecs.
tally := promise value.
promise := nil.
self findThePig.
] repeat
] on: Exception do: [
^nil.
]

Here I don’t have to read ahead because I can see which kind of construct I’m dealing with even before I see the opening bracket. You still have to read to the end to understand exactly how the code is repeated and which exceptions are caught, but my experience is that this gives just enough context that you can read and understand the code in one go, something I couldn’t do with the previous version of the code, without the comments.

Back when we were discussing how to change the language, before we decided to go for a C/C++ style syntax, I suggested something called prefix keywords to make the language slightly more C-like and easier to read. It was never very popular but I still think it’s a pretty decent idea. The suggestion was to allow keyword sends that start with a keyword. You can still write ordinary keyword sends such as:

(...) ifTrue: [...] ifFalse: [...]

but you would also be allowed to write:

if: (...) then: [...] else: [...]

which means sending the if:true:false: message to the first expression, with the two blocks as arguments. You would declare the if:then:else: method on booleans like this (here on True)

if: self then: thenPart else: elsePart
^thenPart value

That way, instead of using comments to mark the beginning of constructs like above, you can use a prefix keyword:

try: [
do: [ | promise |
promise := Processor tallyCPUUsageFor: secs
every: msecs.
tally := promise value.
promise := nil.
self findThePig.
].
] on: Exception do: [
^nil.
]

There’s no problem in parsing it and the semantics is as simple as the three other kinds of sends. It also works pretty well for methods such as acquireDuring:. Consider

mutex acquireDuring: [
... critical section ...
]

compared with

acquire: mutex during: [
... critical section ...
]

I think it’s pretty neat. It does change the flavor of the code somewhat but I think it makes the code much easier to read.

Totten

I read the Middle East Journal, a blog written by freelance journalist Michael Totten. Recently, he asked his readers what we thought about him no longer going on paid assignments. Instead, he would write exclusively for the readers of his blog and financed by our voluntary tips. The feedback in the comments were very positive and apparently so was the feedback in his tip jar. So now, if things work out, he’s off to Iran. And maybe later Afghanistan, or Syria, or North Korea.

It’s an interesting experiment — a journalist cutting out the middleman and writing directly for, and interacting with, a group of people. At first I was pretty excited about it but as I’ve thought a bit more about it I’m having some second thoughts. For one thing, some might argue that the middleman, the editor, is there for a reason. I would tend to agree. Also, these are dangerous places he’ll be going. Will I be (partially) responsible if something happens to him? I’m sure that the more remote and dangerous places he writes about, the more people will read his blog and hit his tip jar. Will put pressure on him to visit these places and put himself at even greater risk?

On the other hand there’s the distinct possibility that I shouldn’t think so much and just enjoy the blog.

Entertainment

Where does the wierdest wikipedia content go when it dies? Why, to BJAODN, bad jokes and other deleted nonsense.

From Coleoptera

The most musical of the Insecta, Coleoptera are known for their tight, 4-part harmonizing and catchy melodies.

From Malaga (province)

The Sun Coast (Costa del Sol) is a concrete monster that swallows, burns, and spits back millions of happy European tourists.

From Alternative rock

Alternative rock is the name given to one stone when you’re looking at another stone. The term was coined by photographer Edwin Blastocyst when looking at one stone and speaking about another, oddly enough.

    The quote from Edwin Blastocyst needs to be verified.

In other (old) news: They’re made out of meat is pure genious and now a film has been made based on the original short story.

Characters

Quick: which character is '\u03b5'?

Let’s see, well, since it’s between 0370 and 03ff it’s clearly Coptic or Greek. The Greek lower-case characters start at 03b1 so it would be the lower case form of the fifth Greek character. That would be, let’s see, ε! Man, that Unicode stuff is just so logical!

I don’t think there’s anything wrong with Unicode. But as soon as people have to use the character codes directly, for instance when using Unicode character constants in languages like Java and C#, there’s trouble. There are tens of thousands of Unicode code points and I can’t even remember which ASCII character is line feed and which one is carriage return. On the rare occasions where I either read or write code that uses Unicode constants or ASCII control characters, I usually have to open a browser look up the values myself. That sucks.

The fortress language improves on things: there, instead of writing the code of a character in an identifier, you can write the name. It just seems so obvious really: the Unicode standard defines a name for all the characters so why should I have the trouble of looking up the character code?

In neptune you’re welcome to still use character codes to specify Unicode characters. Writing '\u03b5' means greek small letter epsilon just as it does in Java and C#. But inspired by fortress we’ve added another syntax for specifying Unicode characters by name: writing \<name> specifies the Unicode character called name. So instead of writing '\u03b5' you can simply name the character: '\'. If you feel that 'x' is too straightforward you can specify it equivalently as '\'. This works both in character literals and text strings:

"From \ to \"

which means the same as, but is a lot easier to understand than

"From \u0391 to \u03A9"

Besides Unicode names, we also allow the ASCII control characters to be specified by their short and long names. This means that I don’t have to remember that line feed is 10, not 13; instead I can just write '\' or '\'.

The first time you want to write the character ε you probably still have to look it up to see that it’s called greek small letter epsilon. But there’s a lot more logic to the name than to the raw character code and there’s a better chance you’ll remember next time. And it will of course be obvious to anyone reading the code which character it is. The only problem I see is that the names tend to be very long. Fortress allows you to use shorter forms of some characters: you can for instance leave out words like “letter” in character names. If the length of the names turns out to be a problem we might add something like that at some point.

Either way, I think this is a pretty nifty feature. And I wouldn’t be surprised if it turns out that there are other languages that have similar mechanisms.