Forth and the Furious

In which we go faster

This is a bit of an interlude. At this point, I’ve built a serviceable computing engine that’s not quite a Forth implementation but is capable of any computation. We can now look at refactoring some parts of it with respect to increasing performance. At the same time, we want to look to the future.

In preparation for this, I’ve created a program that runs the Fibonacci calculator and records the program’s running time. At the same time, I have added a counter to the Forth machine to count how many words it executes. Here we go.

struct JForth: ParsableCommand
{

	private func timedRun(_ block: () throws -> ()) rethrows -> Double
	{
		let start = DispatchTime.now()
		try block()
		let end = DispatchTime.now()
		let nanoseconds = end.uptimeNanoseconds - start.uptimeNanoseconds
		return Double(nanoseconds) / 1e9
	}

	let program =
	"""
	: fibo dup 3 branch? 15 branch
		dup 1 = 10 branch?
		dup 1 - fibo swap 2 - fibo + ;
	"""

	func run() throws
	{
		do
		{
			var output: [String] = []
			var forth = MyForth()
			try forth.interpret(input: program, output: &output)
			let time = try timedRun
			{
				try forth.interpret(input: "26 fibo .", output: &output)
			}
			let mips = (Double(forth.wordsExecuted) / time) / 1_000_000
			log.info("Time of run: \(time)s, words executed: \(forth.wordsExecuted), mips: \(mips)")
			log.info("\(output)")
		}
		catch
		{
			log.error("\(error)")
		}
	}
}

JForth.main()

ParsableCommand is a protocol from the Apple’s Argument Parser library. It makes it pretty easy to add command line arguments and usage information.

The output from this program on my laptop looks something like:

Time of run: 3.678328657s, words executed: 3189047, mips: 0.8669826155776271
["121393"]

I’ve done several runs and they all produced numbers in the range of ~ 0.9 mips. This gives us a baseline for seeing how our modifications affect performance. I can also profile the code, but there is plenty of low hanging fruit before we get there. The code at this point is tagged blog-1021-1

Compiled Words

The first point is that words are stored in definitions as strings. This means that when a definition is executed, each word in that definition has to be looked up in the dictionary each time it is executed. An obvious optimisation would be to do the lookups at compile time and replace the string with its definition. Our main interpreter loop will iterate through a bidirectional collection of WordDefinition except I will be renaming WordDefinition to CompiledWord because it’s confusing me.

The WordList type must be a reference to a list of compiled words. Why? Because I need to maintain the ability to do recursion, so the word definition for a word e.g. for fibo needs to be available as I am compiling it and if it were a value type, it would be a half completed definition that gets baked in. So, WordList will be a class. There are some problems with this choice:

  • performance may suffer with retains and releases
  • a recursive definition is a circular reference.

I’m going to ignore the former problem for now and hope it’s not too serious. If profiling shows it will be an issue, I will review the choice. I will also ignore the latter problem for now. It will only be a problem when words get redefined or I implement a way to delete them. In any event, replacing the actual list with a single word that (for example) throws an exception will both break the circular dependency and provide an automatic way to handle word definitions that reference other deleted or changed word definitions.

Time passes

Implementing the above results in a 100% increase in mips. I get the following output now:

Time of run: 2.550830144s, words executed: 4685362, mips: 1.836798898986196
["121393"]

You’ll notice that the number of words executed has increased. This is because pushes of literal values onto the stack now count as executing words. In overall run time this is about a 30% decrease.

The code for this change is tagged blog-1021-2.

Cache the Stack Top

Suppose we maintain theta of the stack as a local variable. That might save something. For example, + currently looks like this:

dataStack.push(try dataStack.throwingPop() &+ dataStack.throwingPop())

If the top of the stack is in a local variable, it would look like this:

stackTop &+= dataStack.throwingPop()

This saves a push and a pop in the short term and over several instructions might be a time saver, especially if Swift puts the local variable in a machine register.

In reality, stackTop could be empty, so I use an Optional. + really looks like this

let op1 = try stackTop ?? dataStack.throwingPop()
stackTop = try op1 &+ dataStack.throwingPop()

At the end of running the interpreter loop, stackTop is flushed on to the data stack, otherwise it would get discarded. I also flush it before recursively calling interpret(words:output:). If you don’t, do this, the stack gets out of sequence.

Time of run: 1.885700435s, words executed: 4685362, mips: 2.484679916828889
["121393"]
Program ended with exit code: 0

That’s encouraging. It’s another 35% increase in mips and another 24% decrease in execution time. Overall, I’ve doubled the speed with just these two optimisations.

The code is tagged blog-1021-3.

Addendum

In the final code above, there was one problem: the stack could get out of sequence if an exception was thrown. Take the following:

case .divMod:
	let denominator = try stackTop ?? dataStack.throwingPop()
	let numerator = try dataStack.throwingPop()
	guard denominator != 0 else { throw Error.divisionByZero }
	let (q, r) = numerator.quotientAndRemainder(dividingBy: denominator)
	dataStack.push(r)
	stackTop = q

If the stack is 1 2 0 before the function and the top item is in stackTop then denominator becomes 0, numerator becomes 2 and an error is thrown. The function passes the error up the stack, but there is a defer that runs as the function exits its scope that flushes stackTop to the data stack. So we get 1 0 on the stack. It looks like we’ve removed the second item from the stack and left the first.

In most cases, this is not really a problem. In the case of divMod I could move the test for division by zero to just after getting the denominator, or I could set stackTop to nil as soon as I have got the denominator. In the end, I hit on a nice generic solution. I added a function to Optional that returns the Optional and sets it to nil at the same time.

fileprivate extension Optional where Wrapped == MyForth.Data
{
	mutating func pop() -> MyForth.Data?
	{
		let ret = self
		self = nil
		return ret
	}
}

Every time I want to reference stackTop in a way that logically pops it off the top of the stack, I use this function instead e.g.

let denominator = try stackTop.pop() ?? dataStack.throwingPop()

Here’s a typical timed run.

Time of run: 1.814111424s, words executed: 4685362, mips: 2.5827311035113136
["121393"]

If anything, it is actually slightly faster than before. I assume this is because the function is inlined by the compiler and discarding the value of stackTop allows other optimisations elsewhere.

The code is tagged blog-1021-4.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.