Whilst I was writing my post about instruction decoding, a thought struck me about improving the structure of decoding. In the existing decoder, I do four things:
- Get the displacement, if needed
- Get the source operand, if needed
- Do the operation
- Save the result to the destination if needed
This works fine but the “do the operation” part is a bit ugly because, frequently an operation really has two or even more “sources”. For example, the 8-bit add uses the a register as an implicit source, so the operation needs to sneakily fetch the accumulator as well as the operand. Similarly some operations might have an extra implicit destination. For example, most control transfers (jumps, returns) can be implemented as simple loads with a destination of the program counter. The call instruction is an exception because it also has to save the existing program counter on the stack. In fact, it has two sources too, because it needs to get the target address and the existing program counter.
So, I thought, why not have two sources and two destinations.
- Get the displacement, if needed
- Get source operand 1, if needed
- Get source operand 2, if needed
- Do the operation
- Save the result 1 to the destination 1 if needed
- Save the result 2 to the destination 2 if needed
An obvious down side is that I either need to do some duplication for the operand calculations (steps 2 and 3 are identical except for the variable used as are steps 5 and 6) or modularise it (i.e. put it in a function). So in order to do this, I need to accept a performance hit.
Another down side would be a proliferation of local variables. At the moment, I have four local variables: source8, source16, dest8, dest16 so that each can be of the correct type for the width of the data. I would need to double these up to eight, at which point we are definitely running out of registers. I did consider using and array, but I think the array access would have an unacceptable performance impact.
I decided to go ahead anyway. The code is on a new branch so I can throw it away if it turns out to be a bust. In order to try to control the proliferation of local variables,
I decided to start by coalescing the different width source and destination variables, after all, an eight bit quantity fits into sixteen bits. The only problem is the casting needed to squeeze it back into an eight bit register at the end, but that is localised to one place (two places in future). Also, given that I would be doing some casting anyway for 8-bit destinations, I decided to be consistent and make the source and destination variables UInt
variables so that they are at the computer’s natural integer width.
Having completed the first part (well, not quite, I’ve introduced some errors in the ZEXALL test, that need to be fixed), I was astonished to find that the ZEXALL test runs at a nominal 360MHz as compared to 217MHZ previously. That’s a 66% improvement. This means that with my later anticipated slow downs, I’ll probably still expect a net gain in speed.
Furthermore, it’s leading me to consider making the registers all UInt
s too in the same way as Andre Weissflog did.
Even furthermore, I think I can apply the same lesson to my 6502 emulation. It’s running at around 90MHz at the moment, although it is embedded in a PET emulation with a lot of memory mapped devices.