Metal And Emulation

I thought it was time to learn how to do programming using the GPU. On macOS, that means Metal. I started by following this tutorial. Once I got to about tutorial number 11, I started wondering if I could apply Metal to my emulators. I thought, it would be cool if all of the graphics could be done on the GPU instead of in the main thread of the Cocoa application.

Here is what I did. Note that, I’m a beginner at this, so much of the code could surely be improved. Also, some of my explanation might be wrong (although the code does display the PET screen properly), so please bear with me. The source code for the emulator is in my M6502 Bitbucket repository.

The basic principle of Metal is to draw into a texture using a pipeline. There are two sorts of pipeline (that I know about) and they are the “render” pipeline and the “compute” pipeline. The render pipeline seems to be for drawing 3D objects in. The compute pipeline is for computing tasks, but it seems it also works for drawing pixel oriented images onto a drawing texture. This, to me, seemed to be the one to use.

The Commodore PET 3032 had a 40 x 25 character display. The display was not bitmapped, but used electronics to convert the screen RAM which was effectively an array of character codes, into the raster images on the display. There were two character sets, one with upper case and graphics character and another with lower case and upper case (and a limited set of graphics characters). The character set to use was selected by toggling the CA2 line on the VIA chip. The CA2 line was controlled in software by toggling bit 1 of the VIA’s PCR register at location $E84C or 59,468. Even after more than 30 years, the I known the commands POKE 59468,12 and POKE 59468,14 off by heart.

The original code in VDUView.swift worked by blitting an NSImage containing the character’s bitmap onto the view in the right place using Cocoa technology. In fact, it calculated the lines that needed to be redrawn from the dirty rect and asked its controller for each line in turn and then used its delegate to draw the line. The delegate was responsible for rendering the image for each character and blitting it into the view.

The code that actually paints the texture is written in the Metal language, which is based on C++ and it is compiled for and runs on the GPU. Here is the code for my PET screen render.

#include
using namespace metal;

struct Geometry
{
    uint2 charDims;
    uint2 screenDims;
};

constant struct Geometry pet40Geometry = { uint2(8, 8), uint2(40, 25) };

kernel void petVDURender(texture2d output [[texture(0)]],
                     	constant uint8_t *characterMaps [[buffer(1)]],
                    	constant uint8_t *screenBuffer [[buffer(2)]],
                     	uint2 gid [[thread_position_in_grid]])
{
    uint2 screenDims = uint2(output.get_width(), output.get_height());
    uint2 petPixelDims = pet40Geometry.charDims * pet40Geometry.screenDims;
    uint2 petPixel = gid * petPixelDims / screenDims;

    // Calculate the index of the character
    uint2 charPos = petPixel / pet40Geometry.charDims;
    uint screenBufferIndex = charPos.x + charPos.y * pet40Geometry.screenDims.x;
    uint charStartIndex = screenBuffer[screenBufferIndex] * pet40Geometry.charDims.y;
    // charStartIndex now points at the byte of the first line of the character
    uint2 bitIndexes = petPixel % pet40Geometry.charDims;
    bool isGreen = (characterMaps[charStartIndex + bitIndexes.y] & (0x80 >> bitIndexes.x)) != 0;

    float3 colour = isGreen ? float3(0, 1, 0) : float3(0);
    output.write(float4(colour, 1), gid);
}

The function is called petVDURender. The keyword kernel lets Metal know it is a function for a compute pipeline. It has four parameters:

  • output a two dimensional texture into which to write the colour data
  • characterMaps the 8 x 8 bitmaps for all 256 PET characters
  • screenBuffer the sequence of bytes in the PET’s screen memory
  • gid the thread’s position in the grid. This is effectively the coordinates of the pixel – more on threads later.

The bits surrounded by doubled square brackets [[...]] bind the function arguments to variables or buffers that the GPU knows about. So output is bound to the texture the GPU is drawing in; characterMaps is bound to a buffer that is provided as an input to the pipeline; screenBuffer is bound to a buffer that contains the bytes of the PET screen memory; gid is bound to a vector representing the position in the thread grid, which is effectively the position of the pixel on the screen, although I guess it doesn’t have to be.

The code first calculates which PET pixel the gid is in on the assumption that the PET screen is 320 x 200. It then figures out which character that refers to given that the PET screen is 40 x 25 characters. It uses that value to index into the screen memory to find out what byte is at that position on the PET screen.

The character map is supplied as an array of eight-byte sequences with each byte representing one line of the character’s glyph. Given the PET pixel position, it is a fairly simple matter to find the right line in the character and then find the right bit of the line. If it is 1, the colour is green; if it 0, the colour is black.

So that is the shader compute function and runs for every single pixel on the screen, which might seem a lot. But this is the GPU and we can do lots of things at once. For example, my MacBook Pro has an AMD Radeon Pro 560X in it. This processor has 1,024 shader cores, which (I think) means 1,024 parallel thread, in theory.

The VDU view has to change because it renders itself in a fundamentally different way. When it needs updating, instead of redrawing the dirty rect using Cocoa functions, it has to rerun the compute pipeline. Our compute function has two input buffers, so each time it runs the pipeline it has to provide those two buffers.

The first step is to initialise the compute pipeline. Here is the code from init(frame:device:). Note that the view inherits from MTKView, not NSView​.

    override init(frame frameRect: CGRect, device: MTLDevice?)
    {
        let notNilDevice = device ?? MTLCreateSystemDefaultDevice()!
        do
        {
            guard let commandQueue = notNilDevice.makeCommandQueue()
                else { throw MetalVDUView.Error.unexpectedNil("command queue") }
            self.commandQueue = commandQueue
            computePipelineState = try MetalVDUView.registerShaders(device: notNilDevice)
        }
        catch
        {
            fatalError("\(error)")
        }
        super.init(frame: frameRect, device: notNilDevice)
        framebufferOnly = false
    }

    static func registerShaders(device: MTLDevice) throws -> MTLComputePipelineState
    {
        let library = device.makeDefaultLibrary()!
        guard let kernel = library.makeFunction(name: "petVDURender")
            else { throw MetalVDUView.Error.unexpectedNil("kernel") }

        return try device.makeComputePipelineState(function: kernel)
    }

The first thing we do is make sure we have a Metal device. This is an abstraction of the GPU and it should be passed in to the initialiser as a parameter. If it isn’t, we use a Metal API call to get one.

We then create a command queue to send commands to the device.

Finally, we create the pipeline state. The pipeline state tells us which shader function to use (the one discussed above). The shader function has previously been compiled with the Metal compiler and included as a resource in the application.

The line framebufferOnly = false is necessary because MTKView assumes that you are only writing to the GPU buffer and optimises itself accordingly. Unless you tell it that is not allowed, it won’t let you use a compute pipeline with it.

When it’s time to draw something, we need to assemble a command for the GPU that tells it: what to do i.e. run the pipeline state we created with the compute function in it; what to draw into; what the two buffers for the character set and the screen are. Below is the main code in the draw(_ dirtyRect:) function, with commentary.

            guard let commandBuffer = commandQueue.makeCommandBuffer()
                else { throw MetalVDUView.Error.unexpectedNil("command buffer") }
            guard let commandEncoder = commandBuffer.makeComputeCommandEncoder()
                else { throw MetalVDUView.Error.unexpectedNil("command encoder") }

We have set up a command buffer for commands to send to the GPU and created a command encoder for our particular command.

            commandEncoder.setComputePipelineState(computePipelineState)
            commandEncoder.setTexture(drawable.texture, index: 0)

We have set the pipeline state for this command and the texture to draw into. This text will be passed as the first parameter of our shader function. Note that drawable is an unwrapped version of currentDrawable which is a variable maintained by MTKView​ for our convenience. I think it changes every time you render something into it.

            if characterSetbuffer == nil
            {
                let characterBitmap = charDelegate.characterSetBitmap
                assert(characterBitmap.count == 256 * 8, "Don't think our character map is big enough")

                let bufferLength = characterBitmap.count * MemoryLayout.size

				guard let newBuffer = device.makeBuffer(length: bufferLength,
                                                        options: [])
                else { throw MetalVDUView.Error.unexpectedNil("command encoder") }
                memcpy(newBuffer.contents(), characterBitmap, bufferLength)
                characterSetbuffer = newBuffer
            }
			commandEncoder.setBuffer(characterSetbuffer, offset: 0, index: 1)

The above code sets up the character bitmap buffer. The character bitmap is supplied by the charDelegate which has a new function for this purpose. Off the two classes that conform to this protocol, only PETChars works. The NativeChars class will throw a fatal error.

The character bitmap is created on demand and cached until the PET changes its character set from upper case to lower case or vice versa.

            if screenBuffer == nil
            {
                let bufferLength = geometry.height * geometry.width * MemoryLayout.size
                guard let newBuffer = device.makeBuffer(length: bufferLength, options: [])
                    else { throw MetalVDUView.Error.unexpectedNil("command encoder") }
                screenBuffer = newBuffer
            }
            guard let screenBuffer = screenBuffer else { fatalError("Screen buffere cannot be nil here") }
            var bufferContentPtr = screenBuffer.contents()
            let rowLengthInBytes = geometry.width * MemoryLayout.size
            for row in 0 ..<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span><span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>&lt; geometry.height
            {
                let bytes = controller.bytesFor(row: row)
                assert(bytes.count == geometry.width, &quot;Geometry does not match backing bytes&quot;)
                memcpy(bufferContentPtr, bytes, rowLengthInBytes)
                bufferContentPtr += rowLengthInBytes
            }
            commandEncoder.setBuffer(screenBuffer, offset: 0, index: 2)

Now we have set up the screen buffer and copied the PET screen memory into it. We do it line by line because that means we don't have to amend the controller, although this is probably an optimisation point: we should be ab le to copy the entire screen into the buffer in one go.


            let w = computePipelineState.threadExecutionWidth
            let h = computePipelineState.maxTotalThreadsPerThreadgroup / w
            let threadsPerGroup = MTLSizeMake(w, h, 1)
            commandEncoder.dispatchThreads(MTLSizeMake(drawable.texture.width,
                                                       drawable.texture.height,
                                                       1),
                                           threadsPerThreadgroup: threadsPerGroup)

This sets up the threads. The pipeline state tells us what the optimum number of parallel threads is and then we dispatch one thread per pixel in groups defined by the optimal values supplied.

            commandEncoder.endEncoding()
            commandBuffer.present(currentDrawable!)
            commandBuffer.commit()

We're done so we send the command buffer to the GPU.

Screen Shot 2018-09-24 at 19.45.12

Was it worth it? Well, in debug mode, it appears to more or less double the CPU’s speed from 1 to 2 MHz. However, in release mode, it makes almost no difference to the CPU speed at all. This is because the CPU has always run in a different thread to the display code.

Next stop: the same exercise for the Spectrum emulation.

One thought on “Metal And Emulation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s