Sep 30, 2021

Building an 8-Bit CPU on a Game Boy

I recently finished building Ben Eater’s 8-bit Breadboard Computer but near the end of it I realized that much of the fun was in the design and the learning, but the act of actually cutting wire was an exercise in tedium.

I wasn’t quite ready to put the 8-bit computer behind me, but I didn’t want to cut or bend wire anymore.

Another thing that’s been on my to-do list is programming for the original Game Boy (GB). The GB comes from a time when game development was a lot closer to the hardware: developers had to be intimately familiar with the hardware, not only because they were writing in assembly and needed to understand the quirks, but because it was necessary to squeeze every last bit of performance out of relatively weak machines.

The entire system is simple enough that you can understand how it all works from top to bottom, but powerful enough to actually do cool stuff. If you doubt its power, look at a game like The Legend of Zelda: Link’s Awakening.

Modern game development involves writing C++ for huge game engines like Unreal Engine where you can’t possibly understand how everything works. Instead there are layers above and below your code, and you’re mostly just inserting little glue layers in-between.

The Game Boy is an intersection of two of my interests: game development and low-level hardware stuff. It’s essentially a self-contained embedded system (8-bit CPU, buttons, speakers, RAM, timers, interrupts, memory-mapped IO, LCD) with the capability of game development (sound effects, sprites, tilemaps).

There is also a large amount of documentation around the GB on the net which makes it relatively easy to get started (but hard to master). The primary resource is gbdev which contains Pan Docs (the technical reference), RGBDS (tools for assembling), and links to tutorials.

Game Boy SAP-1

While staring at my (mostly) complete build of the 8-bit breadboard computer and pondering what to do next (besides fixing the final bug that haunts me), I had the idea of recreating it on a Game Boy as a fun and relatively simple way to ease myself into GB development.

The 160x144 display could display the LEDs as tiles. The buttons could act as pushbuttons and potentiometers and switches. The DIP switches could exist virtually and be set with the d-pad. The internal timer could work as the auto clock step.

Essentially I could emulate the computer on the GB, but with the added flair of visualizing the LEDs (that’s the coolest part of the breadboard computer).

Process

I first laid out the display in a spreadsheet to make sure that everything I wanted would fit onto the screen which could display 20x18 tiles of size 8x8.

I went through a few iterations until I was able to fit everything on the screen in a way that was still clean and legible. It would have been nice to put everything in the same location as they exist on the breadboards but there simply wasn’t enough horizontal space.

I designed a font where each glyph is one tile. I don’t claim to be the world’s greatest type designer, but it gets the job done.

And of course I needed an LED. I went for a square with rounded corners which reads nicely on the small display.

With the layout and tiles designed, it was time to get it all displayed on screen. But you can’t just put a PNG on the Game Boy and expect it to display it. Each tile is eight rows of 8 pixels, and each row is two bytes, for a total of 16 bytes per tile.

The original Game Boy had 2bpp color, meaning it could display four colors: white, light gray, dark gray, and black; although these shades of gray displayed on the original as shades of green.

Pixel N in the tile has its color determined by combining bit N of Byte 1 with bit N of Byte 0, and the resultant value is an index into the palette. For example, the digit 0 of the font:

83 83 7D 7D 6D 6D 6D 6D 6D 6D 7D 7D 83 83 FF FF

Each pixel is either fully black (0x3) or fully white (0x0).

The first line is defined by the bytes 83 83:

    1000 0011

    1000 0011
-------------
    3000 0033

Reading left to right, that means the order of pixels goes black, white, white, white, white, white, black, black, which is exactly what the image shows. You can read more about it here.

Needless to say, that would all be tedious to calculate by hand so I used the wonderful tool called Gameboy Tile Data Generator (gbtdg). You give it an 8x8 PNG and it generates the 16 bytes that the GB expects.

The LED and font tiles can then be put into the ROM.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
SECTION "Tile Data", ROM0

Tiles_Begin:
DB $FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF ; Blank   ($00)
DB $83,$83,$7D,$7D,$6D,$6D,$6D,$6D,$6D,$6D,$7D,$7D,$83,$83,$FF,$FF ; 0       ($01)
DB $EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$FF,$FF ; 1       ($02)
DB $07,$07,$FB,$FB,$FB,$FB,$87,$87,$7F,$7F,$7F,$7F,$03,$03,$FF,$FF ; 2       ($03)
DB $03,$03,$FD,$FD,$FD,$FD,$83,$83,$FD,$FD,$FD,$FD,$03,$03,$FF,$FF ; 3       ($04)
DB $7D,$7D,$7D,$7D,$7D,$7D,$81,$81,$FD,$FD,$FD,$FD,$FD,$FD,$FF,$FF ; 4       ($05)
DB $03,$03,$7F,$7F,$7F,$7F,$07,$07,$FB,$FB,$FB,$FB,$07,$07,$FF,$FF ; 5       ($06)
DB $83,$83,$7F,$7F,$7F,$7F,$03,$03,$7D,$7D,$7D,$7D,$83,$83,$FF,$FF ; 6       ($07)
DB $01,$01,$FD,$FD,$FD,$FD,$FD,$FD,$FD,$FD,$FD,$FD,$FD,$FD,$FF,$FF ; 7       ($08)
DB $83,$83,$7D,$7D,$7D,$7D,$83,$83,$7D,$7D,$7D,$7D,$83,$83,$FF,$FF ; 8       ($09)
DB $83,$83,$7D,$7D,$7D,$7D,$81,$81,$FD,$FD,$FD,$FD,$FD,$FD,$FF,$FF ; 9       ($0A)
DB $C7,$C7,$BB,$BB,$7D,$7D,$7D,$7D,$01,$01,$7D,$7D,$7D,$7D,$FF,$FF ; A       ($0B)
DB $03,$03,$7D,$7D,$7D,$7D,$03,$03,$7D,$7D,$7D,$7D,$03,$03,$FF,$FF ; B       ($0C)
DB $83,$83,$7D,$7D,$7F,$7F,$7F,$7F,$7F,$7F,$7D,$7D,$83,$83,$FF,$FF ; C       ($0D)
DB $03,$03,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$03,$03,$FF,$FF ; D       ($0E)
DB $83,$83,$7F,$7F,$7F,$7F,$07,$07,$7F,$7F,$7F,$7F,$83,$83,$FF,$FF ; E       ($0F)
DB $01,$01,$7F,$7F,$7F,$7F,$03,$03,$7F,$7F,$7F,$7F,$7F,$7F,$FF,$FF ; F       ($10)
DB $83,$83,$7D,$7D,$7F,$7F,$61,$61,$7D,$7D,$7D,$7D,$83,$83,$FF,$FF ; G       ($11)
DB $7D,$7D,$7D,$7D,$7D,$7D,$01,$01,$7D,$7D,$7D,$7D,$7D,$7D,$FF,$FF ; H       ($12)
DB $01,$01,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$01,$01,$FF,$FF ; I       ($13)
DB $01,$01,$F7,$F7,$F7,$F7,$F7,$F7,$77,$77,$77,$77,$8F,$8F,$FF,$FF ; J       ($14)
DB $7B,$7B,$7B,$7B,$7B,$7B,$07,$07,$7B,$7B,$7B,$7B,$7B,$7B,$FF,$FF ; K       ($15)
DB $7F,$7F,$7F,$7F,$7F,$7F,$7F,$7F,$7F,$7F,$7F,$7F,$03,$03,$FF,$FF ; L       ($16)
DB $BB,$BB,$55,$55,$6D,$6D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$FF,$FF ; M       ($17)
DB $03,$03,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$FF,$FF ; N       ($18)
DB $83,$83,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$83,$83,$FF,$FF ; O       ($19)
DB $03,$03,$7D,$7D,$7D,$7D,$03,$03,$7F,$7F,$7F,$7F,$7F,$7F,$FF,$FF ; P       ($1A)
DB $83,$83,$7D,$7D,$7D,$7D,$7D,$7D,$75,$75,$7B,$7B,$85,$85,$FF,$FF ; Q       ($1B)
DB $83,$83,$7D,$7D,$7D,$7D,$03,$03,$7D,$7D,$7D,$7D,$7D,$7D,$FF,$FF ; R       ($1C)
DB $81,$81,$7F,$7F,$7F,$7F,$83,$83,$FD,$FD,$FD,$FD,$03,$03,$FF,$FF ; S       ($1D)
DB $01,$01,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$EF,$FF,$FF ; T       ($1E)
DB $7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$83,$83,$FF,$FF ; U       ($1F)
DB $7D,$7D,$7D,$7D,$BB,$BB,$BB,$BB,$D7,$D7,$D7,$D7,$EF,$EF,$FF,$FF ; V       ($20)
DB $7D,$7D,$7D,$7D,$7D,$7D,$7D,$7D,$6D,$6D,$55,$55,$BB,$BB,$FF,$FF ; W       ($21)
DB $7D,$7D,$BB,$BB,$D7,$D7,$EF,$EF,$D7,$D7,$BB,$BB,$7D,$7D,$FF,$FF ; X       ($22)
DB $7D,$7D,$7D,$7D,$BB,$BB,$D7,$D7,$EF,$EF,$EF,$EF,$EF,$EF,$FF,$FF ; Y       ($23)
DB $01,$01,$FD,$FD,$FB,$FB,$C7,$C7,$BF,$BF,$7F,$7F,$01,$01,$FF,$FF ; Z       ($24)
DB $FF,$83,$83,$7D,$83,$7D,$83,$7D,$83,$7D,$83,$7D,$FF,$83,$FF,$FF ; LED Off ($25)
DB $FF,$83,$83,$01,$83,$01,$83,$01,$83,$01,$83,$01,$FF,$83,$FF,$FF ; LED On  ($26)
Tiles_End:

And the map can be defined as indices into the list of tiles, where tile $00 is the blank (black) tile.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
SECTION "Tile Map", ROM0

TileMap_Begin:
DB $0D,$16,$15,$00,$25,$00,$00,$00,$00,$00,$00,$00,$00,$25,$25,$25,$25,$00,$1A,$0D ; [$000] CLK  PC
DB $00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$020]
DB $00,$13,$1C,$00,$25,$25,$25,$25,$25,$25,$25,$25,$00,$00,$25,$25,$25,$00,$19,$1A ; [$040] IR   OP
DB $00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$060]
DB $17,$0B,$1C,$00,$25,$25,$25,$25,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$080] MAR
DB $17,$0F,$17,$00,$25,$25,$25,$25,$25,$25,$25,$25,$00,$00,$00,$00,$00,$00,$00,$00 ; [$0A0] MEM
DB $00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$0C0]
DB $0C,$1F,$1D,$00,$25,$25,$25,$25,$25,$25,$25,$25,$00,$00,$00,$00,$00,$00,$00,$00 ; [$0E0] BUS
DB $00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$100]
DB $00,$00,$0B,$00,$25,$25,$25,$25,$25,$25,$25,$25,$00,$00,$00,$00,$00,$00,$00,$00 ; [$120] A
DB $00,$00,$0C,$00,$25,$25,$25,$25,$25,$25,$25,$25,$00,$0D,$24,$00,$00,$00,$00,$00 ; [$140] B    C Z
DB $0B,$16,$1F,$00,$25,$25,$25,$25,$25,$25,$25,$25,$00,$25,$25,$00,$00,$00,$00,$00 ; [$160] ALU  x x
DB $00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$080]
DB $0D,$1E,$16,$00,$25,$25,$25,$25,$25,$25,$25,$25,$25,$25,$25,$25,$25,$25,$25,$25 ; [$1A0] CTL
DB $00,$00,$00,$00,$12,$17,$1C,$1C,$13,$13,$0B,$0B,$0F,$1D,$0C,$19,$0D,$0D,$14,$10 ; [$1C0] H M R R I I A A E S B O C C J F
DB $00,$00,$00,$00,$1E,$13,$13,$19,$19,$13,$13,$19,$19,$1F,$13,$13,$0F,$19,$1A,$13 ; [$1E0] T I I O O I I O O U I I E O P I
DB $00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$200]
DB $19,$1F,$1E,$00,$01,$01,$01,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00 ; [$220] OUT
TileMap_End:

I’ve truncated some empty offscreen tiles here to help everything fit better on screen but the code has the full version.

I fill the entire tile map rather than just the 20x18 tiles that are on screen. The actual full Game Boy tile map is 32x32 which allows for rendering things off screen before displaying them (by modifying the viewport). Storing all 32x32, with unused tiles as blank, allowed me to just write from Start Address to End Address easily without doing any offsets (see below).

The use of TileMap_Begin and TileMap_End is a convenient method of assigning names to addresses which can be used to do things like count the number of bytes in a section (TileMap_End-TileMap_Begin).

Displaying the tiles on screen involves first loading the tile descriptions (the sixteen bytes per tile) into VRAM (given the mnemonic _VRAM). Tile 0 is stored at address _VRAM, Tile 1 at _VRAM+16, Tile 2 at _VRAM+32, etc. Here that means blank is $0, digit 0 is $1, digit 1 is $2, and so on up until the enabled LED tile is $26.

1
2
3
4
5
; Load tiles into VRAM
ld hl, _VRAM
ld de, Tiles_Begin
ld bc, Tiles_End-Tiles_Begin
call CopyBytes

With the tiles in VRAM, we can fill the tile map (given the mnemonic _SCRN0) by copying the full 32x32 tile map shown above. That’s why I wasted the space filling out the offscreen regions; it allows for code that starts at TileMap_Begin and ends at TileMap_End, without needing to skip the offscreen bytes. We aren’t starved for bytes with this project anyway.

1
2
3
4
5
; Fill the map
ld hl, _SCRN0
ld de, TileMap_Begin
ld bc, TileMap_End-TileMap_Begin
call CopyBytes

The code uses a subroutine called CopyBytes which takes a source address, destination address, and a byte count, and then copies byte by byte from source to destination.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
; [in] de: Source address
; [in] hl: Destination address
; [in] bc: Count
CopyBytes:
  ; Increment each by one to prevent rolling over when decrementing
  inc b
  inc c
  jr .skip

.loop:
  ld a, [de]
  ld [hl+], a
  inc de

.skip:
  dec c
  jr nz, .loop
  dec b
  jr nz, .loop

  ret

There’s some other code required also to actually get everything on the screen, mainly turning off the LCD before accessing VRAM (don’t do that; either turn off the LCD or wait for VBLANK);

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
SECTION "Main", ROM0[$150]
start:
  di
  ld SP, $FFFF

; Turn LCD off
.wait_for_vblank
  ldh a, [rLY]
  cp 145
  jr nz, .wait_for_vblank
.disable_lcd
  ld a, [rLCDC]
  xor LCDCF_ON
  ld [rLCDC], a

  ; Load tiles into VRAM
  ld hl, _VRAM
  ld de, Tiles_Begin
  ld bc, Tiles_End-Tiles_Begin
  call CopyBytes

  ; Fill the map
  ld hl, _SCRN0
  ld de, TileMap
  ld bc, TileMap_End-TileMap_Begin
  call CopyBytes

  ; Set background palette
  ld a, %11100100
  ld [rBGP], a

  ; Turn LCD back on
  ld a, [rLCDC]
  or LCDCF_ON
  ld [rLCDC], a

  ei

.loop
  halt
  nop
  jr .loop

That’s essentially all that’s required to get the initial state of everything onto the screen.

Emulation

The majority of the work was actually emulating the CPU operations which meant breaking down the instructions into their two parts (4 bits of opcode and 4 bits of optional data), tracking the current op stage, and then asserting different signals depending on the stage.

Let’s step through the instruction LDA 14, where the opcode is $1 and the data is $E, which is 0b00011110. This instruction loads whatever is in memory at address 14 ($E) into the A Register.

1
2
3
4
5
6
7
  LDA 14

 OP  DATA
---- ----
 $1   $E

0001 1110

Every instruction has the first two stages:

Stage 0

CO: Put the value in the Program Counter onto the bus
MI: Load the value on the bus into the Memory Address Register

Stage 1

RO: Put the value that’s in RAM at the address in the MAR onto the bus
II: Put the value in the bus into the Instruction Register
CE: Increase the value in the Program Counter

These two steps are the Fetch. They fetch the next instruction from memory and put it into the Instruction Register.

In code, that means looking at the current op stage opc (which is initialized to 0), executing stages 0 and 1, and then branching elsewhere at stage 2 and beyond.

The control signals take up two bytes (16 bits), where each bit represents the state of one signal, so asserting one is as simple as using the set instruction to enable the bit that corresponds to that signal. ctrl is the address of the first byte and ctrl+1 is the address of the second.

To determine the instruction and thus the next stage is done by masking the upper four bits (and $f0) and jumping to the appropriate instruction based on the value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
  ; ctrl:    ht | mi | ri | ro | io | ii | ai | ao
  ; ctrl+1:  eo | su | bi | oi | ce | co | jp | fi

  ; Operate based on current op stage
  ld a, [opc]
  ld c, a

  cp $0
  jr z, .stage_0
  cp $1
  jr z, .stage_1

  jr .stage_2_3_4

  ; Stage 0 same for all instructions: CO | MI
.stage_0:
  ld hl, ctrl
  set CTRL_MI, [hl]
  inc hl
  set CTRL_CO, [hl]
  jr .update_leds

  ; Stage 1 same for all instructions: RO | II | CE
.stage_1:
  ld hl, ctrl
  set CTRL_RO, [hl]
  set CTRL_II, [hl]
  inc hl
  set CTRL_CE, [hl]
  jr .update_leds

  ; Stages 2, 3, and 4 differ
.stage_2_3_4:
  ld a, [ir]
  and $f0
  ld b, c

  ; LDA
  cp $10
  call z, op_lda

  ; Go to next op stage (reset to 0 after stage 4)
  ld a, [opc]
  inc a
  ld [opc], a
  cp 5
  jr nz, .exit
  xor a
  ld [opc], a

Once the opcode is determined we call the corresponding function. In this case LDA.

When it comes time to evaluate the actual instruction, all it needs to do is set the appropriate signals based on what stage of the instruction it’s in. For LDA, that’s different for Stage 2 and Stage 3. There is no Stage 4.

Stage 2

IO: Put the lower four bits of the Instruction Register onto the bus
MI: Load the value on the bus into the Memory Address Register

Stage 3

RO: Put the value that’s in RAM at the address in the MAR onto the bus
AI: Put the value on the bus into the A Register

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
op_lda:
  ; save state
  ld c, a
  ld a, b

  cp 2
  jr z, .stage_2
  cp 3
  jr z, .stage_3
  jr .exit

.stage_2:
  ld hl, ctrl
  set CTRL_MI, [hl]
  set CTRL_IO, [hl]
  jr .exit

.stage_3:
  ld hl, ctrl
  set CTRL_RO, [hl]
  set CTRL_AI, [hl]

.exit:
  ld c, a
  ret

All of the actual work is done at the end of the tick depending on whichever signals have been asserted by the current stage of the current instruction. The order of execution matters because any signal that puts data somewhere must be evaluated before any signals that need to read that data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
; Update the control signals for next time. Order matters.
call ctrl_ht
call ctrl_ro
call ctrl_io
call ctrl_ao
call ctrl_su
call ctrl_fi
call ctrl_eo
call ctrl_co
call ctrl_mi
call ctrl_ri
call ctrl_ii
call ctrl_ai
call ctrl_bi
call ctrl_oi
call ctrl_ce
call ctrl_jp

Each signal subroutine checks if its respective bit is set in ctrl or ctrl+1 and, if it is, does its thing. Stage 2 of LDA 14, for instance, will have IO and MI set so ctrl_io will put the value in RAM onto the bus first, and then ctrl_mi will load that into the Memory Address Register.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ctrl_io:
  ld a, [ctrl]
  ld b, a
  bit CTRL_IO, b
  jr z, .exit

  ld a, [ir]
  and $0f
  ld [bus], a

.exit:
  ret

ctrl_mi:
  ld a, [ctrl]
  ld b, a
  bit CTRL_MI, b
  jp z, .exit

  ld a, [bus]
  ld [mar], a

.exit:
  ret

Finally all of the LEDs are updated to display the current values, as well as the digits of the output register.

The subroutine UpdateLeds toggles LED tiles based on a binary value so that a high bit displays as an enabled LED and a low bit displays as a disabled LED. It works for an arbitrary number of LEDs by shifting the value left and seeing if a 1 or a 0 shifted out (which sets the carry flag).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
; [in] a:  Value to LED-ify
; [in] hl: Start address of LED tile
; [in] c:  LED count
UpdateLeds:
  ld b, a      ; cached A
  dec hl       ; pre-decrement, will be added back first way around the loop
  ld de, $00FF ; offset from HL (inc e will overflow to 0 for first offset)

  ; shift left for anything less than a count of 8 so we have MSB in the right spot for left shift
  ld a, 8
  sub c
  jr z, .check

.shift:
  sla b
  dec a
  jr nz, .shift

.check:
  inc hl

  ; determine if we've checked all bits
  inc e
  ld a, e
  cp c
  jr z, .exit

  ; shift left, carry contains whether the bit was set
  sla b
  jr c, .set

.clear:
  ld [hl], LED_OFF
  jr .check

.set:
  ld [hl], LED_ON
  jr .check

.exit:
  ret

Displaying the output as three digits meant converting a binary value to Binary-Coded Decimal (BCD) so that each digit could be represented as a single tile.

CalculateBCD does so by dividing by 100 until overflow and saves the number of divisions (the 100s place), then divides by 10 until overflow and saves the number of divisions (the 10s place), and then subtracts the remaining value to get the 1s place.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
; [in]  a: value to calculate
; [out] e: 100s place
; [out] c: 10s place
; [out] d: 1s place
CalculateBCD:
  ld e, 0 ; count (100s place)

  ld b, 100
.sub_led_100:
  ld h, a ; save for 10s calc
  sub b
  jr c, .reset
  ; count++
  ld c, a
  ld a, e
  inc a
  ld e, a
  ld a, c
  jr .sub_led_100
.reset:
  ld b, 10
  ld c, 0 ; count (10s place)
  ld a, h
.sub_led_10:
  sub b
  jr c, .mod
  ; count++
  ld d, a
  ld a, c
  inc a
  ld c, a
  ld a, d
  jr .sub_led_10
.mod:
  ; a contains overflow; difference is remainder ($ff-remainder-9)
  ld b, a
  ld a, $ff
  sub b
  ld b, a
  ld a, 9
  sub b
  ; 1s place
  ld d, a
  ret

  [...]

  ; Update output display
  ld a, [out_reg]
  call CalculateBCD
  ; 100s place
  ld hl, display
  ld [hl], e
  ; 10s place
  inc hl
  ld [hl], c
  ; 1s place
  inc hl
  ld [hl], d

Limitations

There are two parts to the program: the emulation, which just interprets the opcodes and performs the required action (e.g., LDI N will load the value N into the A register); and the LED simulation which ensures that the correct LEDs light up on screen at the same time as they would on the actual breadboard version.

The circuit that exists on the breadboard involves a lot of signals that operate simultaneously, but the GB’s CPU can only execute one instruction at a time. Achieving correct behavior as well as correct LED display required checking/executing signals in a specific order (which isn’t an issue on the breadboard because multiple signals can fire simultaneously).

On the breadboard the output signals put a value on the bus lines as soon as that signal is asserted and the input signals latch that value on the rising edge of the clock. So if AO is asserted, then the value in the A register is on the bus lines. And if OI is also asserted, then the value on the bus lines will go into the output register once the clock pulses.

The solution is simply to evaluate the output signals (RO, IO, AO, etc) before evaluating the input signals (MI, RI, AI, etc), so that the variable holding the bus value has been loaded before being read.

Making sure the state of the LEDs was correct during development mostly involved loading the GB version with the same program as Ben, watching him step through in his videos, and making sure the LEDs matched.

The Zen of Assembly

Even experienced programmers often fear assembly. I was surprised to discover that I found it mostly enjoyable.

It’s Fun

Programming in assembly turned out to be very therapeutic. Being able to understand the GB’s architecture and create behavior one instruction at a time was a welcome change from wrangling C++ and OOP.

There are no libraries, no compiler black magic, no hidden behavior, no fighting the language. It’s just you, the assembler, some registers, and a bunch of data.

Programming is, after all, just data manipulation which is often obscured by modern languages.

Stack and Registers

The usual stack-based approach to a function would be to push any registers you’re going to use onto the stack, use them however you want, and then pop the old values off the stack before returning. Or allocate space on the stack for local variables and function parameters.

But, while the Game Boy has a stack, using it is discouraged because pushing and popping takes a lot of machine cycles. I avoided pushing and popping whenever possible and instead tried to write all functions using nothing more than the given registers. If I needed to use a register in the function but its old value was important (common with A), I would try to save it to a separate register that wasn’t needed and then restore it at the end. This allowed me to avoid using the stack for all functions except interrupt handlers.

Interrupt handlers require use of the stack because you don’t know when they’re going to be called so you don’t know which registers are safe to use. In that case you must push any registers you’re going to overwrite onto the stack and pop them at the end.

You can see such a situation in the UpdateTiles function which is called when a VBLANK interrupt occurs. I use the A register extensively and so I push it (as well as the flags register because you can only push/pop 16-bit register pairs but also because flags are often important to preserve) before doing anything with A.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
UpdateTiles:
  push af

  ; CLK
  ld hl, _SCRN0 + 4
  ld a, [clk_led]
  ld [hl], a

  [...]

  pop af

Prefer Relative Jumps

The Game Boy has two types of jumps: jump absolute (jp) and jump relative (jr). The absolute jump will jump to a 16-bit address while the relative jump can jump -128 locations backward or +127 locations forward. In other words, if you’re at PC $205 and do a jr 5, it’ll go to $20A, and if you do jr -5 it’ll go to $200. (The assembler handles the offsets for you; you can just use a label as a destination.)

Relative jumps should be the first choice at all times because they take fewer cycles to execute and they use fewer bytes (8 bit offset instead of a 16 bit address). I used jr by default at all times and then would only change to jp once the assembler told me I had an error (due to the label I was trying to jump to being too far away).

Loops and Falling Through

When doing conditional logic it was often useful to think in terms of falling through conditions as a way of simplifying the logic.

As an example, here’s code for checking if the A button is pressed. It calls ReadKeys which loads A with four bits indicating Start, Select, B, and A; then it checks for bit 0 (A). If it’s set, the nz flag is set, so it jumps to key_pressed. Otherwise it jumps to key_released.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
.loop:
  call ReadKeys
  bit 0, a
  jr nz, .key_pressed
  jr .key_released

.key_pressed:
  ld a, [button_state]
  cp 0
  jr z, .tick
  jr .loop

.key_released:
  xor a
  ld [button_state], a
  ld a, LED_OFF
  ld [clk_led], a
  jr .loop

But we can simplify it by shuffling things around:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
.loop:
  call ReadKeys
  bit 0, a
  jr nz, .key_pressed

.key_released:
  xor a
  ld [button_state], a
  ld a, LED_OFF
  ld [clk_led], a
  jr .loop

.key_pressed:
  ld a, [button_state]
  cp 0
  jr z, .tick
  jr .loop

Or by changing the logic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
.loop:
  call ReadKeys
  bit 0, a
  jr z, .key_released

.key_pressed:
  ld a, [button_state]
  cp 0
  jr z, .tick
  jr .loop

.key_released:
  xor a
  ld [button_state], a
  ld a, LED_OFF
  ld [clk_led], a
  jr .loop

In both cases the flow of execution falls through to the “else” case, which removes the need for two jr instructions for each case.

Setting to 0

It’s common to want to set some value to 0. The obvious way is with ld:

1
ld a, 0

But there’s another way:

1
xor a

1 XOR 1 is 0 and so XORing two values that are the same will be all zeros because all of the same bits are set.

    01010111
XOR
    01010111
------------
    00000000

ld r,n takes 8 clock cycles and uses 2 bytes.
xor a takes 4 clock cycles and uses 1 byte.

It’s simple tricks like this that can save a lot of space and cycles when you use them a lot in a large program/game.

Setting a value to zero is so common that some instruction set architectures have a dedicated zero register (e.g., MIPS and RISC-V).

Result

The following is the program that runs in the video titled “Conditional jump instructions” which counts from 0 to 255 and back to 0 again repeatedly. I made a slight modification and changed it to count in increments of 15 to make it go faster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ld a, $e0
ld [ram], a    ; OUT
ld a, $2f
ld [ram+1], a  ; ADD 15
ld a, $74
ld [ram+2], a  ; JC 4
ld a, $60
ld [ram+3], a  ; JMP 0
ld a, $3f
ld [ram+4], a  ; SUB 15
ld a, $e0
ld [ram+5], a  ; OUT
ld a, $80
ld [ram+6], a  ; JZ 0
ld a, $64
ld [ram+7], a  ; JMP 4
xor a
ld [ram+8], a
ld [ram+9], a
ld [ram+10], a
ld [ram+11], a
ld [ram+12], a
ld [ram+13], a
ld [ram+14], a
ld a, 15
ld [ram+15], a

The computer only has 16 bytes of RAM. The program itself takes up the first 8 bytes, and one more byte is used as for actual variable data (address 15 which contains the value 15).

Here it is running at a slow speed in the bgb emulator so that you can see the LEDs clearly.

We can also crank it up a bit.

But the true test is running it on real hardware. I have a Game Boy Color that has been fitted with a backlit IPS LCD (because I can) and an Everdrive GB X7 for running ROMs.

In all of the videos you can see that the clock LED doesn’t work quite right. It simply cycles too fast to be accurately represented on screen.

The Future

There are some features I’d still like to add:

Game Boy Color support

Monochromatic LEDs are a bit boring. I’d like to add color so that the LEDs match those of the breadboard version (blue for bus, red for ALU, yellow for control, etc).

Programmable at Runtime

The program is embedded directly inside the ROM, so if you want to change it you have to edit the source and rebuild. The breadboard is programmed with DIP switches and the same could be done on the Game Boy: boot into a programming menu with an address DIP switch and a data DIP switch, set them both, then press a button to program it. When finished, press another button to go into Run Mode.

Dynamic Clock Speed

The clock speed is set in the source and changing it again means rebuilding. It would be nice to be able to increase and decrease the clock as the program runs, as well as being able to select Auto or Manual modes.

Source Code

The source code is here.

Last Edited: Dec 20, 2022