Optimization

Introduction

After you've worked the bugs out, you may if you wish make your program smaller and run faster. This section is dedicated to just that purpose. Although there are a lot of things you can do, here are some general things that can help:

Code replacements

xor a vs. ld a,0

A simple way to set a to zero, saving 1 byte and 3 T-states. Don't use this if you want to preserve flags.

or a vs. cp 0

If you want to compare for equality, sign or parity, you can save 1 byte and 3 T-states. Also always resets C flag.

dec a vs. cp 1

If you can, dec a is a smaller and faster way to check if a or any other register is 1. 8-bit increments and decrements will effect both the z flag and sign flag, among other things.

inc a vs. cp 255

Again, this is a byte smaller and 3 t-states faster if you use inc a. It does not preserve a, but you can often do this and it works on all of the main 8-bit registers and (hl).

adc a,0 vs. jr nc,$+3 \ inc a

adc a,0 is 7 t-states and 2 bytes, whereas the latter is 3 bytes and 11 t-states if the c flag is set, 12 if it is reset. Save a byte and 4 to 5 cycles !

ccf \ adc a,0 vs. jr c,$+3 \ inc a

They are the same size, but the former is always 11 t-states whereas the latter is either 11 or 12 depending on the c flag.

sbc a,0 vs. jr nc,$+3 \ dec a

See adc a,0 vs. jr nc,$+3 \ inc a.

ccf \ sbc a,0 vs. jr c,$+3 \ dec a

See ccf \ adc a,0 vs. jr c,$+3 \ inc a.

scf \ ccf

This is used to reset the c flag, but there are many other ways to do that. This is 8 t-states, 2 bytes, but the following are 1 byte, 4 t-states:

or a     ;z flag is set if a = 0
and a   ;z flag is set if a=0
xor a   ;always sets the z flag, sets A=0
cp a    ;always sets the z flag.
sub a   ;always sets the z flag, sets A=0

As well, the following are two bytes, but 7 t-states. You should not use these :
sub 0
add a,0
cp 0

In each of these cases, other flags are also modified.

Cursor/pen

 ld hl,$0100        ;$01 is the row, and $00 is the column
 ld (curRow),hl
 ld (penCol),hl

This is much more efficient if you're going to change both cursor/pen positions. Because curCol is right after curRow (and penRow is right after penCol), you can use a 16-bit register to load both at once.

PutS

Something you may or may not know, it is that PutS and any other variation modifies HL to point to the byte after the null-term. This is very useful, especially when displaying multiple items to different locations on the screen without having to load string after string into hl.

 ld hl,txtTest
 bcall(_PutS)
 ld de,$0100
 ld (curRow),de
 ld hl,txtTest2
 bcall(_PutS)
 ;...

txtTest:
 .db "Test",0
txtTest2:
 .db "Test2",0

can be
 ld hl,txtTest
 bcall(_PutS)
 ld de,$0100
 ld (curRow),de
 ;we don't need "ld hl,txtTest2", because hl already points to txtTest2
 bcall(_PutS)
 ;...

txtTest:
 .db "Test",0
;txtTest2            ;Optional, doesn't affect speed or size here
 .db "Test2",0

It also allows you to display strings through a loop say, for a high score board.

high:
 ld b,8
 ld de,0
 ld (curRow),de
 ld hl,txtHigh
highloop:

 push hl
 push de
 ld a,(hl)
 ld h,0
 ld l,a
 bcall(_DispHL)
 pop de
 pop hl

 inc hl

 bcall(_PutS)

 inc e
 ld d,0
 ld (curRow),de
 djnz highloop

 bcall(_GetKey)
 ret

txtHigh:
 .db 20,"HIGH SCORE!",0
txt2nd:
 .db 19,"HIGH SCORE!",0
txt3rd:
 .db 18,"HIGH SCORE!",0
txt4th:
 .db 17,"HIGH SCORE!",0
txt5th:
 .db 16,"HIGH SCORE!",0
txt6th:
 .db 15,"HIGH SCORE!",0
txt7th:
 .db 14,"HIGH SCORE!",0
txt8th:
 .db 13,"HIGH SCORE!",0

Optimised Code Snippets

Test For 0 (8-bits)

For any 8-bit register, you can use the following:

inc [reg8]
dec [reg8]

This will set the z flag if the register is 0, else nz. It is 8 t-states, 2 bytes, and preserves registers.

Set A=0

ld a,0 is 2 bytes, 7 t-states, the following are 1-byte and 4 t-states:

xor a
sub a

Note that these will change flags, but usually that is okay.

16-bit CP

To compare HL to another 16-bit register, you can do the following:

or a
sbc hl,[reg16]
add hl,[reg16]

The or a is simply to reset the c flag, so if the c flag is reset at this point, don't include that and save a byte plus 4 t-states. The speed here is 4+15+11 = 30 t-states and it is 4 bytes total.

Conditionally Set or Reset A

In some cases, you need to set all of the bits in A or reset all of them based on a flag. If you are using the c flag:

sbc a,a

1 byte, 4 t-states is all it takes. It also preserves the c flag, so if the c flag was set, it sets A=255, else A=0 and the c flag stays the same.

16-bit NEG

To get the negative (additive inverse) of a 16-bit register, the following 6 byte, 24 t-state routine can be used:

xor a
sub [LSBreg16]
ld [LSBreg16],a
sbc a,a
sub [MSBreg16]
ld [MSBreg16],a

An example code would be:
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a

ld hl,(hl)

Often we want to use indirection when using a lookup table of addresses. For example, say you have a look-up table for strings:

LUT:
 .dw String1
 .dw String2
 .dw String3
 .dw String4

String1: .db "String1",0
String2: .db "String2",0
String3: .db "String3",0
String4: .db "String4",0

And say you wanted to store the location of the string in HL. Assuming HL already points to the address located in the LUT:
 ld e,(hl)
 inc hl
 ld d,(hl)
 ex de,hl

That is 4 bytes, 24 t-states, but it destroys DE. The following is the same size and speed, destroying A:
 ld a,(hl)
 inc hl
 ld h,(hl)
 ld l,a

In the case that you need extreme speed or size optimisations, the following also does the trick, but has a few drawbacks:
 ld sp,hl
 pop hl

At just 2 bytes, 16-tstates that is pretty optimised, but it destroys the stack pointer which is a crucial element to most routines. In general, you would need to save the stack pointer somewhere and later restore it at a total cost of 40 t-states and 8 bytes and your routine wouldn't be able to use the stack. You would then need to use this version of indirection at least 6 times to get a speed saving and 5 times for a size saving, at the cost of 2 bytes of RAM.

Optimized 'ld a,Y \ jr nc,$+4 \ ld a,X'

Instead of using the above code (6 bytes, 19cc or 21cc), try these:

In the case that Y==0: (3 bytes, 11cc)
sbc a,a \ and X

In the case that X==0: (4 bytes, 15cc)
ccf \ sbc a,a \ xor Y

When neither X==0, nor Y==0: (5 bytes, 18cc)
sbc a,a \ and X^Y \ xor Y

Conclusion

From this point on, you may be perfectly happy with your program. It works, runs at a decent speed and is also smaller than it use to be. What more could there be to do? Read on to find out what else you need to do before you decide to release your program to the general public.

Unless otherwise stated, the content of this page is licensed under GNU Free Documentation License.