This article covers advanced math routines for a number of functions including the basic addition, multiplication, and square roots, as well as exponential and trigonometric functions.
In most instances it is most efficient to store numbers in little endian format as that is how the Z80 works internally. Doing this allows us to take advantage of the handful of 16bit operations. In the case of BCD formatted numbers, it really doesn't matter if they are stored bigendian or littleendian since built in operations are only 8bit. However, I choose to use little endian since it is easier to work with large numbers when the pointer passed to the routine points to the lowest byte.
I'll try to cover:
 Integer arithmetic
 BCD arithmetic
 Fixed point arithmetic
 Floating point arithmetic
Table of Contents

Intro to Integer Formats
Integers are one of the most basic format to work with on the Z80. The built in instructions primarily work with 8bit integers, though a few work with 16bit integers. Since you are reading this page, I am going to assume you know this much.
However, if you are looking to work with larger integers, you'll need to come up with a working format. Best practice for the z80 is store in little endian. You may have seen this mentioned here and elsewhere without a good reason, so here it is:
The Z80 stores 16bit integers in littleendian. If you need to access the lower 16 bits of a 64bit integer, it's easier and more efficient to use ld hl,(int64) for 3 bytes, 16cc, versus ld hl,(int64+6) \ ld a,l \ ld l,h \ ld h,a at 6 bytes, 28cc.
So if you wanted to store the 32bit integer 0x12345678 as a constant in your program, it would look like:
int32_var0:
.db $78,$56,$34,$12
OR:
int32_var0:
.dw $5678,$1234
And if you wanted to change that value to 0xFEDCBA98, you could do:
ld hl,$BA98
ld (int64_var0),hl
ld hl,$FEDC
ld (int64_var0+2),hl
Intro to BCD Formats
BCD, short for Binary Coded Decimal, is a format where one byte represents two base 10 digits. As an example, 0x66 is interpreted in base 10 as 66 instead of 102.
The Z80 has a handy instruction called DAA (1 byte, 4cc) which converts the A register based on it's contents, the c flag, halfcarry flag, and the last type of arithmetic operation. Addition and subtraction operations set or reset a special flag so that the DAA instruction knows what to do.
For example, if A=0x66, then add a,a, you would get A=0xCC, with the c flag reset. Followed up with the daa instruction, the c flag is set, and A is converted to 0x32, which is what we expect since 66*2=132.
Intro to Fixed Point Formats
Fixed point formats offer the advantage of precision smaller than a unit. For example, you can store values such as 119.125.
In fixed point, the decimal point is in a fixed location. An 8.8 fixed point number means that given a 16bit number, the top 8 bits form the integer part, and the lower 8 bits form the fractional part. If the lower 8 bits formed the integer 55, then it would be interpreted as 55/256.
Fixed point formats tend to be faster and easier to implement than floating point, but they can lose precision more easily. They are great for games where speed is desired, but often not desirable in mathematical applications.
Many fixed point routines can use integer routines with some minor modifications to overhead code.
Intro to Floating Point Formats
Floating point formats typically have:
 A sign bit
 An exponent for an implied base, which indicates where the decimal point is
 A mantissa
Floating point and scientific notation, are very similar. If we are using BCD floats, then base 10 is likely the implied base, so you might have a number like 3.14159*10^{0}.
In practice, a base of some power of two is far more efficient to compute. As well, the mantissa is typically of the form where one digit forms the nonzero integer part, and the rest form the fractional part. This is referred to as being normalized. When the top digit is 0, this is a denormalized number. Some formats allow this, some don't. For base2 float formats that don't allow denormalized numbers there is the advantage that the top bit can only be one value— 1, so it isn't actually necessary to occupy a bit that is known to always be 1. Instead, this bit is taken as implicit. In such cases, that bit is instead put to some other use, like a sign bit, or another bit of the exponent.
In general:
 Sign is represented as a bit where 0 means positive, 1 means negative.
 The exponent is an integer with a bias. For example, if our format uses an 8bit exponent, 128 might represent the exponent 0. In fact, this is what TI does for their BCD floats.
 The mantissa is a 1.n fixed point number. As examples,
** A 7byte BCD float has 14 digits, so the mantissa is a 1.13 fixed point number.
** An "extended precision float" has a 64bit mantissa with no implicit bit. Therefore, the mantissa is a 1.63 fixed point number.
** A "single precision float" has a 24bit mantissa, where the top bit is implied and in its place is the sign bit. So it is a 1.23 fixed point number, where the top bit is always expected to be 1.
There is a GitHub repository for Z80 floats here. At the time of this writing, it contains singleprecision and extendedprecision binary floats. Maybe at some future point there will be other formats.
Addition/Subtraction
Thankfully addition and subtraction are builtin instructions on the Z80 and by extension, the eZ80. However, if we want to work with multiple precision arithmetic, we still have to juggle pointers and RAM.
Integer Addition/Subtraction, multiprecision
Suppose we have two 64bit integers stored in a fixed location in RAM called input1 and input2, and we want to store their sum or difference to output. Then we can take advantage of 16bit math to make a speedoptimized routine:
;addint64:
;47 bytes, 264cc
ld hl,(input1) \ ld de,(input2) \ add hl,de \ ld (output),hl
ld hl,(input1+2) \ ld de,(input2+2) \ adc hl,de \ ld (output+2),hl
ld hl,(input1+4) \ ld de,(input2+4) \ adc hl,de \ ld (output+4),hl
ld hl,(input1+6) \ ld de,(input2+6) \ adc hl,de \ ld (output+6),hl
Note the use of add in the first iteration, versus adc in the latter iterations.
For subtraction, we only have sbc for 16bit subtraction, so we have to first reset carry
;subint64:
;49 bytes, 272cc
or a
ld hl,(input1) \ ld de,(input2) \ sbc hl,de \ ld (output),hl
ld hl,(input1+2) \ ld de,(input2+2) \ sbc hl,de \ ld (output+2),hl
ld hl,(input1+4) \ ld de,(input2+4) \ sbc hl,de \ ld (output+4),hl
ld hl,(input1+6) \ ld de,(input2+6) \ sbc hl,de \ ld (output+6),hl
Suppose you want to pass pointers to the input and output operands where DE is the first operand, HL is the second, and BC is the output:
;addint64:
;45 bytes, 294cc
ld a,(de) \ add a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a
;subint64:
;45 bytes, 294cc
ld a,(de) \ sub (hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a
Suppose you want to save a few bytes at the cost of some cycles:
;addint64:
;17 bytes, 515cc
or a
call +_
_:
call +_
_:
call +_
_:
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc \ ret
;subint64:
;17 bytes, 515cc
or a
call +_
_:
call +_
_:
call +_
_:
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc \ ret
Suppose you just want the output to overwrite the first input, in order to save time and space:
;addint64:
;38 bytes, 252cc
ld a,(de) \ add a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a
;subint64:
;38 bytes, 252cc
ld a,(de) \ sub (hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a
Or with BC freed up, roll your loops:
;addint64:
;11 bytes, 384cc
ld b,8
or a
_:
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
djnz _
ret
;subint64:
;11 bytes, 384cc
ld b,8
or a
_:
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
djnz _
ret
BCD Addition/Subtraction, multiprecision
BCD addition and subtraction work almost exactly like their integer counterparts
except that you have to use arithmetic with the A register, followed by daa.
I'll offer one example, and the reader can adapt the rest. This modifies the
above 64bit example, but encodes 16 BCD digits.
;addBCD_16:
;46 bytes, 284cc
ld a,(de) \ add a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a
Fixed Point Addition/Subtraction
Fixed Point addition and subtraction are identical to integer addition and subtraction, assuming both inputs have the same format.
Floating Point Addition/Subtraction
Believe it or not, addition and subtraction are one of the toughest floating point operations to implement. To show this (using base 10 for convenience), suppose you have two numbers, 3.14159 *10^{0} and 1.23456*10^{3}.
To add these, we have to align their decimal points, which means denormalizing a number until it's exponent matches. On top of that, if your precision is too limited, you'll have to round. In this example, we only have 6 digits allowed. Since it truncates the digit 1, our number is not rounded up.
0.00314*10^^3^^
+1.23456*10^^3^^

=1.23770*10^^3^^
So let's review what we need to do, then create some pseudocode and then some examples:
 The first input will be overwritten with the output.
 We need to align the mantissas so that their exponents are the same
 Organize the inputs so that the exponent of the first input is greater than or equal to that of the second.
 Find their difference in exponent. If this difference is more than the number of bits in the mantissa, then if we aligned the mantissas, the second one would be full of zeros, so we can skip these cases. If the difference is equal to the size of the mantissa, then when the mantissas are aligned and round:
 If the high bit is 1, increment the mantissa of the first input. For formats where the high bit is implicit, then we can just go ahead and increment that mantissa.
 If that difference is nonzero, shift the second input right a number of times equal to that difference.
 If the signs of the input are the same, add the second mantissa to that of the first as fixedpoint numbers.
 If there is overflow, increment the exponent and shift the mantissa right, rotating in a '1' bit.
 If the exponent overflows, set the result to INF
 If there is overflow, increment the exponent and shift the mantissa right, rotating in a '1' bit.
 Else the signs are different, so subtract the second mantissa from that of the first as fixedpoint numbers.
 If there is overflow, invert the sign bit, then perform 2's compliment on the mantissa (negate it).
 If the mantissa is zero, set the output to ZERO.
 While the upper bit of the result is 0, decrement the exponent, and shift the mantissa left.
 If the exponent underflows, set the result to ZERO.
;We'll assume that the top bit of the mantissa is always 1.
;exp1 represents the exponent of the first input as an integer with bias at half it's maximum value.
;exp2 represents the exponent of the second input as an integer with bias at half it's maximum value.
;sign1 is 0 or 1. 1 if the first input is negative, 0 otherwise.
;sign2 is 0 or 1. 1 if the second input is negative, 0 otherwise.
;mant1 is the mantissa of the first input, a fixedpoint number. Since it is addition, we'll just treat it as an integer.
;mant2 is the mantissa of the second input, a fixedpoint number. Since it is addition, we'll just treat it as an integer.
;size is the size of the mantissa.
if exp1<exp2:
swap(float1,float2)
k=exp1exp2
if k>size:
return
if k=size:
mant2=1
else:
mant2=mant2>>k
if carry=1:
mant2=mant2+1
if sign1=sign2:
mant1=mant1+mant2
if carry=1:
<<shift right mant1, shifting in a 1>>
exp1=exp1+1
if carry=1:
return INF
return float1
else:
mant1=mant1mant2
if carry=1:
sign1=not(sign1)
mant1=mant1
while topbit(mant1)=0:
exp1=1
if exp1=0
return ZERO
mant1=mant1<<1
return float1
In this example, we have a 24bit littleendian mantissa, with the top bit an implicit 1, instead, used as a sign bit. Finally, the top 8 bytes are a mantissa such that 128 corresponds to an exponent of 0. In all, these floats use 32 bits of space.
addend = scrap
addend2= scrap+7 ;4 bytes
;;Still need to tend to special cases
addSingle:
;;x+y
push af
push hl
push de
push bc
inc de
inc de
inc hl
inc hl
ld a,(de)
xor (hl)
push af
inc de
inc hl
ex de,hl
ld a,(de)
sub (hl)
ex de,hl
jr nc,$+5
ex de,hl
neg
cp 24
jp nc,add_unneeded
push hl
ld hl,addend+6
dec de
ld bc,0408h
dec hl
ld (hl),0
sub c
jr nc,$5
add a,c
push af
push hl
ex de,hl
ld a,(hl) \ or 80h \ ld (de),a \ dec de \ dec hl
ldd
ldd
ex de,hl
dec b
jr z,$+7
ld (hl),0 \ dec hl \ djnz $3
pop hl
pop af
ld b,a
jr z,noshift
set 7,(hl)
_:
push hl
srl (hl) \ dec hl
rr (hl) \ dec hl
rr (hl) \ dec hl
rr (hl)
pop hl
djnz _
noshift:
pop hl ;bigger float
dec hl
ld b,(hl)
dec hl
dec hl
ex de,hl
pop af
jp m,subtract
ld hl,addend+2
ld a,(hl) \ rla \ inc hl
ld a,(de) \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ set 7,a \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ ld (hl),a
jp nc,add_done
inc (hl)
jp z,add_overflow
dec hl \ rr (hl)
dec hl \ rr (hl)
dec hl \ rr (hl)
jp add_done
subtract:
ld hl,addend
xor a
ld c,a \ sub (hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c
sbc a,(hl) \ ld (hl),a \ inc hl
ld a,(de) \ sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ set 7,a
sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ ld (hl),a
dec de
ex de,hl
jr nc,negated
ld hl,addend
ld a,80h \ xor b \ ld b,a
ld a,c \ sub (hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a
negated:
jp m,add_done
push bc
ld hl,(addend)
ld de,(addend+2)
ld bc,(addend+4)
ld a,h \ or l \ or d \ or e \ or b \ or c
jp z,add_underflow
ld a,(addend+6)
normalize:
dec a \ jr z,add_underflow
add hl,hl \ rl e \ rl d \ rl c \ rl b
jp p,normalize
ld (addend),hl
ld (addend+2),de
ld (addend+4),bc
ld (addend+6),a
pop bc
add_done:
;;Need to adjust sign flag
ld hl,addend+5
ld a,(hl)
rla \ rl b \ rra \ ld (hl),a
dec hl
dec hl
add_copy:
pop de
push de
ldi
ldi
ldi
ld a,(hl)
ld (de),a
pop bc
pop de
pop hl
pop af
ret
add_underflow:
;;How many push/pops are needed?
;;return ZERO
ld hl,0
ld (addend+3),hl
ld (addend+5),hl
pop bc
jr add_done
add_overflow:
;;How many push/pops are needed?
;;return INF
dec hl
ld (hl),40h
jr add_done
add_unneeded:
;;How many push/pops are needed?
;;Return bigger number
pop af
dec hl
dec hl
dec hl
jr add_copy
For an extendedprecision example, see here.
Multiplication
Multiple precision multiplication is tricky on the eZ80 which has builtin 8bit multiplication instructions, and even trickier on the Z80 which doesn't have builtin multiplication. The way many of us learned in school is easy to implement, but it slows down significantly with higher digit numbers. For example, to multiply two 2digit numbers, you would need to perform 4 singledigit multiplications, but to multiply two 10digit numbers, you would need to use 100 singledigit multiplications! In general, to multiply two ndigit numbers, you would need n*n singledigit multiplications. For higher precision multiplication, this can be quite an issue.
In the past century, several algorithms have been developed to greatly reduce the amount of work required, starting off with Karatsuba multiplication, then ToomCook, and then FFT based multiplication (including Schönhage–Strassen). I'm not familiar with Fürer's algorithm, unfortunately.
"Schoolbook" method
This is how most people learn to multiply multidigit numbers in school. It's relatively easy, albeit tedious. That said, it is very easy in binary! Think about it— your only digits are 0 and 1, and multiplying by 0 or 1 is trivial.
So let's do an example. We will multiply 1011 by 1100, but in base 10:
1011
* 1101

1011
00000
101100
1011000

= 1113111
Sweet baby carrots, that was easy. Now if it was binary, that 3 is too big for one digit, so we'd subtract 2 (base 2) and carry a 1. If it had been, say, 7, then we'd subtract 6 and carry 3. In this case, when we carry the 1, the next digit gets bumped to 2, so we have to subtract two and carry 1 again, and again, and again, resulting in 10001111. In code, we would actually just sum it up as we go. Here is some pseudocode:
; multiply E times A
set accumulator to 0
while A is not 0, then:
shift A right, carrying out the bottom bit
if 1 is carried out:
add E to the accumulator
shift E to the left, shifting in 0.
return accumulator as the result
Putting that into assembly, we have:
ld hl,0 ; set accumulator to 0
ld d,h ; (makes adding 'E' to 'accumulator' easier)
loop:
or a ; while A is not 0, then:
ret z ;
rra ; shift A right, carrying out the bottom bit
jr nc,skip; if 1 is carried out:
add hl,de ; add E to the accumulator
skip:
sla e ; shift E to the left, shifting in 0.
rl d ;
jr loop
So that is the easy way to implement it. The better way for the Z80 requires looking at multiplication slightly different from how we did it in school. Luckily it's an easy change: work lefttoright instead of righttoleft! So our above example looks like this instead:
1011
* 1101

1011000
101100
00000
1011
To implement this version, we'll need a loop counter, and some trickery. On first glance, it looks like we need to shift our 'E' all the way to the left, then shift right on each iteration. However, we can simply shift the accumulator left each iteration before adding. The looks like the following by hand:
1011
* 1101
^
1011
1011
* 1101
^
10110
1011
1011
* 1101
^
101100
10110
0000
1011
* 1101
^
1011000
101100
00000
1011
So in code:
ld hl,0 ; set accumulator to 0
ld d,h ; (makes adding 'E' to 'accumulator' easier)
ld b,8
loop:
add hl,hl
rla
jr nc,skip; if 1 is carried out:
add hl,de ; add E to the accumulator
skip:
djnz loop
ret
Now let's create various routines:
H_Times_E
This is a classic. It packs everything into two 16bit registers, making it very streamlined, and providing a religious experience (even to atheists).
H_Times_E:
;Inputs:
; H and E
;Outputs:
; HL is the product
; D is 0
; A,E,B,C are preserved
;36 bytes
;min: 190cc
;max: 242cc
;avg: 216cc
ld d,0
ld l,d
sla h \ jr nc,$+3 \ ld l,e
add hl,hl \ jr nc,$+3 \ add hl,de
add hl,hl \ jr nc,$+3 \ add hl,de
add hl,hl \ jr nc,$+3 \ add hl,de
add hl,hl \ jr nc,$+3 \ add hl,de
add hl,hl \ jr nc,$+3 \ add hl,de
add hl,hl \ jr nc,$+3 \ add hl,de
add hl,hl \ ret nc \ add hl,de
ret
Or we can sacrifice speed for size, taking almost twice the time, but at 1/3 the size:
H_Times_E:
;Inputs:
; H and E
;Outputs:
; HL is the product
; D is 0
; B is 0
; A,E,C are preserved
;12 bytes
;min: 381cc
;max: 429cc
;avg: 405cc
ld d,0
ld l,d
ld b,8
_:
add hl,hl
jr nc,$+3
add hl,de
djnz _
ret
DE_Times_A
DE_Times_A:
;Inputs:
; A and DE
;Outputs:
; AHL is the product, with A being the upper 8 bits
; BC = 0
; DE is preserved
ld bc,$0800
ld h,c
ld l,c
_:
add hl,hl
rla ; Check mostsignificant bit of accumulator
jr nc,_skip ; If zero, skip addition
add hl,de
adc a,c
_skip:
djnz _
ret
Karatsuba Multiplication Algorithm
The Karatsuba algorithm is a genuinely cool bit of math that is straightforward and fairly young as far as algorithms go. It reduced a 2by2 multiplication to three 1by1 multiplications— better than our four 1by1 multiplies that we needed with hand multiplication. If we apply this recursively, then a 4by4 uses three 2by2 multiplies, which translates to nine 1by1 multiplies. Much better than 16 1digit multiplies! The downside is a little more overhead with addition/subtraction.
Before we figure out how it works, here is the algorithm:
#Want to multiply (ax+b) * (cx+d)
#ex., in base 10 digits, this could be (a*10+b)*(c*10+d)
z0 = b*d
z2 = a*c
z1 = (a+b)*(c+d)z0z2
result = z0+z1*x+z2*x^2
Or another statement of the algorithm (my preferred interpretation):
z0 = b*d
z2 = a*c
z1 = z0+z2(ab)*(cd)
result = z0+z1*x+z2*x^2
So as an example, we'll multiply 37 by 69.
a=3, b=7, c=6, d=9
z0 = 7*9 = 63
z2 = 3*6 = 18
z1 = 63+18(37)*(69) = 69 #This is coincidence that z1 is 69 and one of our inputs was also 69 !
result = 63+690+1800 = 2553
To verify this algorithm, note that z1 = z0+z2(ab)*(cd) = bd+ac(ab)(cd) = ac+bd(acadbc+bd) = ad+bc so then result = z0+z1*x+z2*x^2 = bd+(ad+bc)x+acx^2 which is indeed (ax+b)(cx+d).
ToomCook Multiplication Algorithms
The ToomCook algorithm extends Karatsuba's algorithm via a different approach. For an nbym multiplication, we will only need to n+m1 1digit multiplications. This is an insane improvement over Karatsuba, let alone our hand method. A 16by16 multiply now needs 31 singledigit multiplies instead of 81 in Karatsuba, or 256 in the hand method. Holy heck. The problem is that overhead for recombining the multiplications quickly grows nontrivial.
The premise starts with polynomials. If you are given two points, there is a unique linear function that passes though both (a line). Given three points, there is only one quadratic function that passes through all three points. In general, given n+1 points, there is a unique nth degree polynomial that passes through each point.
Now let's look at multiplying polynomials. Given two linear functions, f(x)=ax+b and g(x)=cx+d, we can multiply them to get a quadratic function. If we evaluate this at three points, then theoretically we can find the coefficients for the quadratic polynomial. Some easy points to evaluate are x=0,1, and 1. Or in other words, we get bd, (a+b)(c+d), (ba),(dc). The formula for recombining the points at x=0,1, and 1 is this:
z0 = f(0)g(0)
z1 = f(1)g(1)
z2 = f(1)g(1)
result = z0+(z1z2)x/2+((z1+z2)/2z0) * x^2
In general, the formula is different depending on the x values you choose to evaluate at. Some formulas are easier than others, but some x values are more difficult than others.
So with our example in the Karatsuba section, f(x)=3x+7 and g(x)=6x+9,
z0 = f(0)g(0) = 7*9 = 63
z1 = f(1)g(1) = (3+7)(6+9) = 150
z2 = f(1)g(1) = (73)(96) = 12
result = z0+(z1z2)x/2+((z1+z2)/2z0) * x^2
= 63+(15012)x/2+((150+12)/263) * x^2
= 63+69x+18x^2
This example is similar to Karatsuba's method, but a little more complicated. It turns out that Karatsuba's method is just ToomCook with points at either x=0,1,+inf, or x=0,1,+inf.
For a fixed set of points, the formula to recombine the values will always be the same, so we have two challenges. First, we need to choose a set of points that our algorithm will evaluate at, second, we need to find the recombining formula. I like to solve this with a system of equations. If we want an algorithm for 3by3 multiplication, then we'll need five points. For example, x=2, 1, 0, 1, and 2. Then we want to generate coefficients c0, c1, c2, c3, and c4. The easy way is to put this matrix into your calculator:
[[1 2 4 8 16]
[1 1 1 1 1]
[A]= [1 0 0 0 0]
[1 1 1 1 1]
[1 2 4 8 16]]
And then find the inverse:
[[ 0 0 1 0 0]
[ 1/12 2/3 0 2/3 1/12]
[A]^1= [1/24 2/3 5/4 2/3 1/24]
[1/12 1/6 0 1/6 1/12]
[ 1/24 1/6 1/4 1/6 1/24]]
And this tells us that:
z0=f(2)g(2)
z1=f(1)g(1)
z2=f(0)g(0)
z3=f(1)g(1)
z4=f(2)g(2)
c0 = [ 0 0 1 0 0]*[z0 z1 z2 z3 z4] = z2
c1 = [ 1/12 2/3 0 2/3 1/12]*[z0 z1 z2 z3 z4] = (z0z4)/12 + 2(z3z1)/3
c2 = [1/24 2/3 5/4 2/3 1/24]*[z0 z1 z2 z3 z4] =(z0+z4)/24 + 2(z3+z1)/3  5*z2/4
c3 = [1/12 1/6 0 1/6 1/12]*[z0 z1 z2 z3 z4] = (z4z0)/12 + (z1z3)/6
c4 = [ 1/24 1/6 1/4 1/6 1/24]*[z0 z1 z2 z3 z4] = (z0+z4)/24  (z1+z3)/6 z2/4
This reduces a 3by3 multiplication to 5 nontrivial multiplications, 2 divisionsby3, 10 add/sub, and a handful of shifts. Those divisionsby three are a bit slow, but much faster than a multiplication if you use the right algorithms.
Multiplication, base case
In all cases, we will need a "base case" multiplication routine. This is the smallest multiplication performed, and all larger multiplication are broken up into some combination of these multiplications. On the eZ80, you can use the mlt instruction as your base case. For example, the mlt hl instruction multiplies H by L and stores the 16bit result in HL. On the regular Z80, we need to create our own custom routine. In my experience, a well made 16by16 multiplication routine is the best base case. The following is made by Runer112, analyzed by Zeda, and tested by jacobly:
;This was made by Runer112
;Tested by jacobly
mul16:
;BC*DE > DEHL
; ~544.887cc as calculated in jacobly's test
;min: 214cc (DE = 1)
;max: 667cc
;avg: 544.4507883cc however, deferring to jacobly's result as mine may have math issues ?
ld a,d
ld d,0
ld h,b
ld l,c
add a,a
jr c,Mul_BC_DE_DEHL_Bit14
add a,a
jr c,Mul_BC_DE_DEHL_Bit13
add a,a
jr c,Mul_BC_DE_DEHL_Bit12
add a,a
jr c,Mul_BC_DE_DEHL_Bit11
add a,a
jr c,Mul_BC_DE_DEHL_Bit10
add a,a
jr c,Mul_BC_DE_DEHL_Bit9
add a,a
jr c,Mul_BC_DE_DEHL_Bit8
add a,a
jr c,Mul_BC_DE_DEHL_Bit7
ld a,e
and %11111110
add a,a
jr c,Mul_BC_DE_DEHL_Bit6
add a,a
jr c,Mul_BC_DE_DEHL_Bit5
add a,a
jr c,Mul_BC_DE_DEHL_Bit4
add a,a
jr c,Mul_BC_DE_DEHL_Bit3
add a,a
jr c,Mul_BC_DE_DEHL_Bit2
add a,a
jr c,Mul_BC_DE_DEHL_Bit1
add a,a
jr c,Mul_BC_DE_DEHL_Bit0
rr e
ret c
ld h,d
ld l,e
ret
Mul_BC_DE_DEHL_Bit14:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit13
add hl,bc
adc a,d
Mul_BC_DE_DEHL_Bit13:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit12
add hl,bc
adc a,d
Mul_BC_DE_DEHL_Bit12:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit11
add hl,bc
adc a,d
Mul_BC_DE_DEHL_Bit11:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit10
add hl,bc
adc a,d
Mul_BC_DE_DEHL_Bit10:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit9
add hl,bc
adc a,d
Mul_BC_DE_DEHL_Bit9:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit8
add hl,bc
adc a,d
Mul_BC_DE_DEHL_Bit8:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit7
add hl,bc
adc a,d
Mul_BC_DE_DEHL_Bit7:
ld d,a
ld a,e
and %11111110
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit6
add hl,bc
adc a,0
Mul_BC_DE_DEHL_Bit6:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit5
add hl,bc
adc a,0
Mul_BC_DE_DEHL_Bit5:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit4
add hl,bc
adc a,0
Mul_BC_DE_DEHL_Bit4:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit3
add hl,bc
adc a,0
Mul_BC_DE_DEHL_Bit3:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit2
add hl,bc
adc a,0
Mul_BC_DE_DEHL_Bit2:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit1
add hl,bc
adc a,0
Mul_BC_DE_DEHL_Bit1:
add hl,hl
adc a,a
jr nc,Mul_BC_DE_DEHL_Bit0
add hl,bc
adc a,0
Mul_BC_DE_DEHL_Bit0:
add hl,hl
adc a,a
jr c,Mul_BC_DE_DEHL_FunkyCarry
rr e
ld e,a
ret nc
add hl,bc
ret nc
inc e
ret nz
inc d
ret
Mul_BC_DE_DEHL_FunkyCarry:
inc d
rr e
ld e,a
ret nc
add hl,bc
ret nc
inc e
ret
Although this is unlikely to change, the most uptodate version can be found here.
mul24, while less used on the Z80, can still be a basecase multiplication, though it does use an 8by24 multiplication subroutine. Here is 8by24:
C_times_BDE:
;;C*BDE => CAHL
;C = 0 157
;C = 1 141
;141+
;C>=128 135+6{0,33+{0,1}}+{0,20+{0,8}}
;C>=64 115+5{0,33+{0,1}}+{0,20+{0,8}}
;C>=32 95+4{0,33+{0,1}}+{0,20+{0,8}}
;C>=16 75+3{0,33+{0,1}}+{0,20+{0,8}}
;C>=8 55+2{0,33+{0,1}}+{0,20+{0,8}}
;C>=4 35+{0,33+{0,1}}+{0,20+{0,8}}
;C>=2 15+{0,20+{0,8}}
;min: 141cc
;max: 508cc
;avg: 349.21279907227cc
ld a,b
ld h,d
ld l,e
sla c \ jr c,mul8_24_1
sla c \ jr c,mul8_24_2
sla c \ jr c,mul8_24_3
sla c \ jr c,mul8_24_4
sla c \ jr c,mul8_24_5
sla c \ jr c,mul8_24_6
sla c \ jr c,mul8_24_7
sla c \ ret c
ld a,c
ld h,c
ld l,c
ret
mul8_24_1:
add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_2:
add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_3:
add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_4:
add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_5:
add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_6:
add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_7:
add hl,hl \ rla \ rl c \ ret nc \ add hl,de \ adc a,b \ ret nc \ inc c \ ret
And mul24 needs var48 defined.
mul24:
;;BDE*CHL > HLBCDE
;;155 bytes
;;402+3*C_Times_BDE
;;fastest:1201cc
;;slowest:1753cc
;;avg :1464.9033203125cc (1464+925/1024)
;min: 825cc
;max: 1926cc
;avg: 1449.63839751681cc
push bc
ld c,l
push hl
call C_Times_BDE
ld (var48),hl
ld l,a
ld h,c
ld (var48+2),hl
pop hl
ld c,h
call C_Times_BDE
push bc
ld bc,(var48+1)
add hl,bc
ld (var48+1),hl
pop bc
ld b,c
ld c,a
ld hl,(var48+3)
ld h,0
adc hl,bc
ld (var48+3),hl
pop bc
call C_Times_BDE
ld de,(var48+2)
add hl,de
ld (var48+2),hl
ld d,c
ld e,a
ld b,h
ld c,l
ld hl,(var48+4)
ld h,0
adc hl,de
ld de,(var48)
ret
Integer Multiplication, multiprecision
This 32by32 multiply requires 8 bytes of RAM and uses the above mul16 routine as a base case for Karatsuba Multiplication:
; You will need to define z32_0, which is where the result is written!
mul32:
;max: 703cc + 3*mul16
; 2704cc
;min: 655cc + 3*mul16
; 1297cc
;avg: 673.25cc+3*mul16
; 2307.911cc
;DEHL * BCIX ==> z32_0
z32_2 = z32_0+4
push de
push bc
push hl
push ix
call mul16 ;DEHL
ld (z32_2),hl
ld (z32_2+2),de
pop de
pop bc
; push bc
push de
call mul16 ;DEHL
ld (z32_0),hl
ld (z32_0+2),de
pop de ;low word
; pop bc ;low word
pop hl
xor a
sbc hl,de
jr nc,+_
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
inc a
_:
ex de,hl
pop hl
sbc hl,bc
jr nc,+_
ld b,a
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ld a,b
inc a
_:
ld b,h
ld c,l
push af
call mul16
pop af ;holds the sign in the low bit
rra
jr c,mul32_add
;need to perform z0+z2result
push de
push hl
xor a
ld hl,(z32_0)
ld bc,(z32_2)
add hl,bc
ex de,hl
ld hl,(z32_0+2)
ld bc,(z32_2+2)
adc hl,bc
rla
;now need to subtract
ex de,hl
pop bc
sbc hl,bc
ex de,hl
pop bc
sbc hl,bc
sbc a,0
;A:HL:DE is the result, need to add to z32_0+2
mul32_final:
ld bc,(z32_0+2)
ex de,hl
add hl,bc
ld (z32_0+2),hl
ld hl,(z32_2)
adc hl,de
ld (z32_2),hl
ld hl,z32_2+2
adc a,(hl)
ld (hl),a
ret nc
inc hl
inc (hl)
ret
mul32_add:
;add to the current result
xor a
ld bc,(z32_0)
add hl,bc
ex de,hl
ld bc,(z32_0+2)
adc hl,bc
rla
ex de,hl
ld bc,(z32_2)
add hl,bc
ex de,hl
ld bc,(z32_2+2)
adc hl,bc
adc a,0
jp mul32_final
And for 64bit, you'll need the above mul32, as well as a mov8 routine like this:
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ret
mul64:
;multiplies the 64bit integers at inp64_1 and inp64_2
;stores the 128bit (16byte) result at out128
;
;min: 1740+3*min(mul32)
; 5631cc
;max: 1901+3*max(mul32)
; 10013cc
;avg: 1797+3*avg(mul32) + 9572881/2^24
; ~8720.733cc
;uses 24 bytes at out128
z64_0 = out128
z64_2 = z64_0+8
z32_0 = z64_2+8
ld de,(inp64_1+6)
ld hl,(inp64_1+4)
ld bc,(inp64_2+6)
ld ix,(inp64_2+4)
call mul32
;copy the 8 bytes at z32_0 to z64_2
ld hl,z32_0
ld de,z64_2
call mov8
ld de,(inp64_1+2)
ld hl,(inp64_1)
ld bc,(inp64_2+2)
ld ix,(inp64_2)
call mul32
;copy the 8 bytes at z32_0 to z64_0
ld hl,z32_0
ld de,z64_0
call mov8
;now I need to subtract the 32bit digits from each other
xor a
ld hl,(inp64_1)
ld bc,(inp64_1+4)
sbc hl,bc
ex de,hl
ld hl,(inp64_1+2)
ld bc,(inp64_1+6)
sbc hl,bc
jr nc,+_
ld b,a \ sub e \ ld e,a
ld a,b \ sbc a,d \ ld d,a
ld a,b \ sbc a,l \ ld l,a
ld a,b \ sbc a,h \ ld h,a
ld a,b
_:
rla
push hl ;top byte
push de
ld hl,(inp64_2)
ld bc,(inp64_2+4)
sbc hl,bc
ex de,hl
ld hl,(inp64_2+2)
ld bc,(inp64_2+6)
sbc hl,bc
jr nc,+_
ld c,a
xor a
ld b,a
sub e \ ld e,a
ld a,b \ sbc a,d \ ld d,a
ld a,b \ sbc a,l \ ld l,a
ld a,b \ sbc a,h \ ld h,a
ld a,c
inc a
_:
ex de,hl
pop ix
pop bc
push af
call mul32
pop af ;holds the sign in the low bit
rra
jp c,mul64_add
;need to perform z0+z2result
xor a
ld hl,(z64_0)
ld de,(z64_2)
add hl,de
ld (inp64_1),hl
ld hl,(z64_0+2)
ld de,(z64_2+2)
adc hl,de
ld (inp64_1+2),hl
ld hl,(z64_0+4)
ld de,(z64_2+4)
adc hl,de
ld (inp64_1+4),hl
ld hl,(z64_0+6)
ld de,(z64_2+6)
adc hl,de
ld (inp64_1+6),hl
rla
;now need to subtract
ld hl,(inp64_1)
ld de,(z32_0)
sbc hl,de
ld (inp64_1),hl
ld hl,(inp64_1+2)
ld de,(z32_0+2)
sbc hl,de
ld (inp64_1+2),hl
ld hl,(inp64_1+4)
ld de,(z32_0+4)
sbc hl,de
ld (inp64_1+4),hl
ld hl,(inp64_1+6)
ld de,(z32_0+6)
sbc hl,de
ld (inp64_1+6),hl
sbc a,0
mul64_final:
;now need to add it back in
ld hl,(z64_0+4)
ld de,(inp64_1)
add hl,de
ld (z64_0+4),hl
ld hl,(z64_0+6)
ld de,(inp64_1+2)
adc hl,de
ld (z64_0+6),hl
ld hl,(z64_0+8)
ld de,(inp64_1+4)
adc hl,de
ld (z64_0+8),hl
ld hl,(z64_0+10)
ld de,(inp64_1+6)
adc hl,de
ld (z64_0+10),hl
ld hl,z64_0+12
adc a,(hl)
ld (hl),a
ret nc
inc hl \ inc (hl) \ ret nz
inc hl \ inc (hl) \ ret nz
inc hl \ inc (hl) \ ret
mul64_add:
;add to the current result
;z0+z2+result
xor a
ld hl,(z64_0)
ld de,(z64_2)
add hl,de
ld (inp64_1),hl
ld hl,(z64_0+2)
ld de,(z64_2+2)
adc hl,de
ld (inp64_1+2),hl
ld hl,(z64_0+4)
ld de,(z64_2+4)
adc hl,de
ld (inp64_1+4),hl
ld hl,(z64_0+6)
ld de,(z64_2+6)
adc hl,de
ld (inp64_1+6),hl
rla
;now need to subtract
ld hl,(inp64_1)
ld de,(z32_0)
add hl,de
ld (inp64_1),hl
ld hl,(inp64_1+2)
ld de,(z32_0+2)
adc hl,de
ld (inp64_1+2),hl
ld hl,(inp64_1+4)
ld de,(z32_0+4)
adc hl,de
ld (inp64_1+4),hl
ld hl,(inp64_1+6)
ld de,(z32_0+6)
adc hl,de
ld (inp64_1+6),hl
adc a,0
jp mul64_final
BCD Multiplication, multiprecision
Fixed Point Multiplication
If we multiply an A.B fixedpoint number by a C.D fixedpoint number, we get an (A+C).(B+D) fixedpoint number. So for example, if we use 16 bits to encode a 4.12 fixedpoint number, and multiply it by another 4.12 fixed point, the result will be 8.24. To convert this to a 4.12, we'll need to discard the top 4 bits of the result, and the bottom 12 bits. Depending on the application, you might also want to check for overflow, in which case you'd need to check the discarded 4 bits at the top.
Fixed point multiplication uses an integer multiplication at its core. For example, an 8.8 fixedpoint routine can use a 16by16 multiply, with minor adjustments. The result should be interpreted as a 16.16 fixedpoint number, so you'll need to grab the 8 bits above and below the decimal in order to keep the output as 8.8 fixed. As well, you'll need special code for handling negative numbers. Putting it together, using the mul16 from the integer multiplication section:
mulfixed8_8:
;Multiplies H.L by D.E, stores the result in H.L
; First, find out if the output is positive or negative
ld a,h
xor d
push af ;sign bit is the result sign bit
; Now make sure the inputs are positive
xor d ;A now has the value of H, since I XORed it with D twice (cancelling)
jp p,mulfixed8_8_lbl1 ;if Positive, don't negate
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
mulfixed8_8_lbl1:
bit 7,d
jr z,mulfixed8_8_lbl2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
mulfixed8_8_lbl2:
; Now we need to put DE in BC to use mul16
ld b,d
ld c,e
call mul16
;Get the middle two bytes, EH, and put them in HL
ld l,h
ld h,e
;We should check for overflow. If D>0, we will set HL to 0x7FFF
ld a,d
or a
jr z,mulfixed8_8_lbl3
ld hl,$7FFF
mulfixed8_8_lbl3:
; Now we need to restore the sign
pop af
ret p ;don't need to do anything, result is already positive
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
Or slightly more complicated 4.12 multiplication, which also uses the mul16 as its core, and only the resizing step changes:
mulfixed4_12:
;Multiplies H.L by D.E, stores the result in H.L
; First, find out if the output is positive or negative
ld a,h
xor d
push af ;sign bit is the result sign bit
; Now make sure the inputs are positive
xor d ;A now has the value of H, since I XORed it with D twice (cancelling)
jp p,mulfixed4_12_lbl1 ;if Positive, don't negate
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
mulfixed4_12_lbl1:
bit 7,d
jr z,mulfixed4_12_lbl2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
mulfixed4_12_lbl2:
; Now we need to put DE in BC to use mul16
ld b,d
ld c,e
call mul16
;The result doesn't need the top 4 bits or bottom 12 bits.
;We'll hold onto the top 4 bits to check overflow, though.
;Currently we need to shift DEH left by 4 bits and keep DE, or right by 12 bits and keep HL.
ld a,h ;we'll actually be moving the discared bits into A
and $F0
rla \ rl e \ rl d
rla \ rl e \ rl d
rla \ rl e \ rl d
rla \ rl e \ rl d
adc a,a
ex de,hl
;if A is nonzero, we have overflow
jr z,mulfixed4_12_lbl3
ld hl,$7FFF
mulfixed4_12_lbl3:
; Now we need to restore the sign
pop af
ret p ;don't need to do anything, result is already positive
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
For 16.16 fixed point, we'll use the mul32 from the integer multiplication section.
mulfixed16_16:
;Multiplies DE.HL by BC.IX, stores the result in DE.HL
; First, find out if the output is positive or negative
ld a,d
xor b
push af ;sign bit is the result sign bit
; Now make sure the inputs are positive
xor b ;A now has the value of D, since I XORed it with B twice (cancelling)
jp p,mulfixed16_16_lbl1 ;if Positive, don't negate
xor a
sub l
ld l,a
ld a,0
sbc a,h
ld h,a
ld a,0
sbc a,e
ld e,a
sbc a,a
sub d
ld d,a
mulfixed16_16_lbl1:
bit 7,b
jr z,mulfixed16_16_lbl2
xor a
sub ixl
ld ixl,a
ld a,0
sbc a,ixh
ld ixh,a
ld a,0
sbc a,c
ld c,a
sbc a,a
sub b
ld b,a
mulfixed16_16_lbl2:
; Now we multiply
call mul32
;We should check for overflow. If the upper two bytes are nonzero, we will set the result to 0x7FFFFFFF
ld hl,(z32_0+6)
ld a,h
or l
;Get the middle four bytes and put them in DEHL
ld hl,(z32_0+2)
ld de,(z32_0+4)
;Maybe we need to set the result to 0x7FFFFFFF
jr z,mulfixed16_16_lbl3
ld de,$7FFF
ld h,e
ld l,e
mulfixed16_16_lbl3:
; Now we need to restore the sign
pop af
ret p ;don't need to do anything, result is already positive
xor a
ld b,a
sub l
ld l,a
ld a,b
sbc a,h
ld h,a
ld a,b
sbc a,e
ld e,a
sbc a,a
sub d
ld d,a
ret
Floating Point Multiplication
Floating point multiplication is actually pretty easy compared to FP addition and subtraction.
We'll start with an example: Suppose we have a floating point format that stores a 6digit mantissa, and digits are base10. Let's multiply 3.14159*10^{1} * 6.54321*10^{3}. To do so, we can add the powers of their exponents and this is our new exponent, and directly multiply the mantissas for our new mantissa.
So our result is (3.14149 * 6.54321)*10^{(1+3) = 20.5560831039*10}2^{. Our mantissa is >=10, so we'll need to increment our exponent and divide our mantissa by 10 (shift right by 1 digit): 2.05560831039*10}3^{. Now we need to keep only six digits of the mantissa, and we'll round: 2.05561*10}3^^. Tada :)
So the steps we do:
 XOR the signs of the inputs. This is the sign bit of our result.
 Add the exponents together to form our new exponent. Be wary of overflow or underflow!
 Multiply their mantissas as 1.n fixed point numbers. If the top digit of the result is nonzero, we'll need to increment the exponent (again, check overflow), and then shift the digits right.
 In practice, I actually shift left by one digit if the top digit is 0, because then it is in the final form we need for our mantissa.
 Truncate the mantissa to the correct size, rounding if desired.
There are Single Precision and Extended Precision examples on GitHub. These also check for "special values" that represent 0, +infinity, infinity, and Not a Number (NaN).
In case something breaks, here are the current versions of those routines as of this writing:
var48 = scrap+4
mulSingle:
;Inputs: HL points to float1, DE points to float2, BC points to where the result is copied
;Outputs: float1*float2 is stored to (BC)
;573+mul24+{0,35}+{0,30}
;min: 1398cc
;max: 2564cc
;avg: 2055.13839751681cc
push af
push hl
push de
push bc
call +_ ;CHLB
ld a,c
ex de,hl
pop hl
push hl
ld (hl),b \ inc hl
ld (hl),e \ inc hl
ld (hl),d \ inc hl
ld (hl),a
pop bc
pop de
pop hl
pop af
ret
_:
;;return float in CHLB
push de
ld e,(hl)
inc hl
ld d,(hl)
inc hl
ld c,(hl)
inc hl
ld a,(hl)
or a
jr z,mulSingle_case0
ex de,hl
ex (sp),hl
ld e,(hl)
inc hl
ld d,(hl)
inc hl
ld b,(hl)
inc hl
inc (hl)
dec (hl)
jr z,mulSingle_case1
add a,(hl) ;\
pop hl ; 
rra ; Lots of help from Runer112 and
adc a,a ; calc84maniac for optimizing
jp po,bad ; this exponent check.
xor 80h ; 
jr z,underflow ;/
push af ;exponent
ld a,b
xor c
push af ;sign
set 7,b
set 7,c
call mul24 ;BDE*CHL>HLBCDE, returns sign info
pop de
ld a,e
pop de
jp m,+_
rl c
rl b
adc hl,hl
dec d
_:
inc d
jr z,overflow
rl c
ld c,d
ld de,0
push af
ld a,b
adc a,e
ld b,a
adc hl,de
jr nc,+_
inc c \ jr z,overflow
rr h
rr l
rr b
_:
pop af
cpl
and $80
xor h
ld h,a
ret
bad:
jr nc,overflow
underflow:
ld hl,0
rl b
rr h
ld c,l
ld b,l
ret
overflow:
ld hl,$8000
jr underflow+3
mulSingle_case1:
;x*0 > 0
;x*inf > inf
;x*NaN > NaN
pop hl
ld h,b
ld l,d
ld b,e
ld c,0
ret
mulSingle_case0:
;special*x = special
;NaN*x = NaN
;0*0 = 0
;0*NaN = NaN
;0*Inf = NaN
;Inf*Inf = Inf
;Inf*Inf =Inf
;0CDE
pop hl
inc hl
inc hl
inc hl
ld a,(hl)
or a
jr z,+_
ld h,c
ld c,0
ret
_:
dec hl
ld b,(hl)
;basically, if bc has bit 5 set, return NaN
ld a,b
or c
ld h,$20
and h
jr z,+_
ld c,0
ret
_:
ld a,c
xor b
rl b
rlca
rr b
res 4,b
rl c
rrca
rr c
ld a,c
and $E0
add a,b
rra
ld h,a
ld c,0
ret
xmul (extended precision, 80bit floats):
#include "routines/pushpop.z80"
#include "routines/mov.z80"
#include "mul/mul64.z80"
#include "routines/rl64.z80"
#define var_z xOP3+16
;uses 60 bytes after xOP1
xmul:
;Input:
; HL points to one number
; DE points to another
;Timing, excluding special cases (which take ~ 800cc):
;1057+{0,3}+{0,172}+mul64
;max: 1232+max(mul64)
; 11245cc
;min: 1057+min(mul64)
; 6688cc
;avg: 1144.5+avg(mul64)
; 9865.233ccs
push hl
push de
push bc
push af
push ix
push bc
call +_
pop hl
push de
ex de,hl
ld hl,var_z+8
call mov8
ex de,hl
pop de
ld (hl),e
inc hl
ld (hl),d
pop ix
pop af
pop bc
pop de
pop hl
ret
_:
push de
ld de,xOP1
call mov10
pop hl
call mov10
ld de,(xOP2+8)
ld hl,(xOP1+8)
xmul_stepin_geomean:
ld a,h
xor d
ld b,a
res 7,d
res 7,h
ld a,h \ or l \ jp z,casemul
ld a,d \ or e \ jp z,casemul2
add hl,de
ld de,$4000
sbc hl,de
jp c,mul_zero
jp m,mul_inf
sla b
jr nc,+_
set 7,h
_:
push hl
call mul64
ld a,(var_z+15)
add a,a
pop de
jr c,+_
#ifdef inc_FMA
ld hl,var_z
call rl64
#else
ld hl,var_z+7
sla (hl)
#endif
inc hl
jp rl64
_:
inc de
ret
casemul:
;xOP1 is inf/nan/0
ld hl,xOP2+9
ld a,(hl)
dec hl
or (hl)
dec hl
ld a,(hl)
ld hl,xOP2
jr nz,casemul2_copy
;now we have two special cases to multipy together
;inf*inf> inf
;0*0 > 0
;
;nan*nan> NaN
;inf*nan> NaN
;inf*0 > NaN
;nan*inf> NaN
;nan*0 > NaN
;0*inf > NaN
;0*nan > NaN
sla b
ld de,0
rr d
and $C0
ld c,a
ld a,(xOP1+7)
and $C0
cp c
jr z,+_
ld a,$40
_:
ld (var_z+15),a
ret
casemul2:
;finite times inf/nan/0, so xOP1 > out
ld hl,xOP1
casemul2_copy:
ld de,var_z+8
call mov8
ld e,(hl)
inc hl
ld d,(hl)
ret
mul_zero:
xor a
ld (var_z+15),a
ld d,a
ret
mul_inf:
ld d,e
ld a,255
ld (var_z+15),a
ret
Division
Division is usually considered to be a slow operation, but in my experience, at high precision, it is only a little slower than multiplication if done right.
"Schoolbook" Division
Division can be done just like base10 division most of us learned in school, but instead of checking if the next digit is from 0 to 9, we only need to check if it is 0 or 1 which amounts to the question of "is our remainder smaller than the divisor (0) or bigger (1) ?"
Here is the pseudocode:
N is the numerator
D is the denominator
R is the remainder
Q is the quotient
While N>0:
shift Q left one bit to make room for the next digit
leftshift the top bit of N into R, removing the bit from N.
If R>=D
R  D > R
Q + 1 > Q
Return Q
In Z80 assembly, this might look like
Div_HL_D:
;Inputs:
; HL and D
;Outputs:
; HL is the quotient (HL/D)
; A is the remainder
; B is 0
; C,D,E are preserved
xor a ; Clear upper eight bits of AHL
ld b,16 ; Sixteen bits in dividend
_loop:
add hl,hl ; Do a "SLA HL". If the upper bit was 1, the c flag is set
rla ; This moves the upper bits of the dividend into A
jr c,_overflow; If D is allowed to be >128, then it is possible for A to overflow here. (Yes future Zeda, 128 is "safe.")
cp d ; Check if we can subtract the divisor
jr c,_skip ; Carry means A < D
_overflow:
sub d ; Do subtraction for real this time
inc l ; Set the next bit of the quotient (currently bit 0)
_skip:
djnz _loop
ret
Notice here that we are using HL as both our 'N' value and our 'Q' value from the pseudocode.
'A' acts as our remainder, and D acts as our dividend.
Special Case Division
Division by a power of 2
This is easy: just rightshift! If you want to divide by 2, right shift once. Divide by 4? Right shift twice. Divide by 8? Right shift three times…
Division by Constants
Here is an example of how to divide by a constant. Suppose I want to divide an 8bit integer, x, by 7, and return the result as an 8bit integer. Well, 1/7 is approximately 73/512, so I can get a pretty close result with x*73/512. Division by 512 will be easy: Just take the top 8 bits from the multiplication, and shift right. For example:
;A/7
ld l,a
ld h,0
ld b,h
ld c,l
add hl,hl ;A*2
add hl,hl ;A*4
add hl,hl ;A*8
add hl,bc ;A*9
add hl,hl ;A*18
add hl,hl ;A*36
add hl,hl ;A*72
add hl,bc ;A*73
ld a,h
rra ;carry flag is always reset in the case, so no need to worry
adc a,b ;just to round. B is 0
ret
This isn't perfect, but it is pretty good, and fast.
Turning Division Into Multiplication
You can turn division into multiplication as follows:
Suppose you can write your division as x/(1y) where y<1. Then you can rewrite your division as:
x*(1+y)*(1+y^{2})*(1+y^{4})*(1+y^{8})*(1+y^{16})*…
So some pseudocode might be:
while y>10^15 ;basically, until y is very small
x + x*y > x
y * y > y
Division by 2^n  1
If we want to perform x/(2^{n}1), then we can rewrite this as 2^{n} * x/(12^{n}). Using the technique above, this gives us the following pseudocode:
x + (x>>n) > x
x + (x>>(2n)) > x
x + (x>>(4n)) > x
x + (x>>(8n)) > x
x + (x>>(16n))> x
...
return x>>n
For examples:
Division by 3
In this case, 3=2^{2}1, so we can use the code:
x + (x>>2) > x
x + (x>>4) > x
x + (x>>8) > x
x + (x>>16)> x
...
return x>>2
For example, here is a way to implement HL/3 with some rounding errors:
;HL/2 ;we'll do this to prevent overflow, and divide by 2 at the end instead of 4
srl h \ rr l
;HL + (HL>>2) > HL
ld b,h
ld c,l
srl b \ rr c
srl b \ rr c
add hl,bc
;HL + (HL>>4) > HL
ld b,h
ld c,l
srl b \ rr c
srl b \ rr c
srl b \ rr c
srl b \ rr c
add hl,bc
;HL + (HL>>8) > HL
ld b,0
ld c,h
add hl,bc
HL/2
srl h \ rr l
Since left shifts can often be done cheaper, let's try this with leftshifts, and then do a bunch of rightshifts at the end:
;We want a difference of a factor of 2 shifts
ld b,h
ld c,l
xor a
ld e,a
add hl,hl \ rla
add hl,hl \ rla
add hl,bc \ adc a,e
;We want a difference of a factor of 4 shifts
ld b,h
ld c,l
ld e,a
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,bc \ adc a,e
;We want a difference of a factor of 8 shifts
ld d,a
ld e,h
ld b,l
ld c,0
add hl,bc
add a,e
ld e,a
jr nc,$+3
inc d
;Now we need to shift right by 2+4+8 plus a final 2, so shift right by 16
;This means we just keep DE as our result!
But knowing that we are going to shift right by 16 at the end, we can optimize the final iteration:
;We want a difference of a factor of 2 shifts
ld b,h
ld c,l
xor a
ld e,a
add hl,hl \ rla
add hl,hl \ rla
add hl,bc \ adc a,e
;We want a difference of a factor of 4 shifts
ld b,h
ld c,l
ld e,a
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,bc \ adc a,e
ld b,a
ld c,h
add hl,bc
adc a,0
ld l,h
ld h,a
;now HL is our result
Finally, it turns out that the accuracy issues occur at products of 3, and we can fix that simply by incrementing HL:
HL_Div_3:
;Input: HL
;Output: HL is the input divided by 3
;Destroys: B,C,E,A
;217cc
;increment HL, putting overflow in A
ld bc,1
ld a,b
add hl,bc
adc a,b
;We want a difference of a factor of 2 shifts
ld b,h
ld c,l
ld e,a
add hl,hl \ rla
add hl,hl \ rla
add hl,bc \ adc a,e
;We want a difference of a factor of 4 shifts
ld b,h
ld c,l
ld e,a
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,bc \ adc a,e
ld b,a
ld c,h
add hl,bc
adc a,0
ld l,h
ld h,a
;now HL is our result
ret
Division by 2^n + 1
If we want to perform x/(2^{n}+1), then we can rewrite this as 2^{n} * x/(1+2^{n}). Using the technique for turning division into multiplication, this gives us the following pseudocode:
x  (x>>n) > x
x + (x>>(2n)) > x
x + (x>>(4n)) > x
x + (x>>(8n)) > x
x + (x>>(16n))> x
...
return x>>n
Division, BC/DE
In this section, we'll explore the integer division routine for dividing BC by DE and various ways to optimize it. We'll start off with our baseline routine, and note that our accumulator will never overflow (we'll use a 16bit accumulator, and up to 16 bits will be shifted in), so we don't need to take care of that as we did in the HL/D routine.
BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;1382cc
;23 bytes
ld hl,0 ;Accumulator
ld a,16 ;Loop counter
div_loop:
;shift the bits from BC (numerator) into HL (accumulator)
sla c \ rl b
adc hl,hl
;Check if remainder >= denominator (HL>=DE)
sbc hl,de
jr c,div_loop_readjust
inc c
jr div_loop_done
div_loop_readjust:
; remainder is not >= denominator, so we have to add DE back to HL
add hl,de
div_loop_done:
dec a
jr nz,div_loop
ret
Now that we have established our baseline routine, let's add our first minor optimization. In particular, let's fix up this part:
jr c,div_loop_readjust
inc c
jr div_loop_done
div_loop_readjust:
; remainder is not >= denominator, so we have to add DE bBCk to HL
add hl,de
The instruction ‘cp xx`, where `xx` is a constant, is compiled to `FExx`. As long as we don’t need to preserve flags, we can skip a onebyte instruction like this:
jr +_
inc c
_:
Becomes ====>
.db $FE
inc c
This saves one byte and 4cc. So reorganizing the section of code above:
jr c,div_loop_readjust
inc c
jr div_loop_done
div_loop_readjust:
; remainder is not >= denominator, so we have to add DE bBCk to HL
add hl,de
Becomes ====>
jr nc,divinc_acc ;change the condition from `c` to `nc`
add hl,de
.db $FE ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
inc c
Applying this, we now have:
BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;min: 1270cc
;max: 1414cc
;avg: 1342cc
;22 bytes
ld hl,0
ld a,16
div_loop:
sla c \ rl b
adc hl,hl
sbc hl,de
jr nc,divinc_acc
add hl,de
.db $FE ;this begins the instruction `cp *`, so it eats the `inc c`
div_inc_acc:
inc c
div_loop_done:
dec a
jr nz,div_loop
ret
This change saves one byte and on average 40cc, but saving up to to 112cc or costing up to 32cc depending on inputs.
There is another cheap optimization— using B instead of A as the loop counter.
It adds 2 bytes to overhead, but saves 2 bytes in the loop for no net change in size. And while it adds 8cc to overhead, it saves 112cc in the main loop, for a net savings of 104cc.
BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;min: 1166cc
;max: 1310cc
;avg: 1238cc
;22 bytes
ld hl,0
ld a,b
ld b,16
div_loop:
;shift the bits from BC into HL
sla c \ rla
adc hl,hl
sbc hl,de
jr nc,div_inc_acc
add hl,de
.db $FE ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
inc c
div_loop_done:
djnz div_loop
ld b,a
ret
Now, even the worst case is better than the original, and we've saved a byte!
But this next trick adds 5 bytes onto the previous routine. If we negate DE, then we can swap whether we are doing sbc or add when comparing HL to DE:
BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 1124cc
;max: 1332cc
;avg: 1228cc
;27 bytes
xor a
ld h,a ;take advantage of A=0 to set hl to 0 saving 2cc and 1 byte.
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld a,b
ld b,16
div_loop:
;shift the bits from BC into HL
sla c \ rla
adc hl,hl
add hl,de
jr c,divinc_acc
sbc hl,de
.db $FE ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
inc c
div_loop_done:
djnz div_loop
ld b,a
ret
Overhead: +5 bytes, +22cc
Loop: 32cc on average, up to +0cc, as low as 64cc
Net: +5 bytes, 10cc on average, as much as +22cc, as low as 42cc
But now we have that the comparison leaves carry set when C is incremented, and reset when it is not. If we use `rl c` instead of `sla c`, we can just shift in the next bit:
BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 1072cc
;max: 1232cc
;avg: 1152cc
;28 bytes
xor a
ld h,a
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld a,b
ld b,16
div_loop:
;shift the bits from BC into HL
rl c \ rla
adc hl,hl
add hl,de
jr c,div_loop_done
sbc hl,de
div_loop_done:
djnz div_loop
rl c \ rla
ld b,a
ret
This adds 3 bytes and 12cc to overhead, but saves 2 byte in the main loop and 88cc on average, for a net of +1 byte, 76cc.
And now, if we really want speed, we can unroll it, and take advantage of only needing to shift one byte at a time into HL:
BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 738cc
;max: 898cc
;avg: 818cc
;144 bytes
xor a
ld h,a
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld a,b
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla
ld b,a
ld a,c
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla
ld c,a
ret
It costs +115 bytes (making the routine just over 5 times larger), but saves 334cc in all cases (29% improvement on average).
If you really enjoy the speed improvements, but don't want to take up that much space, there is a cheap trade off by taking advantage of subroutines:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 773cc
;max: 933cc
;avg: 853cc
;82 bytes
xor a
ld h,a
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ld a,b
ld b,c
call BC_Div_DE_sub
ld a,b
ld b,c
BC_Div_DE_sub:
;min: 354cc
;max: 434cc
;avg: 394cc
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
rla
ld c,a
ret
Basically, we traded 35cc for a savings of 61 bytes.
Integer Division, multiprecision
BCD Division, multiprecision
Fixed Point Division
Suppose that we want to divide H.L by D.E. This is basically (HL/256)/(DE/256) = HL/DE. However, we can't directly use an integer routine, as we'd need 8 more bits of precision. You would have to apply this similarly to other formats.
8.8/8.8 Fixed Point Division
So we'll start off like an integer division, but we'll go an extra 8 iterations, shifting in zeros.
BC_Div_DE_88:
;Inputs:
; DE,BC are 8.8 Fixed Point numbers
;Outputs:
; DE is the 8.8 Fixed Point result (rounded to the least significant bit)
;if DE is 0 : 122cc or 136cc if BC is negative
;if BC>=128*DE : 152cc or 166cc if BC is negative
;Otherwise:
;min: 1107cc
;max: 1319cc
;avg: 1201cc
; First, find out if the output is positive or negative
ld a,b
xor d
push af ;sign bit is the result sign bit
; Now make sure the inputs are positive
xor b ;A now has the value of B, since I XORed it with D twice (cancelling)
jp p,BC_Div_DE_88_lbl1 ;if Positive, don't negate
xor a
sub c
ld c,a
sbc a,a
sub b
ld b,a
BC_Div_DE_88_lbl1:
;now make DE negative to optimize the remainder comparison
ld a,d
or d
jp m,BC_Div_DE_88_lbl2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
BC_Div_DE_88_lbl2:
;if DE is 0, we can call it an overflow
;A is the current value of D
or e
jr z,div_fixed88_overflow
;The accumulator gets set to B if no overflow.
;We can use H=0 to save a few cc in the meantime
ld h,0
;if B+DE>=0, then we'll have overflow
ld a,b
add a,e
ld a,d
adc a,h
jr c,div_fixed88_overflow
;Now we can load the accumulator/remainder with B
;H is already 0
ld l,b
ld a,c
call div_fixed88_sub
ld c,a
ld a,b ;A is now 0
call div_fixed88_sub
ld d,c
ld e,a
pop af
ret p
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ret
div_fixed88_overflow:
ld de,$7FFF
pop af
ret p
inc de
inc e
ret
div_fixed88_sub:
;min: 456cc
;max: 536cc
;avg: 496cc
ld b,8
BC_Div_DE_88_lbl3:
rla
adc hl,hl
add hl,de
jr c,$+4
sbc hl,de
djnz BC_Div_DE_88_lbl3
adc a,a
ret
Floating Point Division
See divSingle and xdiv from the z80float project.
Square Root
Digit Extraction
Newton's Method (with analysis)
Integer Square Root multiprecision
BCD Square Root multiprecision
Fixed Point Square Root
Floating Point Square Root
See sqrtSingle and xsqrt from the z80float project.
BorchardtGauss Algorithm
Carlson's Improvement
CORDIC Algorithms
BKM Algoritms
Exponential
Integer Exponential
BCD Exponential
Fixed Point Exponential
Floating Point Exponential
See the following:
expSingle
xexp
pow2Single
xpow2
pow10Single
xpow10
powSingle
xpow
Logarithm
Integer Logarithm
BCD Logarithm
Fixed Point Logarithm
NOTE: These routines are from an older edit. It's cool, but I think I've come across other viable (better) options.
The following code is based on the approximation:
ln(x+1) : x(8+x)/(8+5x) on [0,1]
For a more complete analysis of how these formulas were derived, see this document.
What is convenient to note is that the constant multiplications can be reduced to a few shift and add instructions. So to put the functions into code:
lognat:
;Input: H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 tstates
dec h
dec h
jr nz,$+8
inc l \ dec l
ret nz
ld l,177
ret
inc h
ret nz
ld b,h
ld c,l
ld e,l
ld d,8
add hl,hl
add hl,hl
add hl,de
ex de,hl
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ adc a,a
ld h,a \ ld l,b
sla h \ jr c,$+3 \ ld l,c
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
rl l
ld a,h
adc a,b
ld l,a
ld h,b
cp a
ret
But that only works on [1,2], so we have to reduce speed a little to extend it to all positive inputs (0,128):
lognat:
;Input: H.L needs to be on (0,128.0)
;Output: H.L if c flag set
; returns nc if input is negative (HL not modified)
;Error:
; The error on the outputs is as follows:
; 20592 inputs are exact
; 12075 inputs are off by 1/256
; 100 inputs are off by 2/256
; So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 tstates
ld a,h \ or l \ jr nz,$+5
ld h,80h \ ret
dec h
dec h
jr nz,$+9
inc l \ dec l
jr nz,normalizeln1
ld l,177
ret
inc h
jr nz,normalizeln
ld b,h
ld c,l
ld e,l
ld d,8
add hl,hl
add hl,hl
add hl,de
ex de,hl
call HL_Div_DE
ld h,a \ ld l,b
sla h \ jr c,$+3 \ ld l,c
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
add hl,hl \ jr c,$+3 \ add hl,bc
rl l
ld a,h
adc a,b
ld h,b
ld l,a
scf
ret
HL_Div_DE:
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
add hl,hl \ sbc hl,de \ adc a,a \ ret
inc h
normalizeln:
xor a
inc h \ ret m
ld d,a \ ld e,a
ld a,l
jr z,toosmall
inc e \ srl h \ rra \ jr nz,$4
rla \ rl h
dec e
stepin:
ld l,a
push de
call lognat
pop de
;now multiply DE by 355, then divide by 2 (rounding)
ld b,d \ ld c,e \ ld a,d
ex de,hl
add hl,hl
add hl,hl ;4
add hl,bc ;5
add hl,hl ;10
add hl,bc ;11
add hl,hl ;22
add hl,hl
add hl,hl
add hl,hl
add hl,bc
add hl,hl
add hl,bc
sra h \ rr l
adc hl,de
scf
ret
toosmall:
dec d
dec e \ add a,a \ jr nc,$2
inc h
jp stepin
Or a table driven method for log2:
Because we are working in binary, lg(), which is log2() is the easiest and most practical to compute. From this, we can derive other bases, too. Now to use some identities, note that lg(x/2)=lg(x)lg(2) = lg(x)1. So then we have lg(x)=1+lg(x/2). So what we will do is make a routine that works on [1,2] and extend that to all other values. If our input is less than 1, we have lg(x)=lg(x*2)1, so we double our input, and subtract from an accumulator. If input is >1, we will divide it by 2 and add to the accumulator. Then, we only have 256 values to worry about in our table. But, if our input is now even, we can divide by 2 without losing any accuracy (no bits lost) and just increment the accumulator. We can keep doing this until we finally have an odd number (which means we need to take care of input=0 as a separate case). Then we have a table of 128 elements instead—all of the odd numbers. And now we can build our algorithm:
lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)>H.L
;Speed: Avg: 340
ld de,lgLUT
ld b,0
ld a,h
or a
ret m ;If input is negative, this algorithm won't suffice (this is only for the Reals, not Complex)
ld a,l
jr z,$+8
inc b \ srl h \ rra \ jr nz,$4 ;input >2, so we rotate, using b as the accumulator
or a \ jr nz,$+6
ld hl,8000h \ ret ;If input was 0, use this as inf
rra \ inc b \ jr nc,$2 ;input<1, so we rotate
;A is the element to look up in the LUT
ld l,a
ld c,h
dec b
add hl,hl
add hl,de
ld e,(hl)
inc hl
ld d,(hl)
ex de,hl
add hl,bc
ret
lglut:
.dw $F800
.dw $F996
.dw $FA52
.dw $FACF
.dw $FB2C
.dw $FB76
.dw $FBB3
.dw $FBE8
.dw $FC16
.dw $FC3F
.dw $FC64
.dw $FC86
.dw $FCA5
.dw $FCC1
.dw $FCDC
.dw $FCF4
.dw $FD0B
.dw $FD21
.dw $FD36
.dw $FD49
.dw $FD5C
.dw $FD6D
.dw $FD7E
.dw $FD8E
.dw $FD9D
.dw $FDAC
.dw $FDBA
.dw $FDC8
.dw $FDD5
.dw $FDE2
.dw $FDEE
.dw $FDFA
.dw $FE06
.dw $FE11
.dw $FE1C
.dw $FE26
.dw $FE31
.dw $FE3B
.dw $FE44
.dw $FE4E
.dw $FE57
.dw $FE60
.dw $FE69
.dw $FE71
.dw $FE7A
.dw $FE82
.dw $FE8A
.dw $FE92
.dw $FE9A
.dw $FEA1
.dw $FEA9
.dw $FEB0
.dw $FEB7
.dw $FEBE
.dw $FEC5
.dw $FECB
.dw $FED2
.dw $FED8
.dw $FEDF
.dw $FEE5
.dw $FEEB
.dw $FEF1
.dw $FEF7
.dw $FEFD
.dw $FF03
.dw $FF09
.dw $FF0E
.dw $FF14
.dw $FF19
.dw $FF1E
.dw $FF24
.dw $FF29
.dw $FF2E
.dw $FF33
.dw $FF38
.dw $FF3D
.dw $FF42
.dw $FF47
.dw $FF4B
.dw $FF50
.dw $FF55
.dw $FF59
.dw $FF5E
.dw $FF62
.dw $FF67
.dw $FF6B
.dw $FF6F
.dw $FF74
.dw $FF78
.dw $FF7C
.dw $FF80
.dw $FF84
.dw $FF88
.dw $FF8C
.dw $FF90
.dw $FF94
.dw $FF98
.dw $FF9B
.dw $FF9F
.dw $FFA3
.dw $FFA7
.dw $FFAA
.dw $FFAE
.dw $FFB2
.dw $FFB5
.dw $FFB9
.dw $FFBC
.dw $FFC0
.dw $FFC3
.dw $FFC6
.dw $FFCA
.dw $FFCD
.dw $FFD0
.dw $FFD4
.dw $FFD7
.dw $FFDA
.dw $FFDD
.dw $FFE0
.dw $FFE4
.dw $FFE7
.dw $FFEA
.dw $FFED
.dw $FFF0
.dw $FFF3
.dw $FFF6
.dw $FFF9
.dw $FFFC
.dw $FFFF
Adding a little overhead code, we can get an ln() routine. Note that to convert, we have ln(x)=lg(x)/lg(e), but 1/lg(e) is a constant, so we can use constant multiplication. (it is approximately 355/512 for a good estimate):
ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)>H.L
;Speed: Avg: 340+(325 worst case)
call lg_88
;now signed multiply HL by 355, then divide by 2 (rounding)
ld de,0
bit 7,h
jr z,$+9
dec e \ xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ld b,h
ld c,l
xor a
add hl,hl
add hl,hl \ rla
add hl,bc \ adc a,d
add hl,hl \ rla
add hl,bc \ adc a,d
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,hl \ rla
add hl,bc \ adc a,d
add hl,hl \ rla
add hl,bc \ adc a,d
sra a \ rr h
ld l,h
ld h,a
inc e
ret nz
xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ret
Floating Point Logarithm
See the following:
lnSingle (Single Precision Natural Log)
lgSingle (Single Precision Log Base 2)
log10Single (Single Precision Log Base 10)
logSingle (Single Precision Log Base y)
xln (Extended Precision Natural Log)
xlg (Extended Precision Log Base 2)
xlog10 (Extended Precision Log Base 10)
xlog (Extended Precision Log Base y)
Cosine/Sine
Integer Cosine/Sine
BCD Cosine/Sine
Fixed Point Cosine/Sine
Floating Point Cosine/Sine
See the following:
sinSingle (Single Precision Sine)
cosSingle (Single Precision Cosine)
xsin (Extended Precision Sine)
xcos (Extended Precision Cosine)
Arctangent
Integer Arctangent
BCD Arctangent
Fixed Point Arctangent
NOTE: These routines are from an older edit. It's cool, but I think I've come across other viable (better) options.
The following code is based on the approximation:
atan(x) : x(9+2x^2)/(9+5x^2) on [1,1]
For a more complete analysis of how these formulas were derived, see this document.
What is convenient to note is that the constant multiplications can be reduced to a few shift and add instructions. So to put the function into code:
ArcTan_88:
;Inputs: DE is positive
ld a,d
or a
jr z,ArcTan+2
inc a
jr z,ArcTan+2
ld b,d
ld c,e
ld de,256
call DE_Div_BC_88
call ArcTan
ld de,402 ;about pi/2
ex de,hl
or a
sbc hl,de
ret
ArcTan:
;Input: DE
;Output: HL
push de ;save X for later
ld b,d
ld c,e
call BC_Times_DE ;x^2 computed
ld d,h
ld e,l
add hl,hl
add hl,hl
add hl,de ;5x^2 is now in HL
ld a,9
add a,h
ld h,a ;9+5x^2>HL
ld a,9
sla e
rl d
add a,d
ld d,a ;9+2x^2>DE
ld b,h
ld c,l
call DE_Div_BC_88 ;>DE
pop bc
jp BC_Times_DE
Table driven arctangent can be implemented, too.
Arctan can be computed on the whole real line just from the values on [0,1] by the identities: arctan(x)=arctan(x) and arctan(x)=pi/2arctan(1/x). Unlike ln() where we need an LUT (Look Up Table) of 16bit numbers, we can just use 8bit numbers. Plus, since arctan is pretty close to a linear function for a while, a large chunk of the LUT can be omitted by checking if the input is within a certain range:
arctan_88:
;Input:
; D.E
;Output: atan(D.E)>D.E
push de
ld a,d
or a
jp p,$+5
neg
ld d,a
dec a
jr nz,checkneedinv
inc e \ dec e \ jr nz,checkneedinv
pop af \ rla \ ld de,201 \ ret nc \ ld de,201 \ ret
checkneedinv:
inc a
call nz,DEgt1_Inv
;0.E is the value to atan
ld hl,adjustatan
push hl
ld a,e
cp 46 \ ret c
dec a \ cp 42h \ ret c
dec a \ cp 4Eh \ ret c
dec a \ cp 57h \ ret c
dec a \ cp 5Eh \ ret c
dec a \ cp 64h \ ret c
dec a \ cp 6Ah \ ret c
dec a \ cp 6Fh \ ret c
sub 6Fh \ ld e,a
ld hl,atanlut
add hl,de
ld a,(hl)
ret
adjustatan:
ld e,a
pop bc
ld a,b
or a
jp p,$+5
neg
jr z,$+9
ld hl,402
or a
sbc hl,de
ex de,hl
rl b
ret nc
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ret
DEgt1_Inv:
;Works if DE>1
ld hl,256
ld b,8
InvLoop:
add hl,hl
sbc hl,de
jr nc,$+3
add hl,de
adc a,a
djnz InvLoop
cpl
ld e,a
ld d,b
ret
atanlut:
.db $6F
.db $6F
.db $70
.db $71
.db $72
.db $73
.db $73
.db $74
.db $75
.db $76
.db $77
.db $77
.db $78
.db $79
.db $7A
.db $7B
.db $7B
.db $7C
.db $7D
.db $7E
.db $7F
.db $7F
.db $80
.db $81
.db $82
.db $82
.db $83
.db $84
.db $85
.db $85
.db $86
.db $87
.db $88
.db $88
.db $89
.db $8A
.db $8B
.db $8B
.db $8C
.db $8D
.db $8E
.db $8E
.db $8F
.db $90
.db $90
.db $91
.db $92
.db $93
.db $93
.db $94
.db $95
.db $95
.db $96
.db $97
.db $97
.db $98
.db $99
.db $9A
.db $9A
.db $9B
.db $9C
.db $9C
.db $9D
.db $9E
.db $9E
.db $9F
.db $A0
.db $A0
.db $A1
.db $A2
.db $A2
.db $A3
.db $A3
.db $A4
.db $A5
.db $A5
.db $A6
.db $A7
.db $A7
.db $A8
.db $A9
.db $A9
.db $AA
.db $AA
.db $AB
.db $AC
.db $AC
.db $AD
.db $AD
.db $AE
.db $AF
.db $AF
.db $B0
.db $B0
.db $B1
.db $B2
.db $B2
.db $B3
.db $B3
.db $B4
.db $B5
.db $B5
.db $B6
.db $B6
.db $B7
.db $B7
.db $B8
.db $B9
.db $B9
.db $BA
.db $BA
.db $BB
.db $BB
.db $BC
.db $BC
.db $BD
.db $BE
.db $BE
.db $BF
.db $BF
.db $C0
.db $C0
.db $C1
.db $C1
.db $C2
.db $C2
.db $C3
.db $C3
.db $C4
.db $C4
.db $C5
.db $C6
.db $C6
.db $C7
.db $C7
.db $C8
.db $C8
.db $C9
Floating Point Arctangent
See the following:
atanSingle (Single Precision Arctangent)
xatan (Extended Precision Arctangent)
Deriving Tangent, Arcsine, Arccosine, Hyperbolic Functions
String Conversions
Integer String Conversions
BCD String Conversions
Fixed Point String Conversions
Floating Point String Conversions
See the following:
strToSingle (Single Precision String To Float)
strtox (Extended Precision String To Float)
singleToStr (Single Precision Float To String)
xtostr (Extended Precision Float To String)
Random Number Generators
rand16
;#define smc ;uncomment if you are using SMC
rand16:
;;collaboration by Zeda with Runer112
;;160cc or 148cc if using SMC
;;26 bytes
;;cycle: 4,294,901,760 (almost 4.3 billion)
#ifdef smc
seed1=$+1
ld hl,9999
#else
ld hl,(seed1)
#endif
ld b,h
ld c,l
add hl,hl
add hl,hl
inc l
add hl,bc
ld (seed1),hl
#ifdef smc
seed2=$+1
ld hl,9999
#else
ld hl,(seed2)
#endif
add hl,hl
sbc a,a
and %00101101
xor l
ld l,a
ld (seed2),hl
add hl,bc
ret
rand24
rand24:
;;219cc
#ifdef smc
seed1_0=$+1
ld hl,12345
seed1_1=$+1
ld a,67
#else
ld hl,(seed1_0)
ld a,(seed1_1)
#endif
ld b,h
ld c,l
ld d,a
add hl,hl \ rla
add hl,hl \ rla
inc l
add hl,bc \ adc a,0
ld (seed1_0),hl
ld (seed1_1),a
ld c,b
ld b,a
#ifdef smc
seed2_0=$+1
ld hl,65432
seed2_1=$+1
ld a,10
#else
ld hl,(seed2_0)
ld a,(seed2_1)
#endif
add hl,hl
rla
ld (seed2_1),a
sbc a,a
and %10000111
xor l
ld l,a
ld (seed2_0),hl
add hl,bc
ret
rand32
rand32:
;;Tested and passes all CAcert tests
;;Uses a very simple 32bit LCG and 32bit LFSR
;;it has a period of 18,446,744,069,414,584,320
;;roughly 18.4 quintillion.
;;LFSR taps: 0,2,6,7 = 11000101
;;291cc
;;Thanks to Runer112 for his help on optimizing the LCG and suggesting to try the much simpler LCG. On their own, the two are terrible, but together they are great.
seed1_0=$+1
ld hl,12345
seed1_1=$+1
ld de,6789
ld b,h
ld c,l
add hl,hl \ rl e \ rl d
add hl,hl \ rl e \ rl d
inc l
add hl,bc
ld (seed1_0),hl
ld hl,(seed1_1)
adc hl,de
ld (seed1_1),hl
ex de,hl
;;lfsr
seed2_0=$+1
ld hl,9876
seed2_1=$+1
ld bc,54321
add hl,hl \ rl c \ rl b
ld (seed2_1),bc
sbc a,a
and %11000101
xor l
ld l,a
ld (seed2_0),hl
ex de,hl
add hl,bc
ret
rand, Floatingpoint
For floatingpoint random numbers, we'll need to take a little extra care to make the distribution as uniform as possible. We'll generate a float on [0,1), but we want to make sure to select numbers from [0,2^n) with probability 2^n, for example. To do this, we'll first generate the exponent by initializing to 0, then decrement the exponent and generate a random bit, repeating this process until a 1 is generated, or the exponent is too small for our floats. Finally, we generate our mantissa in normalized form.
(This means the exponent is 1 with 50% probability, 2 with 25% probability, etc.)
So as an example for a singleprecision random float generator
randSingle:
;BC points to where the result is output.
;Stores a pseudorandom number on [0,1)
;A value of 0 is extraordinarily unlikely.
push bc
;The first thing we'll do is generate the exponent.
;Initialize to 1 (stored as 0x7F) and we'll generate
;random bits, decrementing the exponent until we get a 1.
ld c,$80 ;exponent
jr rand_gen_exp
_:
dec c
add hl,hl
jr c,+_ ;we have completed calculation of the exponent
djnz _
;make sure the exponent isn't zero!
inc c
dec c
jr z,rand_zero
rand_gen_exp:
push bc
call rand
pop bc
ld b,16
jr _
_:
; Now we have generated the exponent, let's generate the mantissa
push bc
call rand
push hl
call rand
pop de
pop bc
ld b,h
;Now our float is stored in CBDE
; Make the float positive
res 7,b
.db $FE ;start of `cp *` to skip `ld b,c`
rand_zero:
ld b,c
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
inc hl
ld (hl),c
ret
Or an extendedprecision routine:
xrand:
;Stores a pseudorandom number on [0,1)
;A value of 0 should be practically impossible in this universe.
push bc
;The first thing we'll do is generate the exponent.
;Initialize to 0 (stored as 0x4000) and we'll generate
;random bits, decrementing the exponent until we get a 1.
ld bc,$4000 ;exponent
jr rand_gen_exp
_:
dec bc
add hl,hl
jr c,+_ ;we have completed calculation of the exponent
dec a
jr nz,_
;make sure the exponent isn't zero!
ld a,b
or c
jr z,xrand_zero
rand_gen_exp:
push bc
call rand
pop bc
ld a,16
jr _
_:
; Now we have generated the exponent, let's generate the mantissa
pop ix ;pointer to the output
ld (ix+8),c
ld (ix+9),b
call rand
ld (ix),h
ld (ix+1),l
call rand
ld (ix+2),h
ld (ix+3),l
call rand
ld (ix+4),l ;it's "random" anyways, just fun to change it up :P
ld (ix+5),h
call rand
ld (ix+6),h
ld (ix+7),l
ret
xrand_zero:
pop hl
set_xzero:
;HL points to where to write 0 as an extended precision float.
xor a
ld b,10
_:
ld (hl),a
inc hl
djnz _
ret