This article covers advanced math routines for a number of functions including the basic addition, multiplication, and square roots, as well as exponential and trigonometric functions.
In most instances it is most efficient to store numbers in little endian format as that is how the Z80 works internally. Doing this allows us to take advantage of the handful of 16-bit operations. In the case of BCD formatted numbers, it really doesn't matter if they are stored big-endian or little-endian since built in operations are only 8-bit. However, I choose to use little endian since it is easier to work with large numbers when the pointer passed to the routine points to the lowest byte.
I'll try to cover:

• Integer arithmetic
• BCD arithmetic
• Fixed point arithmetic
• Floating point arithmetic

# Intro to Integer Formats

Integers are one of the most basic format to work with on the Z80. The built in instructions primarily work with 8-bit integers, though a few work with 16-bit integers. Since you are reading this page, I am going to assume you know this much.
However, if you are looking to work with larger integers, you'll need to come up with a working format. Best practice for the z80 is store in little endian. You may have seen this mentioned here and elsewhere without a good reason, so here it is:

The Z80 stores 16-bit integers in little-endian. If you need to access the lower 16 bits of a 64-bit integer, it's easier and more efficient to use ld hl,(int64) for 3 bytes, 16cc, versus ld hl,(int64+6) \ ld a,l \ ld l,h \ ld h,a at 6 bytes, 28cc.
So if you wanted to store the 32-bit integer 0x12345678 as a constant in your program, it would look like:

``````int32_var0:
.db \$78,\$56,\$34,\$12```
```

OR:
``````int32_var0:
.dw \$5678,\$1234```
```

And if you wanted to change that value to 0xFEDCBA98, you could do:
``````    ld hl,\$BA98
ld (int64_var0),hl
ld hl,\$FEDC
ld (int64_var0+2),hl```
```

# Intro to BCD Formats

BCD, short for Binary Coded Decimal, is a format where one byte represents two base 10 digits. As an example, 0x66 is interpreted in base 10 as 66 instead of 102.
The Z80 has a handy instruction called DAA (1 byte, 4cc) which converts the A register based on it's contents, the c flag, half-carry flag, and the last type of arithmetic operation. Addition and subtraction operations set or reset a special flag so that the DAA instruction knows what to do.
For example, if A=0x66, then add a,a, you would get A=0xCC, with the c flag reset. Followed up with the daa instruction, the c flag is set, and A is converted to 0x32, which is what we expect since 66*2=132.

# Intro to Fixed Point Formats

Fixed point formats offer the advantage of precision smaller than a unit. For example, you can store values such as 119.125.
In fixed point, the decimal point is in a fixed location. An 8.8 fixed point number means that given a 16-bit number, the top 8 bits form the integer part, and the lower 8 bits form the fractional part. If the lower 8 bits formed the integer 55, then it would be interpreted as 55/256.

Fixed point formats tend to be faster and easier to implement than floating point, but they can lose precision more easily. They are great for games where speed is desired, but often not desirable in mathematical applications.
Many fixed point routines can use integer routines with some minor modifications to overhead code.

# Intro to Floating Point Formats

Floating point formats typically have:

• A sign bit
• An exponent for an implied base, which indicates where the decimal point is
• A mantissa

Floating point and scientific notation, are very similar. If we are using BCD floats, then base 10 is likely the implied base, so you might have a number like 3.14159*100.
In practice, a base of some power of two is far more efficient to compute. As well, the mantissa is typically of the form where one digit forms the non-zero integer part, and the rest form the fractional part. This is referred to as being normalized. When the top digit is 0, this is a denormalized number. Some formats allow this, some don't. For base-2 float formats that don't allow denormalized numbers there is the advantage that the top bit can only be one value— 1, so it isn't actually necessary to occupy a bit that is known to always be 1. Instead, this bit is taken as implicit. In such cases, that bit is instead put to some other use, like a sign bit, or another bit of the exponent.

In general:

• Sign is represented as a bit where 0 means positive, 1 means negative.
• The exponent is an integer with a bias. For example, if our format uses an 8-bit exponent, 128 might represent the exponent 0. In fact, this is what TI does for their BCD floats.
• The mantissa is a 1.n fixed point number. As examples,

** A 7-byte BCD float has 14 digits, so the mantissa is a 1.13 fixed point number.
** An "extended precision float" has a 64-bit mantissa with no implicit bit. Therefore, the mantissa is a 1.63 fixed point number.
** A "single precision float" has a 24-bit mantissa, where the top bit is implied and in its place is the sign bit. So it is a 1.23 fixed point number, where the top bit is always expected to be 1.

There is a GitHub repository for Z80 floats here. At the time of this writing, it contains single-precision and extended-precision binary floats. Maybe at some future point there will be other formats.

Thankfully addition and subtraction are built-in instructions on the Z80 and by extension, the eZ80. However, if we want to work with multiple precision arithmetic, we still have to juggle pointers and RAM.

Suppose we have two 64-bit integers stored in a fixed location in RAM called input1 and input2, and we want to store their sum or difference to output. Then we can take advantage of 16-bit math to make a speed-optimized routine:

``````;addint64:
;47 bytes, 264cc
ld hl,(input1) \ ld de,(input2) \ add hl,de \ ld (output),hl
ld hl,(input1+2) \ ld de,(input2+2) \ adc hl,de \ ld (output+2),hl
ld hl,(input1+4) \ ld de,(input2+4) \ adc hl,de \ ld (output+4),hl
ld hl,(input1+6) \ ld de,(input2+6) \ adc hl,de \ ld (output+6),hl```
```

Note the use of add in the first iteration, versus adc in the latter iterations.

For subtraction, we only have sbc for 16-bit subtraction, so we have to first reset carry

``````;subint64:
;49 bytes, 272cc
or a
ld hl,(input1) \ ld de,(input2) \ sbc hl,de \ ld (output),hl
ld hl,(input1+2) \ ld de,(input2+2) \ sbc hl,de \ ld (output+2),hl
ld hl,(input1+4) \ ld de,(input2+4) \ sbc hl,de \ ld (output+4),hl
ld hl,(input1+6) \ ld de,(input2+6) \ sbc hl,de \ ld (output+6),hl```
```

Suppose you want to pass pointers to the input and output operands where DE is the first operand, HL is the second, and BC is the output:

``````;addint64:
;45 bytes, 294cc
ld a,(de) \ add a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ adc a,(hl) \ ld (bc),a```
```
``````;subint64:
;45 bytes, 294cc
ld a,(de) \ sub (hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
ld a,(de) \ sbc a,(hl) \ ld (bc),a```
```

Suppose you want to save a few bytes at the cost of some cycles:

``````;addint64:
;17 bytes, 515cc
or a
call +_
_:
call +_
_:
call +_
_:
ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc \ ret```
```
``````;subint64:
;17 bytes, 515cc
or a
call +_
_:
call +_
_:
call +_
_:
ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc \ ret```
```

Suppose you just want the output to overwrite the first input, in order to save time and space:

``````;addint64:
;38 bytes, 252cc
ld a,(de) \ add a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (de),a```
```
``````;subint64:
;38 bytes, 252cc
ld a,(de) \ sub (hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (de),a```
```

Or with BC freed up, roll your loops:
``````;addint64:
;11 bytes, 384cc
ld b,8
or a
_:
ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
djnz -_
ret```
```
``````;subint64:
;11 bytes, 384cc
ld b,8
or a
_:
ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
djnz -_
ret```
```

BCD addition and subtraction work almost exactly like their integer counterparts
except that you have to use arithmetic with the A register, followed by daa.
I'll offer one example, and the reader can adapt the rest. This modifies the
above 64-bit example, but encodes 16 BCD digits.

``````;addBCD_16:
;46 bytes, 284cc
ld a,(de) \ add a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ daa \ ld (de),a```
```

Fixed Point addition and subtraction are identical to integer addition and subtraction, assuming both inputs have the same format.

Believe it or not, addition and subtraction are one of the toughest floating point operations to implement. To show this (using base 10 for convenience), suppose you have two numbers, 3.14159 *100 and 1.23456*103.
To add these, we have to align their decimal points, which means denormalizing a number until it's exponent matches. On top of that, if your precision is too limited, you'll have to round. In this example, we only have 6 digits allowed. Since it truncates the digit 1, our number is not rounded up.

`````` 0.00314*10^^3^^
+1.23456*10^^3^^
----------------
=1.23770*10^^3^^```
```

So let's review what we need to do, then create some pseudo-code and then some examples:
• The first input will be overwritten with the output.
• We need to align the mantissas so that their exponents are the same
• Organize the inputs so that the exponent of the first input is greater than or equal to that of the second.
• Find their difference in exponent. If this difference is more than the number of bits in the mantissa, then if we aligned the mantissas, the second one would be full of zeros, so we can skip these cases. If the difference is equal to the size of the mantissa, then when the mantissas are aligned and round:
• If the high bit is 1, increment the mantissa of the first input. For formats where the high bit is implicit, then we can just go ahead and increment that mantissa.
• If that difference is non-zero, shift the second input right a number of times equal to that difference.
• If the signs of the input are the same, add the second mantissa to that of the first as fixed-point numbers.
• If there is overflow, increment the exponent and shift the mantissa right, rotating in a '1' bit.
• If the exponent overflows, set the result to INF
• Else the signs are different, so subtract the second mantissa from that of the first as fixed-point numbers.
• If there is overflow, invert the sign bit, then perform 2's compliment on the mantissa (negate it).
• If the mantissa is zero, set the output to ZERO.
• While the upper bit of the result is 0, decrement the exponent, and shift the mantissa left.
• If the exponent underflows, set the result to ZERO.
``````;We'll assume that the top bit of the mantissa is always 1.
;exp1 represents the exponent of the first input as an integer with bias at half it's maximum value.
;exp2 represents the exponent of the second input as an integer with bias at half it's maximum value.
;sign1 is 0 or 1. 1 if the first input is negative, 0 otherwise.
;sign2 is 0 or 1. 1 if the second input is negative, 0 otherwise.
;mant1 is the mantissa of the first input, a fixed-point number. Since it is addition, we'll just treat it as an integer.
;mant2 is the mantissa of the second input, a fixed-point number. Since it is addition, we'll just treat it as an integer.
;size is the size of the mantissa.
if exp1<exp2:
swap(float1,float2)
k=exp1-exp2
if k>size:
return
if k=size:
mant2=1
else:
mant2=mant2>>k
if carry=1:
mant2=mant2+1
if sign1=sign2:
mant1=mant1+mant2
if carry=1:
<<shift right mant1, shifting in a 1>>
exp1=exp1+1
if carry=1:
return INF
return float1
else:
mant1=mant1-mant2
if carry=1:
sign1=not(sign1)
mant1=-mant1
while topbit(mant1)=0:
exp1-=1
if exp1=0
return ZERO
mant1=mant1<<1
return float1```
```

In this example, we have a 24-bit little-endian mantissa, with the top bit an implicit 1, instead, used as a sign bit. Finally, the top 8 bytes are a mantissa such that 128 corresponds to an exponent of 0. In all, these floats use 32 bits of space.
``````addend = scrap

;;Still need to tend to special cases
;;x+y
push af
push hl
push de
push bc
inc de
inc de
inc hl
inc hl
ld a,(de)
xor (hl)
push af
inc de
inc hl
ex de,hl
ld a,(de)
sub (hl)
ex de,hl
jr nc,\$+5
ex de,hl
neg
cp 24
push hl
dec de
ld bc,0408h
dec hl
ld (hl),0
sub c
jr nc,\$-5
push af
push hl
ex de,hl
ld a,(hl) \ or 80h \ ld (de),a \ dec de \ dec hl
ldd
ldd
ex de,hl
dec b
jr z,\$+7
ld (hl),0 \ dec hl \ djnz \$-3
pop hl
pop af
ld b,a
jr z,noshift
set 7,(hl)
_:
push hl
srl (hl) \ dec hl
rr (hl) \ dec hl
rr (hl) \ dec hl
rr (hl)
pop hl
djnz -_
noshift:
pop hl  ;bigger float
dec hl
ld b,(hl)
dec hl
dec hl
ex de,hl
pop af
jp m,subtract
ld a,(hl) \ rla \ inc hl
ld a,(de) \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ set 7,a \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ ld (hl),a
inc (hl)
dec hl \ rr (hl)
dec hl \ rr (hl)
dec hl \ rr (hl)
subtract:
xor a
ld c,a \ sub (hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c
sbc a,(hl) \ ld (hl),a \ inc hl
ld a,(de) \ sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ set 7,a
sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
ld a,(de) \ ld (hl),a
dec de
ex de,hl
jr nc,negated
ld a,80h \ xor b \ ld b,a
ld a,c \ sub (hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
ld a,c \ sbc a,(hl) \ ld (hl),a
negated:
push bc
ld a,h \ or l \ or d \ or e \ or b \ or c
normalize:
add hl,hl \ rl e \ rl d \ rl c \ rl b
jp p,normalize
pop bc
ld a,(hl)
rla \ rl b \ rra \ ld (hl),a
dec hl
dec hl
pop de
push de
ldi
ldi
ldi
ld a,(hl)
ld (de),a
pop bc
pop de
pop hl
pop af
ret
;;How many push/pops are needed?
;;return ZERO
ld hl,0
pop bc
;;How many push/pops are needed?
;;return INF
dec hl
ld (hl),40h
;;How many push/pops are needed?
;;Return bigger number
pop af
dec hl
dec hl
dec hl
```

For an extended-precision example, see here.

# Multiplication

Multiple precision multiplication is tricky on the eZ80 which has built-in 8-bit multiplication instructions, and even trickier on the Z80 which doesn't have built-in multiplication. The way many of us learned in school is easy to implement, but it slows down significantly with higher digit numbers. For example, to multiply two 2-digit numbers, you would need to perform 4 single-digit multiplications, but to multiply two 10-digit numbers, you would need to use 100 single-digit multiplications! In general, to multiply two n-digit numbers, you would need n*n single-digit multiplications. For higher precision multiplication, this can be quite an issue.

In the past century, several algorithms have been developed to greatly reduce the amount of work required, starting off with Karatsuba multiplication, then Toom-Cook, and then FFT based multiplication (including Schönhage–Strassen). I'm not familiar with Fürer's algorithm, unfortunately.

## "Schoolbook" method

This is how most people learn to multiply multi-digit numbers in school. It's relatively easy, albeit tedious. That said, it is very easy in binary! Think about it— your only digits are 0 and 1, and multiplying by 0 or 1 is trivial.
So let's do an example. We will multiply 1011 by 1100, but in base 10:

``````     1011
*    1101
---------
1011
00000
101100
1011000
---------
= 1113111```
```

Sweet baby carrots, that was easy. Now if it was binary, that 3 is too big for one digit, so we'd subtract 2 (base 2) and carry a 1. If it had been, say, 7, then we'd subtract 6 and carry 3. In this case, when we carry the 1, the next digit gets bumped to 2, so we have to subtract two and carry 1 again, and again, and again, resulting in 10001111. In code, we would actually just sum it up as we go. Here is some pseudocode:
``````; multiply E times A
set accumulator to 0

while A is not 0, then:
shift A right, carrying out the bottom bit
if 1 is carried out:
shift E to the left, shifting in 0.

return accumulator as the result```
```

Putting that into assembly, we have:

``````  ld hl,0   ; set accumulator to 0
ld d,h    ; (makes adding 'E' to 'accumulator' easier)
loop:
or a      ; while A is not 0, then:
ret z     ;

rra       ;     shift A right, carrying out the bottom bit
jr nc,skip;     if 1 is carried out:
skip:
sla e     ;     shift E to the left, shifting in 0.
rl d      ;
jr loop```
```

So that is the easy way to implement it. The better way for the Z80 requires looking at multiplication slightly different from how we did it in school. Luckily it's an easy change: work left-to-right instead of right-to-left! So our above example looks like this instead:
``````     1011
*    1101
---------
1011000
101100
00000
1011```
```

To implement this version, we'll need a loop counter, and some trickery. On first glance, it looks like we need to shift our 'E' all the way to the left, then shift right on each iteration. However, we can simply shift the accumulator left each iteration before adding. The looks like the following by hand:

``````     1011
*    1101
-----^---
1011

1011
*    1101
------^--
10110
1011

1011
*    1101
-------^-
101100
10110
0000

1011
*    1101
--------^
1011000
101100
00000
1011```
```

So in code:

``````  ld hl,0   ; set accumulator to 0
ld d,h    ; (makes adding 'E' to 'accumulator' easier)
ld b,8
loop:
rla
jr nc,skip;     if 1 is carried out:
skip:
djnz loop
ret```
```

Now let's create various routines:

### H_Times_E

This is a classic. It packs everything into two 16-bit registers, making it very streamlined, and providing a religious experience (even to atheists).

``````H_Times_E:
;Inputs:
;  H and E
;Outputs:
;  HL is the product
;  D is 0
;  A,E,B,C are preserved
;36 bytes
;min: 190cc
;max: 242cc
;avg: 216cc

ld d,0
ld l,d
sla h \ jr nc,\$+3 \ ld l,e
ret```
```

Or we can sacrifice speed for size, taking almost twice the time, but at 1/3 the size:

``````H_Times_E:
;Inputs:
;  H and E
;Outputs:
;  HL is the product
;  D is 0
;  B is 0
;  A,E,C are preserved
;12 bytes
;min: 381cc
;max: 429cc
;avg: 405cc

ld d,0
ld l,d
ld b,8
_:
jr nc,\$+3
djnz -_
ret```
```

### DE_Times_A

``````DE_Times_A:
;Inputs:
;   A and DE
;Outputs:
;   AHL is the product, with A being the upper 8 bits
;   BC = 0
;   DE is preserved

ld bc,\$0800
ld h,c
ld l,c
_:
rla          ; Check most-significant bit of accumulator
jr nc,_skip  ; If zero, skip addition
_skip:
djnz -_
ret```
```

## Karatsuba Multiplication Algorithm

The Karatsuba algorithm is a genuinely cool bit of math that is straightforward and fairly young as far as algorithms go. It reduced a 2-by-2 multiplication to three 1-by-1 multiplications— better than our four 1-by-1 multiplies that we needed with hand multiplication. If we apply this recursively, then a 4-by-4 uses three 2-by-2 multiplies, which translates to nine 1-by-1 multiplies. Much better than 16 1-digit multiplies! The downside is a little more overhead with addition/subtraction.

Before we figure out how it works, here is the algorithm:

``````#Want to multiply (ax+b) * (cx+d)
#ex., in base 10 digits, this could be (a*10+b)*(c*10+d)
z0 = b*d
z2 = a*c
z1 = (a+b)*(c+d)-z0-z2
result = z0+z1*x+z2*x^2```
```

Or another statement of the algorithm (my preferred interpretation):
``````z0 = b*d
z2 = a*c
z1 = z0+z2-(a-b)*(c-d)
result = z0+z1*x+z2*x^2```
```

So as an example, we'll multiply 37 by 69.

``````a=3, b=7, c=6, d=9
z0 = 7*9 = 63
z2 = 3*6 = 18
z1 = 63+18-(3-7)*(6-9) = 69   #This is coincidence that z1 is 69 and one of our inputs was also 69 !
result = 63+690+1800 = 2553```
```

To verify this algorithm, note that z1 = z0+z2-(a-b)*(c-d) = bd+ac-(a-b)(c-d) = ac+bd-(ac-ad-bc+bd) = ad+bc so then result = z0+z1*x+z2*x^2 = bd+(ad+bc)x+acx^2 which is indeed (ax+b)(cx+d).

## Toom-Cook Multiplication Algorithms

The Toom-Cook algorithm extends Karatsuba's algorithm via a different approach. For an n-by-m multiplication, we will only need to n+m-1 1-digit multiplications. This is an insane improvement over Karatsuba, let alone our hand method. A 16-by-16 multiply now needs 31 single-digit multiplies instead of 81 in Karatsuba, or 256 in the hand method. Holy heck. The problem is that overhead for recombining the multiplications quickly grows non-trivial.

The premise starts with polynomials. If you are given two points, there is a unique linear function that passes though both (a line). Given three points, there is only one quadratic function that passes through all three points. In general, given n+1 points, there is a unique n-th degree polynomial that passes through each point.

Now let's look at multiplying polynomials. Given two linear functions, f(x)=ax+b and g(x)=cx+d, we can multiply them to get a quadratic function. If we evaluate this at three points, then theoretically we can find the coefficients for the quadratic polynomial. Some easy points to evaluate are x=0,1, and -1. Or in other words, we get bd, (a+b)(c+d), (b-a),(d-c). The formula for recombining the points at x=0,1, and -1 is this:

``````z0 = f(0)g(0)
z1 = f(1)g(1)
z2 = f(-1)g(-1)
result = z0+(z1-z2)x/2+((z1+z2)/2-z0) * x^2```
```

In general, the formula is different depending on the x values you choose to evaluate at. Some formulas are easier than others, but some x values are more difficult than others.

So with our example in the Karatsuba section, f(x)=3x+7 and g(x)=6x+9,

``````z0 = f(0)g(0)   = 7*9 = 63
z1 = f(1)g(1)   = (3+7)(6+9) = 150
z2 = f(-1)g(-1) = (7-3)(9-6) = 12
result = z0+(z1-z2)x/2+((z1+z2)/2-z0) * x^2
= 63+(150-12)x/2+((150+12)/2-63) * x^2
= 63+69x+18x^2```
```

This example is similar to Karatsuba's method, but a little more complicated. It turns out that Karatsuba's method is just Toom-Cook with points at either x=0,1,+inf, or x=0,-1,+inf.

For a fixed set of points, the formula to recombine the values will always be the same, so we have two challenges. First, we need to choose a set of points that our algorithm will evaluate at, second, we need to find the recombining formula. I like to solve this with a system of equations. If we want an algorithm for 3-by-3 multiplication, then we'll need five points. For example, x=-2, -1, 0, 1, and 2. Then we want to generate coefficients c0, c1, c2, c3, and c4. The easy way is to put this matrix into your calculator:

``````    [[1  -2   4  -8  16]
[1  -1   1  -1   1]
[A]= [1   0   0   0   0]
[1   1   1   1   1]
[1   2   4   8  16]]```
```

And then find the inverse:
``````       [[    0     0     1     0      0]
[ 1/12  -2/3     0   2/3  -1/12]
[A]^-1= [-1/24   2/3  -5/4   2/3  -1/24]
[-1/12   1/6     0  -1/6   1/12]
[ 1/24  -1/6   1/4  -1/6   1/24]]```
```

And this tells us that:
``````z0=f(-2)g(-2)
z1=f(-1)g(-1)
z2=f(0)g(0)
z3=f(1)g(1)
z4=f(2)g(2)

c0 = [    0     0     1     0      0]*[z0   z1   z2   z3   z4] = z2
c1 = [ 1/12  -2/3     0   2/3  -1/12]*[z0   z1   z2   z3   z4] = (z0-z4)/12 + 2(z3-z1)/3
c2 = [-1/24   2/3  -5/4   2/3  -1/24]*[z0   z1   z2   z3   z4] =-(z0+z4)/24 + 2(z3+z1)/3 - 5*z2/4
c3 = [-1/12   1/6     0  -1/6   1/12]*[z0   z1   z2   z3   z4] = (z4-z0)/12 + (z1-z3)/6
c4 = [ 1/24  -1/6   1/4  -1/6   1/24]*[z0   z1   z2   z3   z4] = (z0+z4)/24 - (z1+z3)/6 -z2/4```
```

This reduces a 3-by-3 multiplication to 5 non-trivial multiplications, 2 divisions-by-3, 10 add/sub, and a handful of shifts. Those divisions-by three are a bit slow, but much faster than a multiplication if you use the right algorithms.

## Multiplication, base case

In all cases, we will need a "base case" multiplication routine. This is the smallest multiplication performed, and all larger multiplication are broken up into some combination of these multiplications. On the eZ80, you can use the mlt instruction as your base case. For example, the mlt hl instruction multiplies H by L and stores the 16-bit result in HL. On the regular Z80, we need to create our own custom routine. In my experience, a well made 16-by-16 multiplication routine is the best base case. The following is made by Runer112, analyzed by Zeda, and tested by jacobly:

``````;This was made by Runer112
;Tested by jacobly
mul16:
;BC*DE --> DEHL
; ~544.887cc as calculated in jacobly's test
;min: 214cc  (DE = 1)
;max: 667cc
;avg: 544.4507883cc   however, deferring to jacobly's result as mine may have math issues ?
ld    a,d
ld    d,0
ld    h,b
ld    l,c
jr    c,Mul_BC_DE_DEHL_Bit14
jr    c,Mul_BC_DE_DEHL_Bit13
jr    c,Mul_BC_DE_DEHL_Bit12
jr    c,Mul_BC_DE_DEHL_Bit11
jr    c,Mul_BC_DE_DEHL_Bit10
jr    c,Mul_BC_DE_DEHL_Bit9
jr    c,Mul_BC_DE_DEHL_Bit8
jr    c,Mul_BC_DE_DEHL_Bit7
ld    a,e
and    %11111110
jr    c,Mul_BC_DE_DEHL_Bit6
jr    c,Mul_BC_DE_DEHL_Bit5
jr    c,Mul_BC_DE_DEHL_Bit4
jr    c,Mul_BC_DE_DEHL_Bit3
jr    c,Mul_BC_DE_DEHL_Bit2
jr    c,Mul_BC_DE_DEHL_Bit1
jr    c,Mul_BC_DE_DEHL_Bit0
rr    e
ret    c
ld    h,d
ld    l,e
ret

Mul_BC_DE_DEHL_Bit14:
jr    nc,Mul_BC_DE_DEHL_Bit13
Mul_BC_DE_DEHL_Bit13:
jr    nc,Mul_BC_DE_DEHL_Bit12
Mul_BC_DE_DEHL_Bit12:
jr    nc,Mul_BC_DE_DEHL_Bit11
Mul_BC_DE_DEHL_Bit11:
jr    nc,Mul_BC_DE_DEHL_Bit10
Mul_BC_DE_DEHL_Bit10:
jr    nc,Mul_BC_DE_DEHL_Bit9
Mul_BC_DE_DEHL_Bit9:
jr    nc,Mul_BC_DE_DEHL_Bit8
Mul_BC_DE_DEHL_Bit8:
jr    nc,Mul_BC_DE_DEHL_Bit7
Mul_BC_DE_DEHL_Bit7:
ld    d,a
ld    a,e
and    %11111110
jr    nc,Mul_BC_DE_DEHL_Bit6
Mul_BC_DE_DEHL_Bit6:
jr    nc,Mul_BC_DE_DEHL_Bit5
Mul_BC_DE_DEHL_Bit5:
jr    nc,Mul_BC_DE_DEHL_Bit4
Mul_BC_DE_DEHL_Bit4:
jr    nc,Mul_BC_DE_DEHL_Bit3
Mul_BC_DE_DEHL_Bit3:
jr    nc,Mul_BC_DE_DEHL_Bit2
Mul_BC_DE_DEHL_Bit2:
jr    nc,Mul_BC_DE_DEHL_Bit1
Mul_BC_DE_DEHL_Bit1:
jr    nc,Mul_BC_DE_DEHL_Bit0
Mul_BC_DE_DEHL_Bit0:
jr    c,Mul_BC_DE_DEHL_FunkyCarry
rr    e
ld    e,a
ret    nc
ret    nc
inc    e
ret    nz
inc    d
ret

Mul_BC_DE_DEHL_FunkyCarry:
inc    d
rr    e
ld    e,a
ret    nc
ret    nc
inc    e
ret```
```

Although this is unlikely to change, the most up-to-date version can be found here.

mul24, while less used on the Z80, can still be a base-case multiplication, though it does use an 8-by-24 multiplication subroutine. Here is 8-by-24:

``````C_times_BDE:
;;C*BDE => CAHL
;C = 0     157
;C = 1     141
;141+
;C>=128    135+6{0,33+{0,1}}+{0,20+{0,8}}
;C>=64     115+5{0,33+{0,1}}+{0,20+{0,8}}
;C>=32     95+4{0,33+{0,1}}+{0,20+{0,8}}
;C>=16     75+3{0,33+{0,1}}+{0,20+{0,8}}
;C>=8      55+2{0,33+{0,1}}+{0,20+{0,8}}
;C>=4      35+{0,33+{0,1}}+{0,20+{0,8}}
;C>=2      15+{0,20+{0,8}}
;min: 141cc
;max: 508cc
;avg: 349.21279907227cc

ld a,b
ld h,d
ld l,e
sla c \ jr c,mul8_24_1
sla c \ jr c,mul8_24_2
sla c \ jr c,mul8_24_3
sla c \ jr c,mul8_24_4
sla c \ jr c,mul8_24_5
sla c \ jr c,mul8_24_6
sla c \ jr c,mul8_24_7
sla c \ ret c
ld a,c
ld h,c
ld l,c
ret
mul8_24_1:
add hl,hl \ rla \ rl c \ jr nc,\$+7 \ add hl,de \ adc a,b \ jr nc,\$+3 \ inc c
mul8_24_2:
add hl,hl \ rla \ rl c \ jr nc,\$+7 \ add hl,de \ adc a,b \ jr nc,\$+3 \ inc c
mul8_24_3:
add hl,hl \ rla \ rl c \ jr nc,\$+7 \ add hl,de \ adc a,b \ jr nc,\$+3 \ inc c
mul8_24_4:
add hl,hl \ rla \ rl c \ jr nc,\$+7 \ add hl,de \ adc a,b \ jr nc,\$+3 \ inc c
mul8_24_5:
add hl,hl \ rla \ rl c \ jr nc,\$+7 \ add hl,de \ adc a,b \ jr nc,\$+3 \ inc c
mul8_24_6:
add hl,hl \ rla \ rl c \ jr nc,\$+7 \ add hl,de \ adc a,b \ jr nc,\$+3 \ inc c
mul8_24_7:
add hl,hl \ rla \ rl c \ ret nc \ add hl,de \ adc a,b \ ret nc \ inc c \ ret```
```

And mul24 needs var48 defined.

``````mul24:
;;BDE*CHL -> HLBCDE
;;155 bytes
;;402+3*C_Times_BDE
;;fastest:1201cc
;;slowest:1753cc
;;avg    :1464.9033203125cc (1464+925/1024)
;min: 825cc
;max: 1926cc
;avg: 1449.63839751681cc

push bc
ld c,l
push hl
call C_Times_BDE
ld (var48),hl
ld l,a
ld h,c
ld (var48+2),hl

pop hl
ld c,h
call C_Times_BDE
push bc
ld bc,(var48+1)
ld (var48+1),hl
pop bc
ld b,c
ld c,a
ld hl,(var48+3)
ld h,0
ld (var48+3),hl

pop bc
call C_Times_BDE
ld de,(var48+2)
ld (var48+2),hl
ld d,c
ld e,a
ld b,h
ld c,l
ld hl,(var48+4)
ld h,0
ld de,(var48)
ret```
```

## Integer Multiplication, multiprecision

This 32-by-32 multiply requires 8 bytes of RAM and uses the above mul16 routine as a base case for Karatsuba Multiplication:

``````; You will need to define z32_0, which is where the result is written!

mul32:
;max: 703cc  + 3*mul16
;     2704cc
;min: 655cc  + 3*mul16
;     1297cc
;avg: 673.25cc+3*mul16
;     2307.911cc
;DEHL * BCIX ==> z32_0
z32_2 = z32_0+4

push de
push bc
push hl
push ix
call mul16  ;DEHL
ld (z32_2),hl
ld (z32_2+2),de

pop de
pop bc
;  push bc
push de
call mul16  ;DEHL
ld (z32_0),hl
ld (z32_0+2),de

pop de    ;low word
;  pop bc    ;low word
pop hl
xor a
sbc hl,de
jr nc,+_
sub l
ld l,a
sbc a,a
sub h
ld h,a
xor a
inc a
_:
ex de,hl
pop hl
sbc hl,bc
jr nc,+_
ld b,a
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ld a,b
inc a
_:
ld b,h
ld c,l
push af
call mul16
rra
;need to perform z0+z2-result
push de
push hl
xor a
ld hl,(z32_0)
ld bc,(z32_2)
ex de,hl
ld hl,(z32_0+2)
ld bc,(z32_2+2)
rla
;now need to subtract
ex de,hl
pop bc
sbc hl,bc
ex de,hl
pop bc
sbc hl,bc
sbc a,0
;A:HL:DE is the result, need to add to z32_0+2
mul32_final:
ld bc,(z32_0+2)
ex de,hl
ld (z32_0+2),hl
ld hl,(z32_2)
ld (z32_2),hl
ld hl,z32_2+2
ld (hl),a
ret nc
inc hl
inc (hl)
ret
xor a
ld bc,(z32_0)
ex de,hl
ld bc,(z32_0+2)
rla
ex de,hl
ld bc,(z32_2)
ex de,hl
ld bc,(z32_2+2)
jp mul32_final```
```

And for 64-bit, you'll need the above mul32, as well as a mov8 routine like this:

``````  ldi
ldi
ldi
ldi
ldi
ldi
ldi
ldi
ret```
```
``````mul64:
;multiplies the 64-bit integers at inp64_1 and inp64_2
;stores the 128-bit (16-byte) result at out128
;
;min: 1740+3*min(mul32)
;     5631cc
;max: 1901+3*max(mul32)
;     10013cc
;avg: 1797+3*avg(mul32) + 9572881/2^24
;    ~8720.733cc
;uses 24 bytes at out128
z64_0 = out128
z64_2 = z64_0+8
z32_0 = z64_2+8

ld de,(inp64_1+6)
ld hl,(inp64_1+4)
ld bc,(inp64_2+6)
ld ix,(inp64_2+4)
call mul32
;copy the 8 bytes at z32_0 to z64_2
ld hl,z32_0
ld de,z64_2
call mov8

ld de,(inp64_1+2)
ld hl,(inp64_1)
ld bc,(inp64_2+2)
ld ix,(inp64_2)
call mul32
;copy the 8 bytes at z32_0 to z64_0
ld hl,z32_0
ld de,z64_0
call mov8

;now I need to subtract the 32-bit digits from each other
xor a
ld hl,(inp64_1)
ld bc,(inp64_1+4)
sbc hl,bc
ex de,hl
ld hl,(inp64_1+2)
ld bc,(inp64_1+6)
sbc hl,bc
jr nc,+_
ld b,a \ sub e \ ld e,a
ld a,b \ sbc a,d \ ld d,a
ld a,b \ sbc a,l \ ld l,a
ld a,b \ sbc a,h \ ld h,a
ld a,b
_:
rla
push hl   ;top byte
push de

ld hl,(inp64_2)
ld bc,(inp64_2+4)
sbc hl,bc
ex de,hl
ld hl,(inp64_2+2)
ld bc,(inp64_2+6)
sbc hl,bc
jr nc,+_
ld c,a
xor a
ld b,a
sub e \ ld e,a
ld a,b \ sbc a,d \ ld d,a
ld a,b \ sbc a,l \ ld l,a
ld a,b \ sbc a,h \ ld h,a
ld a,c
inc a
_:
ex de,hl
pop ix
pop bc
push af
call mul32

rra
;need to perform z0+z2-result
xor a
ld hl,(z64_0)
ld de,(z64_2)
ld (inp64_1),hl
ld hl,(z64_0+2)
ld de,(z64_2+2)
ld (inp64_1+2),hl
ld hl,(z64_0+4)
ld de,(z64_2+4)
ld (inp64_1+4),hl
ld hl,(z64_0+6)
ld de,(z64_2+6)
ld (inp64_1+6),hl
rla
;now need to subtract
ld hl,(inp64_1)
ld de,(z32_0)
sbc hl,de
ld (inp64_1),hl
ld hl,(inp64_1+2)
ld de,(z32_0+2)
sbc hl,de
ld (inp64_1+2),hl
ld hl,(inp64_1+4)
ld de,(z32_0+4)
sbc hl,de
ld (inp64_1+4),hl
ld hl,(inp64_1+6)
ld de,(z32_0+6)
sbc hl,de
ld (inp64_1+6),hl
sbc a,0
mul64_final:
;now need to add it back in
ld hl,(z64_0+4)
ld de,(inp64_1)
ld (z64_0+4),hl
ld hl,(z64_0+6)
ld de,(inp64_1+2)
ld (z64_0+6),hl
ld hl,(z64_0+8)
ld de,(inp64_1+4)
ld (z64_0+8),hl
ld hl,(z64_0+10)
ld de,(inp64_1+6)
ld (z64_0+10),hl
ld hl,z64_0+12
ld (hl),a
ret nc
inc hl \ inc (hl) \ ret nz
inc hl \ inc (hl) \ ret nz
inc hl \ inc (hl) \ ret
;z0+z2+result
xor a
ld hl,(z64_0)
ld de,(z64_2)
ld (inp64_1),hl
ld hl,(z64_0+2)
ld de,(z64_2+2)
ld (inp64_1+2),hl
ld hl,(z64_0+4)
ld de,(z64_2+4)
ld (inp64_1+4),hl
ld hl,(z64_0+6)
ld de,(z64_2+6)
ld (inp64_1+6),hl
rla
;now need to subtract
ld hl,(inp64_1)
ld de,(z32_0)
ld (inp64_1),hl
ld hl,(inp64_1+2)
ld de,(z32_0+2)
ld (inp64_1+2),hl
ld hl,(inp64_1+4)
ld de,(z32_0+4)
ld (inp64_1+4),hl
ld hl,(inp64_1+6)
ld de,(z32_0+6)
ld (inp64_1+6),hl
jp mul64_final```
```

## Fixed Point Multiplication

If we multiply an A.B fixed-point number by a C.D fixed-point number, we get an (A+C).(B+D) fixed-point number. So for example, if we use 16 bits to encode a 4.12 fixed-point number, and multiply it by another 4.12 fixed point, the result will be 8.24. To convert this to a 4.12, we'll need to discard the top 4 bits of the result, and the bottom 12 bits. Depending on the application, you might also want to check for overflow, in which case you'd need to check the discarded 4 bits at the top.

Fixed point multiplication uses an integer multiplication at its core. For example, an 8.8 fixed-point routine can use a 16-by-16 multiply, with minor adjustments. The result should be interpreted as a 16.16 fixed-point number, so you'll need to grab the 8 bits above and below the decimal in order to keep the output as 8.8 fixed. As well, you'll need special code for handling negative numbers. Putting it together, using the mul16 from the integer multiplication section:

``````mulfixed8_8:
;Multiplies H.L by D.E, stores the result in H.L
; First, find out if the output is positive or negative
ld a,h
xor d
push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
xor d     ;A now has the value of H, since I XORed it with D twice (cancelling)
jp p,mulfixed8_8_lbl1   ;if Positive, don't negate
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
mulfixed8_8_lbl1:
bit 7,d
jr z,mulfixed8_8_lbl2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
mulfixed8_8_lbl2:

; Now we need to put DE in BC to use mul16
ld b,d
ld c,e
call mul16

;Get the middle two bytes, EH, and put them in HL
ld l,h
ld h,e

;We should check for overflow. If D>0, we will set HL to 0x7FFF
ld a,d
or a
jr z,mulfixed8_8_lbl3
ld hl,\$7FFF
mulfixed8_8_lbl3:

; Now we need to restore the sign
pop af
ret p    ;don't need to do anything, result is already positive
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret```
```

Or slightly more complicated 4.12 multiplication, which also uses the mul16 as its core, and only the resizing step changes:

``````mulfixed4_12:
;Multiplies H.L by D.E, stores the result in H.L
; First, find out if the output is positive or negative
ld a,h
xor d
push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
xor d     ;A now has the value of H, since I XORed it with D twice (cancelling)
jp p,mulfixed4_12_lbl1   ;if Positive, don't negate
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
mulfixed4_12_lbl1:
bit 7,d
jr z,mulfixed4_12_lbl2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
mulfixed4_12_lbl2:

; Now we need to put DE in BC to use mul16
ld b,d
ld c,e
call mul16

;The result doesn't need the top 4 bits or bottom 12 bits.
;We'll hold onto the top 4 bits to check overflow, though.
;Currently we need to shift DEH left by 4 bits and keep DE, or right by 12 bits and keep HL.
ld a,h    ;we'll actually be moving the discared bits into A
and \$F0
rla \ rl e \ rl d
rla \ rl e \ rl d
rla \ rl e \ rl d
rla \ rl e \ rl d
ex de,hl

;if A is non-zero, we have overflow
jr z,mulfixed4_12_lbl3
ld hl,\$7FFF
mulfixed4_12_lbl3:

; Now we need to restore the sign
pop af
ret p    ;don't need to do anything, result is already positive
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret```
```

For 16.16 fixed point, we'll use the mul32 from the integer multiplication section.

``````mulfixed16_16:
;Multiplies DE.HL by BC.IX, stores the result in DE.HL
; First, find out if the output is positive or negative
ld a,d
xor b
push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
xor b     ;A now has the value of D, since I XORed it with B twice (cancelling)
jp p,mulfixed16_16_lbl1   ;if Positive, don't negate
xor a
sub l
ld l,a
ld a,0
sbc a,h
ld h,a
ld a,0
sbc a,e
ld e,a
sbc a,a
sub d
ld d,a
mulfixed16_16_lbl1:
bit 7,b
jr z,mulfixed16_16_lbl2
xor a
sub ixl
ld ixl,a
ld a,0
sbc a,ixh
ld ixh,a
ld a,0
sbc a,c
ld c,a
sbc a,a
sub b
ld b,a
mulfixed16_16_lbl2:

; Now we multiply
call mul32

;We should check for overflow. If the upper two bytes are non-zero, we will set the result to 0x7FFFFFFF
ld hl,(z32_0+6)
ld a,h
or l

;Get the middle four bytes and put them in DEHL
ld hl,(z32_0+2)
ld de,(z32_0+4)

;Maybe we need to set the result to 0x7FFFFFFF
jr z,mulfixed16_16_lbl3
ld de,\$7FFF
ld h,e
ld l,e
mulfixed16_16_lbl3:

; Now we need to restore the sign
pop af
ret p    ;don't need to do anything, result is already positive
xor a
ld b,a
sub l
ld l,a
ld a,b
sbc a,h
ld h,a
ld a,b
sbc a,e
ld e,a
sbc a,a
sub d
ld d,a
ret```
```

## Floating Point Multiplication

Floating point multiplication is actually pretty easy compared to FP addition and subtraction.
We'll start with an example: Suppose we have a floating point format that stores a 6-digit mantissa, and digits are base-10. Let's multiply 3.14159*10-1 * 6.54321*103. To do so, we can add the powers of their exponents and this is our new exponent, and directly multiply the mantissas for our new mantissa.
So our result is (3.14149 * 6.54321)*10(-1+3) = 20.5560831039*102. Our mantissa is >=10, so we'll need to increment our exponent and divide our mantissa by 10 (shift right by 1 digit): 2.05560831039*103. Now we need to keep only six digits of the mantissa, and we'll round: 2.05561*103^^. Tada :)

So the steps we do:

• XOR the signs of the inputs. This is the sign bit of our result.
• Add the exponents together to form our new exponent. Be wary of overflow or underflow!
• Multiply their mantissas as 1.n fixed point numbers. If the top digit of the result is non-zero, we'll need to increment the exponent (again, check overflow), and then shift the digits right.
• In practice, I actually shift left by one digit if the top digit is 0, because then it is in the final form we need for our mantissa.
• Truncate the mantissa to the correct size, rounding if desired.

There are Single Precision and Extended Precision examples on GitHub. These also check for "special values" that represent 0, +infinity, -infinity, and Not a Number (NaN).

In case something breaks, here are the current versions of those routines as of this writing:

``````var48 = scrap+4
mulSingle:
;Inputs: HL points to float1, DE points to float2, BC points to where the result is copied
;Outputs: float1*float2 is stored to (BC)
;573+mul24+{0,35}+{0,30}
;min: 1398cc
;max: 2564cc
;avg: 2055.13839751681cc
push af
push hl
push de
push bc

call +_   ;CHLB
ld a,c
ex de,hl
pop hl
push hl
ld (hl),b \ inc hl
ld (hl),e \ inc hl
ld (hl),d \ inc hl
ld (hl),a
pop bc
pop de
pop hl
pop af
ret

_:
;;return float in CHLB
push de
ld e,(hl)
inc hl
ld d,(hl)
inc hl
ld c,(hl)
inc hl
ld a,(hl)
or a
jr z,mulSingle_case0
ex de,hl
ex (sp),hl
ld e,(hl)
inc hl
ld d,(hl)
inc hl
ld b,(hl)
inc hl
inc (hl)
dec (hl)
jr z,mulSingle_case1
pop hl          ; |
rra             ; |Lots of help from Runer112 and
adc a,a         ; |calc84maniac for optimizing
jp po,bad       ; |this exponent check.
xor 80h         ; |
jr z,underflow  ;/
push af         ;exponent
ld a,b
xor c
push af         ;sign
set 7,b
set 7,c
pop de
ld a,e
pop de
jp m,+_
rl c
rl b
dec d
_:
inc d
jr z,overflow
rl c
ld c,d
ld de,0
push af
ld a,b
ld b,a
jr nc,+_
inc c \ jr z,overflow
rr h
rr l
rr b
_:
pop af
cpl
and \$80
xor h
ld h,a
ret
jr nc,overflow
underflow:
ld hl,0
rl b
rr h
ld c,l
ld b,l
ret
overflow:
ld hl,\$8000
jr underflow+3
mulSingle_case1:
;x*0   -> 0
;x*inf -> inf
;x*NaN -> NaN
pop hl
ld h,b
ld l,d
ld b,e
ld c,0
ret
mulSingle_case0:
;special*x = special
;NaN*x = NaN
;0*0 = 0
;0*NaN = NaN
;0*Inf = NaN
;Inf*Inf  = Inf
;Inf*-Inf =-Inf
;0CDE
pop hl
inc hl
inc hl
inc hl
ld a,(hl)
or a
jr z,+_
ld h,c
ld c,0
ret
_:
dec hl
ld b,(hl)
;basically, if b|c has bit 5 set, return NaN
ld a,b
or c
ld h,\$20
and h
jr z,+_
ld c,0
ret
_:
ld a,c
xor b
rl b
rlca
rr b
res 4,b

rl c
rrca
rr c

ld a,c
and \$E0
rra
ld h,a
ld c,0
ret```
```

xmul (extended precision, 80-bit floats):

``````#include "routines/pushpop.z80"
#include "routines/mov.z80"
#include "mul/mul64.z80"
#include "routines/rl64.z80"

#define var_z xOP3+16
;uses 60 bytes after xOP1
xmul:
;Input:
;  HL points to one number
;  DE points to another
;Timing, excluding special cases (which take ~ 800cc):
;1057+{0,3}+{0,172}+mul64
;max: 1232+max(mul64)
;     11245cc
;min: 1057+min(mul64)
;     6688cc
;avg: 1144.5+avg(mul64)
;     9865.233ccs

push hl
push de
push bc
push af
push ix
push bc
call +_
pop hl
push de
ex de,hl
ld hl,var_z+8
call mov8
ex de,hl
pop de
ld (hl),e
inc hl
ld (hl),d
pop ix
pop af
pop bc
pop de
pop hl
ret
_:
push de
ld de,xOP1
call mov10
pop hl
call mov10
ld de,(xOP2+8)
ld hl,(xOP1+8)
xmul_stepin_geomean:
ld a,h
xor d
ld b,a
res 7,d
res 7,h
ld a,h \ or l \ jp z,casemul
ld a,d \ or e \ jp z,casemul2
ld de,\$4000
sbc hl,de
jp c,mul_zero
jp m,mul_inf
sla b
jr nc,+_
set 7,h
_:
push hl
call mul64
ld a,(var_z+15)
pop de
jr c,+_
#ifdef inc_FMA
ld hl,var_z
call rl64
#else
ld hl,var_z+7
sla (hl)
#endif
inc hl
jp rl64

_:
inc de
ret

casemul:
;xOP1 is inf/nan/0
ld hl,xOP2+9
ld a,(hl)
dec hl
or (hl)
dec hl
ld a,(hl)
ld hl,xOP2
jr nz,casemul2_copy
;now we have two special cases to multipy together
;inf*inf-> inf
;0*0    -> 0
;
;nan*nan-> NaN
;inf*nan-> NaN
;inf*0  -> NaN
;nan*inf-> NaN
;nan*0  -> NaN
;0*inf  -> NaN
;0*nan  -> NaN

sla b
ld de,0
rr d
and \$C0
ld c,a
ld a,(xOP1+7)
and \$C0
cp c
jr z,+_
ld a,\$40
_:
ld (var_z+15),a
ret
casemul2:
;finite times inf/nan/0, so xOP1 -> out
ld hl,xOP1
casemul2_copy:
ld de,var_z+8
call mov8
ld e,(hl)
inc hl
ld d,(hl)
ret
mul_zero:
xor a
ld (var_z+15),a
ld d,a
ret
mul_inf:
ld d,e
ld a,255
ld (var_z+15),a
ret```
```

# Division

Division is usually considered to be a slow operation, but in my experience, at high precision, it is only a little slower than multiplication if done right.

## "Schoolbook" Division

Division can be done just like base-10 division most of us learned in school, but instead of checking if the next digit is from 0 to 9, we only need to check if it is 0 or 1 which amounts to the question of "is our remainder smaller than the divisor (0) or bigger (1) ?"

Here is the pseudocode:

``````N is the numerator
D is the denominator
R is the remainder
Q is the quotient

While N>0:
shift Q left one bit to make room for the next digit
left-shift the top bit of N into R, removing the bit from N.
If R>=D
R - D -> R
Q + 1 -> Q

Return Q```
```

In Z80 assembly, this might look like

``````Div_HL_D:
;Inputs:
;   HL and D
;Outputs:
;   HL is the quotient (HL/D)
;   A is the remainder
;   B is 0
;   C,D,E are preserved
xor a         ; Clear upper eight bits of AHL
ld b,16       ; Sixteen bits in dividend
_loop:
add hl,hl     ; Do a "SLA HL". If the upper bit was 1, the c flag is set
rla           ; This moves the upper bits of the dividend into A
jr c,_overflow; If D is allowed to be >128, then it is possible for A to overflow here. (Yes future Zeda, 128 is "safe.")
cp d          ; Check if we can subtract the divisor
jr c,_skip    ; Carry means A < D
_overflow:
sub d         ; Do subtraction for real this time
inc l         ; Set the next bit of the quotient (currently bit 0)
_skip:
djnz _loop
ret```
```

Notice here that we are using HL as both our 'N' value and our 'Q' value from the pseudocode.
'A' acts as our remainder, and D acts as our dividend.

## Special Case Division

### Division by a power of 2

This is easy: just right-shift! If you want to divide by 2, right shift once. Divide by 4? Right shift twice. Divide by 8? Right shift three times…

### Division by Constants

Here is an example of how to divide by a constant. Suppose I want to divide an 8-bit integer, x, by 7, and return the result as an 8-bit integer. Well, 1/7 is approximately 73/512, so I can get a pretty close result with x*73/512. Division by 512 will be easy: Just take the top 8 bits from the multiplication, and shift right. For example:

``````;A/7
ld l,a
ld h,0
ld b,h
ld c,l
ld a,h
rra         ;carry flag is always reset in the case, so no need to worry
adc a,b     ;just to round. B is 0
ret```
```

This isn't perfect, but it is pretty good, and fast.

### Turning Division Into Multiplication

You can turn division into multiplication as follows:
Suppose you can write your division as x/(1-y) where |y|<1. Then you can rewrite your division as:

x*(1+y)*(1+y2)*(1+y4)*(1+y8)*(1+y16)*…

So some pseudocode might be:

``````while y>10^-15    ;basically, until y is very small
x + x*y -> x
y * y   -> y```
```

### Division by 2^n - 1

If we want to perform x/(2n-1), then we can rewrite this as 2-n * x/(1-2-n). Using the technique above, this gives us the following pseudocode:

``````x + (x>>n) -> x
x + (x>>(2n)) -> x
x + (x>>(4n)) -> x
x + (x>>(8n)) -> x
x + (x>>(16n))-> x
...
return x>>n```
```

For examples:

#### Division by 3

In this case, 3=22-1, so we can use the code:

``````x + (x>>2) -> x
x + (x>>4) -> x
x + (x>>8) -> x
x + (x>>16)-> x
...
return x>>2```
```

For example, here is a way to implement HL/3 with some rounding errors:
``````;HL/2   ;we'll do this to prevent overflow, and divide by 2 at the end instead of 4
srl h \ rr l

;HL + (HL>>2) --> HL
ld b,h
ld c,l
srl b \ rr c
srl b \ rr c

;HL + (HL>>4) --> HL
ld b,h
ld c,l
srl b \ rr c
srl b \ rr c
srl b \ rr c
srl b \ rr c

;HL + (HL>>8) --> HL
ld b,0
ld c,h

HL/2
srl h \ rr l```
```

Since left shifts can often be done cheaper, let's try this with left-shifts, and then do a bunch of right-shifts at the end:

``````;We want a difference of a factor of 2 shifts
ld b,h
ld c,l
xor a
ld e,a

;We want a difference of a factor of 4 shifts
ld b,h
ld c,l
ld e,a

;We want a difference of a factor of 8 shifts
ld d,a
ld e,h
ld b,l
ld c,0
ld e,a
jr nc,\$+3
inc d

;Now we need to shift right by 2+4+8 plus a final 2, so shift right by 16
;This means we just keep DE as our result!```
```

But knowing that we are going to shift right by 16 at the end, we can optimize the final iteration:

``````;We want a difference of a factor of 2 shifts
ld b,h
ld c,l
xor a
ld e,a

;We want a difference of a factor of 4 shifts
ld b,h
ld c,l
ld e,a

ld b,a
ld c,h
ld l,h
ld h,a
;now HL is our result```
```

Finally, it turns out that the accuracy issues occur at products of 3, and we can fix that simply by incrementing HL:

``````HL_Div_3:
;Input: HL
;Output: HL is the input divided by 3
;Destroys: B,C,E,A
;217cc

;increment HL, putting overflow in A
ld bc,1
ld a,b

;We want a difference of a factor of 2 shifts
ld b,h
ld c,l
ld e,a

;We want a difference of a factor of 4 shifts
ld b,h
ld c,l
ld e,a

ld b,a
ld c,h
ld l,h
ld h,a
;now HL is our result

ret```
```

### Division by 2^n + 1

If we want to perform x/(2n+1), then we can rewrite this as 2-n * x/(1+2-n). Using the technique for turning division into multiplication, this gives us the following pseudocode:

``````x - (x>>n) -> x
x + (x>>(2n)) -> x
x + (x>>(4n)) -> x
x + (x>>(8n)) -> x
x + (x>>(16n))-> x
...
return x>>n```
```

## Division, BC/DE

In this section, we'll explore the integer division routine for dividing BC by DE and various ways to optimize it. We'll start off with our baseline routine, and note that our accumulator will never overflow (we'll use a 16-bit accumulator, and up to 16 bits will be shifted in), so we don't need to take care of that as we did in the HL/D routine.

``````BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;1382cc
;23 bytes

ld hl,0    ;Accumulator
ld a,16    ;Loop counter

div_loop:
;shift the bits from BC (numerator) into HL (accumulator)
sla c \ rl b

;Check if remainder >= denominator (HL>=DE)
sbc hl,de
inc c
jr div_loop_done

; remainder is not >= denominator, so we have to add DE back to HL

div_loop_done:
dec a
jr nz,div_loop
ret```
```

Now that we have established our baseline routine, let's add our first minor optimization. In particular, let's fix up this part:

``````  jr c,div_loop_readjust
inc c
jr div_loop_done

; remainder is not >= denominator, so we have to add DE bBCk to HL
```

The instruction ‘cp xx`, where `xx` is a constant, is compiled to `FExx`. As long as we don’t need to preserve flags, we can skip a one-byte instruction like this:
``````  jr +_
inc c
_:

Becomes ====>

.db \$FE
inc c```
```

This saves one byte and 4cc. So reorganizing the section of code above:

``````  jr c,div_loop_readjust
inc c
jr div_loop_done

; remainder is not >= denominator, so we have to add DE bBCk to HL

Becomes ====>

jr nc,divinc_acc      ;change the condition from `c` to `nc`
.db \$FE     ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
inc c```
```

Applying this, we now have:

``````BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;min: 1270cc
;max: 1414cc
;avg: 1342cc
;22 bytes

ld hl,0
ld a,16

div_loop:
sla c \ rl b
sbc hl,de
jr nc,divinc_acc
.db \$FE     ;this begins the instruction `cp *`, so it eats the `inc c`
div_inc_acc:
inc c

div_loop_done:
dec a
jr nz,div_loop
ret```
```

This change saves one byte and on average 40cc, but saving up to to 112cc or costing up to 32cc depending on inputs.

There is another cheap optimization— using B instead of A as the loop counter.
It adds 2 bytes to overhead, but saves 2 bytes in the loop for no net change in size. And while it adds 8cc to overhead, it saves 112cc in the main loop, for a net savings of 104cc.

``````BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;min: 1166cc
;max: 1310cc
;avg: 1238cc
;22 bytes
ld hl,0
ld a,b
ld b,16

div_loop:
;shift the bits from BC into HL
sla c \ rla
sbc hl,de
jr nc,div_inc_acc
.db \$FE     ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
inc c

div_loop_done:
djnz div_loop
ld b,a
ret```
```

Now, even the worst case is better than the original, and we've saved a byte!

But this next trick adds 5 bytes onto the previous routine. If we negate DE, then we can swap whether we are doing sbc or add when comparing HL to DE:

``````BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 1124cc
;max: 1332cc
;avg: 1228cc
;27 bytes
xor a
ld h,a  ;take advantage of A=0 to set hl to 0 saving 2cc and 1 byte.
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a

ld a,b
ld b,16

div_loop:
;shift the bits from BC into HL
sla c \ rla
jr c,divinc_acc
sbc hl,de
.db \$FE     ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
inc c

div_loop_done:
djnz div_loop
ld b,a
ret```
```

Loop: -32cc on average, up to +0cc, as low as -64cc
Net: +5 bytes, -10cc on average, as much as +22cc, as low as -42cc

But now we have that the comparison leaves carry set when C is incremented, and reset when it is not. If we use `rl c` instead of `sla c`, we can just shift in the next bit:

``````BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 1072cc
;max: 1232cc
;avg: 1152cc
;28 bytes
xor a
ld h,a
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a

ld a,b
ld b,16

div_loop:
;shift the bits from BC into HL
rl c \ rla
jr c,div_loop_done
sbc hl,de

div_loop_done:
djnz div_loop
rl c \ rla
ld b,a
ret```
```

This adds 3 bytes and 12cc to overhead, but saves 2 byte in the main loop and 88cc on average, for a net of +1 byte, -76cc.

And now, if we really want speed, we can unroll it, and take advantage of only needing to shift one byte at a time into HL:

``````BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 738cc
;max: 898cc
;avg: 818cc
;144 bytes
xor a
ld h,a
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a

ld a,b
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla
ld b,a

ld a,c
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla
ld c,a

ret```
```

It costs +115 bytes (making the routine just over 5 times larger), but saves 334cc in all cases (29% improvement on average).

If you really enjoy the speed improvements, but don't want to take up that much space, there is a cheap trade off by taking advantage of subroutines:

``````;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 773cc
;max: 933cc
;avg: 853cc
;82 bytes
xor a
ld h,a
ld l,a
sub e
ld e,a
sbc a,a
sub d
ld d,a

ld a,b
ld b,c
call BC_Div_DE_sub
ld a,b
ld b,c

BC_Div_DE_sub:
;min: 354cc
;max: 434cc
;avg: 394cc
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla \ adc hl,hl \ add hl,de \ jr c,\$+4 \ sbc hl,de
rla
ld c,a
ret```
```

Basically, we traded 35cc for a savings of 61 bytes.

## Fixed Point Division

Suppose that we want to divide H.L by D.E. This is basically (HL/256)/(DE/256) = HL/DE. However, we can't directly use an integer routine, as we'd need 8 more bits of precision. You would have to apply this similarly to other formats.

### 8.8/8.8 Fixed Point Division

So we'll start off like an integer division, but we'll go an extra 8 iterations, shifting in zeros.

``````BC_Div_DE_88:
;Inputs:
;     DE,BC are 8.8 Fixed Point numbers
;Outputs:
;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)
;if DE is 0 : 122cc or 136cc if BC is negative
;if |BC|>=128*|DE| : 152cc or 166cc if BC is negative
;Otherwise:
;min: 1107cc
;max: 1319cc
;avg: 1201cc

; First, find out if the output is positive or negative
ld a,b
xor d
push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
xor b     ;A now has the value of B, since I XORed it with D twice (cancelling)
jp p,BC_Div_DE_88_lbl1   ;if Positive, don't negate
xor a
sub c
ld c,a
sbc a,a
sub b
ld b,a
BC_Div_DE_88_lbl1:

;now make DE negative to optimize the remainder comparison
ld a,d
or d
jp m,BC_Div_DE_88_lbl2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
BC_Div_DE_88_lbl2:

;if DE is 0, we can call it an overflow
;A is the current value of D
or e
jr z,div_fixed88_overflow

;The accumulator gets set to B if no overflow.
;We can use H=0 to save a few cc in the meantime
ld h,0

;if B+DE>=0, then we'll have overflow
ld a,b
ld a,d
jr c,div_fixed88_overflow

;Now we can load the accumulator/remainder with B
ld l,b

ld a,c
call div_fixed88_sub
ld c,a

ld a,b      ;A is now 0
call div_fixed88_sub

ld d,c
ld e,a
pop af
ret p
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ret

div_fixed88_overflow:
ld de,\$7FFF
pop af
ret p
inc de
inc e
ret

div_fixed88_sub:
;min: 456cc
;max: 536cc
;avg: 496cc
ld b,8
BC_Div_DE_88_lbl3:
rla
jr c,\$+4
sbc hl,de
djnz BC_Div_DE_88_lbl3
ret```
```

## Floating Point Division

See divSingle and xdiv from the z80float project.

# Square Root

## Floating Point Square Root

See sqrtSingle and xsqrt from the z80float project.

# Exponential

## Floating Point Exponential

See the following:
expSingle
xexp
pow2Single
xpow2
pow10Single
xpow10
powSingle
xpow

# Logarithm

## Fixed Point Logarithm

NOTE: These routines are from an older edit. It's cool, but I think I've come across other viable (better) options.

The following code is based on the approximation:

````ln(x+1) : x(8+x)/(8+5x) on [0,1]`
```

For a more complete analysis of how these formulas were derived, see this document.
What is convenient to note is that the constant multiplications can be reduced to a few shift and add instructions. So to put the functions into code:
``````lognat:
;Input:  H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 t-states
dec h
dec h
jr nz,\$+8
inc l \ dec l
ret nz
ld l,177
ret
inc h
ret nz
ld b,h
ld c,l
ld e,l
ld d,8
ex de,hl

ld h,a \ ld l,b
sla h \ jr c,\$+3 \ ld l,c
rl l
ld a,h
ld l,a
ld h,b
cp a
ret```
```

But that only works on [1,2], so we have to reduce speed a little to extend it to all positive inputs (0,128):

``````lognat:
;Input:  H.L needs to be on (0,128.0)
;Output: H.L if c flag set
;     returns nc if input is negative (HL not modified)
;Error:
;    The error on the outputs is as follows:
;        20592 inputs are exact
;        12075 inputs are off by 1/256
;        100 inputs are off by 2/256
;    So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 t-states

ld a,h \ or l \ jr nz,\$+5
ld h,80h \ ret
dec h
dec h
jr nz,\$+9
inc l \ dec l
jr nz,normalizeln-1
ld l,177
ret
inc h
jr nz,normalizeln
ld b,h
ld c,l
ld e,l
ld d,8
ex de,hl
call HL_Div_DE
ld h,a \ ld l,b
sla h \ jr c,\$+3 \ ld l,c
rl l
ld a,h
ld h,b
ld l,a
scf
ret
HL_Div_DE:

inc h
normalizeln:
xor a
inc h \ ret m
ld d,a \ ld e,a
ld a,l
jr z,toosmall
inc e \ srl h \ rra \ jr nz,\$-4
rla \ rl h
dec e
stepin:
ld l,a
push de
call lognat
pop de
;now multiply DE by 355, then divide by 2 (rounding)
ld b,d \ ld c,e \ ld a,d
ex de,hl
sra h \ rr l
scf
ret
toosmall:
dec d
dec e \ add a,a \ jr nc,\$-2
inc h
jp stepin```
```

Or a table driven method for log2:
Because we are working in binary, lg(), which is log2() is the easiest and most practical to compute. From this, we can derive other bases, too. Now to use some identities, note that lg(x/2)=lg(x)-lg(2) = lg(x)-1. So then we have lg(x)=1+lg(x/2). So what we will do is make a routine that works on [1,2] and extend that to all other values. If our input is less than 1, we have lg(x)=lg(x*2)-1, so we double our input, and subtract from an accumulator. If input is >1, we will divide it by 2 and add to the accumulator. Then, we only have 256 values to worry about in our table. But, if our input is now even, we can divide by 2 without losing any accuracy (no bits lost) and just increment the accumulator. We can keep doing this until we finally have an odd number (which means we need to take care of input=0 as a separate case). Then we have a table of 128 elements instead—all of the odd numbers. And now we can build our algorithm:

``````lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)->H.L
;Speed: Avg: 340
ld de,lgLUT
ld b,0
ld a,h
or a
ret m     ;If input is negative, this algorithm won't suffice (this is only for the Reals, not Complex)
ld a,l
jr z,\$+8
inc b \ srl h \ rra \ jr nz,\$-4   ;input >2, so we rotate, using b as the accumulator
or a \ jr nz,\$+6
ld hl,8000h \ ret     ;If input was 0, use this as -inf
rra \ inc b \ jr nc,\$-2      ;input<1, so we rotate
;A is the element to look up in the LUT
ld l,a
ld c,h
dec b
ld e,(hl)
inc hl
ld d,(hl)
ex de,hl
ret
lglut:
.dw \$F800
.dw \$F996
.dw \$FA52
.dw \$FACF
.dw \$FB2C
.dw \$FB76
.dw \$FBB3
.dw \$FBE8
.dw \$FC16
.dw \$FC3F
.dw \$FC64
.dw \$FC86
.dw \$FCA5
.dw \$FCC1
.dw \$FCDC
.dw \$FCF4
.dw \$FD0B
.dw \$FD21
.dw \$FD36
.dw \$FD49
.dw \$FD5C
.dw \$FD6D
.dw \$FD7E
.dw \$FD8E
.dw \$FD9D
.dw \$FDAC
.dw \$FDBA
.dw \$FDC8
.dw \$FDD5
.dw \$FDE2
.dw \$FDEE
.dw \$FDFA
.dw \$FE06
.dw \$FE11
.dw \$FE1C
.dw \$FE26
.dw \$FE31
.dw \$FE3B
.dw \$FE44
.dw \$FE4E
.dw \$FE57
.dw \$FE60
.dw \$FE69
.dw \$FE71
.dw \$FE7A
.dw \$FE82
.dw \$FE8A
.dw \$FE92
.dw \$FE9A
.dw \$FEA1
.dw \$FEA9
.dw \$FEB0
.dw \$FEB7
.dw \$FEBE
.dw \$FEC5
.dw \$FECB
.dw \$FED2
.dw \$FED8
.dw \$FEDF
.dw \$FEE5
.dw \$FEEB
.dw \$FEF1
.dw \$FEF7
.dw \$FEFD
.dw \$FF03
.dw \$FF09
.dw \$FF0E
.dw \$FF14
.dw \$FF19
.dw \$FF1E
.dw \$FF24
.dw \$FF29
.dw \$FF2E
.dw \$FF33
.dw \$FF38
.dw \$FF3D
.dw \$FF42
.dw \$FF47
.dw \$FF4B
.dw \$FF50
.dw \$FF55
.dw \$FF59
.dw \$FF5E
.dw \$FF62
.dw \$FF67
.dw \$FF6B
.dw \$FF6F
.dw \$FF74
.dw \$FF78
.dw \$FF7C
.dw \$FF80
.dw \$FF84
.dw \$FF88
.dw \$FF8C
.dw \$FF90
.dw \$FF94
.dw \$FF98
.dw \$FF9B
.dw \$FF9F
.dw \$FFA3
.dw \$FFA7
.dw \$FFAA
.dw \$FFAE
.dw \$FFB2
.dw \$FFB5
.dw \$FFB9
.dw \$FFBC
.dw \$FFC0
.dw \$FFC3
.dw \$FFC6
.dw \$FFCA
.dw \$FFCD
.dw \$FFD0
.dw \$FFD4
.dw \$FFD7
.dw \$FFDA
.dw \$FFDD
.dw \$FFE0
.dw \$FFE4
.dw \$FFE7
.dw \$FFEA
.dw \$FFED
.dw \$FFF0
.dw \$FFF3
.dw \$FFF6
.dw \$FFF9
.dw \$FFFC
.dw \$FFFF```
```

Adding a little overhead code, we can get an ln() routine. Note that to convert, we have ln(x)=lg(x)/lg(e), but 1/lg(e) is a constant, so we can use constant multiplication. (it is approximately 355/512 for a good estimate):
``````ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)->H.L
;Speed: Avg: 340+(325 worst case)
call lg_88
;now signed multiply HL by 355, then divide by 2 (rounding)
ld de,0
bit 7,h
jr z,\$+9
dec e \ xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ld b,h
ld c,l
xor a
sra a \ rr h
ld l,h
ld h,a
inc e
ret nz
xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ret```
```

## Floating Point Logarithm

See the following:
lnSingle (Single Precision Natural Log)
lgSingle (Single Precision Log Base 2)
log10Single (Single Precision Log Base 10)
logSingle (Single Precision Log Base y)

xln (Extended Precision Natural Log)
xlg (Extended Precision Log Base 2)
xlog10 (Extended Precision Log Base 10)
xlog (Extended Precision Log Base y)

# Cosine/Sine

## Floating Point Cosine/Sine

See the following:
sinSingle (Single Precision Sine)
cosSingle (Single Precision Cosine)
xsin (Extended Precision Sine)
xcos (Extended Precision Cosine)

# Arctangent

## Fixed Point Arctangent

NOTE: These routines are from an older edit. It's cool, but I think I've come across other viable (better) options.

The following code is based on the approximation:

````atan(x) : x(9+2x^2)/(9+5x^2) on [-1,1]`
```

For a more complete analysis of how these formulas were derived, see this document.
What is convenient to note is that the constant multiplications can be reduced to a few shift and add instructions. So to put the function into code:
``````ArcTan_88:
;Inputs: DE is positive
ld a,d
or a
jr z,ArcTan+2
inc a
jr z,ArcTan+2
ld b,d
ld c,e
ld de,256
call DE_Div_BC_88
call ArcTan
ex de,hl
or a
sbc hl,de
ret
ArcTan:
;Input: DE
;Output: HL
push de    ;save X for later
ld b,d
ld c,e
call BC_Times_DE ;x^2 computed
ld d,h
ld e,l
add hl,de   ;5x^2 is now in HL
ld a,9
ld h,a      ;9+5x^2->HL
ld a,9
sla e
rl d
ld d,a      ;9+2x^2->DE
ld b,h
ld c,l
call DE_Div_BC_88   ;->DE
pop bc
jp BC_Times_DE```
```

Table driven arctangent can be implemented, too.
Arctan can be computed on the whole real line just from the values on [0,1] by the identities: arctan(-x)=-arctan(x) and arctan(x)=pi/2-arctan(1/x). Unlike ln() where we need an LUT (Look Up Table) of 16-bit numbers, we can just use 8-bit numbers. Plus, since arctan is pretty close to a linear function for a while, a large chunk of the LUT can be omitted by checking if the input is within a certain range:
``````arctan_88:
;Input:
;    D.E
;Output: atan(D.E)->D.E
push de
ld a,d
or a
jp p,\$+5
neg
ld d,a
dec a
jr nz,checkneedinv
inc e \ dec e \ jr nz,checkneedinv
pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
checkneedinv:
inc a
call nz,DEgt1_Inv
;0.E is the value to atan
push hl
ld a,e
cp 46 \ ret c
dec a \ cp 42h \ ret c
dec a \ cp 4Eh \ ret c
dec a \ cp 57h \ ret c
dec a \ cp 5Eh \ ret c
dec a \ cp 64h \ ret c
dec a \ cp 6Ah \ ret c
dec a \ cp 6Fh \ ret c
sub 6Fh \ ld e,a
ld hl,atanlut
ld a,(hl)
ret
ld e,a
pop bc
ld a,b
or a
jp p,\$+5
neg
jr z,\$+9
ld hl,402
or a
sbc hl,de
ex de,hl
rl b
ret nc
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ret
DEgt1_Inv:
;Works if DE>1
ld hl,256
ld b,8
InvLoop:
sbc hl,de
jr nc,\$+3
djnz InvLoop
cpl
ld e,a
ld d,b
ret
atanlut:
.db \$6F
.db \$6F
.db \$70
.db \$71
.db \$72
.db \$73
.db \$73
.db \$74
.db \$75
.db \$76
.db \$77
.db \$77
.db \$78
.db \$79
.db \$7A
.db \$7B
.db \$7B
.db \$7C
.db \$7D
.db \$7E
.db \$7F
.db \$7F
.db \$80
.db \$81
.db \$82
.db \$82
.db \$83
.db \$84
.db \$85
.db \$85
.db \$86
.db \$87
.db \$88
.db \$88
.db \$89
.db \$8A
.db \$8B
.db \$8B
.db \$8C
.db \$8D
.db \$8E
.db \$8E
.db \$8F
.db \$90
.db \$90
.db \$91
.db \$92
.db \$93
.db \$93
.db \$94
.db \$95
.db \$95
.db \$96
.db \$97
.db \$97
.db \$98
.db \$99
.db \$9A
.db \$9A
.db \$9B
.db \$9C
.db \$9C
.db \$9D
.db \$9E
.db \$9E
.db \$9F
.db \$A0
.db \$A0
.db \$A1
.db \$A2
.db \$A2
.db \$A3
.db \$A3
.db \$A4
.db \$A5
.db \$A5
.db \$A6
.db \$A7
.db \$A7
.db \$A8
.db \$A9
.db \$A9
.db \$AA
.db \$AA
.db \$AB
.db \$AC
.db \$AC
.db \$AE
.db \$AF
.db \$AF
.db \$B0
.db \$B0
.db \$B1
.db \$B2
.db \$B2
.db \$B3
.db \$B3
.db \$B4
.db \$B5
.db \$B5
.db \$B6
.db \$B6
.db \$B7
.db \$B7
.db \$B8
.db \$B9
.db \$B9
.db \$BA
.db \$BA
.db \$BB
.db \$BB
.db \$BC
.db \$BC
.db \$BD
.db \$BE
.db \$BE
.db \$BF
.db \$BF
.db \$C0
.db \$C0
.db \$C1
.db \$C1
.db \$C2
.db \$C2
.db \$C3
.db \$C3
.db \$C4
.db \$C4
.db \$C5
.db \$C6
.db \$C6
.db \$C7
.db \$C7
.db \$C8
.db \$C8
.db \$C9```
```

## Floating Point Arctangent

See the following:
atanSingle (Single Precision Arctangent)
xatan (Extended Precision Arctangent)

# String Conversions

## Floating Point String Conversions

See the following:
strToSingle (Single Precision String To Float)
strtox (Extended Precision String To Float)
singleToStr (Single Precision Float To String)
xtostr (Extended Precision Float To String)

# Random Number Generators

## rand16

``````;#define smc    ;uncomment if you are using SMC
rand16:
;;collaboration by Zeda with Runer112
;;160cc or 148cc if using SMC
;;26 bytes
;;cycle: 4,294,901,760 (almost 4.3 billion)
#ifdef smc
seed1=\$+1
ld hl,9999
#else
ld hl,(seed1)
#endif
ld b,h
ld c,l
inc l
ld (seed1),hl
#ifdef smc
seed2=\$+1
ld hl,9999
#else
ld hl,(seed2)
#endif
sbc a,a
and %00101101
xor l
ld l,a
ld (seed2),hl
ret```
```

## rand24

``````rand24:
;;219cc
#ifdef smc
seed1_0=\$+1
ld hl,12345
seed1_1=\$+1
ld a,67
#else
ld hl,(seed1_0)
ld a,(seed1_1)
#endif
ld b,h
ld c,l
ld d,a
inc l
ld (seed1_0),hl
ld (seed1_1),a
ld c,b
ld b,a
#ifdef smc
seed2_0=\$+1
ld hl,65432
seed2_1=\$+1
ld a,10
#else
ld hl,(seed2_0)
ld a,(seed2_1)
#endif
rla
ld (seed2_1),a
sbc a,a
and %10000111
xor l
ld l,a
ld (seed2_0),hl
ret```
```

## rand32

``````rand32:
;;Tested and passes all CAcert tests
;;Uses a very simple 32-bit LCG and 32-bit LFSR
;;it has a period of 18,446,744,069,414,584,320
;;roughly 18.4 quintillion.
;;LFSR taps: 0,2,6,7  = 11000101
;;291cc
;;Thanks to Runer112 for his help on optimizing the LCG and suggesting to try the much simpler LCG. On their own, the two are terrible, but together they are great.
seed1_0=\$+1
ld hl,12345
seed1_1=\$+1
ld de,6789
ld b,h
ld c,l
add hl,hl \ rl e \ rl d
add hl,hl \ rl e \ rl d
inc l
ld (seed1_0),hl
ld hl,(seed1_1)
ld (seed1_1),hl
ex de,hl
;;lfsr
seed2_0=\$+1
ld hl,9876
seed2_1=\$+1
ld bc,54321
add hl,hl \ rl c \ rl b
ld (seed2_1),bc
sbc a,a
and %11000101
xor l
ld l,a
ld (seed2_0),hl
ex de,hl
ret```
```

## rand, Floating-point

For floating-point random numbers, we'll need to take a little extra care to make the distribution as uniform as possible. We'll generate a float on [0,1), but we want to make sure to select numbers from [0,2^-n) with probability 2^-n, for example. To do this, we'll first generate the exponent by initializing to 0, then decrement the exponent and generate a random bit, repeating this process until a 1 is generated, or the exponent is too small for our floats. Finally, we generate our mantissa in normalized form.

(This means the exponent is -1 with 50% probability, -2 with 25% probability, etc.)

So as an example for a single-precision random float generator

``````randSingle:
;BC points to where the result is output.
;Stores a pseudo-random number on [0,1)
;A value of 0 is extraordinarily unlikely.

push bc

;The first thing we'll do is generate the exponent.
;Initialize to -1 (stored as 0x7F) and we'll generate
;random bits, decrementing the exponent until we get a 1.

ld c,\$80    ;exponent
jr rand_gen_exp
_:
dec c
jr c,+_   ;we have completed calculation of the exponent
djnz -_

;make sure the exponent isn't zero!
inc c
dec c
jr z,rand_zero

rand_gen_exp:
push bc
call rand
pop bc
ld b,16
jr -_
_:

; Now we have generated the exponent, let's generate the mantissa

push bc
call rand
push hl
call rand
pop de
pop bc
ld b,h
;Now our float is stored in CBDE

; Make the float positive
res 7,b

.db \$FE   ;start of `cp *` to skip `ld b,c`
rand_zero:
ld b,c

pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
inc hl
ld (hl),c
ret```
```

Or an extended-precision routine:

``````xrand:
;Stores a pseudo-random number on [0,1)
;A value of 0 should be practically impossible in this universe.
push bc

;The first thing we'll do is generate the exponent.
;Initialize to 0 (stored as 0x4000) and we'll generate
;random bits, decrementing the exponent until we get a 1.

ld bc,\$4000    ;exponent
jr rand_gen_exp
_:
dec bc
jr c,+_   ;we have completed calculation of the exponent
dec a
jr nz,-_

;make sure the exponent isn't zero!
ld a,b
or c
jr z,xrand_zero

rand_gen_exp:
push bc
call rand
pop bc
ld a,16
jr -_
_:

; Now we have generated the exponent, let's generate the mantissa
pop ix    ;pointer to the output
ld (ix+8),c
ld (ix+9),b

call rand
ld (ix),h
ld (ix+1),l

call rand
ld (ix+2),h
ld (ix+3),l

call rand
ld (ix+4),l   ;it's "random" anyways, just fun to change it up :P
ld (ix+5),h

call rand
ld (ix+6),h
ld (ix+7),l

ret

xrand_zero:
pop hl

set_xzero:
;HL points to where to write 0 as an extended precision float.
xor a
ld b,10
_:
ld (hl),a
inc hl
djnz -_
ret```
```

# High-Precision On-the-Fly Computation of Various Constants

## arctangent(t/(2k+s)

page revision: 24, last edited: 21 Mar 2021 00:02