Floating Point

# Floating Point Tutorial

Although integer math may seem more natural for us when we program the Z80, floating point math that works with fractional numbers is not much more difficult. The only change is in representation and how we program the handling of the data. After reading this, hopefully the reader will not feel limited to integer math and not limited to even the representations here. Don't just use a 'sign bit' if you think it would be more mathematically sound to have a bit for additive inverses and multiplicative inverses. Add complex arithmetic, or try making Rational representations for sometimes many, many more bits of precision and a smaller risk for roundoff errors.

## Format

The typical format includes a sign bit (additive inverse, positive/negative) an exponent, and a mantissa. TI, for example, uses 1 bit for sign, 8 bits for the exponent, and 7 bytes storing 14 base 10 digits. If the bytes were 0083 31 41 59 26 53 58 98, it would be read as sign=0, exponent=83h-80h=3, mantissa=3.1415926535898, so the value represented is 3.1415926535898*103=3141.5926535898.

Generally, the math has a wider range and better precision if we use a base of power of 2. Depending on the usage, we tend to try to pack all of the data to fill 8-bit multiples. TI uses an extra 8 bits to store the 1 sign bit, but in their application, they also use that byte to store variable type. They actually took advantage of unused bits that had to be there anyways.

### Sign Bit

The sign bit is just a 0 or 1 that determines if a number is positive or negative.

### Exponent

To decide a good range for the exponent, you will need to pick a base. TI used 10, but generally, 2 might be better or 256 so that we use whole bytes as digits. Essentially, we will use something like 2exponent*mantissa. It is generally easier to make the middle number in your range as 0. If you are using an 8-bit exponent, 80h=0 for the exponent.

### Mantissa

The mantissa is best kept "normalized" by making sure we don't have leading zeroes. If we are using a base of 2, then a mantissa of %0011111100000000 is usually not what we want. Instead, we shift it left twice, which means we decrement the exponent by 2.
++Examples
If we decided to have 1 bit for sign, 7 bits for the exponent, and 16 bits for the mantissa, using base 2, then the following would represent 1:

``````%01000000   ;sign=0, exponent=64 (treated as 0)
%1000000000000000 ;mantissa=1.0```
```

And likewise, 2 would be just multiplying by 2 which is the same as increasing the exponent by 1:
``````%01000001   ;sign=0, exponent=65 (65-64=1)
%1000000000000000 ;mantissa=1.0
1.0*2^1=2```
```

So if we think of the exponent as the location of the decimal, 2 is 10.00000000000000 and 1 is 1.000000000000000. If we want to perform some arithmetic…

This can be a bit difficult because we have to keep track of the sign of the inputs and the result, but we also have to line up the 'decimal points' (the exponent). In the above examples for the 24-bit floats for 1.0 and 2.0, to line up the decimals, we need to rotate the mantissa of 1 to the right (and likewise, increment the exponent:

``````%01000001   ;sign=0, exponent=65 (treated as 1)
%0100000000000000 ;mantissa=0.5
0.5*2^1=1```
```

Note that 1 is now represented as a denormalized number, but we can directly add the numbers by adding their mantissas:
``````%01000001   ;sign=0, exponent=65 (treated as 1)
%1100000000000000 ;mantissa=1.5
1.5*2^1=3```
```

If there was overflow, we would have to rotate the result right to get the upper bit of the result back in the number and then increment the exponent. For example, if we added 2 to this, we would get %0100000000000000 as the 16-bit addition, but the overflow (carry flag) would be a 1, so we would rotate it back in, dropping the last bit (in our case, a 0):
``````%01000010   ;sign=0, exponent=66 (treated as 2)
%1010000000000000 ;mantissa=1.25
1.25*2^2=5```
```

Adding two negative numbers -a+-b is the same as -(a+b) so it is just like adding two positive numbers and you don't need to change the sign. Adding a positive and negative number will require a subtraction of their absolute values. For example, 2+-1 would require rotating the mantissas as in addition, but then subtract %1000…-%0100… to get %0100… which needs to be renormalised. In this case, it is still a positive result. If there was overflow (on the z80, the c flag is set), then our result is negative, but we also need to perform a 2s compliment on the mantissa (that is, negate it as if it were an integer). So in the case of 1-2, we would get:
%01000001 1100000000000000. 2s compliment would return \$01000001 0100000000000000 and then we set the sign as negative and renormalise the number, getting the expected -1. In the case where the first number is negative and the second is positive, it would be easiest to swap their order for subtraction instead of making a separate reverse-subtract routine.

## Multiplication and Division

Multiplication and division are much easier than addition and subtraction. When multiplying or dividing two numbers, you just need to use XOR logic on the signs and perform basically integer math on the mantissas. If both numbers are negative, their sign bits will be 1 and 1 xor 1=0, which is positive. Likewise, 1 xor 0=1, 0 xor 1=1, 0 xor 0=0. To multiply A and B, add their exponents together, to divide, subtract the exponent of B from A (taking care of underflow/overflow, accordingly which is a simple check of the c flag).

Now you have the sign and exponent of the result, but you still need to get the mantissa. For multiplication, just multiply the two mantissas and take the upper bits for the result, rounding as seen fit. For division, it is a slightly different algorithm than integer division (or so it may seem). Since both numbers are normalized, you don't need to rotate bits into an accumulator. Instead, check if B<=A. If it is, perform B-A as an integer subtraction, and set the upper bit of the result, else leave it reset. Then shift B to the right and repeat the process, setting or resetting the next bits at each iteration. For a 64-bit mantissa, this requires 64 tests. As well, for best accuracy, A and B would need additional 64 trailing zero bits making subtraction 128-bits. Since it is integer subtraction, it is still fairly trivial.

Update from the future: On the z80, there is actually a faster way for large-ish float sizes that involves estimating the next byte of the result, and checking for off-by-one error. For a practical example, my (Zeda's) division routine for 80-bit floats takes ~19000cc versus a previous ~28000cc, compared to a very optimized ~11000cc multiplication.

## Square Root

For square roots, exponents are divided by 2. If the exponent is odd, add 1 to it to make it even and rotate the mantissa right. From there, the mantissa can be treated as an integer to take the square root of.

## Modulo

For many trig applications, you will need to take input mod pi. To do this, check the exponent. If it is less than 1, it is smaller than pi, so you are finished. if it is greater than or equal to 1, perform 3-exponent and this will be the number of iterations required for the algorithm about to be described.

If the mantissa is greater than or equal to pi, subtract pi. Then, shift pi to the right (dividing it by 2) and repeat the needed number of times.

For general modulo, just replace 'pi' with any number and check the exponent for being less than the exponent of the modulus.

# 24-bit Floating Point

24-bit floats are excellent because the math can be contained in the registers. These currently use 7-bit exponents such that 0=0 and %1111111 = 3Fh = -1

## RAM equates

The order in which the RAM values are stored are important:

``````f24_exp2:
.db 0
f24_exp1:
.db 0
f24_exp4:
.db 0
f24_exp3:
.db 0
f24_exp6:
.db 0
f24_exp5:
.db 0
f24_man1:
.dw 0
f24_man2:
.dw 0
f24_man3:
.dw 0
f24_man4:
.dw 0
f24_man5:
.dw 0
f24_man6:
.dw 0
tempbyte1:
.db 0
tempword1:
.dw 0```
```

``````Float24Add:
;BHL+CDE
ld a,c
xor 40h
ld c,a
rlc c
ld a,b
rla
rlc b
xor 80h
sub c
jr z,f24shifted
jr nc,f24shiftDE
rra
cp -16
jr nc,\$+6
ld a,c \ ex de,hl \ rrca \ ret
srl h \ rr l
inc a \ jr nz,\$-5
jr nc,\$+3 \ inc hl
ld a,b \ and 1 \ ld b,a \ ld a,c \ and \$FE \ xor 80h \ or b \ jp f24shifted+1
f24ShiftDE:
rra
cp 16
jr c,\$+5
ld a,b \ rrca \ ret
srl d \ rr e
dec a \ jr nz,\$-5
jr nc,\$+3 \ inc de
ld a,c \ and 1 \ ld c,a \ ld a,b \ and \$FE \ or c \ ld c,a
f24Shifted:
ld a,b \ rrca \ ld b,a \ ld a,c \ rrca \ ld c,a \ xor b \ ld a,b \ jp m,f24sub
ld c,0 \ inc c \ inc c \ rr h \ rr l \ jr nc,\$+8 \ inc l \ jr nz,\$+5 \ inc h
jr z,\$-12
rlca \ add a,c \ jp p,\$+10
jr c,\$+7 \ bit 6,b \ rrca \ jr z,\$+4
rrca \ ret
SetInf:
ld hl,\$FFFF
rla
ld a,h
rra
ld b,a
ret
f24Sub:
sbc hl,de
jr nc,normalise24
xor 80h
ld b,a
xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ld a,b
normalise24:
bit 7,h \ ret nz
cp \$82 \ ret z
dec a \ dec a
rlc b \ rra \ ld b,a
jp normalise24

Float24Sub:
;BHL-CDE
ld a,80h
xor c
ld c,a
```

## Multiplication

``````Float24Mul:
;BHL*CDE -> AHL
ld a,b \ xor c
and 80h
push af
sla b \ sra b
sla c \ sra c
ld a,b
pop bc
jp m,\$+10
cp 64
jr nc,SetInf-2
jp \$+7
cp -64
jr c,SetInf-2
;B has the right sign
and 7Fh
or b
push af
ld b,h
ld c,l
call BC_Times_DE
bit 7,l
ld l,h
ld h,b
jr z,\$+3
inc hl
pop af
ret
BC_Times_DE:
;BHLA is the result
ld a,b
or a
ld hl,0
ld b,h
;1
jr nc,\$+4
ld h,d
ld l,e
;2
rla
jr nc,\$+4
;227+10b-7p
rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

;===
;AHL is the result of B*DE*256
push hl
ld h,b
ld l,b
ld b,a
ld a,c
ld c,h
;1
jr nc,\$+4
ld h,d
ld l,e
;2
rla
jr nc,\$+4
;227+10b-7p
rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

pop de
;Now BDE*256+AHL
ld c,a
ld a,l
ld l,h
ld h,c
ret nc
inc b
;BHLA is the 32-bit result
ret```
```

## Division

Note that this routine currently does not use 32-bit subtractions, so it can get bad accuracy.

``````Float24Div:
;BHL/CDE -> AHL
ld a,b \ xor c
and 80h
push af
sla b \ sra b
sla c \ sra c
ld a,b
sub c
pop bc
jp m,\$+11
cp 64
jp nc,SetInf-2
jp \$+7
cp -64
jp c,SetInf-2
;B has the right sign
and 7Fh
or b
push af

ld bc,0
or a \ sbc hl,de \ jr c,\$+7 \ set 7,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 6,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 5,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 4,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 3,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 2,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 1,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 0,b \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 7,c \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 6,c \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 5,c \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 4,c \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 3,c \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 2,c \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+7 \ set 1,c \ jp \$+4 \ add hl,de \ srl d \ rr e
or a \ sbc hl,de \ jr c,\$+4 \ set 0,c
ld h,b \ ld l,c
pop bc
jp normalise24```
```

## log base 2

``````Float24Lg:
;Inputs: AHL is the 24-bit float
;     bit 7,a is the sign
;     bits 0~6 are the signed exponent
;     HL is the mantissa
;Output:
;    nc flag set if result would be complex
;    c if real result
;    BHL is the computed result if real
or a
jp p,\$+9
ld a,-1
ld hl,0
ret
sra a
;A is now the value that must be added to the final result
push af
exx \ ld hl,0 \ exx
FloatLogLoop:
;Step 1 : check if HL >=AAAA
step1:
ld bc,\$AAAA
xor a
sbc hl,bc
jp c,step2
exx
ld bc,27200
exx
;multiply HL by 3/4
ld d,h
ld e,l
rra \ rr h \ rr l
rra \ rr h \ rr l
jr nc,step1+3
inc hl
jp step1+3
step2:
ld bc,\$9249
xor a
sbc hl,bc
jp c,step3
exx
ld bc,12625
exx
;multiply HL by 7/8
ld d,h
ld e,l
sbc hl,de \ sbc a,0
rra \ rr h \ rr l
rra \ rr h \ rr l
rra \ rr h \ rr l
jr nc,step2+3
inc hl
jp step2+3
step3:
ld bc,\$8888
xor a
sbc hl,bc
jp c,step4
exx
ld bc,6102
exx
;multiply HL by 7/8
ld d,h
ld e,l
sbc hl,de \ sbc a,0
rra \ rr h \ rr l
rra \ rr h \ rr l
rra \ rr h \ rr l
rra \ rr h \ rr l
jr nc,step3+3
inc hl
jp step3+3
step4:
ld bc,\$8421
xor a
sbc hl,bc
jp c,step5
exx
ld bc,3002
exx
;multiply HL by 31/32
;multiply HL by 31
push hl
srl h \ rr l \ rra
srl h \ rr l \ rra
srl h \ rr l \ rra
ld d,a
ld a,h \ ld h,l \ ld l,d
pop de
sbc hl,de \ sbc a,0
;now divide by 32
; 7 shifts right
bit 7,l
ld l,h
ld h,a
jr z,step4+3
inc hl
jp step4+3

step5:
ld bc,\$8208
xor a
sbc hl,bc
jp c,step6
exx
ld bc,1489
exx
;multiply HL by 63/64

;multiply HL by 63
push hl
srl h \ rr l \ rra
srl h \ rr l \ rra
ld d,a
ld a,h \ ld h,l \ ld l,d
pop de
sbc hl,de \ sbc a,0
;now divide by 64
; 7 shifts right
bit 7,l
ld l,h
ld h,a
jr z,step5+3
inc hl
jp step5+3

step6:
ld bc,\$8102
xor a
sbc hl,bc
jp c,step7
exx
ld bc,742
exx
;multiply HL by 127
ld d,h
ld e,l
srl h \ rr l
ld a,h \ ld h,l \ ld l,0 \ rr l
sbc hl,de \ sbc a,0
;now divide by 128
; 7 shifts right
bit 7,l
ld l,h
ld h,a
jr z,step6+3
inc hl
jp step6+3

step7:
ld bc,\$8080
xor a
sbc hl,bc
jp c,step8
exx
ld bc,370
exx
;multiply HL by 255
ld d,h
ld e,l
ld a,h \ ld h,l \ ld l,0
sbc hl,de \ sbc a,0
;now divide by 256
bit 7,l
ld l,h
ld h,a
jr z,step7+3
inc hl
jp step7+3

step8:
exx \ ld bc,185 \ exx
rrc c
sbc hl,bc
jp c,step9
exx
exx
;multiply HL by 511/512
xor a
ld e,h
ld d,a
cp L
jr z,\$+3
inc de
sbc hl,de \ sbc a,0
rra \ rr h \ rr l
jr nc,step8+7
inc hl
jp step8+7

step9:
rrc c
sbc hl,bc
jp c,step10
exx
ld c,92
exx
;multiply HL by 1023/1024
xor a
ld e,h
ld d,a
cp L
jr z,\$+3
inc de
sbc hl,de \ sbc a,0
rra \ rr h \ rr l
rra \ rr h \ rr l
jr nc,step9+2
inc hl
jp step9+2

step10:
rrc c
sbc hl,bc
jp c,step11
exx
ld c,46
exx
;multiply HL by 1023/1024
xor a
ld e,h
ld d,a
cp L
jr z,\$+3
inc de
sbc hl,de \ sbc a,0
rra \ rr h \ rr l
rra \ rr h \ rr l
rra \ rr h \ rr l
jr nc,step10+2
inc hl
jp step10+2

step11:
rrc c
sbc hl,bc
jp c,step12
exx
ld c,23
exx
;multiply HL by 2047/2048
xor a
ld e,h
ld d,a
cp L
jr z,\$+3
inc de
sbc hl,de \ sbc a,0
rra \ rr h \ rr l
rra \ rr h \ rr l
rra \ rr h \ rr l
rra \ rr h \ rr l
jr nc,step11+2
inc hl
jp step11+2

step12:
;    rrc c
;    sbc hl,bc
;    jp c,step13
;    exx
;    ld c,12
;    exx

step13:
logfin:

pop af
exx
ld bc,\$7F07
ld d,a
ld a,b
ret z
ld a,d
ld e,0

jp p,\$+7
ld c,\$87
neg
dec c
ld d,a
```

## Square Root

``````Float24Sqrt:
;Input:
;  AHL is the 24-bit float of which to find the square root
;Output:
;  BHL is the resulting 24-bit square root
or a
jp p,\$+9
ld a,-1
ld hl,0
ret

rra
jr c,\$+6
srl h
rr l
sra a
rra
ld b,a
SqrtHL_prec16:
;input: HL
;Output: DE
;247 bytes
;1269 t-states worst case
xor a
ld d,a

ld c,l
ld l,h
ld h,a

cp h
jr nc,\$+5
dec h
ld a,4

ld e,a
sub h
jr nc,\$+6
cpl
ld h,a
inc e
inc e

ld a,e
ld e,a
sub h
jr nc,\$+6
cpl
ld h,a
inc e
inc e

ld a,e
ld e,a
sub h
jr nc,\$+6
cpl
ld h,a
inc e
inc e

ld a,e
ld l,c

ld e,a
sub h
jr nc,\$+6
cpl
ld h,a
inc e
inc e

ld a,e
ld e,a
sub h
jr nc,\$+6
cpl
ld h,a
inc e
inc e

ld a,e
jr nc,\$+6
sub h \ jp \$+6
sub h
jr nc,\$+6
inc e \ inc e
cpl
ld h,a

ld a,l
ld l,h
ld h,a
sll e \ rl d
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

;iteration 9
sll e \ rl d
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

sll e \ rl d
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

sll e \ rl d
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

sll e \ rl d
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

sll e \ rl d
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

sll e \ rl d
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

sll e \ rl d
jr nc,\$+8
sbc hl,de
inc e
jp \$+13
sbc hl,de
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

srl d \ rr e \ scf
sbc hl,de
ccf
ex de,hl
ret```
```

## Float To Signed Int

``````FloatToSInt:
;Input: BHL
;Output : DE
;Overflow is handled as 0x7FFF or 0xFFFF
ld de,0
ld a,b
and 80h
or d
ld d,a
ld a,b
ret m
dec de
rr d \ rr e
sub 16
ret nc
cpl
ld d,h \ ld e,l
ld b,a
sra d \ rr e
djnz \$-4
ret```
```

## Float to Unsigned Int

``````FloatToUInt:
;Input: BHL
;Output : DE
;Overflow is handled as 0xFFFF
ld de,0
ld a,b
ret m
dec de
sub 16
ret nc
cpl
ld d,h \ ld e,l
ld b,a
srl d \ rr e
djnz \$-4
ret```
```

## Atan2 (CORDIC)

This is getting questionable accuracy, but it is a CORDIC implementation of arctangent using an x and y input:

``````Float24Atan2:
;Inputs:
; BHL=X
; CDE=Y
;Outputs:
; AHL is the angle (64 corresponds to 90 degrees)
; CDE can be multiplied by 0.60725293500888 (7F 9B75)to get sqrt(x^2+y^2).
ld (f24_exp2),bc
ld (f24_man1),hl
ld (f24_man2),de
xor a
ld h,a
ld l,a
ld (f24_man3),hl
ld (f24_exp3),a
ld (f24_man4),hl
ld (f24_exp4),a
ld hl,Float24_arctantable
ld (tempword1),hl
f24_atan2loop:
ld (tempbyte1),a
and \$7F
ld c,a
;add CDE to f1, store to f5
ld hl,(f24_man1)
ld (f24_exp5),a
ld (f24_man5),hl

ld bc,(f24_exp2)
ld a,(tempbyte1)
and \$7F
bit 7,b
jr z,\$+4
or 80h
ld d,a
ld b,c
ld c,d
ld de,(f24_man1)
ld hl,(f24_man2)

bit 7,b
jr nz,\$+2+3+3+3+3+1+3
call Float24Sub
ld (f24_man2),hl
ld (f24_exp2),a
ld hl,(tempword1)
ld c,(hl)

ld (f24_man2),hl
ld (f24_exp2),a
ld a,(tempbyte1)
ld hl,(tempword1)
ld c,(hl) \ set 7,c
inc hl
ld e,(hl) \ inc hl
ld d,(hl) \ inc hl
ld (tempword1),hl
ld hl,(f24_man4)
ld a,(f24_exp4)
ld b,a
ld (f24_man4),hl
ld (f24_exp4),a

ld hl,(f24_man5)
ld a,(f24_exp5)
ld (f24_man1),hl
ld (f24_exp1),a

ld bc,(f24_exp2)
ld de,(f24_man2)
ld a,(tempbyte1)
dec a
cp -8
jp nz,f24_atan2loop
ld hl,(f24_man4)
ld a,(f24_exp4)
ld bc,(f24_exp1)
ld de,(f24_man1)
ret

Float24_arctantable:
.db 5   \ .dw \$8000
.db 4   \ .dw \$9720
.db 3   \ .dw \$9FB4
.db 2   \ .dw \$A222
.db 1   \ .dw \$A2C3
.db 0   \ .dw \$A2EC
.db -1  \ .dw \$A2F6
.db -2  \ .dw \$A2F9    ;note that from here, these are essentially the same
.db -3  \ .dw \$A2F9
.db -4  \ .dw \$A2F9
.db -5  \ .dw \$A2FA
.db -6  \ .dw \$A2FA
.db -7  \ .dw \$A2FA
.db -8  \ .dw \$A2FA
.db -9  \ .dw \$A2FA
.db -10 \ .dw \$A2FA```
```

# Conclusion

For more routines, until this page can be filled in more, see this Floating Point Library

page revision: 1, last edited: 08 Aug 2017 21:48