Floating Point

Floating Point Tutorial

Although integer math may seem more natural for us when we program the Z80, floating point math that works with fractional numbers is not much more difficult. The only change is in representation and how we program the handling of the data. After reading this, hopefully the reader will not feel limited to integer math and not limited to even the representations here. Don't just use a 'sign bit' if you think it would be more mathematically sound to have a bit for additive inverses and multiplicative inverses. Add complex arithmetic, or try making Rational representations for sometimes many, many more bits of precision and a smaller risk for roundoff errors.

Format

The typical format includes a sign bit (additive inverse, positive/negative) an exponent, and a mantissa. TI, for example, uses 1 bit for sign, 8 bits for the exponent, and 7 bytes storing 14 base 10 digits. If the bytes were 0083 31 41 59 26 53 58 98, it would be read as sign=0, exponent=83h-80h=3, mantissa=3.1415926535898, so the value represented is 3.1415926535898*103=3141.5926535898.

Generally, the math has a wider range and better precision if we use a base of power of 2. Depending on the usage, we tend to try to pack all of the data to fill 8-bit multiples. TI uses an extra 8 bits to store the 1 sign bit, but in their application, they also use that byte to store variable type. They actually took advantage of unused bits that had to be there anyways.

Sign Bit

The sign bit is just a 0 or 1 that determines if a number is positive or negative.

Exponent

To decide a good range for the exponent, you will need to pick a base. TI used 10, but generally, 2 might be better or 256 so that we use whole bytes as digits. Essentially, we will use something like 2exponent*mantissa. It is generally easier to make the middle number in your range as 0. If you are using an 8-bit exponent, 80h=0 for the exponent.

Mantissa

The mantissa is best kept "normalized" by making sure we don't have leading zeroes. If we are using a base of 2, then a mantissa of %0011111100000000 is usually not what we want. Instead, we shift it left twice, which means we decrement the exponent by 2.
++Examples
If we decided to have 1 bit for sign, 7 bits for the exponent, and 16 bits for the mantissa, using base 2, then the following would represent 1:

%01000000   ;sign=0, exponent=64 (treated as 0)
%1000000000000000 ;mantissa=1.0

And likewise, 2 would be just multiplying by 2 which is the same as increasing the exponent by 1:
%01000001   ;sign=0, exponent=65 (65-64=1)
%1000000000000000 ;mantissa=1.0
1.0*2^1=2

So if we think of the exponent as the location of the decimal, 2 is 10.00000000000000 and 1 is 1.000000000000000. If we want to perform some arithmetic…

Addition and Subtraction

This can be a bit difficult because we have to keep track of the sign of the inputs and the result, but we also have to line up the 'decimal points' (the exponent). In the above examples for the 24-bit floats for 1.0 and 2.0, to line up the decimals, we need to rotate the mantissa of 1 to the right (and likewise, increment the exponent:

%01000001   ;sign=0, exponent=65 (treated as 1)
%0100000000000000 ;mantissa=0.5
0.5*2^1=1

Note that 1 is now represented as a denormalized number, but we can directly add the numbers by adding their mantissas:
%01000001   ;sign=0, exponent=65 (treated as 1)
%1100000000000000 ;mantissa=1.5
1.5*2^1=3

If there was overflow, we would have to rotate the result right to get the upper bit of the result back in the number and then increment the exponent. For example, if we added 2 to this, we would get %0100000000000000 as the 16-bit addition, but the overflow (carry flag) would be a 1, so we would rotate it back in, dropping the last bit (in our case, a 0):
%01000010   ;sign=0, exponent=66 (treated as 2)
%1010000000000000 ;mantissa=1.25
1.25*2^2=5

Adding two negative numbers -a+-b is the same as -(a+b) so it is just like adding two positive numbers and you don't need to change the sign. Adding a positive and negative number will require a subtraction of their absolute values. For example, 2+-1 would require rotating the mantissas as in addition, but then subtract %1000…-%0100… to get %0100… which needs to be renormalised. In this case, it is still a positive result. If there was overflow (on the z80, the c flag is set), then our result is negative, but we also need to perform a 2s compliment on the mantissa (that is, negate it as if it were an integer). So in the case of 1-2, we would get:
%01000001 1100000000000000. 2s compliment would return $01000001 0100000000000000 and then we set the sign as negative and renormalise the number, getting the expected -1. In the case where the first number is negative and the second is positive, it would be easiest to swap their order for subtraction instead of making a separate reverse-subtract routine.

Multiplication and Division

Multiplication and division are much easier than addition and subtraction. When multiplying or dividing two numbers, you just need to use XOR logic on the signs and perform basically integer math on the mantissas. If both numbers are negative, their sign bits will be 1 and 1 xor 1=0, which is positive. Likewise, 1 xor 0=1, 0 xor 1=1, 0 xor 0=0. To multiply A and B, add their exponents together, to divide, subtract the exponent of B from A (taking care of underflow/overflow, accordingly which is a simple check of the c flag).

Now you have the sign and exponent of the result, but you still need to get the mantissa. For multiplication, just multiply the two mantissas and take the upper bits for the result, rounding as seen fit. For division, it is a slightly different algorithm than integer division (or so it may seem). Since both numbers are normalized, you don't need to rotate bits into an accumulator. Instead, check if B<=A. If it is, perform B-A as an integer subtraction, and set the upper bit of the result, else leave it reset. Then shift B to the right and repeat the process, setting or resetting the next bits at each iteration. For a 64-bit mantissa, this requires 64 tests. As well, for best accuracy, A and B would need additional 64 trailing zero bits making subtraction 128-bits. Since it is integer subtraction, it is still fairly trivial.

Update from the future: On the z80, there is actually a faster way for large-ish float sizes that involves estimating the next byte of the result, and checking for off-by-one error. For a practical example, my (Zeda's) division routine for 80-bit floats takes ~19000cc versus a previous ~28000cc, compared to a very optimized ~11000cc multiplication.

Square Root

For square roots, exponents are divided by 2. If the exponent is odd, add 1 to it to make it even and rotate the mantissa right. From there, the mantissa can be treated as an integer to take the square root of.

Modulo

For many trig applications, you will need to take input mod pi. To do this, check the exponent. If it is less than 1, it is smaller than pi, so you are finished. if it is greater than or equal to 1, perform 3-exponent and this will be the number of iterations required for the algorithm about to be described.

If the mantissa is greater than or equal to pi, subtract pi. Then, shift pi to the right (dividing it by 2) and repeat the needed number of times.

For general modulo, just replace 'pi' with any number and check the exponent for being less than the exponent of the modulus.

24-bit Floating Point

24-bit floats are excellent because the math can be contained in the registers. These currently use 7-bit exponents such that 0=0 and %1111111 = 3Fh = -1

RAM equates

The order in which the RAM values are stored are important:

f24_exp2:
    .db 0
f24_exp1:
    .db 0
f24_exp4:
    .db 0
f24_exp3:
    .db 0
f24_exp6:
    .db 0
f24_exp5:
    .db 0
f24_man1:
    .dw 0
f24_man2:
    .dw 0
f24_man3:
    .dw 0
f24_man4:
    .dw 0
f24_man5:
    .dw 0
f24_man6:
    .dw 0
tempbyte1:
    .db 0
tempword1:
    .dw 0

Addition/Subtraction

Float24Add:
;BHL+CDE
    ld a,c
    xor 40h
    ld c,a
    rlc c
    ld a,b
    rla
    rlc b
    xor 80h
    sub c
    jr z,f24shifted
    jr nc,f24shiftDE
    rra
    cp -16
    jr nc,$+6
    ld a,c \ ex de,hl \ rrca \ ret
    srl h \ rr l
    inc a \ jr nz,$-5
    jr nc,$+3 \ inc hl
    ld a,b \ and 1 \ ld b,a \ ld a,c \ and $FE \ xor 80h \ or b \ jp f24shifted+1
f24ShiftDE:
    rra
    cp 16
    jr c,$+5
    ld a,b \ rrca \ ret
    srl d \ rr e
    dec a \ jr nz,$-5
    jr nc,$+3 \ inc de
    ld a,c \ and 1 \ ld c,a \ ld a,b \ and $FE \ or c \ ld c,a
f24Shifted:
    ld a,b \ rrca \ ld b,a \ ld a,c \ rrca \ ld c,a \ xor b \ ld a,b \ jp m,f24sub
    add hl,de \ ret nc
    ld c,0 \ inc c \ inc c \ rr h \ rr l \ jr nc,$+8 \ inc l \ jr nz,$+5 \ inc h
    jr z,$-12
    rlca \ add a,c \ jp p,$+10
    jr c,$+7 \ bit 6,b \ rrca \ jr z,$+4
    rrca \ ret
SetInf:
    ld hl,$FFFF
    rla
    ld a,h
    rra
    ld b,a
    ret
f24Sub:
    sbc hl,de
    jr nc,normalise24
    xor 80h
    ld b,a
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld a,b
normalise24:
    bit 7,h \ ret nz
    ld a,b \ add a,a
    cp $82 \ ret z
    dec a \ dec a
    add hl,hl
    rlc b \ rra \ ld b,a
    jp normalise24

Float24Sub:
;BHL-CDE
    ld a,80h
    xor c
    ld c,a
    jp Float24Add

Multiplication

Float24Mul:
;BHL*CDE -> AHL
    ld a,b \ xor c
    and 80h
    push af
    sla b \ sra b
    sla c \ sra c
    ld a,b
    add a,c
    pop bc
    jp m,$+10
    cp 64
    jr nc,SetInf-2
    jp $+7
    cp -64
    jr c,SetInf-2
;B has the right sign
    and 7Fh
    or b
    push af
    ld b,h
    ld c,l
    call BC_Times_DE
    bit 7,l
    ld l,h
    ld h,b
    jr z,$+3
    inc hl
    pop af
    ret
BC_Times_DE:
;BHLA is the result
    ld a,b
    or a
    ld hl,0
    ld b,h
;1
    add a,a
    jr nc,$+4
    ld h,d
    ld l,e
;2
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b
;227+10b-7p
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

;===
;AHL is the result of B*DE*256
    push hl
    ld h,b
    ld l,b
    ld b,a
    ld a,c
    ld c,h
;1
    add a,a
    jr nc,$+4
    ld h,d
    ld l,e
;2
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c
;227+10b-7p
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    pop de
;Now BDE*256+AHL
    ld c,a
    ld a,l
    ld l,h
    ld h,c
    add hl,de
    ret nc
    inc b
;BHLA is the 32-bit result
    ret

Division

Note that this routine currently does not use 32-bit subtractions, so it can get bad accuracy.

Float24Div:
;BHL/CDE -> AHL
    ld a,b \ xor c
    and 80h
    push af
    sla b \ sra b
    sla c \ sra c
    ld a,b
    sub c
    pop bc
    jp m,$+11
    cp 64
    jp nc,SetInf-2
    jp $+7
    cp -64
    jp c,SetInf-2
;B has the right sign
    and 7Fh
    or b
    push af

    ld bc,0
    or a \ sbc hl,de \ jr c,$+7 \ set 7,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 6,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 5,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 4,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 3,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 2,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 1,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 0,b \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 7,c \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 6,c \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 5,c \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 4,c \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 3,c \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 2,c \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+7 \ set 1,c \ jp $+4 \ add hl,de \ srl d \ rr e
    or a \ sbc hl,de \ jr c,$+4 \ set 0,c
    ld h,b \ ld l,c
    pop bc
    jp normalise24

log base 2

Float24Lg:
;Inputs: AHL is the 24-bit float
;     bit 7,a is the sign
;     bits 0~6 are the signed exponent
;     HL is the mantissa
;Output:
;    nc flag set if result would be complex
;    c if real result
;    BHL is the computed result if real
    or a
    jp p,$+9
    ld a,-1
    ld hl,0
    ret
    add a,a
    sra a
;A is now the value that must be added to the final result
    push af
    exx \ ld hl,0 \ exx
FloatLogLoop:
;Step 1 : check if HL >=AAAA
step1:
    ld bc,$AAAA
    xor a
    sbc hl,bc
    add hl,bc
    jp c,step2
    exx
    ld bc,27200
    add hl,bc
    exx
;multiply HL by 3/4
    ld d,h
    ld e,l
    add hl,hl \ rla
    add hl,de \ adc a,0
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    jr nc,step1+3
    inc hl
    jp step1+3
step2:
    ld bc,$9249
    xor a
    sbc hl,bc
    add hl,bc
    jp c,step3
    exx
    ld bc,12625
    add hl,bc
    exx
;multiply HL by 7/8
    ld d,h
    ld e,l
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    sbc hl,de \ sbc a,0
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    jr nc,step2+3
    inc hl
    jp step2+3
step3:
    ld bc,$8888
    xor a
    sbc hl,bc
    add hl,bc
    jp c,step4
    exx
    ld bc,6102
    add hl,bc
    exx
;multiply HL by 7/8
    ld d,h
    ld e,l
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    sbc hl,de \ sbc a,0
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    jr nc,step3+3
    inc hl
    jp step3+3
step4:
    ld bc,$8421
    xor a
    sbc hl,bc
    add hl,bc
    jp c,step5
    exx
    ld bc,3002
    add hl,bc
    exx
;multiply HL by 31/32
;multiply HL by 31
    push hl
    srl h \ rr l \ rra
    srl h \ rr l \ rra
    srl h \ rr l \ rra
    ld d,a
    ld a,h \ ld h,l \ ld l,d
    pop de
    sbc hl,de \ sbc a,0
;now divide by 32
; 7 shifts right
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    bit 7,l
    ld l,h
    ld h,a
    jr z,step4+3
    inc hl
    jp step4+3

step5:
    ld bc,$8208
    xor a
    sbc hl,bc
    add hl,bc
    jp c,step6
    exx
    ld bc,1489
    add hl,bc
    exx
;multiply HL by 63/64

;multiply HL by 63
    push hl
    srl h \ rr l \ rra
    srl h \ rr l \ rra
    ld d,a
    ld a,h \ ld h,l \ ld l,d
    pop de
    sbc hl,de \ sbc a,0
;now divide by 64
; 7 shifts right
    add hl,hl \ rla
    add hl,hl \ rla
    bit 7,l
    ld l,h
    ld h,a
    jr z,step5+3
    inc hl
    jp step5+3

step6:
    ld bc,$8102
    xor a
    sbc hl,bc
    add hl,bc
    jp c,step7
    exx
    ld bc,742
    add hl,bc
    exx
;multiply HL by 127
    ld d,h
    ld e,l
    srl h \ rr l
    ld a,h \ ld h,l \ ld l,0 \ rr l
    sbc hl,de \ sbc a,0
;now divide by 128
; 7 shifts right
    add hl,hl \ rla
    bit 7,l
    ld l,h
    ld h,a
    jr z,step6+3
    inc hl
    jp step6+3

step7:
    ld bc,$8080
    xor a
    sbc hl,bc
    add hl,bc
    jp c,step8
    exx
    ld bc,370
    add hl,bc
    exx
;multiply HL by 255
    ld d,h
    ld e,l
    ld a,h \ ld h,l \ ld l,0
    sbc hl,de \ sbc a,0
;now divide by 256
    bit 7,l
    ld l,h
    ld h,a
    jr z,step7+3
    inc hl
    jp step7+3

step8:
    exx \ ld bc,185 \ exx
    rrc c
    sbc hl,bc
    add hl,bc
    jp c,step9
    exx
    add hl,bc
    exx
;multiply HL by 511/512
    xor a
    ld e,h
    ld d,a
    cp L
    jr z,$+3
    inc de
    add hl,hl \ rla
    sbc hl,de \ sbc a,0
    rra \ rr h \ rr l
    jr nc,step8+7
    inc hl
    jp step8+7

step9:
    rrc c
    sbc hl,bc
    add hl,bc
    jp c,step10
    exx
    ld c,92
    add hl,bc
    exx
;multiply HL by 1023/1024
    xor a
    ld e,h
    ld d,a
    cp L
    jr z,$+3
    inc de
    add hl,hl \ rla
    add hl,hl \ rla
    sbc hl,de \ sbc a,0
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    jr nc,step9+2
    inc hl
    jp step9+2

step10:
    rrc c
    sbc hl,bc
    add hl,bc
    jp c,step11
    exx
    ld c,46
    add hl,bc
    exx
;multiply HL by 1023/1024
    xor a
    ld e,h
    ld d,a
    cp L
    jr z,$+3
    inc de
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    sbc hl,de \ sbc a,0
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    jr nc,step10+2
    inc hl
    jp step10+2

step11:
    rrc c
    sbc hl,bc
    add hl,bc
    jp c,step12
    exx
    ld c,23
    add hl,bc
    exx
;multiply HL by 2047/2048
    xor a
    ld e,h
    ld d,a
    cp L
    jr z,$+3
    inc de
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    sbc hl,de \ sbc a,0
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    rra \ rr h \ rr l
    jr nc,step11+2
    inc hl
    jp step11+2

step12:
;    rrc c
;    sbc hl,bc
;    add hl,bc
;    jp c,step13
;    exx
;    ld c,12
;    add hl,bc
;    exx

step13:
logfin:

    pop af
    exx
    ld bc,$7F07
    ld d,a
    ld a,b
    ret z
    ld a,d
    ld e,0

    jp p,$+7
    ld c,$87
    neg
    dec c
    add a,a \ jp p,$-2
    ld d,a
    jp Float24Add

Square Root

Float24Sqrt:
;Input:
;  AHL is the 24-bit float of which to find the square root
;Output:
;  BHL is the resulting 24-bit square root
    or a
    jp p,$+9
    ld a,-1
    ld hl,0
    ret

    rra
    jr c,$+6
    srl h
    rr l
    add a,a
    add a,a
    sra a
    rra
    ld b,a
SqrtHL_prec16:
;input: HL
;Output: DE
;247 bytes
;1269 t-states worst case
;Average is about 1219 t-states
    xor a
    ld d,a

    ld c,l
    ld l,h
    ld h,a

    add hl,hl
    add hl,hl
    cp h
    jr nc,$+5
    dec h
    ld a,4

    add hl,hl
    add hl,hl
    ld e,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc e
    inc e

    ld a,e
    add hl,hl
    add hl,hl
    add a,a
    ld e,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc e
    inc e

    ld a,e
    add hl,hl
    add hl,hl
    add a,a
    ld e,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc e
    inc e

    ld a,e
    ld l,c

    add hl,hl
    add hl,hl
    add a,a
    ld e,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc e
    inc e

    ld a,e
    add hl,hl
    add hl,hl
    add a,a
    ld e,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc e
    inc e

    ld a,e
    add a,a \ ld e,a
    add hl,hl
    add hl,hl
    jr nc,$+6
    sub h \ jp $+6
    sub h
    jr nc,$+6
    inc e \ inc e
    cpl
    ld h,a

    ld a,l
    ld l,h
    add a,a
    ld h,a
    adc hl,hl
    adc hl,hl
    sll e \ rl d
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

;iteration 9
    add hl,hl \ add hl,hl
    sll e \ rl d
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

    add hl,hl \ add hl,hl
    sll e \ rl d
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

    add hl,hl \ add hl,hl
    sll e \ rl d
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

    add hl,hl \ add hl,hl
    sll e \ rl d
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

    add hl,hl \ add hl,hl
    sll e \ rl d
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

    add hl,hl \ add hl,hl
    sll e \ rl d
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

    sll e \ rl d
    add hl,hl \ add hl,hl    ;up to 17 bits
    jr nc,$+8
        sbc hl,de
        inc e
        jp $+13
    sbc hl,de
    jr nc,$+3
    add hl,de
    sbc a,a \ add a,a \ inc a \ add a,e \ ld e,a

    srl d \ rr e \ scf
    sbc hl,de
    ccf
    ex de,hl
    adc hl,hl
    ret

Float To Signed Int

FloatToSInt:
;Input: BHL
;Output : DE
;Overflow is handled as 0x7FFF or 0xFFFF
    ld de,0
    ld a,b
    and 80h
    or d
    ld d,a
    ld a,b
    add a,a
    ret m
    dec de
    rr d \ rr e
    sub 16
    ret nc
    cpl
    ld d,h \ ld e,l
    ld b,a
    sra d \ rr e
    djnz $-4
    ret

Float to Unsigned Int

FloatToUInt:
;Input: BHL
;Output : DE
;Overflow is handled as 0xFFFF
    ld de,0
    ld a,b
    add a,a
    ret m
    dec de
    sub 16
    ret nc
    cpl
    ld d,h \ ld e,l
    ld b,a
    srl d \ rr e
    djnz $-4
    ret

Atan2 (CORDIC)

This is getting questionable accuracy, but it is a CORDIC implementation of arctangent using an x and y input:

Float24Atan2:
;Inputs:
; BHL=X
; CDE=Y
;Outputs:
; AHL is the angle (64 corresponds to 90 degrees)
; CDE can be multiplied by 0.60725293500888 (7F 9B75)to get sqrt(x^2+y^2).
    ld (f24_exp2),bc
    ld (f24_man1),hl
    ld (f24_man2),de
    xor a
    ld h,a
    ld l,a
    ld (f24_man3),hl
    ld (f24_exp3),a
    ld (f24_man4),hl
    ld (f24_exp4),a
    ld hl,Float24_arctantable
    ld (tempword1),hl
f24_atan2loop:
    ld (tempbyte1),a
    add a,c
    and $7F
    ld c,a
;add CDE to f1, store to f5
    ld hl,(f24_man1)
    call Float24Add
    ld (f24_exp5),a
    ld (f24_man5),hl

    ld bc,(f24_exp2)
    ld a,(tempbyte1)
    add a,b
    and $7F
    bit 7,b
    jr z,$+4
    or 80h
    ld d,a
    ld b,c
    ld c,d
    ld de,(f24_man1)
    ld hl,(f24_man2)

    bit 7,b
    jr nz,$+2+3+3+3+3+1+3
    call Float24Sub
    ld (f24_man2),hl
    ld (f24_exp2),a
    ld hl,(tempword1)
    ld c,(hl)
    jp adjustangle

    call Float24Add
    ld (f24_man2),hl
    ld (f24_exp2),a
    ld a,(tempbyte1)
    ld hl,(tempword1)
    ld c,(hl) \ set 7,c
AdjustAngle:
    inc hl
    ld e,(hl) \ inc hl
    ld d,(hl) \ inc hl
    ld (tempword1),hl
    ld hl,(f24_man4)
    ld a,(f24_exp4)
    ld b,a
    call Float24Add
    ld (f24_man4),hl
    ld (f24_exp4),a

    ld hl,(f24_man5)
    ld a,(f24_exp5)
    ld (f24_man1),hl
    ld (f24_exp1),a

    ld bc,(f24_exp2)
    ld de,(f24_man2)
    ld a,(tempbyte1)
    dec a
    cp -8
    jp nz,f24_atan2loop
    ld hl,(f24_man4)
    ld a,(f24_exp4)
    ld bc,(f24_exp1)
    ld de,(f24_man1)
    ret

Float24_arctantable:
 .db 5   \ .dw $8000
 .db 4   \ .dw $9720
 .db 3   \ .dw $9FB4
 .db 2   \ .dw $A222
 .db 1   \ .dw $A2C3
 .db 0   \ .dw $A2EC
 .db -1  \ .dw $A2F6
 .db -2  \ .dw $A2F9    ;note that from here, these are essentially the same
 .db -3  \ .dw $A2F9
 .db -4  \ .dw $A2F9
 .db -5  \ .dw $A2FA
 .db -6  \ .dw $A2FA
 .db -7  \ .dw $A2FA
 .db -8  \ .dw $A2FA
 .db -9  \ .dw $A2FA
 .db -10 \ .dw $A2FA

32-bit Floating Point

80-bit Floating Point

Conclusion

For more routines, until this page can be filled in more, see this Floating Point Library

Unless otherwise stated, the content of this page is licensed under GNU Free Documentation License.