Advanced Math

This article covers advanced math routines for a number of functions including the basic addition, multiplication, and square roots, as well as exponential and trigonometric functions.
In most instances it is most efficient to store numbers in little endian format as that is how the Z80 works internally. Doing this allows us to take advantage of the handful of 16-bit operations. In the case of BCD formatted numbers, it really doesn't matter if they are stored big-endian or little-endian since built in operations are only 8-bit. However, I choose to use little endian since it is easier to work with large numbers when the pointer passed to the routine points to the lowest byte.
I'll try to cover:

  • Integer arithmetic
  • BCD arithmetic
  • Fixed point arithmetic
  • Floating point arithmetic
Table of Contents

Intro to Integer Formats

Integers are one of the most basic format to work with on the Z80. The built in instructions primarily work with 8-bit integers, though a few work with 16-bit integers. Since you are reading this page, I am going to assume you know this much.
However, if you are looking to work with larger integers, you'll need to come up with a working format. Best practice for the z80 is store in little endian. You may have seen this mentioned here and elsewhere without a good reason, so here it is:

The Z80 stores 16-bit integers in little-endian. If you need to access the lower 16 bits of a 64-bit integer, it's easier and more efficient to use ld hl,(int64) for 3 bytes, 16cc, versus ld hl,(int64+6) \ ld a,l \ ld l,h \ ld h,a at 6 bytes, 28cc.
So if you wanted to store the 32-bit integer 0x12345678 as a constant in your program, it would look like:

int32_var0:
.db $78,$56,$34,$12

OR:
int32_var0:
.dw $5678,$1234

And if you wanted to change that value to 0xFEDCBA98, you could do:
    ld hl,$BA98
    ld (int64_var0),hl
    ld hl,$FEDC
    ld (int64_var0+2),hl

Intro to BCD Formats

BCD, short for Binary Coded Decimal, is a format where one byte represents two base 10 digits. As an example, 0x66 is interpreted in base 10 as 66 instead of 102.
The Z80 has a handy instruction called DAA (1 byte, 4cc) which converts the A register based on it's contents, the c flag, half-carry flag, and the last type of arithmetic operation. Addition and subtraction operations set or reset a special flag so that the DAA instruction knows what to do.
For example, if A=0x66, then add a,a, you would get A=0xCC, with the c flag reset. Followed up with the daa instruction, the c flag is set, and A is converted to 0x32, which is what we expect since 66*2=132.

Intro to Fixed Point Formats

Fixed point formats offer the advantage of precision smaller than a unit. For example, you can store values such as 119.125.
In fixed point, the decimal point is in a fixed location. An 8.8 fixed point number means that given a 16-bit number, the top 8 bits form the integer part, and the lower 8 bits form the fractional part. If the lower 8 bits formed the integer 55, then it would be interpreted as 55/256.

Fixed point formats tend to be faster and easier to implement than floating point, but they can lose precision more easily. They are great for games where speed is desired, but often not desirable in mathematical applications.
Many fixed point routines can use integer routines with some minor modifications to overhead code.

Intro to Floating Point Formats

Floating point formats typically have:

  • A sign bit
  • An exponent for an implied base, which indicates where the decimal point is
  • A mantissa

Floating point and scientific notation, are very similar. If we are using BCD floats, then base 10 is likely the implied base, so you might have a number like 3.14159*100.
In practice, a base of some power of two is far more efficient to compute. As well, the mantissa is typically of the form where one digit forms the non-zero integer part, and the rest form the fractional part. This is referred to as being normalized. When the top digit is 0, this is a denormalized number. Some formats allow this, some don't. For base-2 float formats that don't allow denormalized numbers there is the advantage that the top bit can only be one value— 1, so it isn't actually necessary to occupy a bit that is known to always be 1. Instead, this bit is taken as implicit. In such cases, that bit is instead put to some other use, like a sign bit, or another bit of the exponent.

In general:

  • Sign is represented as a bit where 0 means positive, 1 means negative.
  • The exponent is an integer with a bias. For example, if our format uses an 8-bit exponent, 128 might represent the exponent 0. In fact, this is what TI does for their BCD floats.
  • The mantissa is a 1.n fixed point number. As examples,

** A 7-byte BCD float has 14 digits, so the mantissa is a 1.13 fixed point number.
** An "extended precision float" has a 64-bit mantissa with no implicit bit. Therefore, the mantissa is a 1.63 fixed point number.
** A "single precision float" has a 24-bit mantissa, where the top bit is implied and in its place is the sign bit. So it is a 1.23 fixed point number, where the top bit is always expected to be 1.

There is a GitHub repository for Z80 floats here. At the time of this writing, it contains single-precision and extended-precision binary floats. Maybe at some future point there will be other formats.

Addition/Subtraction

Thankfully addition and subtraction are built-in instructions on the Z80 and by extension, the eZ80. However, if we want to work with multiple precision arithmetic, we still have to juggle pointers and RAM.

Integer Addition/Subtraction, multiprecision

Suppose we have two 64-bit integers stored in a fixed location in RAM called input1 and input2, and we want to store their sum or difference to output. Then we can take advantage of 16-bit math to make a speed-optimized routine:

;addint64:
;47 bytes, 264cc
    ld hl,(input1) \ ld de,(input2) \ add hl,de \ ld (output),hl
    ld hl,(input1+2) \ ld de,(input2+2) \ adc hl,de \ ld (output+2),hl
    ld hl,(input1+4) \ ld de,(input2+4) \ adc hl,de \ ld (output+4),hl
    ld hl,(input1+6) \ ld de,(input2+6) \ adc hl,de \ ld (output+6),hl

Note the use of add in the first iteration, versus adc in the latter iterations.

For subtraction, we only have sbc for 16-bit subtraction, so we have to first reset carry

;subint64:
;49 bytes, 272cc
    or a
    ld hl,(input1) \ ld de,(input2) \ sbc hl,de \ ld (output),hl
    ld hl,(input1+2) \ ld de,(input2+2) \ sbc hl,de \ ld (output+2),hl
    ld hl,(input1+4) \ ld de,(input2+4) \ sbc hl,de \ ld (output+4),hl
    ld hl,(input1+6) \ ld de,(input2+6) \ sbc hl,de \ ld (output+6),hl

Suppose you want to pass pointers to the input and output operands where DE is the first operand, HL is the second, and BC is the output:

;addint64:
;45 bytes, 294cc
    ld a,(de) \ add a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ adc a,(hl) \ ld (bc),a
;subint64:
;45 bytes, 294cc
    ld a,(de) \ sub (hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc
    ld a,(de) \ sbc a,(hl) \ ld (bc),a

Suppose you want to save a few bytes at the cost of some cycles:

;addint64:
;17 bytes, 515cc
    or a
    call +_
_:
    call +_
_:
    call +_
_:
    ld a,(de) \ adc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc \ ret
;subint64:
;17 bytes, 515cc
    or a
    call +_
_:
    call +_
_:
    call +_
_:
    ld a,(de) \ sbc a,(hl) \ ld (bc),a \ inc hl \ inc de \ inc bc \ ret

Suppose you just want the output to overwrite the first input, in order to save time and space:

;addint64:
;38 bytes, 252cc
    ld a,(de) \ add a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (de),a
;subint64:
;38 bytes, 252cc
    ld a,(de) \ sub (hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (de),a

Or with BC freed up, roll your loops:
;addint64:
;11 bytes, 384cc
    ld b,8
    or a
_:
    ld a,(de) \ adc a,(hl) \ ld (de),a \ inc hl \ inc de
    djnz -_
    ret
;subint64:
;11 bytes, 384cc
    ld b,8
    or a
_:
    ld a,(de) \ sbc a,(hl) \ ld (de),a \ inc hl \ inc de
    djnz -_
    ret

BCD Addition/Subtraction, multiprecision

BCD addition and subtraction work almost exactly like their integer counterparts
except that you have to use arithmetic with the A register, followed by daa.
I'll offer one example, and the reader can adapt the rest. This modifies the
above 64-bit example, but encodes 16 BCD digits.

;addBCD_16:
;46 bytes, 284cc
    ld a,(de) \ add a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ daa \ ld (de),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ daa \ ld (de),a

Fixed Point Addition/Subtraction

Fixed Point addition and subtraction are identical to integer addition and subtraction, assuming both inputs have the same format.

Floating Point Addition/Subtraction

Believe it or not, addition and subtraction are one of the toughest floating point operations to implement. To show this (using base 10 for convenience), suppose you have two numbers, 3.14159 *100 and 1.23456*103.
To add these, we have to align their decimal points, which means denormalizing a number until it's exponent matches. On top of that, if your precision is too limited, you'll have to round. In this example, we only have 6 digits allowed. Since it truncates the digit 1, our number is not rounded up.

 0.00314*10^^3^^
+1.23456*10^^3^^
----------------
=1.23770*10^^3^^

So let's review what we need to do, then create some pseudo-code and then some examples:
  • The first input will be overwritten with the output.
  • We need to align the mantissas so that their exponents are the same
    • Organize the inputs so that the exponent of the first input is greater than or equal to that of the second.
    • Find their difference in exponent. If this difference is more than the number of bits in the mantissa, then if we aligned the mantissas, the second one would be full of zeros, so we can skip these cases. If the difference is equal to the size of the mantissa, then when the mantissas are aligned and round:
      • If the high bit is 1, increment the mantissa of the first input. For formats where the high bit is implicit, then we can just go ahead and increment that mantissa.
    • If that difference is non-zero, shift the second input right a number of times equal to that difference.
    • If the signs of the input are the same, add the second mantissa to that of the first as fixed-point numbers.
      • If there is overflow, increment the exponent and shift the mantissa right, rotating in a '1' bit.
        • If the exponent overflows, set the result to INF
    • Else the signs are different, so subtract the second mantissa from that of the first as fixed-point numbers.
      • If there is overflow, invert the sign bit, then perform 2's compliment on the mantissa (negate it).
      • If the mantissa is zero, set the output to ZERO.
      • While the upper bit of the result is 0, decrement the exponent, and shift the mantissa left.
      • If the exponent underflows, set the result to ZERO.
;We'll assume that the top bit of the mantissa is always 1.
;exp1 represents the exponent of the first input as an integer with bias at half it's maximum value.
;exp2 represents the exponent of the second input as an integer with bias at half it's maximum value.
;sign1 is 0 or 1. 1 if the first input is negative, 0 otherwise.
;sign2 is 0 or 1. 1 if the second input is negative, 0 otherwise.
;mant1 is the mantissa of the first input, a fixed-point number. Since it is addition, we'll just treat it as an integer.
;mant2 is the mantissa of the second input, a fixed-point number. Since it is addition, we'll just treat it as an integer.
;size is the size of the mantissa.
    if exp1<exp2:
        swap(float1,float2)
    k=exp1-exp2
    if k>size:
        return
    if k=size:
        mant2=1
    else:
        mant2=mant2>>k
        if carry=1:
            mant2=mant2+1
    if sign1=sign2:
        mant1=mant1+mant2
        if carry=1:
            <<shift right mant1, shifting in a 1>>
            exp1=exp1+1
            if carry=1:
                return INF
            return float1
    else:
        mant1=mant1-mant2
        if carry=1:
            sign1=not(sign1)
            mant1=-mant1
        while topbit(mant1)=0:
            exp1-=1
            if exp1=0
                return ZERO
            mant1=mant1<<1
        return float1

In this example, we have a 24-bit little-endian mantissa, with the top bit an implicit 1, instead, used as a sign bit. Finally, the top 8 bytes are a mantissa such that 128 corresponds to an exponent of 0. In all, these floats use 32 bits of space.
addend = scrap
addend2= scrap+7    ;4 bytes

;;Still need to tend to special cases
addSingle:
;;x+y
    push af
    push hl
    push de
    push bc
    inc de
    inc de
    inc hl
    inc hl
    ld a,(de)
    xor (hl)
    push af
    inc de
    inc hl
    ex de,hl
    ld a,(de)
    sub (hl)
    ex de,hl
    jr nc,$+5
    ex de,hl
    neg
    cp 24
    jp nc,add_unneeded
    push hl
    ld hl,addend+6
    dec de
    ld bc,0408h
    dec hl
    ld (hl),0
    sub c
    jr nc,$-5
    add a,c
    push af
    push hl
    ex de,hl
    ld a,(hl) \ or 80h \ ld (de),a \ dec de \ dec hl
    ldd
    ldd
    ex de,hl
    dec b
    jr z,$+7
    ld (hl),0 \ dec hl \ djnz $-3
    pop hl
    pop af
    ld b,a
    jr z,noshift
    set 7,(hl)
_:
    push hl
    srl (hl) \ dec hl
    rr (hl) \ dec hl
    rr (hl) \ dec hl
    rr (hl)
    pop hl
    djnz -_
noshift:
    pop hl  ;bigger float
    dec hl
    ld b,(hl)
    dec hl
    dec hl
    ex de,hl
    pop af
    jp m,subtract
    ld hl,addend+2
    ld a,(hl) \ rla \ inc hl
    ld a,(de) \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
    ld a,(de) \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
    ld a,(de) \ set 7,a \ adc a,(hl) \ ld (hl),a \ inc hl \ inc de
    ld a,(de) \ ld (hl),a
    jp nc,add_done
    inc (hl)
    jp z,add_overflow
    dec hl \ rr (hl)
    dec hl \ rr (hl)
    dec hl \ rr (hl)
    jp add_done
subtract:
    ld hl,addend
    xor a
    ld c,a \ sub (hl) \ ld (hl),a \ inc hl
    ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
    ld a,c
    sbc a,(hl) \ ld (hl),a \ inc hl
    ld a,(de) \ sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
    ld a,(de) \ sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
    ld a,(de) \ set 7,a
    sbc a,(hl) \ ld (hl),a \ inc hl \ inc de
    ld a,(de) \ ld (hl),a
    dec de
    ex de,hl
    jr nc,negated
    ld hl,addend
    ld a,80h \ xor b \ ld b,a
    ld a,c \ sub (hl) \ ld (hl),a \ inc hl
    ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
    ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
    ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
    ld a,c \ sbc a,(hl) \ ld (hl),a \ inc hl
    ld a,c \ sbc a,(hl) \ ld (hl),a
negated:
    jp m,add_done
    push bc
    ld hl,(addend)
    ld de,(addend+2)
    ld bc,(addend+4)
    ld a,h \ or l \ or d \ or e \ or b \ or c
    jp z,add_underflow
    ld a,(addend+6)
normalize:
    dec a \ jr z,add_underflow
    add hl,hl \ rl e \ rl d \ rl c \ rl b
    jp p,normalize
    ld (addend),hl
    ld (addend+2),de
    ld (addend+4),bc
    ld (addend+6),a
    pop bc
add_done:
;;Need to adjust sign flag
    ld hl,addend+5
    ld a,(hl)
    rla \ rl b \ rra \ ld (hl),a
    dec hl
    dec hl
add_copy:
    pop de
    push de
    ldi
    ldi
    ldi
    ld a,(hl)
    ld (de),a
    pop bc
    pop de
    pop hl
    pop af
    ret
add_underflow:
;;How many push/pops are needed?
;;return ZERO
    ld hl,0
    ld (addend+3),hl
    ld (addend+5),hl
    pop bc
    jr add_done
add_overflow:
;;How many push/pops are needed?
;;return INF
    dec hl
    ld (hl),40h
    jr add_done
add_unneeded:
;;How many push/pops are needed?
;;Return bigger number
    pop af
    dec hl
    dec hl
    dec hl
    jr add_copy

For an extended-precision example, see here.

Multiplication

Multiple precision multiplication is tricky on the eZ80 which has built-in 8-bit multiplication instructions, and even trickier on the Z80 which doesn't have built-in multiplication. The way many of us learned in school is easy to implement, but it slows down significantly with higher digit numbers. For example, to multiply two 2-digit numbers, you would need to perform 4 single-digit multiplications, but to multiply two 10-digit numbers, you would need to use 100 single-digit multiplications! In general, to multiply two n-digit numbers, you would need n*n single-digit multiplications. For higher precision multiplication, this can be quite an issue.

In the past century, several algorithms have been developed to greatly reduce the amount of work required, starting off with Karatsuba multiplication, then Toom-Cook, and then FFT based multiplication (including Schönhage–Strassen). I'm not familiar with Fürer's algorithm, unfortunately.

"Schoolbook" method

This is how most people learn to multiply multi-digit numbers in school. It's relatively easy, albeit tedious. That said, it is very easy in binary! Think about it— your only digits are 0 and 1, and multiplying by 0 or 1 is trivial.
So let's do an example. We will multiply 1011 by 1100, but in base 10:

     1011
*    1101
---------
     1011
    00000
   101100
  1011000
---------
= 1113111

Sweet baby carrots, that was easy. Now if it was binary, that 3 is too big for one digit, so we'd subtract 2 (base 2) and carry a 1. If it had been, say, 7, then we'd subtract 6 and carry 3. In this case, when we carry the 1, the next digit gets bumped to 2, so we have to subtract two and carry 1 again, and again, and again, resulting in 10001111. In code, we would actually just sum it up as we go. Here is some pseudocode:
; multiply E times A
    set accumulator to 0

    while A is not 0, then:
        shift A right, carrying out the bottom bit
        if 1 is carried out:
            add E to the accumulator
        shift E to the left, shifting in 0.

    return accumulator as the result

Putting that into assembly, we have:

  ld hl,0   ; set accumulator to 0
  ld d,h    ; (makes adding 'E' to 'accumulator' easier)
loop:
  or a      ; while A is not 0, then:
  ret z     ;

  rra       ;     shift A right, carrying out the bottom bit
  jr nc,skip;     if 1 is carried out:
  add hl,de ;         add E to the accumulator
skip:
  sla e     ;     shift E to the left, shifting in 0.
  rl d      ;
  jr loop

So that is the easy way to implement it. The better way for the Z80 requires looking at multiplication slightly different from how we did it in school. Luckily it's an easy change: work left-to-right instead of right-to-left! So our above example looks like this instead:
     1011
*    1101
---------
  1011000
   101100
    00000
     1011

To implement this version, we'll need a loop counter, and some trickery. On first glance, it looks like we need to shift our 'E' all the way to the left, then shift right on each iteration. However, we can simply shift the accumulator left each iteration before adding. The looks like the following by hand:

     1011
*    1101
-----^---
  1011

     1011
*    1101
------^--
  10110
   1011

     1011
*    1101
-------^-
  101100
   10110
    0000

     1011
*    1101
--------^
  1011000
   101100
    00000
     1011

So in code:

  ld hl,0   ; set accumulator to 0
  ld d,h    ; (makes adding 'E' to 'accumulator' easier)
  ld b,8
loop:
  add hl,hl
  rla
  jr nc,skip;     if 1 is carried out:
  add hl,de ;         add E to the accumulator
skip:
  djnz loop
  ret

Now let's create various routines:

H_Times_E

This is a classic. It packs everything into two 16-bit registers, making it very streamlined, and providing a religious experience (even to atheists).

H_Times_E:
;Inputs:
;  H and E
;Outputs:
;  HL is the product
;  D is 0
;  A,E,B,C are preserved
;36 bytes
;min: 190cc
;max: 242cc
;avg: 216cc

  ld d,0
  ld l,d
  sla h \ jr nc,$+3 \ ld l,e
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ jr nc,$+3 \ add hl,de
  add hl,hl \ ret nc \ add hl,de
  ret

Or we can sacrifice speed for size, taking almost twice the time, but at 1/3 the size:

H_Times_E:
;Inputs:
;  H and E
;Outputs:
;  HL is the product
;  D is 0
;  B is 0
;  A,E,C are preserved
;12 bytes
;min: 381cc
;max: 429cc
;avg: 405cc

  ld d,0
  ld l,d
  ld b,8
_:
  add hl,hl
  jr nc,$+3
  add hl,de
  djnz -_
  ret

DE_Times_A

DE_Times_A:
;Inputs:
;   A and DE
;Outputs:
;   AHL is the product, with A being the upper 8 bits
;   BC = 0
;   DE is preserved

    ld bc,$0800
    ld h,c
    ld l,c
_:
    add hl,hl
    rla          ; Check most-significant bit of accumulator
    jr nc,_skip  ; If zero, skip addition
    add hl,de
    adc a,c
_skip:
    djnz -_
    ret

Karatsuba Multiplication Algorithm

The Karatsuba algorithm is a genuinely cool bit of math that is straightforward and fairly young as far as algorithms go. It reduced a 2-by-2 multiplication to three 1-by-1 multiplications— better than our four 1-by-1 multiplies that we needed with hand multiplication. If we apply this recursively, then a 4-by-4 uses three 2-by-2 multiplies, which translates to nine 1-by-1 multiplies. Much better than 16 1-digit multiplies! The downside is a little more overhead with addition/subtraction.

Before we figure out how it works, here is the algorithm:

#Want to multiply (ax+b) * (cx+d)
#ex., in base 10 digits, this could be (a*10+b)*(c*10+d)
z0 = b*d
z2 = a*c
z1 = (a+b)*(c+d)-z0-z2
result = z0+z1*x+z2*x^2

Or another statement of the algorithm (my preferred interpretation):
z0 = b*d
z2 = a*c
z1 = z0+z2-(a-b)*(c-d)
result = z0+z1*x+z2*x^2

So as an example, we'll multiply 37 by 69.

a=3, b=7, c=6, d=9
z0 = 7*9 = 63
z2 = 3*6 = 18
z1 = 63+18-(3-7)*(6-9) = 69   #This is coincidence that z1 is 69 and one of our inputs was also 69 !
result = 63+690+1800 = 2553

To verify this algorithm, note that z1 = z0+z2-(a-b)*(c-d) = bd+ac-(a-b)(c-d) = ac+bd-(ac-ad-bc+bd) = ad+bc so then result = z0+z1*x+z2*x^2 = bd+(ad+bc)x+acx^2 which is indeed (ax+b)(cx+d).

Toom-Cook Multiplication Algorithms

The Toom-Cook algorithm extends Karatsuba's algorithm via a different approach. For an n-by-m multiplication, we will only need to n+m-1 1-digit multiplications. This is an insane improvement over Karatsuba, let alone our hand method. A 16-by-16 multiply now needs 31 single-digit multiplies instead of 81 in Karatsuba, or 256 in the hand method. Holy heck. The problem is that overhead for recombining the multiplications quickly grows non-trivial.

The premise starts with polynomials. If you are given two points, there is a unique linear function that passes though both (a line). Given three points, there is only one quadratic function that passes through all three points. In general, given n+1 points, there is a unique n-th degree polynomial that passes through each point.

Now let's look at multiplying polynomials. Given two linear functions, f(x)=ax+b and g(x)=cx+d, we can multiply them to get a quadratic function. If we evaluate this at three points, then theoretically we can find the coefficients for the quadratic polynomial. Some easy points to evaluate are x=0,1, and -1. Or in other words, we get bd, (a+b)(c+d), (b-a),(d-c). The formula for recombining the points at x=0,1, and -1 is this:

z0 = f(0)g(0)
z1 = f(1)g(1)
z2 = f(-1)g(-1)
result = z0+(z1-z2)x/2+((z1+z2)/2-z0) * x^2

In general, the formula is different depending on the x values you choose to evaluate at. Some formulas are easier than others, but some x values are more difficult than others.

So with our example in the Karatsuba section, f(x)=3x+7 and g(x)=6x+9,

z0 = f(0)g(0)   = 7*9 = 63
z1 = f(1)g(1)   = (3+7)(6+9) = 150
z2 = f(-1)g(-1) = (7-3)(9-6) = 12
result = z0+(z1-z2)x/2+((z1+z2)/2-z0) * x^2
       = 63+(150-12)x/2+((150+12)/2-63) * x^2
       = 63+69x+18x^2

This example is similar to Karatsuba's method, but a little more complicated. It turns out that Karatsuba's method is just Toom-Cook with points at either x=0,1,+inf, or x=0,-1,+inf.

For a fixed set of points, the formula to recombine the values will always be the same, so we have two challenges. First, we need to choose a set of points that our algorithm will evaluate at, second, we need to find the recombining formula. I like to solve this with a system of equations. If we want an algorithm for 3-by-3 multiplication, then we'll need five points. For example, x=-2, -1, 0, 1, and 2. Then we want to generate coefficients c0, c1, c2, c3, and c4. The easy way is to put this matrix into your calculator:

    [[1  -2   4  -8  16]
     [1  -1   1  -1   1]
[A]= [1   0   0   0   0]
     [1   1   1   1   1]
     [1   2   4   8  16]]

And then find the inverse:
       [[    0     0     1     0      0]
        [ 1/12  -2/3     0   2/3  -1/12]
[A]^-1= [-1/24   2/3  -5/4   2/3  -1/24]
        [-1/12   1/6     0  -1/6   1/12]
        [ 1/24  -1/6   1/4  -1/6   1/24]]

And this tells us that:
z0=f(-2)g(-2)
z1=f(-1)g(-1)
z2=f(0)g(0)
z3=f(1)g(1)
z4=f(2)g(2)

c0 = [    0     0     1     0      0]*[z0   z1   z2   z3   z4] = z2
c1 = [ 1/12  -2/3     0   2/3  -1/12]*[z0   z1   z2   z3   z4] = (z0-z4)/12 + 2(z3-z1)/3
c2 = [-1/24   2/3  -5/4   2/3  -1/24]*[z0   z1   z2   z3   z4] =-(z0+z4)/24 + 2(z3+z1)/3 - 5*z2/4
c3 = [-1/12   1/6     0  -1/6   1/12]*[z0   z1   z2   z3   z4] = (z4-z0)/12 + (z1-z3)/6
c4 = [ 1/24  -1/6   1/4  -1/6   1/24]*[z0   z1   z2   z3   z4] = (z0+z4)/24 - (z1+z3)/6 -z2/4

This reduces a 3-by-3 multiplication to 5 non-trivial multiplications, 2 divisions-by-3, 10 add/sub, and a handful of shifts. Those divisions-by three are a bit slow, but much faster than a multiplication if you use the right algorithms.

Multiplication, base case

In all cases, we will need a "base case" multiplication routine. This is the smallest multiplication performed, and all larger multiplication are broken up into some combination of these multiplications. On the eZ80, you can use the mlt instruction as your base case. For example, the mlt hl instruction multiplies H by L and stores the 16-bit result in HL. On the regular Z80, we need to create our own custom routine. In my experience, a well made 16-by-16 multiplication routine is the best base case. The following is made by Runer112, analyzed by Zeda, and tested by jacobly:

;This was made by Runer112
;Tested by jacobly
mul16:
;BC*DE --> DEHL
; ~544.887cc as calculated in jacobly's test
;min: 214cc  (DE = 1)
;max: 667cc
;avg: 544.4507883cc   however, deferring to jacobly's result as mine may have math issues ?
    ld    a,d
    ld    d,0
    ld    h,b
    ld    l,c
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit14
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit13
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit12
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit11
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit10
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit9
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit8
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit7
    ld    a,e
     and    %11111110
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit6
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit5
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit4
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit3
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit2
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit1
    add    a,a
    jr    c,Mul_BC_DE_DEHL_Bit0
    rr    e
    ret    c
    ld    h,d
    ld    l,e
    ret

Mul_BC_DE_DEHL_Bit14:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit13
    add    hl,bc
    adc    a,d
Mul_BC_DE_DEHL_Bit13:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit12
    add    hl,bc
    adc    a,d
Mul_BC_DE_DEHL_Bit12:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit11
    add    hl,bc
    adc    a,d
Mul_BC_DE_DEHL_Bit11:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit10
    add    hl,bc
    adc    a,d
Mul_BC_DE_DEHL_Bit10:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit9
    add    hl,bc
    adc    a,d
Mul_BC_DE_DEHL_Bit9:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit8
    add    hl,bc
    adc    a,d
Mul_BC_DE_DEHL_Bit8:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit7
    add    hl,bc
    adc    a,d
Mul_BC_DE_DEHL_Bit7:
    ld    d,a
    ld    a,e
    and    %11111110
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit6
    add    hl,bc
    adc    a,0
Mul_BC_DE_DEHL_Bit6:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit5
    add    hl,bc
    adc    a,0
Mul_BC_DE_DEHL_Bit5:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit4
    add    hl,bc
    adc    a,0
Mul_BC_DE_DEHL_Bit4:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit3
    add    hl,bc
    adc    a,0
Mul_BC_DE_DEHL_Bit3:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit2
    add    hl,bc
    adc    a,0
Mul_BC_DE_DEHL_Bit2:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit1
    add    hl,bc
    adc    a,0
Mul_BC_DE_DEHL_Bit1:
    add    hl,hl
    adc    a,a
    jr    nc,Mul_BC_DE_DEHL_Bit0
    add    hl,bc
    adc    a,0
Mul_BC_DE_DEHL_Bit0:
    add    hl,hl
    adc    a,a
    jr    c,Mul_BC_DE_DEHL_FunkyCarry
    rr    e
    ld    e,a
    ret    nc
    add    hl,bc
    ret    nc
    inc    e
    ret    nz
    inc    d
    ret

Mul_BC_DE_DEHL_FunkyCarry:
    inc    d
    rr    e
    ld    e,a
    ret    nc
    add    hl,bc
    ret    nc
    inc    e
    ret

Although this is unlikely to change, the most up-to-date version can be found here.

mul24, while less used on the Z80, can still be a base-case multiplication, though it does use an 8-by-24 multiplication subroutine. Here is 8-by-24:

C_times_BDE:
;;C*BDE => CAHL
;C = 0     157
;C = 1     141
;141+
;C>=128    135+6{0,33+{0,1}}+{0,20+{0,8}}
;C>=64     115+5{0,33+{0,1}}+{0,20+{0,8}}
;C>=32     95+4{0,33+{0,1}}+{0,20+{0,8}}
;C>=16     75+3{0,33+{0,1}}+{0,20+{0,8}}
;C>=8      55+2{0,33+{0,1}}+{0,20+{0,8}}
;C>=4      35+{0,33+{0,1}}+{0,20+{0,8}}
;C>=2      15+{0,20+{0,8}}
;min: 141cc
;max: 508cc
;avg: 349.21279907227cc

  ld a,b
  ld h,d
  ld l,e
  sla c \ jr c,mul8_24_1
  sla c \ jr c,mul8_24_2
  sla c \ jr c,mul8_24_3
  sla c \ jr c,mul8_24_4
  sla c \ jr c,mul8_24_5
  sla c \ jr c,mul8_24_6
  sla c \ jr c,mul8_24_7
  sla c \ ret c
  ld a,c
  ld h,c
  ld l,c
  ret
mul8_24_1:
    add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_2:
    add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_3:
    add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_4:
    add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_5:
    add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_6:
    add hl,hl \ rla \ rl c \ jr nc,$+7 \ add hl,de \ adc a,b \ jr nc,$+3 \ inc c
mul8_24_7:
    add hl,hl \ rla \ rl c \ ret nc \ add hl,de \ adc a,b \ ret nc \ inc c \ ret

And mul24 needs var48 defined.

mul24:
;;BDE*CHL -> HLBCDE
;;155 bytes
;;402+3*C_Times_BDE
;;fastest:1201cc
;;slowest:1753cc
;;avg    :1464.9033203125cc (1464+925/1024)
;min: 825cc
;max: 1926cc
;avg: 1449.63839751681cc

    push bc
    ld c,l
    push hl
    call C_Times_BDE
    ld (var48),hl
    ld l,a
    ld h,c
    ld (var48+2),hl

    pop hl
    ld c,h
    call C_Times_BDE
    push bc
    ld bc,(var48+1)
    add hl,bc
    ld (var48+1),hl
    pop bc
    ld b,c
    ld c,a
    ld hl,(var48+3)
    ld h,0
    adc hl,bc
    ld (var48+3),hl

    pop bc
    call C_Times_BDE
    ld de,(var48+2)
    add hl,de
    ld (var48+2),hl
    ld d,c
    ld e,a
    ld b,h
    ld c,l
    ld hl,(var48+4)
    ld h,0
    adc hl,de
    ld de,(var48)
    ret

Integer Multiplication, multiprecision

This 32-by-32 multiply requires 8 bytes of RAM and uses the above mul16 routine as a base case for Karatsuba Multiplication:

; You will need to define z32_0, which is where the result is written!

mul32:
;max: 703cc  + 3*mul16
;     2704cc
;min: 655cc  + 3*mul16
;     1297cc
;avg: 673.25cc+3*mul16
;     2307.911cc
;DEHL * BCIX ==> z32_0
z32_2 = z32_0+4

  push de
  push bc
  push hl
  push ix
  call mul16  ;DEHL
  ld (z32_2),hl
  ld (z32_2+2),de

  pop de
  pop bc
;  push bc
  push de
  call mul16  ;DEHL
  ld (z32_0),hl
  ld (z32_0+2),de

  pop de    ;low word
;  pop bc    ;low word
  pop hl
  xor a
  sbc hl,de
  jr nc,+_
  sub l
  ld l,a
  sbc a,a
  sub h
  ld h,a
  xor a
  inc a
_:
  ex de,hl
  pop hl
  sbc hl,bc
  jr nc,+_
  ld b,a
  xor a
  sub l
  ld l,a
  sbc a,a
  sub h
  ld h,a
  ld a,b
  inc a
_:
  ld b,h
  ld c,l
  push af
  call mul16
  pop af    ;holds the sign in the low bit
  rra
  jr c,mul32_add
;need to perform z0+z2-result
  push de
  push hl
  xor a
  ld hl,(z32_0)
  ld bc,(z32_2)
  add hl,bc
  ex de,hl
  ld hl,(z32_0+2)
  ld bc,(z32_2+2)
  adc hl,bc
  rla
;now need to subtract
  ex de,hl
  pop bc
  sbc hl,bc
  ex de,hl
  pop bc
  sbc hl,bc
  sbc a,0
;A:HL:DE is the result, need to add to z32_0+2
mul32_final:
  ld bc,(z32_0+2)
  ex de,hl
  add hl,bc
  ld (z32_0+2),hl
  ld hl,(z32_2)
  adc hl,de
  ld (z32_2),hl
  ld hl,z32_2+2
  adc a,(hl)
  ld (hl),a
  ret nc
  inc hl
  inc (hl)
  ret
mul32_add:
;add to the current result
  xor a
  ld bc,(z32_0)
  add hl,bc
  ex de,hl
  ld bc,(z32_0+2)
  adc hl,bc
  rla
  ex de,hl
  ld bc,(z32_2)
  add hl,bc
  ex de,hl
  ld bc,(z32_2+2)
  adc hl,bc
  adc a,0
  jp mul32_final

And for 64-bit, you'll need the above mul32, as well as a mov8 routine like this:

  ldi
  ldi
  ldi
  ldi
  ldi
  ldi
  ldi
  ldi
  ret
mul64:
;multiplies the 64-bit integers at inp64_1 and inp64_2
;stores the 128-bit (16-byte) result at out128
;
;min: 1740+3*min(mul32)
;     5631cc
;max: 1901+3*max(mul32)
;     10013cc
;avg: 1797+3*avg(mul32) + 9572881/2^24
;    ~8720.733cc
;uses 24 bytes at out128
z64_0 = out128
z64_2 = z64_0+8
z32_0 = z64_2+8

  ld de,(inp64_1+6)
  ld hl,(inp64_1+4)
  ld bc,(inp64_2+6)
  ld ix,(inp64_2+4)
  call mul32
  ;copy the 8 bytes at z32_0 to z64_2
  ld hl,z32_0
  ld de,z64_2
  call mov8

  ld de,(inp64_1+2)
  ld hl,(inp64_1)
  ld bc,(inp64_2+2)
  ld ix,(inp64_2)
  call mul32
  ;copy the 8 bytes at z32_0 to z64_0
  ld hl,z32_0
  ld de,z64_0
  call mov8

;now I need to subtract the 32-bit digits from each other
  xor a
  ld hl,(inp64_1)
  ld bc,(inp64_1+4)
  sbc hl,bc
  ex de,hl
  ld hl,(inp64_1+2)
  ld bc,(inp64_1+6)
  sbc hl,bc
  jr nc,+_
  ld b,a \ sub e \ ld e,a
  ld a,b \ sbc a,d \ ld d,a
  ld a,b \ sbc a,l \ ld l,a
  ld a,b \ sbc a,h \ ld h,a
  ld a,b
_:
  rla
  push hl   ;top byte
  push de

  ld hl,(inp64_2)
  ld bc,(inp64_2+4)
  sbc hl,bc
  ex de,hl
  ld hl,(inp64_2+2)
  ld bc,(inp64_2+6)
  sbc hl,bc
  jr nc,+_
  ld c,a
  xor a
  ld b,a
  sub e \ ld e,a
  ld a,b \ sbc a,d \ ld d,a
  ld a,b \ sbc a,l \ ld l,a
  ld a,b \ sbc a,h \ ld h,a
  ld a,c
  inc a
_:
  ex de,hl
  pop ix
  pop bc
  push af
  call mul32
  pop af    ;holds the sign in the low bit

  rra
  jp c,mul64_add
;need to perform z0+z2-result
  xor a
  ld hl,(z64_0)
  ld de,(z64_2)
  add hl,de
  ld (inp64_1),hl
  ld hl,(z64_0+2)
  ld de,(z64_2+2)
  adc hl,de
  ld (inp64_1+2),hl
  ld hl,(z64_0+4)
  ld de,(z64_2+4)
  adc hl,de
  ld (inp64_1+4),hl
  ld hl,(z64_0+6)
  ld de,(z64_2+6)
  adc hl,de
  ld (inp64_1+6),hl
  rla
;now need to subtract
  ld hl,(inp64_1)
  ld de,(z32_0)
  sbc hl,de
  ld (inp64_1),hl
  ld hl,(inp64_1+2)
  ld de,(z32_0+2)
  sbc hl,de
  ld (inp64_1+2),hl
  ld hl,(inp64_1+4)
  ld de,(z32_0+4)
  sbc hl,de
  ld (inp64_1+4),hl
  ld hl,(inp64_1+6)
  ld de,(z32_0+6)
  sbc hl,de
  ld (inp64_1+6),hl
  sbc a,0
mul64_final:
;now need to add it back in
  ld hl,(z64_0+4)
  ld de,(inp64_1)
  add hl,de
  ld (z64_0+4),hl
  ld hl,(z64_0+6)
  ld de,(inp64_1+2)
  adc hl,de
  ld (z64_0+6),hl
  ld hl,(z64_0+8)
  ld de,(inp64_1+4)
  adc hl,de
  ld (z64_0+8),hl
  ld hl,(z64_0+10)
  ld de,(inp64_1+6)
  adc hl,de
  ld (z64_0+10),hl
  ld hl,z64_0+12
  adc a,(hl)
  ld (hl),a
  ret nc
  inc hl \ inc (hl) \ ret nz
  inc hl \ inc (hl) \ ret nz
  inc hl \ inc (hl) \ ret
mul64_add:
;add to the current result
;z0+z2+result
  xor a
  ld hl,(z64_0)
  ld de,(z64_2)
  add hl,de
  ld (inp64_1),hl
  ld hl,(z64_0+2)
  ld de,(z64_2+2)
  adc hl,de
  ld (inp64_1+2),hl
  ld hl,(z64_0+4)
  ld de,(z64_2+4)
  adc hl,de
  ld (inp64_1+4),hl
  ld hl,(z64_0+6)
  ld de,(z64_2+6)
  adc hl,de
  ld (inp64_1+6),hl
  rla
;now need to subtract
  ld hl,(inp64_1)
  ld de,(z32_0)
  add hl,de
  ld (inp64_1),hl
  ld hl,(inp64_1+2)
  ld de,(z32_0+2)
  adc hl,de
  ld (inp64_1+2),hl
  ld hl,(inp64_1+4)
  ld de,(z32_0+4)
  adc hl,de
  ld (inp64_1+4),hl
  ld hl,(inp64_1+6)
  ld de,(z32_0+6)
  adc hl,de
  ld (inp64_1+6),hl
  adc a,0
  jp mul64_final

BCD Multiplication, multiprecision

Fixed Point Multiplication

If we multiply an A.B fixed-point number by a C.D fixed-point number, we get an (A+C).(B+D) fixed-point number. So for example, if we use 16 bits to encode a 4.12 fixed-point number, and multiply it by another 4.12 fixed point, the result will be 8.24. To convert this to a 4.12, we'll need to discard the top 4 bits of the result, and the bottom 12 bits. Depending on the application, you might also want to check for overflow, in which case you'd need to check the discarded 4 bits at the top.

Fixed point multiplication uses an integer multiplication at its core. For example, an 8.8 fixed-point routine can use a 16-by-16 multiply, with minor adjustments. The result should be interpreted as a 16.16 fixed-point number, so you'll need to grab the 8 bits above and below the decimal in order to keep the output as 8.8 fixed. As well, you'll need special code for handling negative numbers. Putting it together, using the mul16 from the integer multiplication section:

mulfixed8_8:
;Multiplies H.L by D.E, stores the result in H.L
; First, find out if the output is positive or negative
  ld a,h
  xor d
  push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
  xor d     ;A now has the value of H, since I XORed it with D twice (cancelling)
  jp p,mulfixed8_8_lbl1   ;if Positive, don't negate
  xor a
  sub l
  ld l,a
  sbc a,a
  sub h
  ld h,a
mulfixed8_8_lbl1:
  bit 7,d
  jr z,mulfixed8_8_lbl2
  xor a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a
mulfixed8_8_lbl2:

; Now we need to put DE in BC to use mul16
  ld b,d
  ld c,e
  call mul16

;Get the middle two bytes, EH, and put them in HL
  ld l,h
  ld h,e

;We should check for overflow. If D>0, we will set HL to 0x7FFF
  ld a,d
  or a
  jr z,mulfixed8_8_lbl3
  ld hl,$7FFF
mulfixed8_8_lbl3:

; Now we need to restore the sign
  pop af
  ret p    ;don't need to do anything, result is already positive
  xor a
  sub l
  ld l,a
  sbc a,a
  sub h
  ld h,a
  ret

Or slightly more complicated 4.12 multiplication, which also uses the mul16 as its core, and only the resizing step changes:

mulfixed4_12:
;Multiplies H.L by D.E, stores the result in H.L
; First, find out if the output is positive or negative
  ld a,h
  xor d
  push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
  xor d     ;A now has the value of H, since I XORed it with D twice (cancelling)
  jp p,mulfixed4_12_lbl1   ;if Positive, don't negate
  xor a
  sub l
  ld l,a
  sbc a,a
  sub h
  ld h,a
mulfixed4_12_lbl1:
  bit 7,d
  jr z,mulfixed4_12_lbl2
  xor a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a
mulfixed4_12_lbl2:

; Now we need to put DE in BC to use mul16
  ld b,d
  ld c,e
  call mul16

;The result doesn't need the top 4 bits or bottom 12 bits.
;We'll hold onto the top 4 bits to check overflow, though.
;Currently we need to shift DEH left by 4 bits and keep DE, or right by 12 bits and keep HL.
  ld a,h    ;we'll actually be moving the discared bits into A
  and $F0
  rla \ rl e \ rl d
  rla \ rl e \ rl d
  rla \ rl e \ rl d
  rla \ rl e \ rl d
  adc a,a
  ex de,hl

;if A is non-zero, we have overflow
  jr z,mulfixed4_12_lbl3
  ld hl,$7FFF
mulfixed4_12_lbl3:

; Now we need to restore the sign
  pop af
  ret p    ;don't need to do anything, result is already positive
  xor a
  sub l
  ld l,a
  sbc a,a
  sub h
  ld h,a
  ret

For 16.16 fixed point, we'll use the mul32 from the integer multiplication section.

mulfixed16_16:
;Multiplies DE.HL by BC.IX, stores the result in DE.HL
; First, find out if the output is positive or negative
  ld a,d
  xor b
  push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
  xor b     ;A now has the value of D, since I XORed it with B twice (cancelling)
  jp p,mulfixed16_16_lbl1   ;if Positive, don't negate
  xor a
  sub l
  ld l,a
  ld a,0
  sbc a,h
  ld h,a
  ld a,0
  sbc a,e
  ld e,a
  sbc a,a
  sub d
  ld d,a
mulfixed16_16_lbl1:
  bit 7,b
  jr z,mulfixed16_16_lbl2
  xor a
  sub ixl
  ld ixl,a
  ld a,0
  sbc a,ixh
  ld ixh,a
  ld a,0
  sbc a,c
  ld c,a
  sbc a,a
  sub b
  ld b,a
mulfixed16_16_lbl2:

; Now we multiply
  call mul32

;We should check for overflow. If the upper two bytes are non-zero, we will set the result to 0x7FFFFFFF
  ld hl,(z32_0+6)
  ld a,h
  or l

;Get the middle four bytes and put them in DEHL
  ld hl,(z32_0+2)
  ld de,(z32_0+4)

;Maybe we need to set the result to 0x7FFFFFFF
  jr z,mulfixed16_16_lbl3
  ld de,$7FFF
  ld h,e
  ld l,e
mulfixed16_16_lbl3:

; Now we need to restore the sign
  pop af
  ret p    ;don't need to do anything, result is already positive
  xor a
  ld b,a
  sub l
  ld l,a
  ld a,b
  sbc a,h
  ld h,a
  ld a,b
  sbc a,e
  ld e,a
  sbc a,a
  sub d
  ld d,a
  ret

Floating Point Multiplication

Floating point multiplication is actually pretty easy compared to FP addition and subtraction.
We'll start with an example: Suppose we have a floating point format that stores a 6-digit mantissa, and digits are base-10. Let's multiply 3.14159*10-1 * 6.54321*103. To do so, we can add the powers of their exponents and this is our new exponent, and directly multiply the mantissas for our new mantissa.
So our result is (3.14149 * 6.54321)*10(-1+3) = 20.5560831039*102. Our mantissa is >=10, so we'll need to increment our exponent and divide our mantissa by 10 (shift right by 1 digit): 2.05560831039*103. Now we need to keep only six digits of the mantissa, and we'll round: 2.05561*103^^. Tada :)

So the steps we do:

  • XOR the signs of the inputs. This is the sign bit of our result.
  • Add the exponents together to form our new exponent. Be wary of overflow or underflow!
  • Multiply their mantissas as 1.n fixed point numbers. If the top digit of the result is non-zero, we'll need to increment the exponent (again, check overflow), and then shift the digits right.
    • In practice, I actually shift left by one digit if the top digit is 0, because then it is in the final form we need for our mantissa.
  • Truncate the mantissa to the correct size, rounding if desired.

There are Single Precision and Extended Precision examples on GitHub. These also check for "special values" that represent 0, +infinity, -infinity, and Not a Number (NaN).

In case something breaks, here are the current versions of those routines as of this writing:

var48 = scrap+4
mulSingle:
;Inputs: HL points to float1, DE points to float2, BC points to where the result is copied
;Outputs: float1*float2 is stored to (BC)
;573+mul24+{0,35}+{0,30}
;min: 1398cc
;max: 2564cc
;avg: 2055.13839751681cc
    push af
    push hl
    push de
    push bc

    call +_   ;CHLB
    ld a,c
    ex de,hl
    pop hl
    push hl
    ld (hl),b \ inc hl
    ld (hl),e \ inc hl
    ld (hl),d \ inc hl
    ld (hl),a
    pop bc
    pop de
    pop hl
    pop af
    ret

_:
;;return float in CHLB
    push de
    ld e,(hl)
    inc hl
    ld d,(hl)
    inc hl
    ld c,(hl)
    inc hl
    ld a,(hl)
    or a
    jr z,mulSingle_case0
    ex de,hl
    ex (sp),hl
    ld e,(hl)
    inc hl
    ld d,(hl)
    inc hl
    ld b,(hl)
    inc hl
    inc (hl)
    dec (hl)
    jr z,mulSingle_case1
    add a,(hl)      ;\
    pop hl          ; |
    rra             ; |Lots of help from Runer112 and
    adc a,a         ; |calc84maniac for optimizing
    jp po,bad       ; |this exponent check.
    xor 80h         ; |
    jr z,underflow  ;/
    push af         ;exponent
    ld a,b
    xor c
    push af         ;sign
    set 7,b
    set 7,c
    call mul24      ;BDE*CHL->HLBCDE, returns sign info
    pop de
    ld a,e
    pop de
    jp m,+_
    rl c
    rl b
    adc hl,hl
    dec d
_:
    inc d
    jr z,overflow
    rl c
    ld c,d
    ld de,0
    push af
    ld a,b
    adc a,e
    ld b,a
    adc hl,de
    jr nc,+_
    inc c \ jr z,overflow
    rr h
    rr l
    rr b
_:
    pop af
    cpl
    and $80
    xor h
    ld h,a
    ret
bad:
    jr nc,overflow
underflow:
    ld hl,0
    rl b
    rr h
    ld c,l
    ld b,l
    ret
overflow:
    ld hl,$8000
    jr underflow+3
mulSingle_case1:
;x*0   -> 0
;x*inf -> inf
;x*NaN -> NaN
  pop hl
  ld h,b
  ld l,d
  ld b,e
  ld c,0
  ret
mulSingle_case0:
;special*x = special
;NaN*x = NaN
;0*0 = 0
;0*NaN = NaN
;0*Inf = NaN
;Inf*Inf  = Inf
;Inf*-Inf =-Inf
  ;0CDE
  pop hl
  inc hl
  inc hl
  inc hl
  ld a,(hl)
  or a
  jr z,+_
  ld h,c
  ld c,0
  ret
_:
  dec hl
  ld b,(hl)
;basically, if b|c has bit 5 set, return NaN
  ld a,b
  or c
  ld h,$20
  and h
  jr z,+_
  ld c,0
  ret
_:
  ld a,c
  xor b
  rl b
  rlca
  rr b
  res 4,b

  rl c
  rrca
  rr c

  ld a,c
  and $E0
  add a,b
  rra
  ld h,a
  ld c,0
  ret

xmul (extended precision, 80-bit floats):

#include "routines/pushpop.z80"
#include "routines/mov.z80"
#include "mul/mul64.z80"
#include "routines/rl64.z80"

#define var_z xOP3+16
;uses 60 bytes after xOP1
xmul:
;Input:
;  HL points to one number
;  DE points to another
;Timing, excluding special cases (which take ~ 800cc):
;1057+{0,3}+{0,172}+mul64
;max: 1232+max(mul64)
;     11245cc
;min: 1057+min(mul64)
;     6688cc
;avg: 1144.5+avg(mul64)
;     9865.233ccs

  push hl
  push de
  push bc
  push af
  push ix
  push bc
  call +_
  pop hl
  push de
  ex de,hl
  ld hl,var_z+8
  call mov8
  ex de,hl
  pop de
  ld (hl),e
  inc hl
  ld (hl),d
  pop ix
  pop af
  pop bc
  pop de
  pop hl
  ret
_:
  push de
  ld de,xOP1
  call mov10
  pop hl
  call mov10
    ld de,(xOP2+8)
    ld hl,(xOP1+8)
xmul_stepin_geomean:
    ld a,h
    xor d
  ld b,a
    res 7,d
    res 7,h
    ld a,h \ or l \ jp z,casemul
    ld a,d \ or e \ jp z,casemul2
  add hl,de
  ld de,$4000
  sbc hl,de
  jp c,mul_zero
  jp m,mul_inf
  sla b
  jr nc,+_
  set 7,h
_:
  push hl
    call mul64
  ld a,(var_z+15)
  add a,a
  pop de
  jr c,+_
#ifdef inc_FMA
  ld hl,var_z
  call rl64
#else
  ld hl,var_z+7
  sla (hl)
#endif
  inc hl
  jp rl64

_:
  inc de
  ret

casemul:
;xOP1 is inf/nan/0
  ld hl,xOP2+9
  ld a,(hl)
  dec hl
  or (hl)
  dec hl
  ld a,(hl)
  ld hl,xOP2
  jr nz,casemul2_copy
  ;now we have two special cases to multipy together
;inf*inf-> inf
;0*0    -> 0
;
;nan*nan-> NaN
;inf*nan-> NaN
;inf*0  -> NaN
;nan*inf-> NaN
;nan*0  -> NaN
;0*inf  -> NaN
;0*nan  -> NaN

  sla b
  ld de,0
  rr d
  and $C0
  ld c,a
  ld a,(xOP1+7)
  and $C0
  cp c
  jr z,+_
  ld a,$40
_:
  ld (var_z+15),a
  ret
casemul2:
;finite times inf/nan/0, so xOP1 -> out
  ld hl,xOP1
casemul2_copy:
  ld de,var_z+8
  call mov8
  ld e,(hl)
  inc hl
  ld d,(hl)
  ret
mul_zero:
  xor a
  ld (var_z+15),a
  ld d,a
  ret
mul_inf:
  ld d,e
  ld a,255
  ld (var_z+15),a
  ret

Division

Division is usually considered to be a slow operation, but in my experience, at high precision, it is only a little slower than multiplication if done right.

"Schoolbook" Division

Division can be done just like base-10 division most of us learned in school, but instead of checking if the next digit is from 0 to 9, we only need to check if it is 0 or 1 which amounts to the question of "is our remainder smaller than the divisor (0) or bigger (1) ?"

Here is the pseudocode:

N is the numerator
D is the denominator
R is the remainder
Q is the quotient

While N>0:
    shift Q left one bit to make room for the next digit
    left-shift the top bit of N into R, removing the bit from N.
    If R>=D
        R - D -> R
        Q + 1 -> Q

Return Q

In Z80 assembly, this might look like

Div_HL_D:
;Inputs:
;   HL and D
;Outputs:
;   HL is the quotient (HL/D)
;   A is the remainder
;   B is 0
;   C,D,E are preserved
    xor a         ; Clear upper eight bits of AHL
    ld b,16       ; Sixteen bits in dividend
_loop:
    add hl,hl     ; Do a "SLA HL". If the upper bit was 1, the c flag is set
    rla           ; This moves the upper bits of the dividend into A
    jr c,_overflow; If D is allowed to be >128, then it is possible for A to overflow here. (Yes future Zeda, 128 is "safe.")
    cp d          ; Check if we can subtract the divisor
    jr c,_skip    ; Carry means A < D
_overflow:
    sub d         ; Do subtraction for real this time
    inc l         ; Set the next bit of the quotient (currently bit 0)
_skip:
    djnz _loop
    ret

Notice here that we are using HL as both our 'N' value and our 'Q' value from the pseudocode.
'A' acts as our remainder, and D acts as our dividend.

Special Case Division

Division by a power of 2

This is easy: just right-shift! If you want to divide by 2, right shift once. Divide by 4? Right shift twice. Divide by 8? Right shift three times…

Division by Constants

Here is an example of how to divide by a constant. Suppose I want to divide an 8-bit integer, x, by 7, and return the result as an 8-bit integer. Well, 1/7 is approximately 73/512, so I can get a pretty close result with x*73/512. Division by 512 will be easy: Just take the top 8 bits from the multiplication, and shift right. For example:

;A/7
  ld l,a
  ld h,0
  ld b,h
  ld c,l
  add hl,hl   ;A*2
  add hl,hl   ;A*4
  add hl,hl   ;A*8
  add hl,bc   ;A*9
  add hl,hl   ;A*18
  add hl,hl   ;A*36
  add hl,hl   ;A*72
  add hl,bc   ;A*73
  ld a,h
  rra         ;carry flag is always reset in the case, so no need to worry
  adc a,b     ;just to round. B is 0
  ret

This isn't perfect, but it is pretty good, and fast.

Turning Division Into Multiplication

You can turn division into multiplication as follows:
Suppose you can write your division as x/(1-y) where |y|<1. Then you can rewrite your division as:

x*(1+y)*(1+y2)*(1+y4)*(1+y8)*(1+y16)*…

So some pseudocode might be:

while y>10^-15    ;basically, until y is very small
    x + x*y -> x
    y * y   -> y

Division by 2^n - 1

If we want to perform x/(2n-1), then we can rewrite this as 2-n * x/(1-2-n). Using the technique above, this gives us the following pseudocode:

x + (x>>n) -> x
x + (x>>(2n)) -> x
x + (x>>(4n)) -> x
x + (x>>(8n)) -> x
x + (x>>(16n))-> x
...
return x>>n

For examples:

Division by 3

In this case, 3=22-1, so we can use the code:

x + (x>>2) -> x
x + (x>>4) -> x
x + (x>>8) -> x
x + (x>>16)-> x
...
return x>>2

For example, here is a way to implement HL/3 with some rounding errors:
;HL/2   ;we'll do this to prevent overflow, and divide by 2 at the end instead of 4
  srl h \ rr l

;HL + (HL>>2) --> HL
  ld b,h
  ld c,l
  srl b \ rr c
  srl b \ rr c
  add hl,bc

;HL + (HL>>4) --> HL
  ld b,h
  ld c,l
  srl b \ rr c
  srl b \ rr c
  srl b \ rr c
  srl b \ rr c
  add hl,bc

;HL + (HL>>8) --> HL
  ld b,0
  ld c,h
  add hl,bc

HL/2
  srl h \ rr l

Since left shifts can often be done cheaper, let's try this with left-shifts, and then do a bunch of right-shifts at the end:

;We want a difference of a factor of 2 shifts
  ld b,h
  ld c,l
  xor a
  ld e,a
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,bc \ adc a,e

;We want a difference of a factor of 4 shifts
  ld b,h
  ld c,l
  ld e,a
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,bc \ adc a,e

;We want a difference of a factor of 8 shifts
  ld d,a
  ld e,h
  ld b,l
  ld c,0
  add hl,bc
  add a,e
  ld e,a
  jr nc,$+3
  inc d

;Now we need to shift right by 2+4+8 plus a final 2, so shift right by 16
;This means we just keep DE as our result!

But knowing that we are going to shift right by 16 at the end, we can optimize the final iteration:

;We want a difference of a factor of 2 shifts
  ld b,h
  ld c,l
  xor a
  ld e,a
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,bc \ adc a,e

;We want a difference of a factor of 4 shifts
  ld b,h
  ld c,l
  ld e,a
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,bc \ adc a,e

  ld b,a
  ld c,h
  add hl,bc
  adc a,0
  ld l,h
  ld h,a
;now HL is our result

Finally, it turns out that the accuracy issues occur at products of 3, and we can fix that simply by incrementing HL:

HL_Div_3:
;Input: HL
;Output: HL is the input divided by 3
;Destroys: B,C,E,A
;217cc

;increment HL, putting overflow in A
  ld bc,1
  ld a,b
  add hl,bc
  adc a,b

;We want a difference of a factor of 2 shifts
  ld b,h
  ld c,l
  ld e,a
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,bc \ adc a,e

;We want a difference of a factor of 4 shifts
  ld b,h
  ld c,l
  ld e,a
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,hl \ rla
  add hl,bc \ adc a,e

  ld b,a
  ld c,h
  add hl,bc
  adc a,0
  ld l,h
  ld h,a
;now HL is our result

  ret

Division by 2^n + 1

If we want to perform x/(2n+1), then we can rewrite this as 2-n * x/(1+2-n). Using the technique for turning division into multiplication, this gives us the following pseudocode:

x - (x>>n) -> x
x + (x>>(2n)) -> x
x + (x>>(4n)) -> x
x + (x>>(8n)) -> x
x + (x>>(16n))-> x
...
return x>>n

Division, BC/DE

In this section, we'll explore the integer division routine for dividing BC by DE and various ways to optimize it. We'll start off with our baseline routine, and note that our accumulator will never overflow (we'll use a 16-bit accumulator, and up to 16 bits will be shifted in), so we don't need to take care of that as we did in the HL/D routine.

BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;1382cc
;23 bytes

  ld hl,0    ;Accumulator
  ld a,16    ;Loop counter

div_loop:
  ;shift the bits from BC (numerator) into HL (accumulator)
  sla c \ rl b
  adc hl,hl

  ;Check if remainder >= denominator (HL>=DE)
  sbc hl,de
  jr c,div_loop_readjust
  inc c
  jr div_loop_done

div_loop_readjust:
; remainder is not >= denominator, so we have to add DE back to HL
  add hl,de

div_loop_done:
  dec a
  jr nz,div_loop
  ret

Now that we have established our baseline routine, let's add our first minor optimization. In particular, let's fix up this part:

  jr c,div_loop_readjust
  inc c
  jr div_loop_done

div_loop_readjust:
  ; remainder is not >= denominator, so we have to add DE bBCk to HL
  add hl,de

The instruction ‘cp xx`, where `xx` is a constant, is compiled to `FExx`. As long as we don’t need to preserve flags, we can skip a one-byte instruction like this:
  jr +_
  inc c
_:

Becomes ====>

  .db $FE
  inc c

This saves one byte and 4cc. So reorganizing the section of code above:

  jr c,div_loop_readjust
  inc c
  jr div_loop_done

div_loop_readjust:
  ; remainder is not >= denominator, so we have to add DE bBCk to HL
  add hl,de

Becomes ====>

  jr nc,divinc_acc      ;change the condition from `c` to `nc`
  add hl,de
  .db $FE     ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
  inc c

Applying this, we now have:

BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;min: 1270cc
;max: 1414cc
;avg: 1342cc
;22 bytes

  ld hl,0
  ld a,16

div_loop:
  sla c \ rl b
  adc hl,hl
  sbc hl,de
  jr nc,divinc_acc
  add hl,de
  .db $FE     ;this begins the instruction `cp *`, so it eats the `inc c`
div_inc_acc:
  inc c

div_loop_done:
  dec a
  jr nz,div_loop
  ret

This change saves one byte and on average 40cc, but saving up to to 112cc or costing up to 32cc depending on inputs.

There is another cheap optimization— using B instead of A as the loop counter.
It adds 2 bytes to overhead, but saves 2 bytes in the loop for no net change in size. And while it adds 8cc to overhead, it saves 112cc in the main loop, for a net savings of 104cc.

BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;min: 1166cc
;max: 1310cc
;avg: 1238cc
;22 bytes
  ld hl,0
  ld a,b
  ld b,16

div_loop:
  ;shift the bits from BC into HL
  sla c \ rla
  adc hl,hl
  sbc hl,de
  jr nc,div_inc_acc
  add hl,de
  .db $FE     ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
  inc c

div_loop_done:
  djnz div_loop
  ld b,a
  ret

Now, even the worst case is better than the original, and we've saved a byte!

But this next trick adds 5 bytes onto the previous routine. If we negate DE, then we can swap whether we are doing sbc or add when comparing HL to DE:

BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 1124cc
;max: 1332cc
;avg: 1228cc
;27 bytes
  xor a
  ld h,a  ;take advantage of A=0 to set hl to 0 saving 2cc and 1 byte.
  ld l,a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a

  ld a,b
  ld b,16

div_loop:
  ;shift the bits from BC into HL
  sla c \ rla
  adc hl,hl
  add hl,de
  jr c,divinc_acc
  sbc hl,de
  .db $FE     ;this begins the instruction `cp *`, so it eats the next byte.
div_inc_acc:
  inc c

div_loop_done:
  djnz div_loop
  ld b,a
  ret

Overhead: +5 bytes, +22cc
Loop: -32cc on average, up to +0cc, as low as -64cc
Net: +5 bytes, -10cc on average, as much as +22cc, as low as -42cc

But now we have that the comparison leaves carry set when C is incremented, and reset when it is not. If we use `rl c` instead of `sla c`, we can just shift in the next bit:

BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 1072cc
;max: 1232cc
;avg: 1152cc
;28 bytes
  xor a
  ld h,a
  ld l,a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a

  ld a,b
  ld b,16

div_loop:
  ;shift the bits from BC into HL
  rl c \ rla
  adc hl,hl
  add hl,de
  jr c,div_loop_done
  sbc hl,de

div_loop_done:
  djnz div_loop
  rl c \ rla
  ld b,a
  ret

This adds 3 bytes and 12cc to overhead, but saves 2 byte in the main loop and 88cc on average, for a net of +1 byte, -76cc.

And now, if we really want speed, we can unroll it, and take advantage of only needing to shift one byte at a time into HL:

BC_Div_DE:
;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 738cc
;max: 898cc
;avg: 818cc
;144 bytes
  xor a
  ld h,a
  ld l,a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a

  ld a,b
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla
  ld b,a

  ld a,c
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla
  ld c,a

  ret

It costs +115 bytes (making the routine just over 5 times larger), but saves 334cc in all cases (29% improvement on average).

If you really enjoy the speed improvements, but don't want to take up that much space, there is a cheap trade off by taking advantage of subroutines:

;BC/DE ==> BC, remainder in HL
;NOTE: BC/0 returns 0 as the quotient.
;min: 773cc
;max: 933cc
;avg: 853cc
;82 bytes
  xor a
  ld h,a
  ld l,a
  sub e
  ld e,a
  sbc a,a
  sub d
  ld d,a

  ld a,b
  ld b,c
  call BC_Div_DE_sub
  ld a,b
  ld b,c

BC_Div_DE_sub:
;min: 354cc
;max: 434cc
;avg: 394cc
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla \ adc hl,hl \ add hl,de \ jr c,$+4 \ sbc hl,de
  rla
  ld c,a
  ret

Basically, we traded 35cc for a savings of 61 bytes.

Integer Division, multiprecision

BCD Division, multiprecision

Fixed Point Division

Suppose that we want to divide H.L by D.E. This is basically (HL/256)/(DE/256) = HL/DE. However, we can't directly use an integer routine, as we'd need 8 more bits of precision. You would have to apply this similarly to other formats.

8.8/8.8 Fixed Point Division

So we'll start off like an integer division, but we'll go an extra 8 iterations, shifting in zeros.

BC_Div_DE_88:
;Inputs:
;     DE,BC are 8.8 Fixed Point numbers
;Outputs:
;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)
;if DE is 0 : 122cc or 136cc if BC is negative
;if |BC|>=128*|DE| : 152cc or 166cc if BC is negative
;Otherwise:
;min: 1107cc
;max: 1319cc
;avg: 1201cc

; First, find out if the output is positive or negative
    ld a,b
    xor d
    push af   ;sign bit is the result sign bit

; Now make sure the inputs are positive
    xor b     ;A now has the value of B, since I XORed it with D twice (cancelling)
    jp p,BC_Div_DE_88_lbl1   ;if Positive, don't negate
    xor a
    sub c
    ld c,a
    sbc a,a
    sub b
    ld b,a
BC_Div_DE_88_lbl1:

;now make DE negative to optimize the remainder comparison
    ld a,d
    or d
    jp m,BC_Div_DE_88_lbl2
    xor a
    sub e
    ld e,a
    sbc a,a
    sub d
    ld d,a
BC_Div_DE_88_lbl2:

;if DE is 0, we can call it an overflow
;A is the current value of D
  or e
  jr z,div_fixed88_overflow

;The accumulator gets set to B if no overflow.
;We can use H=0 to save a few cc in the meantime
    ld h,0

;if B+DE>=0, then we'll have overflow
    ld a,b
    add a,e
    ld a,d
    adc a,h
    jr c,div_fixed88_overflow

;Now we can load the accumulator/remainder with B
;H is already 0
    ld l,b

    ld a,c
    call div_fixed88_sub
    ld c,a

    ld a,b      ;A is now 0
    call div_fixed88_sub

    ld d,c
    ld e,a
    pop af
    ret p
    xor a
    sub e
    ld e,a
    sbc a,a
    sub d
    ld d,a
    ret

div_fixed88_overflow:
    ld de,$7FFF
    pop af
    ret p
    inc de
    inc e
    ret

div_fixed88_sub:
;min: 456cc
;max: 536cc
;avg: 496cc
    ld b,8
BC_Div_DE_88_lbl3:
    rla
    adc hl,hl
    add hl,de
    jr c,$+4
    sbc hl,de
    djnz BC_Div_DE_88_lbl3
    adc a,a
    ret

Floating Point Division

See divSingle and xdiv from the z80float project.

Square Root

Digit Extraction

Newton's Method (with analysis)

Integer Square Root multiprecision

BCD Square Root multiprecision

Fixed Point Square Root

Floating Point Square Root

See sqrtSingle and xsqrt from the z80float project.

Borchardt-Gauss Algorithm

Carlson's Improvement

CORDIC Algorithms

BKM Algoritms

Exponential

Integer Exponential

BCD Exponential

Fixed Point Exponential

Floating Point Exponential

See the following:
expSingle
xexp
pow2Single
xpow2
pow10Single
xpow10
powSingle
xpow

Logarithm

Integer Logarithm

BCD Logarithm

Fixed Point Logarithm

NOTE: These routines are from an older edit. It's cool, but I think I've come across other viable (better) options.

The following code is based on the approximation:

ln(x+1) : x(8+x)/(8+5x) on [0,1]

For a more complete analysis of how these formulas were derived, see this document.
What is convenient to note is that the constant multiplications can be reduced to a few shift and add instructions. So to put the functions into code:
lognat:
;Input:  H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 t-states
    dec h
       dec h
    jr nz,$+8
    inc l \ dec l
    ret nz
    ld l,177
    ret
    inc h
    ret nz
    ld b,h
    ld c,l
    ld e,l
    ld d,8
    add hl,hl
    add hl,hl
    add hl,de
    ex de,hl

    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ adc a,a

    ld h,a \ ld l,b
    sla h \ jr c,$+3 \ ld l,c
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    rl l
    ld a,h
    adc a,b
    ld l,a
    ld h,b
    cp a
    ret

But that only works on [1,2], so we have to reduce speed a little to extend it to all positive inputs (0,128):

lognat:
;Input:  H.L needs to be on (0,128.0)
;Output: H.L if c flag set
;     returns nc if input is negative (HL not modified)
;Error:
;    The error on the outputs is as follows:
;        20592 inputs are exact
;        12075 inputs are off by 1/256
;        100 inputs are off by 2/256
;    So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 t-states

    ld a,h \ or l \ jr nz,$+5
    ld h,80h \ ret
    dec h
    dec h
    jr nz,$+9
    inc l \ dec l
    jr nz,normalizeln-1
    ld l,177
    ret
    inc h
    jr nz,normalizeln
    ld b,h
    ld c,l
    ld e,l
    ld d,8
    add hl,hl
    add hl,hl
    add hl,de
    ex de,hl
    call HL_Div_DE
    ld h,a \ ld l,b
    sla h \ jr c,$+3 \ ld l,c
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    rl l
    ld a,h
    adc a,b
    ld h,b
    ld l,a
    scf
    ret
HL_Div_DE:
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ adc a,a \ ret

    inc h
normalizeln:
    xor a
    inc h \ ret m
    ld d,a \ ld e,a
    ld a,l
    jr z,toosmall
    inc e \ srl h \ rra \ jr nz,$-4
    rla \ rl h
    dec e
stepin:
    ld l,a
    push de
    call lognat
    pop de
    ;now multiply DE by 355, then divide by 2 (rounding)
    ld b,d \ ld c,e \ ld a,d
    ex de,hl
    add hl,hl
    add hl,hl    ;4
    add hl,bc    ;5
    add hl,hl    ;10
    add hl,bc    ;11
    add hl,hl    ;22
    add hl,hl
    add hl,hl
    add hl,hl
    add hl,bc
    add hl,hl
    add hl,bc
    sra h \ rr l
    adc hl,de
    scf
    ret
toosmall:
    dec d
    dec e \ add a,a \ jr nc,$-2
    inc h
    jp stepin

Or a table driven method for log2:
Because we are working in binary, lg(), which is log2() is the easiest and most practical to compute. From this, we can derive other bases, too. Now to use some identities, note that lg(x/2)=lg(x)-lg(2) = lg(x)-1. So then we have lg(x)=1+lg(x/2). So what we will do is make a routine that works on [1,2] and extend that to all other values. If our input is less than 1, we have lg(x)=lg(x*2)-1, so we double our input, and subtract from an accumulator. If input is >1, we will divide it by 2 and add to the accumulator. Then, we only have 256 values to worry about in our table. But, if our input is now even, we can divide by 2 without losing any accuracy (no bits lost) and just increment the accumulator. We can keep doing this until we finally have an odd number (which means we need to take care of input=0 as a separate case). Then we have a table of 128 elements instead—all of the odd numbers. And now we can build our algorithm:

lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)->H.L
;Speed: Avg: 340
 ld de,lgLUT
 ld b,0
 ld a,h
 or a
 ret m     ;If input is negative, this algorithm won't suffice (this is only for the Reals, not Complex)
 ld a,l
 jr z,$+8
 inc b \ srl h \ rra \ jr nz,$-4   ;input >2, so we rotate, using b as the accumulator
 or a \ jr nz,$+6
 ld hl,8000h \ ret     ;If input was 0, use this as -inf
 rra \ inc b \ jr nc,$-2      ;input<1, so we rotate
;A is the element to look up in the LUT
 ld l,a
 ld c,h
 dec b
 add hl,hl
 add hl,de
 ld e,(hl)
 inc hl
 ld d,(hl)
 ex de,hl
 add hl,bc
 ret
lglut:
.dw $F800
.dw $F996
.dw $FA52
.dw $FACF
.dw $FB2C
.dw $FB76
.dw $FBB3
.dw $FBE8
.dw $FC16
.dw $FC3F
.dw $FC64
.dw $FC86
.dw $FCA5
.dw $FCC1
.dw $FCDC
.dw $FCF4
.dw $FD0B
.dw $FD21
.dw $FD36
.dw $FD49
.dw $FD5C
.dw $FD6D
.dw $FD7E
.dw $FD8E
.dw $FD9D
.dw $FDAC
.dw $FDBA
.dw $FDC8
.dw $FDD5
.dw $FDE2
.dw $FDEE
.dw $FDFA
.dw $FE06
.dw $FE11
.dw $FE1C
.dw $FE26
.dw $FE31
.dw $FE3B
.dw $FE44
.dw $FE4E
.dw $FE57
.dw $FE60
.dw $FE69
.dw $FE71
.dw $FE7A
.dw $FE82
.dw $FE8A
.dw $FE92
.dw $FE9A
.dw $FEA1
.dw $FEA9
.dw $FEB0
.dw $FEB7
.dw $FEBE
.dw $FEC5
.dw $FECB
.dw $FED2
.dw $FED8
.dw $FEDF
.dw $FEE5
.dw $FEEB
.dw $FEF1
.dw $FEF7
.dw $FEFD
.dw $FF03
.dw $FF09
.dw $FF0E
.dw $FF14
.dw $FF19
.dw $FF1E
.dw $FF24
.dw $FF29
.dw $FF2E
.dw $FF33
.dw $FF38
.dw $FF3D
.dw $FF42
.dw $FF47
.dw $FF4B
.dw $FF50
.dw $FF55
.dw $FF59
.dw $FF5E
.dw $FF62
.dw $FF67
.dw $FF6B
.dw $FF6F
.dw $FF74
.dw $FF78
.dw $FF7C
.dw $FF80
.dw $FF84
.dw $FF88
.dw $FF8C
.dw $FF90
.dw $FF94
.dw $FF98
.dw $FF9B
.dw $FF9F
.dw $FFA3
.dw $FFA7
.dw $FFAA
.dw $FFAE
.dw $FFB2
.dw $FFB5
.dw $FFB9
.dw $FFBC
.dw $FFC0
.dw $FFC3
.dw $FFC6
.dw $FFCA
.dw $FFCD
.dw $FFD0
.dw $FFD4
.dw $FFD7
.dw $FFDA
.dw $FFDD
.dw $FFE0
.dw $FFE4
.dw $FFE7
.dw $FFEA
.dw $FFED
.dw $FFF0
.dw $FFF3
.dw $FFF6
.dw $FFF9
.dw $FFFC
.dw $FFFF

Adding a little overhead code, we can get an ln() routine. Note that to convert, we have ln(x)=lg(x)/lg(e), but 1/lg(e) is a constant, so we can use constant multiplication. (it is approximately 355/512 for a good estimate):
ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)->H.L
;Speed: Avg: 340+(325 worst case)
 call lg_88
 ;now signed multiply HL by 355, then divide by 2 (rounding)
    ld de,0
    bit 7,h
    jr z,$+9
    dec e \ xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld b,h
    ld c,l
    xor a
    add hl,hl
    add hl,hl \ rla
    add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,bc \ adc a,d
    sra a \ rr h
    ld l,h
    ld h,a
    inc e
    ret nz
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ret

Floating Point Logarithm

See the following:
lnSingle (Single Precision Natural Log)
lgSingle (Single Precision Log Base 2)
log10Single (Single Precision Log Base 10)
logSingle (Single Precision Log Base y)

xln (Extended Precision Natural Log)
xlg (Extended Precision Log Base 2)
xlog10 (Extended Precision Log Base 10)
xlog (Extended Precision Log Base y)

Cosine/Sine

Integer Cosine/Sine

BCD Cosine/Sine

Fixed Point Cosine/Sine

Floating Point Cosine/Sine

See the following:
sinSingle (Single Precision Sine)
cosSingle (Single Precision Cosine)
xsin (Extended Precision Sine)
xcos (Extended Precision Cosine)

Arctangent

Integer Arctangent

BCD Arctangent

Fixed Point Arctangent

NOTE: These routines are from an older edit. It's cool, but I think I've come across other viable (better) options.

The following code is based on the approximation:

atan(x) : x(9+2x^2)/(9+5x^2) on [-1,1]

For a more complete analysis of how these formulas were derived, see this document.
What is convenient to note is that the constant multiplications can be reduced to a few shift and add instructions. So to put the function into code:
ArcTan_88:
;Inputs: DE is positive
    ld a,d
    or a
    jr z,ArcTan+2
    inc a
    jr z,ArcTan+2
    ld b,d
    ld c,e
    ld de,256
    call DE_Div_BC_88
    call ArcTan
    ld de,402     ;about pi/2
    ex de,hl
    or a
    sbc hl,de
    ret
ArcTan:
;Input: DE
;Output: HL
    push de    ;save X for later
    ld b,d
    ld c,e
    call BC_Times_DE ;x^2 computed
    ld d,h
    ld e,l
    add hl,hl
    add hl,hl
    add hl,de   ;5x^2 is now in HL
    ld a,9
    add a,h
    ld h,a      ;9+5x^2->HL
    ld a,9
    sla e
    rl d
    add a,d
    ld d,a      ;9+2x^2->DE
    ld b,h
    ld c,l
    call DE_Div_BC_88   ;->DE     
    pop bc
    jp BC_Times_DE

Table driven arctangent can be implemented, too.
Arctan can be computed on the whole real line just from the values on [0,1] by the identities: arctan(-x)=-arctan(x) and arctan(x)=pi/2-arctan(1/x). Unlike ln() where we need an LUT (Look Up Table) of 16-bit numbers, we can just use 8-bit numbers. Plus, since arctan is pretty close to a linear function for a while, a large chunk of the LUT can be omitted by checking if the input is within a certain range:
arctan_88:
;Input:
;    D.E
;Output: atan(D.E)->D.E
    push de
    ld a,d
    or a
    jp p,$+5
    neg
    ld d,a
    dec a
    jr nz,checkneedinv
    inc e \ dec e \ jr nz,checkneedinv
    pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
checkneedinv:
    inc a
    call nz,DEgt1_Inv
;0.E is the value to atan
    ld hl,adjustatan
    push hl
    ld a,e
    cp 46 \ ret c
    dec a \ cp 42h \ ret c
    dec a \ cp 4Eh \ ret c
    dec a \ cp 57h \ ret c
    dec a \ cp 5Eh \ ret c
    dec a \ cp 64h \ ret c
    dec a \ cp 6Ah \ ret c
    dec a \ cp 6Fh \ ret c
    sub 6Fh \ ld e,a
    ld hl,atanlut
    add hl,de
    ld a,(hl)
    ret
adjustatan:
    ld e,a
    pop bc
    ld a,b
    or a
    jp p,$+5
    neg
    jr z,$+9
    ld hl,402
    or a
    sbc hl,de
    ex de,hl
    rl b
    ret nc
    xor a
    sub e
    ld e,a
    sbc a,a
    sub d
    ld d,a
    ret
DEgt1_Inv:
;Works if DE>1
    ld hl,256
    ld b,8
InvLoop:
    add hl,hl
    sbc hl,de
    jr nc,$+3
    add hl,de
    adc a,a
    djnz InvLoop
    cpl
    ld e,a
    ld d,b
    ret
atanlut:
.db $6F
.db $6F
.db $70
.db $71
.db $72
.db $73
.db $73
.db $74
.db $75
.db $76
.db $77
.db $77
.db $78
.db $79
.db $7A
.db $7B
.db $7B
.db $7C
.db $7D
.db $7E
.db $7F
.db $7F
.db $80
.db $81
.db $82
.db $82
.db $83
.db $84
.db $85
.db $85
.db $86
.db $87
.db $88
.db $88
.db $89
.db $8A
.db $8B
.db $8B
.db $8C
.db $8D
.db $8E
.db $8E
.db $8F
.db $90
.db $90
.db $91
.db $92
.db $93
.db $93
.db $94
.db $95
.db $95
.db $96
.db $97
.db $97
.db $98
.db $99
.db $9A
.db $9A
.db $9B
.db $9C
.db $9C
.db $9D
.db $9E
.db $9E
.db $9F
.db $A0
.db $A0
.db $A1
.db $A2
.db $A2
.db $A3
.db $A3
.db $A4
.db $A5
.db $A5
.db $A6
.db $A7
.db $A7
.db $A8
.db $A9
.db $A9
.db $AA
.db $AA
.db $AB
.db $AC
.db $AC
.db $AD
.db $AD
.db $AE
.db $AF
.db $AF
.db $B0
.db $B0
.db $B1
.db $B2
.db $B2
.db $B3
.db $B3
.db $B4
.db $B5
.db $B5
.db $B6
.db $B6
.db $B7
.db $B7
.db $B8
.db $B9
.db $B9
.db $BA
.db $BA
.db $BB
.db $BB
.db $BC
.db $BC
.db $BD
.db $BE
.db $BE
.db $BF
.db $BF
.db $C0
.db $C0
.db $C1
.db $C1
.db $C2
.db $C2
.db $C3
.db $C3
.db $C4
.db $C4
.db $C5
.db $C6
.db $C6
.db $C7
.db $C7
.db $C8
.db $C8
.db $C9

Floating Point Arctangent

See the following:
atanSingle (Single Precision Arctangent)
xatan (Extended Precision Arctangent)

Deriving Tangent, Arcsine, Arccosine, Hyperbolic Functions

String Conversions

Integer String Conversions

BCD String Conversions

Fixed Point String Conversions

Floating Point String Conversions

See the following:
strToSingle (Single Precision String To Float)
strtox (Extended Precision String To Float)
singleToStr (Single Precision Float To String)
xtostr (Extended Precision Float To String)

Random Number Generators

rand16

;#define smc    ;uncomment if you are using SMC
rand16:
;;collaboration by Zeda with Runer112
;;160cc or 148cc if using SMC
;;26 bytes
;;cycle: 4,294,901,760 (almost 4.3 billion)
#ifdef smc
seed1=$+1
    ld hl,9999
#else
    ld hl,(seed1)
#endif
    ld b,h
    ld c,l
    add hl,hl
    add hl,hl
    inc l
    add hl,bc
    ld (seed1),hl
#ifdef smc
seed2=$+1
    ld hl,9999
#else
    ld hl,(seed2)
#endif
    add hl,hl
    sbc a,a
    and %00101101
    xor l
    ld l,a
    ld (seed2),hl
    add hl,bc
    ret

rand24

rand24:
;;219cc
#ifdef smc
seed1_0=$+1
    ld hl,12345
seed1_1=$+1
    ld a,67
#else
    ld hl,(seed1_0)
    ld a,(seed1_1)
#endif
    ld b,h
    ld c,l
    ld d,a
    add hl,hl \ rla
    add hl,hl \ rla
    inc l
    add hl,bc \ adc a,0
    ld (seed1_0),hl
    ld (seed1_1),a
    ld c,b
    ld b,a
#ifdef smc
seed2_0=$+1
    ld hl,65432
seed2_1=$+1
    ld a,10
#else
    ld hl,(seed2_0)
    ld a,(seed2_1)
#endif
    add hl,hl
    rla
    ld (seed2_1),a
    sbc a,a
    and %10000111
    xor l
    ld l,a
    ld (seed2_0),hl
    add hl,bc
    ret

rand32

rand32:
;;Tested and passes all CAcert tests
;;Uses a very simple 32-bit LCG and 32-bit LFSR
;;it has a period of 18,446,744,069,414,584,320
;;roughly 18.4 quintillion.
;;LFSR taps: 0,2,6,7  = 11000101
;;291cc
;;Thanks to Runer112 for his help on optimizing the LCG and suggesting to try the much simpler LCG. On their own, the two are terrible, but together they are great.
seed1_0=$+1
    ld hl,12345
seed1_1=$+1
    ld de,6789
    ld b,h
    ld c,l
    add hl,hl \ rl e \ rl d
    add hl,hl \ rl e \ rl d
    inc l
    add hl,bc
    ld (seed1_0),hl
    ld hl,(seed1_1)
    adc hl,de
    ld (seed1_1),hl
    ex de,hl
;;lfsr
seed2_0=$+1
    ld hl,9876
seed2_1=$+1
    ld bc,54321
    add hl,hl \ rl c \ rl b
    ld (seed2_1),bc
    sbc a,a
    and %11000101
    xor l
    ld l,a
    ld (seed2_0),hl
    ex de,hl
    add hl,bc
    ret

rand, Floating-point

For floating-point random numbers, we'll need to take a little extra care to make the distribution as uniform as possible. We'll generate a float on [0,1), but we want to make sure to select numbers from [0,2^-n) with probability 2^-n, for example. To do this, we'll first generate the exponent by initializing to 0, then decrement the exponent and generate a random bit, repeating this process until a 1 is generated, or the exponent is too small for our floats. Finally, we generate our mantissa in normalized form.

(This means the exponent is -1 with 50% probability, -2 with 25% probability, etc.)

So as an example for a single-precision random float generator

randSingle:
;BC points to where the result is output.
;Stores a pseudo-random number on [0,1)
;A value of 0 is extraordinarily unlikely.

  push bc

;The first thing we'll do is generate the exponent.
;Initialize to -1 (stored as 0x7F) and we'll generate
;random bits, decrementing the exponent until we get a 1.

  ld c,$80    ;exponent
  jr rand_gen_exp
_:
  dec c
  add hl,hl
  jr c,+_   ;we have completed calculation of the exponent
  djnz -_

;make sure the exponent isn't zero!
  inc c
  dec c
  jr z,rand_zero

rand_gen_exp:
  push bc
  call rand
  pop bc
  ld b,16
  jr -_
_:

; Now we have generated the exponent, let's generate the mantissa

  push bc
  call rand
  push hl
  call rand
  pop de
  pop bc
  ld b,h
;Now our float is stored in CBDE

; Make the float positive
  res 7,b

  .db $FE   ;start of `cp *` to skip `ld b,c`
rand_zero:
  ld b,c

  pop hl
  ld (hl),e
  inc hl
  ld (hl),d
  inc hl
  ld (hl),b
  inc hl
  ld (hl),c
  ret

Or an extended-precision routine:

xrand:
;Stores a pseudo-random number on [0,1)
;A value of 0 should be practically impossible in this universe.
  push bc

;The first thing we'll do is generate the exponent.
;Initialize to 0 (stored as 0x4000) and we'll generate
;random bits, decrementing the exponent until we get a 1.

  ld bc,$4000    ;exponent
  jr rand_gen_exp
_:
  dec bc
  add hl,hl
  jr c,+_   ;we have completed calculation of the exponent
  dec a
  jr nz,-_

;make sure the exponent isn't zero!
  ld a,b
  or c
  jr z,xrand_zero

rand_gen_exp:
  push bc
  call rand
  pop bc
  ld a,16
  jr -_
_:

; Now we have generated the exponent, let's generate the mantissa
  pop ix    ;pointer to the output
  ld (ix+8),c
  ld (ix+9),b

  call rand
  ld (ix),h
  ld (ix+1),l

  call rand
  ld (ix+2),h
  ld (ix+3),l

  call rand
  ld (ix+4),l   ;it's "random" anyways, just fun to change it up :P
  ld (ix+5),h

  call rand
  ld (ix+6),h
  ld (ix+7),l

  ret

xrand_zero:
  pop hl

set_xzero:
;HL points to where to write 0 as an extended precision float.
  xor a
  ld b,10
_:
  ld (hl),a
  inc hl
  djnz -_
  ret

High-Precision On-the-Fly Computation of Various Constants

pi

e

ln(2)

ln(10)

.5*ln(1+s*21-k+(|s|-|t|)4-k+i(t*21-k+ts*21-2k))

arctangent(t/(2k+s)

Unless otherwise stated, the content of this page is licensed under GNU Free Documentation License.