Advanced Math

Integer Math

Addition/Subtraction

To do addition or subtraction on an integer of arbitrary size, we use ADC.

Here we add two 32-bit integers:

    LD     HL, (word1)        ; Get least-significant word of word1
    LD     DE, (word2)        ; Get least-significant word of word2
    ADD    HL, DE              ; Add them
    LD     (result), HL        ; Store least-significant result

    LD     HL, (dword1 + 2)    ; Get most-significant word of word1
    LD     DE, (dword2 + 2)    ; Get most-significant word of word2
    ADC    HL, DE              ; Add them with the carry from the previous addition
    LD     (result + 2), HL    ; Store most-significant result
    RET

word1:    .DB    $B3, $90, $12, $32    ; Each word is stored with the least-significant
word2:    .DB    $F1, $15, $06, $B8    ; bytes first. You could just as easily have stored
result:    .DB    $00, $00, $00, $00    ; them in big-endian, but because of how registers are
                                     ; loaded from RAM, it wouldn't work.

But now, what if you wanted to add truly arbitrary sized integers? We might do something like this:
     ld hl,Num1
     ld de,Num2
     ld bc,Size
AddLoop:
;Inputs:
;     HL points to Num1 (stored little-endian)
;     DE points to Num2 (stored little-endian)
;     BC is the size of the integers
;Outputs:
;     BC is 0
;     HL points to the byte after Num1
;     DE points tot he byte after Num2
;     c flag is the carry
;     z flag is set
;     p/v flag reset
;     Num1 is overwritten with the result
       ld a,(de)
       adc a,(hl)
       ld (hl),a
       inc de
       cpi         ;increment HL, decrement BC, set parity flag of BC becomes 0
       jp pe,AddLoop
     ret
Num1:
     .db $B3,$90,$12,$32,$47
Num2:
     .db $F1,$15,$06,$B8,0

Similarly, one can perform arbitrary subtractions.

Multiplication

For multiplication, we could just add them in a DJNZ loop. But that is slow for multiplying big numbers. There is a much more efficient algorithm that you probably learned as a child in school that involves multiplying multi-digit numbers by hand. Similarly, we will do that here by using the binary representation of numbers and the astute reader will realise that binary digits are only 1 or 0 and multiplying by 1 or 0 is rather trivial!

To understand this algorithm, it is suggested that the reader try this by hand with some simple 4-bit numbers, then follow the algorithm or try to develop it in pseudocode.

Here is the first example:

.module    DE_Times_A

;Note that we choose DE and A because of the readily available instructions that make this even more efficient.

DE_Times_A:           ; HL = DE × A
    LD     HL, 0      ; Use HL to store the product
    LD     B, 8       ; Eight bits to check
_loop:
    RRCA              ; Check least-significant bit of accumulator
    JR     NC, _skip  ; If zero, skip addition
    ADD    HL, DE
_skip:
    SLA    E          ; Shift DE one bit left
    RL     D

;In binary, a shift left is multiplying by 2, just like a shift left in decimal is like multiplying by 10.
;We essentially added a 0 to the end of the number

    DJNZ   _loop
    RET

Or alternatively for a slightly faster and smaller code:
.module    DE_Times_A
DE_Times_A:      ; HL = DE × A
    ld hl,0      ; Use HL to store the product
    ld b,8       ; Eight bits to check
_loop:
    add hl,hl
    rlca         ; Check most-significant bit of accumulator
    jr nc,_skip  ; If zero, skip addition
    add hl,de
_skip:
    djnz _loop
    ret

And you might realise that multiplying an 8-bit number by a 16-bit number can yield up to a 24-bit number, so to take care of that overflow by turning rlca into adc a,a which is the same size and speed. Note that adc a,a can be replaced with rla for the same effect.
.module    DE_Times_A
DE_Times_A:      ; AHL = DE × A
    ld hl,0      ; Use HL to store the product
    ld b,8       ; Eight bits to check
_loop:
    add hl,hl
    adc a,a         ; Check most-significant bit of accumulator
    jr nc,_skip  ; If zero, skip addition
    add hl,de
_skip:
    djnz _loop
    ret

Division

Division is also just like base-10 division as well, but instead of checking if the next digit is from 0 to 9, we only need to check if it is 0 or 1 which amounts to the question of "is our remainder smaller than the divisor (1) or not (0) ?"

Again, it is recommended that the reader try this with small examples to understand and possibly develop their own code.

.module    Div_HL_D
Div_HL_D:            ; HL = HL ÷ D, A = remainder
    XOR    A         ; Clear upper eight bits of AHL
    LD     B, 16     ; Sixteen bits in dividend
_loop:
    ADD    HL, HL    ; Do a SLA HL. If the upper bit was 1, the c flag is set
    RLA              ; This moves the upper bits of the dividend into A
    JR     C, _overflow
    CP     D         ; Check if we can subtract the divisor
    JR     C, _skip  ; Carry means D > A
_overflow:
    SUB    D         ; Do subtraction for real this time
    INC    L         ; Set the next bit of the quotient (currently bit 0)
_skip:
    DJNZ   _loop
    RET

Fixed Point Math

Fixed Point refers to a number format that includes a fixed size integer part, and a fixed size fractional part. For example, 8.8 fixed point refers to a format where the upper 8 bits form the integer part, the lower 8 bits form the fractional part. Fixed point math often provides a desired balance between the numerical precision of floating point and the computing speed of integer maths.

To elaborate a little more, to represent 3.0 as a fixed point number stored in HL, we might have HL=0300, where the decimal point is between H and L (03.00). Then HL=0341 would be equivalent to 3+65/256=3.25390625. If we were to square this, we get HL=0A96 where we have lost some bits of precision.

Addition and Subtraction

As with integer arithmetic, addition and subtraction are trivial on the Z80. To add H.L+D.E, just use add hl,de, and for subtraction, 'or a \ sbc hl,de' will sufice.

Multiplication and Division

If we view HL as a 16 bit integer, then H.L is essentially HL/256. For multiplication, the math for H.L*D.E is then:
(HL/256)(DE/256) = (HL*DE)/65536.
This means we can use a 16-bit multiplication routine that returns 32 bits and we now have a 16.16 fixed point result. To keep output as an 8.8 fixed point number, we need to grab the middle bytes around the decimal. If output was DE.HL, then E.H is a our 8.8 result. Overflow and underflow should be handled by the programmer depending on the task.
note: Really, only a routine to return the lower 24-bits of the 16x16 multiplication is needed.

Allowing speed to take priority, here is a very fast routine for both 16-bit integer multiplication (returned in BHLA) or 8.8 fixed point (returned in HL):

BC_Times_DE:
;  BHLA is the 32-bit result
    ld a,b
    or a
    ld hl,0
    ld b,h
;1
    add a,a
    jr nc,$+4
    ld h,d
    ld l,e
;2
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b
;227+10b-7p
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

;===
;AHL is the result of B*DE*256
    push hl
    ld h,b
    ld l,b
    ld b,a
    ld a,c
    ld c,h
;1
    add a,a
    jr nc,$+4
    ld h,d
    ld l,e
;2
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c
;227+10b-7p
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    pop de
;Now BDE*256+AHL
    ld c,a
    ld a,l
    ld l,h
    ld h,c
    add hl,de
    ret nc
    inc b
;BHLA is the 32-bit result
    ret

For division, we should notice that H.L/D.E evaluates as:
(HL/256)/(DE/256)=HL/DE
We could use a standard integer division routine, however, unlike with multiplication, the result can only be used a 16.0 integer. We need 8 more bits of accuracy for 16.8 output (of which we only use the lower 8-bits of the integer part). Since we are going to discard the upper 8 bits pf the acquired result, we can optimise the first 8 iterations to not store the bits anywhere.

DE_Div_BC_88:
;Inputs:
;     DE,BC are 8.8 Fixed Point numbers
;Outputs:
;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)
     ld a,8
     ld hl,0
Loop1:
     rl d
     adc hl,hl
     sbc hl,bc
     jr nc,$+3
    add hl,bc
     dec a
     jr nz,Loop1

     ld d,e
     ld e,a
     ld a,16
    jp $+6
DivLoop:
     add hl,bc
     dec a
     ret z

     sla e
     rl d
     adc hl,hl
    jr c,overflow
     sbc hl,bc
     jr c,DivLoop
     inc e
     jp DivLoop+1
overflow:
    or a
    sbc hl,bc
    inc e
    jp DivLoop

Inverse (1/x)

If we want to invert a number >1, we know that its result will be <1, so for 8.8 fixed point, this means there will only be 8 bits in the result. We can optimise out most of the cycles of the division, making it much faster. In the DE_Div_BC_88 routine, we initialise HL=0 and rotate a bit of DE into HL at each iteration. If DE=100h=1.0, then no bits will get rotated into HL until the 8th iteration. Then after that, HL=1, but we know BC>256, so we are guaranteed to have nothing happen for another 8 iterations until HL=256. At this point, we have only 8 more iterations to compute, and DE is 0, so we no longer need to even rotate DE into HL.
So now at this point, we can just initialise HL=256, then rotate in a 0 each iteration. This can be done with add hl,hl, then we compare to BC to see if the remainder (here, HL). However, since we now only need two register pairs, lets use B for the iteration counter, and DE is the denominator. If we compare HL to DE, we can use sbc hl,de. If the C flag is set, it means HL<DE, so we need to rotate a 0 into our accumulator and then add DE to HL again. We could use sbc hl,de \ jr nc,$+3 \ add hl,de \ ccf \ rla, but if we note that we have a cpl instruction to invert the bits of a:

DEgt1_Inv:
;1/D.E->B.C
; 470 to 518 t-states
    ld hl,256
    ld b,8
InvLoop:
    add hl,hl
    sbc hl,de
    jr nc,$+3
    add hl,de
    rla
    djnz InvLoop
    cpl
    ld c,a
    ret

And to unroll, saving 102 t-states:
DEgt1_Inv:
;1/D.E->B.C
; 343 to 397 t-states
    ld hl,512
    ld b,l
InvLoop:
    or a       \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ adc a,a
    cpl
    ld c,a
    ret

Note that this actually works as an HL_Div_DE_88 routine, by removing the ld hl,256 and making sure HL<DE.

Square Root

Using the integer square root of HL, sqrt(H.L) would be returned in 4.4 fixed point format. The upper 4 bits of the integer part will be 0 regardless, so we can essentially say that we have 8.4 Fixed Point, so we need to modify the algorithm to iterate 4 more times to get 8.8 fixed point output. The following is adapted from a speed-optimised routine used in a floating point library (which required 16 bits instead of 12):

SqrtHL_prec12:
;input: HL
;Output: HL
;183 bytes
    xor a
    ld b,a

    ld e,l
    ld l,h
    ld h,a

    add hl,hl
    add hl,hl
    cp h
    jr nc,$+5
    dec h
    ld a,4

    add hl,hl
    add hl,hl
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    ld l,e

    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add a,a \ ld c,a
    add hl,hl
    add hl,hl
    jr nc,$+6
    sub h \ jp $+6
    sub h
    jr nc,$+6
    inc c \ inc c
    cpl
    ld h,a

    ld a,l
    ld l,h
    add a,a
    ld h,a
    adc hl,hl
    adc hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

;iteration 9
    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
;12th iteration completed
; output in BC
    srl b \ rr c
    ld h,b
    ld l,c
    ret

Exponentiation and Logarithms

Exponentiation

Here, exponentiation refers to 2x.
Observe that 239/16 = 22+7/16 = 22/1+0/2+1/4+1/8+1/16=2220/221/421/821/16 = 22sqrt(sqrt(2))sqrt(sqrt(sqrt(2)))sqrt(sqrt(sqrt(sqrt(2)))) = 22sqrt(sqrt(2sqrt(2sqrt(2))))

The observant reader will see that taking away the trivial integer part of the power, the fractional part can be represented as a sequence of square roots with 2s wherever there are 1s in the binary expansion. So to compute 2x where x is on [0,1), in pseudo code we have:

If X == 0
  Return 1
Else
  While (carry is not set)
    X>>    ;shift X right
  End
  y=sqrt(2)
  While X>0
    X>>
    If (carry is set)
        y=y*2
    y=sqrt(y)
  End

Converting that to assembly:
;Inputs:
;     HL is the 8.8 fixed point number 'x' for 2^x
;Outputs:
;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.
     ld a,l
     or a
     push hl     ;save H for later, H is the integer part of the power
     ld hl,1
     jr z,integerpow
     scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter.
;wait until we come across the lowest bit. Also note that we 
     rra
     jr nc,$-1
     ld hl,2*256
powloop:
     push af
     call FPSqrtHL    ;returns in HL
     pop af
     srl a
     jr z,integerpow
     jr nc,powloop
     add hl,hl
     jp powloop
integerpow:
     pop bc
;Now b is the integer part for 2^x that we need to multiply HL by.
     ld de,0
     ld a,b
     or a
     ret z

     add hl,hl
     rl e \ rl d \ jr c,wayoverflow
     djnz $-7
     ret
wayoverflow:
     ld hl,-1
     ld d,h
     ld e,l
     ret

Note that in the spirit of optimisation, that we can significantly speed up the fixed point sqrtHL routine since HL is never greater than 4.0.

Logarithms

Here, logarithm refers to log2 also written as lg().
Using logarithms will allow us to extend the exponential routine to other bases, such as 3^n without much difficulty. Here is a fast, fixed point log_2 routine, optimised for size:

Log_2_88_size:
;Inputs:
;     HL is an unsigned 8.8 fixed point number.
;Outputs:
;     HL is the signed 8.8 fixed point value of log base 2 of the input.
;Example:
;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...)
;averages about 39 t-states slower than original
;62 bytes
     ex de,hl
     ld hl,0
     ld a,d
     ld c,8
     or a
     jr z,DE_lessthan_1
     srl d
     jr z,logloop-1
     inc l
     rr e
     jr $-7
DE_lessthan_1:
     ld a,e
     dec hl
     or a
     ret z
     inc l
     dec l
     add a,a
     jr nc,$-2
     ld e,a

     inc d
logloop:
     add hl,hl
     push hl
     ld h,d
     ld l,e
     ld a,e
     ld b,8

     add hl,hl
     rla
     jr nc,$+5
       add hl,de
       adc a,0
     djnz $-7

     ld e,h
     ld d,a
     pop hl
     rr a         ;this is NOT supposed to be rra, we need the z flag affected
     jr z,$+7
       srl d
       rr e
       inc l
     dec c
     jr nz,logloop
     ret

Arctangent and Natural Logarithm

Although these are not the fastest methods, the approach to the problem of computing these functions efficiently is interesting. As well, they can work faster than CORDIC or AGM methods.
Given efficient routines for computing multiplication and division, the following are accurate to better than 8 bits of accuracy:

atan(x) : x(9+2x^2)/(9+5x^2) on [-1,1]
ln(x+1) : x(8+x)/(8+5x) on [0,1]

For a more complete analysis of how these formulas were derived, see this document.
What is convenient to note is that the constant multiplications can be reduced to a few shift and add instuctions. So to put the functions into code:
ArcTan_88:
;Inputs: DE is positive
    ld a,d
    or a
    jr z,ArcTan+2
    inc a
    jr z,ArcTan+2
    ld b,d
    ld c,e
    ld de,256
    call DE_Div_BC_88
    call ArcTan
    ld de,402     ;about pi/2
    ex de,hl
    or a
    sbc hl,de
    ret
ArcTan:
;Input: DE
;Output: HL
    push de    ;save X for later
    ld b,d
    ld c,e
    call BC_Times_DE ;x^2 computed
    ld d,h
    ld e,l
    add hl,hl
    add hl,hl
    add hl,de   ;5x^2 is now in HL
    ld a,9
    add a,h
    ld h,a      ;9+5x^2->HL
    ld a,9
    sla e
    rl d
    add a,d
    ld d,a      ;9+2x^2->DE
    ld b,h
    ld c,l
    call DE_Div_BC_88   ;->DE     
    pop bc
    jp BC_Times_DE

For the following ln() routine, the approximating function is employed taking advantage of lots of optimisations, putting it between the computing speed of multiplication and division:
lognat:
;Input:  H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 t-states
    dec h
       dec h
    jr nz,$+8
    inc l \ dec l
    ret nz
    ld l,177
    ret
    inc h
    ret nz
    ld b,h
    ld c,l
    ld e,l
    ld d,8
    add hl,hl
    add hl,hl
    add hl,de
    ex de,hl

    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ adc a,a

    ld h,a \ ld l,b
    sla h \ jr c,$+3 \ ld l,c
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    rl l
    ld a,h
    adc a,b
    ld l,a
    ld h,b
    cp a
    ret

But that only works on [1,2], so we have to reduce speed a little to extend it to all positive inputs (0,128):
lognat:
;Input:  H.L needs to be on (0,128.0)
;Output: H.L if c flag set
;     returns nc if input is negative (HL not modified)
;Error:
;    The error on the outputs is as follows:
;        20592 inputs are exact
;        12075 inputs are off by 1/256
;        100 inputs are off by 2/256
;    So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 t-states

    ld a,h \ or l \ jr nz,$+5
    ld h,80h \ ret
    dec h
    dec h
    jr nz,$+9
    inc l \ dec l
    jr nz,normalizeln-1
    ld l,177
    ret
    inc h
    jr nz,normalizeln
    ld b,h
    ld c,l
    ld e,l
    ld d,8
    add hl,hl
    add hl,hl
    add hl,de
    ex de,hl
    call HL_Div_DE
    ld h,a \ ld l,b
    sla h \ jr c,$+3 \ ld l,c
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    add hl,hl \ jr c,$+3 \ add hl,bc
    rl l
    ld a,h
    adc a,b
    ld h,b
    ld l,a
    scf
    ret
HL_Div_DE:
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
    add hl,hl \ sbc hl,de \ adc a,a \ ret

    inc h
normalizeln:
    xor a
    inc h \ ret m
    ld d,a \ ld e,a
    ld a,l
    jr z,toosmall
    inc e \ srl h \ rra \ jr nz,$-4
    rla \ rl h
    dec e
stepin:
    ld l,a
    push de
    call lognat
    pop de
    ;now multiply DE by 355, then divide by 2 (rounding)
    ld b,d \ ld c,e \ ld a,d
    ex de,hl
    add hl,hl
    add hl,hl    ;4
    add hl,bc    ;5
    add hl,hl    ;10
    add hl,bc    ;11
    add hl,hl    ;22
    add hl,hl
    add hl,hl
    add hl,hl
    add hl,bc
    add hl,hl
    add hl,bc
    sra h \ rr l
    adc hl,de
    scf
    ret
toosmall:
    dec d
    dec e \ add a,a \ jr nc,$-2
    inc h
    jp stepin

Now it averages about 1250 t-states, still faster than some implementations of division.

If you pass HL=768, in fixed point that is 3.0, so output is 282 which is 1.1015625 in fixed point.

Sine and Cosine

Unlike atan() and ln(), these have quickly converging Taylor series. From these, we can derive:

sin(x) : x-85x^3/512+x^5/128 is within 8 bits of accuracy on [-pi/2,pi/2]
cos(x) : 1-x^2/2+5x^4/128 is within 8 bits of accuracy on [-pi/2,pi/2] for almost all values

For sine, noting that 85=17*5 will help with fast constant multiplication:
sine_88:
;Inputs: de
    push de
    sra d \ rr e
    ld b,d \ ld c,e
    call BC_Times_DE
    push hl     ;x^2/4
    sra h \ rr l
    ex de,hl
    ld b,d \ ld c,e
    call BC_Times_DE
    sra h \ rr l
    inc h
    ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4
    xor a
    ld d,a
    ld b,h
    ld c,l
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,bc \ adc a,d
    ld b,h
    ld c,l
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,bc \ adc a,d
    ld e,l
    ld l,h
    ld h,a
    rl e
    adc hl,hl
    rl e
    jr nc,$+3
    inc hl

    pop de
    ex hl,de
    or a
    sbc hl,de
    ex de,hl
    pop bc
    jp BC_Times_DE

Table-Based Fixed 8.8

When speed is priority, using table-based routines are often the fastest way to achieve a task.

Logarithms

Because we are working in binary, lg(), which is log2() is the easiest and most practical to compute. From this, we can derive other bases, too. Now to use some identities, note that lg(x/2)=lg(x)-lg(2) = lg(x)-1. So then we have lg(x)=1+lg(x/2). So what we will do is make a routine that works on [1,2] and extend that to all other values. If our input is less than 1, we have lg(x)=lg(x*2)-1, so we double our input, and subtract from an accumulator. If input is >1, we will divide it by 2 and add to the accumulator. Then, we only have 256 values to worry about in our table. But, if our input is now even, we can divide by 2 without losing any accuracy (no bits lost) and just increment the accumulator. We can keep doing this until we finally have an odd number (which means we need to take care of input=0 as a separate case). Then we have a table of 128 elements instead—all of the odd numbers. And now we can build our algorithm:

lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)->H.L
;Speed: Avg: 340
 ld de,lgLUT
 ld b,0
 ld a,h
 or a
 ret m     ;If input is negative, this algorithm won't suffice (this is only for the Reals, not Complex)
 ld a,l
 jr z,$+8
 inc b \ srl h \ rra \ jr nz,$-4   ;input >2, so we rotate, using b as the accumulator
 or a \ jr nz,$+6
 ld hl,8000h \ ret     ;If input was 0, use this as -inf
 rra \ inc b \ jr nc,$-2      ;input<1, so we rotate
;A is the element to look up in the LUT
 ld l,a
 ld c,h
 dec b
 add hl,hl
 add hl,de
 ld e,(hl)
 inc hl
 ld d,(hl)
 ex de,hl
 add hl,bc
 ret
lglut:
.dw $F800
.dw $F996
.dw $FA52
.dw $FACF
.dw $FB2C
.dw $FB76
.dw $FBB3
.dw $FBE8
.dw $FC16
.dw $FC3F
.dw $FC64
.dw $FC86
.dw $FCA5
.dw $FCC1
.dw $FCDC
.dw $FCF4
.dw $FD0B
.dw $FD21
.dw $FD36
.dw $FD49
.dw $FD5C
.dw $FD6D
.dw $FD7E
.dw $FD8E
.dw $FD9D
.dw $FDAC
.dw $FDBA
.dw $FDC8
.dw $FDD5
.dw $FDE2
.dw $FDEE
.dw $FDFA
.dw $FE06
.dw $FE11
.dw $FE1C
.dw $FE26
.dw $FE31
.dw $FE3B
.dw $FE44
.dw $FE4E
.dw $FE57
.dw $FE60
.dw $FE69
.dw $FE71
.dw $FE7A
.dw $FE82
.dw $FE8A
.dw $FE92
.dw $FE9A
.dw $FEA1
.dw $FEA9
.dw $FEB0
.dw $FEB7
.dw $FEBE
.dw $FEC5
.dw $FECB
.dw $FED2
.dw $FED8
.dw $FEDF
.dw $FEE5
.dw $FEEB
.dw $FEF1
.dw $FEF7
.dw $FEFD
.dw $FF03
.dw $FF09
.dw $FF0E
.dw $FF14
.dw $FF19
.dw $FF1E
.dw $FF24
.dw $FF29
.dw $FF2E
.dw $FF33
.dw $FF38
.dw $FF3D
.dw $FF42
.dw $FF47
.dw $FF4B
.dw $FF50
.dw $FF55
.dw $FF59
.dw $FF5E
.dw $FF62
.dw $FF67
.dw $FF6B
.dw $FF6F
.dw $FF74
.dw $FF78
.dw $FF7C
.dw $FF80
.dw $FF84
.dw $FF88
.dw $FF8C
.dw $FF90
.dw $FF94
.dw $FF98
.dw $FF9B
.dw $FF9F
.dw $FFA3
.dw $FFA7
.dw $FFAA
.dw $FFAE
.dw $FFB2
.dw $FFB5
.dw $FFB9
.dw $FFBC
.dw $FFC0
.dw $FFC3
.dw $FFC6
.dw $FFCA
.dw $FFCD
.dw $FFD0
.dw $FFD4
.dw $FFD7
.dw $FFDA
.dw $FFDD
.dw $FFE0
.dw $FFE4
.dw $FFE7
.dw $FFEA
.dw $FFED
.dw $FFF0
.dw $FFF3
.dw $FFF6
.dw $FFF9
.dw $FFFC
.dw $FFFF

Adding a little overhead code, we can get an ln() routine. Note that to convert, we have ln(x)=lg(x)/lg(e), but 1/lg(e) is a constant, so we can use constant multiplication. (it is approximately 355/512 for a good estimate):
ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)->H.L
;Speed: Avg: 340+(325 worst case)
 call lg_88
 ;now signed multiply HL by 355, then divide by 2 (rounding)
    ld de,0
    bit 7,h
    jr z,$+9
    dec e \ xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld b,h
    ld c,l
    xor a
    add hl,hl
    add hl,hl \ rla
    add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,bc \ adc a,d
    sra a \ rr h
    ld l,h
    ld h,a
    inc e
    ret nz
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ret

Arctan

Arctan also can be computed on the whole real line just from the values on [0,1] by the identities: arctan(-x)=-arctan(x) and arctan(x)=pi/2-arctan(1/x). Unlike ln() where we need an LUT (Look Up Table) of 16-bit numbers, we can just use 8-bit numbers. Plus, since arctan is pretty close to a linear function for a while, a large chunk of the LUT can be omitted by checking if the input is within a certain range:

arctan_88:
;Input:
;    D.E
;Output: atan(D.E)->D.E
    push de
    ld a,d
    or a
    jp p,$+5
    neg
    ld d,a
    dec a
    jr nz,checkneedinv
    inc e \ dec e \ jr nz,checkneedinv
    pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
checkneedinv:
    inc a
    call nz,DEgt1_Inv
;0.E is the value to atan
    ld hl,adjustatan
    push hl
    ld a,e
    cp 46 \ ret c
    dec a \ cp 42h \ ret c
    dec a \ cp 4Eh \ ret c
    dec a \ cp 57h \ ret c
    dec a \ cp 5Eh \ ret c
    dec a \ cp 64h \ ret c
    dec a \ cp 6Ah \ ret c
    dec a \ cp 6Fh \ ret c
    sub 6Fh \ ld e,a
    ld hl,atanlut
    add hl,de
    ld a,(hl)
    ret
adjustatan:
    ld e,a
    pop bc
    ld a,b
    or a
    jp p,$+5
    neg
    jr z,$+9
    ld hl,402
    or a
    sbc hl,de
    ex de,hl
    rl b
    ret nc
    xor a
    sub e
    ld e,a
    sbc a,a
    sub d
    ld d,a
    ret
DEgt1_Inv:
;Works if DE>1
    ld hl,256
    ld b,8
InvLoop:
    add hl,hl
    sbc hl,de
    jr nc,$+3
    add hl,de
    adc a,a
    djnz InvLoop
    cpl
    ld e,a
    ld d,b
    ret
atanlut:
.db $6F
.db $6F
.db $70
.db $71
.db $72
.db $73
.db $73
.db $74
.db $75
.db $76
.db $77
.db $77
.db $78
.db $79
.db $7A
.db $7B
.db $7B
.db $7C
.db $7D
.db $7E
.db $7F
.db $7F
.db $80
.db $81
.db $82
.db $82
.db $83
.db $84
.db $85
.db $85
.db $86
.db $87
.db $88
.db $88
.db $89
.db $8A
.db $8B
.db $8B
.db $8C
.db $8D
.db $8E
.db $8E
.db $8F
.db $90
.db $90
.db $91
.db $92
.db $93
.db $93
.db $94
.db $95
.db $95
.db $96
.db $97
.db $97
.db $98
.db $99
.db $9A
.db $9A
.db $9B
.db $9C
.db $9C
.db $9D
.db $9E
.db $9E
.db $9F
.db $A0
.db $A0
.db $A1
.db $A2
.db $A2
.db $A3
.db $A3
.db $A4
.db $A5
.db $A5
.db $A6
.db $A7
.db $A7
.db $A8
.db $A9
.db $A9
.db $AA
.db $AA
.db $AB
.db $AC
.db $AC
.db $AD
.db $AD
.db $AE
.db $AF
.db $AF
.db $B0
.db $B0
.db $B1
.db $B2
.db $B2
.db $B3
.db $B3
.db $B4
.db $B5
.db $B5
.db $B6
.db $B6
.db $B7
.db $B7
.db $B8
.db $B9
.db $B9
.db $BA
.db $BA
.db $BB
.db $BB
.db $BC
.db $BC
.db $BD
.db $BE
.db $BE
.db $BF
.db $BF
.db $C0
.db $C0
.db $C1
.db $C1
.db $C2
.db $C2
.db $C3
.db $C3
.db $C4
.db $C4
.db $C5
.db $C6
.db $C6
.db $C7
.db $C7
.db $C8
.db $C8
.db $C9

Random Numbers

24-bit Pseudo-Random Number Generator

This routine takes about 1592 t-states and has a period length of 4292870399. It uses two LCGs of relatively prime period length to generate a chaotic sequence of integers with a very large period length (so it takes a very long time to repeat the cycle). It also can be seeded, which can be important:

Rand24: 
;Inputs: seed1,seed2 
;Outputs: 
;   AHL is the pseudo-random number 
;   seed1,seed2 incremented accordingly 
;Destroys: BC,DE 
;Notes: 
; seed1*243+83 mod 65519 -> seed1 
; seed2*251+43 mod 65521 -> seed2 
; Output seed1*seed2 mod 16777259
   ld hl,(seed1) 
   xor a 
   ld b,h \ ld c,l 
   ld de,83 
   add hl,hl \ rla      ;2 
   add hl,bc \ adc a,d   ;3 
   add hl,hl \ rla      ;6 
   add hl,bc \ adc a,d   ;7 
   add hl,hl \ rla      ;14 
   add hl,bc \ adc a,d   ;15 
   add hl,hl \ rla      ;30 
   add hl,hl \ rla      ;60 
   add hl,hl \ rla      ;120 
   add hl,bc \ adc a,d   ;121 
   add hl,hl \ rla      ;242 
   add hl,bc \ adc a,d   ;243 
   add hl,de \ adc a,d   ;243x+83 
;now AHL mod 65519 
; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL 
   ex de,hl 
   ld l,a 
   ld b,h 
   ld c,l 
   add hl,hl 
   add hl,hl 
   add hl,hl 
   add hl,hl 
   add hl,bc 
   add hl,de 
   ld de,65519 
   jr nc,$+5 
   or a \ sbc hl,de 
   or a \ sbc hl,de 
   jr nc,$+3 
   add hl,de 
   ld (seed1),hl 
;seed1 done, now we need to do seed2 
   ld hl,(seed2) 
; seed1*243+83 mod 65519 -> seed1 
; seed2*251+43 mod 65521 -> seed2 
; Output seed1*seed2 
   xor a 
   ld b,h \ ld c,l 
   ld de,43 
   add hl,hl \ rla      ;2 
   add hl,bc \ adc a,d   ;3 
   add hl,hl \ rla      ;6 
   add hl,bc \ adc a,d   ;7 
   add hl,hl \ rla      ;14 
   add hl,bc \ adc a,d   ;15 
   add hl,hl \ rla      ;30 
   add hl,bc \ adc a,d   ;31 
   add hl,hl \ rla      ;62 
   add hl,hl \ rla      ;124 
   add hl,bc \ adc a,d   ;125 
   add hl,hl \ rla      ;250 
   add hl,bc \ adc a,d   ;251 
   add hl,de \ adc a,d   ;251x+83 
;now AHL mod 65521 
; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL 
   ex de,hl 
   ld l,a 
   ld b,h 
   ld c,l 
   add hl,hl 
   add hl,hl 
   add hl,hl 
   add hl,hl 
   sbc hl,bc 
   add hl,de 
   ld de,65521 
   jr nc,$+5 
   or a \ sbc hl,de 
   or a \ sbc hl,de 
   jr nc,$+3 
   add hl,de 
   ld (seed2),hl 
;now seed1 and seed 2 are computed 
   ld bc,(seed1) 
   ex de,hl 
;now do BC_times_DE 
   call BC_Times_DE 
      ex de,hl 
   ld l,b 
   ld h,0 
   ld b,h 
   ld c,l 
   add hl,hl 
   add hl,hl 
   add hl,bc 
   add hl,hl 
   add hl,hl 
   add hl,bc 
   add hl,hl 
   add hl,bc 
   ld c,d 
   ld d,e 
   ld e,a 
   ld a,c 
   sbc hl,bc 
   sbc a,b 
   ret nc 
   ld c,43 
   add hl,bc 
   ret 
BC_Times_DE: 
;BHLA is the result 
   ld a,b 
   or a 
   ld hl,0 
   ld b,h 
;1 
   add a,a 
   jr nc,$+4 
   ld h,d 
   ld l,e 
;2 
   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,b 
;227+10b-7p 
   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,b 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,b 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,b 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,b 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,b 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,b 

;=== 
;AHL is the result of B*DE*256 
   push hl 
   ld h,b 
   ld l,b 
   ld b,a 
   ld a,c 
   ld c,h 
;1 
   add a,a 
   jr nc,$+4 
   ld h,d 
   ld l,e 
;2 
   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,c 
;227+10b-7p 
   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,c 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,c 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,c 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,c 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,c 

   add hl,hl 
   rla 
   jr nc,$+4 
   add hl,de 
   adc a,c 

   pop de 
;Now BDE*256+AHL 
   ld c,a 
   ld a,l 
   ld l,h 
   ld h,c 
   add hl,de 
   ret nc 
   inc b 
;BHLA is the 32-bit result 
   ret 
seed1: 
   .dw 0 
seed2: 
   .dw 0

Conclusion

For more info, see the TI-83 Asm in 28 days, week 3, day 1.

When multiplying by a constant, there are often more efficient ways than using these subprograms. See here for more information.

For ways to perform more accurate math (Floating Point), there are many prebuilt functions that operate on the OP operators. They can be found here. We also have Floating Point routines that cover other formats that may be of more use in other situations.

Sources:
http://eeems.omnimaga.org/files/Resources/Tutorials/ASMin28Days/lesson/day15.html
http://wikiti.brandonw.net/index.php?title=Category:83Plus:BCALLs:By_Name:FP_Math

Unless otherwise stated, the content of this page is licensed under GNU Free Documentation License.