How to add two floating point numbers with opposite sign?

1

For fun and to figure out more about how floats work, I'm trying to make a function that takes two single precision floats, and adds them together.

What I've made so far works perfectly for same sign numbers, but it falls apart when the numbers have opposite signs. I've looked over a number of questions and sites (UAF, How do you add 8-bit floating point with different signs, ICL, Adding 32 bit floating point numbers., How to add and subtract 16 bit floating point half precision numbers?, How to subtract IEEE 754 numbers?), but the ones that bring up subtraction mostly describe it somewhat like "basically the same but subtract instead" which I have not found extremely helpful. UAF does say

Negative mantissas are handled by first converting to 2's complement and then performing the addition. After the addition is performed, the result is converted back to sign-magnitude form.

But it doesn't seem that I know how to do that. I found this and this which explained what signed magnitude is and how to convert between it and two's complement so I tried converting like this:

manz = manx + ( ( (many | 0x01000000) ^ 0x007FFFFF) + 1);

and like this:

manz = manx + ( ( (many | 0x01000000) ^ 0x007FFFFF) + 1);
manz = ( ((manz - 1) ^ 0x007FFFFF) & 0xFEFFFFFF);

But neither of those worked.

Trying the method of subtraction described by the other sources, I tried negating the mantissa of the negative numbers in various ways like these:

manz = manx - many;
manz = manx + (many - (1<<23));
manz = manx + (many - (1<<24));
manz = manx + ( (many - (1<<23)) & 0x007FFFFF );
manz = manx + ( (many - (1<<23)) + 1);
manz = manx + ( (~many & 0x007FFFFF) + 1);
manz = manx + (~many + 1);
manz = manx + ( (many ^ 0x007FFFFF) + 1);
manz = manx + ( (many ^ 0x00FFFFFF) + 1);
manz = manx + ( (many ^ 0x003FFFFF) + 1);

This is the statement that is supposed to handle the addition based on the sign, it is after the mantissas have been aligned:

expz = expy;
if(signx != signy) { // opp sign
  if(manx < many) {
    signz = signy;
    manz = many + ((manx ^ 0x007FFFFF) + 1);
  } else if(manx > many) {
    signz = signx;
    manz = manx - ((many ^ 0x007FFFFF) + 1);
  } else { // x == y
    signz = 0x00000000;
    expz  = 0x00000000;
    manz  = 0x00000000;
  }
} else {
  signz = signx;
  manz  = manx + many;
}

This is the code immediately following it which normalizes the number in the case of an overflow, it works when they have the same sign, but I'm not sure the way it works makes sense when subtracting:

if(manz & 0x01000000) {
  expz++;
  manz = (manz >> 1) + (manz & 0x1);
}
manz &= 0x007FFFFF;

With the test values -3.34632F and 34.8532413F, I get the answer 0x427E0716 (63.506920) when it should be 0x41FC0E2D (31.506922), and with the test values 3.34632F and -34.8532413F, I get the answer 0xC27E0716 (-63.506920) when it should be 0xC1FC0E2D (-31.506922).


I was able to fix my problem by changing the way that I was normalizing the floats when subtracting.

expz = expy;
if(signx != signy) { // opp sign
  if(manx < many) {
    signz = signy;
    manz  = many - manx;
  } else if(manx > many) {
    signz = signx;
    manz  = manx - many;
  } else { // x == y
    signz = 0x00000000;
    expz  = 0x00000000;
    manz  = 0x00000000;
  }
  // Normalize subtraction
  while((manz & 0x00800000) == 0 && manz) {
      manz <<= 1;
      expz--;
  }
} else {
  signz = signx;
  manz  = manx + many;
  // Normalize addition
  if(manz & 0x01000000) {
    expz++;
    manz = (manz >> 1) + ( (x & 0x2) ? (x & 0x1) : 0 ); // round even
  }
}
manz &= 0x007FFFFF;
c
algorithm
floating-point
asked on Stack Overflow Oct 14, 2019 by sheep44 • edited Oct 15, 2019 by sheep44

1 Answer

0

How to add two floating point numbers with opposite sign?

Mostly you don't.

For everything that works with numerical types that can't rely on "twos complement wrap on overflow" (e.g. floating point, big number libraries, ...) you always end up with something like:

add_signed(v1, v2) {
    if( v1 < 0) {
        if( v2 < 0) {
            // Both negative
            return -add_unsigned(-v1, -v2);
        } else {
            // Different sign, v1 is negative
            return subtract_unsigned(v2, -v1);
        }
    } else {
        if( v2 < 0) {
            // Different sign, v2 is negative
            return subtract_unsigned(v1, -v2);
        } else {
            // Both positive
            return add_unsigned(v1, v2);
        }
    }
 }

subtract_signed(v1, v2) {
    return add_signed(v1, -v2);
}

add_unsigned(v1, v2) {
    // Here we know that v1 and v2 will never be negative, and
    //   we know that the result will never be negative
    ...
}

subtract_unsigned(v1, v2) {
    if(v1 < v2) {
        return -subtract_unsigned(v2, v1);
    }
    // Here we know that v1 and v2 will never be negative, and
    //   we know that the result will never be negative
    ...
}

In other words; all of the actual addition and all of the actual subtraction happens with unsigned ("never negative") numbers.

More complete example for addition of 32-bit floating point emulation only (in C, untested and probably buggy, might or might not work for denormals, no support for "NaN/s" or infinities, no support for overflow or underflow, no "shift mantissa left to reduce precision loss before rounding", and no support for different rounding modes than "round towards zero"):

#define SIGN_FLAG      0x80000000U
#define EXPONENT_MASK  0x7F800000U
#define MANTISSA_MASK  0x007FFFFFU
#define IMPLIED_BIT    0x00800000U
#define OVERFLOW_BIT   0x01000000U
#define EXPONENT_ONE   0x00800000U

uint32_t add_signed(uint32_t v1, uint32_t v2) {
    if( (v1 & SIGN_FLAG) != 0) {
        if( (v2 & SIGN_FLAG) != 0) {
            // Both negative
            return SIGN_FLAG | add_unsigned(v1 & ~SIGN_FLAG, v2 & ~SIGN_FLAG);
        } else {
            // Different sign, v1 is negative
            return subtract_unsigned(v2, v1 & ~SIGN_FLAG);
        }
    } else {
        if( (v2 & SIGN_FLAG) != 0) {
            // Different sign, v2 is negative
            return subtract_unsigned(v1, v2 & ~SIGN_FLAG);
        } else {
            // Both positive
            return add_unsigned(v1, v2);
        }
    }
 }

uint32_t subtract_signed(uint32_t v1, uint32_t v2) {
    return add_signed(v1, v2 ^ SIGN_FLAG);
}

uint32_t add_unsigned(uint32_t v1, uint32_t v2) {
    // Here we know that v1 and v2 will never be negative, and
    //   we know that the result will never be negative

    if(v1 < v2) {    // WARNING: Compares both exponents and mantissas
        return add_unsigned(v2, v1);
    }

    // Here we know the exponent of v1 is not smaller than the exponent of v2

    uint32_t m1 = (v1 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t m2 = (v2 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t exp2 = v2 & EXPONENT_MASK;
    uint32_t expr = v1 & EXPONENT_MASK;

    while(exp2 < expr) {
        m2 >>= 1;
        exp2 += EXPONENT_ONE;
    }
    uint32_t mr = m1+m2;
    if( (mr & OVERFLOW_BIT) != 0) {
        mr >> 1;
        expr += EXPONENT_ONE;
    }
    return expr | (mr & ~IMPLIED_BIT);
}

uint32_t subtract_unsigned(uint32_t v1, uint32_t v2) {
    if(v1 == v2) {
        return 0;
    }
    if(v1 < v2) {
        return SIGN_FLAG ^ subtract_unsigned(v2, v1);
    }

    // Here we know the exponent of v1 is not smaller than the exponent of v2,
    //  and that (if exponents are equal) the mantissa of v1 is larger
    //  than the mantissa of v2; and therefore the result will be
    //  positive

    uint32_t m1 = (v1 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t m2 = (v2 & MANTISSA_MASK) | IMPLIED_BIT;
    uint32_t exp2 = v2 & EXPONENT_MASK;
    uint32_t expr = v1 & EXPONENT_MASK;

    while(exp2 < expr) {
        m2 >>= 1;
        exp2 += EXPONENT_ONE;
    }
    uint32_t mr = m1-m2;
    while( (mr & IMPLIED_BIT) == 0) {
        mr <<= 1;
        expr -= EXPONENT_ONE;
    }
    return expr | (mr & ~IMPLIED_BIT);
}
answered on Stack Overflow Oct 14, 2019 by Brendan • edited Oct 15, 2019 by Brendan

User contributions licensed under CC BY-SA 3.0