converting Golang float32 to half-precision float (GLSL float16) as uint16

Question

converting Golang float32 to half-precision float (GLSL float16) as uint16

I need to pass some data over from Go to an '300 es' shader. The data consists of two uint16s packed into a uint32. Each uint16 represents a half-precision float (float16). I found some PD Java code that looks like it will do the job, but I am struggling with porting the last statement, which uses a couple of zero-extend right shifts (I think the other shifts are fine i.e. non-negative). Since Go is a bit clever with extending, the solution to the port is eluding me. I did think maybe the first one could be changed into a left shift, since it just seems to be positioning a single bit for addition? but the final shift blows my mind out the water :)

btw I hope I got the bracketing right, since the operator precedence seems to be different between Go and Java regarding '-' and '>>'...

I need to go the other way around next, but that is hopefully easier without right shifts... famous last words!

Java code:

https://stackoverflow.com/a/6162687/345165

// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
    int fbits = Float.floatToIntBits( fval );
    int sign = fbits >>> 16 & 0x8000;          // sign only
    int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value

    if( val >= 0x47800000 )               // might be or become NaN/Inf
    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
        {                                 // is or must become NaN/Inf
            if( val < 0x7f800000 )        // was value but too large
                return sign | 0x7c00;     // make it +/-Inf
            return sign | 0x7c00 |        // remains +/-Inf or NaN
                ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
        }
        return sign | 0x7bff;             // unrounded not quite Inf
    }
    if( val >= 0x38800000 )               // remains normalized value
        return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
    if( val < 0x33000000 )                // too small for subnormal
        return sign;                      // becomes +/-0
    val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc
    return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
         + ( 0x800000 >>> val - 102 )     // round depending on cut off
      >>> 126 - val );   // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}

My partial port:

func float32toUint16(f float32) uint16 {
    fbits := math.Float32bits(f)
    sign := uint16((fbits >> 16) & 0x00008000)
    rv := (fbits & 0x7fffffff) + 0x1000

    if rv >= 0x47800000 {
        if (fbits & 0x7fffffff) >= 0x47800000 {
            if rv < 0x7f800000 {
                return sign | 0x7c00
            }
            return sign | 0x7c00 | uint16((fbits&0x007fffff)>>13)
        }
        return sign | 0x7bff
    }
    if rv >= 0x38800000 {
        return sign | uint16((rv-0x38000000)>>13)
    }
    if rv < 0x33000000 {
        return sign
    }
    rv = (fbits & 0x7fffffff) >> 23
    return sign | uint16(((fbits&0x7fffff)|0x800000)+(0x800000>>(rv-102))>>(126-rv)) //these two shifts are my problem
}

func pack16(f1 float32, f2 float32) uint32 {
    ui161 := float32toUint16(f1)
    ui162 := float32toUint16(f2)
    return ((uint32(ui161) << 16) | uint32(ui162))
}

I found what looked like even more efficient code, with no branching, but understanding the mechanics of how that works to be able to port it is a bit ;) beyond my rusty (not the language) skills.

https://stackoverflow.com/a/5587983

Cheers

[Edit] The code appears to work with the values I am currently using (it's hard to be precise since I have no experience debuging a shader). So I guess my question is about the correctness of my port, especially the final two shifts.

[Edit2] In the light of day I can see I already got the precedence wrong in one place and fixed the above example.

changed:

    return sign | uint16(rv-(0x38000000>>13))

to:

    return sign | uint16((rv-0x38000000)>>13)

go

half-precision-float

asked on Stack Overflow Sep 17, 2020 by

Peter • edited Sep 18, 2020 by

Peter

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0