Assembly x86 "PSHUFB 128bit" implementation in another language

1

I was reversing some application and i faced this opcode:

PSHUFB XMM2, XMMWORD_ADDRESS

and i tried implementing the algorithm of this function in python with no success! The reference of how this opcode should work is here: http://www.felixcloutier.com/x86/PSHUFB.html

Here is a code snippet:

PSHUFB (with 128 bit operands)

    for i = 0 to 15 {
         if (SRC[(i * 8)+7] = 1 ) then
              DEST[(i*8)+7..(i*8)+0] ← 0;
          else
              index[3..0] ← SRC[(i*8)+3 .. (i*8)+0]; DEST[(i*8)+7..(i*8)+0] ← DEST[(index*8+7)..(index*8+0)];
         endif
    }
DEST[VLMAX-1:128] ← 0

im trying to implement the 128 version of this opcode with no success. Here are the values before and after the function

Before

WINDBG>r xmm2
xmm2=           0 3.78351e-044 6.09194e+027 6.09194e+027

After

WINDBG>r xmm2
xmm2=9.68577e-042            0 4.92279e-029 4.92279e-029

in python you can use 'struct' to change those from float numbers to Hex:

hex(struct.unpack('<I', struct.pack('<f', f))[0])

So i can sort of say those are the hex values of XMM2 before and after the PSHUFB opcode:

Before

xmm2 = 0 0x0000001b 0x6d9d7914 0x6d9d7914

After

xmm2 = 00001b00 00000000 10799d78 10799d78

And most importantly, i almost forgot.. the value of XMMWORD_ADDRESS is:

03 02 01 00 07 06 05 04 0D 0C 0B 0A 09 08 80 80

xmmword 808008090A0B0C0D0405060700010203h

Implementation in Python could be highly appreciated. Implementation in C could work as well

or maybe some explanation of how the hell it works! Because i couldnt understand the intel reference

This is the code algorithm i have so far

x = ['00', '00', '00', '00', '00', '00', '00', '1b', '6d', '9d', '79', '14', '6d', '9d', '79', '14']
s = ['03', '02', '01', '00', '07', '06', '05', '04', '0D', '0C', '0B', '0A', '09', '08', '80', '80']
new = []
for i in range(16):
    if 0x80 == int(s[i], 16) & 0x80:
        print "MSB", s[i]
        new.append(0)
    else:
        print "NOT MSB", s[i]
        new.append( x[int(s[i], 16) & 15] )
       
print x
print new

Where x is the xmm0, and s is the SRC.

the output i get is:

['00', '00', '00', '00', '00', '00', '00', '1b', '6d', '9d', '79', '14', '6d', '9d', '79', '14']

['00', '00', '00', '00', '1b', '00', '00', '00', '9d', '6d', '14', '79', '9d', '6d', '00', '00']

where i should get

['00', '00', '1b', '00', '00', '00', '00', '00', '10', '79', '9d', '78', '10', '79', '9d', '78']

Something else i have noticed right now, in the 'output' i get the hexadecimal number 0x78 Where could it come from?

python
algorithm
assembly
x86
hex
asked on Stack Overflow Mar 3, 2015 by 0xAK • edited Jun 20, 2020 by Community

1 Answer

2

It works like 16 parallel table lookups, with special handling for indexes that have their top bit set. So for example, it could look like this: (not tested, not Python)

for (int i = 0; i < 16; i++)
    new_dest[i] = (src[i] & 0x80) ? 0 : dest[src[i] & 15];
dest = new_dest;

The new_dest there is significant, because it's really 16 parallel assignments, ie read-before-write, the second lookup is not affected by what happened in to the first byte and so on. Intel's code snippet leaves that implicit (or is wrong, depending on how you look at it).

answered on Stack Overflow Mar 3, 2015 by harold

User contributions licensed under CC BY-SA 3.0