# Why is my straightforward quaternion multiplication faster than SSE?

6

I've been going through a few different quaternion multiplication implementations, but I've been rather surprised to see that the reference implementation is, so far, my fastest. This is the implementation in question:

``````inline static quat multiply(const quat& lhs, const quat& rhs)
{
return quat((lhs.w * rhs.x) + (lhs.x * rhs.w) + (lhs.y * rhs.z) - (lhs.z * rhs.y),
(lhs.w * rhs.y) + (lhs.y * rhs.w) + (lhs.z * rhs.x) - (lhs.x * rhs.z),
(lhs.w * rhs.z) + (lhs.z * rhs.w) + (lhs.x * rhs.y) - (lhs.y * rhs.x),
(lhs.w * rhs.w) - (lhs.x * rhs.x) - (lhs.y * rhs.y) - (lhs.z * rhs.z));
}
``````

I've tried a few other implementations, some using SSE, some that don't. Here's an example of one such SSE implementation, basically copied from the library that Bullet Physics uses:

``````inline static __m128 multiplynew(__m128 lhs, __m128 rhs)
{
__m128 qv, tmp0, tmp1, tmp2, tmp3;
__m128 product, l_wxyz, r_wxyz, xy, qw;
vec4 sw;

tmp0 = _mm_shuffle_ps(lhs, lhs, _MM_SHUFFLE(3, 0, 2, 1));
tmp1 = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(3, 1, 0, 2));
tmp2 = _mm_shuffle_ps(lhs, lhs, _MM_SHUFFLE(3, 1, 0, 2));
tmp3 = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(3, 0, 2, 1));
qv = _mm_mul_ps(_mm_splat_ps(lhs, 3), rhs);
qv = _mm_madd_ps(_mm_splat_ps(rhs, 3), lhs, qv);
qv = _mm_nmsub_ps(tmp2, tmp3, qv);
product = _mm_mul_ps(lhs, rhs);
l_wxyz = _mm_sld_ps(lhs, lhs, 12);
r_wxyz = _mm_sld_ps(rhs, rhs, 12);
qw = _mm_nmsub_ps(l_wxyz, r_wxyz, product);
qw = _mm_sub_ps(qw, _mm_sld_ps(xy, xy, 8));

sw.uiw = 0xffffffff;
return _mm_sel_ps(qv, qw, sw);
}
``````

In release mode with optimizations turned on, my simple reference implementation runs 70%-90% faster than bullet's SSE implementation. In debug mode without optimizations, it runs as much as 3x faster.

My first question is, why does this happen?

My second question is, is there any way for me to optimize my quaternion-quaternion multiplication routine? I don't want to deal with assembly, but I use sse intrinsics quite a lot elsewhere.

(btw, if it matters, the data storage of my quaternion is defined as `union { __m128 data; struct { float x, y, z, w; }; float f[4]; };`)

I looked at the dissassembly. Here's the disassembly for `multiply` (the fast non-sse one):

``````00EC9940  movaps      xmm3,xmmword ptr [esp+0D0h]
00EC9948  movaps      xmm2,xmmword ptr [esp+0C0h]
00EC9950  movaps      xmm4,xmm3
00EC9953  mulss       xmm4,xmm5
00EC9957  movaps      xmm0,xmm2
00EC995A  mulss       xmm0,xmm6
00EC995E  mulss       xmm3,xmm1
00EC9966  movss       xmm0,dword ptr [esp+40h]
00EC996C  mulss       xmm0,xmm1
00EC9974  movss       xmm0,dword ptr [esp+0F0h]
00EC997D  mulss       xmm0,xmm7
00EC9981  subss       xmm4,xmm0
00EC9985  movss       xmm0,dword ptr [esp+0F0h]
00EC998E  mulss       xmm0,xmm6
00EC9996  movaps      xmm0,xmm2
00EC9999  movaps      xmm2,xmmword ptr [esp+40h]
00EC999E  mulss       xmm0,xmm7
00EC99A6  movaps      xmm0,xmm2
00EC99A9  mulss       xmm0,xmm5
00EC99B1  subss       xmm3,xmm0
00EC99B5  movss       xmm0,dword ptr [esp+0D0h]
00EC99BE  mulss       xmm0,xmm7
00EC99C6  movss       xmm0,dword ptr [esp+0F0h]
00EC99CF  mulss       xmm0,xmm5
00EC99D7  movss       xmm0,dword ptr [esp+0C0h]
00EC99E0  mulss       xmm0,xmm1
00EC99E4  movss       xmm1,dword ptr [esp+0D0h]
00EC99ED  mulss       xmm1,xmm6
00EC99F1  subss       xmm2,xmm0
00EC99F5  movss       xmm0,dword ptr [esp+0C0h]
00EC99FE  mulss       xmm0,xmm5
00EC9A02  movaps      xmm5,xmmword ptr [esp+50h]
00EC9A07  unpcklps    xmm4,xmm2
00EC9A0A  subss       xmm1,xmm0
00EC9A0E  movss       xmm0,dword ptr [esp+0F0h]
00EC9A17  mulss       xmm0,xmm5
00EC9A1B  subss       xmm1,xmm0
00EC9A1F  movss       xmm0,dword ptr [esp+40h]
00EC9A25  mulss       xmm0,xmm7
00EC9A29  subss       xmm1,xmm0
00EC9A2D  unpcklps    xmm3,xmm1
00EC9A30  unpcklps    xmm4,xmm3
00EC9A33  movaps      xmm5,xmm4
00EC9A36  movaps      xmmword ptr [esp+30h],xmm5
00EC9A3B  dec         eax
00EC9A3C  je          SDL_main+58Ah (0EC9A5Ah)
``````

And here's the disassembly for `multiplynew` (the slow sse one):

``````00329BF3  movaps      xmm6,xmm5
00329BF6  mulps       xmm6,xmm1
00329BF9  movaps      xmm0,xmm5
00329BFC  mov         dword ptr [esp+6Ch],0FFFFFFFFh
00329C04  shufps      xmm0,xmm5,93h
00329C08  movaps      xmm1,xmm5
00329C0B  mulps       xmm4,xmm0
00329C0E  movaps      xmm0,xmmword ptr [esp+110h]
00329C16  movaps      xmm3,xmm6
00329C19  shufps      xmm1,xmm5,0FFh
00329C1D  mulps       xmm1,xmmword ptr [esp+40h]
00329C22  movaps      xmm7,xmmword ptr [esp+60h]
00329C2A  mulps       xmm0,xmm5
00329C2D  subps       xmm6,xmm4
00329C30  shufps      xmm3,xmm3,4Eh
00329C37  movaps      xmm0,xmm5
00329C3A  shufps      xmm0,xmm5,0C9h
00329C3E  subps       xmm6,xmm3
00329C41  mulps       xmm0,xmmword ptr [esp+120h]
00329C49  shufps      xmm5,xmm5,0D2h
00329C4D  mulps       xmm5,xmmword ptr [esp+0C0h]
00329C55  andps       xmm6,xmmword ptr [esp+60h]
00329C5D  subps       xmm1,xmm5
00329C60  andnps      xmm7,xmm1
``````

The way I test the speed is using:

``````timer.update();
for (uint i = 0; i < 1000000; ++i)
{
temp1 = quat::multiply(temp1, q1);
}
timer.update();
printf("1M calls to multiplyOld took %fs.\n", timer.getDeltaTime());
``````

(timer.getDeltaTime() returns how much time has passed, in seconds, between the last time timer.update() was called and the time that timer.update() was called before that.)

Why does my non-sse version run faster despite having more instructions..? Am I reading the disassembly wrong or something?

EDIT: I've discovered that the sse version is running faster than the non-sse version when I compile in x64.

c++
optimization
sse
quaternions
asked on Stack Overflow Mar 6, 2014 by Haydn V. Harach • edited Mar 6, 2014 by Haydn V. Harach

3

If I had to make a wild guess, I'd say it's because `multiply` is highly parallelizable at the instruction level, whereas `multiplynew` is very serial in nature, which is inhibiting.

answered on Stack Overflow Mar 6, 2014 by Timothy Shields
3

The best way to use SIMD would be to multiply four (eight) independent quaternions at once with SSE (AVX). However, this often is not practical. In that case I recommend checking out Agner Fog's vectorclass. In the directory `special` he has a file `quaterinon.h`. I converted the multiply function to match your code. This only needs SSE2 (`#include <emmintrin.h>`).

``````inline static __m128 multiplynew(__m128 a, __m128 b) {
__m128 a1123 = _mm_shuffle_ps(a,a,0xE5);
__m128 a2231 = _mm_shuffle_ps(a,a,0x7A);
__m128 b1000 = _mm_shuffle_ps(b,b,0x01);
__m128 b2312 = _mm_shuffle_ps(b,b,0x9E);
__m128 t1    = _mm_mul_ps(a1123, b1000);
__m128 t2    = _mm_mul_ps(a2231, b2312);
__m128 t12m  = _mm_xor_ps(t12, _mm_castsi128_ps(mask)); // flip sign bits
__m128 a3312 = _mm_shuffle_ps(a,a,0x9F);
__m128 b3231 = _mm_shuffle_ps(b,b,0x7B);
__m128 a0000 = _mm_shuffle_ps(a,a,0x00);
__m128 t3    = _mm_mul_ps(a3312, b3231);
__m128 t0    = _mm_mul_ps(a0000, b);
__m128 t03   = _mm_sub_ps(t0, t3);
}
``````

Here is the assembly output (GCC 4.8 MASM style):

``````multiplynew(float __vector, float __vector):
movaps  xmm2, xmm0
movaps  xmm3, xmm0
movaps  xmm5, xmm1
movaps  xmm4, xmm1
shufps  xmm2, xmm0, 229
shufps  xmm4, xmm1, 158
shufps  xmm3, xmm0, 122
shufps  xmm5, xmm1, 1
mulps   xmm3, xmm4
movaps  xmm4, xmm1
mulps   xmm2, xmm5
shufps  xmm4, xmm1, 123
movaps  xmm3, xmm0
shufps  xmm3, xmm0, 159
shufps  xmm0, xmm0, 0
xorps   xmm2, XMMWORD PTR .LC0[rip]
mulps   xmm3, xmm4
mulps   xmm0, xmm1
subps   xmm0, xmm3
ret
``````
answered on Stack Overflow Mar 6, 2014 by Z boson • edited Mar 6, 2014 by Z boson
0

IMO, the easiest and most scalable way to use SIMD is AoSoA style https://en.wikipedia.org/wiki/AoS_and_SoA

For example, instead of `Quaternion { __m128 xyzw; }`, you instead do `Quaternion { T x, y, z, w; }` and then you can use the regular Quaternion for the most part, and then in your big loops you use `Quaternion<__m128>` to multiply then in chunks of four

The same holds true for most other mathematical types, Vectors in particular. I wouldn't worry about SIMD outside of the big loops IMO, it just doesn't matter, which is why I dislike the AoS SIMD style (which doesn't scale anyway as vector intrinsics get bigger).

answered on Stack Overflow Nov 12, 2020 by Elliott Prechter • edited Nov 12, 2020 by lennon310

User contributions licensed under CC BY-SA 3.0