128-bit Values - From XMM Registers To General Purpose


Answer :

You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.

in registers (SSE2)

movq rax,xmm0       ;lower 64 bits movhlps xmm0,xmm0   ;move high 64 bits to low 64 bits. movq rbx,xmm0       ;high 64 bits. 

punpckhqdq xmm0,xmm0 is the SSE2 integer equivalent of movhlps xmm0,xmm0. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.

via memory (SSE2)

movdqu [mem],xmm0 mov rax,[mem] mov rbx,[mem+8] 

slow, but does not destroy xmm register (SSE4.1)

mov rax,xmm0 pextrq rbx,xmm0,1        ;3 cycle latency on Ryzen! (and 2 uops) 

A hybrid strategy is possible, e.g. store to memory, movd/q e/rax,xmm0 so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (mov / movzx loads into 32-bit registers are cheap and have 2/clock throughput.)


For 32 bits, the code is similar:

in registers

movd eax,xmm0 psrldq xmm0,xmm0,4    ;shift 4 bytes to the right movd ebx,xmm0 psrldq xmm0,xmm0,4    ; pshufd could copy-and-shuffle the original reg movd ecx,xmm0         ; not destroying the XMM and maybe creating some ILP psrlq xmm0,xmm0,4 movd edx,xmm0 

via memory

movdqu [mem],xmm0 mov eax,[mem] mov ebx,[mem+4] mov ecx,[mem+8] mov edx,[mem+12] 

Not destroying xmm register (SSE4.1) (slow like the psrldq / pshufd version)

movd eax,xmm0 pextrd ebx,xmm0,1        ;3 cycle latency on Skylake! pextrd ecx,xmm0,2        ;also 2 uops: like a shuffle(port5) + movd(port0) pextrd edx,xmm0,3        

The 64-bit shift variant can run in 2 cycles. The pextrq version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.


Comments

Popular posts from this blog

Are Regular VACUUM ANALYZE Still Recommended Under 9.1?

Can Feynman Diagrams Be Used To Represent Any Perturbation Theory?