128-bit Values - From XMM Registers To General Purpose
Answer :
You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.
in registers (SSE2)
movq rax,xmm0 ;lower 64 bits movhlps xmm0,xmm0 ;move high 64 bits to low 64 bits. movq rbx,xmm0 ;high 64 bits.
punpckhqdq xmm0,xmm0
is the SSE2 integer equivalent of movhlps xmm0,xmm0
. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.
via memory (SSE2)
movdqu [mem],xmm0 mov rax,[mem] mov rbx,[mem+8]
slow, but does not destroy xmm register (SSE4.1)
mov rax,xmm0 pextrq rbx,xmm0,1 ;3 cycle latency on Ryzen! (and 2 uops)
A hybrid strategy is possible, e.g. store to memory, movd/q e/rax,xmm0
so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (mov
/ movzx
loads into 32-bit registers are cheap and have 2/clock throughput.)
For 32 bits, the code is similar:
in registers
movd eax,xmm0 psrldq xmm0,xmm0,4 ;shift 4 bytes to the right movd ebx,xmm0 psrldq xmm0,xmm0,4 ; pshufd could copy-and-shuffle the original reg movd ecx,xmm0 ; not destroying the XMM and maybe creating some ILP psrlq xmm0,xmm0,4 movd edx,xmm0
via memory
movdqu [mem],xmm0 mov eax,[mem] mov ebx,[mem+4] mov ecx,[mem+8] mov edx,[mem+12]
Not destroying xmm register (SSE4.1) (slow like the psrldq
/ pshufd
version)
movd eax,xmm0 pextrd ebx,xmm0,1 ;3 cycle latency on Skylake! pextrd ecx,xmm0,2 ;also 2 uops: like a shuffle(port5) + movd(port0) pextrd edx,xmm0,3
The 64-bit shift variant can run in 2 cycles. The pextrq
version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.
Comments
Post a Comment