Assembly: REP MOVS Mechanism


Answer :

For questions about particular instructions always consult the instruction set reference.

In this case, you will need to look up rep and movs. In short, rep repeats the following string operation ecx times. movs copies data from ds:esi to es:edi and increments or decrements the pointers based on the setting of the direction flag. As such, repeating it will move a range of memory to somewhere else.

PS: usually the operation size is encoded as an instruction suffix, so people use movsb and movsd to indicate byte or dword operation. Some assemblers however allow specifying the size as in your example, by byte ptr or dword ptr. Also, the operands are implicit in the instruction, and you can not modify them.


The short explanation about syntax

At the assembly-code level, two forms of this instruction are allowed: the “explicit-operands” form and the “nooperand” form. The explicit-operands form allows the source and the destination address of the memory to be specified explicitly with symbols. This explicit-operands form is provided to allow documentation; however, note that the documentation provided by this form can be misleading. That is, the symbol does not have to specify the correct source and destination address. The source address is always specified by DS:(RSI/ESI/SI) and the destination address is always specified by ES:(RDI/EDI/DI) registers, which must be loaded correctly before the movsb instruction is executed. This is how I understand the official position of Intel on this issue.

The long explanation about syntax

REP MOVS DWORD PTR ES:[EDI], DWORD PTR [ESI] is a synonym for REP MOVSD; and REP MOVS BYTE PTR ES:[EDI], BYTE PTR[ESI] is a synonym of REP MOVSB.

There are the following MOVS commands, based on data sizes:

  • MOVSB (byte, 8-bit)
  • MOVSW (word, 16-bit)
  • MOVSD (dword, 32-bit)
  • MOVSQ (qword, 64 bit) - only available in 64-bit mode

The MOVS command copies data from DS:(SI/ESI/RSI) to ES:(DI/EDI/RDI) -- the size of SI/DI register is based on your current mode - 16-bit, 32-bit or 64-bit. It also increases (decreases) SI and DI registers (based on the D flag, set CLD to increase the registers).

The MOVS command cannot use other registers than SI/DI, so it is not necessary to specify them.

If the MOVS command is prefixed by REP, it is repeated to copy CX(ECX/RCX) number of bytes, decreasing CX, so at the end CX becomes zero.

The explanation on relative performance

Since first Pentium CPU produced in 1993, Intel began to make simple commands to be executed faster and complex commands (like REP MOVS) -- slower. So, REP MOVS became very slow, and there were no more reason to use it in Pentium CPUs based on P5 microarchitecture (1993-1997).

In parallel with the P5 microarchitecture, Intel developed the P6 microarchitecture, where it has decided to revisit REP MOVS, and, since 1996, implemented the "fast strings" feature which made REP MOVS fast again.

In 2013, Intel decided to revisit REP MOVS again, and implemented CPUID ERMSB (Enhanced REP MOVSB) bit, which was supposed to indicate that the CPU implements byte-sized move and store instructions (movsb, stosb) in a fast and efficient manner. On practice, it is only fast for large blocks, 256 bytes and larger, and only when certain conditions are met:

  • both the source and destination addresses have to be aligned to a 16-byte boundary (this boundary size is recommended for Ivy Bridge processors, on newer the boundary may be larger, up to 64 bytes for Cannonlake);
  • the source region should not overlap with the destination region;
  • the length have to be a multiple of 64 bytes to produce higher performance;
  • the direction have to be forward (CLD).

See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

REP MOVS instructions are very slow on small blocks because the startup cost is about 35 cycles. If you do plain simple MOV EAX (or something like that) in a loop, there are no startup costs and you can copy lots of data during these 35 cycles.

Please note that ERMSB produces best results for REP MOVSB, not REP MOVSD (MOVSQ). All REP MOVS instructions became significantly faster, but REP MOVSB is fastest of all with ERMSB. This is in contrast with older processors (before 2013) where largest MOVS size available (MOVSQ on 64-bit, MOVSD on 32-bit) produced fastest outcome.

So the code that you have shown is not optimal for processors with ERMSB, because only MOVSB is fast, not MOVSD, although the difference is not that big, and a single REP MOVSB should be enough - it will incur startup costs only once rather than twice for fist REP MOVSD and then REP MOVSB.

However, for processors without ERMBS, your code is OK, except for P5-based Pentium processors released in 1993 where plain simple MOV EAX copy (or using larger x87 registers) in a loop would be faster. The code that you have given will also give best results on very old processors like 80386 released in 1985.


Comments

Popular posts from this blog

Are Regular VACUUM ANALYZE Still Recommended Under 9.1?

Can Feynman Diagrams Be Used To Represent Any Perturbation Theory?