assembly - Microarchitectural zeroing of a register via the register renamer: performance versus a mov? -


i read on blog post recent x86 microarchitectures able handle common register zeroing idioms (such xor-ing register itself) in register renamer; in words of author:

"the register renamer knows how execute these instructions – can 0 registers itself."

does know how works in practice? know isas, mips, contain architectural register set 0 in hardware; mean internally, x86 microarchitecture has similar "zero" registers internally registers mapped when convenient? or mental model not quite correct on how stuff works microarchitecturally?

the reason why asking because (from observation) seems mov 1 register containing 0 destination, in loop, still substantially faster zeroing register via xor within loop.

basically happening 0 register within loop depending on condition; can either done allocating architectural register ahead of time store 0 (%xmm3, in case), not modified entire duration of loop, , executing following within it:

movapd  %xmm3, %xmm0 

or instead xor trick:

xorpd   %xmm0, %xmm0 

(both at&t syntax).

in other words choice between hoisting constant 0 outside of loop or rematerializing within each iteration. latter reduces number of live architectural registers one, and, supposed special case awareness , handling of xor idiom processor, seems ought fast former (especially since these machines have more physical registers architectural registers anyway, should able internally equivalent i've done in assembly hoisting out constant 0 or better, internally, full awareness , control on own resources). doesn't seem be, i'm curious if cpu architecture knowledge can explain if there's theoretical reason that.

the registers in case happen sse registers , machine happens ivy bridge; i'm not sure how important either of factors are.

executive summary: can run 4 xor ax, ax instructions per cycle compared slower mov immediate, reg instructions.

details , references:

wikipedia has nice overview of register renaming in general: http://en.wikipedia.org/wiki/register_renaming

torbj¨orn granlund's timings instruction latencies , throughput amd , intel x86 processors at: http://gmplib.org/~tege/x86-timing.pdf

agner fog nicely covers specifics in micro-architecture study:

8.8 register allocation , renaming

register renaming controlled register alias table (rat) , reorder buffer (rob) ... µops decoders , stack engine go rat via queue , rob-read , reservation station. rat can handle 4 µops per clock cycle. rat can rename 4 registers per clock cycle, , can rename same register 4 times in 1 clock cycle.

special cases of independence

a common way of setting register 0 xor'ing or subtracting itself, e.g. xor eax,eax. sandy bridge processor recognizes instructions independent of prior value of register if 2 operand registers same. register set 0 @ rename stage without using execution unit. applies of following instructions: xor, sub, pxor, xorps, xorpd, vxorps, vxorpd , variants of psubxxx , pcmpgtxx, not pandn etc.

instructions need no execution unit

the abovementioned special cases registers set 0 instructions such xor eax,eax handled @ register rename/allocate stage without using execution unit. makes use of these zeroing instructions extremely efficient, throughput of 4 zeroing instructons per clock cycle.


Comments

Popular posts from this blog

basic authentication with http post params android -

vb.net - Virtual Keyboard commands -

css - Firefox for ubuntu renders wrong colors -