c++ - Fast way to "down-scale" a three-dimensional tensor index -
this bit twiddling question c or c++. running gcc 4.6.3 under ubuntu 12.04.2.
i have memory access index p
three-dimensional tensor has form:
p = (i<<(2*n)) + (j<<n) + k
here 0 <= i,j,k < (1<<n)
, n
positive integer.
now want compute "down-scaled" memory access index i>>s, j>>s, k>>s
0 < s < n
, be:
q = ((i>>s)<<(2*(n-s))) + ((j>>s)<<(n-s)) + (k>>s)
what fastest way compute q
p
(without knowing i,j,k
beforehand)? can assume 0 < n <= 10
(i.e. p
32 bit integer). interested in fast approach n=8
(i.e. i,j,k
8 bit integers). n
, s
both compile time constants.
an example n=8
, s=4
:
unsigned int p = 240407; // (3<<16) + (171<<8) + 23; unsigned int q = 161; // (0<<8) + (10<<4) + 1
straightforward way, 8 operations (others operations on constants):
m = (1<<(n-s)) - 1; // mask s lowest bits. q = ( ((p & (m<<(2*n+s))) >> (3*s)) // mask 'i', shift new position. + ((p & (m<<( n+s))) >> (2*s)) // likewise 'j'. + ((p & (m<< s)) >> s)); // likewise 'k'.
looks complicated, isn't, not easy (to me @ least) constants correct.
to create formula less operations, observe shifting numbers u
bits left same multiplying 1<<u
. thus, due multiplication distributivity, multiplying ((1<<u1) + (1<<u2) + ...)
same shifting left u1
, u2
, ... , adding together.
so, try mask needed portions of i
, j
, k
, "shift" them correct positions relative each other 1 multiplication , shift result right, final destination. gives 3 operations compute q
p
.
unfortunately, there limitations, case try 3 @ once. when add numbers (indirectly, adding several multipliers), have make sure bits can set in 1 number, else we'll wrong result. if try add (indirectly) 3 shifted numbers @ once, have this:
iiiii...........jjjjj...........kkkkk....... n-s s n-s s n-s .....jjjjj...........kkkkk................ n-s n-s s n-s ..........kkkkk............... n-s n-s n-s
note farther left in second , third numbers bits of i
, j
, ignore them. this, assume multiplication works on x86: multiplying 2 types t
gives number of type t
, lowest bits of actual result (equal result if there no overflow).
so, make sure k
bits third number not overlap j
bits first, need 3*(n-s) <= n
, i.e. s >= 2*n/3
n = 8
limits s >= 6
(just 1 or 2 bits per component after shifting; don't know if ever use low precision).
however, if s >= 2*n/3
, can use 3 operations:
// constant multiplier perform 3 shifts @ once. f = (1<<(32-3*n)) + (1<<(32-3*n+s)) + (1<<(32-3*n+2*s)); // mask, shift/combine multipler, right shift destination. q = (((p & ((m<<(2*n+s)) + (m<<(n+s)) + (m<<s))) * f) >> (32-3*(n-s)));
if constraint s
strict (which is), can combine first , second formula: compute i
, k
second approach, add j
first formula. here need bits don't overlap in following numbers:
iiiii...............kkkkk....... n-s s n-s s n-s ..........kkkkk............... n-s n-s n-s
i.e. 3*(n-s) <= 2*n
, gives s >= n / 3
, or, n = 8
less strict s >= 3
. formula follows:
// constant multiplier perform 2 shifts @ once. f = (1<<(32-3*n)) + (1<<(32-3*n+2*s)); // mask, shift/combine multipler, right shift destination // , add 'j' straightforward formula. q = ((((p & ((m<<(2*n+s)) + (m<<s))) * f) >> (32-3*(n-s))) + ((p & (m<<(n+s))) >> (2*s)));
this formula works example s = 4
.
whether faster straightforward approach depends on architecture. also, have no idea if c++ guarantees assumed multiplication overflow behavior. finally, need make sure values unsigned , exactly 32 bit formulas work.
Comments
Post a Comment