c++ - Fast way to "down-scale" a three-dimensional tensor index -
this bit twiddling question c or c++. running gcc 4.6.3 under ubuntu 12.04.2.
i have memory access index p three-dimensional tensor has form:
p = (i<<(2*n)) + (j<<n) + k here 0 <= i,j,k < (1<<n) , n positive integer.
now want compute "down-scaled" memory access index i>>s, j>>s, k>>s 0 < s < n, be:
q = ((i>>s)<<(2*(n-s))) + ((j>>s)<<(n-s)) + (k>>s) what fastest way compute q p (without knowing i,j,k beforehand)? can assume 0 < n <= 10 (i.e. p 32 bit integer). interested in fast approach n=8 (i.e. i,j,k 8 bit integers). n , s both compile time constants.
an example n=8 , s=4:
unsigned int p = 240407; // (3<<16) + (171<<8) + 23; unsigned int q = 161; // (0<<8) + (10<<4) + 1
straightforward way, 8 operations (others operations on constants):
m = (1<<(n-s)) - 1; // mask s lowest bits. q = ( ((p & (m<<(2*n+s))) >> (3*s)) // mask 'i', shift new position. + ((p & (m<<( n+s))) >> (2*s)) // likewise 'j'. + ((p & (m<< s)) >> s)); // likewise 'k'. looks complicated, isn't, not easy (to me @ least) constants correct.
to create formula less operations, observe shifting numbers u bits left same multiplying 1<<u. thus, due multiplication distributivity, multiplying ((1<<u1) + (1<<u2) + ...) same shifting left u1, u2, ... , adding together.
so, try mask needed portions of i, j , k, "shift" them correct positions relative each other 1 multiplication , shift result right, final destination. gives 3 operations compute q p.
unfortunately, there limitations, case try 3 @ once. when add numbers (indirectly, adding several multipliers), have make sure bits can set in 1 number, else we'll wrong result. if try add (indirectly) 3 shifted numbers @ once, have this:
iiiii...........jjjjj...........kkkkk....... n-s s n-s s n-s .....jjjjj...........kkkkk................ n-s n-s s n-s ..........kkkkk............... n-s n-s n-s note farther left in second , third numbers bits of i , j, ignore them. this, assume multiplication works on x86: multiplying 2 types t gives number of type t, lowest bits of actual result (equal result if there no overflow).
so, make sure k bits third number not overlap j bits first, need 3*(n-s) <= n, i.e. s >= 2*n/3 n = 8 limits s >= 6 (just 1 or 2 bits per component after shifting; don't know if ever use low precision).
however, if s >= 2*n/3, can use 3 operations:
// constant multiplier perform 3 shifts @ once. f = (1<<(32-3*n)) + (1<<(32-3*n+s)) + (1<<(32-3*n+2*s)); // mask, shift/combine multipler, right shift destination. q = (((p & ((m<<(2*n+s)) + (m<<(n+s)) + (m<<s))) * f) >> (32-3*(n-s))); if constraint s strict (which is), can combine first , second formula: compute i , k second approach, add j first formula. here need bits don't overlap in following numbers:
iiiii...............kkkkk....... n-s s n-s s n-s ..........kkkkk............... n-s n-s n-s i.e. 3*(n-s) <= 2*n, gives s >= n / 3, or, n = 8 less strict s >= 3. formula follows:
// constant multiplier perform 2 shifts @ once. f = (1<<(32-3*n)) + (1<<(32-3*n+2*s)); // mask, shift/combine multipler, right shift destination // , add 'j' straightforward formula. q = ((((p & ((m<<(2*n+s)) + (m<<s))) * f) >> (32-3*(n-s))) + ((p & (m<<(n+s))) >> (2*s))); this formula works example s = 4.
whether faster straightforward approach depends on architecture. also, have no idea if c++ guarantees assumed multiplication overflow behavior. finally, need make sure values unsigned , exactly 32 bit formulas work.
Comments
Post a Comment