hadoop - Pig Cross product reducer key -


when perform crossproduct operation (followed filtering) reducer sizes imbalanced, reducers writing 0 output , others taking several hours complete. basic example following code:

crossproduct = cross tweets, clients;  result = filter crossproduct text matches concat('.*', concat(keyword, '.*'));  store result 'result' using pigstorage(' '); 

in case reducer key?

this difficult question answer. cross implemented in pig join on synthetic keys. best resource understand cross programming pig - page 68

in example, cross like

a = foreach tweets generate flatten(gfcross(0,2)), flatten(*); b = foreach clients generate flatten(gfcross(1,2)), flatten(*); c = cogroup ($0, $1), b ($0, $1); crossproduct = foreach c generate flatten(a), flatten(b); 

as explained in book, gfcross internal udf. first argument input number, , second argument total number of inputs. in example, udf generates records have schema of (int, int). field same first argument has random number between 0 , 3. other field counts 0 3. so, if assume first record in has random number 3, , first record in b has random number 2, following 4 tuples generated udf each input.

a {(3,0), (3,1), (3,2), (3,3)} b {(0,2), (1,2), (2,2), (3,2)} 

when join performed, (3,2) tuple joined (3,2) tuple in b. every record in each input, guaranteed there 1 , 1 instance of artificial keys match , produce record.

so, answer question of reduce key... reduce key synthetic key generated gfcross. since random numbers chosen differently each record, resulting joins should done on distribution of reducers.


Comments