python - Speeding up data transformations -


i have given (by third party , hence unalterable) input data in following structure: data list of 4-tuples, each 4-tuple representing sortie. first element of each sortie list of categories of length 1 5 chosen total of 20 possible categories (without repetitions); second element number of participants involved; third datetime-object indicating begin of sortie; , last , forth element datetime-object indicating end of sortie.

now have transform data following format: each category need calculate (a) numbers of sorties of category, (b) total time spent, (c) average time spent per sortie, (d) "man hours" total, i.e. sum of duration of each sortie multiplied number of participants of selfsame sortie, , (e) average "man hours" per sortie.

my first naive attempt following:

def transform (data):     t = defaultdict (lambda: (0, 0, 0) )      row in data:         delta = row [3] - row [2]         hours = delta.days * 24 + delta.seconds / 3600         manhours = row [1] * hours         cat in row [0]:             t [cat] = (t [cat] [0] + 1, t [cat] [1] + hours, t [cat] [2] + manhours)      return {k: (v [0], v [1], v [1] / v [0], v [2], v [2] / v [0] ) k, v in t.items () } 

and profiling following:

cats = [_ _ in range (20) ] test in range (1000):     data = [ (random.sample (cats, random.randint (1, 5) ), random.randint (2, 40), datetime.datetime (2013, 1, 1, 8), datetime.datetime (2013, 1, 1, 9) ) _ in range (1000) ]     transform (data) 

using -m cprofile.

i have read lot of times on stackoverflow incredible advantages of itertools performant iterating, grouping, counting, etc point users prefer using itertools on simple dict- or list-comprehension.

i take advantage of module, unsure how best out of it. hence:

a) in way can transformation function time-optimized (sped up)?

b) in way can itertools me on endeavour?

i thank in advance answers.

--

for reference: on box (amd phenom ii quad, 4 gb ram, 4 gb swap) using python 3.3.1 (default, apr 17 2013, 22:30:32) [gcc 4.7.3] on linux profiler spits out: 1000 2.027 0.002 2.042 0.002 trans.py:6(transform). moving pyhton3 pypy not option.

edit: sample data (using iso-representation) or use second code snippet create (obviously not real-life) data:

[([6, 4, 15], 3, '2013-07-31t17:23:00', '2013-07-31t18:40:00'), ([9, 18, 5], 15, '2013-07-08t17:49:00', '2013-07-08t18:57:00'), ([7, 14, 17, 12, 0], 18, '2013-07-20t08:16:00', '2013-07-20t09:06:00'), ([6, 1], 32, '2013-07-31t07:14:00', '2013-07-31t09:01:00'), ([17, 7], 7, '2013-07-05t06:59:00', '2013-07-05t07:52:00')] 

2013-08-02: profiling pillmuncher's idea unfortunately resulted in using numpy being 360% slower without using it:

 1000    1.828    0.002    1.842    0.002 prof.py:8(transform) #original function  1000    0.159    0.000    8.457    0.008 prof.py:43(transform3) #numpy function 

you use numpy:

from collections import defaultdict datetime import datetime  import numpy np  def transform(data):     pair_type = np.dtype([('team_size', int), ('duration', 'timedelta64[s]')])     rec_array = np.core.records.array     total = np.sum     mean = np.mean     one_hour = np.timedelta64(1, 'h')     tmp = defaultdict(list)     categories, team_size, begin, end in data:         category in categories:             tmp[category].append((team_size, end - begin))     category, pairs in tmp.items():         pairs = rec_array(pairs, dtype=pair_type)         hours = pairs.duration / one_hour         man_hours = pairs.team_size * hours         yield category, (                 len(pairs),                 total(hours),                 mean(hours),                 total(man_hours),                 mean(man_hours))  some_data = ... result = dict(transform(some_data)) 

i don't know if faster. if try out, please report result.

also, numpy fu isn't great. if knows how improve it, please tell me.


Comments

Popular posts from this blog

basic authentication with http post params android -

vb.net - Virtual Keyboard commands -

css - Firefox for ubuntu renders wrong colors -