python - Speeding up data transformations -
i have given (by third party , hence unalterable) input data in following structure: data list of 4-tuples, each 4-tuple representing sortie. first element of each sortie list of categories of length 1 5 chosen total of 20 possible categories (without repetitions); second element number of participants involved; third datetime-object indicating begin of sortie; , last , forth element datetime-object indicating end of sortie.
now have transform data following format: each category need calculate (a) numbers of sorties of category, (b) total time spent, (c) average time spent per sortie, (d) "man hours" total, i.e. sum of duration of each sortie multiplied number of participants of selfsame sortie, , (e) average "man hours" per sortie.
my first naive attempt following:
def transform (data): t = defaultdict (lambda: (0, 0, 0) ) row in data: delta = row [3] - row [2] hours = delta.days * 24 + delta.seconds / 3600 manhours = row [1] * hours cat in row [0]: t [cat] = (t [cat] [0] + 1, t [cat] [1] + hours, t [cat] [2] + manhours) return {k: (v [0], v [1], v [1] / v [0], v [2], v [2] / v [0] ) k, v in t.items () }
and profiling following:
cats = [_ _ in range (20) ] test in range (1000): data = [ (random.sample (cats, random.randint (1, 5) ), random.randint (2, 40), datetime.datetime (2013, 1, 1, 8), datetime.datetime (2013, 1, 1, 9) ) _ in range (1000) ] transform (data)
using -m cprofile
.
i have read lot of times on stackoverflow incredible advantages of itertools performant iterating, grouping, counting, etc point users prefer using itertools on simple dict- or list-comprehension.
i take advantage of module, unsure how best out of it. hence:
a) in way can transformation function time-optimized (sped up)?
b) in way can itertools
me on endeavour?
i thank in advance answers.
--
for reference: on box (amd phenom ii quad, 4 gb ram, 4 gb swap) using python 3.3.1 (default, apr 17 2013, 22:30:32) [gcc 4.7.3] on linux
profiler spits out: 1000 2.027 0.002 2.042 0.002 trans.py:6(transform)
. moving pyhton3 pypy not option.
edit: sample data (using iso-representation) or use second code snippet create (obviously not real-life) data:
[([6, 4, 15], 3, '2013-07-31t17:23:00', '2013-07-31t18:40:00'), ([9, 18, 5], 15, '2013-07-08t17:49:00', '2013-07-08t18:57:00'), ([7, 14, 17, 12, 0], 18, '2013-07-20t08:16:00', '2013-07-20t09:06:00'), ([6, 1], 32, '2013-07-31t07:14:00', '2013-07-31t09:01:00'), ([17, 7], 7, '2013-07-05t06:59:00', '2013-07-05t07:52:00')]
2013-08-02: profiling pillmuncher's idea unfortunately resulted in using numpy being 360% slower without using it:
1000 1.828 0.002 1.842 0.002 prof.py:8(transform) #original function 1000 0.159 0.000 8.457 0.008 prof.py:43(transform3) #numpy function
you use numpy:
from collections import defaultdict datetime import datetime import numpy np def transform(data): pair_type = np.dtype([('team_size', int), ('duration', 'timedelta64[s]')]) rec_array = np.core.records.array total = np.sum mean = np.mean one_hour = np.timedelta64(1, 'h') tmp = defaultdict(list) categories, team_size, begin, end in data: category in categories: tmp[category].append((team_size, end - begin)) category, pairs in tmp.items(): pairs = rec_array(pairs, dtype=pair_type) hours = pairs.duration / one_hour man_hours = pairs.team_size * hours yield category, ( len(pairs), total(hours), mean(hours), total(man_hours), mean(man_hours)) some_data = ... result = dict(transform(some_data))
i don't know if faster. if try out, please report result.
also, numpy fu isn't great. if knows how improve it, please tell me.
Comments
Post a Comment