python - How can I speed up this bit of code (loop/lists/tuple optimization)? -
i repeat following idiom again , again. read large file (sometimes, 1.2 million records!) , store output sqlite databse. putting stuff sqlite db seems fast.
def readerfunction(recordsize, recordformat, connection, outputdirectory, outputfile, numobjects): insertstring = "insert node_disp_info(node, analysis, timestep, h1_translation, h2_translation, v_translation, h1_rotation, h2_rotation, v_rotation) values (?, ?, ?, ?, ?, ?, ?, ?, ?)" analysisnumber = int(outputpath[-3:]) outputfileobject = open(os.path.join(outputdirectory, outputfile), "rb") outputfileobject, numberofrecordsinfileobject = determinenumberofrecordsinfileobjectgivenrecordsize(recordsize, outputfileobject) numberofrecordsperobject = (numberofrecordsinfileobject//numberofobjects) loop1starttime = time.time() in range(numberofrecordsperobject ): processedrecords = [] loop2starttime = time.time() j in range(numberofobjects): fout = outputfileobject .read(recordsize) processedrecords.append(tuple([j+1, analysisnumber, i] + [x x in list(struct.unpack(recordformat, fout))])) loop2endtime = time.time() print "time taken finish loop2: {}".format(loop2endtime-loop2starttime) dbinsertstarttime = time.time() connection.executemany(insertstring, processedrecords) dbinsertendtime = time.time() loop1endtime = time.time() print "time taken finish loop1: {}".format(loop1endtime-loop1starttime) outputfileobject.close() print "finished reading output file analysis {}...".format(analysisnumber)
when run code, seems "loop 2" , "inserting database" execution time spent. average "loop 2" time 0.003s, run 50,000 times, in analyses. time spent putting stuff database same: 0.004s. currently, inserting database every time after loop2 finishes don't have deal running out ram.
what speed "loop 2"?
this i/o issue.
for j in range(numberofobjects): fout = outputfileobject .read(recordsize)
you spending of time reading teeny incremental bits of file (i.e. 1 record @ time), using struct
unpack individual records. slow. instead, grab whole chunk of file want @ once, let struct.unpack
churn through @ c speed.
you need little bit of math figure out how many bytes read
, , alter recordformat
format string tell struct
how unpack whole thing. there not quite enough info in example me tell more precisely how should that.
i have point out this:
tuple([j+1, analysisnumber, i] + [x x in list(struct.unpack(recordformat, fout))])
is far more sanely written this:
(j+1, analysisnumber, i) + struct.unpack(recordformat, fout)
...but need refactor line if follow above advice remove loop entirely. (you can use zip
, enumerate
prepend data onto each struct member after whole thing unpacked)
edit: example. packed 1m unsigned ints file. yours()
approach, mine()
mine.
def yours(): res = [] open('packed', 'rb') f: while true: b = f.read(4) if not b: break res.append(struct.unpack('i',b)) return res def mine(): open('packed', 'rb') f: return struct.unpack('1000000i',f.read())
timings:
%timeit yours() 1 loops, best of 3: 388 ms per loop %timeit mine() 100 loops, best of 3: 6.14 ms per loop
so, 2 orders of magnitude difference.
Comments
Post a Comment