loops - Python input/output more efficiently -

May 15, 2010

i need process on 10 million spectroscopic data sets. data structured this: there around 1000 .fits (.fits data storage format) files, each file contains around 600-1000 spectra in there around 4500 elements in each spectra (so each file returns ~1000*4500 matrix). means each spectra going repeatedly read around 10 times (or each file going repeatedly read around 10,000 times) if going loop on 10 million entries. although same spectra repeatedly read around 10 times, not duplicate because each time extract different segments of same spectra.

i have catalog file contains information need, coordinates x, y, radius r, strength s, etc. catalog contains information target file going read (identified n1, n2) , spectra in file going use (identified n3).

the code have is:

import numpy np itertools import izip import fitsio  x = [] y = [] r = [] s = [] n1 = [] n2 = [] n3 = [] open('spectra_id.dat') file_id, open('catalog.txt') file_c:     line1, line2 in izip(file_id,file_c):         parts1 = line1.split()         parts2 = line2.split()         n1.append(parts1[0])         n2.append(parts1[1])         n3.append(float(parts1[2]))         x.append(float(parts2[0]))                  y.append(float(parts2[1]))                 r.append(float(parts2[2]))         s.append(float(parts2[3]))    def data_analysis(idx_start,idx_end):  #### loop on 10 million entries     data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])      in xrange(idx_start,idx_end)         filename = "../../../data/" + str(n1[i]) + "/spplate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"         fits_spectra = fitsio.fits(filename)         fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000]  #### return list of list         flux = fluxx[0]         hdu = fits_spectra[0].read_header()         wave_start = hdu['crval1']         logwave = wave_start + 0.0001 * np.arange(4000)         wavegrid = np.power(10,logwave)      ##### after read flux , wavegrid, can following analysis.      ##### save data data_stru      ##### reading time-consuming part of code, later analysis not time consuming.

the problem files big, there no enough memory load @ once, , catalog not structured such entries open same file grouped together. wonder there can offer thoughts split large loop 2 loops: 1) first loop on files can avoid repeatedly opening/reading files again , again, 2) loop on entries going use same file.

if understand code correctly, n1 , n2 determine file open. why not lexsort them. can use itertools.groupby group records same n1, n2. here down-scaled proof of concept:

import itertools  n1 = np.random.randint(0, 3, (10,)) n2 = np.random.randint(0, 3, (10,)) mockdata = np.arange(10)+100  s = np.lexsort((n2, n1))  k, g in itertools.groupby(zip(s, n1[s], n2[s]), lambda x: x[1:]):     # groupby groups iterations of first argument     # (zip(...) in case) result of applying     # optional second argument (here lambda) i.     # here use lambda expression remove si     # tuple (si, n1si, n2si) zip produces because otherwise     # equal (n1si, n2si) pairs still treated different     # because of distinct si's. hence no grouping occur.     # putting si in there in first place necessary,     # can reference other records of corresponding row     # in inner loop.     print(k)     si, n1s, ns2 in g:         # si can used access corresponding other records         print (si, mockdata[si])

prints like:

(0, 1) 4 104 (0, 2) 0 100 2 102 6 106 (1, 0) 1 101 (2, 0) 8 108 9 109 (2, 1) 3 103 5 105 7 107

you may want include n3 in lexsort, not grouping can process files' content in order.

Search This Blog

MOno

loops - Python input/output more efficiently -

Comments

Post a Comment

Popular posts from this blog

javascript - Confirm a form & display message if form is valid with JQuery -

Retrieving ETA (estimated time of arrival) with Google Distance Matrix API and public transit as transport mode -

ionic framework - Meteor - Error: Failed to execute 'insertBefore' on 'Node' -