loops - Python input/output more efficiently -


i need process on 10 million spectroscopic data sets. data structured this: there around 1000 .fits (.fits data storage format) files, each file contains around 600-1000 spectra in there around 4500 elements in each spectra (so each file returns ~1000*4500 matrix). means each spectra going repeatedly read around 10 times (or each file going repeatedly read around 10,000 times) if going loop on 10 million entries. although same spectra repeatedly read around 10 times, not duplicate because each time extract different segments of same spectra.

i have catalog file contains information need, coordinates x, y, radius r, strength s, etc. catalog contains information target file going read (identified n1, n2) , spectra in file going use (identified n3).

the code have is:

import numpy np itertools import izip import fitsio  x = [] y = [] r = [] s = [] n1 = [] n2 = [] n3 = [] open('spectra_id.dat') file_id, open('catalog.txt') file_c:     line1, line2 in izip(file_id,file_c):         parts1 = line1.split()         parts2 = line2.split()         n1.append(parts1[0])         n2.append(parts1[1])         n3.append(float(parts1[2]))         x.append(float(parts2[0]))                  y.append(float(parts2[1]))                 r.append(float(parts2[2]))         s.append(float(parts2[3]))    def data_analysis(idx_start,idx_end):  #### loop on 10 million entries     data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])      in xrange(idx_start,idx_end)         filename = "../../../data/" + str(n1[i]) + "/spplate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"         fits_spectra = fitsio.fits(filename)         fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000]  #### return list of list         flux = fluxx[0]         hdu = fits_spectra[0].read_header()         wave_start = hdu['crval1']         logwave = wave_start + 0.0001 * np.arange(4000)         wavegrid = np.power(10,logwave)      ##### after read flux , wavegrid, can following analysis.      ##### save data data_stru      ##### reading time-consuming part of code, later analysis not time consuming. 

the problem files big, there no enough memory load @ once, , catalog not structured such entries open same file grouped together. wonder there can offer thoughts split large loop 2 loops: 1) first loop on files can avoid repeatedly opening/reading files again , again, 2) loop on entries going use same file.

if understand code correctly, n1 , n2 determine file open. why not lexsort them. can use itertools.groupby group records same n1, n2. here down-scaled proof of concept:

import itertools  n1 = np.random.randint(0, 3, (10,)) n2 = np.random.randint(0, 3, (10,)) mockdata = np.arange(10)+100  s = np.lexsort((n2, n1))  k, g in itertools.groupby(zip(s, n1[s], n2[s]), lambda x: x[1:]):     # groupby groups iterations of first argument     # (zip(...) in case) result of applying     # optional second argument (here lambda) i.     # here use lambda expression remove si     # tuple (si, n1si, n2si) zip produces because otherwise     # equal (n1si, n2si) pairs still treated different     # because of distinct si's. hence no grouping occur.     # putting si in there in first place necessary,     # can reference other records of corresponding row     # in inner loop.     print(k)     si, n1s, ns2 in g:         # si can used access corresponding other records         print (si, mockdata[si]) 

prints like:

(0, 1) 4 104 (0, 2) 0 100 2 102 6 106 (1, 0) 1 101 (2, 0) 8 108 9 109 (2, 1) 3 103 5 105 7 107 

you may want include n3 in lexsort, not grouping can process files' content in order.


Comments

Popular posts from this blog

How to understand 2 main() functions after using uftrace to profile the C++ program? -

c# - Update a combobox from a presenter (MVP) -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -