python - Which sparse Matrix representation to use with sklearn.svm.LinearSVC -


i have large data set (10 000 rows), each row (sample) represented list of bits (~200 000 bits).each bit represent absence or presence of feature in sample. so, it's large (10 000 x 200 000) high-dimensional sparse data set

in order save memory space, each sample, i'm saving indices of non 0 bits. example vector 7 features:

[0, 0, 1, 0, 0, 1, 1] ===> [2, 5, 6]

i'm doing data set. let result x (10 000 variable size vectors). exemple initial data set 3x4:

                 [[0,0,1,0],       [[2],   initial_data=   [0,1,1,0],  ===>  [1,2],   = x                   [0,1,0,1]]        [1,3]] 

each row labeled either of 2 labels: malignantor benign. linear support vector classification model (the 1 in sklearn.svm.linearsvc) trained on data represented x. knowing aforementioned model accepts sparse input , there 7 representation possible in scipy:

  • csc_matrix: compressed sparse column format
  • csr_matrix: compressed sparse row format
  • bsr_matrix: block sparse row format
  • lil_matrix: list of lists format
  • dok_matrix: dictionary of keys format
  • coo_matrix: coordinate format (aka ijv, triplet format)
  • dia_matrix: diagonal format

which representation more efficient training model ? , how can efficiently pass x representation ?

csr way go, supported sklearn's sources. excerpt:

class linearsvc(baseestimator, linearclassifiermixin,             _learntselectormixin, sparsecoefmixin):     ...     ...     x, y = check_x_y(x, y, accept_sparse='csr',                      dtype=np.float64, order="c") 

csr , many other formats not recommended building sparse-matrix directly (adding stuff / changing sparsity-structure costly).

use dok_matrix / lil_matrix build sparse-matrix data (should simple) , convert (which done in linear-time).

x = x.tocsr() 

also keep in mind, data pass converted internally liblinear, external library used sklearn has it's own data-structures. if pass wrong format; it's one-time conversion cost should occur. pure training-procedure not care!


Comments

Popular posts from this blog

c# - Update a combobox from a presenter (MVP) -

How to understand 2 main() functions after using uftrace to profile the C++ program? -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -