python - Which sparse Matrix representation to use with sklearn.svm.LinearSVC -
i have large data set (10 000 rows), each row (sample) represented list of bits (~200 000 bits).each bit represent absence or presence of feature in sample. so, it's large (10 000 x 200 000) high-dimensional sparse data set
in order save memory space, each sample, i'm saving indices of non 0 bits. example vector 7 features:
[0, 0, 1, 0, 0, 1, 1] ===> [2, 5, 6]
i'm doing data set. let result x (10 000 variable size vectors). exemple initial data set 3x4:
[[0,0,1,0], [[2], initial_data= [0,1,1,0], ===> [1,2], = x [0,1,0,1]] [1,3]] each row labeled either of 2 labels: malignantor benign. linear support vector classification model (the 1 in sklearn.svm.linearsvc) trained on data represented x. knowing aforementioned model accepts sparse input , there 7 representation possible in scipy:
- csc_matrix: compressed sparse column format
- csr_matrix: compressed sparse row format
- bsr_matrix: block sparse row format
- lil_matrix: list of lists format
- dok_matrix: dictionary of keys format
- coo_matrix: coordinate format (aka ijv, triplet format)
- dia_matrix: diagonal format
which representation more efficient training model ? , how can efficiently pass x representation ?
csr way go, supported sklearn's sources. excerpt:
class linearsvc(baseestimator, linearclassifiermixin, _learntselectormixin, sparsecoefmixin): ... ... x, y = check_x_y(x, y, accept_sparse='csr', dtype=np.float64, order="c") csr , many other formats not recommended building sparse-matrix directly (adding stuff / changing sparsity-structure costly).
use dok_matrix / lil_matrix build sparse-matrix data (should simple) , convert (which done in linear-time).
x = x.tocsr() also keep in mind, data pass converted internally liblinear, external library used sklearn has it's own data-structures. if pass wrong format; it's one-time conversion cost should occur. pure training-procedure not care!
Comments
Post a Comment