python - CountVectorizer deleting features that only appear once -
i'm using sklearn python package, , having trouble creating countvectorizer
pre-created dictionary, countvectorizer
doesn't delete features appear once or don't appear @ all.
here sample code have:
train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=none) test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names()) print(len(train_count_vect.get_feature_names())) print(len(test_count_vect.get_feature_names()))
len(train_count_vect.get_feature_names())
outputs 89967
len(test_count_vect.get_feature_names())
outputs 9833
inside setup_data()
function, initializing countvectorizer
. training data, i'm initializing without preset vocabulary. then, test data, i'm initializing countvectorizer vocabulary retrieved training data.
how vocabularies same lengths? think sklearn deleting features because appear once or don't appear @ in test corpus. need have same vocabulary because otherwise, classifier of different length test data points.
so, it's impossible without seeing source code of setup_data
, have pretty decent guess going on here. sklearn
follows fit_transform
format, meaning there 2 stages, fit
, , transform
.
in example of countvectorizer
fit
stage creates vocabulary, , transform
step transforms input text vocabulary space.
my guess you're calling fit
on both datasets instead of one, need using same "fitted" version of countvectorizer
on both if want results line up. e.g.:
model = countvectorizer() transformed_train = model.fit_transform(train_corpus) transformed_test = model.transform(test_corpus)
again, can guess until post setup_data
function, having seen before guess you're doing more this:
model = countvectorizer() transformed_train = model.fit_transform(train_corpus) transformed_test = model.fit_transform(test_corpus)
which make new vocabulary test_corpus
, unsurprisingly won't give same vocabulary length in both cases.
Comments
Post a Comment