python - CountVectorizer deleting features that only appear once -


i'm using sklearn python package, , having trouble creating countvectorizer pre-created dictionary, countvectorizer doesn't delete features appear once or don't appear @ all.

here sample code have:

train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=none) test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names())  print(len(train_count_vect.get_feature_names())) print(len(test_count_vect.get_feature_names())) 

len(train_count_vect.get_feature_names()) outputs 89967 len(test_count_vect.get_feature_names()) outputs 9833

inside setup_data() function, initializing countvectorizer. training data, i'm initializing without preset vocabulary. then, test data, i'm initializing countvectorizer vocabulary retrieved training data.

how vocabularies same lengths? think sklearn deleting features because appear once or don't appear @ in test corpus. need have same vocabulary because otherwise, classifier of different length test data points.

so, it's impossible without seeing source code of setup_data, have pretty decent guess going on here. sklearn follows fit_transform format, meaning there 2 stages, fit, , transform.

in example of countvectorizer fit stage creates vocabulary, , transform step transforms input text vocabulary space.

my guess you're calling fit on both datasets instead of one, need using same "fitted" version of countvectorizer on both if want results line up. e.g.:

model = countvectorizer() transformed_train = model.fit_transform(train_corpus) transformed_test = model.transform(test_corpus) 

again, can guess until post setup_data function, having seen before guess you're doing more this:

model = countvectorizer() transformed_train = model.fit_transform(train_corpus) transformed_test = model.fit_transform(test_corpus) 

which make new vocabulary test_corpus, unsurprisingly won't give same vocabulary length in both cases.


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -