python - TfIdfVectorizer with grouped tokens -


i've vectorized ~1million of textual documents tfidfvectorizer.

vect1 = tfidfvectorizer() vect1.fit_transform(data) 

now need group (semantically similar) tokens , repeat vectorization.

replacing tokens in raw data new ones seems tedious , time consuming me, since intend repeat process multiple grouping strategies.

i've tried

voc = vect1.vocabulary_ voc['cabman'] = voc['driver'] # grouping similar tokens have same ids ... vect2 = tfidfvectorizer(vocabulary=voc) vect2.fit_transform(data) 

valueerror: vocabulary contains repeated indices.

is there efficient way accomplish without touching input data?


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -