python - TfIdfVectorizer with grouped tokens -


i've vectorized ~1million of textual documents tfidfvectorizer.

vect1 = tfidfvectorizer() vect1.fit_transform(data) 

now need group (semantically similar) tokens , repeat vectorization.

replacing tokens in raw data new ones seems tedious , time consuming me, since intend repeat process multiple grouping strategies.

i've tried

voc = vect1.vocabulary_ voc['cabman'] = voc['driver'] # grouping similar tokens have same ids ... vect2 = tfidfvectorizer(vocabulary=voc) vect2.fit_transform(data) 

valueerror: vocabulary contains repeated indices.

is there efficient way accomplish without touching input data?


Comments

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

Command prompt result in label. Python 2.7 -