python - TfIdfVectorizer with grouped tokens -
i've vectorized ~1million of textual documents tfidfvectorizer.
vect1 = tfidfvectorizer() vect1.fit_transform(data)
now need group (semantically similar) tokens , repeat vectorization.
replacing tokens in raw data new ones seems tedious , time consuming me, since intend repeat process multiple grouping strategies.
i've tried
voc = vect1.vocabulary_ voc['cabman'] = voc['driver'] # grouping similar tokens have same ids ... vect2 = tfidfvectorizer(vocabulary=voc) vect2.fit_transform(data)
valueerror: vocabulary contains repeated indices.
is there efficient way accomplish without touching input data?
Comments
Post a Comment