python - Most representative document in a list of documents -
hi trying find out representative document in list of documents might be. wondering if there resources or documentation on being able that. have put simple statistics me this:
- removing stop words, using bigrams
- matrix multiply , sum of tf multiplied df score document
- whatever document has tf*df score closest average tf * df retrieved
so idea higher df is, more representative of corpus. if tf scoring optimized average, documents overuse or underuse high df word punished.
it's pretty hacky wondering if there better out there people have encountered.
are correctly referring df? or mean inverse document frequency? cause in order introduce penalization need use inverse. implemented tools using dictionaries faster results.
you need 3 of them:
token_doc_count = { doc_id: {token_id: count}} tokens_freq_corpus = {token_id: count} tokened_docs = {doc_id: list_of_tokens or string_of_tokens} also, tf - idf should penalize stop words it's not necessary remove them.
Comments
Post a Comment