python - Most representative document in a list of documents -


hi trying find out representative document in list of documents might be. wondering if there resources or documentation on being able that. have put simple statistics me this:

  • removing stop words, using bigrams
  • matrix multiply , sum of tf multiplied df score document
  • whatever document has tf*df score closest average tf * df retrieved

so idea higher df is, more representative of corpus. if tf scoring optimized average, documents overuse or underuse high df word punished.

it's pretty hacky wondering if there better out there people have encountered.

are correctly referring df? or mean inverse document frequency? cause in order introduce penalization need use inverse. implemented tools using dictionaries faster results.

you need 3 of them:

token_doc_count = { doc_id: {token_id: count}} tokens_freq_corpus = {token_id: count} tokened_docs = {doc_id: list_of_tokens or string_of_tokens} 

also, tf - idf should penalize stop words it's not necessary remove them.


Comments

Popular posts from this blog

How to understand 2 main() functions after using uftrace to profile the C++ program? -

c# - Update a combobox from a presenter (MVP) -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -