python - Most representative document in a list of documents -

January 15, 2010

hi trying find out representative document in list of documents might be. wondering if there resources or documentation on being able that. have put simple statistics me this:

removing stop words, using bigrams
matrix multiply , sum of tf multiplied df score document
whatever document has tf*df score closest average tf * df retrieved

so idea higher df is, more representative of corpus. if tf scoring optimized average, documents overuse or underuse high df word punished.

it's pretty hacky wondering if there better out there people have encountered.

are correctly referring df? or mean inverse document frequency? cause in order introduce penalization need use inverse. implemented tools using dictionaries faster results.

you need 3 of them:

token_doc_count = { doc_id: {token_id: count}} tokens_freq_corpus = {token_id: count} tokened_docs = {doc_id: list_of_tokens or string_of_tokens}

also, tf - idf should penalize stop words it's not necessary remove them.

Search This Blog

MOno

python - Most representative document in a list of documents -

Comments

Post a Comment

Popular posts from this blog

Retrieving ETA (estimated time of arrival) with Google Distance Matrix API and public transit as transport mode -

android - ConstraintLayout: Realign baseline constraint in case if dependent view visibility was set to GONE -

c# - Populating Gridview inside Listview ItemTemplate On Web User Control from Code Behind -