python - UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code -


this question has answer here:

when execute below code

import networkx nx import numpy np nltk.tokenize.punkt import punktsentencetokenizer sklearn.feature_extraction.text import tfidftransformer, countvectorizer  def textrank(document):     sentence_tokenizer = punktsentencetokenizer()     sentences = sentence_tokenizer.tokenize(document)      bow_matrix = countvectorizer().fit_transform(sentences)     normalized = tfidftransformer().fit_transform(bow_matrix)      similarity_graph = normalized * normalized.t      nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)     scores = nx.pagerank(nx_graph)     return sorted(((scores[i],s) i,s in enumerate(sentences)), reverse=true)  fp = open("qc")     txt = fp.read() sents = textrank(txt) print sents 

i following error

traceback (most recent call last):   file "textrank.py", line 44, in <module>     sents = textrank(txt)   file "textrank.py", line 10, in textrank     sentences = sentence_tokenizer.tokenize(document)   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize     return list(self.sentences_from_text(text, realign_boundaries))   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text     return [text[s:e] s, e in self.span_tokenize(text, realign_boundaries)]   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize     return [(sl.start, sl.stop) sl in slices]   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries     sl1, sl2 in _pair_iter(slices):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter     el in it:   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text     if self.text_contains_sentbreak(context):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak     t in self._annotate_tokens(self._tokenize_words(text)):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass     t1, t2 in _pair_iter(tokens):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter     prev = next(it)   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass     aug_tok in tokens:   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words     line in plaintext.split('\n'): unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128) 

i executing code in ubuntu. text, referred website https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101. created file qc (not qc.txt) , copy pasted data paragraph paragraph file. kindly me resolve error. thank you

please try if following works you.

import networkx nx import numpy np import sys  reload(sys) sys.setdefaultencoding('utf8')  nltk.tokenize.punkt import punktsentencetokenizer sklearn.feature_extraction.text import tfidftransformer, countvectorizer  def textrank(document):     sentence_tokenizer = punktsentencetokenizer()     sentences = sentence_tokenizer.tokenize(document)      bow_matrix = countvectorizer().fit_transform(sentences)     normalized = tfidftransformer().fit_transform(bow_matrix)      similarity_graph = normalized * normalized.t      nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)     scores = nx.pagerank(nx_graph)     return sorted(((scores[i],s) i,s in enumerate(sentences)), reverse=true)  fp = open("qc")     txt = fp.read() sents = textrank(txt.encode('utf-8')) print sents 

Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -