python - UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code -
this question has answer here:
when execute below code
import networkx nx import numpy np nltk.tokenize.punkt import punktsentencetokenizer sklearn.feature_extraction.text import tfidftransformer, countvectorizer def textrank(document): sentence_tokenizer = punktsentencetokenizer() sentences = sentence_tokenizer.tokenize(document) bow_matrix = countvectorizer().fit_transform(sentences) normalized = tfidftransformer().fit_transform(bow_matrix) similarity_graph = normalized * normalized.t nx_graph = nx.from_scipy_sparse_matrix(similarity_graph) scores = nx.pagerank(nx_graph) return sorted(((scores[i],s) i,s in enumerate(sentences)), reverse=true) fp = open("qc") txt = fp.read() sents = textrank(txt) print sents
i following error
traceback (most recent call last): file "textrank.py", line 44, in <module> sents = textrank(txt) file "textrank.py", line 10, in textrank sentences = sentence_tokenizer.tokenize(document) file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text return [text[s:e] s, e in self.span_tokenize(text, realign_boundaries)] file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize return [(sl.start, sl.stop) sl in slices] file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries sl1, sl2 in _pair_iter(slices): file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter el in it: file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text if self.text_contains_sentbreak(context): file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak t in self._annotate_tokens(self._tokenize_words(text)): file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass t1, t2 in _pair_iter(tokens): file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter prev = next(it) file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass aug_tok in tokens: file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words line in plaintext.split('\n'): unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)
i executing code in ubuntu. text, referred website https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101. created file qc (not qc.txt) , copy pasted data paragraph paragraph file. kindly me resolve error. thank you
please try if following works you.
import networkx nx import numpy np import sys reload(sys) sys.setdefaultencoding('utf8') nltk.tokenize.punkt import punktsentencetokenizer sklearn.feature_extraction.text import tfidftransformer, countvectorizer def textrank(document): sentence_tokenizer = punktsentencetokenizer() sentences = sentence_tokenizer.tokenize(document) bow_matrix = countvectorizer().fit_transform(sentences) normalized = tfidftransformer().fit_transform(bow_matrix) similarity_graph = normalized * normalized.t nx_graph = nx.from_scipy_sparse_matrix(similarity_graph) scores = nx.pagerank(nx_graph) return sorted(((scores[i],s) i,s in enumerate(sentences)), reverse=true) fp = open("qc") txt = fp.read() sents = textrank(txt.encode('utf-8')) print sents
Comments
Post a Comment