python - UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code -

March 15, 2011

this question has answer here:

how fix: “unicodedecodeerror: 'ascii' codec can't decode byte” 11 answers

when execute below code

import networkx nx import numpy np nltk.tokenize.punkt import punktsentencetokenizer sklearn.feature_extraction.text import tfidftransformer, countvectorizer  def textrank(document):     sentence_tokenizer = punktsentencetokenizer()     sentences = sentence_tokenizer.tokenize(document)      bow_matrix = countvectorizer().fit_transform(sentences)     normalized = tfidftransformer().fit_transform(bow_matrix)      similarity_graph = normalized * normalized.t      nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)     scores = nx.pagerank(nx_graph)     return sorted(((scores[i],s) i,s in enumerate(sentences)), reverse=true)  fp = open("qc")     txt = fp.read() sents = textrank(txt) print sents

i following error

traceback (most recent call last):   file "textrank.py", line 44, in <module>     sents = textrank(txt)   file "textrank.py", line 10, in textrank     sentences = sentence_tokenizer.tokenize(document)   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize     return list(self.sentences_from_text(text, realign_boundaries))   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text     return [text[s:e] s, e in self.span_tokenize(text, realign_boundaries)]   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize     return [(sl.start, sl.stop) sl in slices]   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries     sl1, sl2 in _pair_iter(slices):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter     el in it:   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text     if self.text_contains_sentbreak(context):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak     t in self._annotate_tokens(self._tokenize_words(text)):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass     t1, t2 in _pair_iter(tokens):   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter     prev = next(it)   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass     aug_tok in tokens:   file "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words     line in plaintext.split('\n'): unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

i executing code in ubuntu. text, referred website https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101. created file qc (not qc.txt) , copy pasted data paragraph paragraph file. kindly me resolve error. thank you

please try if following works you.

import networkx nx import numpy np import sys  reload(sys) sys.setdefaultencoding('utf8')  nltk.tokenize.punkt import punktsentencetokenizer sklearn.feature_extraction.text import tfidftransformer, countvectorizer  def textrank(document):     sentence_tokenizer = punktsentencetokenizer()     sentences = sentence_tokenizer.tokenize(document)      bow_matrix = countvectorizer().fit_transform(sentences)     normalized = tfidftransformer().fit_transform(bow_matrix)      similarity_graph = normalized * normalized.t      nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)     scores = nx.pagerank(nx_graph)     return sorted(((scores[i],s) i,s in enumerate(sentences)), reverse=true)  fp = open("qc")     txt = fp.read() sents = textrank(txt.encode('utf-8')) print sents

Search This Blog

MOno

python - UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code -

Comments

Post a Comment

Popular posts from this blog

c# - Update a combobox from a presenter (MVP) -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -

How to understand 2 main() functions after using uftrace to profile the C++ program? -