UnicodeDecodeError: 'ascii'编解码器无法解码Textranking代码中的字节

6 浏览
0 Comments

UnicodeDecodeError: 'ascii'编解码器无法解码Textranking代码中的字节

当我执行以下代码时,出现以下错误:

Traceback (most recent call last):
  File "Textrank.py", line 44, in 
    sents = textrank(txt)
  File "Textrank.py", line 10, in textrank
    sentences = sentence_tokenizer.tokenize(document)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

我在Ubuntu上执行这段代码。获取文本时,我参考了这个网站:

https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101。我创建了一个名为QC的文件(不是QC.txt),并逐段复制粘贴了数据到该文件中。

请帮助我解决这个错误。

谢谢。

0
0 Comments

在上述代码中,出现了UnicodeDecodeError: 'ascii' codec can't decode byte错误。这个错误的原因是在执行print st语句时,尝试将unicode类型的数据转换为字符串时出现了解码错误。解决这个问题的方法是在代码的开头加入以下几行代码:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

这样,就可以将默认的编码方式设置为utf-8,从而正确地将unicode数据转换为字符串。

0