如何对文本语料进行分词?
如何对文本语料进行分词?
我想使用NLTK库对文本语料进行标记化。
我的语料库如下:
['Did you hear about the Native American man that drank 200 cups of tea?', "What's the best anti diarrheal prescription?", 'What do you call a person who is outside a door and has no arms nor legs?', 'Which Star Trek character is a member of the magic circle?', "What's the difference between a bullet and a human?",
我尝试过:
tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]
但是出现了以下错误:
AttributeError: \'str\' object has no attribute \'decode\'
谢谢你的帮助。
admin 更改状态以发布 2023年5月20日
正如此页面所建议的,word_tokenize方法需要一个字符串作为参数,只需尝试
tok_corp = [nltk.word_tokenize(sent) for sent in corpus]
编辑:使用以下代码,我可以获得分词的语料库。
代码:
import pandas as pd from nltk import word_tokenize corpus = ['Did you hear about the Native American man that drank 200 cups of tea?', "What's the best anti diarrheal prescription?", 'What do you call a person who is outside a door and has no arms nor legs?', 'Which Star Trek character is a member of the magic circle?', "What's the difference between a bullet and a human?"] tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])
输出:
0 1 2 3 4 ... 13 14 15 16 17 0 Did you hear about the ... tea ? None None None 1 What 's the best anti ... None None None None None 2 What do you call a ... no arms nor legs ? 3 Which Star Trek character is ... None None None None None 4 What 's the difference between ... None None None None None
我认为您的语料库中混入了一些非字符串或非字节对象。 我建议您再次检查。