如何对文本语料进行分词？

Question

9 浏览2023年5月20日

匿名的 2022年6月2日

0 Comments

我想使用NLTK库对文本语料进行标记化。

我的语料库如下：

['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?",

我尝试过：

tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]

但是出现了以下错误：

AttributeError: \'str\' object has no attribute \'decode\'

谢谢你的帮助。

admin 更改状态以发布 2023年5月20日

0

2 答案

匿名的 · Answer 1 · 2022-06-02T20:57:58+00:00

错误就在这里，sent没有decode属性。只有当它们最初是编码的，即bytes对象而不是str对象，才需要.decode()。去掉它就应该没问题了。

匿名的 · Answer 2 · 2022-06-02T20:57:58+00:00

正如此页面所建议的，word_tokenize方法需要一个字符串作为参数，只需尝试

tok_corp = [nltk.word_tokenize(sent) for sent in corpus]

编辑：使用以下代码，我可以获得分词的语料库。

代码：

import pandas as pd
from nltk import word_tokenize
corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?"]
tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])

输出：

      0     1     2           3        4   ...    13    14    15    16    17
0    Did   you  hear       about      the  ...   tea     ?  None  None  None
1   What    's   the        best     anti  ...  None  None  None  None  None
2   What    do   you        call        a  ...    no  arms   nor  legs     ?
3  Which  Star  Trek   character       is  ...  None  None  None  None  None
4   What    's   the  difference  between  ...  None  None  None  None  None

我认为您的语料库中混入了一些非字符串或非字节对象。我建议您再次检查。