如何对文本语料进行分词?

9 浏览
0 Comments

如何对文本语料进行分词?

我想使用NLTK库对文本语料进行标记化。

我的语料库如下:

['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?",

我尝试过:

tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]

但是出现了以下错误:

AttributeError: \'str\' object has no attribute \'decode\'

谢谢你的帮助。

admin 更改状态以发布 2023年5月20日
0
0 Comments

错误就在这里,sent没有decode属性。只有当它们最初是编码的,即bytes对象而不是str对象,才需要.decode()。去掉它就应该没问题了。

0
0 Comments

正如此页面所建议的,word_tokenize方法需要一个字符串作为参数,只需尝试

tok_corp = [nltk.word_tokenize(sent) for sent in corpus]

编辑:使用以下代码,我可以获得分词的语料库。

代码:

import pandas as pd
from nltk import word_tokenize
corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?"]
tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])

输出:

      0     1     2           3        4   ...    13    14    15    16    17
0    Did   you  hear       about      the  ...   tea     ?  None  None  None
1   What    's   the        best     anti  ...  None  None  None  None  None
2   What    do   you        call        a  ...    no  arms   nor  legs     ?
3  Which  Star  Trek   character       is  ...  None  None  None  None  None
4   What    's   the  difference  between  ...  None  None  None  None  None

我认为您的语料库中混入了一些非字符串或非字节对象。 我建议您再次检查。

0