Pandas按照两个文本列进行分组，并基于计数返回最大行数。

Question

12 浏览2023年4月10日

匿名的 2023年4月10日

0 Comments

我正在尝试找出具有最大"First_Word, Group"对的答案。

导入pandas库：

import pandas as pd
df = pd.DataFrame({'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'],
           'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'],
           'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice',
                'apple fell out of the tree', 'partrige in a pear tree']},
          columns=['First_Word', 'Group', 'Text'])

然后进行分组：

grouped = df.groupby(['First_Word', 'Group']).count()

现在我想将其筛选为仅包含具有最大"Text"计数的唯一索引行。你会注意到"apple bins"被移除了，因为"apple trees"具有最大值。

grouped = grouped[grouped['Text'] == grouped['Text'].max()]

这个问题类似于这个max value of group问题，但是当我尝试像这样操作时：

grouped = df.groupby(["First_Word", "Group"]).count().apply(lambda t: t[t['Text']==t['Text'].max()])

我得到一个错误："KeyError: ('Text', 'occurred at index Text')"。如果我在"apply"中添加"axis=1"，我会得到一个错误："IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)')"。

0

1 答案

匿名的 · Answer 1 · 2023-05-20T15:45:47+00:00

问题的出现原因是：需要对两个文本列进行分组，并根据计数返回每个组的最大行。

解决方法如下：

1. 首先，使用grouped.groupby(level='First_Word')['Text'].idxmax()对grouped进行分组，根据每个组的最大行的索引标签。

2. 然后，使用grouped.loc根据索引标签从grouped中选择行。

3. 最后，使用print(result)打印结果。

以下是完整的代码示例：

import pandas as pd
df = pd.DataFrame(
    {'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'],
     'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'],
     'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice',
              'apple fell out of the tree', 'partrige in a pear tree']},
    columns=['First_Word', 'Group', 'Text'])
grouped = df.groupby(['First_Word', 'Group']).count()
result = grouped.loc[grouped.groupby(level='First_Word')['Text'].idxmax()]
print(result)

运行上述代码将输出以下结果：

                         Text
First_Word Group             
apple      apple trees      2
orange     orange juice     1
pear       pear tree        1

以上就是使用Pandas对两个文本列进行分组，并根据计数返回每个组的最大行的方法。