将多行数据框的行数值排名

Question

16 浏览2023年5月24日

匿名的 2023年3月6日

0 Comments

我有一个看起来像这样的大型数据框：

日期	股票1	股票2	股票3	股票4	股票5	股票6	股票7	股票8	股票9	股票10
10/20	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.9
11/20	0.8	0.9	0.3	0.4	0.3	0.5	0.3	0.2	0.4	0.1
12/20	0.3	0.6	0.9	0.5	0.6	0.7	0.8	0.7	0.9	0.1

我想找到每一行中20%股票价值最高和20%股票价值最低的股票。输出应该是：

日期	最高价值股票	最低价值股票
10/20	股票9、股票10	股票1、股票2
11/20	股票1、股票2	股票8、股票10
12/20	股票3、股票9	股票1、股票10

我不需要上述值之间的逗号，可以在下面一起显示。我尝试使用df= df.stack()进行堆叠，然后对列内的值进行排序，但我不知道该如何继续。

$\"enter$

admin 更改状态以发布 2023年5月24日

0

2 答案

匿名的 · Answer 1 · 2023-03-06T20:57:58+00:00

你可以使用一个帮助函数对每一行的值进行排序来完成它：

def get_top_bottom_20_pct(x):
    d = x.sort_values(ascending=False).index.tolist()
    return [*map(', '.join, (d[:size], d[-size:]))]
size = int(0.2 * df.shape[1])
s = df.set_index('date').apply(get_top_bottom_20_pct, axis=1)
out = pd.DataFrame(s.tolist(), index=s.index, columns=['higher','lower']).reset_index()

如果你的Python版本 >=3.8，你可以使用海象运算符做同样的事情：

s = df.set_index('date').apply(lambda x: (', '.join((d := x.sort_values(ascending=False).index.tolist())[:size]), 
                                          ', '.join(d[-size:])), axis=1)
out = pd.DataFrame(s.tolist(), index=s.index, columns=['higher','lower']).reset_index()

输出：

    date           higher            lower
0  10/20  stock9, stock10   stock2, stock1
1  11/20   stock2, stock1  stock8, stock10
2  12/20   stock3, stock9  stock1, stock10

匿名的 · Answer 2 · 2023-03-06T20:57:58+00:00

使用 nlargest 和 nsmallest 进行尝试:

#df = df.set_index("date") #uncomment if date is a column and not the index
n = round(len(df.columns)*0.2) #number of stocks in the top/bottom 20%
output = pd.DataFrame()
output["higher"] = df.apply(lambda x: x.nlargest(n).index.tolist(), axis=1)
output["lower"] = df.apply(lambda x: x.nsmallest(n).index.tolist(), axis=1)
>>> output
                  higher              lower
date                                       
10/20  [stock9, stock10]   [stock1, stock2]
11/20   [stock2, stock1]  [stock10, stock8]
12/20   [stock3, stock9]  [stock10, stock1]

编辑：
如果你想要每个股票名称都放在独立的一行上，可以这样做:

output = pd.DataFrame()
output["higher"] = df.apply(lambda x: "\n".join(x.nlargest(n).index.tolist()), axis=1)
output["lower"] = df.apply(lambda x: "\n".join(x.nsmallest(n).index.tolist()), axis=1)