将Pandas数据框转换为嵌套的JSON格式。
将Pandas数据框转换为嵌套的JSON格式。
我正在尝试将一个Pandas DataFrame转换成嵌套JSON。函数.to_json()
不能为我的目的提供足够的灵活性。
以下是一些数据点的数据帧(以csv格式,逗号分隔):
,ID,Location,Country,Latitude,Longitude,timestamp,tide 0,1,BREST,FRA,48.383,-4.495,1807-01-01,6905.0 1,1,BREST,FRA,48.383,-4.495,1807-02-01,6931.0 2,1,BREST,FRA,48.383,-4.495,1807-03-01,6896.0 3,1,BREST,FRA,48.383,-4.495,1807-04-01,6953.0 4,1,BREST,FRA,48.383,-4.495,1807-05-01,7043.0 2508,7,CUXHAVEN 2,DEU,53.867,8.717,1843-01-01,7093.0 2509,7,CUXHAVEN 2,DEU,53.867,8.717,1843-02-01,6688.0 2510,7,CUXHAVEN 2,DEU,53.867,8.717,1843-03-01,6493.0 2511,7,CUXHAVEN 2,DEU,53.867,8.717,1843-04-01,6723.0 2512,7,CUXHAVEN 2,DEU,53.867,8.717,1843-05-01,6533.0 4525,9,MAASSLUIS,NLD,51.918,4.25,1848-02-01,6880.0 4526,9,MAASSLUIS,NLD,51.918,4.25,1848-03-01,6700.0 4527,9,MAASSLUIS,NLD,51.918,4.25,1848-04-01,6775.0 4528,9,MAASSLUIS,NLD,51.918,4.25,1848-05-01,6580.0 4529,9,MAASSLUIS,NLD,51.918,4.25,1848-06-01,6685.0 6540,8,WISMAR 2,DEU,53.898999999999994,11.458,1848-07-01,6957.0 6541,8,WISMAR 2,DEU,53.898999999999994,11.458,1848-08-01,6944.0 6542,8,WISMAR 2,DEU,53.898999999999994,11.458,1848-09-01,7084.0 6543,8,WISMAR 2,DEU,53.898999999999994,11.458,1848-10-01,6898.0 6544,8,WISMAR 2,DEU,53.898999999999994,11.458,1848-11-01,6859.0 8538,10,SAN FRANCISCO,USA,37.806999999999995,-122.465,1854-07-01,6909.0 8539,10,SAN FRANCISCO,USA,37.806999999999995,-122.465,1854-08-01,6940.0 8540,10,SAN FRANCISCO,USA,37.806999999999995,-122.465,1854-09-01,6961.0 8541,10,SAN FRANCISCO,USA,37.806999999999995,-122.465,1854-10-01,6952.0 8542,10,SAN FRANCISCO,USA,37.806999999999995,-122.465,1854-11-01,6952.0
有很多重复的信息,我想要一个像这样的JSON:
[ { "ID": 1, "Location": "BREST", "Latitude": 48.383, "Longitude": -4.495, "Country": "FRA", "Tide-Data": { "1807-02-01": 6931, "1807-03-01": 6896, "1807-04-01": 6953, "1807-05-01": 7043 } }, { "ID": 5, "Location": "HOLYHEAD", "Latitude": 53.31399999999999, "Longitude": -4.62, "Country": "GBR", "Tide-Data": { "1807-02-01": 6931, "1807-03-01": 6896, "1807-04-01": 6953, "1807-05-01": 7043 } } ]
我该如何实现这个目标?
产生数据框的代码:
# input json json_str = '[{"ID":1,"Location":"BREST","Country":"FRA","Latitude":48.383,"Longitude":-4.495,"timestamp":"1807-01-01","tide":6905},{"ID":1,"Location":"BREST","Country":"FRA","Latitude":48.383,"Longitude":-4.495,"timestamp":"1807-02-01","tide":6931},{"ID":1,"Location":"BREST","Country":"DEU","Latitude":48.383,"Longitude":-4.495,"timestamp":"1807-03-01","tide":6896},{"ID":7,"Location":"CUXHAVEN 2","Country":"DEU","Latitude":53.867,"Longitude":-8.717,"timestamp":"1843-01-01","tide":7093},{"ID":7,"Location":"CUXHAVEN 2","Country":"DEU","Latitude":53.867,"Longitude":-8.717,"timestamp":"1843-02-01","tide":6688},{"ID":7,"Location":"CUXHAVEN 2","Country":"DEU","Latitude":53.867,"Longitude":-8.717,"timestamp":"1843-03-01","tide":6493}]' # load json object data_list = json.loads(json_str) # create dataframe df = pd.json_normalize(data_list, None, None)
admin 更改状态以发布 2023年5月25日
groupby.apply
会强制对每个组进行数据操作来创建当前的嵌套数据结构,这会导致速度非常慢。相比之下,使用itertuples
和列表解析的简单的for循环方法来创建当前的嵌套数据结构,并通过json.dumps
进行序列化,速度更快。如果组的规模较小,则这种方法特别有用,因为groupby.apply
在这些情况下速度非常慢。
import json keys = ['ID', 'Location', 'Country', 'Latitude', 'Longitude'] mydict = {} for row in df.itertuples(index=False): mydict.setdefault(row[:5], {})[row.timestamp] = row.tide mylist = [{**dict(zip(keys, k)), 'Tide-Data': v} for k, v in mydict.items()] j = json.dumps(mylist)
请注意,groupby.apply
方法需要稍微更改一下,才能像MaxU所示的那样产生预期的输出结果(需要对传递给apply
的lambda函数稍作修改)。
j = df.groupby(keys).apply(lambda x: x.set_index('timestamp')['tide'].to_dict()).reset_index(name='Tide-Data').to_json(orient='records')
对于给定的输入数据,这两种方法都会产生相同的输出结果:
[ { "ID": 1, "Location": "BREST", "Country": "FRA", "Latitude": 48.383, "Longitude": -4.495, "Tide-Data": { "1807-01-01": 6905.0, "1807-02-01": 6931.0, "1807-03-01": 6896.0, "1807-04-01": 6953.0, "1807-05-01": 7043.0 } }, { "ID": 7, "Location": "CUXHAVEN 2", "Country": "DEU", "Latitude": 53.867, "Longitude": 8.717, "Tide-Data": { "1843-01-01": 7093.0, "1843-02-01": 6688.0, "1843-03-01": 6493.0, "1843-04-01": 6723.0, "1843-05-01": 6533.0 } }, { "ID": 9, "Location": "MAASSLUIS", "Country": "NLD", "Latitude": 51.918, "Longitude": 4.25, "Tide-Data": { "1848-02-01": 6880.0, "1848-03-01": 6700.0, "1848-04-01": 6775.0, "1848-05-01": 6580.0, "1848-06-01": 6685.0 } }, { "ID": 8, "Location": "WISMAR 2", "Country": "DEU", "Latitude": 53.899, "Longitude": 11.458, "Tide-Data": { "1848-07-01": 6957.0, "1848-08-01": 6944.0, "1848-09-01": 7084.0, "1848-10-01": 6898.0, "1848-11-01": 6859.0 } }, { "ID": 10, "Location": "SAN FRANCISCO", "Country": "USA", "Latitude": 37.807, "Longitude": -122.465, "Tide-Data": { "1854-07-01": 6909.0, "1854-08-01": 6940.0, "1854-09-01": 6961.0, "1854-10-01": 6952.0, "1854-11-01": 6952.0 } } ]
1 基准测试结果:对于100k行的框架,如果每个组的规模相对较小,循环方法的运行速度约为groupby.apply
方法的50倍。
import numpy as np def jsonify(df, groupers): res = {} for row in df.itertuples(index=False): res.setdefault(row[:5], {})[row.timestamp] = row.tide j = json.dumps([dict(zip(groupers, k)) | {'Tide-Data': v} for k, v in res.items()]) return j df = pd.DataFrame(np.random.default_rng().choice(10, size=(100000, 7)), columns=['ID','Location','Country','Latitude','Longitude', 'timestamp', 'tide']) groupers = ['ID','Location','Country','Latitude','Longitude'] %timeit jsonify(df, groupers) # 502 ms ± 17.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df.groupby(groupers).apply(lambda x: x.set_index('timestamp')['tide'].to_dict()).reset_index(name='Tide-Data').to_json(orient='records') # 25 s ± 1.38 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
如果组的规模较大,则两者之间的差异要小得多,但循环实现仍然比groupby.apply
方法更快:
df = pd.DataFrame(np.random.default_rng().choice(3, size=(100000, 7)), columns=['ID','Location','Country','Latitude','Longitude', 'timestamp', 'tide']) %timeit jsonify(df, groupers) # 155 ms ± 6.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df.groupby(groupers).apply(lambda x: x.set_index('timestamp')['tide'].to_dict()).reset_index(name='Tide-Data').to_json(orient='records') # 201 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
更新:
j = (df.groupby(['ID','Location','Country','Latitude','Longitude']) .apply(lambda x: x[['timestamp','tide']].to_dict('records')) .reset_index() .rename(columns={0:'Tide-Data'}) .to_json(orient='records'))
结果(格式化):
In [103]: print(json.dumps(json.loads(j), indent=2, sort_keys=True)) [ { "Country": "FRA", "ID": 1, "Latitude": 48.383, "Location": "BREST", "Longitude": -4.495, "Tide-Data": [ { "tide": 6905.0, "timestamp": "1807-01-01" }, { "tide": 6931.0, "timestamp": "1807-02-01" }, { "tide": 6896.0, "timestamp": "1807-03-01" }, { "tide": 6953.0, "timestamp": "1807-04-01" }, { "tide": 7043.0, "timestamp": "1807-05-01" } ] }, { "Country": "DEU", "ID": 7, "Latitude": 53.867, "Location": "CUXHAVEN 2", "Longitude": 8.717, "Tide-Data": [ { "tide": 7093.0, "timestamp": "1843-01-01" }, { "tide": 6688.0, "timestamp": "1843-02-01" }, { "tide": 6493.0, "timestamp": "1843-03-01" }, { "tide": 6723.0, "timestamp": "1843-04-01" }, { "tide": 6533.0, "timestamp": "1843-05-01" } ] }, { "Country": "DEU", "ID": 8, "Latitude": 53.899, "Location": "WISMAR 2", "Longitude": 11.458, "Tide-Data": [ { "tide": 6957.0, "timestamp": "1848-07-01" }, { "tide": 6944.0, "timestamp": "1848-08-01" }, { "tide": 7084.0, "timestamp": "1848-09-01" }, { "tide": 6898.0, "timestamp": "1848-10-01" }, { "tide": 6859.0, "timestamp": "1848-11-01" } ] }, { "Country": "NLD", "ID": 9, "Latitude": 51.918, "Location": "MAASSLUIS", "Longitude": 4.25, "Tide-Data": [ { "tide": 6880.0, "timestamp": "1848-02-01" }, { "tide": 6700.0, "timestamp": "1848-03-01" }, { "tide": 6775.0, "timestamp": "1848-04-01" }, { "tide": 6580.0, "timestamp": "1848-05-01" }, { "tide": 6685.0, "timestamp": "1848-06-01" } ] }, { "Country": "USA", "ID": 10, "Latitude": 37.807, "Location": "SAN FRANCISCO", "Longitude": -122.465, "Tide-Data": [ { "tide": 6909.0, "timestamp": "1854-07-01" }, { "tide": 6940.0, "timestamp": "1854-08-01" }, { "tide": 6961.0, "timestamp": "1854-09-01" }, { "tide": 6952.0, "timestamp": "1854-10-01" }, { "tide": 6952.0, "timestamp": "1854-11-01" } ] } ]
旧的答案:
可以使用groupby()
、apply()
和to_json()
方法进行操作:
j = (df.groupby(['ID','Location','Country','Latitude','Longitude'], as_index=False) .apply(lambda x: dict(zip(x.timestamp,x.tide))) .reset_index() .rename(columns={0:'Tide-Data'}) .to_json(orient='records'))
输出:
In [112]: print(json.dumps(json.loads(j), indent=2, sort_keys=True)) [ { "Country": "FRA", "ID": 1, "Latitude": 48.383, "Location": "BREST", "Longitude": -4.495, "Tide-Data": { "1807-01-01": 6905.0, "1807-02-01": 6931.0, "1807-03-01": 6896.0, "1807-04-01": 6953.0, "1807-05-01": 7043.0 } }, { "Country": "DEU", "ID": 7, "Latitude": 53.867, "Location": "CUXHAVEN 2", "Longitude": 8.717, "Tide-Data": { "1843-01-01": 7093.0, "1843-02-01": 6688.0, "1843-03-01": 6493.0, "1843-04-01": 6723.0, "1843-05-01": 6533.0 } }, { "Country": "DEU", "ID": 8, "Latitude": 53.899, "Location": "WISMAR 2", "Longitude": 11.458, "Tide-Data": { "1848-07-01": 6957.0, "1848-08-01": 6944.0, "1848-09-01": 7084.0, "1848-10-01": 6898.0, "1848-11-01": 6859.0 } }, { "Country": "NLD", "ID": 9, "Latitude": 51.918, "Location": "MAASSLUIS", "Longitude": 4.25, "Tide-Data": { "1848-02-01": 6880.0, "1848-03-01": 6700.0, "1848-04-01": 6775.0, "1848-05-01": 6580.0, "1848-06-01": 6685.0 } }, { "Country": "USA", "ID": 10, "Latitude": 37.807, "Location": "SAN FRANCISCO", "Longitude": -122.465, "Tide-Data": { "1854-07-01": 6909.0, "1854-08-01": 6940.0, "1854-09-01": 6961.0, "1854-10-01": 6952.0, "1854-11-01": 6952.0 } } ]
PS 如果您不关心缩进,可以直接写入JSON文件:
(df.groupby(['ID','Location','Country','Latitude','Longitude'], as_index=False) .apply(lambda x: dict(zip(x.timestamp,x.tide))) .reset_index() .rename(columns={0:'Tide-Data'}) .to_json('/path/to/file_name.json', orient='records'))