读取包含多个表格的Excel表格，这些表格具有具有非白色背景单元格颜色的标题。

Question

9 浏览2023年3月2日

匿名的 2023年3月2日

0 Comments

我有一个包含多个表格的Excel表格。

这些表格的列数和行数各不相同。

好消息是，表头有背景色，而表格内容的背景色是白色的。

我想知道是否可以使用xlrd或其他包将每个表格作为单独的数据框读入。

我目前考虑的方法非常冗长，可能不是最理想的。

例如：

import xlrd
book = xlrd.open_workbook("some.xls", formatting_info=True)
sheets = book.sheet_names()
for index, sh in enumerate(sheets):
    sheet = book.sheet_by_index(index)
    rows, cols = sheet.nrows, sheet.ncols
    for row in range(rows):
         for col in range(cols):
             xfx = sheet.cell_xf_index(row, col)
             xf = book.xf_list[xfx]
             bgx = xf.background.pattern_colour_index
             if bgx != 64:
                 Header_row = rownum

然后遍历Header_row，并获取所有列的值，并将它们作为数据框的列名。

然后继续解析第一列的行，直到遇到空单元格或只有一个或两个非空单元格的行。

正如您所看到的，这变得有点冗长，并不是最优的方式。

感谢您帮助我如何快速将所有表格作为单独的数据框提取出来。

0

1 答案

匿名的 · Answer 1 · 2023-03-24T20:45:33+00:00

问题原因：

这个问题的原因是读取一个包含多个表格的Excel表格文件，其中表头具有非白色背景单元格颜色。在解析表格数据时，需要判断哪些单元格是表头，并根据表头来解析其他数据。

解决方法：

为了提高代码的可读性和清晰度，可以将解析表格的过程封装成函数。下面是一个可能的实现方式：

import xlrd
# from typing import Dict
book = xlrd.open_workbook("some.xls", formatting_info=True)
def is_header(sheet, row, col, exclude_color=64):
    xf_index = sheet.cell_xf_index(row, col)
    bg_color = book.xf_list[xf_index].background.pattern_colour_index
    return bg_color != 64
def parse_sheet(sheet):
    """解析表格数据并获取DataFrame"""
    column_headers = dict()  # type: Dict[int, str]
    for row in range(sheet.nrows):
        # 如果第一个单元格不是表头且没有值，则跳过该行（可能需要移除以防止跳过第13行）
        if not is_header(sheet, row, 0) and not sheet.cell_value(row, 0):
            column_headers.clear()
            continue
        # 否则，我们将填充列的表头列表，并解析其他数据
        c_headers = [c for c in range(sheet.ncols) if is_header(sheet, row, c)]
        if c_headers:
            for col in c_headers:
                column_headers[col] = sheet.cell_value(row, col)
        else:
            for col in range(sheet.ncols):
                value = sheet.cell_value(row, col)
                # TODO: 将数据添加到DataFrame中并使用列表头
# 对于每个表格，调用解析函数进行解析
for index in range(book.sheet_names()):
    parse_sheet(book.sheet_by_index(index))

以上是解决该问题的代码实现。通过调用`parse_sheet`函数来解析每个表格，并根据表头来解析其他数据。函数`is_header`用于判断给定单元格是否为表头，通过检查单元格的背景颜色来确定。