如何递归地在Python中生成目录大小，类似于du .的功能？

Question

13 浏览2023年3月1日

匿名的 2023年3月2日

0 Comments

假设我的结构如下：

/-- am here
/one/some/dir
/two
/three/has/many/leaves
/hello/world

并且假设/one/some/dir中包含一个500mb的大文件，而/three/has/many/leaves中每个文件夹都包含一个400mb的文件。

我想为每个目录生成大小，得到以下输出结果：

/ - 总大小
/one/some/dir 500mb
/two 0 
/three/has/many/leaves 400mb
/three/has/many 800mb
/three/has/ 800mb+另一个大文件的大小

我应该如何操作？

0

3 答案

匿名的 · Answer 1 · 2023-09-28T07:58:16+00:00

问题的出现原因是用户想要在Python中生成递归目录大小，就像命令"du ."一样。用户已经查看了Python的os.walk文档，并尝试了一些代码示例，但仍然无法得到想要的结果。

解决方法是使用os.walk函数来遍历目录树。用户可以根据文档中的示例代码进行修改，以满足自己的需求。下面是一个示例代码：

import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
    print(root, "consumes", end=" ")
    print(sum(getsize(join(root, name)) for name in files), end=" ")
    print("bytes in", len(files), "non-directory files")
    if 'CVS' in dirs:
        dirs.remove('CVS')  # don't visit CVS directories

这段代码会遍历指定目录下的所有文件和子目录，并计算它们的大小。用户可以根据自己的需求进行修改。

用户在评论中提到了一个未经测试的代码示例，该示例可以从底向上遍历目录树，并计算每个目录的大小。下面是该示例代码：

import os
from os.path import join, getsize
dirs_dict = {}
# 从底向上遍历目录树，以便每个目录可以轻松访问其子目录的大小
for root, dirs, files in os.walk('python/Lib/email', topdown=False):
    # 计算该目录中每个非目录文件的大小之和
    size = sum(getsize(join(root, name)) for name in files) 
    # 计算所有子目录的大小之和，从dirs_dict中获取
    subdir_size = sum(dirs_dict[join(root, d)] for d in dirs)
    # 将该目录（包括子目录）的大小存储在字典中，以便以后访问
    my_size = dirs_dict[root] = size + subdir_size
    print('%s: %d' % (root, my_size))

这段代码会输出每个目录（包括子目录）的大小。用户可以根据需要进行修改。

最后，用户在评论中提到了一些困惑，包括文档的难读性和与实际结果不符的问题。其他人建议用户从底向上思考，并使用字典来保存每个子目录的大小。用户表示理解了这个思路，并感谢给予帮助。

值得注意的是，getsize函数会遵循符号链接。如果存在损坏的链接，可能会导致失败，并且用户可能不希望将链接文件计入结果中。可以使用os.lstat(filename).st_size来获取符号链接本身的大小。

匿名的 · Answer 2 · 2023-05-13T14:01:29+00:00

如何递归生成目录大小的问题是因为以下脚本打印了指定目录下所有子目录的目录大小。该脚本应该独立于平台-Posix / Windows /等。它还试图从缓存递归函数的调用中获益（如果可能）。如果省略了参数，脚本将在当前目录中工作。输出按目录大小从最大到最小进行排序。因此，您可以根据您的需要进行调整。

以下是解决方案代码：

从__future__导入print_function

import os

import sys

import operator

def null_decorator(ob):

return ob

if sys.version_info >= (3,2,0):

import functools

my_cache_decorator = functools.lru_cache(maxsize=4096)

else:

my_cache_decorator = null_decorator

start_dir = os.path.normpath(os.path.abspath(sys.argv[1])) if len(sys.argv) > 1 else '.'

_cache_decorator

def get_dir_size(start_path = '.'):

total_size = 0

if 'scandir' in dir(os):

# using fast 'os.scandir' method (new in version 3.5)

for entry in os.scandir(start_path):

if entry.is_dir(follow_symlinks = False):

total_size += get_dir_size(entry.path)

elif entry.is_file(follow_symlinks = False):

total_size += entry.stat().st_size

else:

# using slow, but compatible 'os.listdir' method

for entry in os.listdir(start_path):

full_path = os.path.abspath(os.path.join(start_path, entry))

if os.path.islink(full_path):

continue

if os.path.isdir(full_path):

total_size += get_dir_size(full_path)

elif os.path.isfile(full_path):

total_size += os.path.getsize(full_path)

return total_size

def get_dir_size_walk(start_path = '.'):

total_size = 0

for dirpath, dirnames, filenames in os.walk(start_path):

for f in filenames:

fp = os.path.join(dirpath, f)

total_size += os.path.getsize(fp)

return total_size

def bytes2human(n, format='%(value).0f%(symbol)s', symbols='customary'):

"""

(c) http://code.activestate.com/recipes/578019/

Convert n bytes into a human readable string based on format.

symbols can be either "customary", "customary_ext", "iec" or "iec_ext",

see: https://en.wikipedia.org/wiki/Binary_prefix#Specific_units_of_IEC_60027-2_A.2_and_ISO.2FIEC_80000

>>> bytes2human(0)

'0.0 B'

>>> bytes2human(0.9)

'0.0 B'

>>> bytes2human(1)

'1.0 B'

>>> bytes2human(1.9)

'1.0 B'

>>> bytes2human(1024)

'1.0 K'

>>> bytes2human(1048576)

'1.0 M'

>>> bytes2human(1099511627776127398123789121)

'909.5 Y'

>>> bytes2human(9856, symbols="customary")

'9.6 K'

>>> bytes2human(9856, symbols="customary_ext")

'9.6 kilo'

>>> bytes2human(9856, symbols="iec")

'9.6 Ki'

>>> bytes2human(9856, symbols="iec_ext")

'9.6 kibi'

>>> bytes2human(10000, "%(value).1f %(symbol)s/sec")

'9.8 K/sec'

>>> # precision can be adjusted by playing with %f operator

>>> bytes2human(10000, format="%(value).5f %(symbol)s")

'9.76562 K'

"""

SYMBOLS = {

'customary' : ('B', 'K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y'),

'customary_ext' : ('byte', 'kilo', 'mega', 'giga', 'tera', 'peta', 'exa',

'zetta', 'iotta'),

'iec' : ('Bi', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi', 'Yi'),

'iec_ext' : ('byte', 'kibi', 'mebi', 'gibi', 'tebi', 'pebi', 'exbi',

'zebi', 'yobi'),

}

n = int(n)

if n < 0:

raise ValueError("n < 0")

symbols = SYMBOLS[symbols]

prefix = {}

for i, s in enumerate(symbols[1:]):

prefix[s] = 1 << (i+1)*10

for symbol in reversed(symbols[1:]):

if n >= prefix[symbol]:

value = float(n) / prefix[symbol]

return format % locals()

return format % dict(symbol=symbols[0], value=n)

############################################################

###

### main ()

###

############################################################

if __name__ == '__main__':

dir_tree = {}

### version, that uses 'slow' [os.walk method]

#get_size = get_dir_size_walk

### this recursive version can benefit from caching the function calls (functools.lru_cache)

get_size = get_dir_size

for root, dirs, files in os.walk(start_dir):

for d in dirs:

dir_path = os.path.join(root, d)

if os.path.isdir(dir_path):

dir_tree[dir_path] = get_size(dir_path)

for d, size in sorted(dir_tree.items(), key=operator.itemgetter(1), reverse=True):

print('%s\t%s' %(bytes2human(size, format='%(value).2f%(symbol)s'), d))

print('-' * 80)

if sys.version_info >= (3,2,0):

print(get_dir_size.cache_info())

示例输出：

37.61M .\subdir_b

2.18M .\subdir_a

2.17M .\subdir_a\subdir_a_2

4.41K .\subdir_a\subdir_a_1

----------------------------------------------------------

CacheInfo(hits=2, misses=4, maxsize=4096, currsize=4)

作者不确定这个函数是做什么的，但它不正确：get_dir_size('/var/lib/docker/overlay')得到6889267157，而du -s /var/lib/docker/overlay得到612820并且速度快一个数量级。

感谢您的评论！我刚刚发现对于旧版本的Python（< 3.5），我没有检查符号链接-现在应该已经修复了-您能再次检查一下吗？

同样的问题:/我使用的是3.5.2版本。

匿名的 · Answer 3 · 2023-03-10T19:19:36+00:00

问题的原因是之前的答案无法处理目录中的符号链接。为了解决这个问题，可以使用以下代码来递归生成目录大小：

dirs_dict = {}
for root, dirs, files in os.walk(directory, topdown=False):
    if os.path.islink(root):
        dirs_dict[root] = 0L
    else:
        dir_size = getsize(root)
        # 遍历该目录下的所有非目录文件，累加它们的大小
        for name in files:
             full_name = join(root, name)
             if os.path.islink(full_name):
                 nsize = 0L
             else:
                 nsize = getsize(full_name)
             dirs_dict[full_name] = nsize
             dir_size += nsize
        # 遍历所有子目录，从`dirs_dict`中累加它们的大小
        subdir_size = 0L
        for d in dirs:
            full_d = join(root, d)
            if os.path.islink(full_d):
                dirs_dict[full_d] = 0L
            else:
                subdir_size += dirs_dict[full_d]
        dirs_dict[root] = dir_size + subdir_size

通过以上代码，我们可以递归地计算出目录的大小，支持处理目录中的符号链接。