Java对于大堆内存的处理速度较慢,不会过度进行垃圾回收。
Java对于大堆内存的处理速度较慢,不会过度进行垃圾回收。
原始问题
我正在运行一个Java程序,假设叫做program.jar,它有一个“小”的初始堆大小(1GB)和一个“大”的初始堆大小(16GB)。当我用小的初始堆大小运行它时,即
java -jar -Xms1g -Xmx16g program.jar
程序在12秒内终止(多次运行的平均值)。相比之下,当我用大的初始堆大小运行它时,即
java -jar -Xms16g -Xmx16g program.jar
程序在30秒内终止(多次运行的平均值)。
我了解到,一般来说,大的堆可能会导致过多的垃圾回收,从而使程序变慢:
- Java very large heap sizes
- Does the Sun JVM slow down when more memory is allocated via -Xmx?
- Under what circumstances does Java performance degrade with more memory?
- Java slower with big heap
然而,当我使用-verbose:gc
标志运行program.jar时,没有任何垃圾回收活动被报告,即使是在大的初始堆大小下。使用小的初始堆大小时,只有在程序初始化阶段之前,才会有一些垃圾回收活动。因此,过度的垃圾回收并不能解释我的观察结果。
为了使问题更加混淆(至少对我来说),我有一个功能等效的程序,称为program2.jar,它与program.jar具有相同的输入输出行为。主要的区别在于program2.jar使用的数据结构比program.jar的内存效率低,至少在内存方面(是否也与时间效率有关,这实际上是我想确定的)。但无论我是使用小的初始堆还是大的初始堆运行program2.jar,它总是在大约22秒内终止(包括大约2-3秒的GC过程)。
所以,这是我的问题:(如何)大堆会让程序变慢,除了过多的GC处理之外?
(这个问题可能与Georg在《Java slower with big heap》中的问题相似,但他的问题与堆无关。在我的情况下,我觉得它一定与堆有关,因为它是program.jar两次运行之间唯一的区别。)
这里有一些相关的细节。我正在使用Java 7,OpenJDK:
> java -version java version "1.7.0_75" OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
我的机器有两个E5-2690V3处理器(http://ark.intel.com/products/81713),位于两个插槽中(关闭了超线程和Turbo Boost),并具有充足的内存(64GB),程序运行前约有一半的内存可用:
> free total used free shared buffers cached Mem: 65588960 31751316 33837644 20 154616 23995164 -/+ buffers/cache: 7601536 57987424 Swap: 1023996 11484 1012512
最后,程序具有多个线程(约70个)。
编辑1
响应 Bruno Reis 和 kdgregory
我收集了一些额外的统计数据。这是针对带有较小初始堆的program.jar:
Command being timed: "java -Xms1g -Xmx16g -verbose:gc -jar program.jar" User time (seconds): 339.11 System time (seconds): 29.86 Percent of CPU this job got: 701% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.61 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 12192224 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1 Minor (reclaiming a frame) page faults: 771372 Voluntary context switches: 7446 Involuntary context switches: 27788 Swaps: 0 File system inputs: 10216 File system outputs: 120 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
这是针对带有较大初始堆的program.jar:
Command being timed: "java -Xms16g -Xmx16g -verbose:gc -jar program.jar" User time (seconds): 769.13 System time (seconds): 28.04 Percent of CPU this job got: 1101% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.34 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 10974528 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 16 Minor (reclaiming a frame) page faults: 687727 Voluntary context switches: 6769 Involuntary context switches: 68465 Swaps: 0 File system inputs: 2032 File system outputs: 160 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
(这里报告的墙钟时间与我原始帖子中报告的时间不同,因为有一个之前未计时的初始化阶段。)
Edit 2
我收集了一些关于缓存的统计数据。这是针对带有较小初始堆的program.jar(平均值在30次运行中):
2719852136 cache-references ( +- 1.56% ) [42.11%] 1931377514 cache-misses # 71.010 % of all cache refs ( +- 0.07% ) [42.11%] 56748034419 L1-dcache-loads ( +- 1.34% ) [42.12%] 16334611643 L1-dcache-load-misses # 28.78% of all L1-dcache hits ( +- 1.70% ) [42.12%] 24886806040 L1-dcache-stores ( +- 1.47% ) [42.12%] 2438414068 L1-dcache-store-misses ( +- 0.19% ) [42.13%] 0 L1-dcache-prefetch-misses [42.13%] 23243029 L1-icache-load-misses ( +- 0.66% ) [42.14%] 2424355365 LLC-loads ( +- 1.73% ) [42.15%] 278916135 LLC-stores ( +- 0.30% ) [42.16%] 515064030 LLC-prefetches ( +- 0.33% ) [10.54%] 63395541507 dTLB-loads ( +- 0.17% ) [15.82%] 7402222750 dTLB-load-misses # 11.68% of all dTLB cache hits ( +- 1.87% ) [21.08%] 20945323550 dTLB-stores ( +- 0.69% ) [26.34%] 294311496 dTLB-store-misses ( +- 0.16% ) [31.60%] 17012236 iTLB-loads ( +- 2.10% ) [36.86%] 473238 iTLB-load-misses # 2.78% of all iTLB cache hits ( +- 2.88% ) [42.12%] 29390940710 branch-loads ( +- 0.18% ) [42.11%] 19502228 branch-load-misses ( +- 0.57% ) [42.11%] 53.771209341 seconds time elapsed ( +- 0.42% )
这是针对带有较大初始堆的program.jar(平均值在30次运行中):
10465831994 cache-references ( +- 3.00% ) [42.10%] 1921281060 cache-misses # 18.358 % of all cache refs ( +- 0.03% ) [42.10%] 51072650956 L1-dcache-loads ( +- 2.14% ) [42.10%] 24282459597 L1-dcache-load-misses # 47.54% of all L1-dcache hits ( +- 0.16% ) [42.10%] 21447495598 L1-dcache-stores ( +- 2.46% ) [42.10%] 2441970496 L1-dcache-store-misses ( +- 0.13% ) [42.10%] 0 L1-dcache-prefetch-misses [42.11%] 24953833 L1-icache-load-misses ( +- 0.78% ) [42.12%] 10234572163 LLC-loads ( +- 3.09% ) [42.13%] 240843257 LLC-stores ( +- 0.17% ) [42.14%] 462484975 LLC-prefetches ( +- 0.22% ) [10.53%] 62564723493 dTLB-loads ( +- 0.10% ) [15.80%] 12686305321 dTLB-load-misses # 20.28% of all dTLB cache hits ( +- 0.01% ) [21.06%] 19201170089 dTLB-stores ( +- 1.11% ) [26.33%] 279236455 dTLB-store-misses ( +- 0.10% ) [31.59%] 16259758 iTLB-loads ( +- 4.65% ) [36.85%] 466127 iTLB-load-misses # 2.87% of all iTLB cache hits ( +- 6.68% ) [42.11%] 28098428012 branch-loads ( +- 0.13% ) [42.10%] 18707911 branch-load-misses ( +- 0.82% ) [42.10%] 73.576058909 seconds time elapsed ( +- 0.54% )
比较绝对数字,较大的初始堆导致L1-dcache-load-misses增加约50%,dTLB-load-misses增加约70%。我为dTLB-load-misses进行了一次粗略的计算,假设在我的2.6 GHZ机器上每个miss需要100个周期(来源:维基百科),这导致大初始堆与小初始堆相比延迟了484秒,而小初始堆仅延迟了284秒。我不知道如何将此数字转换为每个内核的延迟(可能不仅仅是除以内核数?),但数量级似乎是有道理的。
在收集这些统计数据后,我还对大和小初始堆(每种情况均为一次运行)的-XX:+PrintFlagsFinal
进行了diff:
< uintx InitialHeapSize := 17179869184 {product} --- > uintx InitialHeapSize := 1073741824 {product}
因此,-Xms
不会影响其他标志。以下是使用小初始堆的 -XX:+PrintGCDetails
输出程序.jar 的结果:
[GC [PSYoungGen: 239882K->33488K(306176K)] 764170K->983760K(1271808K), 0.0840630 secs] [Times: user=0.70 sys=0.66, real=0.09 secs] [Full GC [PSYoungGen: 33488K->0K(306176K)] [ParOldGen: 950272K->753959K(1508352K)] 983760K->753959K(1814528K) [PSPermGen: 2994K->2993K(21504K)], 0.0560900 secs] [Times: user=0.20 sys=0.03, real=0.05 secs] [GC [PSYoungGen: 234744K->33056K(306176K)] 988704K->983623K(1814528K), 0.0416120 secs] [Times: user=0.69 sys=0.03, real=0.04 secs] [GC [PSYoungGen: 264198K->33056K(306176K)] 1214765K->1212999K(1814528K), 0.0489600 secs] [Times: user=0.61 sys=0.23, real=0.05 secs] [Full GC [PSYoungGen: 33056K->0K(306176K)] [ParOldGen: 1179943K->1212700K(2118656K)] 1212999K->1212700K(2424832K) [PSPermGen: 2993K->2993K(21504K)], 0.1589640 secs] [Times: user=2.27 sys=0.10, real=0.16 secs] [GC [PSYoungGen: 230538K->33056K(431616K)] 1443238K->1442364K(2550272K), 0.0523620 secs] [Times: user=0.69 sys=0.23, real=0.05 secs] [GC [PSYoungGen: 427431K->33152K(557568K)] 1836740K->1835676K(2676224K), 0.0774750 secs] [Times: user=0.64 sys=0.72, real=0.08 secs] [Full GC [PSYoungGen: 33152K->0K(557568K)] [ParOldGen: 1802524K->1835328K(2897920K)] 1835676K->1835328K(3455488K) [PSPermGen: 2993K->2993K(21504K)], 0.2019870 secs] [Times: user=2.74 sys=0.13, real=0.20 secs] [GC [PSYoungGen: 492503K->33280K(647168K)] 2327831K->2327360K(3545088K), 0.0870810 secs] [Times: user=0.61 sys=0.92, real=0.09 secs] [Full GC [PSYoungGen: 33280K->0K(647168K)] [ParOldGen: 2294080K->2326876K(3603968K)] 2327360K->2326876K(4251136K) [PSPermGen: 2993K->2993K(21504K)], 0.0512730 secs] [Times: user=0.09 sys=0.12, real=0.05 secs] Heap PSYoungGen total 647168K, used 340719K [0x00000006aaa80000, 0x00000006dd000000, 0x0000000800000000) eden space 613376K, 55% used [0x00000006aaa80000,0x00000006bf73bc10,0x00000006d0180000) from space 33792K, 0% used [0x00000006d2280000,0x00000006d2280000,0x00000006d4380000) to space 33792K, 0% used [0x00000006d0180000,0x00000006d0180000,0x00000006d2280000) ParOldGen total 3603968K, used 2326876K [0x0000000400000000, 0x00000004dbf80000, 0x00000006aaa80000) object space 3603968K, 64% used [0x0000000400000000,0x000000048e0572d8,0x00000004dbf80000) PSPermGen total 21504K, used 3488K [0x00000003f5a00000, 0x00000003f6f00000, 0x0000000400000000) object space 21504K, 16% used [0x00000003f5a00000,0x00000003f5d68070,0x00000003f6f00000)
以下是使用大初始堆的程序.jar 的输出:
Heap PSYoungGen total 4893696K, used 2840362K [0x00000006aaa80000, 0x0000000800000000, 0x0000000800000000) eden space 4194816K, 67% used [0x00000006aaa80000,0x000000075804a920,0x00000007aab00000) from space 698880K, 0% used [0x00000007d5580000,0x00000007d5580000,0x0000000800000000) to space 698880K, 0% used [0x00000007aab00000,0x00000007aab00000,0x00000007d5580000) ParOldGen total 11185152K, used 0K [0x00000003fff80000, 0x00000006aaa80000, 0x00000006aaa80000) object space 11185152K, 0% used [0x00000003fff80000,0x00000003fff80000,0x00000006aaa80000) PSPermGen total 21504K, used 3489K [0x00000003f5980000, 0x00000003f6e80000, 0x00000003fff80000) object space 21504K, 16% used [0x00000003f5980000,0x00000003f5ce8400,0x00000003f6e80000)
访问内存会消耗CPU时间。访问更多的内存不仅线性地增加了CPU时间,还可能增加缓存压力,从而导致缓存错失率的增加,使CPU时间超线性增加。
使用perf stat java -jar ...
运行你的程序,以查看缓存错失的数量。参见Perf教程
图片来源:"Systems Performance: Enterprise and the Cloud Paperback", Brendan Gregg, ISBN: 978-0133390094
由于初始堆大小也影响到伊甸园空间的大小,而较小的伊甸园空间会触发GC,因此这可能会导致更紧凑的堆,这对缓存友好(没有在堆里临时开辟对象)。
为了减少两次运行之间的差异,请将初级和max young generation大小设置为相同的值,以便两次运行仅老年代大小不同。这应该会导致相同的性能(可能)。
顺便说一句:您还可以尝试使用巨型页启动JVM,这可能会(您需要衡量!)通过进一步减少TLB错误而获得额外的%性能。
注意未来的读者:限制新gen大小并不一定使您的JVM更快,这只是触发了@Peng的特定工作负载更快的GC效果。
在启动后手动触发GC将产生相同的效果。