堆外内存溢出排查

# 一、问题

log频繁出现Swap区域使用过高的异常，配置了8G堆内内存，但实际使用的物理内存为12G。JVM参数配置如下：

-Xms8g -Xmx8g -XX:+UseG1GC -XX:G1HeapRegionSize=4M -XX:MetaspaceSize=512M -XX:MaxMetaspaceSize=512M -XX:-OmitStackTraceInFastThrow -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/logs/app -XX:+AlwaysPreTouch -XX:ReservedCodeCacheSize=128m -XX:InitialCodeCacheSize=128m

实际使用的物理内存如下：RES=11.4g

[shankou@mysql ~]# top
PID    USER     PR  NI  VIRT    RES    SHR  S  %CPU   %MEM   TIME+   COMMAND
43992  shankou  20  0   8993m  11.4g   1987 S  128.0  89.0  118:64.88  java

1
2
3

JVM为了优化会额外分配本地内存，这些额外的内存最终会消耗超过-Xmx的限制。堆通常是java应用程序中消耗内存最大的部分，但也有特殊情况，除了堆JVM也会从本地内存中分配对外内存维护其元数据、代码缓存ReservedCodeCacheSize、垃圾回收、JIT生成的代码、使用unsafe.allocateMemory和DirectByteBuffer申请的堆外内存和内部数据结构等。

DirectByteBuffer(int cap) {                   // package-private
   super(-1, 0, cap, cap);
   boolean pa = VM.isDirectMemoryPageAligned();
   int ps = Bits.pageSize();
   long size = Math.max(1L, (long)cap + (pa ? ps : 0));
   Bits.reserveMemory(size, cap);

   long base = 0;
   try {
      base = unsafe.allocateMemory(size);
   } catch (OutOfMemoryError x) {
      Bits.unreserveMemory(size, cap);
      throw x;
   }
   unsafe.setMemory(base, size, (byte) 0);
   if (pa && (base % ps != 0)) {
      // Round up to page boundary
      address = base + ps - (base & (ps - 1));
   } else {
      address = base;
   }
   cleaner = Cleaner.create(this, new Deallocator(base, size, cap));
   att = null;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# 二、排查

项目中添加-XX:NativeMemoryTracking=detail JVM参数重启项目，使用命令 jcmd process_pid VM.native_memory [summary|detail]查看到的内存分布如下：summary仅跟踪JVM子系统（如：Java heap、class、code、thread等）的内存使用情况。detail除了通过JVM子系统跟踪内存使用情况外，还可以通过单独的CallSite、单独的虚拟内存区域及其提交区域来跟踪内存使用情况。

Native Memory Tracking:

Total: reserved=12813709KB, committed=10497485KB
-                 Java Heap (reserved=10485761KB, committed=10485761KB)
                            (mmap: reserved=1048576KB, committed=1048576KB) 
 
-                     Class (reserved=10568992KB, committed=49951KB)
                            (classes #442)
                            (malloc=131KB #259) 
                            (mmap: reserved=1056768KB, committed=4864KB) 
 
-                    Thread (reserved=2585683KB, committed=2585681KB)
                            (thread #127)
                            (stack: reserved=2580485KB, committed=2580481KB)
                            (malloc=390KB #711) 
                            (arena=130KB #234)
 
-                      Code (reserved=2662734KB, committed=40013KB)
                            (malloc=33KB #309) 
                            (mmap: reserved=266240KB, committed=3968KB) 
 
-                        GC (reserved=1644035KB, committed=1644032KB)
                            (malloc=92723KB #6540) 
                            (mmap: reserved=71680KB, committed=71680KB) 
 
-                  Compiler (reserved=1523KB, committed=1523KB)
                            (malloc=4KB #36) 
                            (arena=148KB #21)
 
-                  Internal (reserved=148597KB, committed=148597KB)
                            (malloc=14827KB #3632) 
                            (mmap: reserved=32KB, committed=32KB) 
 
-                    Symbol (reserved=14238KB, committed=14238KB)
                            (malloc=936KB #111) 
                            (arena=488KB #1)
 
-    Native Memory Tracking (reserved=3302KB, committed=3302KB)
                            (malloc=118KB #1641) 
                            (tracking overhead=211KB)
 
-               Arena Chunk (reserved=1781KB, committed=1781KB)
                            (malloc=178KB) 
 
-                   Unknown (reserved=2048KB, committed=0KB)
                            (mmap: reserved=2048KB, committed=0KB) 

    ......

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

操作系统对内存的分配管理典型地分为两个阶段：保留reserve和提交commit。保留阶段告知系统从某一地址开始到后面的dwSize大小的连续虚拟内存需要供程序使用，进程其他分配内存的操作不得使用这段内存；提交阶段将虚拟地址映射到对应的真实物理内存中，这样这块内存就可以正常使用。

NMT中的committed与top中RES为什么存在差异： RES表示任务当前使用的非交换物理内存，那按对操作系统commit提交内存的理解，这两者貌似应该对上。没对上的原因是因为JVM中所谓的commit内存，只是将内存mmaped映射为可读可写可执行的状态。而在Linux中，在分配内存时又是lazy allocation的机制，只有在进程真正访问时才分配真实的物理内存。所以NMT中所统计的committed并不是对应的真实的物理内存，自然与RES等统计方式无法对应起来。所以JVM为我们提供了一个参数-XX:+AlwaysPreTouch，使我们可以在启动之初就按照内存页粒度都访问一遍Heap，强制为其分配物理内存以减少运行时再分配内存造成的延迟（但是相应的会影响JVM进程初始化启动的时间），查看相关代码：

void os::pretouch_memory(char* start, char* end) {
  for (volatile char *p = start; p < end; p += os::vm_page_size()) {
    *p = 0;
  }
}

1
2
3
4
5

从上述的分析能够看出committed的内存小于物理内存，因为jcmd命令显示的内存包含堆内内存、Code区域、通过unsafe.allocateMemory和DirectByteBuffer申请的内存，但是不包含其他Native Code（C代码）申请的堆外内存。所以猜测是使用Native Code申请内存所导致的问题。

pmap工具是linux的工具,能够查看进程用了多少内存,还能分析内存用在上面环节,对于一些长期占用内存居高不下的程序可以分析其行为。

[root@pmap ~]# pmap -h

Usage:
 pmap [options] PID [PID ...]

Options:
 -x, --extended              show details
 -X                          show even more details
            WARNING: format changes according to /proc/PID/smaps
 -XX                         show everything the kernel provides
 -c, --read-rc               read the default rc
 -C, --read-rc-from=<file>   read the rc from file
 -n, --create-rc             create new default rc
 -N, --create-rc-to=<file>   create new rc to file
            NOTE: pid arguments are not allowed with -n, -N
 -d, --device                show the device format
 -q, --quiet                 do not display header and footer
 -p, --show-path             show path in the mapping
 -A, --range=<low>[,<high>]  limit results to the given range

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see pmap(1).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

返回值	说明
Address	内存开始地址
Kbytes	占用内存的字节数（KB）
RSS	保留内存的字节数（KB）
Dirty	脏页的字节数（包括共享和私有的）（KB）
Mode	内存的权限：read、write、execute、shared、private (写时复制)
Mapping	占用内存的文件、或[anon]（分配的内存）、或[stack]（堆栈）
Offset	文件偏移
Device	设备名 (major:minor)

使用了pmap查看内存分布，发现大量的64M的地址；而这些地址空间不在jcmd命令所给出的地址空间里面，基本上就断定就是这些64M的内存所导致。

[root@pmap ~]# pmap -x 10659 | sort -n -r -k 3
10659:   /usr/local/dsjprs/pamp/jfullskill/jfullSkillService
Address           Kbytes     RSS     Dirty   Mode  Mapping
00000000007c0000   8675432   7654365 7654365 r-x-- YDService  
00000000018f5000    136520   136520  136520  rw--- YDService  
0000000001f53000    103245   101123  101123  rw---   [ anon ] 
00007f42d4445000     65536    65536   65536  rw---   [ anon ]
00007f42d4446000     65536    65536   65536  rw---   [ anon ]
00007f42d4c47000     65536    65536   65536  rw---   [ anon ]
00007f42d4c47000     65536    65536   65536  rw---   [ anon ]
00007f42d5448000     65536    65536   65536  rw---   [ anon ]
00007f42d5c89000     65536    65536   65536  rw---   [ anon ]
00007f42d6889000     65536    65536   65536  rw---   [ anon ]
00007f42d688a000     65516    65436   65536  rw---   [ anon ]
00007f42d708b000     65512    65426   65536  rw---   [ anon ]
00007f42d788b000     65507    65416   65536  rw---   [ anon ]
00007f42d788c000     65506    65415   65536  rw---   [ anon ]
00007f42d808d000     65496    65406   655368 rw---   [ anon ]
00007f42d88cd000     65481    65401   65536  rw---   [ anon ]
......
total kB         11372567   8253634   6996065

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

确定是Native Code导致的内存溢。我们使用Google开源的一款系统性能分析工具集gperftools定位问题。gperftools将malloc的调用替换为它自己的tcmalloc，从而统计所有内存分配的行为，帮助我们更快的定位到发生泄漏的地方。

gperftools/heap.hprof_54321.0022.heap(2354 MB currently in use)
heap-gpertools/heap.hprof_54321.0022.heap
gperftools/heap.hprof_54321.0021.heap(2476 MB currently in use)
heap-gpertools/heap.hprof_54321.0021.heap
gperftools/heap.hprof_54321.0022.heap(2833 MB currently in use)
heap-gpertools/heap.hprof_54322.0022.heap
gperftools/heap.hprof_54321.0023.heap(3091 MB currently in use)
heap-gpertools/heap.hprof_54322.0023.heap
gperftools/heap.hprof_54321.0024.heap(8675 MB allocated cumulatively, 743 MB currently in use)
heap-gpertools/heap.hprof_54322.0024.heap
gperftools/heap.hprof_54321.0025.heap(10453 MB allocated cumulatively, 785 MB currently in use)
heap-gpertools/heap.hprof_54322.0025.heap
gperftools/heap.hprof_54321.0026.heap(11356 MB allocated cumulatively, 821 MB currently in use)
heap-gpertools/heap.hprof_54322.0026.heap

1
2
3
4
5
6
7
8
9
10
11
12
13
14

通过上述信息可以得到malloc申请的的内存最高到3G之后就释放了，之后始终维持在700M-900M。得到Native Code中没有使用malloc申请，可能使用mmap/brk申请的。使用gperftools没有追踪到这些内存，直接使用命令strace -f -e brk,mmap,munmap -p pid追踪OS申请内存请求，并没有发现可以内存申请。

mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1 , 0) = 0x8b948304f9000
munmap(0x8b948304f9000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1 , 0) = 0x8b948304f9000
munmap(0x8b948304f9000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1 , 0) = 0x8b948304f9000
munmap(0x8b948304f9000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1 , 0) = 0x8b948304f9000
munmap(0x8b948304f9000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1 , 0) = 0x8b948304f9000
munmap(0x8b948304f9000, 4096) = 0

1
2
3
4
5
6
7
8
9
10

接着，使用GDB去dump可疑内存，想着看看内存中的情况。就是直接使用命令gdp -pid pid进入GDB之后，然后使用命令dump memory mem.bin startAddress endAddressdump内存，其中startAddress和endAddress可以从/proc/pid/smaps中查找。然后使用strings mem.bin查看dump的内容，如下：

StackMap Table
SourceFile
rorsimulate/internal/org/apache/commons/lang3/tuple/ImmutablePair
&lt;L:Ljava/lang/Object:R:Ljava/lang/Object:>Lcom/meituan/trip/errorsimulate/internal/org/apache/commons/lang3/tuple/Pair&lt;TL>
Kcom/meituan/trip/errorsimulate/internal/org/apache/common/lang3/tuple/Pair
ImmutablePair.java
serialVersionUID
left
Ljava/lang/Object;
right
l(Ljava/lang/Object:Ljava/lang/Object;)Lcom/meituan/trip/errorsimulate/internal/org/apache/commons/lang3/tuple/ImmutablePair
&lt;init>
'(Ljava/lang/Object:Ljava/lang/Object;)V
this
^Lcom/meituan/trip/errorsimulate/internal/org/apache/commons/lang3/tuple/ImmutablePair&lt;TL;TR;>;
VLcom/meituan/trip/errorsimulate/internal/org/apache/commons/lang3/tuple/ImmutablePair;
getLeft

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

从内容上来看，像是解压后的JAR包信息。读取JAR包信息应该是在项目启动的时候，那么在项目启动之后使用strace作用就不是很大了。所以应该在项目启动的时候使用strace，而不是启动完成之后。再次，项目启动时使用strace去追踪系统调用项目启动使用strace追踪系统调用，发现确实申请了很多64M的内存空间，截图如下：

[pid 18763] mmap(munmap(0x8b948304f9000, 4096) = 0, 67193874, PROT_NONE MAP_PRIVATE|MAP_NORESERVE, -1, 0) = 0x7f43c0000000

使用该mmap申请的地址空间在pmap对应如下：

0000x7f43c0000000 65528  26784 26784 rw--  [anno]

最后，使用jstack去查看对应的线程: 因为strace命令中已经显示申请内存的线程ID。直接使用命令jstack pid去查看线程栈，找到对应的线程栈（注意10进制和16进制转换）如下：

'main' #1 prio=5 os_prio=0 tid=0x00004f543355ae10 nid=ox4323 runnable [0x00004f432dl10000]
  java.lang.Thread.State: RUNNABLE
      at java.io.RandomAccessFile.readBytes(Native Method)
      at java.io.RandomAccessFile.read(RandomAccessFile.java:377)
      at org.springframework.boot.loader.data.RandomAccessDataFile$FileAccess.read(RandomAccessDataFile.java:224)
      - locked &lt;0x000000006c02a43a0> (a java.lang.Object)
      at org.springframework.boot.loader.data.RandomAccessDataFile$FileAccess.access$400(RandomAccessDataFile.java:206)
      at org.springframework.boot.loader.data.RandomAccessDataFile.read(RandomAccessDataFile.java:117)
      at org.springframework.boot.loader.data.RandomAccessDataFile.read(RandomAccessDataFile.java:101)
      at org.springframework.boot.loader.jar.JarFileEntries.getEntryData(JarFileEntries.java:209)
      at org.springframework.boot.loader.jar.JarFileEntries.getInputStream(JarFileEntries.java:189)
      at org.springframework.boot.loader.jar.JarFile.getInputStream(JarFile.java:223)
      - locked &lt;0x000000006c05ds784> (a org.springframework.boot.loader.jar.JarFile)
      at org.reflections.vfs.ZipFile.openInputStream(ZipFile.java:27)
      at org.reflections.adapters.javassistAdapter.getOfCreateClassObject(JavassistAdapter.java:98)
      at org.reflections.adapters.javassistAdapter.getOfCreateClassObject(JavassistAdapter.java:24)
      at org.reflections.scanner.AbstractScanner.scan(AbstractScanner.java:30)
      at org.reflections.Reflections.scan(Reflections.java:253)
      at org.reflections.Reflections.scan(Reflections.java:202)
      at org.reflections.Reflections.&lt;init>(Reflections.java:123)
      at com.sankuai.meituan.config.v2.MtConfigClientV2.initReflections(MtConfigClientV2.java:173)
      - locked &lt;0x000000006c05fg456> (a java.lang.Object)
      at com.sankuai.meituan.config.v2.MtConfigClientV2.scanAnnotation(MtConfigClientV2.java:130)
      at com.sankuai.meituan.config.v2.MtConfigClientV2.init(MtConfigClientV2.java:107)
      at com.sankuai.meituan.config.MtConfigClient.initInvoker(MtConfigClient.java:113)
      at com.sankuai.meituan.config.MtConfigClient.init(MtConfigClient.java:67)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sum.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

这里基本上就可以看出问题来了：使用了Reflections进行扫包，底层使用了Spring Boot去加载JAR。因为解压JAR使用Inflater类，需要用到堆外内存，然后使用Btrace去追踪这个类，栈如下：

com.sankuai.meituan.config.MtConfigClient.init(MtConfigClient.java:67)

发现没有配置扫包路径，默认是扫描所有的包。于是修改代码，配置扫包路径，发布上线后内存问题解决。

# 三、对外内存没释放的原因

Spring Boot对Java JDK的InflaterInputStream进行了包装并且使用了Inflater，而Inflater本身用于解压JAR包的需要用到堆外内存。Inflater这个对象本身实现了finalize方法，在这个方法中有调用释放堆外内存的逻辑。也就是说Spring Boot依赖于GC释放堆外内存。Inflater对应的C代码初始化时使用了malloc申请内存，end的时候也是用free释放了内存。

这时，再看gperftools的内存分布情况，发现使用Spring Boot时，内存使用一直在增加，突然某个点内存使用下降了好多（使用量直接由3G降为700M左右）。这个点应该就是GC引起的，内存应该释放了，但是在操作系统层面并没有看到内存变化，那是不是没有释放到操作系统，被内存分配器持有了呢？

继续探究，发现系统默认的内存分配器（glibc 2.12版本）和使用gperftools内存地址分布差别很明显，2.5G地址使用smaps发现它是属于Native Stack。内存地址分布如下：

pmap -x 33243
......
0000000000fwe000 3845543 2643054 2643054 rw--- [ anno ]
......

1
2
3
4

基本上可以确定是内存分配器的问题，glibc从2.11开始对每个线程引入内存池（64位机器大小就是64M内存）。

在扫描包的时候，Spring Boot不会主动去释放堆外内存，导致在扫描阶段，堆外内存占用量一直持续飙升。当发生GC的时候，Spring Boot依赖于finalize机制去释放了堆外内存；但是glibc为了性能考虑，并没有真正把内存归返到操作系统，而是留下来放入内存池了，导致应用层以为发生了“内存泄漏”。Spring Boot的最新版本2.0.5.RELEASE已经做了修改，在ZipInflaterInputStream主动释放了堆外内存不再依赖GC；所以Spring Boot升级到最新版本，这个问题也可以得到解决。

← JVM Metaspace内存溢出问题在线调试工具 Arthas →