mmap

时间：2021-12-18 作者：sinkinben

Intro to linux system call mmap.

开始之前，先看张图。

Linux IO Stack

1.0 版本：http://域名/files/域名域名

常规做法

在大多数场景下，我们都是通过下面的方式进行 IO 访问：

int fd = open(filename, flags, mode);
read(fd, buffer, size);

那么其 function call stack 实际上是：

read
sys_read
vfs_read: 判断是否命中 Page Cache，命中则直接返回，否则产生 PAGE FAULT，分配内存页，读入文件内容。
内核向块设备层 (Generic Block Layer) 发起 IO 请求，块设备层的职责是屏蔽 SSD/HDD/U盘等存储设备的差异。
IO 请求到达 IO Scheduler ，后面是真正的硬件 IO （在此先不关心硬件层面的 IO ）。

那么 IO 调度的意义是什么呢？

在 SSD 之前，我们都用机械硬盘 HDD 作为存储设备，HDD 有磁头、磁道、转速等概念，磁道上的每个扇区存放着数据，因此 IO Scheduler 一个浅显的作用就是：产生一个比较好的 IO 请求序列，使得磁头走过的路程是最短的。这一点也有助于减少进程的平均阻塞时间。

标记位 O_DIRECT

从上面的图可以看出，在 Linux 中，把硬盘统一抽象为块设备 (Block Device) 进行管理。

从用户的角度来看，是直接面向 VFS 编程的，使用基本的 open/close/write/read 等 API 对文件进行读写操作，但在 VFS 中，会使用内存对文件进行缓存，也就是说，我们在 write 调用的时候，写入的只是 Cache 或者内存，而不是真正的文件，这是所谓的延迟写 (Delayed Write)。

那是在什么时候，我们写入的内容会真正落在硬盘上呢？参考 fsync, fdatasync, sync .

在某些场景下（比如数据库，新型存储系统），我们希望尽可能减少数据的拷贝次数，譬如绕过 VFS 的 Page Cache ，这时候我们可以通过标记位 O_DIRECT 或者 mmap 来实现。

在 Linux I/O Stack 1.0 的版本当中，O_DIRECT 可绕过 VFS 维护的 Cache，直达文件系统，但文件系统本身也会缓存，最理想的情况是通过 mmap 直达通用块设备 IO 层 (Generic Block Device) 。

man 2 open 中对 O_DIRECT 的说明：

Try to minimize cache effects of the I/O to and from this file.  In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching.  File I/O is done directly to/from user-space buffers.  The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred.  To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT.

从上面的描述可以看出, O_DIRECT 并不保证数据可以直接写到硬盘上，如果需要保证数据真正落盘，那么需要结合 O_SYNC 使用。但这样的话，IO 操作就会变成同步 IO ，如果 IO Scheduler 收到大量这样的 IO 请求，那么这样的 IO 请求会被阻塞（这显然不是一件好事情）。

Linus 本人似乎对 O_DIRECT 这一做法十分不屑：

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances." --Linus

内存映射 mmap

mmap 即 memory mapping ，将一块物理内存映射到某个文件上（通过文件描述符 fd 指定），一种典型的 zero-copy 机制，mmap 可以减少一次 kernel -> user space 的数据拷贝。

注意，此处的文件，指的是 VFS 概念下的文件，可以是 socket-fd, file-fd, shm-fd, pipe-fd 等，下同。

通过这个 API ，我们可以做到：

像访问内存一样，去读/写/复制文件的内容，并减少数据拷贝。
结合 shm_open 使用，实现进程之间的共享内存。
如果文件很大，我们又想对文件进行随机读写，那么 mmap 比使用常规文件读写要好。

API 定义：

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

描述：

在调用 mmap 的时候，并不会真正分配物理内存。
完成 mmap 之后，如果我们访问 [addr, addr + length) 这一区间的地址，那么会产生缺页中断，这时候才会真正分配一块物理内存，加载 fd 的内容。

参数说明：

addr 如果是 NIL ，那么内核会自动在进程的虚拟地址空间中选择一块地址空间去映射（想想进程空间中堆和栈之间是什么区域？），因此一般我们默认填 NULL ；如果 addr 是用户自定义的地址，并且位于有效的进程映射地址空间范围内，那么在 addr 页对齐后的位置开始映射。
length 需要映射的长度。

prot 描述物理内存的属性：

PROT_EXEC  Pages may be executed.
PROT_READ  Pages may be read.
PROT_WRITE Pages may be written.
PROT_NONE  Pages may not be accessed.

flags 描述的是，对于 [addr, addr + length) 这一地址空间的修改是否是共享的，是否会把修改 flush 到文件上。

The flags argument determines whether updates to the mapping are visible to other processes mapping the same region, and whether updates are carried through to the  underlying file.  This behavior is determined by including exactly one of the following values in flags:
- MAP_SHARED
  Share  this  mapping.  Updates to the mapping are visible to other processes mapping the same region, and (in the case of file-backed mappings) are carried through to the underlying file.  (To precisely control when updates are carried through to the underlying file requires the use of msync(2).)
- MAP_PRIVATE
  Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the  underlying file.  It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.

更多 flags 标记位的含义请查看 man mmap 。

fd 是需要映射的文件描述符，offset 表示文件内的偏移量，从该位置开始映射。

SHARED 和 PRIVATE

下面分别是 2 种 mmap 模式的行为示意图。

SHARED 表示所有对映射内存的修改都会 "同步" 在映射的对象上，典型的场景是进程之间共享内存。
PRIVATE 采用的是 copy-on-write 的模式，如果没有进程改动映射内存，那么所有进程都共同读取某一个物理页；一旦有修改，会拷贝该页面，新页面会成为修改进程的 PRIVATE 页面。
- 比较典型的场景是：fork 开启一个子进程，如果子进程对数据是只读的，在 OS 层面，子进程和父进程都共用数据段和代码段，如果修改了某一个 buffer，那么 OS 将会发生 copy-on-write ，这个 buffer 将会有 2 个实体，位于不同的物理内存页。

SHARED	PRIVATE

假设我们有这么一段代码：

#include <stdio.h>
int main()
{
    puts("");
    while (1);
}

通过 ./域名 & 在后台运行，并通过 cat /proc/$pid/maps 查看进程的地址空间映射：

...
5630be30c000-5630be32d000 rw-p 00000000   [heap]
7f05d579b000-7f05d57c0000 r--p 00000000   /usr/lib/x86_64-linux-gnu/libc-域名
7f05d57c0000-7f05d5938000 r-xp 00025000   /usr/lib/x86_64-linux-gnu/libc-域名
7f05d5938000-7f05d5982000 r--p 0019d000   /usr/lib/x86_64-linux-gnu/libc-域名
7f05d5982000-7f05d5983000 ---p 001e7000   /usr/lib/x86_64-linux-gnu/libc-域名
7f05d5983000-7f05d5986000 r--p 001e7000   /usr/lib/x86_64-linux-gnu/libc-域名
7f05d5986000-7f05d5989000 rw-p 001ea000   /usr/lib/x86_64-linux-gnu/libc-域名
7f05d5989000-7f05d598f000 rw-p 00000000 
7f05d599f000-7f05d59a0000 r--p 00000000   /usr/lib/x86_64-linux-gnu/ld-域名
7f05d59a0000-7f05d59c3000 r-xp 00001000   /usr/lib/x86_64-linux-gnu/ld-域名
7f05d59c3000-7f05d59cb000 r--p 00024000   /usr/lib/x86_64-linux-gnu/ld-域名
7f05d59cc000-7f05d59cd000 r--p 0002c000   /usr/lib/x86_64-linux-gnu/ld-域名
7f05d59cd000-7f05d59ce000 rw-p 0002d000   /usr/lib/x86_64-linux-gnu/ld-域名
7f05d59ce000-7f05d59cf000 rw-p 00000000 
7ffc57b20000-7ffc57b41000 rw-p 00000000   [stack]
7ffc57b85000-7ffc57b89000 r--p 00000000   [vvar]
7ffc57b89000-7ffc57b8b000 r-xp 00000000   [vdso]
...

puts, printf 等函数的二进制代码都是位于 域名 这个动态链接库当中（当然我们可以通过编译参数指定静态链接），当程序中使用了这些函数时，才会通过 mmap 建立映射。

我们再使用 strace 来追踪 域名 的系统调用栈。

execve("./域名", ["./域名"], 0x7fff9b085890 /* 33 vars */) = 0
brk(NULL)                               = 0x562755d26000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffd09c585d0) = -1 EINVAL (无效的参数)
access("/etc/域名oad", R_OK)      = -1 ENOENT (没有那个文件或目录)
# 首先加载了链接器的代码 域名 
openat(AT_FDCWD, "/etc/域名e", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=61731, ...}) = 0
mmap(NULL, 61731, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f1eeacca000
close(3)                                = 0
# 打开 域名 链接库文件
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/域名.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360q\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
# ...
# 映射文件上的函数到虚拟地址空间
mmap(0x7f1eeaafb000, 1540096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f1eeaafb000
mmap(0x7f1eeac73000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19d000) = 0x7f1eeac73000
mmap(0x7f1eeacbe000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f1eeacbe000
mmap(0x7f1eeacc4000, 13528, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1eeacc4000
close(3)                                = 0
# ...
# puts("")
write(1, "\n", 1)                       = 1

在 Shell 执行某个命令 cmd 的时候，其大概的执行逻辑是：

fork 一个子进程。
在子进程中，通过 exec 函数加载 cmd 的二进制文件并执行。

在上面的输出中：

首先加载了链接器的代码 域名 .

域名 其实就是链接器的二进制代码。根据 man 域名 的描绘：
The programs 域名 and ld-域名* find and load the shared objects (shared libraries) needed by a program, prepare the program to run, and then run it.

然后打开共享链接库的文件 域名 。
最后是把 域名 文件上的二进制代码（几个函数的地址）映射到进程的虚拟地址空间，文件偏移量 0x25000, 0x19d000 等，可以与 /proc/{pid}/maps 的输出对应。

注意到，上述的 mmap 是使用 MAP_PRIVATE|MAP_DENYWRITE 这 2 个标记的，为什么 printf, puts 这些代码理应是只读的，为什么需要这样做呢？

考虑 strtok 这个库函数，内部实现使用了一个 static 变量来记录上一次截断的位置。因此，虽然 printf 是只读的，但 libc 中的其他函数是有可能发生数据修改的。参考 Apple 的一个实现。

共享内存

首先看第一个进程 p1.c:

#include <fcntl.h> /* For O_* constants */
#include <sys/mman.h>
#include <sys/stat.h> /* For mode constants */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int main()
{
    const int len = 1024;
    const char *name = "shm1";
    int shmfd = shm_open(name, O_RDWR | O_CREAT, 0777);

    if (shmfd == -1) 
        exit(EXIT_FAILURE);
    // extend shared memory object as by default it\'s initialized with size 0
    if (ftruncate(shmfd, len) == -1) 
        exit(EXIT_FAILURE);

    void *addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, shmfd, 0);
    memcpy(addr, "hello", 6);

    if (addr == MAP_FAILED)
        exit(EXIT_FAILURE);
    
    munmap(addr, len);
}

注意，这里并没有 shm_unlink 解除共享内存，也就是说这块内存在 p1 结束后，依然存在于内核中。

编译运行：

gcc p1.c -o p1 -lrt
./p1

然后：

$ cat /dev/shm/shm1 
hello

第二个进程 p2.c：

#include <fcntl.h> /* For O_* constants */
#include <sys/mman.h>
#include <sys/stat.h> /* For mode constants */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int main()
{
    const int len = 1024;
    const char *name = "shm1";
    int shmfd = shm_open(name, O_RDWR | O_CREAT, 0777);

    if (shmfd == -1) 
        exit(EXIT_FAILURE);

    void *addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, shmfd, 0);
    puts(addr);
    
    if (addr == MAP_FAILED)
        exit(EXIT_FAILURE);
    
    munmap(addr, len);
    shm_unlink(name);
}

类似的方法编译运行，puts(addr) 会输出 hello 。当 ls /dev/shm 时，shm1 文件不存在，因为执行了 unlink 。

文件随机访问

首先使用 dd 命令创建一个 4G 的文件 域名 。

$ ls -lh 域名 
-rw-r--r-- 1 xxx xxx 域名 Dec 16 18:22 域名

现在对这个文件进行随机读操作：

每次读取 4096 字节到栈上的一个 buffer
随机读取 1e6 次

如果使用 lseek, read 等操作进行随机读写:

#include <fcntl.h> /* For O_* constants */
#include <sys/mman.h>
#include <sys/stat.h> /* For mode constants */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <stdint.h>
#include <time.h>
int main()
{
    srand(time(NULL));
    uint64_t size = (uint64_t)4 * 1024 * 1024 * 1024;
    uint64_t counter = (uint64_t)(1e6);
    int fd = open("./域名", O_RDONLY);
    char buf[4096];
    for (uint64_t i = 0; i < counter; ++i)
    {
        off_t offset = (uint64_t)rand() % size;
        lseek(fd, offset, SEEK_SET);
        read(fd, buf, 4096);
    }
    close(fd);
}

如果使用 mmap 进行文件随机读写：

#include <fcntl.h> /* For O_* constants */
#include <sys/mman.h>
#include <sys/stat.h> /* For mode constants */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <stdint.h>
#include <time.h>

int main()
{
    srand(time(NULL));

    uint64_t size = (uint64_t)4 * 1024 * 1024 * 1024;
    uint64_t counter = (uint64_t)(1e6);

    int fd = open("./域名", O_RDONLY);
    void *addr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (addr == MAP_FAILED)
    {
        puts("mmap failed");
        exit(EXIT_FAILURE);
    }
    char buf[4096];

    for (uint64_t i = 0; i < counter; ++i)
    {
        off_t offset = (uint64_t)rand() % size;
        memcpy(buf, addr + offset, 4096);
    }

    munmap(addr, size);
    close(fd);
}

执行时间对比

使用自带的 time 命令去观察执行时间：

# 常规用法随机读写
$ time ./common
real    域名
user    域名
sys     域名
# mmap 随机读写
$ time ./mmap
real    域名
user    域名
sys     域名

三个时间指标的含义：

参考 StackOverflow .

Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).

User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.

Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like \'user\', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as \'supervisor\' mode) and the system call mechanism.

一个比较直观的理解是：real 绝对值越小，user 占比越高，说明程序的 IO 性能越好。

mmap 随机读写比常规做法好在哪里呢？

在常规读写中，第一次调用 read(fd, offset, buf) 时会预读取 offset 附近的若干页到内存中（这么设计的依据是局部性原理），但这里的场景是随机读写，因此局部性原理并不起效，相反还带来了许多额外的开销，预读取了访问概率较小的文件页。此外，这种做法会引入 C 函数库、内核等各个 IO 层次的缓存，在用户空间读取的内存，实际上有多次的数据拷贝。
在 mmap 中，每次访问 ptr = addr + offset 这一地址，首先看 ptr 是否在内存中，不存在则产生缺页中断，读取硬盘，每次仅读取 ptr 所在的页。并且是从 memory <- block io 的。

PAGE FAULT 对比

如果使用 perf 命令，还能看到缺页中断等信息：

$ perf stat ./common

 Performance counter stats for \'./common\':

           域名 msec task-clock:u              #    域名 CPUs utilized          
                 0      context-switches:u        #    域名 K/sec                  
                 0      cpu-migrations:u          #    域名 K/sec                  
                41      page-faults:u             #    域名 K/sec                  
         161839663      cycles:u                  #    域名 GHz                    
          84068713      instructions:u            #    域名  insn per cycle         
          27018959      branches:u                #   域名 M/sec                  
             48067      branch-misses:u           #    域名% of all branches        

       域名81243 seconds time elapsed

       域名70000 seconds user
       域名53000 seconds sys


$ perf stat ./mmap

 Performance counter stats for \'./mmap\':

            域名 msec task-clock:u              #    域名 CPUs utilized          
                 0      context-switches:u        #    域名 K/sec                  
                 0      cpu-migrations:u          #    域名 K/sec                  
             32807      page-faults:u             #    域名 M/sec                  
        2443079663      cycles:u                  #    域名 GHz                    
          67103101      instructions:u            #    域名  insn per cycle         
          16052590      branches:u                #   域名 M/sec                  
             34696      branch-misses:u           #    域名% of all branches        

       域名02420 seconds time elapsed

       域名63000 seconds user
       域名03000 seconds sys

从上面的输出可以看出，常规操作的预读机制使得其 PAGE FAULT 远远少于 mmap（要知道一次缺页中断的开销是非常高的），但性能还是不如 mmap ，即使测试机器的内存是 8G 的，能够缓存整个文件到内存中。由于此处的场景是随机读写，预读并不能很好提高内存命中的概率，反而带来了额外的读取开销。