lxcfs: 根据 cgroup 信息生成容器内的 /proc（中）

Tags： lxcfs

说明

接上篇，上篇分析的主要侧重libfuse是如何被使用，这篇分析proc文件的内容是怎样生成的。

同时学习一下aither64的做法，aither64实现了根据容器的cpu配额生成/proc/cpu文件的方法：Merge pull request #260 from aither64/cpu-views 。

git clone https://github.com/aither64/lxcfs.git
cd lxcfs
git  branch  cpu-views -t  origin/cpu-views
git checkout   cpu-views

在man proc中可以找到所有proc文件的内容格式和说明。

/proc/cpuinfo文件内容的生成

cgroup的cpuset controller的文件接口：

cgroup.clone_children  cpuset.cpus            cpuset.memory_pressure     cpuset.sched_load_balance
cgroup.event_control   cpuset.mem_exclusive   cpuset.memory_spread_page  cpuset.sched_relax_domain_level
cgroup.procs           cpuset.mem_hardwall    cpuset.memory_spread_slab  notify_on_release
cpuset.cpu_exclusive   cpuset.memory_migrate  cpuset.mems                tasks

lxcfs将cpuset.cpus中指定的cpu（例如0-2）的信息组合成/proc/cpuinfo，每个cpu的信息来自host的/proc/cpuinfo文件。

在上一篇中有详细分析，见xcfs根据cpu-share和cpu-quota等cgroup信息生成容器内的/proc文件（一）：从cgroup中读取cpuinfo。

aither64对/proc/cpuinfo的修正

aither64在计算生成的容器的cpu视图时，依据的下面的cgroup参数计算出应当呈现的cpu数，它们的含义见cgroup cpu controller：CPU带宽控制CFS的相关参数：

cpu.cfs_quota_us
cpu.cfs_period_us

容器内cpu个数的计算方法是：cpu.cfs_quota_us/cpu.cfs_period_us向上取整。

int max_cpu_count(const char *cg)
{
	...
	rv = cfs_quota / cfs_period;
	if ((cfs_quota % cfs_period) > 0)
	    rv += 1;
	...
}

/proc/stat文件内容的生成

在host上看到的/proc/stat文件：

 cat /proc/stat
cpu  27342995 37984 12363147 2702613541 214034 0 2148297 906148 0 0
cpu0 5646549 8330 2772710 677553049 52358 0 391408 214713 0 0
cpu1 7467113 8682 3269935 674623853 54243 0 666477 233736 0 0
cpu2 7014447 10816 3154326 674959733 56337 0 624994 242532 0 0
cpu3 7214885 10154 3166175 675476904 51093 0 465416 215164 0 0
intr 7205758905 153 10 0 0 0 0 3 0 0 0 0 38 15 0 0 0 0 0 0 0 0 0 0 0 0 2196063 0 525227905 10640 0 19 0 1706137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 8831068106
btime 1543312071
processes 4328224
procs_running 1
procs_blocked 0
softirq 4429444472 3 1334970939 157132 1834767679 0 0 336 652016670 0 607531713

在容器中通过lxcfs看到的proc/stat文件内容：

sh-4.2# cat /proc/stat
cpu  13113538 17012 6042607 1352154304 106601 0 1057884 448446 0 0
cpu0 5646486 8330 2772697 677541736 52358 0 391407 214712 0 0
cpu1 7467052 8682 3269910 674612568 54243 0 666477 233734 0 0
intr 7205702049 153 10 0 0 0 0 3 0 0 0 0 38 15 0 0 0 0 0 0 0 0 0 0 0 0 2196042 0 525226982 10640 0 19 0 1706122 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 8830978881
btime 1543312071
processes 4328178
procs_running 2
procs_blocked 0
softirq 4429413658 3 1334956474 157132 1834765030 0 0 336 652007038 0 607527645

lxcfs只重新加工了host的/proc/stat文件中以cpu开头的行，只包含了cgroup的cpuset.cpus中设置的cpu，并且重新计算了它们的累加值（第一行）。

aither64对/proc/stat的修正

/proc/stat中以cpu开头的行的后面数字的含义分别是：

user   (1) Time spent in user mode.

nice   (2) Time spent in user mode with low priority (nice).

system (3) Time spent in system mode.

idle   (4) Time spent in the idle task.  This value should be USER_HZ times the second entry in the /proc/uptime pseudo-file.

iowait (since Linux 2.5.41)
       (5) Time waiting for I/O to complete.

irq (since Linux 2.6.0-test4)
       (6) Time servicing interrupts.

softirq (since Linux 2.6.0-test4)
       (7) Time servicing softirqs.

steal (since Linux 2.6.11)
       (8) Stolen time, which is the time spent in other operating systems when running in a virtualized environment

guest (since Linux 2.6.24)
       (9) Time spent running a virtual CPU for guest operating systems under the control of the Linux kernel.

guest_nice (since Linux 2.6.33)
       (10) Time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel).

aither64根据cgroup cpuacct controller中的信息调整文件内容，cpuacct提供了以下文件：

cpuacct.usage: cgroup实际获得的cpu时间，单位纳秒
cpuacct.stat:  cgroup获得的时间按照user time和system time分开呈现，单位是`USER_HZ`
               user 24
               system 9
cpuacct.usage_percpu：cgroup在每个cpu上占用的时间，纳秒

这里有一个问题，aither64代码中读取了cpuacct.usage_all文件，但是我在CentOS 7.2中没有找到这个文件。@2019-02-20 10:55:47

static int proc_stat_read(char *buf, size_t size, off_t offset,
        struct fuse_file_info *fi)
{
    ...
    /*
     * Read cpuacct.usage_all for all CPUs.
     * If the cpuacct cgroup is present, it is used to calculate the container's
     * CPU usage. If not, values from the host's /proc/stat are used.
     */
    if (read_cpuacct_usage_all(cg, cpuset, &cg_cpu_usage, &cg_cpu_usage_size) != 0) {
        lxcfs_debug("%s\n", "proc_stat_read failed to read from cpuacct, "
                "falling back to the host's /proc/stat");
    }
    ...
}

static int read_cpuacct_usage_all(char *cg, char *cpuset, struct cpuacct_usage **return_usage, int *size)
{
    ...
    if (!cgfs_get_value("cpuacct", cg, "cpuacct.usage_all", &usage_str)) {
        rv = -1;
        goto err;
    }
    ...
}

从代码中看是通过cpuacct.usage_all获得了每个cpu的上的user time 和 system time。

以cgroup路径为key，为每个cgroup生成一个proc_stat结构体（存放在哈希表中），保存上次查询到的cgroup的的cpu的使用情况，如果没有则新建。：

/* Data for CPU view */
struct cg_proc_stat {
    char *cg;
    struct cpuacct_usage *usage; // Real usage as read from the host's /proc/stat
    struct cpuacct_usage *view; // Usage stats reported to the container
    int cpu_count;
    pthread_mutex_t lock; // For node manipulation
    struct cg_proc_stat *next;
};

从容器中下一次读取/proc/stat文件时，会再次读取当前时间cgroup中的cpu数据，新计算出的时间一定是比保存在哈希表中的上一次结果的数值大，如果小了，说明cgroup的配置变化了，就直接用新的结果替换就的结果。

首先是只有cpu.cfs_quota_us/cpu.cfs_period_us向上取整个cpu记录，和对/proc/cpuinfo的修正方法一致。

然后是数值上修正，只呈现了user、system、idle时间，其它都是0，对前三者的数值。user、system、idle时间还要进一步修正。

如果没有指定cpuset，只是设置了cpu quota，计算出容器实际拥有的CPU个数是2，但是系统上有4个CPU，容器是在每个CPU上都有运行的，需要把容器对4个CPU的使用时间重新分布到2个CPU上。

cg_proc_stat的view字段中存放的就是修正后2个CPU的使用时间，usage是原始的4个CPU的使用时间，过程如下：

static int cpuview_proc_stat(const char *cg, const char *cpuset, struct cpuacct_usage *cg_cpu_usage, int cg_cpu_usage_size, FILE *f, char *buf, size_t buf_size)
    ...
    for (curcpu = 0, i = -1; curcpu < nprocs; curcpu++) {
        stat_node->usage[curcpu].online = cg_cpu_usage[curcpu].online;

        if (!stat_node->usage[curcpu].online)
            continue;

        i++;

        stat_node->usage[curcpu].user += diff[curcpu].user;
        stat_node->usage[curcpu].system += diff[curcpu].system;
        stat_node->usage[curcpu].idle += diff[curcpu].idle;

        //max_cpus是计算出的分配个容器的cpu个数，将容器对另外的cpu的使用时间汇总到`xxx_surplus`中，后面再将它们分配容器的cpu视图中。
        if (max_cpus > 0 && i >= max_cpus) {
            user_surplus += diff[curcpu].user;
            system_surplus += diff[curcpu].system;
        }
    }

    /* Calculate usage counters of visible CPUs */
    if (max_cpus > 0) {
        /* threshold = maximum usage per cpu, including idle */
        threshold = total_sum / cpu_cnt * max_cpus;
        for (curcpu = 0, i = -1; curcpu < nprocs; curcpu++) {
            if (i == max_cpus)
                break;

            if (!stat_node->usage[curcpu].online)
                continue;

            i++;

            if (diff[curcpu].user + diff[curcpu].system >= threshold)
                continue;

            /* Add user */
            // 加上对另外的cpu的使用时间
            add_cpu_usage(
                    &user_surplus,
                    &diff[curcpu],
                    &diff[curcpu].user,
                    threshold);

            if (diff[curcpu].user + diff[curcpu].system >= threshold)
                continue;

            /* If there is still room, add system */
            add_cpu_usage(
                    &system_surplus,
                    &diff[curcpu],
                    &diff[curcpu].system,
                    threshold);
        }

        if (user_surplus > 0)
            lxcfs_debug("leftover user: %lu for %s\n", user_surplus, cg);
        if (system_surplus > 0)
            lxcfs_debug("leftover system: %lu for %s\n", system_surplus, cg);
    ...

容器在一个CPU上的idle时间的计算方式：

1. 从宿主机的/proc/stat中获取这个CPU的所有时间all_used：

   all_used = user + nice + system + iowait + irq + softirq + steal + guest + guest_nice;

2. 从容器的cgroup目录中读取所在cgroup使用这个cpu的时间：

    aither64是从cpuacct.usage_all中读取的，将user时间和system时间相加作为使用时间：

        cg_used = cg_cpu_usage[curcpu].user + cg_cpu_usage[curcpu].system;

    centos7.2中，cpuacct.usage_all这个文件接口现在没了，需要改成从cpuacct.usage_percpu中直接读取。

        //系统上一共有4个CPU，为容器指定cpuset 0,1，--cpuset-cpus "1,2"
        325517080 54277298 0 0
        //系统上一共有4个CPU，为容器指定cpuset 1,2，--cpuset-cpus "1,2"
        0 21977223 28985453 0

3. /proc/stat中的idle时间，加上all_used时间减去cg_used，就是容器在这个CPU上的idle时间，
   如果all_used小于cg_used，直接使用/proc/stat中的idle时间。

        cg_cpu_usage[curcpu].idle = idle + (all_used - cg_used);

/proc/meminfo文件内容的生成

     容器中看到内容                        宿主机上看到的内容               replace by (/1024)
sh-4.2# cat /proc/meminfo              [root@~]# cat /proc/meminfo 
MemTotal:         262144 kB            MemTotal:        8010244 kB      memory.limit_in_bytes
MemFree:          260576 kB            MemFree:         2638016 kB      memlimit - memusage
MemAvailable:     260576 kB            MemAvailable:    7077116 kB      memlimit - memusage + cache
Buffers:               0 kB            Buffers:               0 kB      0
Cached:                0 kB            Cached:          4322404 kB      0
SwapCached:            0 kB            SwapCached:            0 kB      0
Active:              480 kB            Active:          2775260 kB      active_anon + active_file (in memory.stat)
Inactive:              0 kB            Inactive:        1740296 kB      inactive_anon + inactive_file (in memory.stat)
Active(anon):        480 kB            Active(anon):     168704 kB      active_anon (in memory.stat)
Inactive(anon):        0 kB            Inactive(anon):    42808 kB      inactive_anon (in memory.stat)
Active(file):          0 kB            Active(file):    2606556 kB      active_file (in memory.stat)
Inactive(file):        0 kB            Inactive(file):  1697488 kB      inactive_file (in memory.stat)
Unevictable:           0 kB            Unevictable:           0 kB      unevictable  (in memory.stat)
Mlocked:               0 kB            Mlocked:               0 kB
SwapTotal:             0 kB            SwapTotal:             0 kB      memory.memsw.limit_in_bytes
SwapFree:              0 kB            SwapFree:              0 kB      memory.memsw.limit_in_bytes - (memory.memsw.usage_in_bytes - memory.usage_in_bytes)
Dirty:               256 kB            Dirty:                24 kB
Writeback:             0 kB            Writeback:             0 kB
AnonPages:        194228 kB            AnonPages:        193152 kB
Mapped:           142404 kB            Mapped:           142344 kB
Shmem:                 0 kB            Shmem:             18360 kB      0 | total_shmem (in memory.stat)
Slab:               0 kB               Slab:             649156 kB      0
SReclaimable:          0 kB            SReclaimable:     441024 kB      0
SUnreclaim:            0 kB            SUnreclaim:       208132 kB      0
KernelStack:        4624 kB            KernelStack:        4480 kB
PageTables:         8132 kB            PageTables:         7412 kB
NFS_Unstable:          0 kB            NFS_Unstable:          0 kB
Bounce:                0 kB            Bounce:                0 kB
WritebackTmp:          0 kB            WritebackTmp:          0 kB
CommitLimit:     4005120 kB            CommitLimit:     4005120 kB
Committed_AS:    1270196 kB            Committed_AS:    1264104 kB
VmallocTotal:   34359738367 kB         VmallocTotal:   34359738367 kB
VmallocUsed:       29888 kB            VmallocUsed:       29888 kB
VmallocChunk:   34358800208 kB         VmallocChunk:   34358800208 kB
HardwareCorrupted:     0 kB            HardwareCorrupted:     0 kB
AnonHugePages:     14336 kB            AnonHugePages:     14336 kB
HugePages_Total:       0               HugePages_Total:       0
HugePages_Free:        0               HugePages_Free:        0
HugePages_Rsvd:        0               HugePages_Rsvd:        0
HugePages_Surp:        0               HugePages_Surp:        0
Hugepagesize:       2048 kB            Hugepagesize:       2048 kB
DirectMap4k:     1023992 kB            DirectMap4k:     1023992 kB
DirectMap2M:     7364608 kB            DirectMap2M:     7364608 kB

cgroup的memory controller包含下面的文件接口：

cgroup.clone_children           memory.kmem.tcp.max_usage_in_bytes  memory.oom_control
cgroup.event_control            memory.kmem.tcp.usage_in_bytes      memory.pressure_level
cgroup.procs                    memory.kmem.usage_in_bytes          memory.soft_limit_in_bytes
memory.failcnt                  memory.limit_in_bytes               memory.stat
memory.force_empty              memory.max_usage_in_bytes           memory.swappiness
memory.kmem.failcnt             memory.memsw.failcnt                memory.usage_in_bytes
memory.kmem.limit_in_bytes      memory.memsw.limit_in_bytes         memory.use_hierarchy
memory.kmem.max_usage_in_bytes  memory.memsw.max_usage_in_bytes     notify_on_release
memory.kmem.slabinfo            memory.memsw.usage_in_bytes         tasks
memory.kmem.tcp.failcnt         memory.move_charge_at_immigrate
memory.kmem.tcp.limit_in_bytes  memory.numa_stat

其中memory.stat中又包含多个key-value值：

cache 0
rss 344064
rss_huge 0
mapped_file 0
swap 0
pgpgin 589
pgpgout 505
pgfault 1729
pgmajfault 0
inactive_anon 0
active_anon 344064
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 268435456
hierarchical_memsw_limit 536870912
total_cache 0
total_rss 344064
total_rss_huge 0
total_mapped_file 0
total_swap 0
total_pgpgin 589
total_pgpgout 505
total_pgfault 1729
total_pgmajfault 0
total_inactive_anon 0
total_active_anon 344064
total_inactive_file 0
total_active_file 0
total_unevictable 0

/proc/uptime文件内容的生成

/proc/uptime文件内容是系统运行的时间和系统空闲的时间：

This file contains two numbers: the uptime of the system (seconds), and the amount of time spent in idle process (seconds).

系统运行时间，lxcfs取的是容器中1号进程的运行时间，获取方法如下：

当前相对系统启动的时间  - （/proc/pid/stat的第22个字段（时钟嘀嗒次数） / 系统的时钟嘀嗒频率）

具体做法见获取指定进程的启动时间和运行时长。

系统空闲时间的获取方法：

系统运行时间  -  容器中1号进程所在的cgroup的CPU使用时间

容器中1号进程的CPU使用时间是从cgroup中读取的，cgroup的cpu controller和cpuacct controller的文件接口如下（这两个controller挂载到了同一个目录）：

cgroup.clone_children  cpuacct.stat          cpu.cfs_quota_us   cpu.stat  kube-proxy         tasks
cgroup.event_control   cpuacct.usage         cpu.rt_period_us   docker    notify_on_release  user.slice
cgroup.procs           cpuacct.usage_percpu  cpu.rt_runtime_us  kubelet   release_agent
cgroup.sane_behavior   cpu.cfs_period_us     cpu.shares         kubepods  system.slice

cpuacct.usage/1000000000就是该cgroup目录使用的cpu时间，cpuacct.usage的单位是纳秒（10的负9次方）。

/proc/uptime、/proc/[pid]/stat等文件的内容和含义，在man proc中都可以找到。

/proc/diskstats文件内容的生成

使用cgroup blkio controller中的信息生成容器的/proc/diskstats文件。

cgroup blkio controller的文件接口如下：

blkio.io_merged                   blkio.io_wait_time_recursive     blkio.throttle.write_iops_device
blkio.io_merged_recursive         blkio.leaf_weight                blkio.time
blkio.io_queued                   blkio.leaf_weight_device         blkio.time_recursive
blkio.io_queued_recursive         blkio.reset_stats                blkio.weight
blkio.io_service_bytes            blkio.sectors                    blkio.weight_device
blkio.io_service_bytes_recursive  blkio.sectors_recursive          cgroup.clone_children
blkio.io_serviced                 blkio.throttle.io_service_bytes  cgroup.event_control
blkio.io_serviced_recursive       blkio.throttle.io_serviced       cgroup.procs
blkio.io_service_time             blkio.throttle.read_bps_device   notify_on_release
blkio.io_service_time_recursive   blkio.throttle.read_iops_device  tasks
blkio.io_wait_time                blkio.throttle.write_bps_device

lxcfs用下面5个接口文件中的内容覆盖host的/proc/diskstats形成容器的/proc/diskstats：

blkio.io_serviced_recursive
blkio.io_merged_recursive
blkio.io_service_bytes_recursive
blkio.io_wait_time_recursive
blkio.io_service_time_recursive

奇怪的是的环境中这个文件内容都是Total 0，在磁盘io方面有盲区，暂时搞不懂这些参数含义 2019-02-15 18:13:44。

/proc/swaps文件内容的生成

参考/proc/meminfo文件内容的生成，swaps中的信息是meminfo信息的子集，swap_total和swap_free的获取方式相同。

小结

事实上需要掌握的内容只有一样： 所有/proc文件的内容格式和含义、所有cgroup文件接口的内容格式和含义。嗯..这个任务..很艰巨…收录在Linux的cgroup功能（三）：cgroup controller汇总和控制器的参数（文件接口）

参考

编程基础系统基础关注跟踪

Kubernetes Prometheus 超级账本

lxcfs