[内核内存] [arm64] zone区域的水线值(watermark)和保留内存值(lowmem_reserve)详解

本文解析了Linux内核中watermark的结构、初始化过程、min_free_kbytes和watermark_scale_factor的作用,以及用户如何自定义调节。重点介绍了zone的watermark判断机制,及其在存储服务、混合部署中的应用场景。

1 watermark简介

linux物理内存中的每个zone都有自己独立的3个档位的watermark值:

  • 最低水线(WMARK_MIN):如果内存区域的空闲页数小于最低水线,说明该内存区域的内存严重不足
  • 低水线(WMARK_LOW):空闲页数小数低水线,说明该内存区域的内存轻微不足。默认情况下,该值为WMARK_MIN的125%
  • 高水线(WMARK_HIGH):如果内存区域的空闲页数大于高水线,说明该内存区域水线充足。默认情况下,该值为WMARK_MAX的150%

在进行内存分配的时候,如果分配器(比如buddy allocator)发现当前空余内存的值低于”low”但高于”min”,说明现在内存面临一定的压力,那么在此次内存分配完成后,kswapd将被唤醒,以执行内存回收操作。在这种情况下,内存分配虽然会触发内存回收,但不存在被内存回收所阻塞的问题,两者的执行关系是异步的

对于kswapd来说,要回收多少内存才算完成任务呢?只要把空余内存的大小恢复到”high”对应的watermark值就可以了,当然,这取决于当前空余内存和”high”值之间的差距,差距越大,需要回收的内存也就越多。”low”可以被认为是一个警戒水位线,而”high”则是一个安全的水位线。

如果内存分配器发现空余内存的值低于了”min”,说明现在内存严重不足。这里要分两种情况来讨论,一种是默认的操作,此时分配器将同步等待内存回收完成,再进行内存分配,也就是direct reclaim。还有一种特殊情况,如果内存分配的请求是带了PF_MEMALLOC标志位的,并且现在空余内存的大小可以满足本次内存分配的需求,那么也将是先分配,再回收。

2 watermark相关结构体

每个zone如何记录各个档位的水线和如何获取每个zone各个档位的水线值???

struct zone {
   
     
    /* Read-mostly fields */  
  
    /*
     *水位值,WMARK_MIN/WMARK_LOV/WMARK_HIGH,页面分配器和kswapd页面回收中会用到,访问用	 
     *_wmark_pages(zone) 宏
	 */ 
    unsigned long watermark[NR_WMARK];    
    unsigned long nr_reserved_highatomic;
   //zone内存区域中预留的内存
    long lowmem_reserve[MAX_NR_ZONES];  
    ...
    ...
    ...
    }

#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])

enum zone_watermarks {
   
   
	WMARK_MIN,
	WMARK_LOW,
	WMARK_HIGH,
	NR_WMARK
};

3 watermark初始化

每个zone对应的3个档位的水线值是如何计算出来的呢?

在计算之前我们需要了解内核中几个全局变量值对应的意义

3.1 managed_pages,spanned_pages,present_pages三个值对应的意义

/*
 * spanned_pages is the total pages spanned by the zone, including
 * holes, which is calculated as:
 *  spanned_pages = zone_end_pfn - zone_start_pfn;
 *
 * present_pages is physical pages existing within the zone, which
 * is calculated as:
 *  present_pages = spanned_pages - absent_pages(pages in holes);
 *
 * managed_pages is present pages managed by the buddy system, which
 * is calculated as (reserved_pages includes pages allocated by the
 * bootmem allocator):
 *  managed_pages = present_pages - reserved_pages;
 */
unsigned long       managed_pages;
unsigned long       spanned_pages;
unsigned long       present_pages;
  • spanned_pages: 代表的是这个zone中所有的页,包含空洞,计算公式是: zone_end_pfn - zone_start_pfn
  • present_pages: 代表的是这个zone中可用的所有物理页,计算公式是:spanned_pages-hole_pages
  • managed_pages: 代表的是通过buddy管理的所有可用的页,计算公式是:present_pages - reserved_pages
  • 三者的关系是: spanned_pages > present_pages > managed_pages

3.2 什么是min_free_kbytes

min_free_kbytes:
 
This is used to force the Linux VM to keep a minimum number
of kilobytes free.  The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.
 
Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.
 
Setting this too high will OOM your machine instantly.

由上可知

  • min_free_kbyes代表的是系统保留空闲内存的最低限
  • watermark[WMARK_MIN]的值是通过min_free_kbytes计算出来的

内核是在函数init_per_zone_wmark_min中完成min_free_kbyes的初始化,这里的min_free_kbytes值有个下限和上限,就是最小不能低于128KiB,最大不能超过65536KiB。在实际应用中,通常建议为不低于1024KiB

//计算DMA_ZONE和NORAML_ZONE中超过高水位页的个数
lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);           
min_free_kbytes = int_sqrt(lowmem_kbytes * 16);  

3.3 Watermark的low,min和high这3档位初始化

3.3.1 内存水线初始化过程分析

ZONE_HIGHMEM的watermark值比较特殊,对于64位系统因为不再使用ZONE_HIGHMEM此处我们不做过多详细介绍,计算方法参考下图。

在非ZONE_HIGHMEM区域中,他们的watermark的min档位值是由计算出的min_free_kbytes来获得的。然后watermark的low和high档位值可以根据上面计算出的min档位值来获得。具体计算过程可以参考下图。

在这里插入图片描述

ps:根据上图可知lowmem_pages计算涉及到zone区域的高水线值,但是目前初始化阶段所有zone区域的高水线值还未初始化,所有估计此处每个zone的高水线值都为0,及lowmem_pages = ZONE_NORMAL->managed_pages + ZONE_DMA->managed_pages(正确性有待考证).

3.3.2 内存水线初始化内核代码分析

先上函数流程图

在这里插入图片描述

init_per_zone_wmark_min函数
//mm/page_alloc.c
/*
 * Initialise min_free_kbytes.
 *
 * For small machines we want it small (128k min).  For large machines
 * we want it large (64MB max).  But it is not linear, because network
 * bandwidth does not increase linearly with machine size.  We use
 *
 *	min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
 *	min_free_kbytes = sqrt(lowmem_kbytes * 16)
 */
int __meminit init_per_zone_wmark_min(void)
{
   
   
	unsigned long lowmem_kbytes;
	int new_min_free_kbytes;
    /*
     *nr_free_buffer_pages是获取ZONE_DMA和ZONE_NORMAL区中高于high水位的总页数
     *nr_free_buffer_pages = managed_pages - high_pages
     */
	lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
	//根据本函数上的注释计算new_min_free_kbytes值
    new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
	/*
	 *根据new_min_free_kbytes值大小情况设置最终的min_free_kbytes(user_min_free_kbytes值未用户通过proc接口设
	 *置的值):
	 *	1.若new_min_free_kbytes > user_min_free_kbytes,则min_free_kbytes取new_min_free_kbytes对应的值,但大
	 *    小需控制在[128k,65536k]区间
	 *	2.若new_min_free_kbytes <= user_min_free_kbytes,则min_free_kbytes取user_min_free_kbytes对应的值.
	 */
	if (new_min_free_kbytes > user_min_free_kbytes) {
   
   
		min_free_kbytes = new_min_free_kbytes;
		if (min_free_kbytes < 128)
			min_free_kbytes = 128;
		if (min_free_kbytes > 65536)
			min_free_kbytes = 65536;
	} else {
   
   
		pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n",
				new_min_free_kbytes, user_min_free_kbytes);
	}
    //通过初始化后的min_free_kbytes计算各个zone的min,low,high值
	setup_per_zone_wmarks();
	refresh_zone_stat_thresholds();
	setup_per_zone_lowmem_reserve();

#ifdef CONFIG_NUMA
	setup_min_unmapped_ratio();
	setup_min_slab_ratio();
#endif

	return 0;
}
core_initcall(init_per_zone_wmark_min)
nr_free_buffer_pages函数
//该函数计算DMA_ZONE和NORAML_ZONE中超过高水位页的个数,初始化时zone高水位线为0
unsigned long nr_free_buffer_pages(void)
{
   
   
	return nr_free_zone_pages(gfp_zone(GFP_USER));
}
EXPORT_SYMBOL_GPL(nr_free_buffer_pages);
/*
 *对每个zone做计算,将每个zone中超过high水位的值放到sum中。超过高水位的页数计算方法是:managed_pages减去 
 *watermark[HIGH], 这样就可以获取到系统中各个zone超过高水位页的总和
 */
static unsigned long nr_free_zone_pages(int offset)
{
   
   
	struct zoneref *z;
	struct zone *zone;

	/* Just pick one node, since fallback list is circular */
	unsigned long sum = 0;

	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);

	for_each_zone_zonelist(zone, z, zonelist, offset) {
   
   
		unsigned long size = zone->managed_pages;
		unsigned long high = high_wmark_pages(zone);
		if (size > high)
			sum += size - high
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值