Features Download
From: Minchan Kim <minchan <at> kernel.org>
Subject: [RFC v5 0/8] Support volatile for anonymous range
Newsgroups: gmane.linux.kernel
Date: Thursday 3rd January 2013 04:27:58 UTC (over 3 years ago)
This is still RFC because we need more input from user-space
people, more stress test, design discussion about interface/reclaim
policy of volatile pages and I want to expand this concept to tmpfs
volatile range if it is possbile without big performance drop of
anonymous volatile range.
(Let's define our term. anon volatile VS tmpfs volatile? John?)

I hope more inputs from user-space allocator people and test patch
with their allocator because it might need design change of arena
management for getting real vaule.

 * Improve volatile range scanning speed
 * Aware of NUMA policy with vma's mempolicy
 * Add direct reclaim hook for discarding volatile pages first
 * Support tmpfs-volatile

Changelog from v5 - There are many changes.

 * Working with THP/KSM
 * Remove vma hacking logic in m[no]volatile system call
 * Discard page without swap cache
 * Kswapd discard volatile page so we can discard volatile pages
   although we don't have swap.

Changelog from v4

 * Add new system call mvolatile/mnovolatile
 * Add sigbus when user try to access volatile range
 * Rebased on v3.7
 * Applied bug fix from John Stultz, Thanks!

Changelog from v3

 * Removing madvise(addr, length, MADV_NOVOLATILE).
 * add vmstat about the number of discarded volatile pages
 * discard volatile pages without promotion in reclaim path

This is based on v3.7

- What's the mvolatile(addr, length)?

  It's a hint that user deliver to kernel so kernel can *discard*
  pages in a range anytime.

- What happens if user access page(ie, virtual address) discarded
  by kernel?

  The user can encounter SIGBUS.

- What should user do for avoding SIGBUS?
  He should call mnovolatie(addr, length) before accessing the range
  which was called by mvolatile.

- What happens if user access page(ie, virtual address) doesn't
  discarded by kernel?

  The user can see old data without page fault.

- What's different with madvise(DONTNEED)?

  System call semantic

  DONTNEED makes sure user always can see zero-fill pages after
  he calls madvise while mvolatile can see old data or encounter

  Internal implementation

  The madvise(DONTNEED) should zap all mapped pages in range so
  overhead is increased linearly with the number of mapped pages.
  Even, if user access zapped pages as write mode, page fault +
  page allocation + memset should be happened.

  The mvolatile just marks the flag in a range(ie, VMA) instead of
  zapping all of pte in the vma so it doesn't touch ptes any more.

- What's the benefit compared to DONTNEED?

  1. The system call overhead is smaller because mvolatile just marks
     the flag to VMA instead of zapping all the page in a range so
     overhead should be very small.

  2. It has a chance to eliminate overheads (ex, zapping pte + page fault
     + page allocation + memset(PAGE_SIZE)) if memory pressure isn't

  3. It has a potential to zap all ptes and free the pages if memory
     pressure is severe so reclaim overhead could be disappear - TODO

- Isn't there any drawback?

  Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
  fault of other threads could be allowed. But m[no]volatile needs
  exclusive mmap_sem so other thread would be blocked if they try to
  access not-yet-mapped pages. That's why I design m[no]volatile
  overhead should be small as far as possible.

  It could suffer from max rss usage increasement because madvise(DONTNEED)
  deallocates pages instantly when the system call is issued while mvoatile
  delays it until memory pressure happens so if memory pressure is severe
  max rss incresement, system would suffer. First of all, allocator needs
  some balance logic for that or kernel might handle it by zapping pages
  although user calls mvolatile if memory pressure is severe.
  The problem is how we know memory pressure is severe.
  One of solution is to see kswapd is active or not. Another solution is
  Anton's mempressure so allocator can handle it.

- What's for targetting?

  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
  of virtual machine like Dalvik. Also, it comes in handy for embedded
  which doesn't have swap device so they can't reclaim anonymous pages.
  By discarding instead of swapout, it could be used in the non-swap

- Stupid performance test
  I attach test program/script which are utter crap and I don't expect
  current smart allocator never have done it so we need more practical data
  with real allocator.

  KVM - 8 core, 2G

13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata
0inputs+0outputs (0major+164050minor)pagefaults 0swaps

23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata
0inputs+0outputs (0major+16384210minor)pagefaults 0swaps

  x86-64 - 12 core, 2G

33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata
0inputs+0outputs (0major+245989minor)pagefaults 0swaps

28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap

Any comments are welcome!

Cc: Michael Kerrisk 
Cc: Arun Sharma 
Cc: [email protected]
Cc: Paul Turner 
CC: David Rientjes 
Cc: John Stultz <[email protected]>
Cc: Andrew Morton 
Cc: Christoph Lameter 
Cc: Android Kernel Team 
Cc: Robert Love 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Dave Chinner 
Cc: Neil Brown 
Cc: Mike Hommey 
Cc: Taras Glek 
Cc: KOSAKI Motohiro 
Cc: KAMEZAWA Hiroyuki 

Minchan Kim (8):
  Introduce new system call mvolatile
  Don't allow volatile attribute on THP and KSM
  bail out when the page is in VOLATILE vma
  add page_locked parameter in free_swap_and_cache
  Discard volatile page
  add PGVOLATILE vmstat count
  add volatile page discard hook to kswapd
  extend PGVOLATILE vmstat to kswapd

 arch/x86/syscalls/syscall_64.tbl |    2 +
 fs/exec.c                        |    4 +-
 include/linux/memory.h           |    2 +
 include/linux/mm.h               |    6 +-
 include/linux/mm_types.h         |    4 +
 include/linux/mvolatile.h        |   63 +++
 include/linux/rmap.h             |    2 +
 include/linux/sched.h            |    1 +
 include/linux/swap.h             |    6 +-
 include/linux/syscalls.h         |    2 +
 include/linux/vm_event_item.h    |    4 +
 kernel/fork.c                    |    2 +
 mm/Kconfig                       |   11 +
 mm/Makefile                      |    2 +-
 mm/fremap.c                      |    2 +-
 mm/huge_memory.c                 |    9 +-
 mm/internal.h                    |    2 +
 mm/ksm.c                         |    3 +-
 mm/madvise.c                     |    2 +-
 mm/memory.c                      |   12 +-
 mm/mempolicy.c                   |    2 +-
 mm/mlock.c                       |    7 +-
 mm/mmap.c                        |   62 ++-
 mm/mprotect.c                    |    3 +-
 mm/mremap.c                      |    2 +-
 mm/mvolatile.c                   |  813
 mm/rmap.c                        |   11 +-
 mm/shmem.c                       |    2 +-
 mm/swapfile.c                    |    7 +-
 mm/vmscan.c                      |   57 ++-
 mm/vmstat.c                      |    4 +
 31 files changed, 1065 insertions(+), 46 deletions(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

================== 8< =============================

#define _GNU_SOURCE

#define SYS_mvolatile 313
#define SYS_mnovolatile 314

#define ALLOC_SIZE (8 << 20)
#define MAP_SIZE  (ALLOC_SIZE * 10)
#define PAGE_SIZE (1 << 12)
#define RETRY 100

pthread_barrier_t barrier;
int mode;

static int mvolatile(void *addr, size_t length)
	return syscall(SYS_mvolatile, addr, length);

static int mnovolatile(void *addr, size_t length)
	return syscall(SYS_mnovolatile, addr, length);

void *thread_entry(void *data)
	unsigned long i;
	cpu_set_t set;
	int cpu = *(int*)data;
	void *mmap_area;
	int retry = RETRY;

	CPU_SET(cpu, &set);
	sched_setaffinity(0, sizeof(set), &set);

	if (mmap_area == MAP_FAILED) {
		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);


	while(retry--) {
		if (mode == VOLATILE_MODE) {
			mvolatile(mmap_area, MAP_SIZE);
			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
				mnovolatile(mmap_area + i, ALLOC_SIZE);
				memset(mmap_area + i, i, ALLOC_SIZE);
				mvolatile(mmap_area + i, ALLOC_SIZE);
		} else {
			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
				memset(mmap_area + i, i, ALLOC_SIZE);
				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
	return NULL;

int main(int argc, char *argv[])
	int i, nr_thread;
	int *data;

	if (argc < 3)
		return 1;

	nr_thread = atoi(argv[1]);
	mode = atoi(argv[2]);

	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
	data = malloc(sizeof(int) * nr_thread);
	pthread_barrier_init(&barrier, NULL, nr_thread);

	for (i = 0; i < nr_thread; i++) {
		data[i] = i;
		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
			perror("Fail to create thread\n");

	for (i = 0; i < nr_thread; i++) {
		if (pthread_join(thread[i], NULL))
			perror("Fail to join thread\n");
		printf("[%d] thread done\n", i);

	return 0;


To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email:  email@kvack.org 
CD: 4ms