Features Download
From: Srivatsa S. Bhat <srivatsa.bhat <at> linux.vnet.ibm.com>
Subject: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management
Newsgroups: gmane.linux.kernel
Date: Tuesday 9th April 2013 21:45:28 UTC (over 3 years ago)
[I know, this cover letter is a little too long, but I wanted to clearly
explain the overall goals and the high-level design of this patchset in
detail. I hope this helps more than it annoys, and makes it easier for
reviewers to relate to the background and the goals of this patchset.]

Overview of Memory Power Management and its implications to the Linux MM

Today, we are increasingly seeing computer systems sporting larger and
amounts of RAM, in order to meet workload demands. However, memory consumes
significant amount of power, potentially upto more than a third of total
power on server systems. So naturally, memory becomes the next big target
power management - on embedded systems and smartphones, and all the way
large server systems.

Power-management capabilities in modern memory hardware:

Modern memory hardware such as DDR3 support a number of power management
capabilities - for instance, the memory controller can automatically put
memory DIMMs/banks into content-preserving low-power states, if it detects
that that *entire* memory DIMM/bank has not been referenced for a threshold
amount of time, thus reducing the energy consumption of the memory
We term these power-manageable chunks of memory as "Memory Regions".

Exporting memory region info of the platform to the OS:

The OS needs to know about the granularity at which the hardware can
automatic power-management of the memory banks (i.e., the address
of the memory regions). On ARM platforms, the bootloader can be modified to
pass on this info to the kernel via the device-tree. On x86 platforms, the
new ACPI 5.0 spec has added support for exporting the power-management
capabilities of the memory hardware to the OS in a standard way[5].

Estimate of power-savings from power-aware Linux MM:

Once the firmware/bootloader exports the required info to the OS, it is
the kernel's MM subsystem to make the best use of these capabilities and
memory power-efficiently. It had been demonstrated on a Samsung Exynos
(with 2 GB RAM) that upto 6 percent of total system power can be saved by
making the Linux kernel MM subsystem power-aware[4]. (More savings can be
expected on systems with larger amounts of memory, and perhaps improved
using better MM designs).

Role of the Linux MM in enhancing memory power savings:

Often, this simply translates to having the Linux MM understand the
at which RAM modules can be power-managed, and keeping the memory
and references consolidated to a minimum no. of these power-manageable
"memory regions". It is of particular interest to note that most of these
hardware have the intelligence to automatically save power, by putting
banks into (content-preserving) low-power states when not referenced for a
threshold amount of time. All that the kernel has to do, is avoid wrecking
the power-savings logic by scattering its allocations and references all
the system memory. (The kernel/MM doesn't have to perform the actual
transitions; its mostly done in the hardware automatically, and this is OK
because these are *content-preserving* low-power states).

So we can summarize the goals for the Linux MM as:

o Consolidate memory allocations and/or references such that they are not
spread across the entire memory address space.  Basically the area of
that is not being referenced can reside in low power state.

o Support light-weight targetted memory compaction/reclaim, to evacuate
lightly-filled memory regions. This helps avoid memory references to
those regions, thereby allowing them to reside in low power states.

Assumptions and goals of this patchset:

In this patchset, we don't handle the part of getting the region boundary
from the firmware/bootloader and populating it in the kernel
The aim of this patchset is to propose and brainstorm on a power-aware
of the Linux MM which can *use* the region boundary info to influence the
at various places such as page allocation, reclamation/compaction etc,
contributing to memory power savings. (This patchset is very much an RFC at
the moment and is not intended for mainline-inclusion yet).

So, in this patchset, we assume a simple model in which each 512MB chunk of
memory can be independently power-managed, and hard-code this into the
As mentioned, the focus of this patchset is not so much on how we get this
from the firmware or how exactly we handle a variety of configurations, but
rather on discussing the power-savings/performance impact of the MM
that *act* upon this info in order to save memory power.

That said, its not very far-fetched to try this out with actual region
boundary info to get the actual power savings numbers. For example, on ARM
platforms, we can make the bootloader export this info to the OS via
and then run this patchset. (This was the method used to get the
in [4]). But even without doing that, we can very well evaluate the
effectiveness of this patchset in contributing to power-savings, by
the free page statistics per-memory-region; and we can observe the
impact by running benchmarks - this is the approach currently used to
this patchset.

Brief overview of the design/approach used in this patchset:

This patchset implements the 'Sorted-buddy design' for Memory Power
in which the buddy (page) allocator is altered to keep the buddy freelists
region-sorted, which helps influence the page allocation paths to keep the
allocations consolidated to a minimum no. of memory regions. This patchset
includes a light-weight targetted compaction/reclaim algorithm that works
hand-in-hand with the page-allocator, to evacuate lightly-filled memory
when memory gets fragmented, in order to further enhance memory power

This Sorted-buddy design was developed based on some of the suggestions
received[1] during the review of the earlier patchset on Memory Power
Management written by Ankita Garg ('Hierarchy design')[2].
One of the key aspects of this Sorted-buddy design is that it avoids the
zone-fragmentation problem that was present in the earlier design[3].

Design of sorted buddy allocator and light-weight targetted region

Sorted buddy allocator:

In this design, the memory region boundaries are captured in a data
parallel to zones, instead of fitting regions between nodes and zones in
hierarchy. Further, the buddy allocator is altered, such that we maintain
zones' freelists in region-sorted-order and thus do page allocation in the
order of increasing memory regions. (The freelists need not be fully
address-sorted, they just need to be region-sorted).

The idea is to do page allocation in increasing order of memory regions
(within a zone) and perform region-compaction in the reverse order, as
illustrated below.

---------------------------- Increasing region

Direction of allocation--->               <---Direction of

The sorting logic (to maintain freelist pageblocks in region-sorted-order)
lies in the page-free path and hence the critical page-allocation paths
fast. Also, the sorting logic is optimized to be O(log n).

Advantages of this design:
1. No zone-fragmentation (IOW, we don't create more zones than necessary)
   hence we avoid its associated problems (like too many zones, extra
   activity, question of choosing watermarks etc).
   [This is an advantage over the 'Hierarchy' design]

2. Performance overhead is expected to be low: Since we retain the
   of the algorithm in the page allocation path, page allocation can
   potentially remain as fast as it would be without memory regions. The
   overhead is pushed to the page-freeing paths which are not that

Light-weight targetted region compaction:

Over time, due to multiple alloc()s and free()s in random order, memory
fragmented, which means the memory allocations will no longer be
to a minimum no. of memory regions. In such cases we need a light-weight
mechanism to opportunistically compact memory to evacuate lightly-filled
memory regions, thereby enhancing the power-savings.

Noting that CMA (Contiguous Memory Allocator) does targetted compaction to
achieve its goals, this patchset generalizes the targetted compaction code
and reuses it to evacuate memory regions. The region evacuation is
by the page allocator : when it notices the first page allocation in a new
region, it sets up a worker function to perform compaction and evacuate
region in the future, if possible. There are handshakes between the alloc
and the free paths in the page allocator to help do this smartly, which are
explained in detail in the patches.

This patchset has been hosted in the below git tree. It applies cleanly on


Changes in this v2:

* Fixed a bug in the NUMA case.
* Added a new optimized O(log n) sorting algorithm to speed up
  of the buddy freelists (patch 9). The efficiency of this new algorithm
  its design allows us to support large amounts of RAM quite easily.
* Added light-weight targetted compaction/reclaim support for memory power
  management (patches 10-14).
* Revamped the cover-letter to better explain the idea behind memory power
  management and this patchset.

Experimental Results:

Test setup:

x86 dual-socket quad core HT-enabled machine booted with mem=8G
Memory region size = 512 MB

Functional testing:

Ran pagetest, a simple C program that allocates and touches a required
of pages.

Below is the statistics from the regions within ZONE_NORMAL, at various
of allocations from pagetest.

	     Present pages   |	Free pages at various allocation sizes   |
			     |  start	|  512 MB  |  1024 MB | 2048 MB  |
  Region 0           1	     |      0   |      0   |       0  |       0  |
  Region 1      131072       |  41537   |  13858   |   13790  |   13334  |
  Region 2      131072       | 131072   |  26839   |      82  |     122  |
  Region 3      131072       | 131072   | 131072   |   26624  |       0  |
  Region 4      131072       | 131072   | 131072   |  131072  |       0  |
  Region 5      131072       | 131072   | 131072   |  131072  |   26624  |
  Region 6      131072       | 131072   | 131072   |  131072  |  131072  |
  Region 7      131072       | 131072   | 131072   |  131072  |  131072  |
  Region 8      131071       |  72704   |  72704   |   72704  |   72704  |

This shows that page allocation occurs in the order of increasing region
numbers, as intended in this design.

Performance impact:

Kernbench results didn't show any noticeable performance degradation with
this patchset as compared to vanilla 3.9-rc5.

Todos and ideas for enhancing the design further:

1. Add support for making this work with sparsemem, memcg etc.

2. Mel Gorman pointed out that regular compaction algorithm would work
   against the sorted-buddy allocation strategy, since it creates free
   at lower pfns. For now, I have not handled this because regular
   triggers only when the memory pressure is very high, and hence memory
   power management is pointless in those situations. Besides, it is
   immaterial whether memory allocations are consolidated towards lower or
   higher pfns, because it saves power either way, and hence the regular
   compaction algorithm doesn't actually work against memory power

3. Add more optimizations to the targetted region compaction algorithm in
   to enhance its benefits and reduce the overhead, such as:
   a. Migrate only active pages during region evacuation, because, strictly
      speaking we only want to avoid _references_ to the region. So
      pages can be kept around, thus reducing the page-migration overhead.
   b. Reduce the search-space for region evacuation, by having the
      page-allocator note down the highest allocated pfn within that

4. Have stronger influence over how freepages from different migratetypes
   are exchanged, so that unmovable and non-reclaimable allocations are
   contained within least no. of memory regions.

5. Influence the refill of per-cpu pagesets and perhaps even heavily used
   slab caches, such that they all get their memory from least no. of
   regions. This is to avoid frequent fragmentation of memory regions.

6. Don't perform region evacuation at situations of high memory
   Also, never use freepages from MIGRATE_RESERVE for the purpose of

7. Add more tracing/debug info to enable better evaluation of the
   effectiveness and benefits of this patchset over vanilla kernel.

8. Add a higher level policy to control the aggressiveness of memory power


[1]. Review comments suggesting modifying the buddy allocator to be aware
     memory regions:

[2]. Patch series that implemented the node-region-zone hierarchy design:

     Summary of the discussion on that patchset:

     Forward-port of that patchset to 3.7-rc3 (minimal x86 config)

[3]. Disadvantages of having memory regions in the hierarchy between nodes

[4]. Estimate of potential power savings on Samsung exynos board

[5]. ACPI 5.0 and MPST support
     Section 5.2.21 Memory Power State Table (MPST)

[6]. v1 of Sorted-buddy memory power management patchset:

 Srivatsa S. Bhat (15):
      mm: Introduce memory regions data-structure to capture region
boundaries within nodes
      mm: Initialize node memory regions during boot
      mm: Introduce and initialize zone memory regions
      mm: Add helpers to retrieve node region and zone region for a given
      mm: Add data-structures to describe memory regions within the zones'
      mm: Demarcate and maintain pageblocks in region-order in the zones'
      mm: Add an optimized version of del_from_freelist to keep page
allocation fast
      bitops: Document the difference in indexing between fls() and __fls()
      mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
      mm: Add support to accurately track per-memory-region allocation
      mm: Restructure the compaction part of CMA for wider use
      mm: Add infrastructure to evacuate memory regions using compaction
      mm: Implement the worker function for memory region compaction
      mm: Add alloc-free handshake to trigger memory region compaction
      mm: Print memory region statistics to understand the buddy allocator

  arch/x86/include/asm/bitops.h      |    4 
 include/asm-generic/bitops/__fls.h |    5 
 include/linux/compaction.h         |    7 
 include/linux/gfp.h                |    2 
 include/linux/migrate.h            |    3 
 include/linux/mm.h                 |   62 ++++
 include/linux/mmzone.h             |   78 ++++-
 include/trace/events/migrate.h     |    3 
 mm/compaction.c                    |  149 +++++++++
 mm/internal.h                      |   40 ++
 mm/page_alloc.c                    |  617
 mm/vmstat.c                        |   36 ++
 12 files changed, 935 insertions(+), 71 deletions(-)

Srivatsa S. Bhat
IBM Linux Technology Center

To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email:  email@kvack.org 
CD: 5ms