Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Rafael J. Wysocki <rjw <at> sisk.pl>
Subject: Suspend and hibernation status report
Newsgroups: gmane.linux.power-management.general
Date: Friday 27th July 2007 08:57:04 UTC (over 9 years ago)
Hi,

Below is a document describing the current state of development of the
suspend
and hibernation infrastructure: how it works, what known problems there are
in
it and what the future development plans are (at least as far as I am
concerned).

[It's almost exactly one yaer after I released the previous swsusp status
report and that's mostly because in the Summer I have more time to write
such
things.  Thus, probably, the next report will be released next Summer, but
since the present one is quite long, the next one is going to be
incremental.
;-)]

As usual, comments, suggestions, opinions etc are welcome.

Greetings,
Rafael


---
Hibernation and Suspend Status Report

I. Introduction

One year ago I wrote a report documenting the status of development of
swsusp
(ie. software suspend, or hibernation, subsystem) that can be found at
http://lkml.org/lkml/2006/7/25/105
.  Although I thought I would be able to
release an updated version of the report within 3-4 months, this turned out
to
be very difficult due to several substantial changes made to swsusp since
then,
causing it to be a moving target from the documentation-writing
perspective.
Moreover, in the meantime I started to work on the core suspend code used,
among other things, for transitioning the system into the ACPI S3 sleep
state,
known as the suspend to RAM, which currently has some things in common with
swsusp.  For this reason, I thought it would be a good idea to document
these
two subsystems together, but that increased the number of things to cover
and
added to the delay.  Finally, however, I have had some time to complete the
present document.

In analogy with the previous report, this document is intended as an
introductory presentation of the current (ie. as in the 2.6.23-rc1 kernel)
design of the suspend (ie. suspend-to-RAM and standby) and hibernation
code,
the status of it, known problems with it and the future development plans.
Thus, I will first explain how this code works and identify all of the
distinct
parts of it.  Next, I will describe each of these parts in more detail and
discuss the known problems related to them.  Finally, I will outline the
possible directions of future development related to suspend and
hibernation.

II. Terminology

Before I start to talk about technical details, some terms that will be
used
throughout of the rest of this document need to be defined.  They are the
following:
* system working state - any state, in which the system's processors can
carry
  out useful computations
* system sleep state - state, in which no useful work can be done by the
  system's processors, but its main memory is powered and, consequently,
the
  contents of memory are preserved, so that the computations carried out
when
  the system was last in a working state can be continued after
transitioning
  the system back to the working state
* system hibernation state - state, in which the system's processors are
off and
  its main memory is not powered, but the information necessary for
continuing
  the computations carried out when the system was last in a working state
is
  preserved in a storage space, such as a disk
* ACPI S4 state - system hibernation state, in which some information is
  preserved by the ACPI platform, in accordance with the ACPI specification
* system suspend - operation, in which the system leaves a working state
and
  enters a sleep state
* system resume - operation, in which the system leaves a sleep state and
enters
  a working state
* system hibernation - operation, in which the system leaves a working
state and
  enters a hibernation state
* system restore - operation, in which the system leaves a hibernation
state and
  enters a working state
* device full power state - state of a device, in which it is fully
operational
  and draws maximum power
* device low power state - state of a device, in which it draws less power
than
  in the full power state and may not be fully operational
* device quiescent state - state of a device, in which it does not generate
  interrupts and/or it will not take part in any DMA transfers
* device off state - state of a device, in which it draws minimal power and
is
  not regarded as operational
* device suspend - operation, in which the device is put into a low power
state
  compatible with the system sleep state that is going to be entered
* device wake up - operation, in which the device is put into the full
power
  state or to a low power state compatible with the system working state
that is
  going to be entered

III. System suspend outline

System suspend support is included in the kernel if CONFIG_PM is set in the
.config .  Then, there is the file /sys/power/state, by reading which one
can
check what suspend states are available on given system.

At present, two different suspend states can generally be supported,
"standby"
and "mem", but some platforms support only one of them and many platforms
do not
support any sleep states at all.  If both are supported, "standby" is the
state
in which the system draws more power, but can be switched to a working
state
faster than from the "mem" sleep state.

A transition to a system sleep state can be started by writing the name of
a
system sleep state supported by the platform ("mem" or "standby") to
/sys/power/state (there is another method to do that, with the help of the
hibernation userland interface, but it should only be used as a part of the
suspend-to-both functionality described later).  If that happens, the
kernel
performs the following actions:

(1) power management notifiers are executed with PM_SUSPEND_PREPARE
(2) tasks are frozen
(3) target system sleep state is announced to the platform-handling code
(4) devices are suspended
(5) platform-specific global suspend preparation methods are executed
(6) non-boot CPUs are taken off-line
(7) interrupts are disabled on the remaining (main) CPU
(8) late suspend of devices is carried out
(9) platform-specific global methods are invoked to put the system to sleep

Of course all of this happens if there are no errors in the way.  However,
for
example, if one of the devices refuses to suspend, we need to wake up all
of the
devices that have already been suspended, inform the platform that the
transition to the low power state will not occur, enable the non-boot CPUs
and
thaw tasks.  Finally, we have to execute the power management notifiers to
inform their owners that the transition has been canceled.

A resume starts when the platform notices a wake-up event, such as the
opening
of a laptop's lid or pressing the power button.  Then, the platform
prepares
itself and the main processor for entering a system working state and
returns
the control to the kernel.  Next, the following actions are performed:

(10) the main CPU is switched to the appropriate mode, if necessary
(11) early resume of devices is carried out
(12) interrupts are enabled on the main CPU
(13) non-boot CPUs are enabled
(14) platform-specific global resume preparation methods are invoked
(15) devices are woken up
(16) tasks are thawed
(17) power management notifiers are executed with PM_POST_SUSPEND

For each of steps (1)-(17) above there is a separate part of the suspend
code
responsible for its completion.

IV. System hibernation outline

System hibernation support is included in the kernel if
CONFIG_SOFTWARE_SUSPEND
is set in the .config .  Then, the hibernation state called "disk" is
listed
in the /sys/power/state file.

Currently there are two possible ways of carrying out a system hibernation.
 The
first of them is entirely kernel-driven and the second one requires a
userland
task that will drive the hibernation procedure calling the kernel to
perform
specific, more or less atomic, actions.  Only the first method is covered
in
this part of the report, because it is generally simpler and the actions of
the
kernel are pretty much the same in both cases.  The other method will be
described later.

The kernel-driven hibernation procedure is started by writing "disk" to
/sys/power/state.  Then, the kernel performs the following actions:

(1) power management notifiers are executed with PM_HIBERNATION_PREPARE
(2) tasks are frozen
(3) some memory is released, if necessary
(4) (optional, on ACPI systems) target system sleep state (S4) is announced
to
    the platform-handling code
(5) devices are suspended for hibernation
(6) (optional) platform-specific global hibernation preparation methods are
    invoked
(7) non-boot CPUs are taken off-line
(8) interrupts are disabled on the main CPU
(9) late suspend of devices for hibernation is carried out
(10) atomic copy of the system memory (aka hibernation image) is created
(11) early resume of devices is carried out
(12) interrupts are enabled on the main CPU
(13) non-boot CPUs are enabled
(14) (optional, but necessary if (6) is performed) platform-specific global
    hibernation-related methods are invoked
(15) devices are woken up
(16) hibernation image is saved in a storage space
(17) devices are put into the off state
(18) the system is powered off _or_ (optionally, on ACPI systems)
    platform-specific global methods are invoked to put the system into the
S4
    sleep state

In analogy with the system suspend described in Section III, if any of the
operations listed above fails, the operations that have already been
performed
need to be reverted, so that the system can flawlessly continue operating
in the
working state.  In particular, the power management notifiers need to be
called
to inform their owners that the system state transition has been canceled.

System restore is started by booting the kernel with the
"resume="
command line parameter, where  is the one the hibernation image
has
been written to in step (16).  This partition may be a swap partition or a
partition containing the swap file with the hibernation image, in which
case the
additional kernel command line parameter "resume_offset=" is
needed,
where  points to the location of the swap file's header (see
Documentation/power/swsusp-and-swap-files.txt in the kernel tree for
details).

The kernel booted with the "resume=" and (optionally)
"resume_offset=" command line parameters, often referred to as the
boot
kernel, is responsible for loading the hibernation image into memory and
passing
control to the kernel contained in the hibernation image, that from now on
will
be referred to as the target kernel.  The following operations are
performed by
it:

(19) hibernation image is loaded into RAM
(20) tasks are frozen
(21) devices are suspended for jumping to the target kernel
(22) (optional, but necessary if (6) was done during the hibernation)
    platform-specific global restore preparation functions are executed
(23) non-boot CPUs are taken off-line
(24) interrupts are disabled on the remaining CPU
(25) late suspend of devices (for jumping to the target kernel) is carried
out
(26) control is passed to the target kernel

If any of steps (19)-(23) fails, the boot kernel continues running as in
the
case of a normal non-restore boot.  Otherwise, the target kernel gets the
control and the following operations are performed by it:

(27) early resume of devices is carried out
(28) interrupts are enabled on the main CPU
(29) non-boot CPUs are enabled
(30) (optional, but necessary if (6) is performed) finish of the system
state
    transition is announced to the platform
(31) devices are woken up
(32) tasks are thawed
(33) power management notifiers are executed with PM_POST_HIBERNATION

Again, for each of steps (1)-(33) there is a part of the hibernation code
responsible for completing it and some of these parts are shared with the
suspend code outlined in Section III.

V. Power management notifiers

This is a new feature, introduced very recently in order to allow
subsystems
that need to know if a system state transition is going to happen to
register
notifiers called right before and right after any such transition.  The
parameter passed to the notifiers determines if the transition in question
is a
suspend or a hibernation.

This mechanism is described in detail in Documentation/power/notifiers.txt
.
At present, it is only used to disable user mode helpers before the
freezing of
tasks.

VI. Freezing and thawing tasks

Steps (2) and (16) of the suspend-resume cycle described in Section III as
well
as steps (2) and (32) of the hibernation-restore cycle outlined in Section
IV
are done by a special code called the freezer.  Generally speaking, it
requests
tasks to "park" themselves in a safe place, called "the refrigerator", in
which
they do not hold any locks, to not start any new I/O operations, do not
allocate
memory and do not do anything else that might destructively interfere with
the
suspend or hibernation procedure.  Userland processes are made enter the
refrigerator by the kernel's signal-handling code, but kernel threads
should
enter the refrigerator voluntarily, by calling the function
try_to_freeze(),
where it is appropriate from their point of view.  Moreover, kernel threads
that
want to receive freeze requests from the freezer have to explicitly mark
themselves as freezable and they are responsible for entering the
refrigerator
relatively quickly after receiving a freeze request.  The freezable kernel
threads are only asked to enter the refrigerator after userland processes
have
been frozen and sys_sync() is called before sending any freeze requests to
kernel threads.  A frozen task is only allowed to exit the refrigerator at
the
freezer's request.  Detailed description of this mechanism is available in
Documentation/power/freezing-of-tasks.txt .

The freezing of tasks generally works, although there are some known
problems
with it.  First of all, uninterruptible tasks cannot be frozen, so if there
are
any such tasks in the system, except for the tasks waiting for vfork()
completions handled in a special way, it is impossible to suspend or
hibernate
it.  This is a strong limitation stemming from the fact that
uninterruptible
tasks can hold locks that might be necessary for suspending devices later
during
the suspend or hibernation procedure.  Unfortunately, it also leads to
problems
in the situations, in which one userland task may wait in the
TASK_UNINTERRUPTIBLE state for another userland task.  Namely, in such
cases the
task that is being waited for may be frozen before the task that waits for
it
and the freezing of tasks will fail as a result.

Another known issue related to the freezer is that some system calls, such
as
sys_poll(), may be interrupted by fake signals sent by it to userland
tasks.

VII. Freeing memory

Step (3) of the hibernation procedure outlined in Section IV is completed
by
calling the same functions that are normally used by kswapd, but in a
slightly
different way.  The part of code responsible for that is referred to as the
memory shrinker (it may sometimes be called by the suspend code as well, so
it
can be treated as a shared piece of code).  It generally works well, but it
seems to be inefficient if there are lots of slab objects to free.

VIII. Platform support

On ACPI systems there are parts of the platform that should only be
accessed
by the kernel through the execution of so-called ACPI control methods
encoded
in the AML language.  These control methods are executed with the help of
the
AML interpreter included in the kernel's ACPI subsystem.

Since the platform is responsible for registering and acting upon events
supposed to wake up the system being in a sleep state, as well as for
passing
control back to the kernel after such an event, it requires special
handling
during every suspend.  Also, during a resume the platform has to be put
into a
state that is compatible with the system working state being entered.

The handling of an ACPI platform related to suspend and resume is done on
two
levels.  First, some global ACPI control methods need to be executed, which
is
done in steps (5), (9) and (14) of the procedure outlined in Section III,
with
the help of the information passed to the platform-handling code in step
(3).
Second, some device-specific ACPI control methods are executed while
devices are
being suspended.  The ordering of execution of different ACPI control
methods
involved in suspend and resume operations is strictly defined by the ACPI
specification and it currently is reflected by the ordering of the kernel's
suspend and resume code.

As far as system hibernation is concerned, in principle the platform
support
is optional.  However, some ACPI platforms do not work correctly after a
restore
if the appropriate ACPI control methods are not executed during transitions
to
and from the hibernation state.  For this reason, the platform support in
the
hibernation code is enabled by default, but the users can request that it
be
disabled by writing "shutdown" to the /sys/power/disk control file before
the hibernation.  By reading this file one can see if the platform support
will
be used during subsequent hibernations (the active setting is shown inside
the
square braces and "[platform]" means that the platform hibernation support
is
enabled).

During a hibernation-restore cycle global ACPI control methods are executed
in
steps (6), (14), (18) and (30) listed in Section IV.  Additionally, the
platform-handling code is informed of the target system sleep state (ACPI
S4) in
step (4) and the ACPI general purpose events (GPEs) are disabled in step
(22)
(if the restore fails, they are enabled during the subsequent clean-up
procedure).  The restore code in the boot kernel uses the platform support
routines if special flag in the image header is set by the hibernation
code.
Still, the current hibernation and restore code does not exactly follow the
ACPI
specification.  Namely, the specification requires that the ACPI subsystem
be
not enabled during a restore until the image is loaded into memory and the
control is passed to the target kernel, but in our current implementation
the
ACPI subsystem is already enabled in the boot kernel before loading the
image.

Apart from this, in step (14) of the hibernation procedure we inform the
platform that the system will not enter the sleep state, which is not what
is
going to happen.  We do that in order to be able to resume devices needed
for
saving the image and in step (18) the platform is prepared for entering the
S4
sleep state from the start.

IX. Handling of devices

Steps (4), (8), (11), and (15) of the suspend-resume cycle outlined in
Section
III, as well as steps (5), (9), (11), (15), (21), (25), (27), and (31) of
the
hibernation-restore cycle described in Section IV are completed in a large
part
by device drivers.  Namely, each device driver supporting the suspend
and/or
resume of devices handled by it is required to define the .suspend() and
.resume() callbacks and register them with the driver model, as described
in
Documentation/power/devices.txt .  These callbacks are used by the power
management core to suspend the driver's devices in step (4) of the
suspend-resume cycle and in steps (5) and (21) of the hibernation-restore
cycle.

At present, the same callbacks are used for both suspend and hibernation. 
In
the case of a suspend they are called with the second parameter equal to
PMSG_SUSPEND, whereas for a hibernation the second parameter passed to each
of
them is equal to PMSG_FREEZE.  Moreover, the drivers' .suspend() callbacks
are
also executed in step (21) of the hibernation-restore cycle, in order to
prepare
devices for passing control to the target kernel, in which case the second
parameter passed to them is equal to PMSG_PRETHAW.  Thus, theoretically,
the
drivers can use the second parameter of their .suspend() callbacks to
distinguish between suspend, hibernation and restore operations, although
only a
few drivers actually do that.

Similarly, the same .resume() callbacks are used for waking up devices in
step
(11) of the suspend-resume cycle, as well as in steps (15) and (31) of the
hibernation-restore cycle.  Since these callbacks take only one parameter,
being a pointer to the device object associated with given device, the
drivers
have no means to distinguish between different reasons for which the
devices may
be woken up and they need to perform basically the same actions in each of
these
cases.

In order to suspend devices the power management core walks the dpm_active
list
in the reverse order.  This list is set up during the kernel initialization
and
devices are put on it in the order in which they are registered with the
driver
model.  Thus, the devices that have been registered last, are suspended
first
and so on, which guarantees that basic dependencies between devices will
not be
violated (ie. parent devices are always suspended after the devices that
depend
on them).  For each device the core checks if:
* the device's class has defined a .suspend() callback, in which case this
  callback is executed,
* the device's type has defined a .suspend() callback, in which case this
  callback is executed,
* the device's bus type has defined a .suspend() callback, in which case
this
  callback is executed.
All of the .suspend() callbacks defined by device classes, types and bus
types
are always executed as long as none of them returns an error.  This means
that,
for example, if a device class has defined the .suspend() callback and a
bus
type has done that too, then both of these callbacks will be executed for
each
device belonging to this class and associated with this bus type and it is
up to
the class, bus type and driver code to cope with that correctly.  If any of
the
.suspend() callbacks listed above returns an error, the suspending of
devices is
immediately terminated and the devices that have already been suspended are
woken up.  The .suspend() callbacks defined by device drivers are executed
by
the device class, device type and bus type .suspend() callbacks.

The suspended devices are moved from the dpm_active list to the dpm_off
list in
the order in which they have been suspended (note that a device may be
regarded
as suspended even if no .suspend() callbacks have been executed for it, for
instance, when there are no such callbacks defined for it).  This list is
used
by the power management core for waking up devices.  Namely, for each
device on
it the power management core checks if:
* the device's bus type has defined a .resume() callback, in which case
this
  callback is executed,
* the device's type has defined a .resume() callback, in which case this
  callback is executed,
* the device's class has defined a .resume() callback, in which case this
  callback is executed.
Again, all bus type, device type and device class .resume() callbacks that
have
been defined are always executed for each device that they fit to. 
Moreover,
any errors returned by them are discarded.  All devices for which they have
been
executed are unconditionally moved from the dpm_off to the dpm_active list,
in
such a way that the original ordering of the dpm_active list is eventually
restored.

Apart from "ordinary" devices, the suspending and resuming of which is
described
above, there are special devices that need some handling in steps (8) and
(11)
of the suspend-resume cycle and in steps (9), (11), (25), and (27) of the
hibernation-restore cycle.  There are two kinds of such devices:
* devices the bus types and drivers of which define .suspend_late() and/or
  .resume_early() callbacks,
* system devices (aka sysdevs)

The devices handled with the help of .suspend_late() callbacks are moved
from
the dpm_off list to the dpm_off_irq list, which is used later to check if
the
.resume_early() callbacks have been defined for them and to execute these
callbacks if that is the case.  All devices on the dpm_off_irq list are
moved
from there back to the dpm_off list before the "ordinary" waking up of
devices
described above.  It should be noted that the right ordering of devices is
always preserved by all of these operations.  Moreover, the .suspend() and
.resume() callbacks may be defined for a device for which .suspend_late()
and
.resume_early() are also defined and all of these callbacks will always be
executed in the right order.

System devices are handled in a special way, independent of the above
general
framework.  Specifically, system device classes and drivers can define
.suspend() and .resume() callbacks that are used to handle their devices.
However, these callbacks are only executed when one CPU is on-line and with
interrupts disabled by it.  Thus, if any of such devices needs to be
handled
with interrupts enabled too, it is necessary to create a separate device
object
for it that will be treated in the ordinary way.  For this reason, from the
power management point of view, system devices are rather inflexible and
the use
of them is no longer recommended.  The existing ones are expected to be
gradually phased out or replaced with device objects corresponding to the
"platform" bus type.

The main problem with the current approach to the handling of devices is
that
the same callbacks are used for both suspend and hibernation, which leads
to
confusion and introduces unnecessary limitations.  For example, it
generally is
not necessary, and may even be harmful, to put devices into low power
states
before step (10) of the hibernation procedure.  In fact, it should be
sufficient
to put devices into quiescent states in step (5) of it and to put them back
into
the full power state (or into the low power states in which they were
before the
hibernation procedure has been started) in step (15).  Then, the execution
of
platform-specific functions in steps (6) and (14) should not be necessary
and
the entire hibernation procedure might be simplified.  It also is generally
unnecessary to put devices into low power states in step (21), during a
restore.
Moreover, the boot kernel need not handle the same set of devices as the
target
kernel, which means that the callbacks used by the target kernel to "wake
up"
devices must be prepared to deal with the situation in which their devices
have
not been initialized or, worse yet, have been initialized by the platform
firmware in an inappropriate way.  Generally, they need not be in the same
states in which they were left in step (5).  Yet, this obviously is not the
case
during a resume, since the states of devices generally need not change
between
steps (4) and (15) of the suspend-resume cycle.  Thus, by requesting that
all of
the .resume() callbacks need to be able to deal with uninitialized devices,
we
impose an unnecessary limitation on the suspend code, which should be
avoided.

The next major limitation is related to the handling of removable storage
devices.  Namely, if some filesystems are mounted out of removable devices,
such
as USB storage devices or memory cards, before a suspend or hibernation,
they
will not be accessible after the corresponding resume or restore and the
users
may lose data as a result of this.  The problem is that for removable, or
rather
"hotpluggable", devices the suspend operation usually causes the device to
disconnect, as though it were physically disconnected from the system. 
There is
the kernel configuration parameter CONFIG_USB_PERSIST which allows one to
work
around this behavior, but it generally is dangerous and the use of it is
not
recommended, unless the user knows exactly what she is doing.

The third major problem with the handling of devices is related to graphics
adapters that often are not touched by the platform after it has registered
a
wake-up event and before it passes control back to the kernel during a
resume.
Usually, the kernel also does not know how to bring the graphics adapter
back to
the pre-suspend state and that may lead to various undesirable effects,
from the
image corruption up to and including a crash of the resuming system,
depending
on the type of the graphics card, platform firmware and its version and
other
similar factors.  A workaround of this that seems to work in the majority
of
cases is to use a userland tool able to put the graphics adapter into the
right
state after a resume, given some simple instructions how to do it, such as
s2ram
(http://en.opensuse.org/s2ram).

At present, the majority of reported and tracked bugs related to suspend
and
hibernation are associated with the platform support, described in Section
VIII,
and with the handling of devices.  Unfortunately, these bugs are usually
reproducible only on a limited number of machines and hard to debug.

X. Handling of non-boot CPUs

Steps (6) and (13) of the suspend-resume cycle, as well as steps (7), (13),
(23), and (29) of the hibernation-restore cycle are completed with the help
of
the CPU hotplug infrastructure, which basically is external with respect to
the
suspend and hibernation code.  There were some problems with this mechanism
in
the past, but currently it is generally reported to work, even on 4-way
machines.

XI. Snapshotting memory and restoring its state

The snapshotting of memory, step (10) of the hibernation procedure, is
completed
by making a copy of each memory page that needs to be saved.  For this
reason,
the hibernation code needs as much as 50% of free RAM to create the image. 
This
is a serious limitation, as it generally affects the system responsiveness
after
a restore and sometimes requires quite a lot of memory to be freed in step
(3).
Still, usually there are many saveable pages in the system that will not be
accessed when userland processes are frozen, and in principle these pages
could
be included in the hibernation image without copying.  Unfortunately,
however,
no efficient method of identifying them pages has been proposed yet.  If
you
have any ideas and/or hints, please help.

The code that restores the memory state from the hibernation image in steps
(19)
and (26) of the hibernation-restore cycle is able to handle images much
greater
than 50% of RAM.  It practically is only limited by the amount of memory
occupied by the boot kernel and its data structures.  Thus, it would be
possible
to use hibernation images as big as 80% or even 90% of RAM if the
"snapshotting"
code could create them.

Apart from the above limitation, there are no any known problems with this
part
of the hibernation code.  Also, it uses data structures that are completely
independent of the rest of the kernel's memory management subsystem and are
allocated on demand, during the hibernation and restore.

XII. Saving and loading the hibernation image

The hibernation image is saved in a swap partition or in a swap file in
step
(16) and loaded from it in step (19) of the hibernation-restore cycle, with
the
help of standard block I/O callbacks and/or functions designed for
accessing
swap devices and/or swap files.  This code has not been changed for a long
time.

There are almost no problems with this part of the hibernation code.
Practically, there have not been any bugs found in it for the last year. 
Yet,
it is quite limited, since it does not support image compression that may
substantially increase the speed of saving and loading the image.  It also
is
only capable of using swap space (ie. swap partitions or swap files) for
saving
hibernation images and only one swap partition or swap file can be used at
a
time.

XIII. Userland hibernation interface

Some users of the hibernation subsystem want it to be able to perform
certain
transformations of the hibernation image, such as encryption and/or
compression,
before saving it.  Moreover, some of them would like the hibernation and
restore
code to use splash screens and display graphical progress meters.  Still,
the
idea of implementing all these things in the kernel space is questionable,
so it
has been made possible to export the hibernation image out of the kernel,
in
order for some userland tools to be able to carry out the desired
operations and
save the image afterwards.  This is the basic role of the userland
hibernation
interface, which also allows userland processes to drive the entire
hibernation
and restore procedure.

The userland hibernation interface has been implemented as a special
software
character device with appropriate file operations and some special ioctls. 
It
is described quite thoroughly in Documentation/power/userland-swsusp.txt,
so
please refer to this document for details.  A reference implementation of
the
userland tools that use this interface is available at http://suspend.sf.net .
At present, this method of driving the hibernation and restore procedures
is
used by default in OpenSUSE and is optionally available for the users of
some
other major distributions.

One of the features provided by the userland hibernation interface is the
possibility to create and save a hibernation image and suspend to RAM right
after that.  Then, the system can be resumed with the help of the platform,
if there is still enough battery power, or the state of it can be restored
on
the basis of the hibernation image.  This often is referred to as the
suspend-to-both capability.  To make it possible, the hibernation userland
interface includes a special ioctl allowing one to make the system enter
the
"mem" sleep state if some additional conditions are met.  However, it is
strongly recommended to use this ioctl only as a part of the
suspend-to-both
functionality.

XIV. Debugging

Problems related to suspend and hibernation are usually difficult to debug,
since most often they are only reproducible on a limited number of systems
and
it generally is difficult to obtain any diagnostic information from a
system
after or during a failing resume or restore.  Nevertheless, there are some
facilities that can be used to debug suspend and hibernation issues.

First, some standard debugging techniques that can be used in such cases
are
described in Documentation/power/basic-pm-debugging.txt and
Documentation/power/drivers-testing.txt .  There also is the suspend-resume
events tracing functionality, available when CONFIG_PM_TRACE is set in
.config
(in addition to CONFIG_PM being set), described in
Documentation/power/s2ram.txt .

Recently, we have added a feature allowing the user to make the kernel beep
in the early phase of resume, right after it has received control from the
platform, which may help confirm that the control is really passed from the
platform to the kernel.  This feature can by activated by executing the
following command:

# r=`cat /proc/sys/kernel/acpi_video_flags` && r=`expr $r + 4` && \
> echo $r > /proc/sys/kernel/acpi_video_flags

XV. Reporting bugs and problems

If you find a bug in the suspend/hibernation code or have a problem related
to
it, please report it, preferably to [email protected] . 
You
can also use the kernel bugzilla (http://bugzilla.kernel.org/) for
this purpose,
in which case please file the report with the "Hibernation/Suspend"
component
of "Power Management" and add the e-mail address [email protected] to its
Cc
list.

The list of bugs related to suspend and hibernation being tracked at the
moment
can be found at http://bugzilla.kernel.org/show_bug.cgi?id=7216
.

XVI. Future development plans

As you have certainly realized, there are some known problems and
limitations
related to the suspend and hibernation code, so I do not consider these
subsystems as finished work.  Therefore I intend to work on improving them
or
even redesigning them to a reasonable extent, if that is desirable. 
However,
in my opinion that should be done in an organized way, so that we do not
introduce regressions and do not end up with a solution worse than the
current
one.

In my opinion, the part of the suspend and hibernation code that should be
taken care of first is the handling of devices.  Namely, I think that we
should
first separate the hibernation-related handling of devices from the
suspend-related handling of them in order to overcome limitations mentioned
in
Section IX.  This also will be necessary if we want to try some new
approaches
to hibernation, such as the kexec-based one recently discussed on the LKML.
For this reason, I think that it will be necessary to introduce some
hibernation-related callbacks to be used in steps (5), (9), (11), (15),
(21),
(25), (27), and (31) of the hibernation-restore cycle instead of the
existing
.suspend(), .resume(), .suspend_late() and .resume_early() callbacks which
should only be used during suspend and resume.  We have discussed this
issue
for a couple of times on the linux-pm list
([email protected])
and it generally seems to be known how the hibernation-specific callbacks
should
work.

The next thing that seems reasonable to do is to eliminate the freezing of
tasks, described in Section VI, from the suspend and resume code, since the
limitations related to it are regarded by many people as too restrictive.
Still, for this purpose we will need to make device drivers be able to
block
userland tasks on I/O after their .suspend() callbacks have been executed.
Currently, there are only a few drivers which can do that and there are
drivers
which openly assume the userland tasks to be frozen in the initial phase of
suspend.  Thus, quite a lot of work needs to be done on the drivers before
we
can drop the freezing of tasks from the suspend code path.

When drivers are able to block userland tasks on I/O after executing their
.suspend() callbacks, or analogous hibernation-specific callbacks (to be
introduced), we may also be able to eliminate the freezing of tasks from
the
hibernation code path or leave only a much simplified and less intrusive
form
of it.  In theory, that can be achieved by using a kexec-based hibernation
framework, but I think that there also are other possibilities worthy of
considering.  Apart from this, I think that we have not yet explored all
possibilities to improve the current framework, including the freezing of
tasks,
so as long as the freezer is in use, I am going to improve it and fix
reported
problems related to it.

There also is the alternative hibernation framework TuxOnIce maintained by
Nigel
Cunningham, which is more feature-rich than the current in-kernel
hibernation
code.  It therefore seems reasonable to incorporate at least some of the
more
advanced TuxOnIce features into the in-kernel code.  I believe that by
combining
TuxOnIce with the current in-kernel hibernation implementation we can
obtain a
relatively simple, but powerful and solid hibernation framework, so I am
going
to work in this direction, after the separation of the suspend-specific and
hibernation-specific device handling is done at the core and device
class/device
type/bus type level.
 
CD: 3ms