Sunday, September 20, 2015

Re-writing to typedef for easy readability

Re-writing to typedef for easy readability :::: call back pointers :

old implementation : pm_op - Execute the PM operation appropriate for given PM event.@dev: Device to handle.

static int pm_op(struct device *dev, const struct dev_pm_ops *ops, pm_message_t state)

********************************************************************************
pm_op - Return the PM operation appropriate for given PM event.

static int (*pm_op(const struct dev_pm_ops *ops, pm_message_t state))(struct device *)

********************************************************************************

typedef int (*pm_callback_t)(struct device *);
********************************************************************************

static pm_callback_t pm_op(const struct dev_pm_ops *ops, pm_message_t state)

********************************************************************************

https://groups.google.com/forum/#!msg/kernelarchive/4UhgVlliQhU/XTKV59UfUYoJ

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4322.html

Linux-Kernel Memory Model

ISO/IEC JTC1 SC22 WG21 N4322 - 2014-11-20
Paul E. McKenney, paulmck@linux.vnet.ibm.com

Introduction

The Linux-kernel memory model is currently defined very informally in the memory-barriers.txt and atomic_ops.txt files in the source tree. Although these two files appear to have been reasonably effective at helping kernel hackers understand what is and is not permitted, they are not necessarily sufficient for deriving the corresponding formal model. This document is a first attempt to bridge this gap.
  1. Variable Access
  2. Memory Barriers
  3. Locking Operations
  4. Atomic Operations
  5. Control Dependencies
  6. RCU Grace-Period Relationships
  7. Summary

Variable Access

Loads from and stores to normal variables should be protected with the ACCESS_ONCE() macro, for example:
r1 = ACCESS_ONCE(x);
ACCESS_ONCE(y) = 1;
ACCESS_ONCE() access may be modeled as a volatile memory_order_relaxed access. However, please note that ACCESS_ONCE() is defined only for properly aligned machine-word-sized variables. ApplyingACCESS_ONCE() to a large array or structure is unlikely to do anything useful.
At one time, gcc guaranteed that properly aligned accesses to machine-word-sized variables would be atomic. Although gcc no longer documents this guarantee, there is still code in the Linux kernel that relies on it. These accesses could be modeled as non-volatile memory_order_relaxed accesses.
An smp_store_release() may be modeled as a volatile memory_order_release store. Similarly, an smp_load_acquire() may be modeled as a memory_order_acquire load.
r1 = smp_load_acquire(x);
smp_store_release(y, 1);
Members of the rcu_dereference() family can be modeled as memory_order_consume loads. Members of this family include: rcu_dereference()rcu_dereference_bh()rcu_dereference_sched(), andsrcu_dereference(). However, rcu_dereference() should be representative for litmus-test purposes, at least initially. Similarly, rcu_assign_pointer() can be modeled as a memory_order_consume load.
The set_mb() function assigns the specified value to the specified variable, then executes a full memory barrier, which is described in the next section. This isn't as strong as a memory_order_seq_cst store because the following code fragment does not guarantee that the stores to x and y will be ordered.
smp_store_release(x, 1);
set_mb(y, 1);
That said, set_mb() provides exactly the ordering required for manipulating task state, which is the job for which it was created.

Memory Barriers

The Linux kernel has a variety of memory barriers:
  1. barrier(), which can be modeled as an atomic_signal_fence(memory_order_acq_rel) or an atomic_signal_fence(memory_order_seq_cst).
  2. smp_mb(), which does not have a direct C11 or C++11 counterpart. On an ARM, PowerPC, or x86 system, it can be modeled as a full memory-barrier instruction (dmbsync, and mfence, respectively). On an Itanium system, it can be modeled as an mf instruction, but this relies on gcc emitting an ld,acq for an ACCESS_ONCE() load and an st,rel for an ACCESS_ONCE() store.
  3. smp_rmb(), which can be modeled (overly conservatively) as an atomic_thread_fence(memory_order_acq_rel). One difference is that smp_rmb() need not order prior loads against later stores, or prior stores against later stores. Another difference is that smp_rmb() need not provide any sort of transitivity, having (lack of) transitivity properties similar to ARM's or PowerPC's address/control/data dependencies.
  4. smp_wmb(), which can be modeled (again overly conservatively) as an atomic_thread_fence(memory_order_acq_rel). One difference is that smp_rmb() need not order prior loads against later stores, nor prior loads against later loads. Similar to smp_rmb()smp_wmb() need not provide any sort of transitivity.
  5. smp_read_barrier_depends(), which is a no-op on all architectures other than Alpha. On Alpha, smp_read_barrier_depends() may be modeled as a atomic_thread_fence(memory_order_acq_rel) or as aatomic_thread_fence(memory_order_seq_cst).
  6. smp_mb__before_atomic(), which provides a full memory barrier before the immediately following non-value-returning atomic operation.
  7. smp_mb__after_atomic(), which provides a full memory barrier after the immediately preceding non-value-returning atomic operation. Both smp_mb__before_atomic() and smp_mb__after_atomic() are described in more detail in the later section on atomic operations.
  8. smp_mb__after_unlock_lock(), which provides a full memory barrier after the immediately preceding lock operation, but only when paired with a preceding unlock operation by this same thread or a preceding unlock operation on the same lock variable. The use of smp_mb__after_unlock_lock() is described in more detail in the second on locking.
There are some additional memory barriers including mmiowb(), however, these cover interactions with memory-mapped I/O, so have no counterpart in C11 and C++11 (which is most likely as it should be for the foreseeable future).

Locking Operations

The Linux kernel features “roach motel” ordering on its locking primitives: Prior operations can be reordered to follow a later acquire, and subsequent operations can be reordered to precede an earlier release. The CPU is permitted to reorder acquire and release operations in this way, but the compiler is not, as compiler-based reordering could result in deadlock.
Note that a release-acquire pair does not necessarily result in a full barrier. To see this consider the following litmus test, with x and y both initially zero, and locks l1 and l3 both initially held by the threads releasing them:
Thread 1                      Thread 2
--------                      --------
y = 1;                        x = 1;
spin_unlock(&l1);             spin_unlock(&l3);
spin_lock(&l2);               spin_lock(&l4);
r1 = x;                       r2 = y;

assert(r1 != 0 || r2 != 0);
In the above litmus test, the assertion can trigger, meaning that an unlock followed by a lock is not guaranteed to be a full memory barrier. And this is where smp_mb__after_unlock_lock() comes in:
Thread 1                      Thread 2
--------                      --------
y = 1;                        x = 1;
spin_unlock(&l1);             spin_unlock(&l3);
spin_lock(&l2);               spin_lock(&l4);
smp_mb__after_unlock_lock();  smp_mb__after_unlock_lock();
r1 = x;                       r2 = y;

assert(r1 != 0 || r2 != 0);
In contrast, after addition of smp_mb__after_unlock_lock(), the assertion cannot trigger.
The above example showed how smp_mb__after_unlock_lock() can cause an unlock-lock sequence in the same thread to act as a full barrier, but it also applies in cases where one thread unlocks and another thread locks the same lock, as shown below:
Thread 1              Thread 2                        Thread 3
--------              --------                        --------
y = 1;                spin_lock(&l1);                 x = 1;
spin_unlock(&l1);     smp_mb__after_unlock_lock();    smp_mb();
                      r1 = y;                         r3 = y;
                      r2 = x;

assert(r1 == 0 || r2 != 0 || r3 != 0);
Without the smp_mb__after_unlock_lock(), the above assertion can trigger, and with it, it cannot. The fact that it can trigger without might seem strange at first glance, but locks are only guaranteed to give sequentially consistent ordering to their critical sections. If you want an observer thread to see the ordering without holding the lock, you need smp_mb__after_unlock_lock(). (Note that there is some possibility that the Linux kernel's memory model will change such that an unlock followed by a lock forms a full memory barrier even without the smp_mb__after_unlock_lock().)
The Linux kernel has an embarrassingly large number of locking primitives, but spin_lock() and spin_unlock() should be representative for litmus-test purposes, at least initially.

Atomic Operations

Atomic operations have three sets of operations, those that are defined on atomic_t, those that are defined on atomic_long_t, and those that are defined on aligned machine-sized variables, currently restricted to int and long. However, in the near term, it should be acceptable to focus on a small subset of these operations.
Variables of type atomic_t may be stored to using atomic_set() and variables of type atomic_long_t may be stored to using atomic_long_set(). Similarly, variables of these types may be loaded from usingatomic_read() and atomic_long_read(). The historical definition of these primitives has lacked any sort of concurrency-safe semantics, so the user is responsible for ensuring that these primitives are not used concurrently in a conflicting manner.
That said, many architectures treat atomic_read() atomic_long_read() as volatile memory_order_relaxed loads and a few architectures treat atomic_set() and atomic_long_set() as memory_order_relaxedstores. There is therefore some chance that concurrent conflicting accesses will be allowed at some point in the future, at which point their semantics will be those of volatile memory_order_relaxed accesses.
The remaining atomic operations are divided into those that return a value and those that do not. The atomic operations that do not return a value are similar to C11 atomic memory_order_relaxed operations. However, the Linux-kernel atomic operations that do return a value cannot be implemented in terms of the C11 atomic operations. These operations can instead be modeled as memory_order_relaxedoperations that are both preceded and followed by the Linux-kernel smp_mb() full memory barrier, which is implemented using the DMB instruction on ARM and the sync instruction on PowerPC. Note that in the case of the CAS operations atomic_cmpxchg()atomic_long_cmpxchg, and cmpxchg(), the full barriers are required in both the success and failure cases. Strong memory ordering can be added to the non-value-returning atomic operations using smp_mb__before_atomic() before and/or smp_mb__after_atomic() after.
The operations are summarized in the following table. An initial implementation of a tool could start with atomic_add()atomic_sub()atomic_xchg(), and atomic_cmpxchg().
Operation Classintlong
Add/Subtractvoid atomic_add(int i, atomic_t *v)
void atomic_sub(int i, atomic_t *v)
void atomic_inc(atomic_t *v)
void atomic_dec(atomic_t *v)
void atomic_long_add(int i, atomic_long_t *v)
void atomic_long_sub(int i, atomic_long_t *v)
void atomic_long_inc(atomic_long_t *v)
void atomic_long_dec(atomic_long_t *v)
Add/Subtract,
Value Returning
int atomic_inc_return(atomic_t *v)
int atomic_dec_return(atomic_t *v)
int atomic_add_return(int i, atomic_t *v)
int atomic_sub_return(int i, atomic_t *v)
int atomic_inc_and_test(atomic_t *v)
int atomic_dec_and_test(atomic_t *v)
int atomic_sub_and_test(int i, atomic_t *v)
int atomic_add_negative(int i, atomic_t *v)
int atomic_long_inc_return(atomic_long_t *v)
int atomic_long_dec_return(atomic_long_t *v)
int atomic_long_add_return(int i, atomic_long_t *v)
int atomic_long_sub_return(int i, atomic_long_t *v)
int atomic_long_inc_and_test(atomic_long_t *v)
int atomic_long_dec_and_test(atomic_long_t *v)
int atomic_long_sub_and_test(int i, atomic_long_t *v)
int atomic_long_add_negative(int i, atomic_long_t *v)
Exchangeint atomic_xchg(atomic_t *v, int new)
int atomic_cmpxchg(atomic_t *v, int old, int new)
int atomic_long_xchg(atomic_long_t *v, int new)
int atomic_long_cmpxchg(atomic_code_t *v, int old, int new)
Conditional
Add/Subtract
int atomic_add_unless(atomic_t *v, int a, int u)
int atomic_inc_not_zero(atomic_t *v)
int atomic_long_add_unless(atomic_long_t *v, int a, int u)
int atomic_long_inc_not_zero(atomic_long_t *v)
Bit Test/Set/Clear
(Generic)
void set_bit(unsigned long nr, volatile unsigned long *addr)
void clear_bit(unsigned long nr, volatile unsigned long *addr)
void change_bit(unsigned long nr, volatile unsigned long *addr)
Bit Test/Set/Clear,
Value Returning
(Generic)
int test_and_set_bit(unsigned long nr, volatile unsigned long *addr)
int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)
int test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
Lock-Barrier Operations
(Generic)
int test_and_set_bit_lock(unsigned long nr, unsigned long *addr)
void clear_bit_unlock(unsigned long nr, unsigned long *addr)
void __clear_bit_unlock(unsigned long nr, unsigned long *addr)
Exchange
(Generic)
T *xchg(T *p, v)
T *cmpxchg(T *ptr, T o, T n)
The rows marked “(Generic)” are type-generic, applying to any aligned machine-word-sized quantity supported by all architectures that the Linux kernel runs on. The set of types is currently those of size intand those of size long. The “Lock-Barrier Operations” have memory_order_acquire semantics for test_and_set_bit_lock() and _atomic_dec_and_lock(), and have memory_order_release for the other primitives. Otherwise, the usual Linux-kernel rule holds: If no value is returned, memory_order_relaxed semantics apply, otherwise the operations behave as if there was smp_mb() before and after.

Control Dependencies

The Linux kernel provides a limited notion of control dependencies, ordering prior loads against control-depedendent stores in some cases. Extreme care is required to avoid control-dependency-destroying compiler optimizations. The restrictions applying to control dependencies include the following:
  1. Control dependencies can order prior loads against later dependent stores, however, they do not order prior loads against later dependent loads. (Use memory_order_consume or memory_order_acquire if you require this behavior.
  2. A load heading up a control dependency must use ACCESS_ONCE(). Similarly, the store at the other end of a control dependency must also use ACCESS_ONCE().
  3. If both legs of a given if or switch statement store the same value to the same variable, then those stores cannot participate in control-dependency ordering.
  4. Control dependencies require at least one run-time conditional that depends on the prior load and that precedes the following store.
  5. The compiler must perceive both the variable loaded from and the variable stored to as being shared variables. For example, the compiler will not perceive an on-stack variable as being shared unless its address has been taken and exported to some other thread (or alias analysis has otherwise been defeated).
  6. Control dependencies are not transitive. In this regard, their behavior is similar to ARM or PowerPC control dependencies.
The C and C++ standards do not guarantee any sort of control dependency. Therefore, this list of restriction is subject to change as compilers become increasingly clever and aggressive.

RCU Grace-Period Relationships

The publish-subscribe portions of RCU are captured by the combination of rcu_assign_pointer(), which can be modeled as a memory_order_release store, and of the rcu_dereference() family of primitives, which can be modeled as memory_order_consume loads, as was noted earlier.
Grace periods can be modeled as described in Appendix D of User-Level Implementations of Read-Copy Update. There are a number of grace-period primitives in the Linux kernel, but rcu_read_lock(),rcu_read_unlock(), and synchronize_rcu() are good places to start. The grace-period relationships can be describe using the following abstract litmus test:
Thread 1                      Thread 2
--------                      --------
rcu_read_lock();              S2a;
S1a;                          synchronize_rcu();
S1b;                          S2b;
rcu_read_unlock();
If either of S1a or S1b precedes S2a, then both must precede S2b. Conversely, if either of S1a or S1b follows S2b, then both must follow S2a.

Summary

This document makes a first attempt to present a formalizable model of the Linux kernel memory model, including variable access, memory barriers, locking operations, atomic operations, control dependencies, and RCU grace-period relationships. The general approach is to reduce the kernel's memory model to some aspect of memory models that have already been formalized, in particular to those of C11, C++11, ARM, and PowerPC.

NAME

       atomic_cmpxchg - atomic_cmpxchg functions.

       int atomic_cmpxchg(volatile global(3clc) int *p, int cmp, int val);

       unsigned int atomic_cmpxchg(volatile global(3clc) unsigned int *p,
                                   unsigned int cmp, unsigned int val);

       int atomic_cmpxchg(volatile local(3clc) int *p, int cmp, int val);

       unsigned int atomic_cmpxchg(volatile local(3clc) unsigned int *p,
                                   unsigned int cmp, unsigned int val);

DESCRIPTION

       Read the 32-bit value (referred to as old) stored at location pointed
       by p. Compute (old == cmp) ?  val : old and store result at location
       pointed by p. The function returns old.

       A 64-bit version of this function, atom_cmpxchg(3clc), is enabled by
       cl_khr_int64_base_atomics(3clc).

SPECIFICATION

       OpenCL Specification[1]

SEE ALSO

       atomicFunctions(3clc), atom_cmpxchg(3clc)

Example code of atomic operation



static inline unsigned long atomic_cmpxchg(volatile void *ptr, 
                                           unsigned long old, 
                                           unsigned long new) 
{ 
    unsigned long prev; 

    /* "0" ; the same constraint with 0th output variable */
    /* RAX = old
       if RAX == *ptr
           *ptr = new
       else
           RAX = *ptr
    */
    /* If success, this returns old value (old == *ptr)
       If fail, this returns *ptr value (old != *ptr)
    */
           
    asm volatile("lock;cmpxchgq %1,%2" 
                 : "=a"(prev) 
                 : "r"(new), "m"(*(volatile long *)ptr), "0"(old) 
                 : "memory"); 
    return prev; 
} 

This is atomic-compare-and-exchange function in Linux kernel. "0" in the constraints means it has the same property to the first property, of 'old' variable, "=a". Therefore 'old' variable is stored in RAX. And the output of the assembly code is prev that is copy of RAX.


The sequence of this code is:


1. old is copied into RAX
2. compxchg does:
 - if RAX(old) and *ptr are the same, *ptr = new is done
 - otherwise RAX = *ptr is done
3. RAX is copied into prev
4. return prev


Finally if *ptr valus is change, 'old' value is returned. If it is failed to change *ptr, *ptr value is returned.


It means that the atomic_cmpxchg is successful, it returns old value, otherwise *ptr. And *ptr value is turned into new value.


If the atomic_cmpxchg is failed, it returns different value, not old.


Therefore the atomic_cmpxchg can be changed to return TRUE or FALSE like following.


static inline unsigned long atomic_cmpxchg(volatile void *ptr, 
                                           unsigned long old, 
                                           unsigned long new) 
{ 
    unsigned long prev; 

    /* "0" -> the same constraint with 0th output variable */
    /* RAX <= old
       if RAX == *ptr
           *ptr <= new
       else
           RAX <= *ptr
    */
    /* If success, this returns old value (old == *ptr)
       If fail, this returns *ptr value (old != *ptr)
    */
           
    asm volatile("lock;cmpxchgq %1,%2" 
                 : "=a"(prev) 
                 : "r"(new), "m"(*(volatile long *)ptr), "0"(old) 
                 : "memory"); 

    if (prev == old)
        return 1;
    else
        return 0;
} 


For example, a thread calls the atomic_cmpxchg with these arguments:

val = 0x5A; (shared by multi-threads)
old = val; (local variable of the thread)
new = 0xFF; (local variable of the thread)
==> atomic_cmpxchg(val, old, new);


If the atomic_cmpxchg is successed, it means that no other thread changed the val variable (without considering ABA problem). The val is not changed, the val is the same to old variable, then the val is changed into new value.

If the atomic_cmpxchg is failed, other threads change the value of val. We sometimes have to make a loop like following.

val = 0x5A;
do
{
    // change val only if its value is 0x5A into 0xFF
    old = 0x5A; (local variable)
    new = 0xFF; (local variable)
} while(atomic_cmpxchg(val, old, new) == 0);

http://gurugio.blogspot.in/2011/02/example-code-of-atomic-operation.html
http://gurugio.kldp.net/


Memory Consistency & memory barrier - The art of multiprocessor programming B.7.1
The art of multiprocessor programming B.7.1 내용

프로세서가 메모리에 값을 쓰면 그 값은 캐시에 저장되고  dirty로 표시되는데, 나중에 메인 메모리에 써질거라는 표시이다. 최신 프로세서 (ARM도 그렇다) 들은 여러개의 쓰기 요청이 발생했을 때 바로 메인 메모리에 쓰는 것이 아니라 write buffer (읽기는 store buffer)라는 하드웨어 큐에 모아놓고 나중에 한꺼번에 메모리에 적용한다. 쓰기 버퍼가 있는 이유는 여러개의 요청을 한꺼번에 처리하면 효율적이고, 두번째로 특정 주소에 여러번 쓰기가 발생했을 때 이전 쓰기를 취소시켜서 메모리까지 갈 필요가 적어지기 때문이다.

- 지금 디바이스의 레지스터에 연결된 메모리에 쓴 값들이 사라지는 현상때문에 디바이스가 제대로 작동하지 않는 문제가 있다. 내가 메모리에 쓴 값들을 모두 차례대로 메모리에 써야하는데 write buffer 때문에 중간에 값들이 사라지기 때문인것으로 보인다. 이것을 메모리 베리어로 해결해야만 한다.

쓰기 버퍼때문에 생기는 현상이 프로그램에 있는 읽기 쓰기가 순서대로 메모리에 반영되지 않는다는 것이다....중략...컴파일러는 더 안좋게된다. reordering이라는 취적화는 싱글 쓰레드만 고려하는 것인다. 이 reordering때문에 멀티 쓰레드 프로그램에서 알 수 없는 결과가 생길 수 있다.

(중요)예를 들면 한 쓰레드가 버퍼에 데이터를 채우고 버퍼가 찼다는 표시를 했는데 다른 쓰레드는 버퍼가 찾다는 표시는 봤지만 새로운 데이터는 보지 못할 수도 있게된다. 그래서 잘못된 데이터를 읽게 될 수도 있다.

- 지금 디바이스가 리셋되는게 이것 때문일까? 데이터 채우기와 비트 설정 사이에 베리어가 필요한게 아닐까?

메모리 베리어는 쓰기 버퍼를 비우고 베리어가 나타나기 전의 모든 쓰기가 프로세서가 볼 수 있도록 해준다. (visible to the processor that issued the barrier의 해석이 애매함) 메모리 베리어를 꼭 써야하는 곳은 프로세서가 크리티컬 섹션 밖에서 공유 변수들을 읽거나 쓸 때이다.

Sunday, September 6, 2015

Hey Linux Power Management !!! Demystified

Linux Power Management!!! in due to follow specific use cases to Discuss

Source Framework::
  • kernel/power/ *
  • drivers/power/
  • drivers/base/power/*
  • drivers/cpuidle/*
  • drivers/cpufreq/* <<<<<<------1st 2.6.11="" 2006="" addition="" in="" li="" nbsp="" place="">
  • drivers/devfreq/*
  • include/linux/power_supply.h
  • include/linux/cpuidle.h
  • include/linux/cpufreq.h
  • include/linux/cpu_pm.h
  • include/linux/device.h
  • include/linux/pm.h
  • include/linux/pm domain.h
  • include/linux/pm runtime.h
  • include/linux/pm wakeup.h
  • include/linux/suspend.h
  • Documentation/power/*.txt
#define container_of(ptr, type, member) (type *)((char *)ptr - (char *)&((type *)0)->member)

最后将程序改为

1 #include
2 #include
3 #include
4
5 #define container_of(ptr, type, member) (type *)((char *)ptr - (char *)&((type *)0)->member)
6
7 typedef struct {
    int a;
    int b;
10     int c;
11 }hehe;
12
13 int main(int argc, char *argv[])
14 {
15     hehe hoho;
16     hehe *haha;
17     hehe *hihi;
18     int *ptr;
19
20     hoho.a = 1;
21     hoho.b = 2;
22     hoho.c = 3;
23
24     hihi = &hoho;
25
26     ptr = &hoho.b;
27
28     printf("ptr = %d, hihi = 0xx\n", *ptr, hihi);
29
30     haha = container_of(ptr, hehe, b);
31     printf("a = %d, b = %d, c = %d \n", haha->a, haha->b, haha->c);
32 }
33

编译后运行:
$ ./2
ptr = 2, hihi = 0xbfe14b5c
a = 1, b = 2, c = 3
Major 3 layers ::
API Layer :: used to provide User space, which used to shutdown, restart, hibernate, suspend using sysfs .
PM Core   ::  Major modifications as on kernel ref: source code framework.
PM driver :: Again 2 Layers Architecture dependent and Specific Driver Framework.

Main stream linux kernel shutdown and restart system calls.
Ahhh!!! After shutdown, sooner or later will boot , so restart shutdown is a special process.

/*
 * Reboot system call: for obvious reasons only root may call it,
 * and even root needs to set up some magic numbers in the registers
 * so that some mistake won't make this reboot the whole machine.
 * You can also set the meaning of the ctrl-alt-del-key here.
 *
 * reboot doesn't sync: do that yourself before calling this.
 */
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
void __user *, arg)


/**
* struct dev_pm_ops - device PM callbacks
*
* Several device power state transitions are externally visible, affecting
* the state of pending I/O queues and (for drivers that touch hardware)
* interrupts, wakeups, DMA, and other hardware state. There may also be
* internal transitions to various low-power modes which are transparent
* to the rest of the driver stack (such as a driver that's ON gating off
* clocks which are not in active use).
*
* The externally visible transitions are handled with the help of callbacks
* included in this structure in such a way that two levels of callbacks are
* involved. First, the PM core executes callbacks provided by PM domains,
* device types, classes and bus types. They are the subsystem-level callbacks
* supposed to execute callbacks provided by device drivers, although they may
* choose not to do that. If the driver callbacks are executed, they have to
* collaborate with the subsystem-level callbacks to achieve the goals
* appropriate for the given system transition, given transition phase and the
* subsystem the device belongs to.
*
* @prepare: The principal role of this callback is to prevent new children of
* the device from being registered after it has returned (the driver's
* subsystem and generally the rest of the kernel is supposed to prevent
* new calls to the probe method from being made too once @prepare() has
* succeeded). If @prepare() detects a situation it cannot handle (e.g.
* registration of a child already in progress), it may return -EAGAIN, so
* that the PM core can execute it once again (e.g. after a new child has
* been registered) to recover from the race condition.
* This method is executed for all kinds of suspend transitions and is
* followed by one of the suspend callbacks: @suspend(), @freeze(), or
* @poweroff(). If the transition is a suspend to memory or standby (that
* is, not related to hibernation), the return value of @prepare() may be
* used to indicate to the PM core to leave the device in runtime suspend
* if applicable. Namely, if @prepare() returns a positive number, the PM
* core will understand that as a declaration that the device appears to be
* runtime-suspended and it may be left in that state during the entire
* transition and during the subsequent resume if all of its descendants
* are left in runtime suspend too. If that happens, @complete() will be
* executed directly after @prepare() and it must ensure the proper
* functioning of the device after the system resume.
* The PM core executes subsystem-level @prepare() for all devices before
* starting to invoke suspend callbacks for any of them, so generally
* devices may be assumed to be functional or to respond to runtime resume
* requests while @prepare() is being executed. However, device drivers
* may NOT assume anything about the availability of user space at that
* time and it is NOT valid to request firmware from within @prepare()
* (it's too late to do that). It also is NOT valid to allocate
* substantial amounts of memory from @prepare() in the GFP_KERNEL mode.
* [To work around these limitations, drivers may register suspend and
* hibernation notifiers to be executed before the freezing of tasks.]
*
* @complete: Undo the changes made by @prepare(). This method is executed for
* all kinds of resume transitions, following one of the resume callbacks:
* @resume(), @thaw(), @restore(). Also called if the state transition
* fails before the driver's suspend callback: @suspend(), @freeze() or
* @poweroff(), can be executed (e.g. if the suspend callback fails for one
* of the other devices that the PM core has unsuccessfully attempted to
* suspend earlier).
* The PM core executes subsystem-level @complete() after it has executed
* the appropriate resume callbacks for all devices. If the corresponding
* @prepare() at the beginning of the suspend transition returned a
* positive number and the device was left in runtime suspend (without
* executing any suspend and resume callbacks for it), @complete() will be
* the only callback executed for the device during resume. In that case,
* @complete() must be prepared to do whatever is necessary to ensure the
* proper functioning of the device after the system resume. To this end,
* @complete() can check the power.direct_complete flag of the device to
* learn whether (unset) or not (set) the previous suspend and resume
* callbacks have been executed for it.
*
* @suspend: Executed before putting the system into a sleep state in which the
* contents of main memory are preserved. The exact action to perform
* depends on the device's subsystem (PM domain, device type, class or bus
* type), but generally the device must be quiescent after subsystem-level
* @suspend() has returned, so that it doesn't do any I/O or DMA.
* Subsystem-level @suspend() is executed for all devices after invoking
* subsystem-level @prepare() for all of them.
*
* @suspend_late: Continue operations started by @suspend(). For a number of
* devices @suspend_late() may point to the same callback routine as the
* runtime suspend callback.
*
* @resume: Executed after waking the system up from a sleep state in which the
* contents of main memory were preserved. The exact action to perform
* depends on the device's subsystem, but generally the driver is expected
* to start working again, responding to hardware events and software
* requests (the device itself may be left in a low-power state, waiting
* for a runtime resume to occur). The state of the device at the time its
* driver's @resume() callback is run depends on the platform and subsystem
* the device belongs to. On most platforms, there are no restrictions on
* availability of resources like clocks during @resume().
* Subsystem-level @resume() is executed for all devices after invoking
* subsystem-level @resume_noirq() for all of them.
*
* @resume_early: Prepare to execute @resume(). For a number of devices
* @resume_early() may point to the same callback routine as the runtime
* resume callback.
*
* @freeze: Hibernation-specific, executed before creating a hibernation image.
* Analogous to @suspend(), but it should not enable the device to signal
* wakeup events or change its power state. The majority of subsystems
* (with the notable exception of the PCI bus type) expect the driver-level
* @freeze() to save the device settings in memory to be used by @restore()
* during the subsequent resume from hibernation.
* Subsystem-level @freeze() is executed for all devices after invoking
* subsystem-level @prepare() for all of them.
*
* @freeze_late: Continue operations started by @freeze(). Analogous to
* @suspend_late(), but it should not enable the device to signal wakeup
* events or change its power state.
*
* @thaw: Hibernation-specific, executed after creating a hibernation image OR
* if the creation of an image has failed. Also executed after a failing
* attempt to restore the contents of main memory from such an image.
* Undo the changes made by the preceding @freeze(), so the device can be
* operated in the same way as immediately before the call to @freeze().
* Subsystem-level @thaw() is executed for all devices after invoking
* subsystem-level @thaw_noirq() for all of them. It also may be executed
* directly after @freeze() in case of a transition error.
*
* @thaw_early: Prepare to execute @thaw(). Undo the changes made by the
* preceding @freeze_late().
*
* @poweroff: Hibernation-specific, executed after saving a hibernation image.
* Analogous to @suspend(), but it need not save the device's settings in
* memory.
* Subsystem-level @poweroff() is executed for all devices after invoking
* subsystem-level @prepare() for all of them.
*
* @poweroff_late: Continue operations started by @poweroff(). Analogous to
* @suspend_late(), but it need not save the device's settings in memory.
*
* @restore: Hibernation-specific, executed after restoring the contents of main
* memory from a hibernation image, analogous to @resume().
*
* @restore_early: Prepare to execute @restore(), analogous to @resume_early().
*
* @suspend_noirq: Complete the actions started by @suspend(). Carry out any
* additional operations required for suspending the device that might be
* racing with its driver's interrupt handler, which is guaranteed not to
* run while @suspend_noirq() is being executed.
* It generally is expected that the device will be in a low-power state
* (appropriate for the target system sleep state) after subsystem-level
* @suspend_noirq() has returned successfully. If the device can generate
* system wakeup signals and is enabled to wake up the system, it should be
* configured to do so at that time. However, depending on the platform
* and device's subsystem, @suspend() or @suspend_late() may be allowed to
* put the device into the low-power state and configure it to generate
* wakeup signals, in which case it generally is not necessary to define
* @suspend_noirq().
*
* @resume_noirq: Prepare for the execution of @resume() by carrying out any
* operations required for resuming the device that might be racing with
* its driver's interrupt handler, which is guaranteed not to run while
* @resume_noirq() is being executed.
*
* @freeze_noirq: Complete the actions started by @freeze(). Carry out any
* additional operations required for freezing the device that might be
* racing with its driver's interrupt handler, which is guaranteed not to
* run while @freeze_noirq() is being executed.
* The power state of the device should not be changed by either @freeze(),
* or @freeze_late(), or @freeze_noirq() and it should not be configured to
* signal system wakeup by any of these callbacks.
*
* @thaw_noirq: Prepare for the execution of @thaw() by carrying out any
* operations required for thawing the device that might be racing with its
* driver's interrupt handler, which is guaranteed not to run while
* @thaw_noirq() is being executed.
*
* @poweroff_noirq: Complete the actions started by @poweroff(). Analogous to
* @suspend_noirq(), but it need not save the device's settings in memory.
*
* @restore_noirq: Prepare for the execution of @restore() by carrying out any
* operations required for thawing the device that might be racing with its
* driver's interrupt handler, which is guaranteed not to run while
* @restore_noirq() is being executed. Analogous to @resume_noirq().
*
* All of the above callbacks, except for @complete(), return error codes.
* However, the error codes returned by the resume operations, @resume(),
* @thaw(), @restore(), @resume_noirq(), @thaw_noirq(), and @restore_noirq(), do
* not cause the PM core to abort the resume transition during which they are
* returned. The error codes returned in those cases are only printed by the PM
* core to the system logs for debugging purposes. Still, it is recommended
* that drivers only return error codes from their resume methods in case of an
* unrecoverable failure (i.e. when the device being handled refuses to resume
* and becomes unusable) to allow us to modify the PM core in the future, so
* that it can avoid attempting to handle devices that failed to resume and
* their children.
*
* It is allowed to unregister devices while the above callbacks are being
* executed. However, a callback routine must NOT try to unregister the device
* it was called for, although it may unregister children of that device (for
* example, if it detects that a child was unplugged while the system was
* asleep).
*
* Refer to Documentation/power/devices.txt for more information about the role
* of the above callbacks in the system suspend process.
*
* There also are callbacks related to runtime power management of devices.
* Again, these callbacks are executed by the PM core only for subsystems
* (PM domains, device types, classes and bus types) and the subsystem-level
* callbacks are supposed to invoke the driver callbacks. Moreover, the exact
* actions to be performed by a device driver's callbacks generally depend on
* the platform and subsystem the device belongs to.
*
* @runtime_suspend: Prepare the device for a condition in which it won't be
* able to communicate with the CPU(s) and RAM due to power management.
* This need not mean that the device should be put into a low-power state.
* For example, if the device is behind a link which is about to be turned
* off, the device may remain at full power. If the device does go to low
* power and is capable of generating runtime wakeup events, remote wakeup
* (i.e., a hardware mechanism allowing the device to request a change of
* its power state via an interrupt) should be enabled for it.
*
* @runtime_resume: Put the device into the fully active state in response to a
* wakeup event generated by hardware or at the request of software. If
* necessary, put the device into the full-power state and restore its
* registers, so that it is fully operational.
*
* @runtime_idle: Device appears to be inactive and it might be put into a
* low-power state if all of the necessary conditions are satisfied.
* Check these conditions, and return 0 if it's appropriate to let the PM
* core queue a suspend request for the device.
*
* Refer to Documentation/power/runtime_pm.txt for more information about the
* role of the above callbacks in device runtime power management.
*
*/
struct dev_pm_ops {
int (*prepare)(struct device *dev);
void (*complete)(struct device *dev);
int (*suspend)(struct device *dev);
int (*resume)(struct device *dev);
int (*freeze)(struct device *dev);
int (*thaw)(struct device *dev);
int (*poweroff)(struct device *dev);
int (*restore)(struct device *dev);
int (*suspend_late)(struct device *dev);
int (*resume_early)(struct device *dev);
int (*freeze_late)(struct device *dev);
int (*thaw_early)(struct device *dev);
int (*poweroff_late)(struct device *dev);
int (*restore_early)(struct device *dev);
int (*suspend_noirq)(struct device *dev);
int (*resume_noirq)(struct device *dev);
int (*freeze_noirq)(struct device *dev);
int (*thaw_noirq)(struct device *dev);
int (*poweroff_noirq)(struct device *dev);
int (*restore_noirq)(struct device *dev);
int (*runtime_suspend)(struct device *dev);
int (*runtime_resume)(struct device *dev);
int (*runtime_idle)(struct device *dev);
};