Features Download
From: David Howells <dhowells <at> redhat.com>
Subject: Linux Credentials document, first draft
Newsgroups: gmane.linux.kernel.lsm
Date: Monday 23rd June 2008 13:19:03 UTC (over 9 years ago)
Here's a first draft of a document about Linux credentials that I intend to
include in Documentation/credentials.txt.

In this form, it's as it should be after _all_ the patches have been
including the ones that add credentials pointers to all the VFS functions
need it (such as vfs_mkdir()) and down, so the section on "Overriding the
use of credentials" needs to be considered in that light.


By: David Howells 


 (*) Overview.

 (*) Types of credentials.

 (*) File markings.

 (*) Task credentials.

     - Accessing task credentials.
     - Accessing another task's credentials.
     - Altering credentials.
     - Managing credentials.

 (*) Open file credentials.

 (*) Overriding the VFS's use of credentials.


Linux supports a variety of credentials, and has a variety of objects that
credentials, including:

	- Tasks
	- Files/inodes
	- Sockets
	- Message queues
	- Shared memory segments
	- Semaphores
	- Keys

Most of these objects are inactive: they don't operate on other objects. 
are the obvious exception.  They do stuff, they access and manipulate

All the objects above have credentials of some sort.  In all cases, this
is, or
includes, an objective context that indicates the ownership of the object.
This is the 'objective context'.

Tasks have an additional interpretation of their credentials.  Some or all
their credentials form a 'subjective context'.  This is what is used when a
task acts upon some other object.  The subjective context is compared to
objective context and a determination is made as to whether an operation is
permitted or not.

Objects frequently also have indications of what operations by particular
subjective contexts may do.  This is an 'access control list' or 'ACL'.  An
example of this is the mode bits on a traditional UNIX file which grants
owner, members of a nominated group and others specific sets of permission,
typically read, write and execute.


The Linux kernel supports the following types of credentials:

 (1) Traditional UNIX credentials.

	Real User ID
	Real Group ID

     The UID and GID are carried by most, if not all, Linux objects, even
if in
     some cases it has to be invented (FAT or CIFS files for example, which
     derived from Windows).  These (mostly) define the objective context of
     that object, with tasks being slightly different in some cases.

	Effective, Saved and FS User ID
	Effective, Saved and FS Group ID
	Supplementary groups

     These are additional credentials used by tasks only.  Usually, an
     EUID/EGID/GROUPS will be used as the subjective context, and real
     will be used as the objective.  For tasks, it should be noted that
this is
     not always true.

 (2) Capabilities.

	Set of permitted capabilities
	Set of inheritable capabilities
	Set of effective capabilities
	Capability bounding set

     These are only carried by tasks.  They indicate superior capabilities
     granted piecemeal to a task that an ordinary task wouldn't otherwise
     These are manipulated implicitly by changes to the traditional UNIX
     credentials, but can also be manipulated directly by the capset()

     The permitted capabilities are those caps that the process might grant
     itself to its effective or permitted sets through capset().  This
     inheritable set might also be so constrained.

     The effective capabilities are the ones that a task is actually
allowed to
     make use of itself.

     The inheritable capabilities are the ones that may get passed across

     The bounding set limits the capabilities that may be inherited across
     execve(), especially when a binary is executed that will execute as
UID 0.

 (3) Secure management flags (securebits).

     These are only carried by tasks.  These govern the way the above
     credentials are manipulated and inherited over certain operations such
     execve().  They aren't used directly as objective or subjective

 (4) Keys and keyrings.

     These are only carried by tasks.  They carry and cache security tokens
     that don't fit into the other standard UNIX credentials.  They are for
     making such things as network filesystem keys available to the file
     accesses performed by processes, without the necessity of ordinary
     programs having to know about security details involved.

     Keyrings are a special type of key.  They carry sets of other keys and
     be searched for the desired key.  Each process may subscribe to a
     of keyrings:

	Per-thread keying
	Per-process keyring
	Per-session keyring

     When a process accesses a key, if not already present, it will
normally be
     cached on one of these keyrings for future accesses to find.

     For more information on using keys, see Documentation/keys.txt.

 (5) LSM

     The Linux Security Module allows extra controls to be placed over the
     operations that a task may do.  Currently Linux supports two main
     alternate LSM options: SELinux and Smack.

     Both work by labelling the objects in a system and then applying sets
     rules (policies) that say what operations a task with one label may do
     an object with another label.

 (6) AF_KEY

     This is a socket-based approach to credential management for
     stacks [RFC 2367].  It isn't discussed by this document as it doesn't
     interact directly with task and file credentials.

Credentials that are only carried by tasks may also be carried by a file
descriptor to record the subjective context of the task that opened it at
time that it opened it.  In such a case, some of the subjective context
recorded thus will be used instead of that of the task that is acting upon


Files on disk or obtained over the network may have annotations that form
objective security context of that file.  Depending on the type of
this may include one or more of the following:

 (*) UNIX UID, GID, mode;

 (*) Windows user ID;

 (*) Access control list;

 (*) LSM security label;

 (*) UNIX exec privilege escalation bits (SUID/SGID);

 (*) File capabilities exec privilege escalation bits.

These are compared to the task's subjective security context, and certain
operations allowed or disallowed as a result.  In the case of execve(), the
privilege escalation bits come into play, and may allow the resulting
extra privileges, based on the annotations on the executable file.


In Linux, all of a task's credentials are held in (uid, gid) or through
(groups, keys, LSM security) a refcounted structure of type 'struct cred'.
Each task points to its credentials by a pointer called 'cred' in its

Once a set of credentials has been prepared and committed, it may not be
changed, barring the following exceptions:

 (1) its reference count may be changed;

 (2) the reference count on the group_info struct it points to may be

 (3) the reference count on the security data it points to may be changed;

 (4) the reference count on any keyrings it points to may be changed;

 (5) any keyrings it points to may be revoked, expired or have their
     attributes changed; and

 (6) the contents of any keyrings to which it points may be changed (the
     point of keyrings being a shared set of credentials, modifiable by
     with appropriate access).

To alter anything in the cred struct, the copy-and-replace principle must
adhered to.  First take a copy, then alter the copy and then use RCU to
the task pointer to make it point to the new copy.  There are wrappers to
with this (see below).

A task may only alter its _own_ credentials; it is no longer permitted for
task to alter another's credentials.  This means the capset() system call
is no
longer permitted to take any PID other than the one of the current process.
Also keyctl_instantiate() and keyctl_negate() functions no longer permit
attachment to process-specific keyrings in the requesting process as the
instantiating process may need to create them.


A task being able to alter only its own credentials permits the current
to read or replace its own credentials without the need for any form of
- which simplifies things greatly.  It can just call:

	const struct cred *current_cred()

to get a pointer to its credentials structure, and it doesn't have to
it afterwards.

There are convenience wrappers for retrieving specific aspects of a task's
credentials (the value is simply returned in each case):

	uid_t current_uid(void)		Current's real UID
	gid_t current_gid(void)		Current's real GID
	uid_t current_euid(void)	Current's effective UID
	gid_t current_egid(void)	Current's effective GID
	uid_t current_fsuid(void)	Current's file access UID
	gid_t current_fsgid(void)	Current's file access GID
	kernel_cap_t current_cap(void)	Current's effective capabilities
	void *current_security(void)	Current's LSM security pointer
	struct user_struct *current_user(void)  Current's user account

There are also convenience wrappers for retrieving specific associated
pairs of
a task's credentials:

	void current_uid_gid(uid_t *, gid_t *);
	void current_euid_egid(uid_t *, gid_t *);
	void current_fsuid_fsgid(uid_t *, gid_t *);

which return these pairs of values through their arguments after retrieving
them from the current task's credentials.

In addition, there is a function for obtaining a reference on the current
process's current set of credentials:

	const struct cred *get_current_cred(void);

and functions for getting references to one of the credentials that don't
actually live in struct cred:

	struct user_struct *get_current_user(void);
	struct group_info *get_current_groups(void);

which get references to the current process's user accounting structure and
supplementary groups list respectively.

Once a reference has been obtained, it must be released with put_cred(),
free_uid() or put_group_info() as appropriate.


Whilst a task may access its own credentials without the need for locking,
same is not true of a task wanting to access another task's credentials. 
must use the RCU read lock and rcu_dereference().

The rcu_dereference() is wrapped by:

	const struct cred *__task_cred(struct task_struct *task);

This should be used inside the RCU read lock, as in the following example:

	void foo(struct task_struct *t, struct foo_data *f)
		const struct cred *tcred;
		tcred = __task_cred(t);
		f->uid = tcred->uid;
		f->gid = tcred->gid;
		f->groups = get_group_info(tcred->groups);

A function need not get RCU read lock to use __task_cred() if it is holding
spinlock at the time as this implicitly holds the RCU read lock.

Should it be necessary to hold another task's credentials for a long period
time, and possibly to sleep whilst doing so, then the caller should get a
reference on them using:

	const struct cred *get_task_cred(struct task_struct *task);

This does all the RCU magic inside of it.  The caller must call put_cred()
the credentials so obtained when they're finished with.

There are a couple of convenience functions to access bits of another
credentials, hiding the RCU magic from the caller:

	uid_t task_uid(task)		Task's real UID
	uid_t task_euid(task)		Task's effective UID

If the caller is holding a spinlock or the RCU read lock at the time


should be used instead.  Similarly, if multiple aspects of a task's
need to be accessed, RCU read lock or a spinlock should be used,
called, the result stored in a temporary pointer and then the credential
aspects called from that before dropping the lock.  This prevents the
potentially expensive RCU magic from being invoked multiple times.

Should some other single aspect of another task's credentials need to be
accessed, then this can be used:

	task_cred_xxx(task, member)

where 'member' is a non-pointer member of the cred struct.  For instance:

	uid_t task_cred_xxx(task, suid);

will retrieve 'struct cred::suid' from the task, doing the appropriate RCU
magic.  This may not be used for pointer members as what they point to may
disappear the moment the RCU read lock is dropped.


As previously mentioned, a task may only alter its own credentials, and may
alter those of another task.  This means that it doesn't need to use any
locking to alter its own credentials.

To alter the current process's credentials, a function should first prepare
new set of credentials by calling:

	struct cred *prepare_creds(void);

this locks current->cred_replace_mutex and then allocates and constructs a
duplicate of the current process's credentials, returning with the mutex
held if successful.  It returns NULL if not successful (out of memory).

The mutex prevents ptrace() from altering the ptrace state of a process
security checks on credentials construction and changing is taking place as
the ptrace state may alter the outcome, particularly in the case of

The new credentials set should be altered appropriately, and any security
checks and hooks done.  Both the current and the proposed sets of
are available for this purpose as current_cred() will return the current
still at this point.

When the credential set is ready, it should be committed to the current
by calling:

	int commit_creds(struct cred *new);

This will alter various aspects of the credentials and the process, giving
LSM a chance to do likewise, then it will use rcu_assign_pointer() to
commit the new credentials to current->cred, it will release
current->cred_replace_mutex to allow ptrace() to take place, and it will
the scheduler and others of the changes.

This function is guaranteed to return 0, so that it can be tail-called at
end of such functions as sys_setresuid().

Note that this function consumes the caller's reference to the new
The caller should _not_ call put_cred() on the new credentials afterwards.

Furthermore, once this function has been called on a new set of
those credentials may _not_ be changed further.

Should the security checks fail or some other error occur after
has been called, then the following function should be invoked:

	void abort_creds(struct cred *new);

This releases the lock on current->cred_replace_mutex that prepare_creds()
and then releases the new credentials.

A typical credentials alteration function would look something like this:

	int alter_suid(uid_t suid)
		struct cred *new;
		int ret;

		new = prepare_creds();
		if (!new)
			return -ENOMEM;

		new->suid = suid;
		ret = security_alter_suid(new);
		if (ret < 0) {
			return ret;

		return commit_creds(new);


There are some functions to help manage credentials:

 (*) void put_cred(const struct cred *cred);

     This releases a reference to the given set of credentials.  If the
     reference count reaches zero, the credentials will be scheduled for
     destruction by the RCU system.

 (*) const struct cred *get_cred(const struct cred *cred);

     This gets a reference on a live set of credentials, returning a
pointer to
     that set of credentials.

 (*) struct cred *get_new_cred(struct cred *cred);

     This gets a reference on a set of credentials that is under
     and is thus still mutable, returning a pointer to that set of


When a new file is opened, a reference is obtained on the opening task's
credentials and this is attached to the file struct as 'f_cred' in place of
'f_uid' and 'f_gid'.  Code that used to access file->f_uid and file->f_gid
should now access file->f_cred->fsuid and file->f_cred->fsgid.

It is safe to access f_cred without the use of RCU or locking because the
pointer will not change over the lifetime of the file struct, and nor will
contents of the cred struct pointed to, barring the exceptions listed above
(see the Task Credentials section).


The VFS access functions, such as vfs_mkdir(), all take a pointer to the
credentials to use.  The standard system calls, such as sys_mkdirat(), just
pass current_cred() through this pointer.

However, under some circumstances it is desirable to override the
used by the VFS, and that can be done by calling into such as vfs_mkdir()
a different set of credentials.  This is done in the following places:

 (*) sys_faccessat().

 (*) do_coredump().

 (*) nfs4recover.c.

 (*) After checking that file-based I/O syscalls may be used by a process,
     file->f_cred is substituted for the current process's credentials. 
     has the following consequences:

     (*) AIO daemons use the opener's credentials, not the daemon's

     (*) The ext2/3/4 block allocator uses the opener's credentials rather
     	 the caller's credentials when doing file data I/O.  Metadata
     	 operations (such as for mkdir) is still done with the caller's

     It can be looked on this way: the caller's credentials are checked to
     whether an operation is permitted, and then the opener's credentials
     used to actually do it.

 (*) Where filesystems don't access the credentials inside, they may use
     as the credentials pointer.

 (*) Calls to writeback code is done with a pointer to 'writeback_cred' as
     credentials to use.  This is because the writeback may not be done by
     caller process, and the Linux VM doesn't keep track of the file struct
     under which the original write took place.  Individual filesystems may
     improve on this by keeping track of writes themselves.

 (*) For system-level calls to VFS code, such as mounting special
     internally or issuing calls like vfs_mkdir() within them, a pointer to
     'init_cred' is used.  This is the set of credentials used by the init
     process and by all kernel daemons.

To unsubscribe from this list: send the line "unsubscribe
linux-security-module" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
CD: 2ms