Subject: Ptrace documentation, draft #6
Date: Wednesday 8th June 2011 12:45:04 UTC (over 5 years ago)
Hello Michael, Oleg suggested that instead of dropping ptrace API doc into Documentation/* (where it will successfully bit-rot), it may be a better idea to put it into the ptrace manpage. What do you think about it? I resent the latest draft below, slightly edited. Feel free to incorporate it into the manpage already. Even better if after reading it you have questions :) and we can improve this doc further. ====================================================================== Ptrace discussions repeatedly display a higher than average amount of misunderstanding and confusion. New ptrace users and even people who already worked with it are repeatedly confused by details which are not documented anywhere and knowledge about which exists mostly in the brains of strace/gdb/other_such_tools developers. This document is meant as a brain dump of this knowledge. It assumes that the reader has basic understanding what ptrace is. ====================================================================== Ptrace userspace API. Ptrace API (ab)uses standard Unix parent/child signaling over waitpid. An unfortunate effect of it is that resulting API is complex and has subtle quirks. This document aims to describe these quirks. Debugged processes (tracees) first need to be attached to the debugging process (tracer). Attachment and subsequent commands are per-thread: in multi-threaded process, every thread can be individually attached to a (potentially different) tracer, or left not attached and thus not debugged. Therefore, "tracee" always means "(one) thread", never "a (possibly multi-threaded) process". Ptrace commands are always sent to a specific tracee using ptrace(PTRACE_foo, pid, ...), where pid is a TID of the corresponding Linux thread. After attachment, each tracee can be in two states: running or stopped. There are many kinds of states when tracee is stopped, and in ptrace discussions they are often conflated. Therefore, it is important to use precise terms. In this document, any stopped state in which tracee is ready to accept ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can be further subdivided into signal-delivery-stop, group-stop, syscall-stop and so on. They are described in detail later. 1.x Death under ptrace. When a (possibly multi-threaded) process receives a killing signal (a signal set to SIG_DFL and whose default action is to kill the process), all threads exit. Tracees report their death to the tracer(s). This is not a ptrace-stop (because tracer can't query tracee status such as register contents, cannot restart tracee etc) but the notification about this event is delivered through waitpid API similarly to ptrace-stop. Note that killing signal will first cause signal-delivery-stop (on one tracee only), and only after it is injected by tracer (or after it was dispatched to a thread which isn't traced), death from signal will happen on ALL tracees within multi-threaded process. SIGKILL operates similarly, with exceptions. No signal-delivery-stop is generated for SIGKILL and therefore tracer can't suppress it. SIGKILL kills even within syscalls (syscall-exit-stop is not generated prior to death by SIGKILL). The net effect is that SIGKILL always kills the process (all its threads), even if some threads of the process are ptraced. Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This operation is deprecated, use kill/tgkill(SIGKILL) instead. ^^^ Oleg prefers to deprecate it instead of describing (and needing to support) PTRACE_KILL's quirks. When tracee executes exit syscall, it reports its death to its tracer. Other threads are not affected. When any thread executes exit_group syscall, every tracee in its thread group reports its death to its tracer. If PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen before actual death. This applies to exits on exit syscall, group_exit syscall, signal deaths (except SIGKILL), and when threads are torn down on execve in multi-threaded process. Tracer cannot assume that ptrace-stopped tracee exists. There are many scenarios when tracee may die while stopped (such as SIGKILL). Therefore, tracer must always be prepared to handle ESRCH error on any ptrace operation. Unfortunately, the same error is returned if tracee exists but is not ptrace-stopped (for commands which require stopped tracee), or if it is not traced by process which issued ptrace call. Tracer needs to keep track of stopped/running state, and interpret ESRCH as "tracee died unexpectedly" only if it knows that tracee has been observed to enter ptrace-stop. Note that there is no guarantee that waitpid(WNOHANG) will reliably report tracee's death status if ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead. IOW: tracee may be "not yet fully dead" but already refusing ptrace ops. Tracer can not assume that tracee ALWAYS ends its life by reporting WIFEXITED(status) or WIFSIGNALED(status). ??? or can it? Do we include such a promise into ptrace API? 1.x Stopped states. When running tracee enters ptrace-stop, it notifies its tracer using waitpid API. Tracer should use waitpid family of syscalls to wait for tracee to stop. Most of this document assumes that tracer waits with: pid = waitpid(pid_or_minus_1, &status, __WALL); Ptrace-stopped tracees are reported as returns with pid > 0 and WIFSTOPPED(status) == true. ??? Do we require __WALL usage, or will just using 0 be ok? Are the rules different if user wants to use waitid? Will waitid require WEXITED? __WALL value does not include WSTOPPED and WEXITED bits, but implies their functionality. Setting of WCONTINUED bit in waitpid flags is not recommended: the continued state is per-process and consuming it can confuse real parent of the tracee. Use of WNOHANG bit in waitpid flags may cause waitpid return 0 ("no wait results available yet") even if tracer knows there should be a notification. Example: kill(tracee, SIGKILL); waitpid(tracee, &status, __WALL | WNOHANG); ??? waitid usage? WNOWAIT? ??? describe how wait notifications queue (or not queue) The following kinds of ptrace-stops exist: signal-delivery-stops, group-stop, PTRACE_EVENT stops, syscall-stops [, SINGLESTEP, SYSEMU, SYSEMU_SINGLESTEP]. They all are reported as waitpid result with WIFSTOPPED(status) == true. They may be differentiated by checking (status >> 8) value, and if looking at (status >> 8) value doesn't resolve ambiguity, by querying PTRACE_GETSIGINFO. (Note: WSTOPSIG(status) macro returns ((status >> 8) & 0xff) value). 1.x.x Signal-delivery-stop When (possibly multi-threaded) process receives any signal except SIGKILL, kernel selects a thread which handles the signal (if signal is generated with t[g]kill, thread selection is done by user). If selected thread is traced, it enters signal-delivery-stop. By this point, signal is not yet delivered to the process, and can be suppressed by tracer. If tracer doesn't suppress the signal, it passes signal to tracee in the next ptrace request. This second step of signal delivery is called "signal injection" in this document. Note that if signal is blocked, signal-delivery-stop doesn't happen until signal is unblocked, with the usual exception that SIGSTOP can't be blocked. Signal-delivery-stop is observed by tracer as waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If WSTOPSIG(status) == SIGTRAP, this may be a different kind of ptrace-stop - see "Syscall-stops" and "execve" sections below for details. If WSTOPSIG(status) == stopping signal, this may be a group-stop - see below. 1.x.x Signal injection and suppression. After signal-delivery-stop is observed by tracer, tracer should restart tracee with ptrace(PTRACE_rest, pid, 0, sig) call, where PTRACE_rest is one of the restarting ptrace ops. If sig is 0, then signal is not delivered. Otherwise, signal sig is delivered. This operation is called "signal injection" in this document, to distinguish it from signal-delivery-stop. Note that sig value may be different from WSTOPSIG(status) value - tracer can cause a different signal to be injected. Note that suppressed signal still causes syscalls to return prematurely. Restartable syscalls will be restarted (tracer will observe tracee to execute restart_syscall(2) syscall if tracer uses PTRACE_SYSCALL), non-restartable syscalls (for example, nanosleep) may return with -EINTR even though no observable signal is injected to the tracee. Note that restarting ptrace commands issued in ptrace-stops other than signal-delivery-stop are not guaranteed to inject a signal, even if sig is nonzero. No error is reported, nonzero sig may simply be ignored. Ptrace users should not try to "create new signal" this way: use tgkill(2) instead. This is a cause of confusion among ptrace users. One typical scenario is that tracer observes group-stop, mistakes it for signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0, stopsig) with the intention of injecting stopsig, but stopsig gets ignored and tracee continues to run. SIGCONT signal has a side effect of waking up (all threads of) group-stopped process. This side effect happens before signal-delivery-stop. Tracer can't suppress this side-effect (it can only suppress signal injection, which only causes SIGCONT handler to not be executed in the tracee, if such handler is installed). In fact, waking up from group-stop may be followed by signal-delivery-stop for signal(s) *other than* SIGCONT, if they were pending when SIGCONT was delivered. IOW: SIGCONT may be not the first signal observed by the tracee after it was sent. Stopping signals cause (all threads of) process to enter group-stop. This side effect happens after signal injection, and therefore can be suppressed by tracer. PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which corresponds to delivered signal. PTRACE_SETSIGINFO may be used to modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t, si_signo field and sig parameter in restarting command must match, otherwise the result is undefined. 1.x.x Group-stop When a (possibly multi-threaded) process receives a stopping signal, all threads stop. If some threads are traced, they enter a group-stop. Note that stopping signal will first cause signal-delivery-stop (on one tracee only), and only after it is injected by tracer (or after it was dispatched to a thread which isn't traced), group-stop will be initiated on ALL tracees within multi-threaded process. As usual, every tracee reports its group-stop separately to corresponding tracer. Group-stop is observed by tracer as waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result is returned by some other classes of ptrace-stops, therefore the recommended practice is to perform ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo) call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU - only these four signals are stopping signals. If tracer sees something else, it can't be group-stop. Otherwise, tracer needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely a group-stop. (Other failure codes are possible, such as ESRCH "no such process" if SIGKILL killed the tracee). As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it restarts or kills it, tracee will not run, and will not send notifications (except SIGKILL death) to tracer, even if tracer enters into another waitpid call. Currently, it causes a problem with transparent handling of stopping signals: if tracer restarts tracee after group-stop, SIGSTOP is effectively ignored: tracee doesn't remain stopped, it runs. If tracer doesn't restart tracee before entering into next waitpid, future SIGCONT will not be reported to the tracer. Which would make SIGCONT to have no effect. 1.x.x PTRACE_EVENT stops If tracer sets TRACE_O_TRACEfoo options, tracee will enter ptrace-stops called PTRACE_EVENT stops. PTRACE_EVENT stops are observed by tracer as waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) == SIGTRAP. Additional bit is set in a higher byte of status word: value (status >> 8) will be (SIGTRAP | PTRACE_EVENT_foo << 8). The following events exist: PTRACE_EVENT_VFORK - stop before return from vfork/clone+CLONE_VFORK. When tracee is continued after this, it will wait for child to exit/exec before continuing its execution (IOW: usual behavior on vfork). PTRACE_EVENT_FORK - stop before return from fork/clone+SIGCHLD PTRACE_EVENT_CLONE - stop before return from clone PTRACE_EVENT_VFORK_DONE - stop before return from vfork/clone+CLONE_VFORK, but after vfork child unblocked this tracee by exiting or exec'ing. For all four stops described above: stop occurs in parent, not in newly created thread. PTRACE_GETEVENTMSG can be used to retrieve new thread's tid. PTRACE_EVENT_EXEC - stop before return from exec. PTRACE_EVENT_EXIT - stop before exit (including death from exit_group), signal death, or exit caused by execve in multi-threaded process. PTRACE_GETEVENTMSG returns exit status. Registers can be examined (unlike when "real" exit happens). The tracee is still alive, it needs to be PTRACE_CONTed or PTRACE_DETACHed to finish exit. PTRACE_GETSIGINFO on PTRACE_EVENT stops returns si_signo = SIGTRAP, si_code = (event << 8) | SIGTRAP. 1.x.x Syscall-stops If tracee was restarted by PTRACE_SYSCALL, tracee enters syscall-enter-stop just prior to entering any syscall. If tracer restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when syscall is finished, or if it is interrupted by a signal. (That is, signal-delivery-stop never happens between syscall-enter-stop and syscall-exit-stop, it happens *after* syscall-exit-stop). Other possibilities are that tracee may stop in a PTRACE_EVENT stop, exit (if it entered exit or exit_group syscall), be killed by SIGKILL, or die silently (if it is a thread group leader, execve syscall happened in another thread, and that thread is not traced by the same tracer). Syscall-enter-stop and syscall-exit-stop are observed by tracer as waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) == SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then WSTOPSIG(status) == (SIGTRAP | 0x80). Syscall-stops can be distinguished from signal-delivery-stop with SIGTRAP by querying PTRACE_GETSIGINFO: si_code <= 0 if SIGTRAP was sent by usual suspects like [tg]kill/sigqueue/etc; or = SI_KERNEL (0x80) if sent by kernel, whereas syscall-stops have si_code = SIGTRAP or (SIGTRAP | 0x80). However, syscall-stops happen very often (twice per syscall), and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat expensive. Some architectures allow to distinguish them by examining registers. For example, on x86 rax = -ENOSYS in syscall-enter-stop. Since SIGTRAP (like any other signal) always happens *after* syscall-exit-stop, and at this point rax almost never contains -ENOSYS, SIGTRAP looks like "syscall-stop which is not syscall-enter-stop", IOW: it looks like a "stray syscall-exit-stop" and can be detected this way. But such detection is fragile and is best avoided. Using PTRACE_O_TRACESYSGOOD option is a recommended method, since it is reliable and does not incur performance penalty. Syscall-enter-stop and syscall-exit-stop are indistinguishable from each other by tracer. Tracer needs to keep track of the sequence of ptrace-stops in order to not misinterpret syscall-enter-stop as syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's death - no other kinds of ptrace-stop can occur in between. If after syscall-enter-stop tracer uses restarting command other than PTRACE_SYSCALL, syscall-exit-stop is not generated. PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code = SIGTRAP or (SIGTRAP | 0x80). 1.x.x SINGLESTEP, SYSEMU, SYSEMU_SINGLESTEP ??? document PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP 1.x Informational and restarting ptrace commands. Most ptrace commands (all except ATTACH, TRACEME, KILL) require tracee to be in ptrace-stop, otherwise they fail with ESRCH. When tracee is in ptrace-stop, tracer can read and write data to tracee using informational commands. They leave tracee in ptrace-stopped state: longv = ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0); ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val); ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct); ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct); ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo); ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo); ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var); ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags); Note that some errors are not reported. For example, setting siginfo may have no effect in some ptrace-stops, yet the call may succeed (return 0 and don't set errno); querying GETEVENTMSG may succeed and return some random value if current ptrace-stop is not documented as returning meaningful event message. ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags) affects one tracee. Current flags are replaced. Flags are inherited by new tracees created and "auto-attached" via active PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options. Another group of commands makes ptrace-stopped tracee run. They have the form: ptrace(PTRACE_cmd, pid, 0, sig); where cmd is CONT, DETACH, SYSCALL, SINGLESTEP, SYSEMU, or SYSEMU_SINGLESTEP. If tracee is in signal-delivery-stop, sig is the signal to be injected. Otherwise, sig may be ignored (recommended practice is to always pass 0 in these cases). 1.x Attaching and detaching A thread can be attached to tracer using ptrace(PTRACE_ATTACH, pid, 0, 0) call. This also sends SIGSTOP to this thread. If tracer wants this SIGSTOP to have no effect, it needs to suppress it. Note that if other signals are concurrently sent to this thread during attach, tracer may see tracee enter signal-delivery-stop with other signal(s) first! The usual practice is to reinject these signals until SIGSTOP is seen, then suppress SIGSTOP injection. The design bug here is that attach and concurrent SIGSTOP are racing and concurrent SIGSTOP may be lost. ??? Describe how to attach to a thread which is already group-stopped. Since attaching sends SIGSTOP and tracer usually suppresses it, this may cause stray EINTR return from the currently executing syscall in the tracee, as described in "signal injection and suppression" section. ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a tracee. It continues to run (doesn't enter ptrace-stop). A common practice is to follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and allow parent (which is our tracer now) to observe our signal-delivery-stop. If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect, then children created by (vfork or clone(CLONE_VFORK)), (fork or clone(SIGCHLD)) and (other kinds of clone) respectively are automatically attached to the same tracer which traced their parent. SIGSTOP is delivered to them, causing them to enter signal-delivery-stop after they exit syscall which created them. Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig). PTRACE_DETACH is a restarting operation, therefore it requires tracee to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can be injected. Otherwise, sig parameter may be silently ignored. If tracee is running when tracer wants to detach it, the usual solution is to send SIGSTOP (using tgkill, to make sure it goes to the correct thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP and then detach it (suppressing SIGSTOP injection). Design bug is that this can race with concurrent SIGSTOPs. Another complication is that tracee may enter other ptrace-stops and needs to be restarted and waited for again, until SIGSTOP is seen. Yet another complication is to be sure that tracee is not already ptrace-stopped, because no signal delivery happens while it is - not even SIGSTOP. ??? Describe how to detach from a group-stopped tracee so that it doesn't run, but continues to wait for SIGCONT. If tracer dies, all tracees are automatically detached and restarted, unless they were in group-stop. Handling of restart from group-stop is currently buggy, but "as planned" behavior is to leave tracee stopped and waiting for SIGCONT. If tracee is restarted from signal-delivery-stop, pending signal is injected. 1.x execve under ptrace. During execve, kernel destroys all other threads in the process, and resets execve'ing thread tid to tgid (process id). This looks very confusing to tracers: All other threads stop in PTRACE_EXIT stop, if requested by active ptrace option. Then all other threads except thread group leader report death as if they exited via exit syscall with exit code 0. Then PTRACE_EVENT_EXEC stop happens, if requested by active ptrace option (on which tracee - leader? execve-ing one?). The execve-ing tracee changes its pid while it is in execve syscall. (Remember, under ptrace 'pid' returned from waitpid, or fed into ptrace calls, is tracee's tid). That is, pid is reset to process id, which coincides with thread group leader tid. If thread group leader has reported its death by this time, for tracer this looks like dead thread leader "reappears from nowhere". If thread group leader was still alive, for tracer this may look as if thread group leader returns from a different syscall than it entered, or even "returned from syscall even though it was not in any syscall". If thread group leader was not traced (or was traced by a different tracer), during execve it will appear as if it has become a tracee of the tracer of execve'ing tracee. All these effects are the artifacts of pid change. PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this case. It enables PTRACE_EVENT_EXEC stop which occurs before execve syscall return. Pid change happens before PTRACE_EVENT_EXEC stop, not after. When tracer receives PTRACE_EVENT_EXEC stop notification, it is guaranteed that except this tracee and thread group leader, no other threads from the process are alive. On receiving this notification, tracer should clean up all its internal data structures about all threads of this process, and retain only one data structure, one which describes single still running tracee, with pid = tgid = process id. Currently, there is no way to retrieve former pid of execve-ing tracee. If tracer doesn't keep track of its tracees' thread group relations, it may be unable to know which tracee execve-ed and therefore no longer exists under old pid due to pid change. Example: two threads execve at the same time: ** we get syscall-entry-stop in thread 1: ** PID1 execve("/bin/foo", "foo"