1 Introduction to Kernel Debugging

Kernel debugging is a task normally performed by systems engineers writing kernel programs. A kernel program is one that is built as part of the kernel and that references kernel data structures. System administrators might also debug the kernel in the following situations:

A process is hung or stops running unexpectedly

The need arises to examine, and possibly modify, kernel parameters

The system itself hangs, panics, or crashes

This manual describes how to debug kernel programs and the kernel. It also includes information about analyzing crash dump files.

In addition to the information provided here, tracing a kernel problem can require a basic understanding of one or more of the following technical areas:

The hardware architecture
See the Alpha Architecture Handbook for an overview of the Alpha hardware architecture and a description of the 64-bit Alpha RISC instruction set.

The internal design of the operating system at a source code and data structure level
See the Alpha Architecture Reference Manual for information on how the Tru64 UNIX operating system interfaces with the hardware.

This chapter provides an overview of the following topics:

Linking a kernel image prior to debugging for systems that are running a kernel built at boot time. (Section 1.1)

Debugging kernel programs (Section 1.2)

Debugging the running kernel (Section 1.3)

Analyzing a crash dump file(Section 1.4)

1.1 Linking a Kernel Image for Debugging

By default, the kernel is a statically linked image that resides in the file /vmunix. However, your system might be configured so that it is linked at bootstrap time. Rather than being a bootable image, the boot file is a text file that describes the hardware and software that will be present on the running system. Using this information, the bootstrap linker links the modules that are needed to support this hardware and software. The linker builds the kernel directly into memory.

You cannot directly debug a bootstrap-linked kernel because you must supply the name of an image to the kernel debugging tools. Without the image, the tools have no access to symbol names, variable names, and so on. Therefore, the first step in any kernel debugging effort is to determine whether your kernel was linked at bootstrap time. If the kernel was linked at bootstrap time, you must then build a kernel image file to use for debugging purposes.

The best way to determine whether your system is bootstrap linked or statically linked is to use the file command to test the type of file from which your system was booted. If your system is a bootstrap-linked system, it was booted from an ASCII text file; otherwise, it was booted from an executable image file. For example, enter the following command to determine the type of file from which your system was booted:

#/usr/bin/file `/usr/sbin/sizer -b`
/etc/sysconfigtab: ascii text

The sizer -b command returns the name of the file from which the system was booted. This file name is input to the file command, which determines that the system was booted from an ASCII text file. The output shown in the preceding example indicates that the system is a bootstrap-linked system. If the system had been booted from an executable image file named vmunix, the output from the file command would have appeared as follows:

vmunix:COFF format alpha executable or object module
 not stripped

If your system is running a bootstrap-linked kernel, build a kernel image that is identical to the bootstrap-linked kernel your system is running, by entering the following command:

# /usr/bin/ld -o vmunix.image `/usr/sbin/sizer -m`

The output from the sizer -m command is a list of the exact modules and linker flags used to build the currently running bootstrap-linked kernel. This output causes the ld command to create a kernel image that is identical to the bootstrap-linked kernel running on your system. The kernel image is written to the file named by the -o flag, in this case the vmunix.image file.

Once you create this image, you can debug the kernel as described in this manual, using the dbx, kdbx, and kdebug debuggers. When you invoke the dbx or kdbx debugger, remember to specify the name of the kernel image file you created with the ld command, such as the vmunix.image file shown here.

When you are finished debugging the kernel, you can remove the kernel image file you created for debugging purposes.

1.2 Debugging Kernel Programs

Kernel programs can be difficult to debug because you normally cannot control kernel execution. To make debugging kernel programs more convenient, the system provides the kdebug debugger. The kdebug debugger is code that resides inside the kernel and allows you to use the dbx debugger to control execution of a running kernel in the same manner as you control execution of a user space program. To debug a kernel program in this manner, follow these steps:

Build your kernel program into the kernel on a test system.

Set up the kdebug debugger, as described in Section 2.3.

Enter the dbx -remote command on a remote build system, supplying the pathname of the kernel running on the test system.

Set breakpoints and enter dbx commands as you normally would. Section 2.1 describes some of the commands that are useful during kernel debugging. For general information about using dbx, see the Programmer's Guide.

The system also provides the kdbx debugger, which is designed especially for debugging kernel code. This debugger contains a number of special commands, called extensions, that allow you to display kernel data structures in a readable format. Section 2.2 describes using kdbx and its extensions. (You cannot use the kdbx debugger with the kdebug debugger.)

Another feature of kdbx is that you can customize it by writing your own extensions. The system contains a set of kdbx library routines that you can use to create extensions that display kernel data structures in ways that are meaningful to you. Chapter 3 describes writing kdbx extensions.

1.3 Debugging the Running Kernel

When you have problems with a process or set of processes, you can attempt to identify the problem by debugging the running kernel. You might also invoke the debugger on the running kernel to examine the values assigned to system parameters. (You can modify the value of the parameters using the debugger, but this practice can cause problems with the kernel and should be avoided.)

You use the dbx or kdbx debugger to examine the state of processes running on your system and to examine the value of system parameters. The kdbx debugger provides special commands, called extensions, that you can use to display kernel data structures. (Section 2.2.3 describes the extensions.)

To examine the state of processes, you invoke the debugger (as described in Section 2.1 or Section 2.2) using the following command:

# dbx -k /vmunix /dev/mem

This command invokes dbx with the kernel debugging -k option (flag), which maps kernel addresses to make kernel debugging easier. The /vmunix and /dev/mem parameters cause the debugger to operate on the running kernel.

Once in the dbx environment, you use dbx commands to display process IDs (PIDs) and trace execution of processes. You can perform the same tasks using the kdbx debugger. The following example shows the dbx command you use to display process IDs:

(dbx) kps
  PID   COMM
00000   kernel idle
00001   init
00014   kloadsrv
00016   update

.
.
.

If you want to trace the execution of the kloadsrv daemon, use the dbx command to set the $pid symbol to the PID of the kloadsrv daemon. Then, enter the t command:

(dbx) set $pid = 14
(dbx) t
>  0 thread_block() ["/usr/sde/build/src/kernel/kern/sched_prim.c":1623, 0xfffffc0000\
43d77c]
   1 mpsleep(0xffffffff92586f00, 0x11a, 0xfffffc0000279cf4, 0x0, 0x0) ["/usr/sde/build\
/src/kernel/bsd/kern_synch.c":411, 0xfffffc000040adc0]
   2 sosleep(0xffffffff92586f00, 0x1, 0xfffffc000000011a, 0x0, 0xffffffff81274210) ["/usr/sde\
/build/src/kernel/bsd/uipc_socket2.c":654, 0xfffffc0000254ff8]
   3 sosbwait(0xffffffff92586f60, 0xffffffff92586f00, 0x0, 0xffffffff92586f00, 0x10180) ["/usr\
/sde/build/src/kernel/bsd/uipc_socket2.c":630, 0xfffffc0000254f64]
   4 soreceive(0x0, 0xffffffff9a64f658, 0xffffffff9a64f680, 0x8000004300000000, 0x0) ["/usr/sde\
/build/src/kernel/bsd/uipc_socket.c":1297, 0xfffffc0000253338]
   5 recvit(0xfffffc0000456fe8, 0xffffffff9a64f718, 0x14000c6d8, 0xffffffff9a64f8b8,\
 0xfffffc000043d724) ["/usr/sde/build/src/kernel/bsd/uipc_syscalls.c":1002,\
 0xfffffc00002574f0]
   6 recvfrom(0xffffffff81274210, 0xffffffff9a64f8c8, 0xffffffff9a64f8b8, 0xffffffff9a64f8c8,\
 0xfffffc0000457570) ["/usr/sde/build/src/kernel/bsd/uipc_syscalls.c":860,\
 0xfffffc000025712c]
   7 orecvfrom(0xffffffff9a64f8b8, 0xffffffff9a64f8c8, 0xfffffc0000457570, 0x1, 0xfffffc0000456fe8)\
 ["/usr/sde/build/src/kernel/bsd/uipc_syscalls.c":825, 0xfffffc000025708c]
   8 syscall(0x120024078, 0xffffffffffffffff, 0xffffffffffffffff, 0x21, 0x7d) ["/usr/sde\
/build/src/kernel/arch/alpha/syscall_trap.c":515, 0xfffffc0000456fe4
   9 _Xsyscall(0x8, 0x12001acb8, 0x14000eed0, 0x4, 0x1400109d0) ["/usr/sde/build\
/src/kernel/arch/alpha/locore.s":1046, 0xfffffc00004486e4]
(dbx) exit

Often, looking at the trace of a process that is hanging or has unexpectedly stopped running reveals the problem. Once you find the problem, you can modify system parameters, restart daemons, or take other corrective actions.

For more information about the commands you can use to debug the running kernel, see Section 2.1 and Section 2.2.

1.4 Analyzing a Crash Dump File

If your system crashes, you can often find the cause of the crash by using dbx or kdbx to debug or analyze a crash dump file.

The operating system can crash because one of the following occurs:

Hardware exception

Software panic

Hung system
When a system hangs, it is often necessary to force the system to create dumps that you can analyze to determine why the system hung. The System Administration manual describes the procedure for forcing a crash dump of a hung system.

Resource exhaustion

The system crashes or hangs because it cannot continue executing. Normally, even in the case of a hardware exception, the operating system detects the problem. (For example a machine-checking routine might discover a hardware problem and begin the process of crashing the system.) In general, the operating system performs the following steps when it detects a problem from which it cannot recover:

It calls the system panic function.
The panic function saves the contents of registers and sends the panic string (a message describing the reason for the system panic) to the error logger and the console terminal.
If the system is a Symmetric Multiprocessing (SMP) system, the panic function notifies the other CPUs in the system that a panic has occurred. The other CPUs then also execute the panic function and record the following panic string:
```
cpu_ip_intr: panic request
```
Once each CPU has recorded the system panic, execution continues only on the master CPU. All other CPUs in the SMP system stop execution.

It calls the system boot function.
The boot function records the stack.

It calls the dump function.
The dump function copies core memory into swap partitions and the system stops running or the reboot process begins. Console environment variables control whether the system reboots automatically. (The System Administration manual describes these environment variables.)

At system reboot time, the copy of core memory saved in the swap partitions is copied into a file, called a crash dump file. You can analyze the crash dump file to determine what caused the crash. By default, the crash dump is a partial (rather than full) dump and is in compressed form. For complete information about managing crash dumps and crash dump files, including how to change default settings, see the System Administration manual. For examples of analyzing crash dump files, see Chapter 4.