본문 바로가기

Programming/Linux_Kernel

paper on debugging kernel oops or hang

출처 : http://mail.nl.linux.org/kernelnewbies/2003-08/msg00347.html




Hello all,
   I recently wrote this paper for some co-workers of mine, and thought it
might be of interest to the kernelnewbies crowd.  I didn't see anything like
it on the kernelnewbies.org home page.  Forgive me if this information is
common knowledge.  The paper talks about using ksymoops and objdump to debug
kernel oops and hang problems.  I'm not planning on updating the document
again, but feel free to use it as you see fit (i.e. I don't want to maintain
this document, but someone else sure can).  This paper deals with the 2.4.x
series of kernels.  Debugging might be different on 2.6.x, I haven't spent
much time with that series of kernels.

Enjoy,
Erik Habbinga

Debugging a kernel oops or hang
Erik Habbinga
August 20, 2003

Step 1: capture the kernel state

Kernel oops
-----------
If the kernel has oopsed or BUG'd, there will be crash information left on
the
console. If you're debugging over a serial port, cut and paste the oops data
into a file.  If the system was stable enough during the crash, the oops
data
will be in /var/log/messages upon reboot.  You can even use a digital
camera to take a picture of the console and transcribe the oops data later.
In any case, you should end up with data that looks like this:

Unable to handle kernel NULL pointer dereference at virtual address 00000004
80289912
*pde = 591be001
Oops: 0002
deadman multipath md bonding1 bonding cpqci cpqhealth cpqrom e100 lpfcdd
CPU:    1
EIP:    0010:[<80289912>]    Not tainted
EFLAGS: 00010202
eax: 00000000   ebx: f8f37138   ecx: f8b36d98   edx: 00000000
esi: 00619380   edi: f504ec00   ebp: 00000900   esp: d91bdcf8
ds: 0018   es: 0018   ss: 0018
Process spew (pid: 1230, stackpage=d91bd000)
Stack: f504ec00 f504ed70 f587f370 00bb8000 00000080 00000001 00000050
00000000
       80286341 d91bdd56 d91bdd58 00610180 f504ec00 00003a01 daa9c320
00609250
       00bb8000 00000020 00000200 f587f200 f5a2e000 00610180 00000600
09001000
Call Trace: [<80286341>]  [<80286427>]  [<80219fdc>]  [<8021a051>]
[<8013b78c>]
  [<8013b94b>]  [<80139b70>]  [<8012fd22>]  [<80130020>]  [<8013008c>]
[<80130b
1292ff>]  [<8012919c>]  [<801d8425>]  [<801d34d9>]  [<80137717>]
[<80106f27>]
Code: 89 42 04 89 10 c7 01 00 00 00 00 c7 41 04 00 00 00 00 8b 03

This data contains a register and stack dump for the process that caused the
oops, along with a explanation of what caused the oops.  In this case, the
spew process caused a kernel NULL pointer dereference at virtual address
0x00000004.  All addresses below PAGE_SIZE are trapped by the CPU's MMU as
probable NULL pointer dereferences, and this is what happened here.

Sometimes, if the oops makes it into /var/log/messages, the syslogd daemon
might disassemble the oops automatically.  This is usually handy, unless you
don't have the correct System.map file on the system.  The wrong System.map
file will give wrong stack trace results, requiring you to generate the
system
addresses by hand into something resembling above by referring to that
incorrect System.map file.  Hopefully this won't happen to you.

Here's a quick description of a kernel BUG().  Sprinkled throughout the
kernel
code are calls to the BUG() macro.  This code is only executed at times
where
the kernel knows something is horribly wrong and the best course of action
is
to stop and report status.  On i386, the BUG() macro decodes to the "ud2a"
instruction (0x0f 0x0b).  This is an invalid instruction on i386, and the
CPU
will throw an "invalid opcode" exception upon execution.  The kernel traps
the
invalid opcode exception, prints "kernel BUG at __FILE__:__LINE!", and then
dumps the CPU state just like any other fault/trap/exception handler that
halts the machine.  So, if you see code like the following on i386, you know
you're looking at the BUG() macro:

Code;  00000000 Before first symbol
   0:   0f 0b                     ud2a

kernel hang
-----------
If the kernel has hung, you'll have to use the magic sysrq functionality to
get the stack trace for all currently running processes.  Compile the kernel
with the CONFIG_MAGIC_SYSRQ option enabled, found in the "kernel hacking"
section of your favorite "make *config" display.  From the console, doing
the
key sequence "Alt" -> "PrintScreen/SysRq" -> "t" will start the stack trace
procedure.  Hold down the keys as you type them on the console's keyboard.
If
you're running from a serial port, enable the "serial console" option under
"character devices" in "make *config" and put the appropriate options on the
kernel boot parameter line for your boot loader.  For example, the
following would be added to the "append" line in lilo.conf:

console=ttyS0,9600 console=tty0

This tells the kernel to send all console messages to serial port 0 at 9600
baud as well as the VGA console.

In the latest 2.4.x kernels, a file shows up in /proc when
CONFIG_MAGIC_SYSRQ
is configured.  Doing:

echo t > /proc/sysrq-trigger

will start the stack trace process.  Of course, if the system is hung to the
point you can't type, this won't be of help.  The serial port + console will
be your best option for getting the stack trace information.

The syslogd daemon will send the sysrq information to the /var/log/messages
file, disassembling it with the currently installed System.map file.  Once
again, if the wrong System.map file is on the system, you're in for some
serious pain.  The console might also show disassembled stack traces.

Here's the start of output from a raw SysRq-T stack trace:

SysRq : Show State
^M
^M                         free                        sibling
^M  task             PC    stack   pid father child younger older
^Minit          R 82831F2C  3780     1      0  1213               (NOTLB)
^MCall Trace: [<801138fa>]  [<80113820>]  [<801470ee>]  [<80147490>]
[<80106f27
>]
^Mkeventd       R 00010000  6068     2      1             3       (L-TLB)
^MCall Trace: [<801235d5>]  [<80105694>]
^Mksoftirqd_CPU R F7BFE000  5184     3      1             4     2 (L-TLB)
^MCall Trace: [<8011b62f>]  [<8011bb3c>]  [<80105694>]
^Mksoftirqd_CPU S F7BFC000  5924     4      1             5     3 (L-TLB)
^MCall Trace: [<8011b62f>]  [<8011bb3c>]  [<80105694>]
^Mksoftirqd_CPU R F7BFA000  6000     5      1             6     4 (L-TLB)
^MCall Trace: [<8011b62f>]  [<8011bb3c>]  [<80105694>]
^Mksoftirqd_CPU S F7BF8000  5404     6      1             7     5 (L-TLB)
^MCall Trace: [<8011b62f>]  [<8011bb3c>]  [<80105694>]
^Mkswapd        S F7BD8000  6356     7      1             8     6 (L-TLB)
^MCall Trace: [<80119cae>]  [<8012ffa6>]  [<80105694>]
^Mbdflush       R 00000286  6308     8      1             9     7 (L-TLB)
^MCall Trace: [<8011429e>]  [<8013bac3>]  [<80105694>]
< lots of lines deleted >
^Msmbd          R F435BF2C  5540  1513    612                1511 (NOTLB)
^MCall Trace: [<80146e7c>]  [<801138fa>]  [<80113820>]  [<801470ee>]
[<80147490
>]  [<80106f27>]

The call trace for every program running on the system is listed.  Note the
embedded "^M"'s.  This trace was captured off of a windows serial port
terminal program. I've written a perl script called "mapread" (see the end
of this
document) that will take a file containing the SysRq-T listing, strip out
the
"^M"'s, and translate the addresses using the System.map file located in the
current directory.  See the next section for details on mapread.

If you look at /var/log/messages after a SysRq-T call trace procedure,
you'll
see something like this in /var/log/messages.  In this case, syslogd managed
to do the addresss translation:

May  6 22:37:15 catdog4 kernel: SysRq : Show State
May  6 22:37:15 catdog4 kernel:
May  6 22:37:15 catdog4 kernel:                          free
     sibling
May  6 22:37:15 catdog4 kernel:   task             PC    stack   pid father
chil
d younger older
May  6 22:37:16 catdog4 kernel: init          S 82825F2C  3780     1      0
309
4               (NOTLB)
May  6 22:37:16 catdog4 kernel: Call Trace: [schedule_timeout+122/156]
[process
_timeout+0/96]  [do_select+458/516]  [sys_select+832/1148]
[system_call+51/56]
May  6 22:37:16 catdog4 kernel: keventd       S 00010000  4992     2      1
        3       (L-TLB)
May  6 22:37:16 catdog4 kernel: Call Trace: [context_thread+277/464]
[arch_kern
el_thread+40/56]
May  6 22:37:16 catdog4 kernel: ksoftirqd_CPU S 82872000  5064     3      1
        4     2 (L-TLB)
May  6 22:37:16 catdog4 kernel: Call Trace: [ksoftirqd+144/196]
[arch_kernel_th
read+40/56]

I'd always run /var/log/messages through a script like the following to
strip
out the date, time, server name, and kernel stuff:

#!/usr/bin/perl

@lines=<>;

for (@lines)
{
   print substr($_, 32); # highly dependent on the length of your system
name
}

Of course, there are other ways to do this.  In any case, if syslogd didn't
do
the address translation, I'd take /var/log/messages, grab the "SysRq: Show
State" lines, run them through the script above, save to a file, and run
them
through mapread (again, see the next section for details on mapread).


Step 2: disassemble the kernel state, translate the addresses

Kernel oops
-----------
For kernel oops data, the ksymoops tool will dissassemble the code into
something understandable.  You'll need the System.map generated by the
kernel
compilation process.  System.map will be in the top level of the kernel
directory after a build.

To run ksymoops, do the following:

% ksymoops -K -L -O -m System.map < oops.txt > oops.out

where the oops data above is in a file oops.txt and the corresponding
dissassembled oops will be written to the file oops.out.  Here's what you'll
get:

ksymoops 2.4.5 on i686 2.4.18-14.  Options used
     -V (default)
     -K (specified)
     -L (specified)
     -O (specified)
     -m System.map-andrew (specified)

Unable to handle kernel NULL pointer dereference at virtual address 00000004
80289912
*pde = 591be001
Oops: 0002
deadman multipath md bonding1 bonding cpqci cpqhealth cpqrom e100 lpfcdd
CPU:    1
EIP:    0010:[<80289912>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000000   ebx: f8f37138   ecx: f8b36d98   edx: 00000000
esi: 00619380   edi: f504ec00   ebp: 00000900   esp: d91bdcf8
ds: 0018   es: 0018   ss: 0018
Process spew (pid: 1230, stackpage=d91bd000)
Stack: f504ec00 f504ed70 f587f370 00bb8000 00000080 00000001 00000050
00000000
       80286341 d91bdd56 d91bdd58 00610180 f504ec00 00003a01 daa9c320
00609250
       00bb8000 00000020 00000200 f587f200 f5a2e000 00610180 00000600
09001000
Call Trace: [<80286341>]  [<80286427>]  [<80219fdc>]  [<8021a051>]
[<8013b78c>]
  [<8013b94b>]  [<80139b70>]  [<8012fd22>]  [<80130020>]  [<8013008c>]
[<80130b
1292ff>]  [<8012919c>]  [<801d8425>]  [<801d34d9>]  [<80137717>]
[<80106f27>]
Code: 89 42 04 89 10 c7 01 00 00 00 00 c7 41 04 00 00 00 00 8b 03


>>EIP; 80289912 <lvm_snapshot_remap_block+a6/f8>   <=====

>>ebx; f8f37138 <END_OF_CODE+78b17794/????>
>>ecx; f8b36d98 <END_OF_CODE+787173f4/????>
>>esi; 00619380 Before first symbol
>>edi; f504ec00 <END_OF_CODE+74c2f25c/????>
>>ebp; 00000900 Before first symbol
>>esp; d91bdcf8 <END_OF_CODE+58d9e354/????>

Trace; 80286341 <lvm_map+3b9/490>
Trace; 80286427 <lvm_make_request_fn+f/1c>
Trace; 80219fdc <generic_make_request+11c/12c>
Trace; 8021a051 <submit_bh+65/80>
Trace; 8013b78c <sync_page_buffers+94/ac>
Trace; 8013b94b <try_to_free_buffers+1a7/1d4>
Trace; 80139b70 <try_to_release_page+44/48>
Trace; 8012fd22 <shrink_cache+24a/3c8>
Trace; 80130020 <shrink_caches+78/a8>
Trace; 8013008c <try_to_free_pages_zone+3c/5c>

Code;  80289912 <lvm_snapshot_remap_block+a6/f8>
00000000 <_EIP>:
Code;  80289912 <lvm_snapshot_remap_block+a6/f8>   <=====
   0:   89 42 04                  mov    %eax,0x4(%edx)   <=====
Code;  80289915 <lvm_snapshot_remap_block+a9/f8>
   3:   89 10                     mov    %edx,(%eax)
Code;  80289917 <lvm_snapshot_remap_block+ab/f8>
   5:   c7 01 00 00 00 00         movl   $0x0,(%ecx)
Code;  8028991d <lvm_snapshot_remap_block+b1/f8>
   b:   c7 41 04 00 00 00 00      movl   $0x0,0x4(%ecx)
Code;  80289924 <lvm_snapshot_remap_block+b8/f8>
  12:   8b 03                     mov    (%ebx),%eax

ksymoops has translated the addresses for most of the call trace section
into
the corresponding function names. Note that 10 traces are disassembled, but
16
were listed in the call trace section.  The translated addresses are given
in
"function_name + hexOffset/totalAddressesInFunction" form.  For example:

>>EIP; 80289912 <lvm_snapshot_remap_block+a6/f8>   <=====
Trace; 80286341 <lvm_map+3b9/490>

means that the instruction pointer held address 0x80289912 at the time of
the
oops, which was 0xa6 bytes past the start of the lvm_snapshot_remap_block
function, which is 0xf8 bytes long.  Similarly, one of the addresses in the
call trace was 0x3b9 bytes past the lvm_map function which is 0x490 bytes
long.

ksymoops has also disassembled the first few lines of code starting at EIP.
This helps in finding the corresponding code when tracking down the problem.

kernel hang
-----------
Mapread is a tool I wrote that does two things: translate single addresses
or
translate an entire sysrq-T call trace listing.  Mapread uses the
System.map file that exists in the current directory.  For example:

% mapread 801138fa
801138fa: 80113880 T schedule_timeout (schedule_timeout+0x7a/122d)

The address 801138fa is in the schedule_timeout function, 0x7a or 122 bytes
from the function start.

For sysrq-T listings:

 % mapread sysrq.txt > sysrq.out

sysrq.out will contain:

task: init (pid: 1)
        801138fa: 80113880 T schedule_timeout (schedule_timeout+0x7a/122d)
        80113820: 80113820 t process_timeout (process_timeout+0x0/0d)
        801470ee: 80146f24 T do_select (do_select+0x1ca/458d)
        80147490: 80147150 T sys_select (sys_select+0x340/832d)
        80106f27: 80106ef4 T system_call (system_call+0x33/51d)
task: keventd (pid: 2)
        801235d5: 801234c0 t context_thread (context_thread+0x115/277d)
        80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d)
task: ksoftirqd_CPU (pid: 3)
        8011b62f: 8011b5c0 T do_softirq (do_softirq+0x6f/111d)
        8011bb3c: 8011baac t ksoftirqd (ksoftirqd+0x90/144d)
        80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d)
task: ksoftirqd_CPU (pid: 4)
        8011b62f: 8011b5c0 T do_softirq (do_softirq+0x6f/111d)
        8011bb3c: 8011baac t ksoftirqd (ksoftirqd+0x90/144d)
        80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d)
task: ksoftirqd_CPU (pid: 5)
        8011b62f: 8011b5c0 T do_softirq (do_softirq+0x6f/111d)
        8011bb3c: 8011baac t ksoftirqd (ksoftirqd+0x90/144d)
        80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d)
< lots of lines deleted >
task: smbd (pid: 1511)
        80146e7c: 80146df4 T __pollwait (__pollwait+0x88/136d)
        801138fa: 80113880 T schedule_timeout (schedule_timeout+0x7a/122d)
        80113820: 80113820 t process_timeout (process_timeout+0x0/0d)
        801470ee: 80146f24 T do_select (do_select+0x1ca/458d)
        80147490: 80147150 T sys_select (sys_select+0x340/832d)
        80106f27: 80106ef4 T system_call (system_call+0x33/51d)

The addresses have been translated into function calls based on the
System.map
file, and the offsets from the start of the function have been given in both
hex and decimal.  For example,

        801470ee: 80146f24 T do_select (do_select+0x1ca/458d)

means that, in the call trace for smbd (pid: 1511), do_select is in the call
stack, stopping at address 0x801470ee.  That address is 0x1ca, or 458 bytes
from the beginning of the do_select function.

NOTE: syslogd, when translating oopses and sysrq-T call traces, gives all
offset
results in decimal, not hex!  Look at the first entry for init:

May  6 22:37:15 catdog4 kernel: SysRq : Show State
May  6 22:37:15 catdog4 kernel:
May  6 22:37:15 catdog4 kernel:                          free
     sibling
May  6 22:37:15 catdog4 kernel:   task             PC    stack   pid father
chil
d younger older
May  6 22:37:16 catdog4 kernel: init          S 82825F2C  3780     1      0
309
4               (NOTLB)
May  6 22:37:16 catdog4 kernel: Call Trace: [schedule_timeout+122/156]
[process
_timeout+0/96]  [do_select+458/516]  [sys_select+832/1148]
[system_call+51/56]

[schedule_timeout+122/156]

This says that the address 122 bytes past the start of schedule_timeout,
which
is 156 bytes long.  When digging through the dissassembled code, you'll want
everything in hex.  I don't know how to make syslogd to do stuff in hex, but
it sure would save me a lot of debugging time.  ksymoops gives results in
hex, gdb
gives results in decimal, objdump gives results in hex, mapread gives
results
in both decimal and hex.  Decimal sucks.


Step 3: walk the call stack, verify function chain

Now that you have a translated call trace for a particular process, you'll
need to verify the call trace.  Unfortunately, the call trace isn't always
accurate.  Somehow phantom function calls get inserted in the trace, yet
there
is no code path to those functions.  To follow a call trace, you'll need a
disassembled kernel.  The kernel build process will generate a uncompressed
kernel image called "vmlinux", and it will be in the top level kernel
directory after the compilation process.  To disassemble the vmlinux file:

% objdump -s -d vmlinux > vmlinux.disas

It is usually very helpful to re-compile the kernel with symbols and
generate
a new vmlinux file with line number information.  Change the following in
the
top level kernel Makefile to add symbols:

chroot_env (1.9-BR_NFSPERF)# diff -Nur Makefile Makefile.symbols
--- Makefile    Wed Aug 20 16:16:16 2003
+++ Makefile.symbols    Wed Aug 20 16:16:08 2003
@@ -92,7 +92,7 @@
 CPPFLAGS := -D__KERNEL__ -I$(HPATH)

 CFLAGS := $(CPPFLAGS) -Wall -Wstrict-prototypes -Wno-trigraphs -O2 \
-         -fno-strict-aliasing -fno-common
+         -fno-strict-aliasing -fno-common -g
 ifndef CONFIG_FRAME_POINTER
 CFLAGS += -fomit-frame-pointer
 endif

The objdump for disassembling symboled vmlinux files is as follows:

% objdump -s -l --source vmlinux > vmlinux.disas

Note that there will be address offsets between these two kernels due to the
debugging information.  For example, in a non-symboled vmlinux file,
kupdate will be at address 8013bd68.  In the symboled vmlinux file, it will
be
at address 8013bd58.  Here's the chunk of the vmlinux file where kupdate
starts, both without and with symbols:

no symbols:

8013bd68 <kupdate>:
8013bd68:       57                      push   %edi
8013bd69:       56                      push   %esi
8013bd6a:       53                      push   %ebx
8013bd6b:       bb 00 e0 ff ff          mov    $0xffffe000,%ebx
8013bd70:       21 e3                   and    %esp,%ebx
8013bd72:       c7 83 88 00 00 00 01    movl   $0x1,0x88(%ebx)
8013bd79:       00 00 00
8013bd7c:       c7 83 80 00 00 00 01    movl   $0x1,0x80(%ebx)
8013bd83:       00 00 00
8013bd86:       8d bb 3a 03 00 00       lea    0x33a(%ebx),%edi
8013bd8c:       be ce 94 2e 80          mov    $0x802e94ce,%esi
8013bd91:       ac                      lods   %ds:(%esi),%al

symbols:

8013bd58 <kupdate>:
kupdate():
8013bd58:       57                      push   %edi
8013bd59:       56                      push   %esi
8013bd5a:       53                      push   %ebx
/src/kernel/linux/include/asm/current.h:9

static inline struct task_struct * get_current(void)
{
        struct task_struct *current;
        __asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL));
8013bd5b:       bb 00 e0 ff ff          mov    $0xffffe000,%ebx
8013bd60:       21 e3                   and    %esp,%ebx
/src/kernel/linux/fs/buffer.c:3112

/*
 * This is the kernel update daemon. It was used to live in userspace
 * but since it's need to run safely we want it unkillable by mistake.
 * You don't need to change your userspace configuration since
 * the userspace `update` will do_exit(0) at the first sys_bdflush().
 */
int kupdate(void *startup)
{
        struct task_struct * tsk = current;
        int interval;

        tsk->session = 1;
8013bd62:       c7 83 88 00 00 00 01    movl   $0x1,0x88(%ebx)
8013bd69:       00 00 00
/src/kernel/linux/fs/buffer.c:3113
        tsk->pgrp = 1;
8013bd6c:       c7 83 80 00 00 00 01    movl   $0x1,0x80(%ebx)
8013bd73:       00 00 00
/src/kernel/linux/include/asm/string.h:34
 */

#define __HAVE_ARCH_STRCPY
static inline char * strcpy(char * dest,const char *src)
{
8013bd76:       8d bb 3a 03 00 00       lea    0x33a(%ebx),%edi
8013bd7c:       be 4e 96 2e 80          mov    $0x802e964e,%esi
/src/kernel/linux/include/asm/string.h:36
int d0, d1, d2;
__asm__ __volatile__(
8013bd81:       ac                      lods   %ds:(%esi),%al

Here's an example of a call trace taken from mapread.  This call trace isn't
accurate, and we'll see why:

task: kupdated (pid: 9)
        40105be4: 40105b78 T __down
        40105d80: 40105d78 T __down_failed
        401d6a9f: 401d6a90 t .text.lock.page_buf_locking
        401c6599: 401c6570 T xfs_getsb
        401ca9f7: 401ca9b8 T xfs_trans_getsb
        401c95bd: 401c95a4 t xfs_trans_apply_sb_deltas
        401c9c2a: 401c9b08 T xfs_trans_commit
        401db19e: 401dac08 T xfs_strategy
        401db1cd: 401dac08 T xfs_strategy
        401db1ee: 401dac08 T xfs_strategy
        401d5ac3: 401d5980 t _page_buf_page_apply
        401db445: 401db3a4 T xfs_bmap
        401d7078: 401d7038 t map_blocks
        401d76da: 401d7550 t delalloc_convert
        401d7cc8: 401d7c6c t linvfs_writepage
        40138b2d: 40138a90 t write_some_buffers
        401d2250: 401d20a0 t xfs_inode_flush
        401ccf11: 401ccefc t xfs_sync
        401df6b3: 401df6a8 t linvfs_statfs
        4013cc9d: 4013cba8 T sync_supers
        4013bf07: 4013beec t sync_old_buffers
        4013c1cf: 4013c0c8 T kupdate
        40105694: 4010566c T kernel_thread

The call trace starts at the bottom and moves upward.  From the listing
above,
we expect the following call chain:

kernel_thread
 kupdate
  sync_old_buffers
   sync_supers
    linvfs_statfs
     xfs_sync
      xfs_inode_flush
       write_some_buffers
	...etc...

Here's what really happens, according to the disassembled vmlinux file.
First, I search for the first address, 40105694.  The line immediately
preceding the address will be the call to the next function.

40105692:       ff d2                   call   *%edx
40105694:       b8 01 00 00 00          mov    $0x1,%eax

Figuring out what happens through function pointers (as in address 40105692)
requires a visit to the source code.  Let's assume that we're actually going
into kupdate at this point.  Code like XFS that is function pointer crazy
requires more debugging.  Trying the next address:

4013c1ca:       e8 1d fd ff ff          call   4013beec <sync_old_buffers>
4013c1cf:       81 3d 00 28 3f 40 00    cmpl   $0x403f2800,0x403f2800

So far so good.  Next address:

4013bf02:       e8 89 cb ff ff          call   40138a90 <write_some_buffers>
4013bf07:       83 c4 04                add    $0x4,%esp

Whoa!  We actually don't go to sync_supers as the call trace says, but we go
to
write_some_buffers instead.  I've run into this case more often than not, so
it always pays to verify the call trace.  I don't know why the call trace is
screwed up in linux.

Once you have a verified call trace, you can start figuring out what
happened
during the crash.  Let's take the example oops I've shown above, where the
spew process caused a page fault:

Unable to handle kernel NULL pointer dereference at virtual address 00000004
80289912
*pde = 591be001
Oops: 0002
deadman multipath md bonding1 bonding cpqci cpqhealth cpqrom e100 lpfcdd
CPU:    1
EIP:    0010:[<80289912>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000000   ebx: f8f37138   ecx: f8b36d98   edx: 00000000
esi: 00619380   edi: f504ec00   ebp: 00000900   esp: d91bdcf8
ds: 0018   es: 0018   ss: 0018
Process spew (pid: 1230, stackpage=d91bd000)
Stack: f504ec00 f504ed70 f587f370 00bb8000 00000080 00000001 00000050
00000000
       80286341 d91bdd56 d91bdd58 00610180 f504ec00 00003a01 daa9c320
00609250
       00bb8000 00000020 00000200 f587f200 f5a2e000 00610180 00000600
09001000
Call Trace: [<80286341>]  [<80286427>]  [<80219fdc>]  [<8021a051>]
[<8013b78c>]
  [<8013b94b>]  [<80139b70>]  [<8012fd22>]  [<80130020>]  [<8013008c>]
[<80130b
1292ff>]  [<8012919c>]  [<801d8425>]  [<801d34d9>]  [<80137717>]
[<80106f27>]
Code: 89 42 04 89 10 c7 01 00 00 00 00 c7 41 04 00 00 00 00 8b 03


>>EIP; 80289912 <lvm_snapshot_remap_block+a6/f8>   <=====

>>ebx; f8f37138 <END_OF_CODE+78b17794/????>
>>ecx; f8b36d98 <END_OF_CODE+787173f4/????>
>>esi; 00619380 Before first symbol
>>edi; f504ec00 <END_OF_CODE+74c2f25c/????>
>>ebp; 00000900 Before first symbol
>>esp; d91bdcf8 <END_OF_CODE+58d9e354/????>

Trace; 80286341 <lvm_map+3b9/490>
Trace; 80286427 <lvm_make_request_fn+f/1c>
Trace; 80219fdc <generic_make_request+11c/12c>
Trace; 8021a051 <submit_bh+65/80>
Trace; 8013b78c <sync_page_buffers+94/ac>
Trace; 8013b94b <try_to_free_buffers+1a7/1d4>
Trace; 80139b70 <try_to_release_page+44/48>
Trace; 8012fd22 <shrink_cache+24a/3c8>
Trace; 80130020 <shrink_caches+78/a8>
Trace; 8013008c <try_to_free_pages_zone+3c/5c>

Code;  80289912 <lvm_snapshot_remap_block+a6/f8>
00000000 <_EIP>:
Code;  80289912 <lvm_snapshot_remap_block+a6/f8>   <=====
   0:   89 42 04                  mov    %eax,0x4(%edx)   <=====
Code;  80289915 <lvm_snapshot_remap_block+a9/f8>
   3:   89 10                     mov    %edx,(%eax)
Code;  80289917 <lvm_snapshot_remap_block+ab/f8>
   5:   c7 01 00 00 00 00         movl   $0x0,(%ecx)
Code;  8028991d <lvm_snapshot_remap_block+b1/f8>
   b:   c7 41 04 00 00 00 00      movl   $0x0,0x4(%ecx)
Code;  80289924 <lvm_snapshot_remap_block+b8/f8>
  12:   8b 03                     mov    (%ebx),%eax

The instruction at address 80289912 generated a page fault.  This code is
moving the value in eax into the memory addresss generated by adding 0x4
to the contents of edx.  The oops says edx contains 0, and the page fault
happens when we try to access memory address 0x00000004.  Some pointer is
NULL
and it shouldn't be.  Looking at the dissambled vmlinux file containing
symbols, I searched for the beginning of lvm_snapshot_remap_block, added
0xa6
to it, and found the following code:

/src/kernel/linux/drivers/md/lvm-snap.c:131
80289a7b:       39 71 08                cmp    %esi,0x8(%ecx)
80289a7e:       75 f1                   jne    80289a71
<lvm_snapshot_remap_block+0x85>
80289a80:       66 39 69 0c             cmp    %bp,0xc(%ecx)
80289a84:       75 eb                   jne    80289a71
<lvm_snapshot_remap_block+0x85>
/src/kernel/linux/drivers/md/lvm-snap.c:133
80289a86:       83 7c 24 14 00          cmpl   $0x0,0x14(%esp,1)
80289a8b:       74 23                   je     80289ab0
<lvm_snapshot_remap_block+0xc4>
/src/kernel/linux/include/linux/list.h:81
80289a8d:       8b 41 04                mov    0x4(%ecx),%eax
80289a90:       8b 11                   mov    (%ecx),%edx
/src/kernel/linux/include/linux/list.h:82
80289a92:       89 42 04                mov    %eax,0x4(%edx)
/src/kernel/linux/include/linux/list.h:83
80289a95:       89 10                   mov    %edx,(%eax)  <=== oops occurs
here
/src/kernel/linux/include/linux/list.h:94
80289a97:       c7 01 00 00 00 00       movl   $0x0,(%ecx)
/src/kernel/linux/include/linux/list.h:95
80289a9d:       c7 41 04 00 00 00 00    movl   $0x0,0x4(%ecx)
/src/kernel/linux/include/linux/list.h:40
80289aa4:       8b 03                   mov    (%ebx),%eax
/src/kernel/linux/include/linux/list.h:41
80289aa6:       89 48 04                mov    %ecx,0x4(%eax)

Because I'm looking at a vmlinux file compiled with symbols, but a non
symboled kernel was the one that crashed, my addresses will be slightly
different.  Thankfully ksymoops decodes some instructions to help us
synchronize the addresses.  The oops address 80289912 with code "move
%eax,0x4(%edx)" becomes address 80289a95 in the symboled vmlinux file.  So,
something at line 83 in list.h is crashing.  Here's the relevant code from
list.h and lvm-snap.c

static inline lv_block_exception_t *lvm_find_exception_table(kdev_t
                                                             org_dev,
                                                             unsigned long
                                                             org_start,
                                                             lv_t * lv)
{
117         struct list_head *hash_table = lv->lv_snapshot_hash_table,
*next;
< code removed >
126         for (next = hash_table->next; next != hash_table;
127              next = next->next) {
128                 lv_block_exception_t *exception;
129
130                 exception = list_entry(next, lv_block_exception_t, hash)

131                 if (exception->rsector_org == org_start &&
132                     exception->rdev_org == org_dev) {
133                         if (i) {
134                                 /* fun, isn't it? :) */
135                                 list_del(next);  <=== oops here
136                                 list_add(next, hash_table);
137                         }
138                         ret = exception;
139                         break;
140                 }

include/linux/list.h
18 struct list_head {
19         struct list_head *next, *prev;
20 };

80 static inline void __list_del(struct list_head *prev, struct list_head
*next)
81 {
82         next->prev = prev;
83         prev->next = next;
84 }

91 static inline void list_del(struct list_head *entry)
92 {
93         __list_del(entry->prev, entry->next);
94         entry->next = (void *) 0;
95         entry->prev = (void *) 0;
96 }

The last line in lvm-snap.c listed in the vmlinux file is #133.  This code
is
comparing some memory address to zero and jumping somewhere if true.  That's
the "if (i) {" line.  The next code is the "list_del(next)" code.  Some code
inspection reveals that next is put into ecx before the list_del call is
made.

list_del(next) translates to:

__list_del(next->prev, next->next)

which becomes:

   (next->next)->prev = (next->prev);
   (next->prev)->next = (next->next);

Here's the annotated assembly for __list_del:

/src/kernel/linux/include/linux/list.h:81
80289a8d:       8b 41 04                mov    0x4(%ecx),%eax
 - move next->prev into eax
80289a90:       8b 11                   mov    (%ecx),%edx
- move next->next into edx
/src/kernel/linux/include/linux/list.h:82
80289a92:       89 42 04                mov    %eax,0x4(%edx)
- move next->prev into next->next->prev
/src/kernel/linux/include/linux/list.h:83
80289a95:       89 10                   mov    %edx,(%eax)  <=== oops occurs
here
- move next->next into next->prev->next (or just next->prev, which is the
same
address as next->prev->next).

Note that eax and edx are equal to 0x00000000 in the oops.  This means that
next->prev == 0x00000000 and next->next = 0x00000000.  This is an invalid
state for the linux list.h code, which is a doubly linked list where the
prev
and next pointers are always expected to be valid.  Only list nodes that
have
been deleted have next and prev pointers of zero, as the last two lines of
list_del() do.  Somehow we're trying to delete a list node that looks like
it's already been deleted, and we page fault while doing it.




Appendix A: mapread perl script
#!/usr/bin/perl

open(MAPFILE, "System.map") || die "Couldn't open System.map: $!\n";
@Map = <MAPFILE>;
close(MAPFILE);

if (-r $ARGV[0])
{
   # we're reading a magic sysrq trace
   &mapread_sysrq($ARGV[0]);
} else {
   # we're just looking up one address
   $addr = $ARGV[0];
   $result = &mapread($addr);
   print "$result\n";
}

sub mapread
{
   local($addr) = shift(@_);
   local($lastAddr);
   local(@fields);
   local($i);
   local($offset);
   local($start);
   local($function);
   local($str);

   $lastAddr = "not in file";
   for ($i = 0; $i < $#Map; $i++)
   {
      @fields = split(' ', $Map[$i]);
      last if ($addr lt $fields[0]);
      $lastAddr = $Map[$i];
   }

   # try to figure out the offset into the function
   # lines in mapfile are startAddr, something, functionName
   @fields = split(' ', $lastAddr);
   $start = $fields[0];
   $function = $fields[2];
   $offset = hex($addr) - hex($start);
   chop($lastAddr);
   $str = sprintf("%s: %s (%s+0x%x/%dd)", $addr, $lastAddr, $function,
$offset, $offset);

   return $str;
}

sub mapread_sysrq
{
   local($sysrq_file) = shift(@_);
   local(@sysrq);
   local($state);
   local($i);
   local($j);
   local(@fields);
   local($task);
   local(@traces);
   local($pid);

   # write sysrq file to another file, just in case there are line
   # feed issues
   open(SYSRQ_FILE, $sysrq_file) || die "Couldn't open $sysrq_file: $!\n";
   @sysrq = <SYSRQ_FILE>;
   close(SYSRQ_FILE);
   open(SYSRQ_FILE_TMP, ">" . $sysrq_file . ".tmp") || die "Couldn't open
$sysrq_file.tmp: $!\n";
   for ($i = 0; $i <= $#sysrq; $i++)
   {
      $sysrq[$i] =~ s/^M//;   # enter ctrl-V ctrl-M to get a true ^M in the
code
      print SYSRQ_FILE_TMP "$sysrq[$i]";
   }
   close(SYSRQ_FILE_TMP);
   open(SYSRQ_FILE, $sysrq_file . ".tmp") || die "Couldn't open
$sysrq_file.tmp: $!\n";
   @sysrq = <SYSRQ_FILE>;
   close(SYSRQ_FILE);
   system("rm $sysrq_file.tmp");

   # state: 0 = start of file
   #        1 = getting task name
   #        2 = getting call traces

   $state = 0;
   for ($i = 0; $i <= $#sysrq; $i++)
   {
      chop($sysrq[$i]);
      @fields = split(' ', $sysrq[$i]);

      if ($state == 0)
      {
         # look for line with task
         if ($fields[0] eq "task")
         {
            $state = 1;
         }
      }
      elsif ($state == 1)
      {
         # first field is task
         $task = $fields[0];
         $pid = $fields[4];
         @traces = ();
         $state = 2;
      }
      elsif ($state == 2)
      {
         # if first field is "Call", skip to third field
         $j = 0;
         if (($fields[0] eq "Call") && ($fields[1] eq "Trace:"))
         {
           $j = 2;
         }

         # clean up the traces and put them in the array
         for (; $j <= $#fields; $j++)
         {
            $fields[$j] =~ s/\[//;
            $fields[$j] =~ s/<//;
            $fields[$j] =~ s/>//;
            $fields[$j] =~ s/\]//;
            push(@traces, &mapread($fields[$j]));
         }

         # if next line doesn't start with a [<, it's a task
         if ($i < $#sysrq)
         {
            @fields = split(' ', $sysrq[$i+1]);
            if (substr($fields[0], 0, 2) ne "[<")
            {
               # dump our trace list
               print "task: $task (pid: $pid)\n";
               for ($j = 0; $j <= $#traces; $j++)
               {
                  print "\t$traces[$j]\n";
               }
               $state = 1;
            }
         }
      }
   }
}