출처 : http://mail.nl.linux.org/kernelnewbies/2003-08/msg00347.html
- To: "'kernelnewbies@nl.linux.org'" <kernelnewbies@nl.linux.org>
- Subject: paper on debugging kernel oops or hang
- From: "HABBINGA,ERIK (HP-Loveland,ex1)" <erik.habbinga@hp.com>
- Date: Mon, 25 Aug 2003 14:30:10 -0700
- List-archive: <http://mail.nl.linux.org/kernelnewbies/>
- List-help: <mailto:listar@nl.linux.org?Subject=help>
- List-owner: <mailto:riel@nl.linux.org>
- List-post: <mailto:kernelnewbies@nl.linux.org>
- List-software: Listar version 1.0.0
- List-subscribe: <mailto:kernelnewbies-request@nl.linux.org?Subject=subscribe>
- List-unsubscribe: <mailto:kernelnewbies-request@nl.linux.org?Subject=unsubscribe>
- Original-Recipient: rfc822;kernelnewbies-archive@nl.linux.org
- Sender: kernelnewbies-bounce@nl.linux.org
Hello all, I recently wrote this paper for some co-workers of mine, and thought it might be of interest to the kernelnewbies crowd. I didn't see anything like it on the kernelnewbies.org home page. Forgive me if this information is common knowledge. The paper talks about using ksymoops and objdump to debug kernel oops and hang problems. I'm not planning on updating the document again, but feel free to use it as you see fit (i.e. I don't want to maintain this document, but someone else sure can). This paper deals with the 2.4.x series of kernels. Debugging might be different on 2.6.x, I haven't spent much time with that series of kernels. Enjoy, Erik Habbinga Debugging a kernel oops or hang Erik Habbinga August 20, 2003 Step 1: capture the kernel state Kernel oops ----------- If the kernel has oopsed or BUG'd, there will be crash information left on the console. If you're debugging over a serial port, cut and paste the oops data into a file. If the system was stable enough during the crash, the oops data will be in /var/log/messages upon reboot. You can even use a digital camera to take a picture of the console and transcribe the oops data later. In any case, you should end up with data that looks like this: Unable to handle kernel NULL pointer dereference at virtual address 00000004 80289912 *pde = 591be001 Oops: 0002 deadman multipath md bonding1 bonding cpqci cpqhealth cpqrom e100 lpfcdd CPU: 1 EIP: 0010:[<80289912>] Not tainted EFLAGS: 00010202 eax: 00000000 ebx: f8f37138 ecx: f8b36d98 edx: 00000000 esi: 00619380 edi: f504ec00 ebp: 00000900 esp: d91bdcf8 ds: 0018 es: 0018 ss: 0018 Process spew (pid: 1230, stackpage=d91bd000) Stack: f504ec00 f504ed70 f587f370 00bb8000 00000080 00000001 00000050 00000000 80286341 d91bdd56 d91bdd58 00610180 f504ec00 00003a01 daa9c320 00609250 00bb8000 00000020 00000200 f587f200 f5a2e000 00610180 00000600 09001000 Call Trace: [<80286341>] [<80286427>] [<80219fdc>] [<8021a051>] [<8013b78c>] [<8013b94b>] [<80139b70>] [<8012fd22>] [<80130020>] [<8013008c>] [<80130b 1292ff>] [<8012919c>] [<801d8425>] [<801d34d9>] [<80137717>] [<80106f27>] Code: 89 42 04 89 10 c7 01 00 00 00 00 c7 41 04 00 00 00 00 8b 03 This data contains a register and stack dump for the process that caused the oops, along with a explanation of what caused the oops. In this case, the spew process caused a kernel NULL pointer dereference at virtual address 0x00000004. All addresses below PAGE_SIZE are trapped by the CPU's MMU as probable NULL pointer dereferences, and this is what happened here. Sometimes, if the oops makes it into /var/log/messages, the syslogd daemon might disassemble the oops automatically. This is usually handy, unless you don't have the correct System.map file on the system. The wrong System.map file will give wrong stack trace results, requiring you to generate the system addresses by hand into something resembling above by referring to that incorrect System.map file. Hopefully this won't happen to you. Here's a quick description of a kernel BUG(). Sprinkled throughout the kernel code are calls to the BUG() macro. This code is only executed at times where the kernel knows something is horribly wrong and the best course of action is to stop and report status. On i386, the BUG() macro decodes to the "ud2a" instruction (0x0f 0x0b). This is an invalid instruction on i386, and the CPU will throw an "invalid opcode" exception upon execution. The kernel traps the invalid opcode exception, prints "kernel BUG at __FILE__:__LINE!", and then dumps the CPU state just like any other fault/trap/exception handler that halts the machine. So, if you see code like the following on i386, you know you're looking at the BUG() macro: Code; 00000000 Before first symbol 0: 0f 0b ud2a kernel hang ----------- If the kernel has hung, you'll have to use the magic sysrq functionality to get the stack trace for all currently running processes. Compile the kernel with the CONFIG_MAGIC_SYSRQ option enabled, found in the "kernel hacking" section of your favorite "make *config" display. From the console, doing the key sequence "Alt" -> "PrintScreen/SysRq" -> "t" will start the stack trace procedure. Hold down the keys as you type them on the console's keyboard. If you're running from a serial port, enable the "serial console" option under "character devices" in "make *config" and put the appropriate options on the kernel boot parameter line for your boot loader. For example, the following would be added to the "append" line in lilo.conf: console=ttyS0,9600 console=tty0 This tells the kernel to send all console messages to serial port 0 at 9600 baud as well as the VGA console. In the latest 2.4.x kernels, a file shows up in /proc when CONFIG_MAGIC_SYSRQ is configured. Doing: echo t > /proc/sysrq-trigger will start the stack trace process. Of course, if the system is hung to the point you can't type, this won't be of help. The serial port + console will be your best option for getting the stack trace information. The syslogd daemon will send the sysrq information to the /var/log/messages file, disassembling it with the currently installed System.map file. Once again, if the wrong System.map file is on the system, you're in for some serious pain. The console might also show disassembled stack traces. Here's the start of output from a raw SysRq-T stack trace: SysRq : Show State ^M ^M free sibling ^M task PC stack pid father child younger older ^Minit R 82831F2C 3780 1 0 1213 (NOTLB) ^MCall Trace: [<801138fa>] [<80113820>] [<801470ee>] [<80147490>] [<80106f27 >] ^Mkeventd R 00010000 6068 2 1 3 (L-TLB) ^MCall Trace: [<801235d5>] [<80105694>] ^Mksoftirqd_CPU R F7BFE000 5184 3 1 4 2 (L-TLB) ^MCall Trace: [<8011b62f>] [<8011bb3c>] [<80105694>] ^Mksoftirqd_CPU S F7BFC000 5924 4 1 5 3 (L-TLB) ^MCall Trace: [<8011b62f>] [<8011bb3c>] [<80105694>] ^Mksoftirqd_CPU R F7BFA000 6000 5 1 6 4 (L-TLB) ^MCall Trace: [<8011b62f>] [<8011bb3c>] [<80105694>] ^Mksoftirqd_CPU S F7BF8000 5404 6 1 7 5 (L-TLB) ^MCall Trace: [<8011b62f>] [<8011bb3c>] [<80105694>] ^Mkswapd S F7BD8000 6356 7 1 8 6 (L-TLB) ^MCall Trace: [<80119cae>] [<8012ffa6>] [<80105694>] ^Mbdflush R 00000286 6308 8 1 9 7 (L-TLB) ^MCall Trace: [<8011429e>] [<8013bac3>] [<80105694>] < lots of lines deleted > ^Msmbd R F435BF2C 5540 1513 612 1511 (NOTLB) ^MCall Trace: [<80146e7c>] [<801138fa>] [<80113820>] [<801470ee>] [<80147490 >] [<80106f27>] The call trace for every program running on the system is listed. Note the embedded "^M"'s. This trace was captured off of a windows serial port terminal program. I've written a perl script called "mapread" (see the end of this document) that will take a file containing the SysRq-T listing, strip out the "^M"'s, and translate the addresses using the System.map file located in the current directory. See the next section for details on mapread. If you look at /var/log/messages after a SysRq-T call trace procedure, you'll see something like this in /var/log/messages. In this case, syslogd managed to do the addresss translation: May 6 22:37:15 catdog4 kernel: SysRq : Show State May 6 22:37:15 catdog4 kernel: May 6 22:37:15 catdog4 kernel: free sibling May 6 22:37:15 catdog4 kernel: task PC stack pid father chil d younger older May 6 22:37:16 catdog4 kernel: init S 82825F2C 3780 1 0 309 4 (NOTLB) May 6 22:37:16 catdog4 kernel: Call Trace: [schedule_timeout+122/156] [process _timeout+0/96] [do_select+458/516] [sys_select+832/1148] [system_call+51/56] May 6 22:37:16 catdog4 kernel: keventd S 00010000 4992 2 1 3 (L-TLB) May 6 22:37:16 catdog4 kernel: Call Trace: [context_thread+277/464] [arch_kern el_thread+40/56] May 6 22:37:16 catdog4 kernel: ksoftirqd_CPU S 82872000 5064 3 1 4 2 (L-TLB) May 6 22:37:16 catdog4 kernel: Call Trace: [ksoftirqd+144/196] [arch_kernel_th read+40/56] I'd always run /var/log/messages through a script like the following to strip out the date, time, server name, and kernel stuff: #!/usr/bin/perl @lines=<>; for (@lines) { print substr($_, 32); # highly dependent on the length of your system name } Of course, there are other ways to do this. In any case, if syslogd didn't do the address translation, I'd take /var/log/messages, grab the "SysRq: Show State" lines, run them through the script above, save to a file, and run them through mapread (again, see the next section for details on mapread). Step 2: disassemble the kernel state, translate the addresses Kernel oops ----------- For kernel oops data, the ksymoops tool will dissassemble the code into something understandable. You'll need the System.map generated by the kernel compilation process. System.map will be in the top level of the kernel directory after a build. To run ksymoops, do the following: % ksymoops -K -L -O -m System.map < oops.txt > oops.out where the oops data above is in a file oops.txt and the corresponding dissassembled oops will be written to the file oops.out. Here's what you'll get: ksymoops 2.4.5 on i686 2.4.18-14. Options used -V (default) -K (specified) -L (specified) -O (specified) -m System.map-andrew (specified) Unable to handle kernel NULL pointer dereference at virtual address 00000004 80289912 *pde = 591be001 Oops: 0002 deadman multipath md bonding1 bonding cpqci cpqhealth cpqrom e100 lpfcdd CPU: 1 EIP: 0010:[<80289912>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010202 eax: 00000000 ebx: f8f37138 ecx: f8b36d98 edx: 00000000 esi: 00619380 edi: f504ec00 ebp: 00000900 esp: d91bdcf8 ds: 0018 es: 0018 ss: 0018 Process spew (pid: 1230, stackpage=d91bd000) Stack: f504ec00 f504ed70 f587f370 00bb8000 00000080 00000001 00000050 00000000 80286341 d91bdd56 d91bdd58 00610180 f504ec00 00003a01 daa9c320 00609250 00bb8000 00000020 00000200 f587f200 f5a2e000 00610180 00000600 09001000 Call Trace: [<80286341>] [<80286427>] [<80219fdc>] [<8021a051>] [<8013b78c>] [<8013b94b>] [<80139b70>] [<8012fd22>] [<80130020>] [<8013008c>] [<80130b 1292ff>] [<8012919c>] [<801d8425>] [<801d34d9>] [<80137717>] [<80106f27>] Code: 89 42 04 89 10 c7 01 00 00 00 00 c7 41 04 00 00 00 00 8b 03 >>EIP; 80289912 <lvm_snapshot_remap_block+a6/f8> <===== >>ebx; f8f37138 <END_OF_CODE+78b17794/????> >>ecx; f8b36d98 <END_OF_CODE+787173f4/????> >>esi; 00619380 Before first symbol >>edi; f504ec00 <END_OF_CODE+74c2f25c/????> >>ebp; 00000900 Before first symbol >>esp; d91bdcf8 <END_OF_CODE+58d9e354/????> Trace; 80286341 <lvm_map+3b9/490> Trace; 80286427 <lvm_make_request_fn+f/1c> Trace; 80219fdc <generic_make_request+11c/12c> Trace; 8021a051 <submit_bh+65/80> Trace; 8013b78c <sync_page_buffers+94/ac> Trace; 8013b94b <try_to_free_buffers+1a7/1d4> Trace; 80139b70 <try_to_release_page+44/48> Trace; 8012fd22 <shrink_cache+24a/3c8> Trace; 80130020 <shrink_caches+78/a8> Trace; 8013008c <try_to_free_pages_zone+3c/5c> Code; 80289912 <lvm_snapshot_remap_block+a6/f8> 00000000 <_EIP>: Code; 80289912 <lvm_snapshot_remap_block+a6/f8> <===== 0: 89 42 04 mov %eax,0x4(%edx) <===== Code; 80289915 <lvm_snapshot_remap_block+a9/f8> 3: 89 10 mov %edx,(%eax) Code; 80289917 <lvm_snapshot_remap_block+ab/f8> 5: c7 01 00 00 00 00 movl $0x0,(%ecx) Code; 8028991d <lvm_snapshot_remap_block+b1/f8> b: c7 41 04 00 00 00 00 movl $0x0,0x4(%ecx) Code; 80289924 <lvm_snapshot_remap_block+b8/f8> 12: 8b 03 mov (%ebx),%eax ksymoops has translated the addresses for most of the call trace section into the corresponding function names. Note that 10 traces are disassembled, but 16 were listed in the call trace section. The translated addresses are given in "function_name + hexOffset/totalAddressesInFunction" form. For example: >>EIP; 80289912 <lvm_snapshot_remap_block+a6/f8> <===== Trace; 80286341 <lvm_map+3b9/490> means that the instruction pointer held address 0x80289912 at the time of the oops, which was 0xa6 bytes past the start of the lvm_snapshot_remap_block function, which is 0xf8 bytes long. Similarly, one of the addresses in the call trace was 0x3b9 bytes past the lvm_map function which is 0x490 bytes long. ksymoops has also disassembled the first few lines of code starting at EIP. This helps in finding the corresponding code when tracking down the problem. kernel hang ----------- Mapread is a tool I wrote that does two things: translate single addresses or translate an entire sysrq-T call trace listing. Mapread uses the System.map file that exists in the current directory. For example: % mapread 801138fa 801138fa: 80113880 T schedule_timeout (schedule_timeout+0x7a/122d) The address 801138fa is in the schedule_timeout function, 0x7a or 122 bytes from the function start. For sysrq-T listings: % mapread sysrq.txt > sysrq.out sysrq.out will contain: task: init (pid: 1) 801138fa: 80113880 T schedule_timeout (schedule_timeout+0x7a/122d) 80113820: 80113820 t process_timeout (process_timeout+0x0/0d) 801470ee: 80146f24 T do_select (do_select+0x1ca/458d) 80147490: 80147150 T sys_select (sys_select+0x340/832d) 80106f27: 80106ef4 T system_call (system_call+0x33/51d) task: keventd (pid: 2) 801235d5: 801234c0 t context_thread (context_thread+0x115/277d) 80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d) task: ksoftirqd_CPU (pid: 3) 8011b62f: 8011b5c0 T do_softirq (do_softirq+0x6f/111d) 8011bb3c: 8011baac t ksoftirqd (ksoftirqd+0x90/144d) 80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d) task: ksoftirqd_CPU (pid: 4) 8011b62f: 8011b5c0 T do_softirq (do_softirq+0x6f/111d) 8011bb3c: 8011baac t ksoftirqd (ksoftirqd+0x90/144d) 80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d) task: ksoftirqd_CPU (pid: 5) 8011b62f: 8011b5c0 T do_softirq (do_softirq+0x6f/111d) 8011bb3c: 8011baac t ksoftirqd (ksoftirqd+0x90/144d) 80105694: 8010566c T kernel_thread (kernel_thread+0x28/40d) < lots of lines deleted > task: smbd (pid: 1511) 80146e7c: 80146df4 T __pollwait (__pollwait+0x88/136d) 801138fa: 80113880 T schedule_timeout (schedule_timeout+0x7a/122d) 80113820: 80113820 t process_timeout (process_timeout+0x0/0d) 801470ee: 80146f24 T do_select (do_select+0x1ca/458d) 80147490: 80147150 T sys_select (sys_select+0x340/832d) 80106f27: 80106ef4 T system_call (system_call+0x33/51d) The addresses have been translated into function calls based on the System.map file, and the offsets from the start of the function have been given in both hex and decimal. For example, 801470ee: 80146f24 T do_select (do_select+0x1ca/458d) means that, in the call trace for smbd (pid: 1511), do_select is in the call stack, stopping at address 0x801470ee. That address is 0x1ca, or 458 bytes from the beginning of the do_select function. NOTE: syslogd, when translating oopses and sysrq-T call traces, gives all offset results in decimal, not hex! Look at the first entry for init: May 6 22:37:15 catdog4 kernel: SysRq : Show State May 6 22:37:15 catdog4 kernel: May 6 22:37:15 catdog4 kernel: free sibling May 6 22:37:15 catdog4 kernel: task PC stack pid father chil d younger older May 6 22:37:16 catdog4 kernel: init S 82825F2C 3780 1 0 309 4 (NOTLB) May 6 22:37:16 catdog4 kernel: Call Trace: [schedule_timeout+122/156] [process _timeout+0/96] [do_select+458/516] [sys_select+832/1148] [system_call+51/56] [schedule_timeout+122/156] This says that the address 122 bytes past the start of schedule_timeout, which is 156 bytes long. When digging through the dissassembled code, you'll want everything in hex. I don't know how to make syslogd to do stuff in hex, but it sure would save me a lot of debugging time. ksymoops gives results in hex, gdb gives results in decimal, objdump gives results in hex, mapread gives results in both decimal and hex. Decimal sucks. Step 3: walk the call stack, verify function chain Now that you have a translated call trace for a particular process, you'll need to verify the call trace. Unfortunately, the call trace isn't always accurate. Somehow phantom function calls get inserted in the trace, yet there is no code path to those functions. To follow a call trace, you'll need a disassembled kernel. The kernel build process will generate a uncompressed kernel image called "vmlinux", and it will be in the top level kernel directory after the compilation process. To disassemble the vmlinux file: % objdump -s -d vmlinux > vmlinux.disas It is usually very helpful to re-compile the kernel with symbols and generate a new vmlinux file with line number information. Change the following in the top level kernel Makefile to add symbols: chroot_env (1.9-BR_NFSPERF)# diff -Nur Makefile Makefile.symbols --- Makefile Wed Aug 20 16:16:16 2003 +++ Makefile.symbols Wed Aug 20 16:16:08 2003 @@ -92,7 +92,7 @@ CPPFLAGS := -D__KERNEL__ -I$(HPATH) CFLAGS := $(CPPFLAGS) -Wall -Wstrict-prototypes -Wno-trigraphs -O2 \ - -fno-strict-aliasing -fno-common + -fno-strict-aliasing -fno-common -g ifndef CONFIG_FRAME_POINTER CFLAGS += -fomit-frame-pointer endif The objdump for disassembling symboled vmlinux files is as follows: % objdump -s -l --source vmlinux > vmlinux.disas Note that there will be address offsets between these two kernels due to the debugging information. For example, in a non-symboled vmlinux file, kupdate will be at address 8013bd68. In the symboled vmlinux file, it will be at address 8013bd58. Here's the chunk of the vmlinux file where kupdate starts, both without and with symbols: no symbols: 8013bd68 <kupdate>: 8013bd68: 57 push %edi 8013bd69: 56 push %esi 8013bd6a: 53 push %ebx 8013bd6b: bb 00 e0 ff ff mov $0xffffe000,%ebx 8013bd70: 21 e3 and %esp,%ebx 8013bd72: c7 83 88 00 00 00 01 movl $0x1,0x88(%ebx) 8013bd79: 00 00 00 8013bd7c: c7 83 80 00 00 00 01 movl $0x1,0x80(%ebx) 8013bd83: 00 00 00 8013bd86: 8d bb 3a 03 00 00 lea 0x33a(%ebx),%edi 8013bd8c: be ce 94 2e 80 mov $0x802e94ce,%esi 8013bd91: ac lods %ds:(%esi),%al symbols: 8013bd58 <kupdate>: kupdate(): 8013bd58: 57 push %edi 8013bd59: 56 push %esi 8013bd5a: 53 push %ebx /src/kernel/linux/include/asm/current.h:9 static inline struct task_struct * get_current(void) { struct task_struct *current; __asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL)); 8013bd5b: bb 00 e0 ff ff mov $0xffffe000,%ebx 8013bd60: 21 e3 and %esp,%ebx /src/kernel/linux/fs/buffer.c:3112 /* * This is the kernel update daemon. It was used to live in userspace * but since it's need to run safely we want it unkillable by mistake. * You don't need to change your userspace configuration since * the userspace `update` will do_exit(0) at the first sys_bdflush(). */ int kupdate(void *startup) { struct task_struct * tsk = current; int interval; tsk->session = 1; 8013bd62: c7 83 88 00 00 00 01 movl $0x1,0x88(%ebx) 8013bd69: 00 00 00 /src/kernel/linux/fs/buffer.c:3113 tsk->pgrp = 1; 8013bd6c: c7 83 80 00 00 00 01 movl $0x1,0x80(%ebx) 8013bd73: 00 00 00 /src/kernel/linux/include/asm/string.h:34 */ #define __HAVE_ARCH_STRCPY static inline char * strcpy(char * dest,const char *src) { 8013bd76: 8d bb 3a 03 00 00 lea 0x33a(%ebx),%edi 8013bd7c: be 4e 96 2e 80 mov $0x802e964e,%esi /src/kernel/linux/include/asm/string.h:36 int d0, d1, d2; __asm__ __volatile__( 8013bd81: ac lods %ds:(%esi),%al Here's an example of a call trace taken from mapread. This call trace isn't accurate, and we'll see why: task: kupdated (pid: 9) 40105be4: 40105b78 T __down 40105d80: 40105d78 T __down_failed 401d6a9f: 401d6a90 t .text.lock.page_buf_locking 401c6599: 401c6570 T xfs_getsb 401ca9f7: 401ca9b8 T xfs_trans_getsb 401c95bd: 401c95a4 t xfs_trans_apply_sb_deltas 401c9c2a: 401c9b08 T xfs_trans_commit 401db19e: 401dac08 T xfs_strategy 401db1cd: 401dac08 T xfs_strategy 401db1ee: 401dac08 T xfs_strategy 401d5ac3: 401d5980 t _page_buf_page_apply 401db445: 401db3a4 T xfs_bmap 401d7078: 401d7038 t map_blocks 401d76da: 401d7550 t delalloc_convert 401d7cc8: 401d7c6c t linvfs_writepage 40138b2d: 40138a90 t write_some_buffers 401d2250: 401d20a0 t xfs_inode_flush 401ccf11: 401ccefc t xfs_sync 401df6b3: 401df6a8 t linvfs_statfs 4013cc9d: 4013cba8 T sync_supers 4013bf07: 4013beec t sync_old_buffers 4013c1cf: 4013c0c8 T kupdate 40105694: 4010566c T kernel_thread The call trace starts at the bottom and moves upward. From the listing above, we expect the following call chain: kernel_thread kupdate sync_old_buffers sync_supers linvfs_statfs xfs_sync xfs_inode_flush write_some_buffers ...etc... Here's what really happens, according to the disassembled vmlinux file. First, I search for the first address, 40105694. The line immediately preceding the address will be the call to the next function. 40105692: ff d2 call *%edx 40105694: b8 01 00 00 00 mov $0x1,%eax Figuring out what happens through function pointers (as in address 40105692) requires a visit to the source code. Let's assume that we're actually going into kupdate at this point. Code like XFS that is function pointer crazy requires more debugging. Trying the next address: 4013c1ca: e8 1d fd ff ff call 4013beec <sync_old_buffers> 4013c1cf: 81 3d 00 28 3f 40 00 cmpl $0x403f2800,0x403f2800 So far so good. Next address: 4013bf02: e8 89 cb ff ff call 40138a90 <write_some_buffers> 4013bf07: 83 c4 04 add $0x4,%esp Whoa! We actually don't go to sync_supers as the call trace says, but we go to write_some_buffers instead. I've run into this case more often than not, so it always pays to verify the call trace. I don't know why the call trace is screwed up in linux. Once you have a verified call trace, you can start figuring out what happened during the crash. Let's take the example oops I've shown above, where the spew process caused a page fault: Unable to handle kernel NULL pointer dereference at virtual address 00000004 80289912 *pde = 591be001 Oops: 0002 deadman multipath md bonding1 bonding cpqci cpqhealth cpqrom e100 lpfcdd CPU: 1 EIP: 0010:[<80289912>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010202 eax: 00000000 ebx: f8f37138 ecx: f8b36d98 edx: 00000000 esi: 00619380 edi: f504ec00 ebp: 00000900 esp: d91bdcf8 ds: 0018 es: 0018 ss: 0018 Process spew (pid: 1230, stackpage=d91bd000) Stack: f504ec00 f504ed70 f587f370 00bb8000 00000080 00000001 00000050 00000000 80286341 d91bdd56 d91bdd58 00610180 f504ec00 00003a01 daa9c320 00609250 00bb8000 00000020 00000200 f587f200 f5a2e000 00610180 00000600 09001000 Call Trace: [<80286341>] [<80286427>] [<80219fdc>] [<8021a051>] [<8013b78c>] [<8013b94b>] [<80139b70>] [<8012fd22>] [<80130020>] [<8013008c>] [<80130b 1292ff>] [<8012919c>] [<801d8425>] [<801d34d9>] [<80137717>] [<80106f27>] Code: 89 42 04 89 10 c7 01 00 00 00 00 c7 41 04 00 00 00 00 8b 03 >>EIP; 80289912 <lvm_snapshot_remap_block+a6/f8> <===== >>ebx; f8f37138 <END_OF_CODE+78b17794/????> >>ecx; f8b36d98 <END_OF_CODE+787173f4/????> >>esi; 00619380 Before first symbol >>edi; f504ec00 <END_OF_CODE+74c2f25c/????> >>ebp; 00000900 Before first symbol >>esp; d91bdcf8 <END_OF_CODE+58d9e354/????> Trace; 80286341 <lvm_map+3b9/490> Trace; 80286427 <lvm_make_request_fn+f/1c> Trace; 80219fdc <generic_make_request+11c/12c> Trace; 8021a051 <submit_bh+65/80> Trace; 8013b78c <sync_page_buffers+94/ac> Trace; 8013b94b <try_to_free_buffers+1a7/1d4> Trace; 80139b70 <try_to_release_page+44/48> Trace; 8012fd22 <shrink_cache+24a/3c8> Trace; 80130020 <shrink_caches+78/a8> Trace; 8013008c <try_to_free_pages_zone+3c/5c> Code; 80289912 <lvm_snapshot_remap_block+a6/f8> 00000000 <_EIP>: Code; 80289912 <lvm_snapshot_remap_block+a6/f8> <===== 0: 89 42 04 mov %eax,0x4(%edx) <===== Code; 80289915 <lvm_snapshot_remap_block+a9/f8> 3: 89 10 mov %edx,(%eax) Code; 80289917 <lvm_snapshot_remap_block+ab/f8> 5: c7 01 00 00 00 00 movl $0x0,(%ecx) Code; 8028991d <lvm_snapshot_remap_block+b1/f8> b: c7 41 04 00 00 00 00 movl $0x0,0x4(%ecx) Code; 80289924 <lvm_snapshot_remap_block+b8/f8> 12: 8b 03 mov (%ebx),%eax The instruction at address 80289912 generated a page fault. This code is moving the value in eax into the memory addresss generated by adding 0x4 to the contents of edx. The oops says edx contains 0, and the page fault happens when we try to access memory address 0x00000004. Some pointer is NULL and it shouldn't be. Looking at the dissambled vmlinux file containing symbols, I searched for the beginning of lvm_snapshot_remap_block, added 0xa6 to it, and found the following code: /src/kernel/linux/drivers/md/lvm-snap.c:131 80289a7b: 39 71 08 cmp %esi,0x8(%ecx) 80289a7e: 75 f1 jne 80289a71 <lvm_snapshot_remap_block+0x85> 80289a80: 66 39 69 0c cmp %bp,0xc(%ecx) 80289a84: 75 eb jne 80289a71 <lvm_snapshot_remap_block+0x85> /src/kernel/linux/drivers/md/lvm-snap.c:133 80289a86: 83 7c 24 14 00 cmpl $0x0,0x14(%esp,1) 80289a8b: 74 23 je 80289ab0 <lvm_snapshot_remap_block+0xc4> /src/kernel/linux/include/linux/list.h:81 80289a8d: 8b 41 04 mov 0x4(%ecx),%eax 80289a90: 8b 11 mov (%ecx),%edx /src/kernel/linux/include/linux/list.h:82 80289a92: 89 42 04 mov %eax,0x4(%edx) /src/kernel/linux/include/linux/list.h:83 80289a95: 89 10 mov %edx,(%eax) <=== oops occurs here /src/kernel/linux/include/linux/list.h:94 80289a97: c7 01 00 00 00 00 movl $0x0,(%ecx) /src/kernel/linux/include/linux/list.h:95 80289a9d: c7 41 04 00 00 00 00 movl $0x0,0x4(%ecx) /src/kernel/linux/include/linux/list.h:40 80289aa4: 8b 03 mov (%ebx),%eax /src/kernel/linux/include/linux/list.h:41 80289aa6: 89 48 04 mov %ecx,0x4(%eax) Because I'm looking at a vmlinux file compiled with symbols, but a non symboled kernel was the one that crashed, my addresses will be slightly different. Thankfully ksymoops decodes some instructions to help us synchronize the addresses. The oops address 80289912 with code "move %eax,0x4(%edx)" becomes address 80289a95 in the symboled vmlinux file. So, something at line 83 in list.h is crashing. Here's the relevant code from list.h and lvm-snap.c static inline lv_block_exception_t *lvm_find_exception_table(kdev_t org_dev, unsigned long org_start, lv_t * lv) { 117 struct list_head *hash_table = lv->lv_snapshot_hash_table, *next; < code removed > 126 for (next = hash_table->next; next != hash_table; 127 next = next->next) { 128 lv_block_exception_t *exception; 129 130 exception = list_entry(next, lv_block_exception_t, hash) 131 if (exception->rsector_org == org_start && 132 exception->rdev_org == org_dev) { 133 if (i) { 134 /* fun, isn't it? :) */ 135 list_del(next); <=== oops here 136 list_add(next, hash_table); 137 } 138 ret = exception; 139 break; 140 } include/linux/list.h 18 struct list_head { 19 struct list_head *next, *prev; 20 }; 80 static inline void __list_del(struct list_head *prev, struct list_head *next) 81 { 82 next->prev = prev; 83 prev->next = next; 84 } 91 static inline void list_del(struct list_head *entry) 92 { 93 __list_del(entry->prev, entry->next); 94 entry->next = (void *) 0; 95 entry->prev = (void *) 0; 96 } The last line in lvm-snap.c listed in the vmlinux file is #133. This code is comparing some memory address to zero and jumping somewhere if true. That's the "if (i) {" line. The next code is the "list_del(next)" code. Some code inspection reveals that next is put into ecx before the list_del call is made. list_del(next) translates to: __list_del(next->prev, next->next) which becomes: (next->next)->prev = (next->prev); (next->prev)->next = (next->next); Here's the annotated assembly for __list_del: /src/kernel/linux/include/linux/list.h:81 80289a8d: 8b 41 04 mov 0x4(%ecx),%eax - move next->prev into eax 80289a90: 8b 11 mov (%ecx),%edx - move next->next into edx /src/kernel/linux/include/linux/list.h:82 80289a92: 89 42 04 mov %eax,0x4(%edx) - move next->prev into next->next->prev /src/kernel/linux/include/linux/list.h:83 80289a95: 89 10 mov %edx,(%eax) <=== oops occurs here - move next->next into next->prev->next (or just next->prev, which is the same address as next->prev->next). Note that eax and edx are equal to 0x00000000 in the oops. This means that next->prev == 0x00000000 and next->next = 0x00000000. This is an invalid state for the linux list.h code, which is a doubly linked list where the prev and next pointers are always expected to be valid. Only list nodes that have been deleted have next and prev pointers of zero, as the last two lines of list_del() do. Somehow we're trying to delete a list node that looks like it's already been deleted, and we page fault while doing it. Appendix A: mapread perl script #!/usr/bin/perl open(MAPFILE, "System.map") || die "Couldn't open System.map: $!\n"; @Map = <MAPFILE>; close(MAPFILE); if (-r $ARGV[0]) { # we're reading a magic sysrq trace &mapread_sysrq($ARGV[0]); } else { # we're just looking up one address $addr = $ARGV[0]; $result = &mapread($addr); print "$result\n"; } sub mapread { local($addr) = shift(@_); local($lastAddr); local(@fields); local($i); local($offset); local($start); local($function); local($str); $lastAddr = "not in file"; for ($i = 0; $i < $#Map; $i++) { @fields = split(' ', $Map[$i]); last if ($addr lt $fields[0]); $lastAddr = $Map[$i]; } # try to figure out the offset into the function # lines in mapfile are startAddr, something, functionName @fields = split(' ', $lastAddr); $start = $fields[0]; $function = $fields[2]; $offset = hex($addr) - hex($start); chop($lastAddr); $str = sprintf("%s: %s (%s+0x%x/%dd)", $addr, $lastAddr, $function, $offset, $offset); return $str; } sub mapread_sysrq { local($sysrq_file) = shift(@_); local(@sysrq); local($state); local($i); local($j); local(@fields); local($task); local(@traces); local($pid); # write sysrq file to another file, just in case there are line # feed issues open(SYSRQ_FILE, $sysrq_file) || die "Couldn't open $sysrq_file: $!\n"; @sysrq = <SYSRQ_FILE>; close(SYSRQ_FILE); open(SYSRQ_FILE_TMP, ">" . $sysrq_file . ".tmp") || die "Couldn't open $sysrq_file.tmp: $!\n"; for ($i = 0; $i <= $#sysrq; $i++) { $sysrq[$i] =~ s/^M//; # enter ctrl-V ctrl-M to get a true ^M in the code print SYSRQ_FILE_TMP "$sysrq[$i]"; } close(SYSRQ_FILE_TMP); open(SYSRQ_FILE, $sysrq_file . ".tmp") || die "Couldn't open $sysrq_file.tmp: $!\n"; @sysrq = <SYSRQ_FILE>; close(SYSRQ_FILE); system("rm $sysrq_file.tmp"); # state: 0 = start of file # 1 = getting task name # 2 = getting call traces $state = 0; for ($i = 0; $i <= $#sysrq; $i++) { chop($sysrq[$i]); @fields = split(' ', $sysrq[$i]); if ($state == 0) { # look for line with task if ($fields[0] eq "task") { $state = 1; } } elsif ($state == 1) { # first field is task $task = $fields[0]; $pid = $fields[4]; @traces = (); $state = 2; } elsif ($state == 2) { # if first field is "Call", skip to third field $j = 0; if (($fields[0] eq "Call") && ($fields[1] eq "Trace:")) { $j = 2; } # clean up the traces and put them in the array for (; $j <= $#fields; $j++) { $fields[$j] =~ s/\[//; $fields[$j] =~ s/<//; $fields[$j] =~ s/>//; $fields[$j] =~ s/\]//; push(@traces, &mapread($fields[$j])); } # if next line doesn't start with a [<, it's a task if ($i < $#sysrq) { @fields = split(' ', $sysrq[$i+1]); if (substr($fields[0], 0, 2) ne "[<") { # dump our trace list print "task: $task (pid: $pid)\n"; for ($j = 0; $j <= $#traces; $j++) { print "\t$traces[$j]\n"; } $state = 1; } } } } }
'Programming > Linux_Kernel' 카테고리의 다른 글
Playing with ptrace, Part I (0) | 2008.11.14 |
---|---|
Playing with ptrace, Part II (0) | 2008.11.14 |
diff 를 이용한 kernel patch 만들기 - 또다른 문서 (1) | 2008.11.12 |
diff를 이용한 kernel patch 만들기 (0) | 2008.11.12 |
Linux Kernel 에 대한 문서들 (0) | 2008.11.12 |