ある日こんなエラーが・・・
Message from syslogd@cmv7 at Jun 13 17:54:30 ...
kernel:BUG: soft lockup - CPU#0 stuck for 23s! [rcuos/2:16]
環境
- ESXi5.5
- CentOS7.2 64bit
- config-3.10.0-327.18.2.el7.x86_64
/var/log/messagesを見てみると・・・
Jun 13 17:54:30 cmv7 kernel: BUG: soft lockup - CPU#0 stuck for 23s! [rcuos/2:16]
Jun 13 17:54:30 cmv7 kernel: Modules linked in: vmw_vsock_vmci_transport vsock coretemp crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd vmw_balloon pcspkr sg vmw_vmci i2c_piix4 shpchp parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom crc_t10dif crct10dif_generic ata_generic pata_acpi vmwgfx drm_kms_helper ttm drm crct10dif_pclmul crct10dif_common ata_piix crc32c_intel libata mptspi serio_raw i2c_core scsi_transport_spi vmxnet3 mptscsih mptbase floppy dm_mirror dm_region_hash dm_log dm_mod
Jun 13 17:54:30 cmv7 kernel: CPU: 0 PID: 16 Comm: rcuos/2 Tainted: G L ------------ 3.10.0-327.18.2.el7.x86_64 #1
Jun 13 17:54:30 cmv7 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014
Jun 13 17:54:30 cmv7 kernel: task: ffff880139bc2e00 ti: ffff880139bd8000 task.ti: ffff880139bd8000
Jun 13 17:54:30 cmv7 kernel: RIP: 0010:[<ffffffffa00543be>] [<ffffffffa00543be>] mpt_put_msg_frame+0x5e/0x80 [mptbase]
Jun 13 17:54:30 cmv7 kernel: RSP: 0018:ffff88013fc03bc8 EFLAGS: 00000246
Jun 13 17:54:30 cmv7 kernel: RAX: ffffc90008800000 RBX: ffff8800956530b0 RCX: 0000000000000015
Jun 13 17:54:30 cmv7 kernel: RDX: ffff880036b53800 RSI: ffff8800369dc000 RDI: 000000000000000e
Jun 13 17:54:30 cmv7 kernel: RBP: ffff88013fc03bd8 R08: 0000000000000003 R09: ffff8800369d90d8
Jun 13 17:54:30 cmv7 kernel: R10: ffff8800971d6540 R11: ffffea0004dc8f00 R12: ffff88013fc03b38
Jun 13 17:54:30 cmv7 kernel: R13: ffffffff81646e1d R14: ffff88013fc03bd8 R15: ffff8800369dc000
Jun 13 17:54:30 cmv7 kernel: FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
Jun 13 17:54:30 cmv7 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 13 17:54:30 cmv7 kernel: CR2: 00000000005643eb CR3: 0000000097221000 CR4: 00000000000007f0
Jun 13 17:54:30 cmv7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 13 17:54:30 cmv7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 13 17:54:30 cmv7 kernel: Stack:
Jun 13 17:54:30 cmv7 kernel: ffff8800369dcd30 0000000000000048 ffff88013fc03c80 ffffffffa0076729
Jun 13 17:54:30 cmv7 kernel: ffff8800369dc008 04000000ffc00400 0000000000000015 ffff8800971d6540
Jun 13 17:54:30 cmv7 kernel: 0000000000000054 ffff8800369dc188 0000006000000015 ffff8801358c3010
Jun 13 17:54:30 cmv7 kernel: Call Trace:
Jun 13 17:54:30 cmv7 kernel: <IRQ>
Jun 13 17:54:30 cmv7 kernel:
Jun 13 17:54:30 cmv7 kernel: [<ffffffffa0076729>] mptscsih_qcmd+0x249/0x820 [mptscsih]
Jun 13 17:54:30 cmv7 kernel: [<ffffffffa006f2b0>] mptspi_qcmd+0x50/0xe0 [mptspi]
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81417f3a>] scsi_dispatch_cmd+0xaa/0x230
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81420ec1>] scsi_request_fn+0x501/0x770
Jun 13 17:54:30 cmv7 kernel: [<ffffffff812c7793>] __blk_run_queue+0x33/0x40
Jun 13 17:54:30 cmv7 kernel: [<ffffffff812c7806>] blk_run_queue+0x26/0x40
Jun 13 17:54:30 cmv7 kernel: [<ffffffff8141f2e8>] scsi_run_queue+0x258/0x2f0
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81421170>] scsi_next_command+0x20/0x40
Jun 13 17:54:30 cmv7 kernel: [<ffffffff814212e5>] scsi_end_request+0x155/0x1d0
Jun 13 17:54:30 cmv7 kernel: [<ffffffff814214c3>] scsi_io_completion+0x103/0x600
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81416805>] scsi_finish_command+0xd5/0x130
Jun 13 17:54:30 cmv7 kernel: [<ffffffff8142099a>] scsi_softirq_done+0x12a/0x150
Jun 13 17:54:30 cmv7 kernel: [<ffffffff812d1a50>] blk_done_softirq+0x90/0xc0
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81084b0f>] __do_softirq+0xef/0x280
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81647adc>] call_softirq+0x1c/0x30
Jun 13 17:54:30 cmv7 kernel: <EOI>
Jun 13 17:54:30 cmv7 kernel:
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81016fc5>] do_softirq+0x65/0xa0
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81084404>] local_bh_enable+0x94/0xa0
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81124265>] rcu_nocb_kthread+0x255/0x370
Jun 13 17:54:30 cmv7 kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81124010>] ? rcu_start_gp+0x40/0x40
Jun 13 17:54:30 cmv7 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jun 13 17:54:30 cmv7 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jun 13 17:54:30 cmv7 kernel: [<ffffffff81646118>] ret_from_fork+0x58/0x90
Jun 13 17:54:30 cmv7 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jun 13 17:54:30 cmv7 kernel: Code: 8b 96 68 01 00 00 0f b7 c8 44 03 a6 b0 01 00 00 44 8b 04 8a 45 09 c4 f6 86 e0 00 00 00 04 75 10 48 8b 83 e8 00 00 00 44 89 60 40 <5b> 41 5c 5d c3 48 8d 76 08 0f b7 c8 44 89 e2 48 c7 c7 78 23 06
どうなったか・・・
処理が重くなり、アプリケーションの処理がほぼ止まったような状態となってしまいました。。。
なにをしたか・・・?
とりあえず以下のパラメータを変更して再起動してみたところエラーは出なくなり速度も通常の状態まで戻りました。
これを
/boot/config-3.10.0-327.18.2.el7.x86_64
CONFIG_LOCKUP_DETECTOR=y
こう
/boot/config-3.10.0-327.18.2.el7.x86_64
CONFIG_LOCKUP_DETECTOR=n
おわりに・・・
とりあえず今のところ同じようなエラーは症状はでていないですが、原因を究明できていない状態なので、原因究明に勤しみたいと思います!
参考情報
https://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2094326
http://visualworks.dip.jp/achiral/blog/blog/2015/04/28/centos7%E3%82%B5%E3%83%BC%E3%83%90%E3%83%BC%E4%B8%8D%E8%AA%BF%E6%94%B9%E5%96%84soft-lockup/