Using ipmitool: rc.shutdown: 30 second watchdog timeout expired. Shutdown terminated.

Background: 

In this post , we will try to diagnose a production host which frequently goes to unresponsive/hung state . The host needs to be rebooted to get it back online. No information can be found in system logs regarding the sudden unresponsiveness, no trace of hacker attack can be seen. Hardware looks to be fine also. Finally, came across the ipmitool utility and Phew!, after so much of investigation, we finally were able to identify the root cause for this behaviour .

About IMPItool:

IPMItool is a command line utility for managing and configuring devices that supports  Intelligent Platform Management Interface (IPMI) . IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. The service processor (or Baseboard Management Controller, BMC) is the brain behind platform management and its primary purpose is to handle the autonomous sensor monitoring and event logging features.

System Log Information:

After going through all the log files and using diagnostic tools, we observed that the host goes to shutdown mode which is the reason for system  hung/unresponsiveness.
Below is the snippet from `last` command and messages from /var/log/messages which are generated when the host become unresponsive . Not much info can be seen in the logs for the sudden shutdown .

=========
server-001% last
aisha       ttyp1    203.83.248.32    Mon Aug 17 10:36   still logged in
aisha       ttyp0    203.83.248.32    Mon Aug 17 08:34   still logged in
reboot         ~                         Mon Aug 17 08:11
shutdown       ~                         Sat Aug 15 07:15
---------
Messages from System Log  during system shutdown
====
Aug 15 07:15:30 server-001 rc.shutdown: 30 second watchdog timeout expired. Shutdown terminated.
Aug 15 07:15:30 server-001 kernel: 30 second watchdog timeout expired. Shutdown terminated.
Aug 15 07:15:30 server-001 kernel: Sat Aug 15 07:15:30 UTC 2009
Aug 15 07:15:30 server-001 init: /bin/sh on /etc/rc.shutdown terminated abnormally, going to single user mode
Aug 15 07:15:30 server-001 syslogd: exiting on signal 15 
Aug 17 08:05:47 server-001 syslogd: restart
Aug 17 08:05:47 server-001 syslogd: kernel boot file is /boot/kernel/kernel
==========

Troubleshooting using ipmitool:

We will see now be using ipmitool to see the BMC log events . This will tell us whether anything wrong with any hardware on the host which is not logged in system logs.

server-001% sudo ipmitool sel list

Password:
  b4 | 05/27/2009 | 13:38:32 | Fan #0x37 | Upper Critical going high
  c8 | 05/27/2009 | 13:38:35 | Fan #0x37 | Upper Critical going high
  dc | 08/15/2009 | 07:07:50 | Fan #0x37 | Upper Critical going high
  f0 | 08/15/2009 | 07:07:52 | Fan #0x37 | Upper Critical going high


server-001% last -y | grep shutdown
shutdown         ~                         Sat Aug 15 2009 07:15
shutdown         ~                         Wed May 27 2009 13:43

We can see that there are some drastic fan events which has been logged . The corresponding shutdown time from `last -y' and events from BMC log confirms that machine are becoming unresponsive because of fan speed going high . It may indicate  a failing fan or in some cases may be problem with fan curve settings . The bad thing is that it is not telling us exactly whether it is a failing fan or problem with firmware . If we can see the same in many hosts with same hardware configuration , it may indicate something wrong with fan settings . Probably a firmware upgrade should resolve it. If it is due to bad fan , then replacing the fan should resolve the issue which is in our case.

Conclusion:

 ipmitool is a very useful tool to diagnose hardware issues besides looking into system logs and using other diagnostic tools . It can give you other statistics like firmware version etc. You can read more about the tool from    http://ipmitool.sourceforge.net/







Comments

Popular posts from this blog

PSSH : Parallel SSH to execute commands on a number of hosts

How to add check_http as a service in Nagios Monitoring using NRPE

Configuring Nagios to monitor services using NRPE