Preventing/Identifying hardware failures in Linux environment.

Hardware failures are always catastrophic . If it is a single point of failure then the impact can be severe  like data loss , delay in arrival of data , service unavailability etc. To prevent single point of failure we can setup high availability , load balancing etc but that comes at a cost and many cannot afford that may be because of technical reasons or because of  cost reasons. So , can  come up with some preventive measures  to at least  alert us in advance that a device is going to fail after few days . It might not be possible for every hardware device but yes we can do it for some devices like disk drives .  But this might not be possible for every hardware device to calculate in advance that a device is going to fail . For such cases like system getting rebooted by itself, or system was hung/unresponsive because of some hardware failure, we need to identify which device failed actually . Many a times  no trace can be found in the system log for such incidents and we have to use standard monitoring tools or console log to identify such failures.

We will see below how we can use standard monitoring tools to prevent/identify hardware failures. The two standards we will discuss here are IPMI and SMART.


About IPMI:

IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. The service processor (or Baseboard Management Controller, BMC) is the brain behind platform management and its primary purpose is to handle the autonomous sensor monitoring and event logging features.

Using ipmitool :

IPMItool is a command line utility for managing and configuring devices that supports  Intelligent Platform Management Interface (IPMI)

SEL(System Event Log) records store system event information and may be useful for debugging problems. To view the SEL , we can use the ipmitool  command as shown below.

server-001% sudo /etc/init.d/ipmi start
server-001% sudo ipmitool sel elsit

Examples :

Below is an example scenario to identify a hardware failure . The host used to go unresponsive and no trace can be seen in /var/log/messages.  Using IPMI we identified that these unresponsiveness are because of a fan going bad. 


server-001% sudo ipmitool sel list

Password:
  b4 | 05/27/2009 | 13:38:32 | Fan #0x37 | Upper Critical going high
  c8 | 05/27/2009 | 13:38:35 | Fan #0x37 | Upper Critical going high
  dc | 08/15/2009 | 07:07:50 | Fan #0x37 | Upper Critical going high
  f0 | 08/15/2009 | 07:07:52 | Fan #0x37 | Upper Critical going high


server-001% last -y | grep shutdown
shutdown         ~                         Sat Aug 15 2009 07:15
shutdown         ~                         Wed May 27 2009 13:43


We can see that there are some drastic fan events which has been logged . The corresponding shutdown time from `last -y' and events from BMC log confirms that machine are becoming unresponsive because of fan speed going high .


Another example to identify bad memory:

Below is a host which used to get rebooted automatically . Upon checking the SEL, we can see that there are memory events which are uncorrectable. This indicates that the host has bad memory sticks and they need to be identified and replaced.  

148 | 03/08/2010 | 14:39:54 | Memory ECC Corr Err | Correctable ECC | Asserted
 149 | 03/08/2010 | 14:39:54 | Memory ECC Corr Err | Correctable ECC | Asserted
 14a | 03/08/2010 | 14:39:54 | Memory ECC Corr Err | Correctable ECC | Asserted
 14b | 03/08/2010 | 14:39:54 | Event Logging Disabled SBE Log Disabled | Correctable memory error logging disabled |
Asserted
 14c | 03/09/2010 | 01:50:22 | Memory ECC Uncorr Err | Uncorrectable ECC | Asserted
 14d | 03/09/2010 | 01:57:21 | Memory ECC Corr Err | Correctable ECC | Asserted


Let's take another host . If we see events from the host below , we can see that all the events related to memory are correctable. Generally we can ignore this . If they continue and/or at a more rapid rate and/or start affecting system uptime/performance or if they start turning into UC errors (uncorrectable) then we have an issue that needs to be fixed. If they start appearing in a larger volume then it's likely a stick going or about to go bad and will more than likely turn into UCs pretty quickly. We need to replace the bad memory stick in such cases.

 ac8 | 02/17/2010 | 11:08:15 | Memory Memory ECC | Correctable ECC | Asserted
 adc | 02/17/2010 | 11:08:15 | Memory Memory ECC | Correctable ECC | Asserted
 af0 | 03/09/2010 | 03:05:38 | Memory Memory ECC | Correctable ECC | Asserted
 b04 | 03/09/2010 | 03:05:38 | Memory Memory ECC | Correctable ECC | Asserted
 b18 | 03/10/2010 | 04:03:19 | Memory Memory ECC | Correctable ECC | Asserted
 b2c | 03/10/2010 | 04:03:19 | Memory Memory ECC | Correctable ECC | Asserted
 b40 | 03/15/2010 | 05:01:18 | Memory Memory ECC | Correctable ECC | Asserted
 b54 | 03/15/2010 | 05:01:18 | Memory Memory ECC | Correctable ECC | Asserted
 b68 | 03/20/2010 | 06:05:19 | Memory Memory ECC | Correctable ECC | Asserted
 b7c | 03/20/2010 | 06:05:19 | Memory Memory ECC | Correctable ECC | Asserted
 
 
We can write a nagios plugin to monitor SEL and if any event is logged , alert us. This  can really help us to reduce hardware failures. 

About SMART:
 
S.M.A.R.T stands for Self monitoring , analysis and reporting technology. It is the industry-standard reliability prediction indicator for both
IDE/ATA and SCSI hard disk drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures, and to 
carry out different types of drive self-tests. S.M.A.R.T enabled hard drives maintains a set of attributes, and sets threshold values beyond which
attributes should not pass under normal operation.If they are passing the threshhold then there is some issue and we need to take a look
at it .
 
Using S.M.A.R.T Monitoring tools  
 
Smartmontools is a SMART monitoring tool and supports ATA/ATAPI/SATA and SCSI disks.The smartmontools
package contains two utility programs (smartctl and smartd) to control and monitor S.M.A.R.T build storage systems.

smartd  is  a  daemon that monitors the S.M.A.R.T system  built into the  hard drives.  The main configuration file 
for smartd is /etc/smartd.conf . smartd polls the S.M.A.R.T enabled hard drives every 30 mins and logs the the  S.M.A.R.T related errors  into  
/var/log/messages . smartctl is a  command line control and monitor utility for S.M.A.R.T enabled disk drives 
 
Examples of using smartmontools 
 
$$ Knowing the health of a disk drive: 
 
 -bash-3.2$ sudo smartctl -H /dev/ad0
smartctl version 5.36 [x86_64-unknown-freebsd6.1] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

We can see that the SMART health check went fine for this disk . 


$$ Smart Health check on a disk drive which is failing.

-bash-3.2$ sudo smartctl -H /dev/ad0
smartctl version 5.36 [x86_64-unknown-freebsd6.1] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
*Drive failure expected in less than 24 hours.* SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   001   001   036    Pre-fail  Always   FAILING_NOW 2059
196 Reallocated_Event_Count 0x0033   001   001   036    Pre-fail  Always   FAILING_NOW 2059 
$$ Checking the disk for bad sectors:

SMART health check as shown in the first example may report fine even if there are some bad sectors on the disk. To identify the bad sectors we can do some SMART test . We can perform either a short or long test . SMART Short Self Test usually finishes under ten minutes .  The Long Self test is  a  longer and more thorough version of the Short Self Test described above . Let's see how we can do a self test.

-bash-3.00$ sudo smartctl -t short /dev/sda

smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Jul 18 10:58:24 2011

Use smartctl -X to abort test.


We can see that the test will finish after two minutes. The  results  are  reported  in  the  Self  Test Error Log, readable with the ´-l selftest´ option as shown below.

$ sudo smartctl -l selftest /dev/sda
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce 
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       50%      7780 
165497
# 2  Extended offline    Completed: read failure       90%      7779 
165498


We can see from the results above that are read failures. Better here to get the disk replaced .
 
Let's see the same results on a host where no bad sectors can be seen.
-bash-3.00$ sudo smartctl -l selftest /dev/sda
Password:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 4  Short offline       Completed without error       00%     40663         -
# 5  Extended offline    Completed without error       00%         0         -
# 6  Short offline       Completed without error       00%         0         -

 
 

Comments