Preventing/Identifying hardware failures in Linux environment.
Hardware failures are always catastrophic . If it is a single point of failure then the impact can be severe like data loss , delay in arrival of data , service unavailability etc. To prevent single point of failure we can setup high availability , load balancing etc but that comes at a cost and many cannot afford that may be because of technical reasons or because of cost reasons. So , can come up with some preventive measures to at least alert us in advance that a device is going to fail after few days . It might not be possible for every hardware device but yes we can do it for some devices like disk drives . But this might not be possible for every hardware device to calculate in advance that a device is going to fail . For such cases like system getting rebooted by itself, or system was hung/unresponsive because of some hardware failure, we need to identify which device failed actually . Many a times no trace can be found in the system log for such incidents and we have to use standard monitoring tools or console log to identify such failures.
We will see below how we can use standard monitoring tools to prevent/identify hardware failures. The two standards we will discuss here are IPMI and SMART.
About IPMI:
IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. The service processor (or Baseboard Management Controller, BMC) is the brain behind platform management and its primary purpose is to handle the autonomous sensor monitoring and event logging features.
Using ipmitool :
IPMItool is a command line utility for managing and configuring devices that supports Intelligent Platform Management Interface (IPMI)
SEL(System Event Log) records store system event information and may be useful for debugging problems. To view the SEL , we can use the ipmitool command as shown below.
Examples :
Below is an example scenario to identify a hardware failure . The host used to go unresponsive and no trace can be seen in /var/log/messages. Using IPMI we identified that these unresponsiveness are because of a fan going bad.
server-001% sudo ipmitool sel list
Password:
b4 | 05/27/2009 | 13:38:32 | Fan #0x37 | Upper Critical going high
c8 | 05/27/2009 | 13:38:35 | Fan #0x37 | Upper Critical going high
dc | 08/15/2009 | 07:07:50 | Fan #0x37 | Upper Critical going high
f0 | 08/15/2009 | 07:07:52 | Fan #0x37 | Upper Critical going high
server-001% last -y | grep shutdown
shutdown ~ Sat Aug 15 2009 07:15
shutdown ~ Wed May 27 2009 13:43
We can see that there are some drastic fan events which has been logged . The corresponding shutdown time from `last -y' and events from BMC log confirms that machine are becoming unresponsive because of fan speed going high .
Another example to identify bad memory:
Below is a host which used to get rebooted automatically . Upon checking the SEL, we can see that there are memory events which are uncorrectable. This indicates that the host has bad memory sticks and they need to be identified and replaced.
Let's take another host . If we see events from the host below , we can see that all the events related to memory are correctable. Generally we can ignore this . If they continue and/or at a more rapid rate and/or start affecting system uptime/performance or if they start turning into UC errors (uncorrectable) then we have an issue that needs to be fixed. If they start appearing in a larger volume then it's likely a stick going or about to go bad and will more than likely turn into UCs pretty quickly. We need to replace the bad memory stick in such cases.
About SMART:
SMART health check as shown in the first example may report fine even if there are some bad sectors on the disk. To identify the bad sectors we can do some SMART test . We can perform either a short or long test . SMART Short Self Test usually finishes under ten minutes . The Long Self test is a longer and more thorough version of the Short Self Test described above . Let's see how we can do a self test.
-bash-3.00$ sudo smartctl -t short /dev/sda
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Jul 18 10:58:24 2011
Use smartctl -X to abort test.
We can see that the test will finish after two minutes. The results are reported in the Self Test Error Log, readable with the ´-l selftest´ option as shown below.
We will see below how we can use standard monitoring tools to prevent/identify hardware failures. The two standards we will discuss here are IPMI and SMART.
About IPMI:
IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. The service processor (or Baseboard Management Controller, BMC) is the brain behind platform management and its primary purpose is to handle the autonomous sensor monitoring and event logging features.
Using ipmitool :
IPMItool is a command line utility for managing and configuring devices that supports Intelligent Platform Management Interface (IPMI)
SEL(System Event Log) records store system event information and may be useful for debugging problems. To view the SEL , we can use the ipmitool command as shown below.
server-001% sudo /etc/init.d/ipmi start
server-001% sudo ipmitool sel elsit
Examples :
Below is an example scenario to identify a hardware failure . The host used to go unresponsive and no trace can be seen in /var/log/messages. Using IPMI we identified that these unresponsiveness are because of a fan going bad.
server-001% sudo ipmitool sel list
Password:
b4 | 05/27/2009 | 13:38:32 | Fan #0x37 | Upper Critical going high
c8 | 05/27/2009 | 13:38:35 | Fan #0x37 | Upper Critical going high
dc | 08/15/2009 | 07:07:50 | Fan #0x37 | Upper Critical going high
f0 | 08/15/2009 | 07:07:52 | Fan #0x37 | Upper Critical going high
server-001% last -y | grep shutdown
shutdown ~ Sat Aug 15 2009 07:15
shutdown ~ Wed May 27 2009 13:43
We can see that there are some drastic fan events which has been logged . The corresponding shutdown time from `last -y' and events from BMC log confirms that machine are becoming unresponsive because of fan speed going high .
Another example to identify bad memory:
Below is a host which used to get rebooted automatically . Upon checking the SEL, we can see that there are memory events which are uncorrectable. This indicates that the host has bad memory sticks and they need to be identified and replaced.
148 | 03/08/2010 | 14:39:54 | Memory ECC Corr Err | Correctable ECC | Asserted
149 | 03/08/2010 | 14:39:54 | Memory ECC Corr Err | Correctable ECC | Asserted
14a | 03/08/2010 | 14:39:54 | Memory ECC Corr Err | Correctable ECC | Asserted
14b | 03/08/2010 | 14:39:54 | Event Logging Disabled SBE Log Disabled | Correctable memory error logging disabled |
Asserted
14c | 03/09/2010 | 01:50:22 | Memory ECC Uncorr Err | Uncorrectable ECC | Asserted
14d | 03/09/2010 | 01:57:21 | Memory ECC Corr Err | Correctable ECC | Asserted
Let's take another host . If we see events from the host below , we can see that all the events related to memory are correctable. Generally we can ignore this . If they continue and/or at a more rapid rate and/or start affecting system uptime/performance or if they start turning into UC errors (uncorrectable) then we have an issue that needs to be fixed. If they start appearing in a larger volume then it's likely a stick going or about to go bad and will more than likely turn into UCs pretty quickly. We need to replace the bad memory stick in such cases.
ac8 | 02/17/2010 | 11:08:15 | Memory Memory ECC | Correctable ECC | Asserted
adc | 02/17/2010 | 11:08:15 | Memory Memory ECC | Correctable ECC | Asserted
af0 | 03/09/2010 | 03:05:38 | Memory Memory ECC | Correctable ECC | Asserted
b04 | 03/09/2010 | 03:05:38 | Memory Memory ECC | Correctable ECC | Asserted
b18 | 03/10/2010 | 04:03:19 | Memory Memory ECC | Correctable ECC | Asserted
b2c | 03/10/2010 | 04:03:19 | Memory Memory ECC | Correctable ECC | Asserted
b40 | 03/15/2010 | 05:01:18 | Memory Memory ECC | Correctable ECC | Asserted
b54 | 03/15/2010 | 05:01:18 | Memory Memory ECC | Correctable ECC | Asserted
b68 | 03/20/2010 | 06:05:19 | Memory Memory ECC | Correctable ECC | Asserted
b7c | 03/20/2010 | 06:05:19 | Memory Memory ECC | Correctable ECC | Asserted
We can write a nagios plugin to monitor SEL and if any event is logged , alert us. This can really help us to reduce hardware failures.
About SMART:
S.M.A.R.T stands for Self monitoring , analysis and reporting technology. It is the industry-standard reliability prediction indicator for both
IDE/ATA and SCSI hard disk drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures, and to
carry out different types of drive self-tests. S.M.A.R.T enabled hard drives maintains a set of attributes, and sets threshold values beyond which
attributes should not pass under normal operation.If they are passing the threshhold then there is some issue and we need to take a look
at it .
Using S.M.A.R.T Monitoring tools
Smartmontools is a SMART monitoring tool and supports ATA/ATAPI/SATA and SCSI disks.The smartmontools
package contains two utility programs (smartctl and smartd) to control and monitor S.M.A.R.T build storage systems. smartd is a daemon that monitors the S.M.A.R.T system built into the hard drives. The main configuration file
for smartd is /etc/smartd.conf . smartd polls the S.M.A.R.T enabled hard drives every 30 mins and logs the the S.M.A.R.T related errors into /var/log/messages . smartctl is a command line control and monitor utility for S.M.A.R.T enabled disk drives
Examples of using smartmontools
$$ Knowing the health of a disk drive:
-bash-3.2$ sudo smartctl -H /dev/ad0 smartctl version 5.36 [x86_64-unknown-freebsd6.1] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED We can see that the SMART health check went fine for this disk . $$ Smart Health check on a disk drive which is failing. -bash-3.2$ sudo smartctl -H /dev/ad0 smartctl version 5.36 [x86_64-unknown-freebsd6.1] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! *Drive failure expected in less than 24 hours.* SAVE ALL DATA. Failed Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 001 001 036 Pre-fail Always FAILING_NOW 2059 196 Reallocated_Event_Count 0x0033 001 001 036 Pre-fail Always FAILING_NOW 2059
$$ Checking the disk for bad sectors:
SMART health check as shown in the first example may report fine even if there are some bad sectors on the disk. To identify the bad sectors we can do some SMART test . We can perform either a short or long test . SMART Short Self Test usually finishes under ten minutes . The Long Self test is a longer and more thorough version of the Short Self Test described above . Let's see how we can do a self test.
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Jul 18 10:58:24 2011
Use smartctl -X to abort test.
We can see that the test will finish after two minutes. The results are reported in the Self Test Error Log, readable with the ´-l selftest´ option as shown below.
$ sudo smartctl -l selftest /dev/sda smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 50% 7780 165497 # 2 Extended offline Completed: read failure 90% 7779 165498 We can see from the results above that are read failures. Better here to get the disk replaced .
Let's see the same results on a host where no bad sectors can be seen.
-bash-3.00$ sudo smartctl -l selftest /dev/sda Password: smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 4 Short offline Completed without error 00% 40663 - # 5 Extended offline Completed without error 00% 0 - # 6 Short offline Completed without error 00% 0 -
Comments