Using lsof to identify deleted files:

Background:

Today's post is on another interesting problem which we had seen last week on our production servers . These servers are mainly responsible for reading data from some group of server hosts and transfers the same data to another batch of hosts. We are receiving disk space alerts for /home partition getting filled up for some of these servers.On checking these hosts ,we can see that df is reporting more than 90% utilisation for /home.We considered using du to identify which files and directories are consuming the space. And we are surprised with the results of du. du is only reporting around 40% disk utilisation for /home. So , is df is lying or du is lying . We will see in the next section , how we tracked the issue and was able to identify where the rest of disk space has gone.
The following are the disk usage results.

$ df -h /home/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sys-home 137G 117G 14G 90% /home
---
$ sudo du -sh /home/
Password:
42G /home/


Using lsof to identify processes holding up deleted files:

Initially we suspected that some process may not be releasing up space. We thought of using the 'lsof' utility which lists the open files . We will not be in going to details of using the lsof command here rather then will see using the 'L' option which enables (‘+’) or disables (‘-’) the listing of file link counts . When +L is followed by a number, only files having a link count less than that number will be listed. A specification of the form ‘‘+L1’’ will show us the open files that have been deleted but not yet unlinked from the file system . The following below is the result of using 'lsof' on one of the affected servers .

$ sudo /usr/sbin/lsof +L1 | grep deleted
Password:
COMMAND PID  USER  FD TYPE DEVICE SIZE/OFF NLINK NODE NAME


forwdata 17967 nobody 1w REG 253,1 0 0 9158660 /home/app/logs/forwdata_dbqueue.out-2009-09-27-10:00:35 (deleted)
forwdata 17967 nobody 2w REG 253,1 0 0 9158660 /home/app/logs/forwdata_dbqueue.out-2009-09-27-10:00:35 (deleted)
forwdata 18086 nobody 1w REG 253,1 549978112 0 9158661 /home/app/logs/forwdata_geo.out-2009-09-27-10:00:35 (deleted)
forwdata 18086 nobody 2w REG 253,1 549978112 0 9158661 /home/app/logs/forwdata_geo.out-2009-09-27-10:00:35 (deleted)
forwdata 18136 nobody 1w REG 253,1 41849077626 0 9158659 /home/app/logs/forwdata_bcookie.out-2009-09-27-10:00:35 (deleted)
forwdata 18136 nobody 2w REG 253,1 41849077626 0 9158659 /home/app/logs/forwdata_bcookie.out-2009-09-27-10:00:35 (deleted)
forwdata 18136 nobody 14ur REG 253,0 0 0 229420 /tmp/.ylock-named/__home__app__var__scoreuser.txt (deleted)


It can be noticed from the output that there are some files which has been deleted (marked as (deleted) but the process has not yet released them. If we see the process '18136' , then
we can notice that it has not released a file which comes out to be around 75G of space. This is the file which is holding up the space on /home. Restarting the process should ideally release the hold space. We did that, restarted the process,Phew! we got our space back.

Conclusion:

lsof is one of the unix power tools , a useful utility for Sys Admins . It has lots of options ,so learning all the options will take some time, but is worth learning some of the important options.  On identifying the deleted files using lsof , we can also restore those files if some important file is accidently deleted till the process has not released the file. On another post , I will write a tutorial on using the `lsof` command.

Comments

Manoj said…
really useful

thanx Zaman

Popular posts from this blog

PSSH : Parallel SSH to execute commands on a number of hosts

How to add check_http as a service in Nagios Monitoring using NRPE

Configuring Nagios to monitor services using NRPE