How to analyze system calls when your disk is in read only and strace output is "Bus error"?

1

We have a HW problem with the disks, that made all the mount points to be read only. Output of dmesg:

end_request: I/O error, dev sda, sector 15574609
sd 0:0:0:0: SCSI error: return code = 0x00040000

We want to analyze a program that is currently running, because it should have died when he couldn't write to the file syste. So, we would like to use strace to debug the system calls.

But the output of strace is:

Bus error

It seems some resources are not available to the machine or some low-level error. I am stuck about how analizing the program, before the sysadmins repair the disk.

unix
io
debug
strace
asked on Server Fault Oct 5, 2011 by ompemi

2 Answers

1

Your disk is (probably, in fact almost certainly) dying. It sounds like your sysadmins have already reached this conclusion.
Prepare for the funeral by dressing your backups in black and performing a restore test.


Re: the bus error - this should have been immediately lethal to the program in question. It's the signal equivalent of "WTF? That's unpossible!" (See this SO question - they're talking about memory, but the same thing can happen with disks, or any addressable component). I don't recall if you can catch SIGBUS, but if your program is doing so it shouldn't.
Further questions on how to trace/debug your software should really be asked over on StackOverflow or Programmers.

answered on Server Fault Oct 5, 2011 by voretaq7 • edited May 23, 2017 by Community
1

Sounds like your system can't even load the utilities/libraries needed to do the tracing.

The correct thing to here is:

  • repair the disk (i.e. restore from backup, etc)
  • get the system back up in an optimal state
  • properly test your program in a controlled manner (by making the filesystem readonly at the right time)
answered on Server Fault Oct 5, 2011 by MikeyB

User contributions licensed under CC BY-SA 3.0