Troubleshooting an mysteriously unstable machine

6

I have a machine with a Core i7 CPU, 12 GiB of memory, 4 hard drives and a graphics card/sound card (both add-in PCI-E). This machine is somehow unstable, and I'm wondering how to troubleshoot the remaining issues.

Originally, the machine had an ASUS P6T SE mainboard and a 8800GT, running off a 700 W PSU, a LG DVD drive and 3 hard drives. Right when I built it, the RAM turned out to be faulty, so it got RMA'd. The sound card is a Creative X-Fi UAA. The first problem was when the 8800GT broke down, but that was easily solved by buying a new card. However, the machine would sometimes BSOD. Usually not under system load, but in idle. However, it BSODed once under load as well. Suspecting the RAM, I ran memcheck over night and no issues were found. Everything was working fine for most of the time.

Some months later (it would BSOD like once every month or so) the hard drive broke down. Classic head crash, replaced the hard drive and got the OS/data restored from backup. Now I switched the disk configuration to single system drive, then 2 disks in RAID0 and one disks for backup.

A few months later, the system started to BSOD more often (three times a day during near idle, i.e. web-browsing, RDP.) Interestingly, the machine has a WLAN USB stick and it would sometimes BSOD when I started many downloads simultaneously. Once the machine started BSOD'ing, I assumed that the mainboard might be faulty as the disk drives didn't report any problems, the graphics card just broke down and was replaced, and an additional memcheck showed no error. The original BSOD all had some message and not just a STOP ERROR CODE (for instance, I got 0x00000116 (0xfffffa800a546010, 0xfffff8801020907c, 0x0000000000000000, 0x000000000000000d) or 0x0000003b (0x00000000c0000005, 0xfffff8800138e4c7, 0xfffff8800b96c550, 0x0000000000000000).)

I replaced the mainboard with a different one, and the machine would now suddenly turn off. This led me to the conclusion that the PSU might be faulty, so I tested with a different one. The different PSU had a cable which was too short to attach it to the DVD drive, so that got cut off. With the different PSU (500 W), things were working rock-solid. I replaced the original 700 W PSU and put it back it, connected it to the DVD drive and the machine would turn off again. I removed the DVD and tested it in a different machine, and indeed, the DVD was faulty. I removed the DVD and the machine was running stable again.

A few weeks later, during gaming, the machine BSODed with Stop Error 1E without any further information. Rebooting and everything worked fine. On the same day, I wanted to run the Backup, and the backup failed with error 0x80070570 (files corrupted.) I ran chkdsk, and indeed, on my primary system drive some index ($SSI?) or so was broken, 9 files got deleted and everything was backed up. In order to check the drives, I ran three instances of HD Tune concurrently, and the machine BSOD again with 1E (0x0000001e (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)). Hoping that one of the drives was faulty, I ran HD Tune sequentially over night, and no error occurred. The machine didn't BSOD, and is running fine again. sfcscan also indicated no system files are broken.

As this machine has nearly everything replaced (hard drive, graphics card, memory, motherboard, PSU) or removed DVD drive; do you have any ideas how to troubleshoot what the heck is going on? The weirdest thing is that it works fine now with extreme load for hours straight, but still I had those two failures over the weekend (both under load, interestingly). Each part in isolation seems to work fine, but the combination somehow makes problems. I'm totally lost where to trouble-shoot, as every time I try to check something, the pesky thing just works fine.

Update: Just got another BSOD (1E), while reading a web site. I got the screen where a memory dump was created, progress bar going up to 100%, but after the reboot, Windows is not aware that the machine crashed. The reliability log does not show a crash. However, looking into the Minidump folder I dug out the minidump from the weekend, and the call stack has a HIDPARSE in it. Can a USB keyboard (or USB mouse) produce a bluescreen?

Update2: I replaced all hard-drive cables and reinstalled Windows. Reinstall worked fine, installing applications for 6 hours straight as well. When turning off, I got a stop error 24. I'm suspecting the primary hard drive to be unreliable (Samsung HD103SJ), as I don't see what else could be causing the problems. HDTune and chkdsk however report that the drive is OK.

windows
troubleshooting
bsod
hardware-failure
asked on Super User Apr 19, 2011 by Anteru • edited May 27, 2011 by Peter Mortensen

4 Answers

2

When this happens I try to exclude the software as well. Could be a hardware/software combination.

What happens if you boot up a Live Linux CD? Knoppix, Ubuntu or whatever? Is the system able to run the Linux system for an extensive number of time without failure. Then maybe you have a software problem.

Alternatively you could try to boot start windows in fail-safe mode (does it still exist in Windows7? I am a Linux guy myself).

Ok, just a few suggestions to eliminate the reasons. Far too often I've found instable systems being the cause of software/misconfiguration rather than actual hardware problems.

Good luck!

answered on Super User Apr 20, 2011 by Anders Hansson
1

This sounds like a heat problem to me did you overclock the chip? You may want to use something like http://www.techpowerup.com/realtemp/ to see how hot it is getting you may just need a better heat sink and cooling system.

answered on Super User Apr 19, 2011 by N4TKD
1

I have had similar problems with my own computers and others that I have fixed in the past. In more or less all cases where I have had similar behaviour to your system (lots of strange, seemingly unconnected problems), it has been due to one of the following two problems:

Bad power supply

Either the PSU has outputted fluctuating voltage or the actual power supplied from the grid has fluctuated. Nowdays I never buy cheap PSUs since I know how hard it can be to diagnose these kinds of problems. The wattage on the PSU is no guarantee that it is good since it might still give fluctuating power (which is usually what matters). Try running some kind of monitoring program that can display the motherboard voltages on your computer (speedfan for instance) and check if they are stable and close to the wanted values. If possible, try using a UPS so that you don't get any voltage fluctuations from the grid. Bad power supply also has a tendency to damage other components in the computer which makes it even harder to debug.

Using RAM that is not recommended by manufacturer

Some motherboards are extremely choosy when it comes to RAM. Check with your motherboard manufacturer, they usually give very detailed recommendations on what to use (brand, size, serial-number). I have had this trouble even on a pre-assembled computer, where the people who assembled it apparantly did not check this since the RAM in it was listed as 'Not recommended'. Took me quite some time to figure this out. Doing memchecks do not always find this for some reason.

answered on Super User Apr 20, 2011 by Leo
0

Turned out to be bad RAM + HDD. The original RAM was specified at 1.65V, (6 sticks), and even though 4-5 passes of memtest would run fine the BSODs disappeared once I switched to 1.5V RAM (3 sticks).

The hard drive was also broken, but replacing the harddrive just reduced the number of different stop codes.

answered on Super User May 29, 2011 by Anteru

User contributions licensed under CC BY-SA 3.0