PDA

View Full Version : Occassional hard lock (Linux)



Myth
11-03-2009, 10:35 PM
Just been noticing lately that the computer will just hardlock and become unresponsive.
Mouse light still on (USB optical mouse), nothing on keyboard lights. And screen shows whatever was onscreen when it locked. Killing X is not possible as the keyboard is unresponsive

Specs:
Asus M2N-VM-DVI
CPU: AMD 4800+ x2
2x 1GB Kingston ValueRAM
Seagate SATA harddrives
(above parts just over a year old)

Enermax 425 PSU (maybe 2 years old)

Integrated graphics, I dont have a Pci-e card to play with
OS: Linux (Gentoo)

I havent run memtest or anything else yet. Just seeing if the Linux peoples would have a clue what to check

Incidentally, has locked in both twm and kde in the last 24 hours.
Has locked before today, just not twice in 24 hours

Speedy Gonzales
11-03-2009, 10:49 PM
What BIOS is on it?? Its up to 0905 as of 9/01/09 this year

Erayd
11-03-2009, 11:10 PM
Speedy, the bios version won't make any difference - if it was running fine before on whatever version it currently has, then it obviously isn't an issue.

First thing I'd check there would be the memory (i.e. memtest). If the memory's fine, then make sure you are saving the kernel log someplace - it should have some interesting stuff in there about why it hung.

Do the same for the X log - it's entirely possible that it's X that is hanging, not the kernel.

If you have a custom kernel, make sure you haven't turned off all the debugging options. You could also use the software watchdog to help figure out what's happening (or even better a hardware watchdog, if your motherboard has one).

Speedy Gonzales
11-03-2009, 11:13 PM
Does Linux have / use UUID?

If it does, then it will make a diff, as the last version is for

Fix UUID might got lost.

Erayd
11-03-2009, 11:15 PM
Speedy, what are you on about? UUID is a *very* generic acronym, can you clarify whatever feature you're trying to describe here?

Speedy Gonzales
11-03-2009, 11:16 PM
Well if I knew I would tell you.

Thats all it says, on the ASUS site for the BIOS

Chilling_Silence
11-03-2009, 11:19 PM
Its possible there was a number of things, these would be my 3 main guesses
1) Memory error
2) CPU Overheating
3) Other software issue, potentially driver-related?

Also, if possible, and its happening semi-frequently then login from another box via ssh. When X hangs, see if the box you ssh'd in from can still use the ssh session, or if that froze too :)

Erayd
11-03-2009, 11:20 PM
Myth, Chill has a good point about overheating - you may want to investigate that.

Speedy:
Well if I knew I would tell you.

Thats all it says, on the ASUS site for the BIOSThen please refrain from giving advice about stuff you don't understand - it merely confuses the issue.

For your information, UUID normally stands for 'Universally Unique IDentifier' - you may want to read the wikipedia page here (http://en.wikipedia.org/wiki/Universally_Unique_Identifier).

Speedy Gonzales
11-03-2009, 11:35 PM
I read it

What are you? a cop?

If I wanna say something, I will

Erayd
12-03-2009, 12:09 AM
I'm not going to get into a debate here - check your PMs.

Myth
12-03-2009, 06:15 AM
Lol @ you 2

Don't worry erayd, I know what you mean ;)

Anyway... regards overheating.. it shouldn't be. The heatsink is an XP90 with SilenX fan and AM2 bracket. Just checked fan is revolving, and air from fan is quite cool (though not compiling anything right now). What cli tools does Linux have to confirm this? (I don't have conky or gkrellm installed)

Will run memtest now.
Will let you know how that went after work

Jen
12-03-2009, 07:56 AM
I would suspect the memory as well, but it wouldn't hurt to check the hard drive too.

Do you have lm-sensors installed?

Chilling_Silence
12-03-2009, 10:39 AM
Yeah grab the lm_sensors package then run sensors-detect. It'll run through a scanning process, setup the relevant .conf files automatically, then just run: sensors
:)

Brooko
12-03-2009, 12:34 PM
+1 to Chill's reply

Also if you want hard disk temps, install hddtemp. Gsmartcontrol is another package thats very good for checking disk health if you have a SMART enabled drive.

Chilling_Silence
12-03-2009, 04:46 PM
Thanks for the heads up Brooko, I might install those on a few boxes tonight and give the apps a whirl :)

Myth
12-03-2009, 08:58 PM
Ok...
Memtest reports (after 10 passes) there are no errors
Smartmontools passed both SATA drives
hddtemps reports both drives at about 48-50 degrees C
sensors reports the cpu running at a cool 17 degrees C

I dont have another linux box for ssh, so might use other XP box and throw something like putty on it. And wait....

erayd, You mentioned some other things... keep an eye on your messenger tomorrow ;) (too knackered tonight)

Chilling_Silence
12-03-2009, 09:27 PM
Holy Santaclaus that CPU is cool!

HDD temp is pretty average ...

Running a stock kernel?

wainuitech
12-03-2009, 10:42 PM
This is a question as I know naff all about linux, esp when it goes wrong - But does Linux have the equivalent of windows event Viewer ?

1 Thing that wouldn't change no matter what the OS is - does the HDD activity light on the case stay on solid when it locks - meaning the HDD is trying to work but cant ?

beeswax34
12-03-2009, 10:49 PM
How on Earth is your CPU running that cool?

Chilling_Silence
12-03-2009, 11:17 PM
Yeah it does, its the file /var/log/messages

However ... when the kernel hard-locks Im not sure it gets written to :-/

Erayd
12-03-2009, 11:34 PM
Normally if you get a kernel locking up for some reason, it will tell you why in whatever log you are sending kernel messages to (or if you aren't logging them, you'll see them in tty1 by default) - hence the stuff I was saying before.

Myth, make sure you use a synchronous mount on your log filesystem- you'll lose a lot of performance, but it means that the final death throes of the system will be written to disk, rather than getting stuck in the write cache when the system hangs. I'll keep an eye on my im accounts- you're welcome to bug me for help on this one. Flick me a pm if I'm not online.

You say you're running gentoo, so I assume you have a custom kernel - can you post the .config? Also which kernel version & arch are you running, and what patches (if any) have you applied other than the default gentoo ones? Are you using vanilla or gentoo sources?

Erayd
12-03-2009, 11:40 PM
This is a question as I know naff all about linux, esp when it goes wrong - But does Linux have the equivalent of windows event Viewer ? Yep - almost all unix / Linux systems have logging that puts windows to shame! There's far, far more detail than event viewer would ever give you.


One thing that wouldn't change no matter what the OS is - does the HDD activity light on the case stay on solid when it locks - meaning the HDD is trying to work but cant ?
Usually true, but this can also occur on Linux if one of the critical filesystems is full, or if the filesystem is doing some re-organisation. Linux filesystems tend to do a lot more than their windows equivalents, much of it automatically.

Myth
13-03-2009, 06:22 AM
Ok.. .config is here (http://pastebin.com/me16ad9e)

Using gentoo kernel (gentoo-sources) no other patches applied
Kernel version is in .config
arch is amd64

synchronous mount?

CPU temp is probably not correct. I know when I was mucking round with conky a few years back there is a file with accurate temps that gets written to... just cant remember that file name (but I think that was mobo temps). I just did a quick install of lm_sensors, added sensor support into the kernel, and did 'sensors' (after sensor-detect of course). So unsure how accurate. But as said, air from CPU fan is quite cool, minimal warmth

Had an idea yesterday... might emerge genkernel and use that, see what happens
(Im no longer looking for uber-quick, just want stable and just like the gentoo way)

Erayd
13-03-2009, 11:01 AM
...synchronous mount?
mount -o sync (or you can put sync as a mount option on fstab). It basically means that everything gets flushed to disk immediately instead of queueing up in the write buffer.

Re your comments on genkernel - while it's still pretty good in terms of speed, it does have a substantially longer boot time, and I find the featureset / implementation of it annoys me (so I don't use it).

Erayd
13-03-2009, 11:27 AM
Comments on the kernel. It looks pretty good, although there are a few weird bits - these are what jumped out. Mostly just nitpicking, I didn't see anything particularly evil although it's impossible to say for sure without seeing your system.

Why is MCE (Machine Check Exception) disabled? Enable it, unless you have a very good reason not to. It should also help with diagnosing this problem.
You have AC / battery / dock ACPI options compiled in, but you're using a desktop???
CPUFreq is enabled, but you aren't using the PowerNow! interface
You are using PCMCIA on a desktop???
Are you a ham radio fan?
Your network driver config seems a bit wacky - lots of stuff enabled in there that I seriously doubt you use (e.g. token ring, 10G ethernet etc)
Why intel i2c on an AMD platform?
Strange ALSA driver config (things enabled that shouldn't be)
No reiserfs support (sigh.... ext3 just doesn't cut it these days - go get a proper filesystem:p)
x86_64 AES support disabled
No PRNG
Virtualisation is enabled, but none of the options under it (e.g. KVM)
No custom version string