This is the strangest thing, I haven't ever had a problem like this before.
Just now I came back to my PC, firefox didn't react, mail didn't react, I rebooted and the errors began, screens and screens. I don't have it all exactly, but just pages and pages of the form "inode (or block) is xxx, should be xxx. Fixed", ending with "unexpected inconsistency" and the option to enter root password (to fix things) or control-d to reboot. It is repeatable, the system won't boot.
The first thing I did was to make a complete copy of everything from my old system which boots fine (also a W2K loafing around on sda1 booted), I was pretty well backed up anyway, but now everything is on an external drive because I wasn't sure exactly what had happened, if there might be a problem with the drive itself. But the PC was literally just sitting there, I hadn't even been working on it.
What does this sound like? Is this fixable? I have good images, data is backed up, and even a fresh install (maybe after formatting?) is also no big deal, but I would feel better if I knew what just happened, it I could rule out that it is the drive itself for example.
(probably solved) System crashes
(probably solved) System crashes
Last edited by sojurn on 10. May 2013, 16:38, edited 5 times in total.
Re: Need help to understand a crash (inodes, blocks)
This is either a filesystem crash or a hard drive crash and in rare cases it might be a problem with your RAM.
What kind of filesystem do you use on that partition? I'm guessing it's probably ext3/ext4, because I've seen similar errors with those filesystems. And which partition is it? Is it your / partition or your /home partition or any other?
You can use fsck to check the filesystems, see if anything else comes up. If it was indeed a filesystem crash, it may be even possible to bring everything back if it was your / partition, by reinstalling whatever was damaged, but it may be difficult to determine exactly what that was. In that case a reinstall might be the easier solution. If it was indeed a filesystem crash, maybe you should consider XFS or JFS as an alternative filesystem. In my experience they are much more reliable than ext3/ext4.
If it's not a filesystem crash, it's most probably a hard drive crash, which means that the hard drive should not be trusted anymore and should be replaced. What do smartmontools report for that hard drive? Make sure you do a long scan.
In the rare case that this was a RAM issue that somehow created problems while writing data on the harddrive, you should check your RAM too, by letting memtest run for at least a few hours.
What kind of filesystem do you use on that partition? I'm guessing it's probably ext3/ext4, because I've seen similar errors with those filesystems. And which partition is it? Is it your / partition or your /home partition or any other?
You can use fsck to check the filesystems, see if anything else comes up. If it was indeed a filesystem crash, it may be even possible to bring everything back if it was your / partition, by reinstalling whatever was damaged, but it may be difficult to determine exactly what that was. In that case a reinstall might be the easier solution. If it was indeed a filesystem crash, maybe you should consider XFS or JFS as an alternative filesystem. In my experience they are much more reliable than ext3/ext4.
If it's not a filesystem crash, it's most probably a hard drive crash, which means that the hard drive should not be trusted anymore and should be replaced. What do smartmontools report for that hard drive? Make sure you do a long scan.
In the rare case that this was a RAM issue that somehow created problems while writing data on the harddrive, you should check your RAM too, by letting memtest run for at least a few hours.
Re: Need help to understand a crash (inodes, blocks)
Thanks for the quick reply Gapan.
I'm sure there is a shell command to dump this out in one fell swoop, but here is my disk (I have all systems in a single partition, probably not a good idea):
sda1 3 GB W2K FAT32
sda2 *
sda3 18 GB data FAT32
sda5 4 GB Mint Ext3
sda6 1 GB swap
sda7 5 GB Salix Ext3
sda8 8 GB Salix Ext3 (duplicate test system)
The paritions are all maybe half to two thirds full, but I have been doing heavy work on them recently, cleaning up.
Salix 13.37 LXDE (both), the PC is a Sempron P5600 with 2GB memory.
And what has happened since then:
At first I thought that my test system also didn't boot, but I think I just pressed the wrong button on boot, because later it booted fine (the system on sda8). And W2K and Mint (my old system) also booted fine. So I was breathing easier already.
I unmounted sda7 from sda8 and ran fsck on it. It found errors all over the place, I didn't write them down but was asked if I wanted to clone and delete, clone and delete. I just get saying yes since my data is backed up anyway (and reinstalling salix is so fast). And that evidently worked, I could boot into my normal system after that. I could barely believe it after all those errors.
I am still uneasy though about the whole affair and I will follow up on your suggestions. I am not familiar with smartmontools but I see it is installed so I will check that out. Ditto with the ram, and I will let them run a long time, do a thorough check, I think that is a good idea.
There have been no problems at all otherwise. The only occasional error I have is that sometimes when I boot the gamin (or fam) doesn't take for some reason and I am in some other desktop. I just log out and log back in and that always seems to work. Had been meaning to ask about that actually, but it is probably not related.
So things work and I am in my normal system again, but this unnerved me a bit and I would like to make sure everything is in order (I've found the smartmontools website in the meantime).
I'm sure there is a shell command to dump this out in one fell swoop, but here is my disk (I have all systems in a single partition, probably not a good idea):
sda1 3 GB W2K FAT32
sda2 *
sda3 18 GB data FAT32
sda5 4 GB Mint Ext3
sda6 1 GB swap
sda7 5 GB Salix Ext3
sda8 8 GB Salix Ext3 (duplicate test system)
The paritions are all maybe half to two thirds full, but I have been doing heavy work on them recently, cleaning up.
Salix 13.37 LXDE (both), the PC is a Sempron P5600 with 2GB memory.
And what has happened since then:
At first I thought that my test system also didn't boot, but I think I just pressed the wrong button on boot, because later it booted fine (the system on sda8). And W2K and Mint (my old system) also booted fine. So I was breathing easier already.
I unmounted sda7 from sda8 and ran fsck on it. It found errors all over the place, I didn't write them down but was asked if I wanted to clone and delete, clone and delete. I just get saying yes since my data is backed up anyway (and reinstalling salix is so fast). And that evidently worked, I could boot into my normal system after that. I could barely believe it after all those errors.
I am still uneasy though about the whole affair and I will follow up on your suggestions. I am not familiar with smartmontools but I see it is installed so I will check that out. Ditto with the ram, and I will let them run a long time, do a thorough check, I think that is a good idea.
There have been no problems at all otherwise. The only occasional error I have is that sometimes when I boot the gamin (or fam) doesn't take for some reason and I am in some other desktop. I just log out and log back in and that always seems to work. Had been meaning to ask about that actually, but it is probably not related.
So things work and I am in my normal system again, but this unnerved me a bit and I would like to make sure everything is in order (I've found the smartmontools website in the meantime).
Last edited by sojurn on 7. May 2013, 22:09, edited 1 time in total.
Re: Need help to understand a crash (inodes, blocks)
Found an easy walkthrough for smartmontools:
http://blog.shadypixel.com/monitoring-h ... tmontools/
Things are looking up, the long test is running (15 minutes):
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 20275 -
Results are in:
# 1 Extended offline Completed without error 00% 20275 -
# 2 Short offline Completed without error 00% 20275 -
# 3 Short offline Completed without error 00% 20275 -
That looks good, thanks again. I will run a long memtest also. And consider changing file systems. I am using ext3 for the simple reason that I have always used it, so habit more than anything else.
http://blog.shadypixel.com/monitoring-h ... tmontools/
Things are looking up, the long test is running (15 minutes):
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 20275 -
Results are in:
# 1 Extended offline Completed without error 00% 20275 -
# 2 Short offline Completed without error 00% 20275 -
# 3 Short offline Completed without error 00% 20275 -
That looks good, thanks again. I will run a long memtest also. And consider changing file systems. I am using ext3 for the simple reason that I have always used it, so habit more than anything else.
INIT: Id 'X1' respawing too fast
Here I thought I was in the clear and now it just happened again. And noticed it in the same way, that a program launch from the panel did not respond. Rebooted (with a sinking feeling this time), and bad errors again. And tried the test salix system which then also crashed, and this time I am sure I didn't mis-type. I haven't tried booting the other two disk systems.gapan wrote:This is either a filesystem crash or a hard drive crash and in rare cases it might be a problem with your RAM.
...
In the rare case that this was a RAM issue that somehow created problems while writing data on the harddrive, you should check your RAM too, by letting memtest run for at least a few hours.
The last error on the screen this time was "INIT: Id 'X1' respawing too fast: disabled for 5 minutes". Booted again and it got further, but then just gave a blank graphical screen.
The quick status is: the drive montools yesterday with no errors, I just now booted from a live PartedMagic CD and ran a long memtest (std), interrupted it after about 45 minutes, again with no errors (will let it run for a few hours tomorrow). Did a fsck on both salix partitions, the main system it said was fine, with the second it started the check by saying there was an error but then found none when it ran.
I am at a loss again, I've never had a PC go bad before. And I am not sure what to do now (except that I was thinking about upgrading PCs anyway, and this might speed things up). I can restore salix from backup (but the crazy thing is I am in my main system again, its working), or reformat everything and install fresh. I could swap video cards. I don't know, I am at a loss for ideas. What is the next thing to check?
I have literally everything backed up now to an external drive, just in case. And I think I will get my laptop up and running in the meantime, I only have a toy system on it now.
Re: Help locating the source of system crashes
This could potentially be awfully embarassing. What it looks like now is that an external drive was causing the problem, and that once "problems" began, it became almost a permanent fixture as I rushed to get things backed up (the irony of that!!). I use device names to boot and in fstab, should probably consider switching to uuids or labels. Assuming this was the cause, I hope I am not the only one in the universe to ever have made an error like this!
Re: (probably solved) System crashes
Once burnt, twice shy 
Welcome to the club!

Welcome to the club!