failed to read root inode

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

failed to read root inode

Christian Affolter-2
Hi

After a disk crash within a hardware RAID-6 controller and kernel
freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume:

Filesystem "dm-13": Disabling barriers, not supported by the underlying
device
XFS mounting filesystem dm-13
Starting XFS recovery on filesystem: dm-13 (logdev: internal)
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file
fs/xfs/xfs_alloc.c.  Caller 0xffffffff8035c58d
Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1

Call Trace:
 [<ffffffff8035c58d>] xfs_free_extent+0xcd/0x110
 [<ffffffff8035aa03>] xfs_free_ag_extent+0x4e3/0x740
 [<ffffffff8035c58d>] xfs_free_extent+0xcd/0x110
 [<ffffffff80395e9d>] xlog_recover_process_efi+0x18d/0x1d0
 [<ffffffff80397400>] xlog_recover_process_efis+0x60/0xa0
 [<ffffffff80397463>] xlog_recover_finish+0x23/0xf0
 [<ffffffff8039d36a>] xfs_mountfs+0x4da/0x680
 [<ffffffff803a9ba8>] kmem_alloc+0x58/0x100
 [<ffffffff803a9cfb>] kmem_zalloc+0x2b/0x40
 [<ffffffff803a3dbd>] xfs_mount+0x36d/0x3a0
 [<ffffffff803b4bad>] xfs_fs_fill_super+0xbd/0x220
 [<ffffffff8028b551>] get_sb_bdev+0x141/0x180
 [<ffffffff803b4af0>] xfs_fs_fill_super+0x0/0x220
 [<ffffffff8028aeb6>] vfs_kern_mount+0x56/0xc0
 [<ffffffff8028af83>] do_kern_mount+0x53/0x110
 [<ffffffff802a343b>] do_new_mount+0x9b/0xe0
 [<ffffffff802a3666>] do_mount+0x1e6/0x220
 [<ffffffff802670f5>] __get_free_pages+0x15/0x60
 [<ffffffff802a373b>] sys_mount+0x9b/0x100
 [<ffffffff8020b49b>] system_call_after_swapgs+0x7b/0x80

Filesystem "dm-13": XFS internal error xfs_trans_cancel at line 1163 of
file fs/xfs/xfs_trans.c.  Caller 0xffffffff80395eb1
Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1

Call Trace:
 [<ffffffff80395eb1>] xlog_recover_process_efi+0x1a1/0x1d0
 [<ffffffff8039fd16>] xfs_trans_cancel+0x126/0x150
 [<ffffffff80395eb1>] xlog_recover_process_efi+0x1a1/0x1d0
 [<ffffffff80397400>] xlog_recover_process_efis+0x60/0xa0
 [<ffffffff80397463>] xlog_recover_finish+0x23/0xf0
 [<ffffffff8039d36a>] xfs_mountfs+0x4da/0x680
 [<ffffffff803a9ba8>] kmem_alloc+0x58/0x100
 [<ffffffff803a9cfb>] kmem_zalloc+0x2b/0x40
 [<ffffffff803a3dbd>] xfs_mount+0x36d/0x3a0
 [<ffffffff803b4bad>] xfs_fs_fill_super+0xbd/0x220
 [<ffffffff8028b551>] get_sb_bdev+0x141/0x180
 [<ffffffff803b4af0>] xfs_fs_fill_super+0x0/0x220
 [<ffffffff8028aeb6>] vfs_kern_mount+0x56/0xc0
 [<ffffffff8028af83>] do_kern_mount+0x53/0x110
 [<ffffffff802a343b>] do_new_mount+0x9b/0xe0
 [<ffffffff802a3666>] do_mount+0x1e6/0x220
 [<ffffffff802670f5>] __get_free_pages+0x15/0x60
 [<ffffffff802a373b>] sys_mount+0x9b/0x100
 [<ffffffff8020b49b>] system_call_after_swapgs+0x7b/0x80

xfs_force_shutdown(dm-13,0x8) called from line 1164 of file
fs/xfs/xfs_trans.c.  Return address = 0xffffffff8039fd2f
Filesystem "dm-13": Corruption of in-memory data detected.  Shutting
down filesystem: dm-13
Please umount the filesystem, and rectify the problem(s)
Failed to recover EFIs on filesystem: dm-13
XFS: log mount finish failed


I tried to repair the filesystem with the help of xfs_repair many times,
without any luck:
Filesystem "dm-13": Disabling barriers, not supported by the underlying
device
XFS mounting filesystem dm-13
XFS: failed to read root inode

xfs_check output:
cache_node_purge: refcount was 1, not zero (node=0x820010)
xfs_check: cannot read root inode (117)
cache_node_purge: refcount was 1, not zero (node=0x8226b0)
xfs_check: cannot read realtime bitmap inode (117)
block 0/8 expected type unknown got log
block 0/9 expected type unknown got log
block 0/10 expected type unknown got log
block 0/11 expected type unknown got log
bad magic number 0xfeed for inode 128
[...]

Are there any other ways to fix the unreadable root inode or to restore
the remaining data?


Environment informations:
Linux Kernel: 2.6.26-gentoo (x86_64)
xfsprogs:     3.0.3

Attached you'll find the xfs_repair and xfs_check output.


Thanks in advance and kind regards
Christian

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs

xfs_repair.log (80K) Download Attachment
xfs_check.log (47K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Eric Sandeen-3
Christian Affolter wrote:
> Hi
>
> After a disk crash within a hardware RAID-6 controller and kernel
> freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume:

Are you sure the volume is reassembled correctly?  It seems like the
fs has a ton of damage ...

One trick I often recommend is to make a metadata image of the fs
with xfs_metadump / xfs_mdrestore and run repair on that to see
what repair -would- do, but I guess you've already run it on the
real fs.

So if repair isn't making a mountable fs, first suggestion would
be to re-try with the latest version of repair.

> Filesystem "dm-13": Disabling barriers, not supported by the underlying
> device

Honestly, that could be part of the problem too, if a bunch of
disks with write caches all lost them, in the array.

> XFS mounting filesystem dm-13
> Starting XFS recovery on filesystem: dm-13 (logdev: internal)
> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file
> fs/xfs/xfs_alloc.c.  Caller 0xffffffff8035c58d
> Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1
...

> XFS: log mount finish failed

So recovery is failing, you could try mount -o ro,norecovery at this
point to see what's still left on the fs... but:
 
> I tried to repair the filesystem with the help of xfs_repair many times,
> without any luck:
> Filesystem "dm-13": Disabling barriers, not supported by the underlying
> device
> XFS mounting filesystem dm-13
> XFS: failed to read root inode

...

> xfs_check output:
> cache_node_purge: refcount was 1, not zero (node=0x820010)
> xfs_check: cannot read root inode (117)

That's a bit of an odd root inode number, I think, which
makes me think maybe there are still serious problems.

> Are there any other ways to fix the unreadable root inode or to restore
> the remaining data?
>
>
> Environment informations:
> Linux Kernel: 2.6.26-gentoo (x86_64)
> xfsprogs:     3.0.3

Those are both pretty old at this point, I can't say there is anything
specific in newer xfsprogs, but I'd probably give that a shot first.

-Eric

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Stan Hoeppner
In reply to this post by Christian Affolter-2
Christian Affolter put forth on 5/8/2010 7:34 AM:
> Hi
>
> After a disk crash within a hardware RAID-6 controller and kernel
> freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume:

What storage management operation(s) were you performing when this crash
occurred?  Were you adding, deleting, shrinking, or growing an EVMS volume
when the "crash" occurred, or was the system just sitting idle with no load
when the crash occurred?

Why did the "crash" of a single disk in a hardware RAID6 cause a kernel
freeze?  What is your definition of "disk crash"?  A single physical disk
failure should not have caused this under any circumstances.  The RAID card
should have handled a single disk failure transparently.

Exactly which make/model is the RAID card?  What is the status of each of
the remaining disks attached to the card as reported by its BIOS?  What is
the status of the RAID6 volume as reported by the RAID card BIOS?  What is
the status of each of your EVMS volumes as reported by the EVMS UI?

I'm asking all of these questions because it seems rather clear that the
root cause of your problem lies at a layer well below the XFS filesystem.

You have two layers of physical disk abstraction below XFS:  a hardware
RAID6 and a software logical volume manager.  You've apparently suffered a
storage system hardware failure, according to your description.  You haven't
given any details of the current status of the hardware RAID, or of the
logical volumes, merely that XFS is having problems.  I think a "Well duh!"
is in order.

Please provide _detailed_ information from the RAID card BIOS and the EVMS
UI.  Even if the problem isn't XFS related I for one would be glad to assist
you in getting this fixed.  Right now we don't have enough information.  At
least I don't.

--
Stan

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Emmanuel Florac
Le Sat, 08 May 2010 17:53:17 -0500 vous écriviez:

> Why did the "crash" of a single disk in a hardware RAID6 cause a
> kernel freeze?  What is your definition of "disk crash"?  A single
> physical disk failure should not have caused this under any
> circumstances.  The RAID card should have handled a single disk
> failure transparently.

The RAID array may go west if the disk isn't properly set up,
particularly if it's a desktop-class drive.

--
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    | <[hidden email]>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Christian Affolter-2
In reply to this post by Eric Sandeen-3
Hi Eric

Thanks for your answer.

>> After a disk crash within a hardware RAID-6 controller and kernel
>> freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume:
>
> Are you sure the volume is reassembled correctly?  It seems like the
> fs has a ton of damage ...
>
> One trick I often recommend is to make a metadata image of the fs
> with xfs_metadump / xfs_mdrestore and run repair on that to see
> what repair -would- do, but I guess you've already run it on the
> real fs.

OK, I didn't know that. I actually cloned the faild volume using dd and
run xfs_repair on the clone.


> So if repair isn't making a mountable fs, first suggestion would
> be to re-try with the latest version of repair.

OK, I will try that. Unfortunately the latest upstream version isn't
included within the distribution package repository, so I will have to
compile it first.


>> Filesystem "dm-13": Disabling barriers, not supported by the underlying
>> device
>
> Honestly, that could be part of the problem too, if a bunch of
> disks with write caches all lost them, in the array.
>
>> XFS mounting filesystem dm-13
>> Starting XFS recovery on filesystem: dm-13 (logdev: internal)
>> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file
>> fs/xfs/xfs_alloc.c.  Caller 0xffffffff8035c58d
>> Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1
> ...
>
>> XFS: log mount finish failed
>
> So recovery is failing, you could try mount -o ro,norecovery at this
> point to see what's still left on the fs... but:

This didn't made any difference, I'm still getting the same error message.


>> I tried to repair the filesystem with the help of xfs_repair many times,
>> without any luck:
>> Filesystem "dm-13": Disabling barriers, not supported by the underlying
>> device
>> XFS mounting filesystem dm-13
>> XFS: failed to read root inode
>
> ...
>
>> xfs_check output:
>> cache_node_purge: refcount was 1, not zero (node=0x820010)
>> xfs_check: cannot read root inode (117)
>
> That's a bit of an odd root inode number, I think, which
> makes me think maybe there are still serious problems.
>
>> Are there any other ways to fix the unreadable root inode or to restore
>> the remaining data?
>>
>>
>> Environment informations:
>> Linux Kernel: 2.6.26-gentoo (x86_64)
>> xfsprogs:     3.0.3
>
> Those are both pretty old at this point, I can't say there is anything
> specific in newer xfsprogs, but I'd probably give that a shot first.

Yes I'm going to try the latest version.


Thanks
Christian

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Stan Hoeppner
In reply to this post by Emmanuel Florac
Emmanuel Florac put forth on 5/9/2010 8:28 AM:

> Le Sat, 08 May 2010 17:53:17 -0500 vous écriviez:
>
>> Why did the "crash" of a single disk in a hardware RAID6 cause a
>> kernel freeze?  What is your definition of "disk crash"?  A single
>> physical disk failure should not have caused this under any
>> circumstances.  The RAID card should have handled a single disk
>> failure transparently.
>
> The RAID array may go west if the disk isn't properly set up,
> particularly if it's a desktop-class drive.

By design, a RAID6 pack should be able to handle two simultaneous drive
failures before the array goes offline.  According to the OP's post he lost
one drive.  Unless it's a really crappy RAID card or if he's using a bunch
of dissimilar drives causing problems with the entire array, he shouldn't
have had a problem.

This is why I'm digging for more information.  The information he presented
here doesn't really make any sense.  One physical disk failure _shouldn't_
have caused the problems he's experiencing.  I don't think we got the full
story.

Oh, btw, when it comes to SATA drives, there is no difference between
"desktop" and "enterprise" class drives.  They're all the same.  The ones
sold as "enterprise" have merely been firmware matched and QC tested with a
given vendor's SAN/NAS box and then certified for use with it.  The vendor
then sells only that one drive/firmware, maybe two certified drives so they
have a second source in case of shortages or price gouging etc, in their arrays.

According to the marketing droids, the only "true" "enterprise" drives
currently on the market are SAS and fiber channel.  The number of these
drives actually shipping into the server/SAN/NAS storage marketplace is
absolutely tiny compared to SATA drives.  In total unit shipments, SATA is
owning the datacenter as well as the desktop.  Browse the various storage
offerings across the big 3 and then 10 of the 2nd tier players and you'll
find at least 8 out of 10 storage arrays are SATA, the remaining two being
SAS and FC in the "high end" category, and usually over double the price of
the SATA based arrays.  This pricing of SAS/FC is what is driving SATA
adoption.  That and really large read/write caches on the SATA arrays
boosting their performance for many workloads and negating the spindle speed
advantage of the SAS and FC drives.

--
Stan

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Emmanuel Florac
Le Sun, 09 May 2010 09:53:55 -0500 vous écriviez:

> Oh, btw, when it comes to SATA drives, there is no difference between
> "desktop" and "enterprise" class drives.  They're all the same.

Yes I know that, however there's an important  difference : the default
firmware setting of desktop drives makes them retry, retry, retry an
retry on error, effectively freezing the array; while "enterprise
drives" simply fails almost instantly at the slightest error (and work
fine 5 minutes later usually).

So a desktop drive failure in a raid array may actually block all IO
for a very long time (like in minutes), and everything goes west if you
then decide that the system must have crashed and pull the plug, if
your RAID controller hasn't any backup battery.

That's why you shouldn't ever use desktop drives in RAID arrays unless
you know how to painfully configure them, one by one, with a friggin
DOS utility ;)

--
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    | <[hidden email]>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Christian Affolter-2
In reply to this post by Stan Hoeppner
Hi

>> After a disk crash within a hardware RAID-6 controller and kernel
>> freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume:
>
> What storage management operation(s) were you performing when this crash
> occurred?  Were you adding, deleting, shrinking, or growing an EVMS volume
> when the "crash" occurred, or was the system just sitting idle with no load
> when the crash occurred?

No, there were no storage management operations in progress while the
system crashed. It's a NFS file server with random read and write
operations.


> Why did the "crash" of a single disk in a hardware RAID6 cause a kernel
> freeze?  What is your definition of "disk crash"?  A single physical disk
> failure should not have caused this under any circumstances.  The RAID card
> should have handled a single disk failure transparently.

That's a good question ;) Honestly I don't know how this could happen,
all I saw were a bunch of errors from the RAID controller driver. In the
past two other disks failed and the controller reported each failure
correctly and started to rebuild the array automatically by using the
hot-spare disk. So it did its job two times correctly.
[...]

kernel: arcmsr0: abort device command of scsi id = 1 lun = 0
kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
kernel: arcmsr0: abort device command of scsi id = 2 lun = 0
kernel: arcmsr0: ccb ='0xffff8100bf819080'
isr got aborted command
kernel: arcmsr0: isr get an illegal ccb command
      done acb = '0xffff81013ec245c8'ccb = '0xffff8100bf819080' ccbacb =
'0xffff81013ec245c8' startdone = 0x0 ccboutstandingcount = 0
kernel: arcmsr0: abort device command of scsi id = 2 lun = 0
kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
kernel: arcmsr0: ccb ='0xffff8100bf814640'
isr got aborted command
kernel: arcmsr0: isr get an illegal ccb command
      done acb = '0xffff81013ec245c8'ccb = '0xffff8100bf814640' ccbacb =
'0xffff81013ec245c8' startdone = 0x0 ccboutstandingcount = 13
kernel: sd 0:0:4:0: [sdd] Result: hostbyte=DID_ABORT
driverbyte=DRIVER_OK,SUGGEST_OK
kernel: end_request: I/O error, dev sdd, sector 1897479440
kernel: Device dm-79, XFS metadata write error block 0x1813c0 in dm-79
kernel: Device dm-65, XFS metadata write error block 0x2c1a0 in dm-65
kernel: xfs_force_shutdown(dm-65,0x1) called from line 1093 of file
fs/xfs/xfs_buf_item.c.  Return address = 0xffffffff80374359
kernel: Filesystem "dm-65": I/O Error Detected.  Shutting down
filesystem: dm-65
kernel: Please umount the filesystem, and rectify the problem(s)
kernel: xfs_force_shutdown(dm-65,0x1) called from line 420 of file
fs/xfs/xfs_rw.c.  Return address = 0xffffffff803a9529

[...]

Afterwards most of the volumes where shutdown and after a couple of
hours the kernel freezes with a kernel panic (which I can't remember as
I had no serial console attached).



> Exactly which make/model is the RAID card?

Areca Technology Corp. ARC-1160 16-Port PCI-X to SATA RAID Controller
Firmware Version   : V1.42 2006-10-13


> What is the status of each of the remaining disks attached to the card as reported by its BIOS?

After the hard reset, one disk was reported as 'faild' and the rebuild
started.


> What is the status of the RAID6 volume as reported by the RAID card BIOS?

By now, the rebuild finished, therefor the volume is in normal
non-degraded state.


> What is the status of each of your EVMS volumes as reported by the EVMS UI?

They're all active. Do you need more informations here? There are
approximately 45 active volumes on this server.


> I'm asking all of these questions because it seems rather clear that the
> root cause of your problem lies at a layer well below the XFS filesystem.

Yes, I never blamed XFS for being the cause of the problem.


> You have two layers of physical disk abstraction below XFS:  a hardware
> RAID6 and a software logical volume manager.  You've apparently suffered a
> storage system hardware failure, according to your description.  You haven't
> given any details of the current status of the hardware RAID, or of the
> logical volumes, merely that XFS is having problems.  I think a "Well duh!"
> is in order.
>
> Please provide _detailed_ information from the RAID card BIOS and the EVMS
> UI.  Even if the problem isn't XFS related I for one would be glad to assist
> you in getting this fixed.  Right now we don't have enough information.  At
> least I don't.


Thanks for your help
Christian

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Emmanuel Florac
Le Sun, 09 May 2010 17:35:27 +0200 vous écriviez:

> kernel: arcmsr0: abort device command of scsi id = 1 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 2 lun = 0
> kernel: arcmsr0: ccb ='0xffff8100bf819080'

Looks like disks 1, 2 and 4 timed out there. What model are those disks?

--
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    | <[hidden email]>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Stan Hoeppner
In reply to this post by Christian Affolter-2
Christian Affolter put forth on 5/9/2010 10:35 AM:

> That's a good question ;) Honestly I don't know how this could happen,
> all I saw were a bunch of errors from the RAID controller driver. In the
> past two other disks failed and the controller reported each failure
> correctly and started to rebuild the array automatically by using the
> hot-spare disk. So it did its job two times correctly.
> [...]
>
> kernel: arcmsr0: abort device command of scsi id = 1 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 4 lun = 0
> kernel: arcmsr0: abort device command of scsi id = 2 lun = 0
> kernel: arcmsr0: ccb ='0xffff8100bf819080'

Ok, that's not good.  Looks like the Areca driver is showing communication
failure with 3 physical drives simultaneously.  Can't be a drive problem.

I just read through about 20 Google hits, and it seems this Areca issue is
pretty common.  One OP said he had a defective card.  The rest all report
the same or similar errors, across many Areca models, using many different
drives, under moderate to high I/O load, under multiple *nix OSes, usually
lots of small file copies is the trigger.  I've read plenty of less than
flattering things about Areca RAID cards in the past.  This is just more of
the same.

From everything I've read on this, the problem is either:

A.  A defective Areca card
B.  Firmware issue (card and/or drives)
C.  Driver issue
D.  More than one of the above

> Areca Technology Corp. ARC-1160 16-Port PCI-X to SATA RAID Controller
> Firmware Version   : V1.42 2006-10-13

That's a 16 port card.  How many total drives do you have connected?

Are they all the same model/firmware rev?  If different models, do you at
least have identical models in each RAID pack? Mixing different
brands/models/firmware revs within a RAID pack is always a very bad idea.
In fact, using anything but identical drives/firmware on a single controller
card is a bad idea.  Some cards are more finicky than others, but almost all
of them will have problems of one kind or another with a mixed bag 'o
drives.  They can have problems with all identical drives if the drive
firmware isn't to the card firmware's liking (see below).

> After the hard reset, one disk was reported as 'faild' and the rebuild
> started.

Unfortunately the errors reported weren't indicative of a bad drive, but
multiple bad drives.  None of the drives are bad.  The
controller/firmware/driver have a problem, or have a problem with the
drive(s) firmware.  The Areca firmware marked one drive as bad because the
logic says something besides the card/firmware/driver _must_ be bad.  So, it
marked one of the drives as bad and started rebuilding it.

Back in the late '90s I had Mylex DAC960 cards doing exactly the same thing
due to a problem with firmware on the Seagate ST118202 Cheetah drives.  The
DAC960 would just kick a drive offline willy nilly.  This was with 8
identical firmware drives in RAID5 arrays on a single SCSI channel.  Was
really annoying.  I was at customer sites twice weekly replacing and
rebuilding drives until Seagate finally admitted the firmware bug and
advance shipped us 50 new 3 series Cheetah drives.  That was really fun
replacing drives one by one and rebuilding the arrays after each drive swap.
 We lost a lot of labor $$ over that and had some less than happy customers.
 Once all the drives were replaced with the 3 series, we never had another
problem with any of those arrays.  I'm still surprised I was able to rebuild
the arrays without issues after adding each new drive, which was a slightly
different size with a different firmware.  I was just sure the rebuilds
would puke.  I got lucky.  These systems were in production, thus the reason
we didn't restore from tape, which would have saved a lot of time.

>> What is the status of the RAID6 volume as reported by the RAID card BIOS?
>
> By now, the rebuild finished, therefor the volume is in normal
> non-degraded state.

That's good.

>> What is the status of each of your EVMS volumes as reported by the EVMS UI?
>
> They're all active. Do you need more informations here? There are
> approximately 45 active volumes on this server.

No.  Just wanted to know if they're all reported as healthy.

>> I'm asking all of these questions because it seems rather clear that the
>> root cause of your problem lies at a layer well below the XFS filesystem.
>
> Yes, I never blamed XFS for being the cause of the problem.

I should have worded that differently.  I didn't mean to imply that you were
blaming XFS.  I meant that I wanted to help you figure out the root cause
which wasn't XFS.

>> You have two layers of physical disk abstraction below XFS:  a hardware
>> RAID6 and a software logical volume manager.  You've apparently suffered a
>> storage system hardware failure, according to your description.  You haven't
>> given any details of the current status of the hardware RAID, or of the
>> logical volumes, merely that XFS is having problems.  I think a "Well duh!"
>> is in order.
>>
>> Please provide _detailed_ information from the RAID card BIOS and the EVMS
>> UI.  Even if the problem isn't XFS related I for one would be glad to assist
>> you in getting this fixed.  Right now we don't have enough information.  At
>> least I don't.

On second read, this looks rather preachy and antagonistic.  I truly did not
intend that tone.  Please accept my apology if this came across that way.  I
think I was starting to get frustrated because I wanted to troubleshoot this
further but didn't feel I had enough info.  Again, this was less than
professional, and I apologize.

--
Stan

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Roger Willcocks-2
In reply to this post by Christian Affolter-2
The original xfs_repair log looks reasonably sane, and suggests that  
the block(s) containing inodes 128-191 had been zeroed out. xfs_repair  
claims to have fixed all that up, and rebuilt the root directory  
amongst others.

However xfs_check still complains, and correspondence off-list shows  
that xfs_repair -n detects the same 'bad magic number 0xfeed' on  
alternating inodes as xfs_check:

bad magic number 0xfeed on inode 128
bad version number 0x0 on inode 128
bad magic number 0x0 on inode 129
bad version number 0x0 on inode 129
bad (negative) size -8161755683656211562 on inode 129
bad magic number 0xfeed on inode 130
bad version number 0x0 on inode 130
bad magic number 0x0 on inode 131
bad version number 0x0 on inode 131
bad (negative) size -8161755683656211562 on inode 131

This pattern corresponds to the corruption described in http://oss.sgi.com/archives/xfs/2008-01/msg00696.html

xfs_check also says:

block 0/8 expected type unknown got log
block 0/9 expected type unknown got log
block 0/10 expected type unknown got log
block 0/11 expected type unknown got log

Could xfs_repair (or remounting the disk) have written log data over  
those blocks?

--
Roger


On 8 May 2010, at 13:34, Christian Affolter wrote:

> Hi
>
> After a disk crash within a hardware RAID-6 controller and kernel
> freeze, I'm unable to mount an XFS filesystem on top of an EVMS  
> volume:
>
> Filesystem "dm-13": Disabling barriers, not supported by the  
> underlying
> device
> XFS mounting filesystem dm-13
> Starting XFS recovery on filesystem: dm-13 (logdev: internal)
> XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1599 of file
> fs/xfs/xfs_alloc.c.  Caller 0xffffffff8035c58d
> Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1
>
> Call Trace:
> [<ffffffff8035c58d>] xfs_free_extent+0xcd/0x110
> [<ffffffff8035aa03>] xfs_free_ag_extent+0x4e3/0x740
> [<ffffffff8035c58d>] xfs_free_extent+0xcd/0x110
> [<ffffffff80395e9d>] xlog_recover_process_efi+0x18d/0x1d0
> [<ffffffff80397400>] xlog_recover_process_efis+0x60/0xa0
> [<ffffffff80397463>] xlog_recover_finish+0x23/0xf0
> [<ffffffff8039d36a>] xfs_mountfs+0x4da/0x680
> [<ffffffff803a9ba8>] kmem_alloc+0x58/0x100
> [<ffffffff803a9cfb>] kmem_zalloc+0x2b/0x40
> [<ffffffff803a3dbd>] xfs_mount+0x36d/0x3a0
> [<ffffffff803b4bad>] xfs_fs_fill_super+0xbd/0x220
> [<ffffffff8028b551>] get_sb_bdev+0x141/0x180
> [<ffffffff803b4af0>] xfs_fs_fill_super+0x0/0x220
> [<ffffffff8028aeb6>] vfs_kern_mount+0x56/0xc0
> [<ffffffff8028af83>] do_kern_mount+0x53/0x110
> [<ffffffff802a343b>] do_new_mount+0x9b/0xe0
> [<ffffffff802a3666>] do_mount+0x1e6/0x220
> [<ffffffff802670f5>] __get_free_pages+0x15/0x60
> [<ffffffff802a373b>] sys_mount+0x9b/0x100
> [<ffffffff8020b49b>] system_call_after_swapgs+0x7b/0x80
>
> Filesystem "dm-13": XFS internal error xfs_trans_cancel at line 1163  
> of
> file fs/xfs/xfs_trans.c.  Caller 0xffffffff80395eb1
> Pid: 13473, comm: mount Not tainted 2.6.26-gentoo #1
>
> Call Trace:
> [<ffffffff80395eb1>] xlog_recover_process_efi+0x1a1/0x1d0
> [<ffffffff8039fd16>] xfs_trans_cancel+0x126/0x150
> [<ffffffff80395eb1>] xlog_recover_process_efi+0x1a1/0x1d0
> [<ffffffff80397400>] xlog_recover_process_efis+0x60/0xa0
> [<ffffffff80397463>] xlog_recover_finish+0x23/0xf0
> [<ffffffff8039d36a>] xfs_mountfs+0x4da/0x680
> [<ffffffff803a9ba8>] kmem_alloc+0x58/0x100
> [<ffffffff803a9cfb>] kmem_zalloc+0x2b/0x40
> [<ffffffff803a3dbd>] xfs_mount+0x36d/0x3a0
> [<ffffffff803b4bad>] xfs_fs_fill_super+0xbd/0x220
> [<ffffffff8028b551>] get_sb_bdev+0x141/0x180
> [<ffffffff803b4af0>] xfs_fs_fill_super+0x0/0x220
> [<ffffffff8028aeb6>] vfs_kern_mount+0x56/0xc0
> [<ffffffff8028af83>] do_kern_mount+0x53/0x110
> [<ffffffff802a343b>] do_new_mount+0x9b/0xe0
> [<ffffffff802a3666>] do_mount+0x1e6/0x220
> [<ffffffff802670f5>] __get_free_pages+0x15/0x60
> [<ffffffff802a373b>] sys_mount+0x9b/0x100
> [<ffffffff8020b49b>] system_call_after_swapgs+0x7b/0x80
>
> xfs_force_shutdown(dm-13,0x8) called from line 1164 of file
> fs/xfs/xfs_trans.c.  Return address = 0xffffffff8039fd2f
> Filesystem "dm-13": Corruption of in-memory data detected.  Shutting
> down filesystem: dm-13
> Please umount the filesystem, and rectify the problem(s)
> Failed to recover EFIs on filesystem: dm-13
> XFS: log mount finish failed
>
>
> I tried to repair the filesystem with the help of xfs_repair many  
> times,
> without any luck:
> Filesystem "dm-13": Disabling barriers, not supported by the  
> underlying
> device
> XFS mounting filesystem dm-13
> XFS: failed to read root inode
>
> xfs_check output:
> cache_node_purge: refcount was 1, not zero (node=0x820010)
> xfs_check: cannot read root inode (117)
> cache_node_purge: refcount was 1, not zero (node=0x8226b0)
> xfs_check: cannot read realtime bitmap inode (117)
> block 0/8 expected type unknown got log
> block 0/9 expected type unknown got log
> block 0/10 expected type unknown got log
> block 0/11 expected type unknown got log
> bad magic number 0xfeed for inode 128
> [...]
>
> Are there any other ways to fix the unreadable root inode or to  
> restore
> the remaining data?
>
>
> Environment informations:
> Linux Kernel: 2.6.26-gentoo (x86_64)
> xfsprogs:     3.0.3
>
> Attached you'll find the xfs_repair and xfs_check output.
>
>
> Thanks in advance and kind regards
> Christian
> <
> xfs_repair
> .log><xfs_check.log>_______________________________________________
> xfs mailing list
> [hidden email]
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Eric Sandeen-3
In reply to this post by Stan Hoeppner
Stan Hoeppner wrote:

> Emmanuel Florac put forth on 5/9/2010 8:28 AM:
>> Le Sat, 08 May 2010 17:53:17 -0500 vous écriviez:
>>
>>> Why did the "crash" of a single disk in a hardware RAID6 cause a
>>> kernel freeze?  What is your definition of "disk crash"?  A single
>>> physical disk failure should not have caused this under any
>>> circumstances.  The RAID card should have handled a single disk
>>> failure transparently.
>> The RAID array may go west if the disk isn't properly set up,
>> particularly if it's a desktop-class drive.
>
> By design, a RAID6 pack should be able to handle two simultaneous drive
> failures before the array goes offline.  According to the OP's post he lost
> one drive.  Unless it's a really crappy RAID card or if he's using a bunch
> of dissimilar drives causing problems with the entire array, he shouldn't
> have had a problem.
>
> This is why I'm digging for more information.  The information he presented
> here doesn't really make any sense.  One physical disk failure _shouldn't_
> have caused the problems he's experiencing.  I don't think we got the full
> story.

I tend to agree, something is missing here, which means my suggestions
for repair will be unlikely to be terribly successful; I think more is wrong
than we know...

-Eric

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: failed to read root inode

Christian Affolter-2
In reply to this post by Christian Affolter-2
Hi

>> So if repair isn't making a mountable fs, first suggestion would
>> be to re-try with the latest version of repair.
>
> OK, I will try that. Unfortunately the latest upstream version isn't
> included within the distribution package repository, so I will have to
> compile it first.

OK, I was able to mount the volume by first repairing it with the latest
xfs_repair version and mount -o ro,norecovery [...]

Thank you everybody for your helpfulness and your suggestions regarding
this problem.


Regards
Christian

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs