Quantcast

XFS corruption with Areca RAID reintegration

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

XFS corruption with Areca RAID reintegration

nortally
This post has NOT been accepted by the mailing list yet.
I set up my 24T XFS file system following the advice on the XFS website for Areca RAID controllers. Last week, one of the SATA drives timed out, so the RAID failed it and started reintegrating a hot spare. Within minutes the server log reported fatal errors and took the file system off-line.

Note: I do understand that the RAID shouldn't chronically fail drives and promote a hot spare. I have two identical systems and one of them does this about once every six weeks. I've been working to resolve the issue. But I never had a file system corruption problem with the 12T ext4 file system I was using before the last expansion.

Here's how I experienced the problem:
1. Came into the office, the RAID alarm was going.
2. Silenced the alarm, checked that (from the RAID's point of view) the RAID was reintegrating in the background and the Volume was OK.
3. Logged into the server and discovered the file system was no longer mounted.
4. Checked the server log and discovered that the xfs drive reported a failure and unmounted the file system within 2 minutes of the RAID event.
5. Ran xfs_repair, got an alert saying I had to mount the file system so the journal could be replayed.
6. Tried to mount the file system, was told the journal was corrupt, and I'd have to run xfs_repair with the -L option.
7. Ran xfs_repair -L. Let it run for 5 days, it was still logging activity every 15 minutes (the same message saying another 20 items had been processed), but it wasn't done. I terminated the operation because the data was non-essential (this device is used for backup) and I need to get that file system on-line.

I tried again, then two nights ago the same thing happened again. I'm hoping that this is a configuration issue:

Hardware Setup:
Habey 12-bay SATA enclosure
Areca 1680X RAID card
RAID 6 device built on 10 x 3T SATA hard drives + 2 Hot Spares
Single volume using entire RAID
stripe size 64k
Disk Write Cache Mode disabled (per XFS instructions)

Partition command:
# parted -a optimal /dev/sdc mklabel gpt mkpart primary 0% 100%

Format command:
# mkfs.xfs -d su=64k,sw=8 /dev/sdc1

Running Software:
Package: xfsprogs
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: XFS Development Team <xfs@oss.sgi.com>
Architecture: amd64
Version: 3.1.9ubuntu2
Filename: pool/main/x/xfsprogs/xfsprogs_3.1.9ubuntu2_amd64.deb

# uname -a
Linux server42 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/issue
Ubuntu 14.04 LTS \n \l

RAID Event Log:
2014-07-16 00:09:29  Enc#2 PHY#9      Time Out Error
2014-07-16 00:09:31  ARC-1680-VOL#010 Volume Degraded
2014-07-16 00:09:33  Raid Set # 010   RaidSet Degraded
2014-07-16 00:09:35  Enc#2 PHY#9      Device Removed
2014-07-16 07:58:02  192.168.001.073  Telnet Log In        

Syslog:
Jul 16 00:10:19 server42 kernel: [37949.044839] ffff88010d833000: 5c 63 63 9e 9b 21 32 88 05 9d a3 dc 42 9d 7a e6  \cc..!2.....B.z.
Jul 16 00:10:19 server42 kernel: [37949.044991] ffff88010d833010: 45 1a f0 65 dc 4d 0e d5 fd 22 f0 fb ad 0b 46 11  E..e.M..."....F.
Jul 16 00:10:19 server42 kernel: [37949.045122] ffff88010d833020: 66 c2 9c 91 ad 17 72 45 36 c2 77 e1 6d 4d 66 11  f.....rE6.w.mMf.
Jul 16 00:10:19 server42 kernel: [37949.045252] ffff88010d833030: ad 53 5e 9b 5e dd 55 81 7b 3e 21 97 cf 4c c4 3b  .S^.^.U.{>!..L.;
Jul 16 00:10:19 server42 kernel: [37949.045385] XFS (sdc1): Internal error xfs_allocbt_read_verify at line 362 of file /build/buildd/linux-3.13.0/fs/xfs/xfs_alloc_btree.c.  Caller 0xffffffffa015b6c5
Jul 16 00:10:19 server42 kernel: [37949.045603] CPU: 0 PID: 390 Comm: kworker/0:1H Not tainted 3.13.0-29-generic #53-Ubuntu
Jul 16 00:10:19 server42 kernel: [37949.045605] Hardware name: Supermicro X7SBi/X7SBi, BIOS 1.3a       11/03/2009
Jul 16 00:10:19 server42 kernel: [37949.045639] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045642]  0000000000000001 ffff8800c8867d78 ffffffff8171a214 ffff8802209b7000
Jul 16 00:10:19 server42 kernel: [37949.045646]  ffff8800c8867d90 ffffffffa015e53b ffffffffa015b6c5 ffff8800c8867dc8
Jul 16 00:10:19 server42 kernel: [37949.045650]  ffffffffa015e595 0000016a364f2b20 ffff8800364f2b20 ffff8800364f2a80
Jul 16 00:10:19 server42 kernel: [37949.045654] Call Trace:
Jul 16 00:10:19 server42 kernel: [37949.045661]  [<ffffffff8171a214>] dump_stack+0x45/0x56
Jul 16 00:10:19 server42 kernel: [37949.045683]  [<ffffffffa015e53b>] xfs_error_report+0x3b/0x40 [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045704]  [<ffffffffa015b6c5>] ? xfs_buf_iodone_work+0x85/0xf0 [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045725]  [<ffffffffa015e595>] xfs_corruption_error+0x55/0x80 [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045746]  [<ffffffffa015b6c5>] ? xfs_buf_iodone_work+0x85/0xf0 [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045770]  [<ffffffffa0178ba9>] xfs_allocbt_read_verify+0x69/0xd0 [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045791]  [<ffffffffa015b6c5>] ? xfs_buf_iodone_work+0x85/0xf0 [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045812]  [<ffffffffa015b6c5>] xfs_buf_iodone_work+0x85/0xf0 [xfs]
Jul 16 00:10:19 server42 kernel: [37949.045817]  [<ffffffff810838a2>] process_one_work+0x182/0x450
Jul 16 00:10:19 server42 kernel: [37949.045820]  [<ffffffff81084641>] worker_thread+0x121/0x410
Jul 16 00:10:19 server42 kernel: [37949.045823]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Jul 16 00:10:19 server42 kernel: [37949.045827]  [<ffffffff8108b322>] kthread+0xd2/0xf0
Jul 16 00:10:19 server42 kernel: [37949.045830]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0
Jul 16 00:10:19 server42 kernel: [37949.045834]  [<ffffffff8172ab3c>] ret_from_fork+0x7c/0xb0
Jul 16 00:10:19 server42 kernel: [37949.045837]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0
Jul 16 00:10:19 server42 kernel: [37949.045840] XFS (sdc1): Corruption detected. Unmount and run xfs_repair
Jul 16 00:10:19 server42 kernel: [37949.045948] XFS (sdc1): metadata I/O error: block 0x384ae0778 ("xfs_trans_read_buf_map") error 117 numblks 8
Jul 16 00:10:19 server42 kernel: [37949.046104] XFS (sdc1): xfs_do_force_shutdown(0x1) called from line 376 of file /build/buildd/linux-3.13.0/fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffa01bcd7d
Jul 16 00:10:19 server42 kernel: [37949.046726] XFS (sdc1): I/O Error Detected. Shutting down filesystem
Jul 16 00:10:19 server42 kernel: [37949.046832] XFS (sdc1): Please umount the filesystem and rectify the problem(s)
Jul 16 00:10:34 server42 kernel: [37964.128015] XFS (sdc1): xfs_log_force: error 5 returned.
Jul 16 00:11:04 server42 kernel: [37994.208014] XFS (sdc1): xfs_log_force: error 5 returned. (repeats until dismount)
Loading...