Quantcast

xfs_check/xfs_repair reports corruption immediately after creating filesystem

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

xfs_check/xfs_repair reports corruption immediately after creating filesystem

beantownguy80
This post has NOT been accepted by the mailing list yet.
I am trying to debug several anomalous behaviors with xfs. Here's some background. I have a custom block device driver that my company has been using in their product for about a year an a half. Until now, we have been using it with ext4 with some of the latest features like discard and lazy_itable_init. We have hundreds of customers and have run hours and hours of tests and have never seen any corruption that wasn't purposely induced. We currently support RHEL 6.x with the 2.6.32 kernel (with RHEL backports), but we would like to switch to xfs because it supports much larger filesystems than ext4 on RHEL6 and it's more scalable. We also test heavily with a more recent Ubuntu 12.04 running  a 3.2x kernel (3.2.0-59-generic to be more exact, more on this later), but this is not in production.

We can change the block device size at load time so for quick testing, we create a 1GiB drive.
Here's what the geometry looks like, which is pretty typical for a 1GiB drive, which is very small by today's standards even for a USB stick, but I digress.

$ sudo hdparm -g /dev/cbd0

/dev/cbd0:
 geometry      = 1024/64/32, sectors = 2097152, start = 0
I have a tool that also prints some results back of various ioctls, the output should be self explanatory.

$ sudo ./specs /dev/cbd0
Device: /dev/cbd0
BLKBSZGET: 512
BLKSSZGET: 512
BLKGETSIZE64: 1073741824
HDIO_GETGEO:
  cylinders: 1024
      heads: 64
    sectors: 32
      start: 0
Geometry check: geometry size matches BLKGETSIZE64 exactly

What's odd is that if I create an xfs filesystem on Ubuntu 12.04, it attempts to write off the edge of the disk. The last I/O request overlaps the end of the disk. That is, it's start sector is near the end of the disk (in bounds), but it's transfer length extends beyond the end of the disk. I think that this indicates the mkfs.xfs has some misunderstanding about the geometry or the size of my disk, but I don't know why. There are two very odd things: 1) this doesn't happen on RHEL6 (kernel 2.6.32), but it does on Ubuntu 12.04 (kernel 3.2.0) and 2) this goes away on Ubuntu if I specify the block size as 512 (e.g. mkfs.xfs -b size=512). Here's the output from mkfs.xfs when it fails.

$ sudo mkfs.xfs -K -f /dev/cbd0
meta-data=/dev/cbd0              isize=256    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: pwrite64 failed: Input/output error

Now, once I get past the first issue (by specifying 512 block size on Ubuntu for example) on either platform I consistently get xfs_check to report "bad magic number 0 for inode xxx". On ubuntu with the forced 512-byte block size, xxx is always 44-95. It would seem like part of the disk is unintentionally zeroed out or left zeroes and that maybe this has to do with mkfs.xfs having a completely bonkers idea about the geometry and or size of the virtual disk drive. Another weird factoid is that this fs will mount successfully, but I am worried about corruption creeping in if this is indicative of a more fundamental problem and furthermore it will be impossible to correct a legitimate problem if xfs_repair doesn't work.

It's worth mentioning that I am formatting the device directly, there are no partitions, because we plan on supporting very large filesystems and don't want to have to support protective MBR+GPT and have to resize the GPT when we grow the underlying device and fs. This should work fine and in fact is often recommended for md devices, and so I don't think this should be an issue.

One last test that I did, was I used dd to copy 1GiB from /dev/zero to a file (on a different filesystem), and named it 1G.data. I used mkfs.xfs on that file to create a filesystem. Then I did xfs_check -f 1G.data and it checks out okay (as to be expected). I then copied the file to the this block device (using dd again). Then I run xfs_check against the block device and it checks out okay. So it's something that mkfs.xfs is doing when directly accessing my device. There is no write caching with this device driver. All writes are only acknowledged once they have hit persistent storage so I don't think it would be some kind of data hazard (WAW, RAW, etc..), but I could be wrong.

I have tried the bleeding edge xfsprogs from git, but I still see the same behavior. Please advise!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

beantownguy80
This post has NOT been accepted by the mailing list yet.
I have one small note. I think the writing off the end of the disk issue is Ubuntu-specific and has something to do with their changes to the "Host Protected Area" (HPA) detection and how and when it's "unlocked". If anyone has some good answers about how to correct/avoid that in a kernel driver (flags I need to set regarding native capacity or something else), that would be much appreciated. There's clearly something wrong there, but this doesn't happen on the older CentOS so for now I want to focus on the data integrity issue after mkfs.xfs. Thanks in advance.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

beantownguy80
This post has NOT been accepted by the mailing list yet.
I think I have a better understanding of what's going on. For some reason when using xfs, the block I/O scheduler layer is sending down write requests that have multiple segments (no surprise here) that are all writes to the same sector (a bit surprising to me). I was not expecting this and it seems like a lot of busy work i.e. only one of these writes seems necessary, which is the last one in order. I can see why you might want to read the same sector into multiple buffers (e.g. to satisfy multiple readers), but multiple overwrites of the same sector in one request seems like a waste of effort at the very least, so I must be misunderstanding the intention of these requests (maybe there is an extra flag I am not checking).

As a comparison, usually when I get multiple segments sent down from ext4, they are contiguous sectors ranges. That is, the entire request is a read or write broken up into multiple segments of contiguous runs of sectors, but are broken up into multiple segments because obviously the buffers are not all on contiguous pages. That's the point of the segmenting, no? But with the xfs driver, it seems like I get some requests like that, but also some spurious requests where it appears to be asking me to overwrite the same same sector range multiple times. I am trying to gain a better understanding of why this might be happening and how to handle these requests by looking at the xfs driver source and xfsprogs source, but it's a bit of a learning curve for me.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

beantownguy80
This post has NOT been accepted by the mailing list yet.
Here's another tidbit of information. The symptoms are somewhat different.

Given: blk_queue_max_hw_sectors() is set to 1024 (which means 512 KiB when using typical 512-byte sectors)

I am iterating through a rq_for_each_segment(bvec, rq, iter) loop and some other strange phenomena are observed.

segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903
segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 // <--- Something smells fishy here
...

Segment 1 begins to show some issues.
1) The start sector is not adjacent to the previous stop sector, there is a gap from sector 903 through sector 1022.
2) Just taking these two segments into account, the I/O request spans more than 1024 sectors violating the queue limit specified earlier.

All of the examples I've looked at and docs I've read (kernel docs, ldd3, Linux Kernel Development (2010), Understanding The Linux Kernel, etc...) indicated that requests should only contain reads or writes of adjacent sectors. That's been my observation with requests coming from the ext4 fs as well. Am I misunderstanding how this is supposed to work? Why is the block layer sending requests down that seem to violate the request_queue configuration/limits?

This coupled with the other odd behavior reported in an earlier post of sometimes observing writes to the same sectors within the same request but across segments could be construed as an indication of some kind of race or corruption? I am iterating this request while holding a mutex and no other code in my driver is accessing the same request ptr concurrently. This seems to only happen with xfs filesystems and tools (mkfs.xfs, xfs_repair, xfs.check). Why does this seem to be isolated to using xfs? I'm not trying to blame xfs so I apologize if ti comes off that way, I just think there may be some behavior about ext4 that I am taking for granted that is not true in general.

The only workaround I have found is to limit the number of segments to 1, but this obviously has a negative performance impact and again has never been necessary with ext3 or ext4. If someone could point me in the right direction that would be very helpful.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

diama13
This post has NOT been accepted by the mailing list yet.
In reply to this post by beantownguy80
Did you find a solution? I think i have the same problem with xfs!!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

beantownguy80
This post has NOT been accepted by the mailing list yet.

The problem was partially in the driver I was writing and partially in the kernel depending on how you look at it. When you use the rq_for_each_segment macro on a rq that came from DIO (e.g. mkfs.xfs and xfs_repair both use O_DIRECT) then sometimes the fields in the iterator are not updated correctly. You can only trust the bv_len in the biovec to calculate how many blocks were transferred in that iteration. If you go back to the email thread, I explained it better earlier. Did you look at my later comments in th thread.

On May 12, 2014 1:23 PM, "diama13 [via Xfs]" <[hidden email]> wrote:
>
> Did you find a solution? I think i have the same problem with xfs!!
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://xfs.9218.n7.nabble.com/xfs-check-xfs-repair-reports-corruption-immediately-after-creating-filesystem-tp35011p35017.html
> To unsubscribe from xfs_check/xfs_repair reports corruption immediately after creating filesystem, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

diama13
This post has NOT been accepted by the mailing list yet.
I do not use rq_for_each_segment on my own driver. I use bio->bi_sector to find the rq
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

beantownguy80
This post has NOT been accepted by the mailing list yet.

If you are not using rq pointers and rq_for_each_segment then it sounds like your problem is different. Good luck.

On May 12, 2014 1:38 PM, "diama13 [via Xfs]" <[hidden email]> wrote:
I do not use rq_for_each_segment on my own driver. I use bio->bi_sector to find the rq


To unsubscribe from xfs_check/xfs_repair reports corruption immediately after creating filesystem, click here.
NAML
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

diama13
This post has NOT been accepted by the mailing list yet.
In reply to this post by diama13
blk_queue_max_hw_sectors() is set to 8
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: xfs_check/xfs_repair reports corruption immediately after creating filesystem

diama13
This post has NOT been accepted by the mailing list yet.
I think its the same. Thanks anyway
Loading...