Re: xfs Digest, Vol 53, Issue 111

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: xfs Digest, Vol 53, Issue 111

Sotirios Torentsis
Am 21.11.2012 02:38 schrieb <[hidden email]>:
Send xfs mailing list submissions to
        [hidden email]

To subscribe or unsubscribe via the World Wide Web, visit
        http://oss.sgi.com/mailman/listinfo/xfs
or, via email, send a message with subject or body 'help' to
        [hidden email]

You can reach the person managing the list at
        [hidden email]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of xfs digest..."

Today's Topics:

   1. [PATCH 0/2] discontiguous buffer patches (Mark Tinguely)
   2. Re: [PATCH 0/2] xfsprogs: fixes for release candidate.
      (Mark Tinguely)
   3. Re: [PATCH] xfs: Don't flush inodes when project quota
      exceeded (Jan Kara)
   4. Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous
      direct I/O requests (Jan Kara)
   5. [PATCH] ext4: Reduce i_mutex usage in ext4_file_sync() (Jan Kara)
   6. Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync
      part      correctly (Jan Kara)
   7. Re: [PATCH] xfs: Don't flush inodes when project quota
      exceeded (Dave Chinner)


---------- Weitergeleitete Nachricht ----------
From: Mark Tinguely <[hidden email]>
To: [hidden email]
Cc: 
Date: Tue, 20 Nov 2012 16:41:20 -0600
Subject: [PATCH 0/2] discontiguous buffer patches
Eric Sundeen's "userspace vs. fragmented multiblock dir2", xfstest 291
commit 2a4ed, causes a xfs_buf lock hang:

[<ffffffff81065d87>] down+0x47/0x50
[<ffffffffa03a25c6>] xfs_buf_lock+0x66/0xe0 [xfs]
[<ffffffffa03a33c9>] _xfs_buf_find+0x1f9/0x320 [xfs]
[<ffffffffa03a393f>] xfs_buf_get_map+0x2f/0x170 [xfs]
[<ffffffffa03a3f7a>] xfs_buf_read_map+0x2a/0x100 [xfs]
[<ffffffffa0411630>] xfs_trans_read_buf_map+0x3b0/0x590 [xfs]
[<ffffffffa03e1c5e>] xfs_da_read_buf+0xbe/0x230 [xfs]
[<ffffffffa03e78dc>] xfs_dir2_block_addname+0x7c/0x980 [xfs]
[<ffffffffa03f1468>] xfs_dir2_sf_addname+0x3e8/0x450 [xfs]
[<ffffffffa03e634c>] xfs_dir_createname+0x17c/0x1e0 [xfs]
[<ffffffffa03b9fe2>] xfs_create+0x4c2/0x5f0 [xfs]
[<ffffffffa03b0c8a>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
[<ffffffffa03b0dce>] xfs_vn_create+0xe/0x10 [xfs]
[<ffffffff8115695c>] vfs_create+0xac/0xd0
[<ffffffff811587de>] do_last+0x8be/0x960
[<ffffffff8115940c>] path_openat+0xdc/0x410
[<ffffffff81159853>] do_filp_open+0x43/0xa0
[<ffffffff8114a502>] do_sys_open+0x152/0x1e0
[<ffffffff8114a5cc>] sys_open+0x1c/0x20
[<ffffffff81423df9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

That bisect a problem to:

    commit 3605431fb9739a30ccd0c6380ae8e3c6f8e670a5
    Author: Dave Chinner <[hidden email]>
    Date:   Fri Jun 22 18:50:13 2012 +1000
    xfs: use discontiguous xfs_buf support in dabuf wrappers

xfs_trans_buf_item_match() is looking for the block number of the
buffer in the single segment map area.

Futhermore, there are a couple issue with multi-segment buffer log
format.

Patch 1 cleans up the buffer map so that XFS always uses b_maps[].

Patch 2 fixes the buffer log format issues.

--Mark.





---------- Weitergeleitete Nachricht ----------
From: Mark Tinguely <[hidden email]>
To: Dave Chinner <[hidden email]>
Cc: [hidden email]
Date: Tue, 20 Nov 2012 17:17:36 -0600
Subject: Re: [PATCH 0/2] xfsprogs: fixes for release candidate.
On 11/09/12 01:02, Dave Chinner wrote:
Hi Ben,

These are two fixes for bugs that have shown up during the -rc
phase of the release. Please consider them for inclusion.

Cheers,

Dave.

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs

This series has been committed to git://oss.sgi.com/xfs/cmds/xfsprogs master branch, commit 1adfff, 19473a and 36298.

--Mark.




---------- Weitergeleitete Nachricht ----------
From: Jan Kara <[hidden email]>
To: Dave Chinner <[hidden email]>
Cc: Jan Kara <[hidden email]>, [hidden email]
Date: Wed, 21 Nov 2012 01:24:59 +0100
Subject: Re: [PATCH] xfs: Don't flush inodes when project quota exceeded
On Tue 20-11-12 11:04:28, Dave Chinner wrote:
> On Mon, Nov 19, 2012 at 10:39:13PM +0100, Jan Kara wrote:
> > On Tue 13-11-12 01:36:13, Jan Kara wrote:
> > > When project quota gets exceeded xfs_iomap_write_delay() ends up flushing
> > > inodes because ENOSPC gets returned from xfs_bmapi_delay() instead of EDQUOT.
> > > This makes handling of writes over project quota rather slow as a simple test
> > > program shows:
> > >   fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
> > >   for (i = 0; i < 50000; i++)
> > >           pwrite(fd, buf, 4096, i*4096);
> > >
> > > Writing 200 MB like this into a directory with 100 MB project quota takes
> > > around 6 minutes while it takes about 2 seconds with this patch applied. This
> > > actually happens in a real world load when nfs pushes data into a directory
> > > which is over project quota.
> > >
> > > Fix the problem by replacing XFS_QMOPT_ENOSPC flag with XFS_QMOPT_EPDQUOT.
> > > That makes xfs_trans_reserve_quota_bydquots() return new error EPDQUOT when
> > > project quota is exceeded. xfs_bmapi_delay() then uses this flag so that
> > > xfs_iomap_write_delay() can distinguish real ENOSPC (requiring flushing)
> > > from exceeded project quota (not requiring flushing).
> > >
> > > As a side effect this patch fixes inconsistency where e.g. xfs_create()
> > > returned EDQUOT even when project quota was exceeded.
> >   Ping? Any opinions?
>
> FWIW, it doesn't look like it'll apply to a current XFs tree:
>
> > > @@ -441,8 +442,11 @@ retry:
> > >    */
> > >   if (nimaps == 0) {
> > >           trace_xfs_delalloc_enospc(ip, offset, count);
> > > -         if (flushed)
> > > -                 return XFS_ERROR(error ? error : ENOSPC);
> > > +         if (flushed) {
> > > +                 if (error == 0 || error == EPDQUOT)
> > > +                         error = ENOSPC;
> > > +                 return XFS_ERROR(error);
> > > +         }
> > >
> > >           if (error == ENOSPC) {
> > >                   xfs_iunlock(ip, XFS_ILOCK_EXCL);
>
> This xfs_iomap_write_delay() looks like this now:
>
>         /*
>          * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. Retry
>          * without EOF preallocation.
>          */
>         if (nimaps == 0) {
>                 trace_xfs_delalloc_enospc(ip, offset, count);
>                 if (prealloc) {
>                         prealloc = 0;
>                         error = 0;
>                         goto retry;
>                 }
>                 return XFS_ERROR(error ? error : ENOSPC);
>         }
>
> The flushing is now way up in xfs_file_buffered_aio_write(), and the
> implementation of xfs_flush_inodes() has changed as well. Hence it
> may or may not behave differently not....
  OK, so I tested latest XFS tree and changes by commit 9aa05000 (changing
xfs_flush_inodes()) indeed improve the performace from those ~6 minutes to
~6 seconds which is good enough I believe. Thanks for the pointer! I was
thinking for a while why sync_inodes_sb() is so much faster than the
original XFS implementation and I believe it's because we don't force the
log on each sync now.

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR




---------- Weitergeleitete Nachricht ----------
From: Jan Kara <[hidden email]>
To: Jeff Moyer <[hidden email]>
Cc: [hidden email], [hidden email], "Darrick J. Wong" <[hidden email]>, [hidden email], [hidden email], [hidden email], [hidden email], [hidden email], [hidden email], Jan Kara <[hidden email]>, [hidden email]
Date: Wed, 21 Nov 2012 01:56:26 +0100
Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests
On Tue 20-11-12 15:02:15, Jeff Moyer wrote:
> Jan Kara <[hidden email]> writes:
>
> >> @@ -1279,6 +1280,9 @@ struct ext4_sb_info {
> >>    /* workqueue for dio unwritten */
> >>    struct workqueue_struct *dio_unwritten_wq;
> >>
> >> +  /* workqueue for aio+dio+o_sync disk cache flushing */
> >> +  struct workqueue_struct *aio_dio_flush_wq;
> >> +
> >   Umm, I'm not completely decided whether we really need a separate
> > workqueue. But it doesn't cost too much so I guess it makes some sense -
> > fsync() is rather heavy so syncing won't starve extent conversion...
>
> I'm assuming you'd like me to convert the names from flush to fsync,
> yes?
  Would be nicer, yes.

> >> +
> >> +  /*
> >> +   * If we are running in nojournal mode, just flush the disk
> >> +   * cache and return.
> >> +   */
> >> +  if (!journal)
> >> +          return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
> >   And this is wrong as well - you need to do work similar to what
> > ext4_sync_file() does. Actually it would be *much* better if these two
> > sites used the same helper function. Which also poses an interesting
> > question about locking - do we need i_mutex or not? Forcing a transaction
> > commit is definitely OK without it, similarly as grabbing transaction ids
> > from inode or ext4_should_journal_data() test. __sync_inode() call seems
> > to be OK without i_mutex as well so I believe we can just get rid of it
> > (getting i_mutex from the workqueue is a locking nightmare we don't want to
> > return to).
>
> Just to be clear, are you saying you would like me to remove the
> mutex_lock/unlock pair from ext4_sync_file?  (I had already factored out
> the common code between this new code path and the fsync path in my tree.)
  Yes, after some thinking I came to that conclusion. We actually need to
keep i_mutex around ext4_flush_unwritten_io() to avoid livelocks but the
rest doesn't need it. The change should be definitely a separate patch just
in case there's something subtle I missed and we need to bisect in
future... I've attached a patch for that so that blame for bugs goes my way
;) Compile tested only so far. I'll give it some more testing overnight.

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR


---------- Weitergeleitete Nachricht ----------
From: Jan Kara <[hidden email]>
To: 
Cc: 
Date: Wed, 21 Nov 2012 01:46:51 +0100
Subject: [PATCH] ext4: Reduce i_mutex usage in ext4_file_sync()
ext4_file_sync() needs i_mutex only to avoid livelocks of
ext4_flush_unwritten_io() all other code doesn't need it. In particular
syncing of inode & metadata in non-journal case is safe (writeback doesn't
hold i_mutex either) and forcing of transaction commits doesn't need i_mutex
either (there's nothing inode specific in that code apart from grabbing
transaction ids from the inode). So shorten the span where i_mutex is held.

Signed-off-by: Jan Kara <[hidden email]>
---
 fs/ext4/fsync.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index be1d89f..2268114 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -113,8 +113,6 @@ static int __sync_inode(struct inode *inode, int datasync)
  *
  * What we do is just kick off a commit and wait on it.  This will snapshot the
  * inode to disk.
- *
- * i_mutex lock is held when entering and exiting this function
  */

 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
@@ -133,12 +131,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
        ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
        if (ret)
                return ret;
-       mutex_lock(&inode->i_mutex);

        if (inode->i_sb->s_flags & MS_RDONLY)
                goto out;

+       mutex_lock(&inode->i_mutex);
        ret = ext4_flush_unwritten_io(inode);
+       mutex_unlock(&inode->i_mutex);
        if (ret < 0)
                goto out;

@@ -180,7 +179,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
                        ret = err;
        }
  out:
-       mutex_unlock(&inode->i_mutex);
        trace_ext4_sync_file_exit(inode, ret);
        return ret;
 }
--
1.7.1


--k+w/mQv8wyuph6w0--




---------- Weitergeleitete Nachricht ----------
From: Jan Kara <[hidden email]>
To: Jeff Moyer <[hidden email]>
Cc: [hidden email], [hidden email], "Darrick J. Wong" <[hidden email]>, [hidden email], [hidden email], [hidden email], [hidden email], [hidden email], [hidden email], Jan Kara <[hidden email]>, [hidden email]
Date: Wed, 21 Nov 2012 01:57:39 +0100
Subject: Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly
On Tue 20-11-12 15:47:54, Jeff Moyer wrote:
> Jan Kara <[hidden email]> writes:
>
> > On Mon 19-11-12 23:51:15, Darrick J. Wong wrote:
> >> When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES
> >> flag so that flushes are issued /after/ the write completes, not before.
> >>
> >> Note, however, that for block devices, the DIO setup code ensures that a flush
> >> wq is attached to the superblock of the bdevfs filesystem, not the filesystem
> >> that the device node happens to reside in.  This means that unlike regular
> >> files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb.  Therefore, adjust
> >> Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.
> >>
> >> Signed-off-by: Darrick J. Wong <[hidden email]>
> >> ---
> >>  fs/block_dev.c |    5 +++--
> >>  fs/direct-io.c |    3 ++-
> >>  2 files changed, 5 insertions(+), 3 deletions(-)
> >>
> >>
> >> diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> index 1a1e5e3..05ff33a 100644
> >> --- a/fs/block_dev.c
> >> +++ b/fs/block_dev.c
> >> @@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> >>    struct inode *inode = file->f_mapping->host;
> >>
> >>    return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
> >> -                              nr_segs, blkdev_get_blocks, NULL, NULL, 0);
> >> +                              nr_segs, blkdev_get_blocks, NULL, NULL,
> >> +                              DIO_SYNC_WRITES);
> >>  }
> >>
> >>  int __sync_blockdev(struct block_device *bdev, int wait)
> >> @@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
> >>    percpu_down_read(&bdev->bd_block_size_semaphore);
> >>
> >>    ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
> >> -  if (ret > 0 || ret == -EIOCBQUEUED) {
> >> +  if (ret > 0) {
> >>            ssize_t err;
> >>
> >>            err = generic_write_sync(file, pos, ret);
> >> diff --git a/fs/direct-io.c b/fs/direct-io.c
> >> index b7391d4..c626c43 100644
> >> --- a/fs/direct-io.c
> >> +++ b/fs/direct-io.c
> >> @@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
> >>            work->ret = ret;
> >>            work->offset = offset;
> >>            work->len = bytes;
> >> -          queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
> >> +          queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,
> >> +                     &work->work);
> >   This should be folded into the original patch introducing the
> > s_dio_flush_wq. And please add a comment before this line saying that block
> > devices need a dereference exactly like this... Otherwise the patch looks
> > good so you can add:
> >   Reviewed-by: Jan Kara <[hidden email]>
>
> When you say, "This," do you mean that one change, or the whole patch?
  I meant the whole patch after that hung is folded ;)

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR




---------- Weitergeleitete Nachricht ----------
From: Dave Chinner <[hidden email]>
To: Jan Kara <[hidden email]>
Cc: [hidden email]
Date: Wed, 21 Nov 2012 12:38:21 +1100
Subject: Re: [PATCH] xfs: Don't flush inodes when project quota exceeded
On Wed, Nov 21, 2012 at 01:24:59AM +0100, Jan Kara wrote:
> On Tue 20-11-12 11:04:28, Dave Chinner wrote:
> > On Mon, Nov 19, 2012 at 10:39:13PM +0100, Jan Kara wrote:
> > > On Tue 13-11-12 01:36:13, Jan Kara wrote:
> > > > When project quota gets exceeded xfs_iomap_write_delay() ends up flushing
> > > > inodes because ENOSPC gets returned from xfs_bmapi_delay() instead of EDQUOT.
> > > > This makes handling of writes over project quota rather slow as a simple test
> > > > program shows:
> > > >         fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
> > > >         for (i = 0; i < 50000; i++)
> > > >                 pwrite(fd, buf, 4096, i*4096);
> > > >
> > > > Writing 200 MB like this into a directory with 100 MB project quota takes
> > > > around 6 minutes while it takes about 2 seconds with this patch applied. This
> > > > actually happens in a real world load when nfs pushes data into a directory
> > > > which is over project quota.
> > > >
> > > > Fix the problem by replacing XFS_QMOPT_ENOSPC flag with XFS_QMOPT_EPDQUOT.
> > > > That makes xfs_trans_reserve_quota_bydquots() return new error EPDQUOT when
> > > > project quota is exceeded. xfs_bmapi_delay() then uses this flag so that
> > > > xfs_iomap_write_delay() can distinguish real ENOSPC (requiring flushing)
> > > > from exceeded project quota (not requiring flushing).
> > > >
> > > > As a side effect this patch fixes inconsistency where e.g. xfs_create()
> > > > returned EDQUOT even when project quota was exceeded.
> > >   Ping? Any opinions?
> >
> > FWIW, it doesn't look like it'll apply to a current XFs tree:
> >
> > > > @@ -441,8 +442,11 @@ retry:
> > > >          */
> > > >         if (nimaps == 0) {
> > > >                 trace_xfs_delalloc_enospc(ip, offset, count);
> > > > -               if (flushed)
> > > > -                       return XFS_ERROR(error ? error : ENOSPC);
> > > > +               if (flushed) {
> > > > +                       if (error == 0 || error == EPDQUOT)
> > > > +                               error = ENOSPC;
> > > > +                       return XFS_ERROR(error);
> > > > +               }
> > > >
> > > >                 if (error == ENOSPC) {
> > > >                         xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >
> > This xfs_iomap_write_delay() looks like this now:
> >
> >         /*
> >          * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. Retry
> >          * without EOF preallocation.
> >          */
> >         if (nimaps == 0) {
> >                 trace_xfs_delalloc_enospc(ip, offset, count);
> >                 if (prealloc) {
> >                         prealloc = 0;
> >                         error = 0;
> >                         goto retry;
> >                 }
> >                 return XFS_ERROR(error ? error : ENOSPC);
> >         }
> >
> > The flushing is now way up in xfs_file_buffered_aio_write(), and the
> > implementation of xfs_flush_inodes() has changed as well. Hence it
> > may or may not behave differently not....
>   OK, so I tested latest XFS tree and changes by commit 9aa05000 (changing
> xfs_flush_inodes()) indeed improve the performace from those ~6 minutes to
> ~6 seconds which is good enough I believe. Thanks for the pointer! I was
> thinking for a while why sync_inodes_sb() is so much faster than the
> original XFS implementation and I believe it's because we don't force the
> log on each sync now.

I think it's detter because we now don't scan every inode in the
inode cache doing a mapping_tagged(PAGECACHE_TAG_DIRTY) check to see
if they are dirty or not to decide whether it needs writeback. The
overhead of doing that adds up very quickly when you have lots of
cached inodes and you are scanning them for every page that is
written to....

Cheers,

Dave.
--
Dave Chinner
[hidden email]



_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs


_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs