[PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

Darrick J. Wong-2
Hi everybody,

On March 29th, Jeff Moyer posted to lkml a patchset with this note:

> Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
> are sync'd before the I/O is actually issued (everybody else).  The following
> patch series fixes this in two parts.  First, for the file systems that use
> the generic routines, Jan has provided some generic infrastructure to perform
> the syncs after the I/O is completed.  Second, for those file systems which
> require some endio processing of their own for O_DIRECT writes (xfs and
> ext4), [Jeff] implemented file system specific syncing.  This passes the
> updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
> in the aio group.  [Jeff] tested ext3, ext4, xfs, and btrfs only.

Since the original post a few months ago, this patchset doesn't seem to have
made any progress.  An internal testing team here discovered that the issue
also affects O_SYNC+AIO+DIO writes to block devices.  Worse yet, since the
flushes were being issued (and waited upon) directly in the io_submit call
graph, the io_submit calls themselves would take a very long time to complete.
Therefore, I added another patch to move the flush to the io_end processing.

The blockdev patch was written by me.  The ext4 patch had to be updated to
accomodate a rework of the ext4 endio code that landed since March.  Everything
else has been passed through from Jeff's March 30th resend, with few changes.

This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.

Comments and questions are, as always, welcome.

--D

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

Darrick J. Wong-2
Provide VFS helpers for handling O_SYNC AIO DIO writes.  Filesystems wanting to
use the helpers have to pass DIO_SYNC_WRITES to __blockdev_direct_IO.  If the
filesystem doesn't provide its own direct IO end_io handler, the generic code
will take care of issuing the flush.  Otherwise, the filesystem's custom end_io
handler is passed struct dio_sync_io_work pointer as 'private' argument, and it
must call generic_dio_end_io() to finish the AIO DIO.  The generic code then
takes care to call generic_write_sync() from a workqueue context when AIO DIO
is complete.

Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
and the generic suffices for them, make blockdev_direct_IO() pass the new
DIO_SYNC_WRITES flag.

From: Jan Kara <[hidden email]>
Signed-off-by: Jan Kara <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
[[hidden email]: Minor style and changelog fixes]
Signed-off-by: Darrick J. Wong <[hidden email]>
---
 fs/direct-io.c     |  126 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/super.c         |    2 +
 include/linux/fs.h |   14 +++++-
 3 files changed, 137 insertions(+), 5 deletions(-)


diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..b7391d4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
 #include <linux/uio.h>
 #include <linux/atomic.h>
 #include <linux/prefetch.h>
+#include <asm/cmpxchg.h>
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -112,6 +113,15 @@ struct dio_submit {
  unsigned tail; /* last valid page + 1 */
 };
 
+/* state needed for final sync and completion of O_SYNC AIO DIO */
+struct dio_sync_io_work {
+ struct kiocb *iocb;
+ loff_t offset;
+ ssize_t len;
+ int ret;
+ struct work_struct work;
+};
+
 /* dio_state communicated between submission path and end_io */
 struct dio {
  int flags; /* doesn't change */
@@ -134,6 +144,7 @@ struct dio {
  /* AIO related stuff */
  struct kiocb *iocb; /* kiocb */
  ssize_t result;                 /* IO result */
+ struct dio_sync_io_work *sync_work; /* work used for O_SYNC AIO */
 
  /*
  * pages[] (and any fields placed after it) are not zeroed out at
@@ -217,6 +228,45 @@ static inline struct page *dio_get_page(struct dio *dio,
 }
 
 /**
+ * generic_dio_end_io() - generic dio ->end_io handler
+ * @iocb: iocb of finishing DIO
+ * @offset: the byte offset in the file of the completed operation
+ * @bytes: length of the completed operation
+ * @work: work to queue for O_SYNC AIO DIO, NULL otherwise
+ * @ret: error code if IO failed
+ * @is_async: is this AIO?
+ *
+ * This is generic callback to be called when direct IO is finished. It
+ * handles update of number of outstanding DIOs for an inode, completion
+ * of async iocb and queueing of work if we need to call fsync() because
+ * io was O_SYNC.
+ */
+void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
+ struct dio_sync_io_work *work, int ret, bool is_async)
+{
+ struct inode *inode = iocb->ki_filp->f_dentry->d_inode;
+
+ if (!is_async) {
+ inode_dio_done(inode);
+ return;
+ }
+
+ /*
+ * If we need to sync file, we offload completion to workqueue
+ */
+ if (work) {
+ work->ret = ret;
+ work->offset = offset;
+ work->len = bytes;
+ queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
+ } else {
+ aio_complete(iocb, ret, 0);
+ inode_dio_done(inode);
+ }
+}
+EXPORT_SYMBOL(generic_dio_end_io);
+
+/**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
  *
@@ -258,12 +308,23 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is
  ret = transferred;
 
  if (dio->end_io && dio->result) {
+ void *private;
+
+ if (dio->sync_work)
+ private = dio->sync_work;
+ else
+ private = dio->private;
+
  dio->end_io(dio->iocb, offset, transferred,
-    dio->private, ret, is_async);
+    private, ret, is_async);
  } else {
- if (is_async)
- aio_complete(dio->iocb, ret, 0);
- inode_dio_done(dio->inode);
+ /* No IO submitted? Skip syncing... */
+ if (!dio->result && dio->sync_work) {
+ kfree(dio->sync_work);
+ dio->sync_work = NULL;
+ }
+ generic_dio_end_io(dio->iocb, offset, transferred,
+   dio->sync_work, ret, is_async);
  }
 
  return ret;
@@ -1020,6 +1081,41 @@ static inline int drop_refcount(struct dio *dio)
 }
 
 /*
+ * Work performed from workqueue when AIO DIO is finished.
+ */
+static void dio_aio_sync_work(struct work_struct *work)
+{
+ struct dio_sync_io_work *sync_work =
+ container_of(work, struct dio_sync_io_work, work);
+ struct kiocb *iocb = sync_work->iocb;
+ struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
+ int err, ret = sync_work->ret;
+
+ err = generic_write_sync(iocb->ki_filp, sync_work->offset,
+ sync_work->len);
+ if (err < 0 && ret > 0)
+ ret = err;
+ aio_complete(iocb, ret, 0);
+ inode_dio_done(inode);
+}
+
+static noinline int dio_create_flush_wq(struct super_block *sb)
+{
+ struct workqueue_struct *wq =
+ alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
+
+ if (!wq)
+ return -ENOMEM;
+ /*
+ * Atomically put workqueue in place. Release our one in case someone
+ * else won the race and attached workqueue to superblock.
+ */
+ if (cmpxchg(&sb->s_dio_flush_wq, NULL, wq))
+ destroy_workqueue(wq);
+ return 0;
+}
+
+/*
  * This is a library function for use by filesystem drivers.
  *
  * The locking rules are governed by the flags parameter:
@@ -1112,6 +1208,26 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
  memset(dio, 0, offsetof(struct dio, pages));
 
  dio->flags = flags;
+ if (flags & DIO_SYNC_WRITES && rw & WRITE &&
+    ((iocb->ki_filp->f_flags & O_DSYNC) || IS_SYNC(inode))) {
+ /* The first O_SYNC AIO DIO for this FS? Create workqueue... */
+ if (!inode->i_sb->s_dio_flush_wq) {
+ retval = dio_create_flush_wq(inode->i_sb);
+ if (retval) {
+ kmem_cache_free(dio_cache, dio);
+ goto out;
+ }
+ }
+ dio->sync_work = kmalloc(sizeof(struct dio_sync_io_work),
+ GFP_KERNEL);
+ if (!dio->sync_work) {
+ retval = -ENOMEM;
+ kmem_cache_free(dio_cache, dio);
+ goto out;
+ }
+ INIT_WORK(&dio->sync_work->work, dio_aio_sync_work);
+ dio->sync_work->iocb = iocb;
+ }
  if (dio->flags & DIO_LOCKING) {
  if (rw == READ) {
  struct address_space *mapping =
@@ -1124,6 +1240,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
       end - 1);
  if (retval) {
  mutex_unlock(&inode->i_mutex);
+ kfree(dio->sync_work);
  kmem_cache_free(dio_cache, dio);
  goto out;
  }
@@ -1271,6 +1388,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 
  if (drop_refcount(dio) == 0) {
  retval = dio_complete(dio, offset, retval, false);
+ kfree(dio->sync_work);
  kmem_cache_free(dio_cache, dio);
  } else
  BUG_ON(retval != -EIOCBQUEUED);
diff --git a/fs/super.c b/fs/super.c
index 12f1237..7478202 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -244,6 +244,8 @@ static inline void destroy_super(struct super_block *s)
 #ifdef CONFIG_SMP
  free_percpu(s->s_files);
 #endif
+ if (s->s_dio_flush_wq)
+ destroy_workqueue(s->s_dio_flush_wq);
  destroy_sb_writers(s);
  security_sb_free(s);
  WARN_ON(!list_empty(&s->s_mounts));
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..40dd206 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -44,6 +44,7 @@ struct vm_area_struct;
 struct vfsmount;
 struct cred;
 struct swap_info_struct;
+struct workqueue_struct;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1321,6 +1322,9 @@ struct super_block {
 
  /* Being remounted read-only */
  int s_readonly_remount;
+
+ /* Pending fsync calls for completed AIO DIO with O_SYNC */
+ struct workqueue_struct *s_dio_flush_wq;
 };
 
 /* superblock cache pruning functions */
@@ -2433,8 +2437,13 @@ enum {
 
  /* filesystem does not support filling holes */
  DIO_SKIP_HOLES = 0x02,
+
+ /* need generic handling of O_SYNC aio writes */
+ DIO_SYNC_WRITES = 0x04,
 };
 
+struct dio_sync_io_work;
+
 void dio_end_io(struct bio *bio, int error);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -2448,12 +2457,15 @@ static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
 {
  return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
     offset, nr_segs, get_block, NULL, NULL,
-    DIO_LOCKING | DIO_SKIP_HOLES);
+    DIO_LOCKING | DIO_SKIP_HOLES |
+    DIO_SYNC_WRITES);
 }
 #endif
 
 void inode_dio_wait(struct inode *inode);
 void inode_dio_done(struct inode *inode);
+void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
+ struct dio_sync_io_work *work, int ret, bool is_async);
 
 extern const struct file_operations generic_ro_fops;
 

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
flushed after the write completion.  Instead, it's flushed *before* the
I/O is sent to the disk (in __generic_file_aio_write).  This patch
attempts to fix that problem by marking an I/O as requiring a cache
flush in endio processing.  I'll send a follow-on patch to the
generic write code to get rid of the bogus generic_write_sync call
when EIOCBQUEUED is returned.

From: Jeff Moyer <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
[[hidden email]: Rework original patch to reflect a subsequent
                          ext4 reorganization]
Signed-off-by: Darrick J. Wong <[hidden email]>
---
 fs/ext4/ext4.h    |    9 +++++
 fs/ext4/file.c    |    2 +
 fs/ext4/inode.c   |    6 +++
 fs/ext4/page-io.c |   92 +++++++++++++++++++++++++++++++++++++++++++++--------
 fs/ext4/super.c   |   13 +++++++
 5 files changed, 106 insertions(+), 16 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c20de1..f5a0b89 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -186,6 +186,7 @@ struct mpage_da_data {
 #define EXT4_IO_END_ERROR 0x0002
 #define EXT4_IO_END_QUEUED 0x0004
 #define EXT4_IO_END_DIRECT 0x0008
+#define EXT4_IO_END_NEEDS_SYNC 0x0010
 
 struct ext4_io_page {
  struct page *p_page;
@@ -1279,6 +1280,9 @@ struct ext4_sb_info {
  /* workqueue for dio unwritten */
  struct workqueue_struct *dio_unwritten_wq;
 
+ /* workqueue for aio+dio+o_sync disk cache flushing */
+ struct workqueue_struct *aio_dio_flush_wq;
+
  /* timer for periodic error stats printing */
  struct timer_list s_err_report;
 
@@ -1335,6 +1339,11 @@ static inline void ext4_set_io_unwritten_flag(struct inode *inode,
  }
 }
 
+static inline int ext4_io_end_deferred(ext4_io_end_t *i)
+{
+ return i->flag & (EXT4_IO_END_UNWRITTEN | EXT4_IO_END_NEEDS_SYNC);
+}
+
 static inline ext4_io_end_t *ext4_inode_aio(struct inode *inode)
 {
  return inode->i_private;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index bf3966b..dd34b81 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -155,7 +155,7 @@ ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
  ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
  mutex_unlock(&inode->i_mutex);
 
- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
  ssize_t err;
 
  err = generic_write_sync(file, pos, ret);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..fc4f05e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2893,8 +2893,12 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 
  iocb->private = NULL;
 
+ /* AIO+DIO+O_SYNC I/Os need a cache flush after completion */
+ if (is_async && (IS_SYNC(inode) || (iocb->ki_filp->f_flags & O_DSYNC)))
+ io_end->flag |= EXT4_IO_END_NEEDS_SYNC;
+
  /* if not aio dio with unwritten extents, just free io and return */
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+ if (!ext4_io_end_deferred(io_end)) {
  ext4_free_io_end(io_end);
 out:
  if (is_async)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 68e896e..cda013a 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -84,6 +84,50 @@ void ext4_free_io_end(ext4_io_end_t *io)
  kmem_cache_free(io_end_cachep, io);
 }
 
+/*
+ * This function is called in the completion path for AIO O_SYNC|O_DIRECT
+ * writes, and also in the fsync path.  The purpose is to ensure that the
+ * disk caches for the journal and data devices are flushed.
+ */
+static int ext4_end_io_do_flush(ext4_io_end_t *io)
+{
+ struct inode *inode = io->inode;
+ tid_t commit_tid;
+ bool needs_barrier = false;
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ int barriers_enabled = test_opt(inode->i_sb, BARRIER);
+ int ret = 0;
+
+ if (!barriers_enabled)
+ return 0;
+
+ /*
+ * If we are running in nojournal mode, just flush the disk
+ * cache and return.
+ */
+ if (!journal)
+ return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
+
+ if (ext4_should_journal_data(inode)) {
+ ret = ext4_force_commit(inode->i_sb);
+ goto out;
+ }
+
+ commit_tid = io->iocb->ki_filp->f_flags & __O_SYNC ?
+ EXT4_I(inode)->i_sync_tid : EXT4_I(inode)->i_datasync_tid;
+ if (!jbd2_trans_will_send_data_barrier(journal, commit_tid))
+ needs_barrier = true;
+
+ jbd2_log_start_commit(journal, commit_tid);
+ ret = jbd2_log_wait_commit(journal, commit_tid);
+
+ if (!ret && needs_barrier)
+ ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
+
+out:
+ return ret;
+}
+
 /* check a range of space and convert unwritten extents to written. */
 static int ext4_end_io(ext4_io_end_t *io)
 {
@@ -96,21 +140,37 @@ static int ext4_end_io(ext4_io_end_t *io)
    "list->prev 0x%p\n",
    io, inode->i_ino, io->list.next, io->list.prev);
 
- ret = ext4_convert_unwritten_extents(inode, offset, size);
- if (ret < 0) {
- ext4_msg(inode->i_sb, KERN_EMERG,
- "failed to convert unwritten extents to written "
- "extents -- potential data loss!  "
- "(inode %lu, offset %llu, size %zd, error %d)",
- inode->i_ino, offset, size, ret);
+ if (io->flag & EXT4_IO_END_UNWRITTEN) {
+ ret = ext4_convert_unwritten_extents(inode, offset, size);
+ if (ret < 0) {
+ ext4_msg(inode->i_sb, KERN_EMERG,
+ "failed to convert unwritten extents to "
+ "written extents -- potential data loss!  "
+ "(inode %lu, offset %llu, size %zd, error %d)",
+ inode->i_ino, offset, size, ret);
+ goto endio;
+ }
  }
+
+ /*
+ * This function has two callers.  The first is the end_io_work
+ * routine just below, which is an asynchronous completion context.
+ * The second is in the fsync path.  For the latter path, we can't
+ * return from here until the job is done.  Hence,
+ * ext4_end_io_do_flush is blocking.
+ */
+ if (io->flag & EXT4_IO_END_NEEDS_SYNC)
+ ret = ext4_end_io_do_flush(io);
+
+endio:
  if (io->iocb)
  aio_complete(io->iocb, io->result, 0);
 
  if (io->flag & EXT4_IO_END_DIRECT)
  inode_dio_done(inode);
  /* Wake up anyone waiting on unwritten extent conversion */
- if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
+ if (io->flag & EXT4_IO_END_UNWRITTEN &&
+    atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
  wake_up_all(ext4_ioend_wq(io->inode));
  return ret;
 }
@@ -149,8 +209,11 @@ void ext4_add_complete_io(ext4_io_end_t *io_end)
  struct workqueue_struct *wq;
  unsigned long flags;
 
- BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
- wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+ BUG_ON(!ext4_io_end_deferred(io_end));
+ if (io_end->flag & EXT4_IO_END_UNWRITTEN)
+ wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+ else
+ wq = EXT4_SB(io_end->inode->i_sb)->aio_dio_flush_wq;
 
  spin_lock_irqsave(&ei->i_completed_io_lock, flags);
  if (list_empty(&ei->i_completed_io_list)) {
@@ -180,7 +243,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
 
  while (!list_empty(&unwritten)) {
  io = list_entry(unwritten.next, ext4_io_end_t, list);
- BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
+ BUG_ON(!ext4_io_end_deferred(io));
  list_del_init(&io->list);
 
  err = ext4_end_io(io);
@@ -192,7 +255,8 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
  spin_lock_irqsave(&ei->i_completed_io_lock, flags);
  while (!list_empty(&complete)) {
  io = list_entry(complete.next, ext4_io_end_t, list);
- io->flag &= ~EXT4_IO_END_UNWRITTEN;
+ io->flag &= ~(EXT4_IO_END_UNWRITTEN |
+      EXT4_IO_END_NEEDS_SYNC);
  /* end_io context can not be destroyed now because it still
  * used by queued worker. Worker thread will destroy it later */
  if (io->flag & EXT4_IO_END_QUEUED)
@@ -204,7 +268,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
  * flag, and destroy it's end_io if it was converted already */
  if (work_io) {
  work_io->flag &= ~EXT4_IO_END_QUEUED;
- if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
+ if (!ext4_io_end_deferred(work_io))
  list_add_tail(&work_io->list, &to_free);
  }
  spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
@@ -319,7 +383,7 @@ static void ext4_end_bio(struct bio *bio, int error)
      bi_sector >> (inode->i_blkbits - 9));
  }
 
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+ if (!ext4_io_end_deferred(io_end)) {
  ext4_free_io_end(io_end);
  return;
  }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 80928f7..c657c4d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -849,7 +849,9 @@ static void ext4_put_super(struct super_block *sb)
  dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);
 
  flush_workqueue(sbi->dio_unwritten_wq);
+ flush_workqueue(sbi->aio_dio_flush_wq);
  destroy_workqueue(sbi->dio_unwritten_wq);
+ destroy_workqueue(sbi->aio_dio_flush_wq);
 
  if (sbi->s_journal) {
  err = jbd2_journal_destroy(sbi->s_journal);
@@ -3913,6 +3915,14 @@ no_journal:
  goto failed_mount_wq;
  }
 
+ EXT4_SB(sb)->aio_dio_flush_wq =
+ alloc_workqueue("ext4-aio-dio-flush",
+ WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+ if (!EXT4_SB(sb)->aio_dio_flush_wq) {
+ pr_err("EXT4-fs: failed to create flush workqueue\n");
+ goto failed_flush_wq;
+ }
+
  /*
  * The jbd2_journal_load will have done any necessary log recovery,
  * so we can safely mount the rest of the filesystem now.
@@ -4045,6 +4055,8 @@ failed_mount4a:
  sb->s_root = NULL;
 failed_mount4:
  ext4_msg(sb, KERN_ERR, "mount failed");
+ destroy_workqueue(EXT4_SB(sb)->aio_dio_flush_wq);
+failed_flush_wq:
  destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
 failed_mount_wq:
  if (sbi->s_journal) {
@@ -4494,6 +4506,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 
  trace_ext4_sync_fs(sb, wait);
  flush_workqueue(sbi->dio_unwritten_wq);
+ flush_workqueue(sbi->aio_dio_flush_wq);
  /*
  * Writeback quota in non-journalled quota case - journalled quota has
  * no dirty dquots

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
Hi,

Fsyncing is tricky business, so factor out the bits of the xfs_file_fsync
function that can be used from the I/O post-processing path.

From: Jeff Moyer <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
---
 fs/xfs/xfs_file.c  |   44 +++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_inode.h |    1 +
 2 files changed, 30 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index aa473fa..507f446 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -152,25 +152,18 @@ xfs_dir_fsync(
  return _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
 }
 
-STATIC int
-xfs_file_fsync(
- struct file *file,
- loff_t start,
- loff_t end,
+/*
+ * Returns 0 on success, -errno on failure.
+ */
+int
+do_xfs_file_fsync(
+ struct xfs_inode *ip,
+ struct xfs_mount *mp,
  int datasync)
 {
- struct inode *inode = file->f_mapping->host;
- struct xfs_inode *ip = XFS_I(inode);
- struct xfs_mount *mp = ip->i_mount;
- int error = 0;
  int log_flushed = 0;
  xfs_lsn_t lsn = 0;
-
- trace_xfs_file_fsync(ip);
-
- error = filemap_write_and_wait_range(inode->i_mapping, start, end);
- if (error)
- return error;
+ int error = 0;
 
  if (XFS_FORCED_SHUTDOWN(mp))
  return -XFS_ERROR(EIO);
@@ -222,6 +215,27 @@ xfs_file_fsync(
  return -error;
 }
 
+STATIC int
+xfs_file_fsync(
+ struct file *file,
+ loff_t start,
+ loff_t end,
+ int datasync)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
+ int error = 0;
+
+ trace_xfs_file_fsync(ip);
+
+ error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+ if (error)
+ return error;
+
+ return do_xfs_file_fsync(ip, mp, datasync);
+}
+
 STATIC ssize_t
 xfs_file_aio_read(
  struct kiocb *iocb,
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 94b32f9..d5bf105 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -546,6 +546,7 @@ do { \
  iput(VFS_I(ip)); \
 } while (0)
 
+int do_xfs_file_fsync(struct xfs_inode *, struct xfs_mount *, int);
 #endif /* __KERNEL__ */
 
 /*

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 7/9] ocfs2: Use generic handlers of O_SYNC AIO DIO

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

From: Jan Kara <[hidden email]>
Signed-off-by: Jan Kara <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
---
 fs/ocfs2/aops.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)


diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 6577432..60457cc 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb,
  level = ocfs2_iocb_rw_locked_level(iocb);
  ocfs2_rw_unlock(inode, level);
 
- if (is_async)
- aio_complete(iocb, ret, 0);
- inode_dio_done(inode);
+ generic_dio_end_io(iocb, offset, bytes, private, ret, is_async);
 }
 
 /*
@@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw,
  return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
     iov, offset, nr_segs,
     ocfs2_direct_IO_get_blocks,
-    ocfs2_dio_end_io, NULL, 0);
+    ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES);
 }
 
 static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,


_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
flushed after the write completion for AIOs.  This patch attempts to fix
that problem by marking an I/O as requiring a cache flush in endio
processing, and then issuing the cache flush after any unwritten extent
conversion is done.

From: Jeff Moyer <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
[[hidden email]: Rework patch to use per-mount workqueues]
Signed-off-by: Darrick J. Wong <[hidden email]>
---
 fs/xfs/xfs_aops.c  |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_aops.h  |    1 +
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |    8 ++++++++
 4 files changed, 61 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e57e2da..9cebbb7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -173,6 +173,24 @@ xfs_setfilesize(
 }
 
 /*
+ * In the case of synchronous, AIO, O_DIRECT writes, we need to flush
+ * the disk cache when the I/O is complete.
+ */
+STATIC bool
+xfs_ioend_needs_cache_flush(
+ struct xfs_ioend *ioend)
+{
+ struct xfs_inode *ip = XFS_I(ioend->io_inode);
+ struct xfs_mount *mp = ip->i_mount;
+
+ if (!(mp->m_flags & XFS_MOUNT_BARRIER))
+ return false;
+
+ return IS_SYNC(ioend->io_inode) ||
+       (ioend->io_iocb->ki_filp->f_flags & O_DSYNC);
+}
+
+/*
  * Schedule IO completion handling on the final put of an ioend.
  *
  * If there is no work to do we might as well call it a day and free the
@@ -189,11 +207,30 @@ xfs_finish_ioend(
  queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
  else if (ioend->io_append_trans)
  queue_work(mp->m_data_workqueue, &ioend->io_work);
+ else if (ioend->io_needs_fsync)
+ queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work);
  else
  xfs_destroy_ioend(ioend);
  }
 }
 
+STATIC int
+xfs_ioend_force_cache_flush(
+ xfs_ioend_t *ioend)
+{
+ struct xfs_inode *ip = XFS_I(ioend->io_inode);
+ struct xfs_mount *mp = ip->i_mount;
+ int err = 0;
+ int datasync;
+
+ datasync = !IS_SYNC(ioend->io_inode) &&
+ !(ioend->io_iocb->ki_filp->f_flags & __O_SYNC);
+ err = do_xfs_file_fsync(ip, mp, datasync);
+ xfs_destroy_ioend(ioend);
+ /* do_xfs_file_fsync returns -errno. our caller expects positive. */
+ return -err;
+}
+
 /*
  * IO write completion.
  */
@@ -250,12 +287,22 @@ xfs_end_io(
  error = xfs_setfilesize(ioend);
  if (error)
  ioend->io_error = -error;
+ } else if (ioend->io_needs_fsync) {
+ error = xfs_ioend_force_cache_flush(ioend);
+ if (error && ioend->io_result > 0)
+ ioend->io_error = -error;
+ ioend->io_needs_fsync = 0;
  } else {
  ASSERT(!xfs_ioend_is_append(ioend));
  }
 
 done:
- xfs_destroy_ioend(ioend);
+ /* the honoring of O_SYNC has to be done last */
+ if (ioend->io_needs_fsync) {
+ atomic_inc(&ioend->io_remaining);
+ xfs_finish_ioend(ioend);
+ } else
+ xfs_destroy_ioend(ioend);
 }
 
 /*
@@ -292,6 +339,7 @@ xfs_alloc_ioend(
  atomic_set(&ioend->io_remaining, 1);
  ioend->io_isasync = 0;
  ioend->io_isdirect = 0;
+ ioend->io_needs_fsync = 0;
  ioend->io_error = 0;
  ioend->io_list = NULL;
  ioend->io_type = type;
@@ -1409,6 +1457,8 @@ xfs_end_io_direct_write(
 
  if (is_async) {
  ioend->io_isasync = 1;
+ if (xfs_ioend_needs_cache_flush(ioend))
+ ioend->io_needs_fsync = 1;
  xfs_finish_ioend(ioend);
  } else {
  xfs_finish_ioend_sync(ioend);
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index c325abb..e48c7c2 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -47,6 +47,7 @@ typedef struct xfs_ioend {
  atomic_t io_remaining; /* hold count */
  unsigned int io_isasync : 1; /* needs aio_complete */
  unsigned int io_isdirect : 1;/* direct I/O */
+ unsigned int io_needs_fsync : 1; /* aio+dio+o_sync */
  struct inode *io_inode; /* file being written to */
  struct buffer_head *io_buffer_head;/* buffer linked list head */
  struct buffer_head *io_buffer_tail;/* buffer linked list tail */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index deee09e..ecd3d2e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -209,6 +209,7 @@ typedef struct xfs_mount {
  struct workqueue_struct *m_data_workqueue;
  struct workqueue_struct *m_unwritten_workqueue;
  struct workqueue_struct *m_cil_workqueue;
+ struct workqueue_struct *m_aio_blkdev_flush_wq;
 } xfs_mount_t;
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 26a09bd..b05b557 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -863,8 +863,15 @@ xfs_init_mount_workqueues(
  WQ_MEM_RECLAIM, 0, mp->m_fsname);
  if (!mp->m_cil_workqueue)
  goto out_destroy_unwritten;
+
+ mp->m_aio_blkdev_flush_wq = alloc_workqueue("xfs-aio-blkdev-flush/%s",
+ WQ_MEM_RECLAIM, 0, mp->m_fsname);
+ if (!mp->m_aio_blkdev_flush_wq)
+ goto out_destroy_cil_queue;
  return 0;
 
+out_destroy_cil_queue:
+ destroy_workqueue(mp->m_cil_workqueue);
 out_destroy_unwritten:
  destroy_workqueue(mp->m_unwritten_workqueue);
 out_destroy_data_iodone_queue:
@@ -877,6 +884,7 @@ STATIC void
 xfs_destroy_mount_workqueues(
  struct xfs_mount *mp)
 {
+ destroy_workqueue(mp->m_aio_blkdev_flush_wq);
  destroy_workqueue(mp->m_cil_workqueue);
  destroy_workqueue(mp->m_data_workqueue);
  destroy_workqueue(mp->m_unwritten_workqueue);


_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 6/9] gfs2: Use generic handlers of O_SYNC AIO DIO

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

From: Jan Kara <[hidden email]>
Signed-off-by: Jan Kara <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
---
 fs/gfs2/aops.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 01c4975..8b1d0f7 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1022,7 +1022,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
 
  rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
   offset, nr_segs, gfs2_get_block_direct,
-  NULL, NULL, 0);
+  NULL, NULL, DIO_SYNC_WRITES);
 out:
  gfs2_glock_dq(&gh);
  gfs2_holder_uninit(&gh);


_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 5/9] btrfs: Use generic handlers of O_SYNC AIO DIO

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file. Although we use our own bio->end_io function, we call dio_end_io()
from it and thus, because we don't set any specific dio->end_io function,
generic code ends up calling generic_dio_end_io() which is all what we need
for proper O_SYNC AIO DIO handling.

From: Jan Kara <[hidden email]>
Signed-off-by: Jan Kara <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
[[hidden email]: Don't issue flush if aio is queued]
Signed-off-by: Darrick J. Wong <[hidden email]>
---
 fs/btrfs/file.c  |    2 +-
 fs/btrfs/inode.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9ab1bed..37b5bb3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1495,7 +1495,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
  * one running right now.
  */
  BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
- if (num_written > 0 || num_written == -EIOCBQUEUED) {
+ if (num_written > 0) {
  err = generic_write_sync(file, pos, num_written);
  if (err < 0 && num_written > 0)
  num_written = err;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 95542a1..c8b6049 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6579,7 +6579,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
  return __blockdev_direct_IO(rw, iocb, inode,
    BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
    iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
-   btrfs_submit_direct, 0);
+   btrfs_submit_direct, DIO_SYNC_WRITES);
 }
 
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,


_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES
flag so that flushes are issued /after/ the write completes, not before.

Note, however, that for block devices, the DIO setup code ensures that a flush
wq is attached to the superblock of the bdevfs filesystem, not the filesystem
that the device node happens to reside in.  This means that unlike regular
files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb.  Therefore, adjust
Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.

Signed-off-by: Darrick J. Wong <[hidden email]>
---
 fs/block_dev.c |    5 +++--
 fs/direct-io.c |    3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)


diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..05ff33a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
  struct inode *inode = file->f_mapping->host;
 
  return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
-    nr_segs, blkdev_get_blocks, NULL, NULL, 0);
+    nr_segs, blkdev_get_blocks, NULL, NULL,
+    DIO_SYNC_WRITES);
 }
 
 int __sync_blockdev(struct block_device *bdev, int wait)
@@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
  percpu_down_read(&bdev->bd_block_size_semaphore);
 
  ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
  ssize_t err;
 
  err = generic_write_sync(file, pos, ret);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b7391d4..c626c43 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
  work->ret = ret;
  work->offset = offset;
  work->len = bytes;
- queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
+ queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,
+   &work->work);
  } else {
  aio_complete(iocb, ret, 0);
  inode_dio_done(inode);


_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

[PATCH 8/9] filemap: don't call generic_write_sync for -EIOCBQUEUED

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
Hi,

As it stands, generic_file_aio_write will call into generic_write_sync
when -EIOCBQUEUED is returned from __generic_file_aio_write.  EIOCBQUEUED
indicates that an I/O was submitted but NOT completed.  Thus, we will
flush the disk cache, potentially before the write(s) even make it to
the disk!  Up until now, this has been the best we could do, as file
systems didn't bother to flush the disk cache after an O_SYNC AIO+DIO
write.  After applying the prior two patches to xfs and ext4, at least
the major two file systems do the right thing.  So, let's go ahead and
fix this backwards logic.

From: Jeff Moyer <[hidden email]>
Signed-off-by: Jeff Moyer <[hidden email]>
---
 mm/filemap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..8e14c10 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2532,7 +2532,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
  ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
  mutex_unlock(&inode->i_mutex);
 
- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
  ssize_t err;
 
  err = generic_write_sync(file, pos, ret);


_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

Darrick J. Wong-2
In reply to this post by Darrick J. Wong-2
Apologies for the garbage emails.  The corporate email server lost its mind,
and I'm well on my way to losing mine.... <grumble> <shakes fist>

They at least were so garbled that they're not attached to this thread.

--D

> Hi everybody,
>
> On March 29th, Jeff Moyer posted to lkml a patchset with this note:
>
> > Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
> > are sync'd before the I/O is actually issued (everybody else).  The following
> > patch series fixes this in two parts.  First, for the file systems that use
> > the generic routines, Jan has provided some generic infrastructure to perform
> > the syncs after the I/O is completed.  Second, for those file systems which
> > require some endio processing of their own for O_DIRECT writes (xfs and
> > ext4), [Jeff] implemented file system specific syncing.  This passes the
> > updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
> > in the aio group.  [Jeff] tested ext3, ext4, xfs, and btrfs only.
>
> Since the original post a few months ago, this patchset doesn't seem to have
> made any progress.  An internal testing team here discovered that the issue
> also affects O_SYNC+AIO+DIO writes to block devices.  Worse yet, since the
> flushes were being issued (and waited upon) directly in the io_submit call
> graph, the io_submit calls themselves would take a very long time to complete.
> Therefore, I added another patch to move the flush to the io_end processing.
>
> The blockdev patch was written by me.  The ext4 patch had to be updated to
> accomodate a rework of the ext4 endio code that landed since March.  Everything
> else has been passed through from Jeff's March 30th resend, with few changes.
>
> This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
> ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.
>
> Comments and questions are, as always, welcome.
>
> --D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

Jan Kara
In reply to this post by Darrick J. Wong-2
On Mon 19-11-12 23:41:31, Darrick J. Wong wrote:

> If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
> flushed after the write completion.  Instead, it's flushed *before* the
> I/O is sent to the disk (in __generic_file_aio_write).  This patch
> attempts to fix that problem by marking an I/O as requiring a cache
> flush in endio processing.  I'll send a follow-on patch to the
> generic write code to get rid of the bogus generic_write_sync call
> when EIOCBQUEUED is returned.
>
> From: Jeff Moyer <[hidden email]>
> Signed-off-by: Jeff Moyer <[hidden email]>
> [[hidden email]: Rework original patch to reflect a subsequent
>  ext4 reorganization]
> Signed-off-by: Darrick J. Wong <[hidden email]>
> ---
>  fs/ext4/ext4.h    |    9 +++++
>  fs/ext4/file.c    |    2 +
>  fs/ext4/inode.c   |    6 +++
>  fs/ext4/page-io.c |   92 +++++++++++++++++++++++++++++++++++++++++++++--------
>  fs/ext4/super.c   |   13 +++++++
>  5 files changed, 106 insertions(+), 16 deletions(-)
>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3c20de1..f5a0b89 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -186,6 +186,7 @@ struct mpage_da_data {
>  #define EXT4_IO_END_ERROR 0x0002
>  #define EXT4_IO_END_QUEUED 0x0004
>  #define EXT4_IO_END_DIRECT 0x0008
> +#define EXT4_IO_END_NEEDS_SYNC 0x0010
>  
>  struct ext4_io_page {
>   struct page *p_page;
> @@ -1279,6 +1280,9 @@ struct ext4_sb_info {
>   /* workqueue for dio unwritten */
>   struct workqueue_struct *dio_unwritten_wq;
>  
> + /* workqueue for aio+dio+o_sync disk cache flushing */
> + struct workqueue_struct *aio_dio_flush_wq;
> +
  Umm, I'm not completely decided whether we really need a separate
workqueue. But it doesn't cost too much so I guess it makes some sense -
fsync() is rather heavy so syncing won't starve extent conversion...

> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 68e896e..cda013a 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -84,6 +84,50 @@ void ext4_free_io_end(ext4_io_end_t *io)
>   kmem_cache_free(io_end_cachep, io);
>  }
>  
> +/*
> + * This function is called in the completion path for AIO O_SYNC|O_DIRECT
> + * writes, and also in the fsync path.  The purpose is to ensure that the
> + * disk caches for the journal and data devices are flushed.
> + */
> +static int ext4_end_io_do_flush(ext4_io_end_t *io)
> +{
> + struct inode *inode = io->inode;
> + tid_t commit_tid;
> + bool needs_barrier = false;
> + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> + int barriers_enabled = test_opt(inode->i_sb, BARRIER);
> + int ret = 0;
> +
> + if (!barriers_enabled)
> + return 0;
  This is definitely wrong. Even if barriers are disabled, we may need to
push out some buffers or commit a transaction.

> +
> + /*
> + * If we are running in nojournal mode, just flush the disk
> + * cache and return.
> + */
> + if (!journal)
> + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
  And this is wrong as well - you need to do work similar to what
ext4_sync_file() does. Actually it would be *much* better if these two
sites used the same helper function. Which also poses an interesting
question about locking - do we need i_mutex or not? Forcing a transaction
commit is definitely OK without it, similarly as grabbing transaction ids
from inode or ext4_should_journal_data() test. __sync_inode() call seems
to be OK without i_mutex as well so I believe we can just get rid of it
(getting i_mutex from the workqueue is a locking nightmare we don't want to
return to).

> +
> + if (ext4_should_journal_data(inode)) {
> + ret = ext4_force_commit(inode->i_sb);
> + goto out;
> + }
> +
> + commit_tid = io->iocb->ki_filp->f_flags & __O_SYNC ?
> + EXT4_I(inode)->i_sync_tid : EXT4_I(inode)->i_datasync_tid;
> + if (!jbd2_trans_will_send_data_barrier(journal, commit_tid))
> + needs_barrier = true;
> +
> + jbd2_log_start_commit(journal, commit_tid);
> + ret = jbd2_log_wait_commit(journal, commit_tid);
> +
> + if (!ret && needs_barrier)
> + ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
> +
> +out:
> + return ret;
> +}
> +
>  /* check a range of space and convert unwritten extents to written. */
>  static int ext4_end_io(ext4_io_end_t *io)
>  {
...

> @@ -149,8 +209,11 @@ void ext4_add_complete_io(ext4_io_end_t *io_end)
>   struct workqueue_struct *wq;
>   unsigned long flags;
>  
> - BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> - wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> + BUG_ON(!ext4_io_end_deferred(io_end));
> + if (io_end->flag & EXT4_IO_END_UNWRITTEN)
> + wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> + else
> + wq = EXT4_SB(io_end->inode->i_sb)->aio_dio_flush_wq;
  Umm, I'd prefer if we used aio_dio_flush_wq when EXT4_IO_END_NEEDS_SYNC
is set. That way slow syncing works will be always offloaded to a separate
workqueue.

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly

Jan Kara
In reply to this post by Darrick J. Wong-2
On Mon 19-11-12 23:51:15, Darrick J. Wong wrote:

> When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES
> flag so that flushes are issued /after/ the write completes, not before.
>
> Note, however, that for block devices, the DIO setup code ensures that a flush
> wq is attached to the superblock of the bdevfs filesystem, not the filesystem
> that the device node happens to reside in.  This means that unlike regular
> files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb.  Therefore, adjust
> Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.
>
> Signed-off-by: Darrick J. Wong <[hidden email]>
> ---
>  fs/block_dev.c |    5 +++--
>  fs/direct-io.c |    3 ++-
>  2 files changed, 5 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1a1e5e3..05ff33a 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>   struct inode *inode = file->f_mapping->host;
>  
>   return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
> -    nr_segs, blkdev_get_blocks, NULL, NULL, 0);
> +    nr_segs, blkdev_get_blocks, NULL, NULL,
> +    DIO_SYNC_WRITES);
>  }
>  
>  int __sync_blockdev(struct block_device *bdev, int wait)
> @@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
>   percpu_down_read(&bdev->bd_block_size_semaphore);
>  
>   ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
> - if (ret > 0 || ret == -EIOCBQUEUED) {
> + if (ret > 0) {
>   ssize_t err;
>  
>   err = generic_write_sync(file, pos, ret);
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index b7391d4..c626c43 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
>   work->ret = ret;
>   work->offset = offset;
>   work->len = bytes;
> - queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
> + queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,
> +   &work->work);
  This should be folded into the original patch introducing the
s_dio_flush_wq. And please add a comment before this line saying that block
devices need a dereference exactly like this... Otherwise the patch looks
good so you can add:
  Reviewed-by: Jan Kara <[hidden email]>

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

Jan Kara
In reply to this post by Darrick J. Wong-2
On Mon 19-11-12 23:51:14, Darrick J. Wong wrote:

> If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
> flushed after the write completion for AIOs.  This patch attempts to fix
> that problem by marking an I/O as requiring a cache flush in endio
> processing, and then issuing the cache flush after any unwritten extent
> conversion is done.
>
> From: Jeff Moyer <[hidden email]>
> Signed-off-by: Jeff Moyer <[hidden email]>
> [[hidden email]: Rework patch to use per-mount workqueues]
> Signed-off-by: Darrick J. Wong <[hidden email]>
> ---
>  fs/xfs/xfs_aops.c  |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_aops.h  |    1 +
>  fs/xfs/xfs_mount.h |    1 +
>  fs/xfs/xfs_super.c |    8 ++++++++
>  4 files changed, 61 insertions(+), 1 deletion(-)
>
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e57e2da..9cebbb7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -173,6 +173,24 @@ xfs_setfilesize(
>  }
>  
>  /*
> + * In the case of synchronous, AIO, O_DIRECT writes, we need to flush
> + * the disk cache when the I/O is complete.
> + */
> +STATIC bool
> +xfs_ioend_needs_cache_flush(
> + struct xfs_ioend *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> +
> + if (!(mp->m_flags & XFS_MOUNT_BARRIER))
> + return false;
> +
> + return IS_SYNC(ioend->io_inode) ||
> +       (ioend->io_iocb->ki_filp->f_flags & O_DSYNC);
> +}
  I'm not really a XFS developer but calling things "...cache_flush" when
you actually mean fsync is IMHO misleading.

> +
> +/*
>   * Schedule IO completion handling on the final put of an ioend.
>   *
>   * If there is no work to do we might as well call it a day and free the
> @@ -189,11 +207,30 @@ xfs_finish_ioend(
>   queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>   else if (ioend->io_append_trans)
>   queue_work(mp->m_data_workqueue, &ioend->io_work);
> + else if (ioend->io_needs_fsync)
> + queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work);
>   else
>   xfs_destroy_ioend(ioend);
>   }
>  }
>  
> +STATIC int
> +xfs_ioend_force_cache_flush(
> + xfs_ioend_t *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> + int err = 0;
> + int datasync;
> +
> + datasync = !IS_SYNC(ioend->io_inode) &&
> + !(ioend->io_iocb->ki_filp->f_flags & __O_SYNC);
> + err = do_xfs_file_fsync(ip, mp, datasync);
> + xfs_destroy_ioend(ioend);
> + /* do_xfs_file_fsync returns -errno. our caller expects positive. */
> + return -err;
> +}
> +
>  /*
>   * IO write completion.
>   */
> @@ -250,12 +287,22 @@ xfs_end_io(
>   error = xfs_setfilesize(ioend);
>   if (error)
>   ioend->io_error = -error;
> + } else if (ioend->io_needs_fsync) {
> + error = xfs_ioend_force_cache_flush(ioend);
> + if (error && ioend->io_result > 0)
> + ioend->io_error = -error;
> + ioend->io_needs_fsync = 0;
>   } else {
>   ASSERT(!xfs_ioend_is_append(ioend));
>   }
>  
>  done:
> - xfs_destroy_ioend(ioend);
> + /* the honoring of O_SYNC has to be done last */
> + if (ioend->io_needs_fsync) {
> + atomic_inc(&ioend->io_remaining);
> + xfs_finish_ioend(ioend);
> + } else
> + xfs_destroy_ioend(ioend);
>  }
  Umm, I don't quite get why you do things the way you do in xfs_end_io().
Why don't you handle fsync there always but offload it instead to workqueue
again in some cases? Is it so that it gets processed in the right workqueue
or why?

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync

Dave Chinner
In reply to this post by Darrick J. Wong-2
On Mon, Nov 19, 2012 at 11:41:38PM -0800, Darrick J. Wong wrote:

> Hi,
>
> Fsyncing is tricky business, so factor out the bits of the xfs_file_fsync
> function that can be used from the I/O post-processing path.
>
> From: Jeff Moyer <[hidden email]>
> Signed-off-by: Jeff Moyer <[hidden email]>
> ---
>  fs/xfs/xfs_file.c  |   44 +++++++++++++++++++++++++++++---------------
>  fs/xfs/xfs_inode.h |    1 +
>  2 files changed, 30 insertions(+), 15 deletions(-)
>
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index aa473fa..507f446 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -152,25 +152,18 @@ xfs_dir_fsync(
>   return _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
>  }
>  
> -STATIC int
> -xfs_file_fsync(
> - struct file *file,
> - loff_t start,
> - loff_t end,
> +/*
> + * Returns 0 on success, -errno on failure.
> + */
> +int
> +do_xfs_file_fsync(
> + struct xfs_inode *ip,
> + struct xfs_mount *mp,
>   int datasync)

xfs_do_file_fsync()

And being an internal XFS function, it should return a positive
error number.

....

> +{
> + struct inode *inode = file->f_mapping->host;
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_mount *mp = ip->i_mount;
> + int error = 0;
> +
> + trace_xfs_file_fsync(ip);
> +
> + error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> + if (error)
> + return error;
> +
> + return do_xfs_file_fsync(ip, mp, datasync);

And be negated here.

> +}
> +
>  STATIC ssize_t
>  xfs_file_aio_read(
>   struct kiocb *iocb,
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 94b32f9..d5bf105 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -546,6 +546,7 @@ do { \
>   iput(VFS_I(ip)); \
>  } while (0)
>  
> +int do_xfs_file_fsync(struct xfs_inode *, struct xfs_mount *, int);
>  #endif /* __KERNEL__ */

This should probably go in fs/xfs/xfs_vnodeops.h (like the only
other non-static function (xfs_zero_eof) in fs/xfs/xfs_file.c is.

Cheers,

Dave.
--
Dave Chinner
[hidden email]

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

Dave Chinner
In reply to this post by Darrick J. Wong-2
On Mon, Nov 19, 2012 at 11:51:14PM -0800, Darrick J. Wong wrote:

> If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
> flushed after the write completion for AIOs.  This patch attempts to fix
> that problem by marking an I/O as requiring a cache flush in endio
> processing, and then issuing the cache flush after any unwritten extent
> conversion is done.
>
> From: Jeff Moyer <[hidden email]>
> Signed-off-by: Jeff Moyer <[hidden email]>
> [[hidden email]: Rework patch to use per-mount workqueues]
> Signed-off-by: Darrick J. Wong <[hidden email]>
> ---
>  fs/xfs/xfs_aops.c  |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_aops.h  |    1 +
>  fs/xfs/xfs_mount.h |    1 +
>  fs/xfs/xfs_super.c |    8 ++++++++
>  4 files changed, 61 insertions(+), 1 deletion(-)
>
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e57e2da..9cebbb7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -173,6 +173,24 @@ xfs_setfilesize(
>  }
>  
>  /*
> + * In the case of synchronous, AIO, O_DIRECT writes, we need to flush
> + * the disk cache when the I/O is complete.
> + */

That's not quite right - we need to fsync the metadata when the data
IO is complete. Block device/disk cache flushes are irrelevant at
this level as that is wholly encapsulated inside the metadata fsync
processing.

> +STATIC bool
> +xfs_ioend_needs_cache_flush(

xfs_ioend_need_fsync()

> + struct xfs_ioend *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> +
> + if (!(mp->m_flags & XFS_MOUNT_BARRIER))
> + return false;

And regardless of whether we have barriers enabled or not, we need
to flush the dirty metadata to the log for O_SYNC/O_DSYNC+AIO+DIO
writes here. So there should be no check of the mount flags here.

> + return IS_SYNC(ioend->io_inode) ||
> +       (ioend->io_iocb->ki_filp->f_flags & O_DSYNC);
> +}
> +
> +/*
>   * Schedule IO completion handling on the final put of an ioend.
>   *
>   * If there is no work to do we might as well call it a day and free the
> @@ -189,11 +207,30 @@ xfs_finish_ioend(
>   queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>   else if (ioend->io_append_trans)
>   queue_work(mp->m_data_workqueue, &ioend->io_work);
> + else if (ioend->io_needs_fsync)
> + queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work);
>   else
>   xfs_destroy_ioend(ioend);
>   }
>  }
>  
> +STATIC int
> +xfs_ioend_force_cache_flush(

STATIC void
xfs_ioend_fsync()

> + xfs_ioend_t *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> + int err = 0;
> + int datasync;

        trace_xfs_ioend_fsync(ip);

> + datasync = !IS_SYNC(ioend->io_inode) &&
> + !(ioend->io_iocb->ki_filp->f_flags & __O_SYNC);
> + err = do_xfs_file_fsync(ip, mp, datasync);
> + xfs_destroy_ioend(ioend);

I think this is wrong. The ioend is destroyed by the caller, so
putting it here turns all subsequent uses by the caller into
use-after-free memory corruption bugs.


> + /* do_xfs_file_fsync returns -errno. our caller expects positive. */
> + return -err;

This is why xfs_do_file_fsync() should be returning a positive
error - to avoid repeated juggling of error signs.

> +}
> +
>  /*
>   * IO write completion.
>   */
> @@ -250,12 +287,22 @@ xfs_end_io(
>   error = xfs_setfilesize(ioend);
>   if (error)
>   ioend->io_error = -error;
> + } else if (ioend->io_needs_fsync) {
> + error = xfs_ioend_force_cache_flush(ioend);
> + if (error && ioend->io_result > 0)
> + ioend->io_error = -error;
> + ioend->io_needs_fsync = 0;

So this is all use-after-free. Also, there's no need to clear
io_needs_fsync() as the ioend is about to be destroyed.

>   } else {
>   ASSERT(!xfs_ioend_is_append(ioend));
>   }
>  
>  done:
> - xfs_destroy_ioend(ioend);
> + /* the honoring of O_SYNC has to be done last */
> + if (ioend->io_needs_fsync) {
> + atomic_inc(&ioend->io_remaining);
> + xfs_finish_ioend(ioend);
> + } else
> + xfs_destroy_ioend(ioend);

And requeuing work from one workqueue to the next is something that
we can avoid.  We know at IO submission time (i.e.
xfs_vm_direct_io)) whether an fsync completion is going to be needed
during Io completion.  The ioend->io_needs_fsync flag can be set
then, and the first pass through xfs_finish_ioend() can queue it to
the correct workqueue.  i.e. it only needs to be queued if it's not
already an unwritten or append ioend and it needs an fsync.

As it is, all the data completion workqueues run the same completion
function so all you need to do is handle the fsync case at the end
of the existing processing - it's not an else case. i.e the end of
xfs_end_io() becomes:

        if (ioend->io_needs_fsync) {
                error = xfs_ioend_fsync(ioend);
                if (error)
                        ioend->io_error = -error;
                goto done;
        }
done:
        xfs_destroy_ioend(ioend);

As it is, this code is going to change before these changes go in -
there's a nasty regression in the DIO code that I found this
afternoon that requires reworking this IO completion logic to
avoid. The patch will appear on the list soon....

> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -209,6 +209,7 @@ typedef struct xfs_mount {
>   struct workqueue_struct *m_data_workqueue;
>   struct workqueue_struct *m_unwritten_workqueue;
>   struct workqueue_struct *m_cil_workqueue;
> + struct workqueue_struct *m_aio_blkdev_flush_wq;

        struct workqueue_struct *m_aio_fsync_wq;

Cheers,

Dave.
--
Dave Chinner
[hidden email]

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

Jeff Moyer
In reply to this post by Darrick J. Wong-2
"Darrick J. Wong" <[hidden email]> writes:

> Hi everybody,
>
> On March 29th, Jeff Moyer posted to lkml a patchset with this note:
>
>> Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
>> are sync'd before the I/O is actually issued (everybody else).  The following
>> patch series fixes this in two parts.  First, for the file systems that use
>> the generic routines, Jan has provided some generic infrastructure to perform
>> the syncs after the I/O is completed.  Second, for those file systems which
>> require some endio processing of their own for O_DIRECT writes (xfs and
>> ext4), [Jeff] implemented file system specific syncing.  This passes the
>> updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
>> in the aio group.  [Jeff] tested ext3, ext4, xfs, and btrfs only.
>
> Since the original post a few months ago, this patchset doesn't seem to have
> made any progress.  An internal testing team here discovered that the issue
> also affects O_SYNC+AIO+DIO writes to block devices.  Worse yet, since the
> flushes were being issued (and waited upon) directly in the io_submit call
> graph, the io_submit calls themselves would take a very long time to complete.
> Therefore, I added another patch to move the flush to the io_end processing.
>
> The blockdev patch was written by me.  The ext4 patch had to be updated to
> accomodate a rework of the ext4 endio code that landed since March.  Everything
> else has been passed through from Jeff's March 30th resend, with few changes.
>
> This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
> ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.
>
> Comments and questions are, as always, welcome.

Hi, Darrick,

I just finished testing my version of this patch set on ext4 and xfs
(btrfs is next), but you beat me to the posting!  Sorry, I should have
been more clear when I said I would see about refreshing the series.

How have you tested these patches?  You just said "lightly", and I'm
afraid I don't know what that means.  xfstests?  You should at least run
them through test 113.

Would you like to push the set in, or should I post the patches I've
got?  Doesn't matter to me, just let me know.

Cheers,
Jeff

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

Darrick J. Wong-2
On Tue, Nov 20, 2012 at 09:23:18AM -0500, Jeff Moyer wrote:

> "Darrick J. Wong" <[hidden email]> writes:
>
> > Hi everybody,
> >
> > On March 29th, Jeff Moyer posted to lkml a patchset with this note:
> >
> >> Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
> >> are sync'd before the I/O is actually issued (everybody else).  The following
> >> patch series fixes this in two parts.  First, for the file systems that use
> >> the generic routines, Jan has provided some generic infrastructure to perform
> >> the syncs after the I/O is completed.  Second, for those file systems which
> >> require some endio processing of their own for O_DIRECT writes (xfs and
> >> ext4), [Jeff] implemented file system specific syncing.  This passes the
> >> updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
> >> in the aio group.  [Jeff] tested ext3, ext4, xfs, and btrfs only.
> >
> > Since the original post a few months ago, this patchset doesn't seem to have
> > made any progress.  An internal testing team here discovered that the issue
> > also affects O_SYNC+AIO+DIO writes to block devices.  Worse yet, since the
> > flushes were being issued (and waited upon) directly in the io_submit call
> > graph, the io_submit calls themselves would take a very long time to complete.
> > Therefore, I added another patch to move the flush to the io_end processing.
> >
> > The blockdev patch was written by me.  The ext4 patch had to be updated to
> > accomodate a rework of the ext4 endio code that landed since March.  Everything
> > else has been passed through from Jeff's March 30th resend, with few changes.
> >
> > This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
> > ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.
> >
> > Comments and questions are, as always, welcome.
>
> Hi, Darrick,
>
> I just finished testing my version of this patch set on ext4 and xfs
> (btrfs is next), but you beat me to the posting!  Sorry, I should have
> been more clear when I said I would see about refreshing the series.
>
> How have you tested these patches?  You just said "lightly", and I'm
> afraid I don't know what that means.  xfstests?  You should at least run
> them through test 113.

xfstest'd for ext4, but only sanity checked on the rest.

> Would you like to push the set in, or should I post the patches I've
> got?  Doesn't matter to me, just let me know.

Since you're the original author of most of the patches in my set anyway, would
you mind simply picking up my blockdev patch for your next submission?  It
sounds like you're a little further along in the testing than I am anyway.

--D
>
> Cheers,
> Jeff

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

Jeff Moyer
"Darrick J. Wong" <[hidden email]> writes:

>> Would you like to push the set in, or should I post the patches I've
>> got?  Doesn't matter to me, just let me know.
>
> Since you're the original author of most of the patches in my set anyway, would
> you mind simply picking up my blockdev patch for your next submission?  It
> sounds like you're a little further along in the testing than I am anyway.

Not at all.  I'll address the comments so far and get another version
out tomorrow.

Cheers,
Jeff

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

Jeff Moyer
In reply to this post by Dave Chinner
Dave Chinner <[hidden email]> writes:

> And requeuing work from one workqueue to the next is something that
> we can avoid.  We know at IO submission time (i.e.
> xfs_vm_direct_io)) whether an fsync completion is going to be needed
> during Io completion.  The ioend->io_needs_fsync flag can be set
> then, and the first pass through xfs_finish_ioend() can queue it to
> the correct workqueue.  i.e. it only needs to be queued if it's not
> already an unwritten or append ioend and it needs an fsync.
>
> As it is, all the data completion workqueues run the same completion
> function so all you need to do is handle the fsync case at the end
> of the existing processing - it's not an else case. i.e the end of
> xfs_end_io() becomes:
>
> if (ioend->io_needs_fsync) {
> error = xfs_ioend_fsync(ioend);
> if (error)
> ioend->io_error = -error;
> goto done;
> }
> done:
> xfs_destroy_ioend(ioend);

Works for me, that makes things simpler.

> As it is, this code is going to change before these changes go in -
> there's a nasty regression in the DIO code that I found this
> afternoon that requires reworking this IO completion logic to
> avoid. The patch will appear on the list soon....

I'm not on the xfs list, so if you haven't already sent it, mind Cc-ing
me?

>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -209,6 +209,7 @@ typedef struct xfs_mount {
>>   struct workqueue_struct *m_data_workqueue;
>>   struct workqueue_struct *m_unwritten_workqueue;
>>   struct workqueue_struct *m_cil_workqueue;
>> + struct workqueue_struct *m_aio_blkdev_flush_wq;
>
> struct workqueue_struct *m_aio_fsync_wq;

For the record, m_aio_blkdev_flush_wq is the name you chose previously.
;-)

Thanks for the review!

Cheers,
Jeff

_______________________________________________
xfs mailing list
[hidden email]
http://oss.sgi.com/mailman/listinfo/xfs
12