innodb_flush_method=O_DSYNC | ALL_O_DIRECT leads to log writes with log_sys->mutex locked
Bug #1075129 reported by
Alexey Kopytov
This bug affects 3 people
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Percona Server moved to https://jira.percona.com/projects/PS | Status tracked in 5.7 | |||||
5.1 |
Won't Fix
|
Medium
|
Unassigned | |||
5.5 |
Triaged
|
Medium
|
Unassigned | |||
5.6 |
Triaged
|
Medium
|
Unassigned | |||
5.7 |
Triaged
|
Medium
|
Unassigned |
Bug Description
When innodb_file_flush method has the default (empty) value or is O_DIRECT, InnoDB does buffered log writes with log_sys->mutex locked, and then calls fsync() after releasing the mutex, i.e. the actual I/O happens with the mutex unlocked.
With O_DSYNC or ALL_O_DIRECT, the actual I/O happens inside the lock. Which makes log_sys->mutex very hot in some workloads.
We can fix this by queuing the writes inside the lock, and then processing the queue after releasing the mutex and before returning from log_write_up_to().
tags: | added: xtradb |
To post a comment you must log in.
Considering the following fragment from log_write_up_to:
group = UT_LIST_ GET_FIRST( log_sys- >log_groups) ;
/* Do the write to the log files */
while (group) { write_buf( align_down( log_sys- >written_ to_all_ lsn,
OS_FILE_ LOG_BLOCK_ SIZE),
log_group_
group, log_sys->buf + area_start,
area_end - area_start,
ut_uint64_
start_offset - area_start);
log_group_ set_fields( group, log_sys- >write_ lsn);
group = UT_LIST_ GET_NEXT( log_groups, group);
}
mutex_ exit(&( log_sys- >mutex) );
if (srv_unix_ file_flush_ method == SRV_UNIX_O_DSYNC file_flush_ method == SRV_UNIX_ ALL_O_DIRECT) {
|| srv_unix_
/* O_DSYNC means the OS did not buffer the log file at all:
so we have also flushed to disk what we have written */
log_sys- >flushed_ to_disk_ lsn = log_sys->write_lsn;
} else if (flush_to_disk) {
group = UT_LIST_ GET_FIRST( log_sys- >log_groups) ;
fil_flush( group-> space_id, FALSE); >flushed_ to_disk_ lsn = log_sys->write_lsn;
log_sys-
}
There already is a log_do_write in log_group_ write_buf:
if (log_do_write) { >n_log_ ios++;
log_sys-
srv_os_ log_pending_ writes+ +;
fil_io( OS_FILE_ WRITE | OS_FILE_LOG, TRUE, group->space_id, 0,
next_ offset / UNIV_PAGE_SIZE,
next_ offset % UNIV_PAGE_SIZE, write_len, buf, group);
srv_os_ log_pending_ writes- -;
srv_os_ log_written+ = write_len;
srv_log_writes++;
}
However, it is unconditionally set to TRUE in non-UNIV_DEBUG (and
nowhere set to false in UNIV_DEBUG too).
However, the same variable cannot be reused, since to increment >n_log_ ios++ among others requires the log_sys mutex.
log_sys-
So, one may want to replace fil_io over there with an in-memory
buffering so that counters are updated (the worst can happen with a crash is the counters
being incorrect) and then do the I/O after mutex_exit in
log_write_up_to but before the if condition with
SRV_UNIX_O_DSYNC.
Even this should benefit O_DSYNC / ALL_O_DIRECT the most, it will
also benefit normal case since it will avoid the overhead of
_fil_aio when under the mutex.