Features/BlockJob

Error handling

query-block-jobs: BlockJobInfo gets two new fields, paused and io-status. The job-specific iostatus is completely separate from the block device iostatus.

block-stream

I would still like to add on_error to the existing block-stream command, if only to ease unit testing. Concerns about the stability of the API can be handled by adding introspection (exporting the schema), which is not hard to do. The new option is an enum with the following possible values:

'report': The behavior is the same as in 1.1. An I/O error will complete the job immediately with an error code.
'ignore': An I/O error, respectively during a read or a write, will be ignored. For streaming, the job will complete with an error and the backing file will be left in place. For mirroring, the sector will be marked again as dirty and re-examined later.
'stop': The job will be paused, and the job iostatus (which can be examined with query-block-jobs) is updated.
'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

In all cases, even for 'report', the I/O error is reported as a QMP event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.

After cancelling a job, the job implementation MAY choose to treat stop and enospc values as report, i.e. complete the job immediately with an error code, as long as block_job_is_cancelled(job) returns true when the completion callback is called.

Open problem: There could be unrecoverable errors in which the job will always fail as if rerror/werror were set to report (example: error while switching backing files). Does it make sense to fire an event before the point in time where such errors can happen?

block-job-pause: A new QMP command. Takes a block device (drive), pauses an active background block operation on that device. This command returns immediately after marking the active background block operation for pausing. It is an error to call this command if no operation is in progress. The operation will pause as soon as possible (it won't pause if the job is being cancelled). No event is emitted when the operation is actually paused. Cancelling a paused job automatically resumes it.

block-job-resume

A new QMP command. Takes a block device (drive), resume a paused background block operation on that device. This command returns immediately after resuming a paused background block operation. It is an error to call this command if no operation is in progress.

A successful block-job-resume operation also resets the iostatus on the job that is passed.

Rationale: block-job-resume is required to restart a job that had on_error behavior set to 'stop' or 'enospc'. Adding block-job-pause makes it simpler to test the new feature.

Mirroring commands

query-block-jobs

The returned JSON object will grow an additional member, "target". The target field is a dictionary with two fields, "info" and "stats" (resembling the output of query-block and query-blockstat but for the mirroring target). Member "device" of the BlockInfo structure will be made optional.

Rationale: this allows libvirt to observe the high watermark of qcow2 mirroring targets.

If present, the target has its own iostatus. It is set when the job is paused due to an error on the target (together with sending a BLOCK_JOB_ERROR event). block-job-resume resets it.

drive-mirror

activates mirroring to a second block device (optionally creating the image on that second block device). Compared to the earlier versions, the "full" argument is replaced by an enum option "sync" with three values:

'top': copies data in the topmost image to the destination
'full': copies data from all images to the destination
'dirty': copies clusters that are marked in the dirty bitmap to the destination (see below)

block-job-complete: force completion of mirroring and switching of the device to the target, not related to the rest of the proposal. Synchronously opens backing files if needed, asynchronously completes the job.

MIRROR_STATE_CHANGE: new event, triggered every time the block-job-complete becomes available/unavailable. Contains the device name (like device: 'ide0-hd0'), and the state (synced: true/false).

Persistent dirty bitmap

A persistent dirty bitmap can be used by management for two reasons: 1) when mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable; 2) when mirroring is used for storage migration, to check after a management crash whether the VM must be restarted with the source or the destination.

The dirty bitmap is synchronized on every bdrv_flush (or on every I/O operation if the disk operates in writethrough or directsync mode).

The persistent dirty bitmap is created by management, but QEMU needs it also for drive-mirror. If so:

if management has not set up a persistent dirty bitmap, QEMU will use a simple non-persistent bitmap.
if management has set up a persistent dirty bitmap and later calls blockdev-dirty-disable, QEMU will delay the disabling until drive mirroring also terminates.

QMP commands

The dirty bitmap is managed by these QMP commands:

blockdev-dirty-enable: takes a file name used for the dirty bitmap, and an optional granularity. Setting the granularity will not be supported in the initial version.
query-block-dirty: returns statistics about the dirty bitmap: right now the granularity, the number of bits that are set, and whether QEMU is consuming the dirty bitmap (i.e. drive-mirror active)
blockdev-dirty-disable: disable the dirty bitmap.

The dirty bitmap can also be specified on the command-line with -drive.

Usage

The dirty bitmap can be used as follows for storage migration. To start migration:

blockdev-dirty-enable ide0-hd0 /var/lib/libvirt/dirty/diskname
management notes existence of dirty bitmap for /mnt/src/diskname.img in its private data
drive-mirror ide0-hd0 /mnt/dest/diskname.img
management notes /mnt/dest/diskname.img as the mirroring target in its private data
At this point, mirroring has taken a reference to the dirty bitmap.
To end migration:
blockdev-dirty-disable ide0-hd0
block-job-complete ide0-hd0
The dirty bitmap remains enabled until the BLOCK_JOB_COMPLETED event is sent.
When management receives the BLOCK_JOB_COMPLETED event, it notes switch to /mnt/dest/diskname.img (without dirty bitmap nor mirroring target) in its private data.

If management crashes between (6) and (7), it can examine the dirty bitmap on disk. If it is all-zeros, management can restart the virtual machine with /mnt/dest/diskname.img. If it has even a single zero bit, management can restart the virtual machine with the persistent dirty bitmap enabled, and later issue again a drive-mirror command (with sync='dirty') to restart from step 4.

Internal workings

In addition to the persistent dirty bitmap, QEMU keeps a volatile bitmap. The invariants are as follows:

both bitmaps have a bit set to one if the contents may differ on the source and destination.
the persistent bitmap must have a bit set to one on disk if the source has been fsync-ed after writing that block.
the volatile bitmap may only have a bit set to zero if the contents are the same on the source and destination
the persistent bitmap may only have a bit set to zero if the contents are the same on the source and destination, and the destination has been fsync-ed

Bitmap handling when doing I/O on the source

after writing to the source:
- set bit in both dirty bitmaps (*)
when flushing the source:
- synchronously msync the persistent bitmap to disk

Bitmap handling in the drive-mirror coroutine

before reading from the source:
- reset bit in the volatile in-flight bitmap
when the volatile bitmap becomes all clear:
- flush the target
- clear the persistent dirty bitmap (*)
- asynchronously msync the persistent bitmap to disk (optional)
periodically but only while the mirror is in copy phase (optionally):
- flush the target
- copy the volatile dirty bitmap to the persistent bitmap (*)
- asynchronously msync the persistent bitmap to disk (optional)

One possibility for periodic update of the persistent bitmap is to do it when one page of the volatile bitmap becomes all zeros.

Steps marked with (*) have to be done with no context switches (coroutines guarantee this).

Implementation notes: Linux doesn't do anything on msync(MS_ASYNC), you need to use fadvise(FADV_DONTNEED). Also, on Linux it is cheaper to just fdatasync the file than to use msync.