This commit is contained in:
Nguyen Anh Quynh
2015-08-21 15:04:50 +08:00
commit 344d016104
499 changed files with 266445 additions and 0 deletions

View File

@@ -0,0 +1,24 @@
QEMU<->ACPI BIOS CPU hotplug interface
--------------------------------------
QEMU supports CPU hotplug via ACPI. This document
describes the interface between QEMU and the ACPI BIOS.
ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
-----------------------------------------
Generic ACPI GPE block. Bit 2 (GPE.2) used to notify CPU
hot-add/remove event to ACPI BIOS, via SCI interrupt.
CPU present bitmap for:
ICH9-LPC (IO port 0x0cd8-0xcf7, 1-byte access)
PIIX-PM (IO port 0xaf00-0xaf1f, 1-byte access)
---------------------------------------------------------------
One bit per CPU. Bit position reflects corresponding CPU APIC ID.
Read-only.
CPU hot-add/remove notification:
-----------------------------------------------------
QEMU sets/clears corresponding CPU bit on hot-add/remove event.
CPU present map read by ACPI BIOS GPE.2 handler to notify OS of CPU
hot-(un)plug events.

View File

@@ -0,0 +1,44 @@
QEMU<->ACPI BIOS memory hotplug interface
--------------------------------------
ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add
events.
Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access):
---------------------------------------------------------------
0xa00:
read access:
[0x0-0x3] Lo part of memory device phys address
[0x4-0x7] Hi part of memory device phys address
[0x8-0xb] Lo part of memory device size in bytes
[0xc-0xf] Hi part of memory device size in bytes
[0x10-0x13] Memory device proximity domain
[0x14] Memory device status fields
bits:
0: Device is enabled and may be used by guest
1: Device insert event, used to distinguish device for which
no device check event to OSPM was issued.
It's valid only when bit 1 is set.
2-7: reserved and should be ignored by OSPM
[0x15-0x17] reserved
write access:
[0x0-0x3] Memory device slot selector, selects active memory device.
All following accesses to other registers in 0xa00-0xa17
region will read/store data from/to selected memory device.
[0x4-0x7] OST event code reported by OSPM
[0x8-0xb] OST status code reported by OSPM
[0xc-0x13] reserved, writes into it are ignored
[0x14] Memory device control fields
bits:
0: reserved, OSPM must clear it before writing to register
1: if set to 1 clears device insert event, set by OSPM
after it has emitted device check event for the
selected memory device
2-7: reserved, OSPM must clear them before writing to register
Selecting memory device slot beyond present range has no effect on platform:
- write accesses to memory hot-plug registers not documented above are
ignored
- read accesses to memory hot-plug registers not documented above return
all bits set to 1.

View File

@@ -0,0 +1,45 @@
QEMU<->ACPI BIOS PCI hotplug interface
--------------------------------------
QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document
describes the interface between QEMU and the ACPI BIOS.
ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
-----------------------------------------
Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
event to ACPI BIOS, via SCI interrupt.
PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
---------------------------------------------------------------
Slot injection notification pending. One bit per slot.
Read by ACPI BIOS GPE.1 handler to notify OS of injection
events. Read-only.
PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
-----------------------------------------------------
Slot removal notification pending. One bit per slot.
Read by ACPI BIOS GPE.1 handler to notify OS of removal
events. Read-only.
PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
----------------------------------------
Write: Used by ACPI BIOS _EJ0 method to request device removal.
One bit per slot.
Read: Hotplug features register. Used by platform to identify features
available. Current base feature set (no bits set):
- Read-only "up" register @0xae00, 4-byte access, bit per slot
- Read-only "down" register @0xae04, 4-byte access, bit per slot
- Read/write "eject" register @0xae08, 4-byte access,
write: bit per slot eject, read: hotplug feature set
- Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot
PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
-----------------------------------------------
Used by ACPI BIOS _RMV method to indicate removability status to OS. One
bit per slot. Read-only

View File

@@ -0,0 +1,96 @@
Device Specification for Inter-VM shared memory device
------------------------------------------------------
The Inter-VM shared memory device is designed to share a region of memory to
userspace in multiple virtual guests. The memory region does not belong to any
guest, but is a POSIX memory object on the host. Optionally, the device may
support sending interrupts to other guests sharing the same memory region.
The Inter-VM PCI device
-----------------------
*BARs*
The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support
registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is
used to map the shared memory object from the host. The size of BAR2 is
specified when the guest is started and must be a power of 2 in size.
*Registers*
The device currently supports 4 registers of 32-bits each. Registers
are used for synchronization between guests sharing the same memory object when
interrupts are supported (this requires using the shared memory server).
The server assigns each VM an ID number and sends this ID number to the QEMU
process when the guest starts.
enum ivshmem_registers {
IntrMask = 0,
IntrStatus = 4,
IVPosition = 8,
Doorbell = 12
};
The first two registers are the interrupt mask and status registers. Mask and
status are only used with pin-based interrupts. They are unused with MSI
interrupts.
Status Register: The status register is set to 1 when an interrupt occurs.
Mask Register: The mask register is bitwise ANDed with the interrupt status
and the result will raise an interrupt if it is non-zero. However, since 1 is
the only value the status will be set to, it is only the first bit of the mask
that has any effect. Therefore interrupts can be masked by setting the first
bit to 0 and unmasked by setting the first bit to 1.
IVPosition Register: The IVPosition register is read-only and reports the
guest's ID number. The guest IDs are non-negative integers. When using the
server, since the server is a separate process, the VM ID will only be set when
the device is ready (shared memory is received from the server and accessible via
the device). If the device is not ready, the IVPosition will return -1.
Applications should ensure that they have a valid VM ID before accessing the
shared memory.
Doorbell Register: To interrupt another guest, a guest must write to the
Doorbell register. The doorbell register is 32-bits, logically divided into
two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low
16-bits are the interrupt vector to trigger. The semantics of the value
written to the doorbell depends on whether the device is using MSI or a regular
pin-based interrupt. In short, MSI uses vectors while regular interrupts set the
status register.
Regular Interrupts
If regular interrupts are used (due to either a guest not supporting MSI or the
user specifying not to use them on startup) then the value written to the lower
16-bits of the Doorbell register results is arbitrary and will trigger an
interrupt in the destination guest.
Message Signalled Interrupts
A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
written to the Doorbell register must be between 0 and the maximum number of
vectors the guest supports. The lower 16 bits written to the doorbell is the
MSI vector that will be raised in the destination guest. The number of MSI
vectors is configurable but it is set when the VM is started.
The important thing to remember with MSI is that it is only a signal, no status
is set (since MSI interrupts are not shared). All information other than the
interrupt itself should be communicated via the shared memory region. Devices
supporting multiple MSI vectors can use different vectors to indicate different
events have occurred. The semantics of interrupt vectors are left to the
user's discretion.
Usage in the Guest
------------------
The shared memory device is intended to be used with the provided UIO driver.
Very little configuration is needed. The guest should map BAR0 to access the
registers (an array of 32-bit ints allows simple writing) and map BAR2 to
access the shared memory region itself. The size of the shared memory region
is specified when the guest (or shared memory server) is started. A guest may
map the whole shared memory region or only part of it.

View File

@@ -0,0 +1,50 @@
PCI IDs for qemu
================
Red Hat, Inc. donates a part of its device ID range to qemu, to be used for
virtual devices. The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36.
Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned
for your devices.
1af4 vendor ID
--------------
The 1000 -> 10ff device ID range is used as follows for virtio-pci devices.
Note that this allocation separate from the virtio device IDs, which are
maintained as part of the virtio specification.
1af4:1000 network device
1af4:1001 block device
1af4:1002 balloon device
1af4:1003 console device
1af4:1004 SCSI host bus adapter device
1af4:1005 entropy generator device
1af4:1009 9p filesystem device
1af4:10f0 Available for experimental usage without registration. Must get
to official ID when the code leaves the test lab (i.e. when seeking
1af4:10ff upstream merge or shipping a distro/product) to avoid conflicts.
1af4:1100 Used as PCI Subsystem ID for existing hardware devices emulated
by qemu.
1af4:1110 ivshmem device (shared memory, docs/specs/ivshmem_device_spec.txt)
All other device IDs are reserved.
1b36 vendor ID
--------------
The 0000 -> 00ff device ID range is used as follows for QEMU-specific
PCI devices (other than virtio):
1b36:0001 PCI-PCI bridge
1b36:0002 PCI serial port (16550A) adapter (docs/specs/pci-serial.txt)
1b36:0003 PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt)
1b36:0004 PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt)
All these devices are documented in docs/specs.
The 0100 device ID is used for the QXL video card device.

View File

@@ -0,0 +1,34 @@
QEMU pci serial devices
=======================
There is one single-port variant and two muliport-variants. Linux
guests out-of-the box with all cards. There is a Windows inf file
(docs/qemupciserial.inf) to setup the single-port card in Windows
guests.
single-port card
----------------
Name: pci-serial
PCI ID: 1b36:0002
PCI Region 0:
IO bar, 8 bytes long, with the 16550 uart mapped to it.
Interrupt is wired to pin A.
multiport cards
---------------
Name: pci-serial-2x
PCI ID: 1b36:0003
Name: pci-serial-4x
PCI ID: 1b36:0004
PCI Region 0:
IO bar, with two/four 16550 uart mapped after each other.
The first is at offset 0, second at offset 8, ...
Interrupt is wired to pin A.

View File

@@ -0,0 +1,26 @@
pci-test is a device used for testing low level IO
device implements up to two BARs: BAR0 and BAR1.
Each BAR can be memory or IO. Guests must detect
BAR type and act accordingly.
Each BAR size is up to 4K bytes.
Each BAR starts with the following header:
typedef struct PCITestDevHdr {
uint8_t test; <- write-only, starts a given test number
uint8_t width_type; <- read-only, type and width of access for a given test.
1,2,4 for byte,word or long write.
any other value if test not supported on this BAR
uint8_t pad0[2];
uint32_t offset; <- read-only, offset in this BAR for a given test
uint32_t data; <- read-only, data to use for a given test
uint32_t count; <- for debugging. number of writes detected.
uint8_t name[]; <- for debugging. 0-terminated ASCII string.
} PCITestDevHdr;
All registers are little endian.
device is expected to always implement tests 0 to N on each BAR, and to add new
tests with higher numbers. In this way a guest can scan test numbers until it
detects an access type that it does not support on this BAR, then stop.

View File

@@ -0,0 +1,78 @@
When used with the "pseries" machine type, QEMU-system-ppc64 implements
a set of hypervisor calls using a subset of the server "PAPR" specification
(IBM internal at this point), which is also what IBM's proprietary hypervisor
adheres too.
The subset is selected based on the requirements of Linux as a guest.
In addition to those calls, we have added our own private hypervisor
calls which are mostly used as a private interface between the firmware
running in the guest and QEMU.
All those hypercalls start at hcall number 0xf000 which correspond
to a implementation specific range in PAPR.
- H_RTAS (0xf000)
RTAS is a set of runtime services generally provided by the firmware
inside the guest to the operating system. It predates the existence
of hypervisors (it was originally an extension to Open Firmware) and
is still used by PAPR to provide various services that aren't performance
sensitive.
We currently implement the RTAS services in QEMU itself. The actual RTAS
"firmware" blob in the guest is a small stub of a few instructions which
calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU.
Arguments:
r3 : H_RTAS (0xf000)
r4 : Guest physical address of RTAS parameter block
Returns:
H_SUCCESS : Successfully called the RTAS function (RTAS result
will have been stored in the parameter block)
H_PARAMETER : Unknown token
- H_LOGICAL_MEMOP (0xf001)
When the guest runs in "real mode" (in powerpc lingua this means
with MMU disabled, ie guest effective == guest physical), it only
has access to a subset of memory and no IOs.
PAPR provides a set of hypervisor calls to perform cachable or
non-cachable accesses to any guest physical addresses that the
guest can use in order to access IO devices while in real mode.
This is typically used by the firmware running in the guest.
However, doing a hypercall for each access is extremely inefficient
(even more so when running KVM) when accessing the frame buffer. In
that case, things like scrolling become unusably slow.
This hypercall allows the guest to request a "memory op" to be applied
to memory. The supported memory ops at this point are to copy a range
of memory (supports overlap of source and destination) and XOR which
is used by our SLOF firmware to invert the screen.
Arguments:
r3: H_LOGICAL_MEMOP (0xf001)
r4: Guest physical address of destination
r5: Guest physical address of source
r6: Individual element size
0 = 1 byte
1 = 2 bytes
2 = 4 bytes
3 = 8 bytes
r7: Number of elements
r8: Operation
0 = copy
1 = xor
Returns:
H_SUCCESS : Success
H_PARAMETER : Invalid argument

View File

@@ -0,0 +1,39 @@
PVPANIC DEVICE
==============
pvpanic device is a simulated ISA device, through which a guest panic
event is sent to qemu, and a QMP event is generated. This allows
management apps (e.g. libvirt) to be notified and respond to the event.
The management app has the option of waiting for GUEST_PANICKED events,
and/or polling for guest-panicked RunState, to learn when the pvpanic
device has fired a panic event.
ISA Interface
-------------
pvpanic exposes a single I/O port, by default 0x505. On read, the bits
recognized by the device are set. Software should ignore bits it doesn't
recognize. On write, the bits not recognized by the device are ignored.
Software should set only bits both itself and the device recognize.
Currently, only bit 0 is recognized, setting it indicates a guest panic
has happened.
ACPI Interface
--------------
pvpanic device is defined with ACPI ID "QEMU0001". Custom methods:
RDPT: To determine whether guest panic notification is supported.
Arguments: None
Return: Returns a byte, bit 0 set to indicate guest panic
notification is supported. Other bits are reserved and
should be ignored.
WRPT: To send a guest panic event
Arguments: Arg0 is a byte, with bit 0 set to indicate guest panic has
happened. Other bits are reserved and should be cleared.
Return: None
The ACPI device will automatically refer to the right port in case it
is modified.

362
qemu/docs/specs/qcow2.txt Normal file
View File

@@ -0,0 +1,362 @@
== General ==
A qcow2 image file is organized in units of constant size, which are called
(host) clusters. A cluster is the unit in which all allocations are done,
both for actual guest data and for image metadata.
Likewise, the virtual disk as seen by the guest is divided into (guest)
clusters of the same size.
All numbers in qcow2 are stored in Big Endian byte order.
== Header ==
The first cluster of a qcow2 image contains the file header:
Byte 0 - 3: magic
QCOW magic string ("QFI\xfb")
4 - 7: version
Version number (valid values are 2 and 3)
8 - 15: backing_file_offset
Offset into the image file at which the backing file name
is stored (NB: The string is not null terminated). 0 if the
image doesn't have a backing file.
16 - 19: backing_file_size
Length of the backing file name in bytes. Must not be
longer than 1023 bytes. Undefined if the image doesn't have
a backing file.
20 - 23: cluster_bits
Number of bits that are used for addressing an offset
within a cluster (1 << cluster_bits is the cluster size).
Must not be less than 9 (i.e. 512 byte clusters).
Note: qemu as of today has an implementation limit of 2 MB
as the maximum cluster size and won't be able to open images
with larger cluster sizes.
24 - 31: size
Virtual disk size in bytes
32 - 35: crypt_method
0 for no encryption
1 for AES encryption
36 - 39: l1_size
Number of entries in the active L1 table
40 - 47: l1_table_offset
Offset into the image file at which the active L1 table
starts. Must be aligned to a cluster boundary.
48 - 55: refcount_table_offset
Offset into the image file at which the refcount table
starts. Must be aligned to a cluster boundary.
56 - 59: refcount_table_clusters
Number of clusters that the refcount table occupies
60 - 63: nb_snapshots
Number of snapshots contained in the image
64 - 71: snapshots_offset
Offset into the image file at which the snapshot table
starts. Must be aligned to a cluster boundary.
If the version is 3 or higher, the header has the following additional fields.
For version 2, the values are assumed to be zero, unless specified otherwise
in the description of a field.
72 - 79: incompatible_features
Bitmask of incompatible features. An implementation must
fail to open an image if an unknown bit is set.
Bit 0: Dirty bit. If this bit is set then refcounts
may be inconsistent, make sure to scan L1/L2
tables to repair refcounts before accessing the
image.
Bit 1: Corrupt bit. If this bit is set then any data
structure may be corrupt and the image must not
be written to (unless for regaining
consistency).
Bits 2-63: Reserved (set to 0)
80 - 87: compatible_features
Bitmask of compatible features. An implementation can
safely ignore any unknown bits that are set.
Bit 0: Lazy refcounts bit. If this bit is set then
lazy refcount updates can be used. This means
marking the image file dirty and postponing
refcount metadata updates.
Bits 1-63: Reserved (set to 0)
88 - 95: autoclear_features
Bitmask of auto-clear features. An implementation may only
write to an image with unknown auto-clear features if it
clears the respective bits from this field first.
Bits 0-63: Reserved (set to 0)
96 - 99: refcount_order
Describes the width of a reference count block entry (width
in bits: refcount_bits = 1 << refcount_order). For version 2
images, the order is always assumed to be 4
(i.e. refcount_bits = 16).
This value may not exceed 6 (i.e. refcount_bits = 64).
100 - 103: header_length
Length of the header structure in bytes. For version 2
images, the length is always assumed to be 72 bytes.
Directly after the image header, optional sections called header extensions can
be stored. Each extension has a structure like the following:
Byte 0 - 3: Header extension type:
0x00000000 - End of the header extension area
0xE2792ACA - Backing file format name
0x6803f857 - Feature name table
other - Unknown header extension, can be safely
ignored
4 - 7: Length of the header extension data
8 - n: Header extension data
n - m: Padding to round up the header extension size to the next
multiple of 8.
Unless stated otherwise, each header extension type shall appear at most once
in the same image.
If the image has a backing file then the backing file name should be stored in
the remaining space between the end of the header extension area and the end of
the first cluster. It is not allowed to store other data here, so that an
implementation can safely modify the header and add extensions without harming
data of compatible features that it doesn't support. Compatible features that
need space for additional data can use a header extension.
== Feature name table ==
The feature name table is an optional header extension that contains the name
for features used by the image. It can be used by applications that don't know
the respective feature (e.g. because the feature was introduced only later) to
display a useful error message.
The number of entries in the feature name table is determined by the length of
the header extension data. Each entry look like this:
Byte 0: Type of feature (select feature bitmap)
0: Incompatible feature
1: Compatible feature
2: Autoclear feature
1: Bit number within the selected feature bitmap (valid
values: 0-63)
2 - 47: Feature name (padded with zeros, but not necessarily null
terminated if it has full length)
== Host cluster management ==
qcow2 manages the allocation of host clusters by maintaining a reference count
for each host cluster. A refcount of 0 means that the cluster is free, 1 means
that it is used, and >= 2 means that it is used and any write access must
perform a COW (copy on write) operation.
The refcounts are managed in a two-level table. The first level is called
refcount table and has a variable size (which is stored in the header). The
refcount table can cover multiple clusters, however it needs to be contiguous
in the image file.
It contains pointers to the second level structures which are called refcount
blocks and are exactly one cluster in size.
Given a offset into the image file, the refcount of its cluster can be obtained
as follows:
refcount_block_entries = (cluster_size * 8 / refcount_bits)
refcount_block_index = (offset / cluster_size) % refcount_block_entries
refcount_table_index = (offset / cluster_size) / refcount_block_entries
refcount_block = load_cluster(refcount_table[refcount_table_index]);
return refcount_block[refcount_block_index];
Refcount table entry:
Bit 0 - 8: Reserved (set to 0)
9 - 63: Bits 9-63 of the offset into the image file at which the
refcount block starts. Must be aligned to a cluster
boundary.
If this is 0, the corresponding refcount block has not yet
been allocated. All refcounts managed by this refcount block
are 0.
Refcount block entry (x = refcount_bits - 1):
Bit 0 - x: Reference count of the cluster. If refcount_bits implies a
sub-byte width, note that bit 0 means the least significant
bit in this context.
== Cluster mapping ==
Just as for refcounts, qcow2 uses a two-level structure for the mapping of
guest clusters to host clusters. They are called L1 and L2 table.
The L1 table has a variable size (stored in the header) and may use multiple
clusters, however it must be contiguous in the image file. L2 tables are
exactly one cluster in size.
Given a offset into the virtual disk, the offset into the image file can be
obtained as follows:
l2_entries = (cluster_size / sizeof(uint64_t))
l2_index = (offset / cluster_size) % l2_entries
l1_index = (offset / cluster_size) / l2_entries
l2_table = load_cluster(l1_table[l1_index]);
cluster_offset = l2_table[l2_index];
return cluster_offset + (offset % cluster_size)
L1 table entry:
Bit 0 - 8: Reserved (set to 0)
9 - 55: Bits 9-55 of the offset into the image file at which the L2
table starts. Must be aligned to a cluster boundary. If the
offset is 0, the L2 table and all clusters described by this
L2 table are unallocated.
56 - 62: Reserved (set to 0)
63: 0 for an L2 table that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
in the active L1 table.
L2 table entry:
Bit 0 - 61: Cluster descriptor
62: 0 for standard clusters
1 for compressed clusters
63: 0 for a cluster that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
in L2 tables that are reachable from the the active L1
table.
Standard Cluster Descriptor:
Bit 0: If set to 1, the cluster reads as all zeros. The host
cluster offset can be used to describe a preallocation,
but it won't be used for reading data from this cluster,
nor is data read from the backing file if the cluster is
unallocated.
With version 2, this is always 0.
1 - 8: Reserved (set to 0)
9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a
cluster boundary. If the offset is 0, the cluster is
unallocated.
56 - 61: Reserved (set to 0)
Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
cluster boundary!
x+1 - 61: Compressed size of the images in sectors of 512 bytes
If a cluster is unallocated, read requests shall read the data from the backing
file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
no backing file or the backing file is smaller than the image, they shall read
zeros for all parts that are not covered by the backing file.
== Snapshots ==
qcow2 supports internal snapshots. Their basic principle of operation is to
switch the active L1 table, so that a different set of host clusters are
exposed to the guest.
When creating a snapshot, the L1 table should be copied and the refcount of all
L2 tables and clusters reachable from this L1 table must be increased, so that
a write causes a COW and isn't visible in other snapshots.
When loading a snapshot, bit 63 of all entries in the new active L1 table and
all L2 tables referenced by it must be reconstructed from the refcount table
as it doesn't need to be accurate in inactive L1 tables.
A directory of all snapshots is stored in the snapshot table, a contiguous area
in the image file, whose starting offset and length are given by the header
fields snapshots_offset and nb_snapshots. The entries of the snapshot table
have variable length, depending on the length of ID, name and extra data.
Snapshot table entry:
Byte 0 - 7: Offset into the image file at which the L1 table for the
snapshot starts. Must be aligned to a cluster boundary.
8 - 11: Number of entries in the L1 table of the snapshots
12 - 13: Length of the unique ID string describing the snapshot
14 - 15: Length of the name of the snapshot
16 - 19: Time at which the snapshot was taken in seconds since the
Epoch
20 - 23: Subsecond part of the time at which the snapshot was taken
in nanoseconds
24 - 31: Time that the guest was running until the snapshot was
taken in nanoseconds
32 - 35: Size of the VM state in bytes. 0 if no VM state is saved.
If there is VM state, it starts at the first cluster
described by first L1 table entry that doesn't describe a
regular guest cluster (i.e. VM state is stored like guest
disk content, except that it is stored at offsets that are
larger than the virtual disk presented to the guest)
36 - 39: Size of extra data in the table entry (used for future
extensions of the format)
variable: Extra data for future extensions. Unknown fields must be
ignored. Currently defined are (offset relative to snapshot
table entry):
Byte 40 - 47: Size of the VM state in bytes. 0 if no VM
state is saved. If this field is present,
the 32-bit value in bytes 32-35 is ignored.
Byte 48 - 55: Virtual disk size of the snapshot in bytes
Version 3 images must include extra data at least up to
byte 55.
variable: Unique ID string for the snapshot (not null terminated)
variable: Name of the snapshot (not null terminated)
variable: Padding to round up the snapshot table entry size to the
next multiple of 8.

View File

@@ -0,0 +1,138 @@
=Specification=
The file format looks like this:
+----------+----------+----------+-----+
| cluster0 | cluster1 | cluster2 | ... |
+----------+----------+----------+-----+
The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.
Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.
All fields are little-endian.
==Header==
Header {
uint32_t magic; /* QED\0 */
uint32_t cluster_size; /* in bytes */
uint32_t table_size; /* for L1 and L2 tables, in clusters */
uint32_t header_size; /* in clusters */
uint64_t features; /* format feature bits */
uint64_t compat_features; /* compat feature bits */
uint64_t autoclear_features; /* self-resetting feature bits */
uint64_t l1_table_offset; /* in bytes */
uint64_t image_size; /* total logical image size, in bytes */
/* if (features & QED_F_BACKING_FILE) */
uint32_t backing_filename_offset; /* in bytes from start of header */
uint32_t backing_filename_size; /* in bytes */
}
Field descriptions:
* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].
* ''table_size'' must be a power of 2 in range [1, 16].
* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.
* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:
** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.
** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.
** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.
* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.
* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.
* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file.
Feature bits:
* QED_F_BACKING_FILE = 0x01. The image uses a backing file.
* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants.
There are currently no defined ''compat_features'' or ''autoclear_features'' bits.
Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.
==Tables==
Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
Table {
uint64_t offsets[TABLE_NOFFSETS];
}
The tables are organized as follows:
+----------+
| L1 table |
+----------+
,------' | '------.
+----------+ | +----------+
| L2 table | ... | L2 table |
+----------+ +----------+
,------' | '------.
+----------+ | +----------+
| Data | ... | Data |
+----------+ +----------+
A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings:
===L2 table offsets===
* 0 - unallocated. The L2 table is not yet allocated.
===Data cluster offsets===
* 0 - unallocated. The data cluster is not yet allocated.
* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.
Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed.
===Unallocated L2 tables and data clusters===
Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.
Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written.
===Zero data clusters===
Zero data clusters are a space-efficient way of storing zeroed regions of the image.
Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file.
Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written.
===Logical offset translation===
Logical offsets are translated into cluster offsets as follows:
table_bits table_bits cluster_bits
<--------> <--------> <--------------->
+----------+----------+-----------------+
| L1 index | L2 index | byte offset |
+----------+----------+-----------------+
Structure of a logical offset
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset
def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
l2_offset = l1_table[l1_index]
l2_table = load_table(l2_offset)
cluster_offset = l2_table[l2_index] & offset_mask
return cluster_offset + byte_offset
==Consistency checking==
This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.
Consistency check includes the following invariants:
# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.
# Offsets must be within the image file size and must be ''cluster_size'' aligned.
# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.
The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.

View File

@@ -0,0 +1,81 @@
QEMU Standard VGA
=================
Exists in two variants, for isa and pci.
command line switches:
-vga std [ picks isa for -M isapc, otherwise pci ]
-device VGA [ pci variant ]
-device isa-vga [ isa variant ]
-device secondary-vga [ legacy-free pci variant ]
PCI spec
--------
Applies to the pci variant only for obvious reasons.
PCI ID: 1234:1111
PCI Region 0:
Framebuffer memory, 16 MB in size (by default).
Size is tunable via vga_mem_mb property.
PCI Region 1:
Reserved (so we have the option to make the framebuffer bar 64bit).
PCI Region 2:
MMIO bar, 4096 bytes in size (qemu 1.3+)
PCI ROM Region:
Holds the vgabios (qemu 0.14+).
The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER
instead of PCI_CLASS_DISPLAY_VGA.
IO ports used
-------------
Doesn't apply to the legacy-free pci variant, use the MMIO bar instead.
03c0 - 03df : standard vga ports
01ce : bochs vbe interface index port
01cf : bochs vbe interface data port (x86 only)
01d0 : bochs vbe interface data port
Memory regions used
-------------------
0xe0000000 : Framebuffer memory, isa variant only.
The pci variant used to mirror the framebuffer bar here, qemu 0.14+
stops doing that (except when in -M pc-$old compat mode).
MMIO area spec
--------------
Likewise applies to the pci variant only for obvious reasons.
0000 - 03ff : reserved, for possible virtio extension.
0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1.
word access is supported, bytes are written
in little endia order (aka index port first),
so indexed registers can be updated with a
single mmio write (and thus only one vmexit).
0500 - 0515 : bochs dispi interface registers, mapped flat
without index/data ports. Use (index << 1)
as offset for (16bit) register access.
0600 - 0607 : qemu extended registers. qemu 2.2+ only.
The pci revision is 2 (or greater) when
these registers are present. The registers
are 32bit.
0600 : qemu extended register region size, in bytes.
0604 : framebuffer endianness register.
- 0xbebebebe indicates big endian.
- 0x1e1e1e1e indicates little endian.

View File

@@ -0,0 +1,266 @@
Vhost-user Protocol
===================
Copyright (c) 2014 Virtual Open Systems Sarl.
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
===================
This protocol is aiming to complement the ioctl interface used to control the
vhost implementation in the Linux kernel. It implements the control plane needed
to establish virtqueue sharing with a user space process on the same host. It
uses communication over a Unix domain socket to share file descriptors in the
ancillary data of the message.
The protocol defines 2 sides of the communication, master and slave. Master is
the application that shares its virtqueues, in our case QEMU. Slave is the
consumer of the virtqueues.
In the current implementation QEMU is the Master, and the Slave is intended to
be a software Ethernet switch running in user space, such as Snabbswitch.
Master and slave can be either a client (i.e. connecting) or server (listening)
in the socket communication.
Message Specification
---------------------
Note that all numbers are in the machine native byte order. A vhost-user message
consists of 3 header fields and a payload:
------------------------------------
| request | flags | size | payload |
------------------------------------
* Request: 32-bit type of the request
* Flags: 32-bit bit field:
- Lower 2 bits are the version (currently 0x01)
- Bit 2 is the reply flag - needs to be sent on each reply from the slave
* Size - 32-bit size of the payload
Depending on the request type, payload can be:
* A single 64-bit integer
-------
| u64 |
-------
u64: a 64-bit unsigned integer
* A vring state description
---------------
| index | num |
---------------
Index: a 32-bit index
Num: a 32-bit number
* A vring address description
--------------------------------------------------------------
| index | flags | size | descriptor | used | available | log |
--------------------------------------------------------------
Index: a 32-bit vring index
Flags: a 32-bit vring flags
Descriptor: a 64-bit user address of the vring descriptor table
Used: a 64-bit user address of the vring used ring
Available: a 64-bit user address of the vring available ring
Log: a 64-bit guest address for logging
* Memory regions description
---------------------------------------------------
| num regions | padding | region0 | ... | region7 |
---------------------------------------------------
Num regions: a 32-bit number of regions
Padding: 32-bit
A region is:
-----------------------------------------------------
| guest address | size | user address | mmap offset |
-----------------------------------------------------
Guest address: a 64-bit guest address of the region
Size: a 64-bit size
User address: a 64-bit user address
mmap offset: 64-bit offset where region starts in the mapped memory
In QEMU the vhost-user message is implemented with the following struct:
typedef struct VhostUserMsg {
VhostUserRequest request;
uint32_t flags;
uint32_t size;
union {
uint64_t u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
};
} QEMU_PACKED VhostUserMsg;
Communication
-------------
The protocol for vhost-user is based on the existing implementation of vhost
for the Linux Kernel. Most messages that can be sent via the Unix domain socket
implementing vhost-user have an equivalent ioctl to the kernel implementation.
The communication consists of master sending message requests and slave sending
message replies. Most of the requests don't require replies. Here is a list of
the ones that do:
* VHOST_GET_FEATURES
* VHOST_GET_VRING_BASE
There are several messages that the master sends with file descriptors passed
in the ancillary data:
* VHOST_SET_MEM_TABLE
* VHOST_SET_LOG_FD
* VHOST_SET_VRING_KICK
* VHOST_SET_VRING_CALL
* VHOST_SET_VRING_ERR
If Master is unable to send the full message or receives a wrong reply it will
close the connection. An optional reconnection mechanism can be implemented.
Message types
-------------
* VHOST_USER_GET_FEATURES
Id: 1
Equivalent ioctl: VHOST_GET_FEATURES
Master payload: N/A
Slave payload: u64
Get from the underlying vhost implementation the features bitmask.
* VHOST_USER_SET_FEATURES
Id: 2
Ioctl: VHOST_SET_FEATURES
Master payload: u64
Enable features in the underlying vhost implementation using a bitmask.
* VHOST_USER_SET_OWNER
Id: 3
Equivalent ioctl: VHOST_SET_OWNER
Master payload: N/A
Issued when a new connection is established. It sets the current Master
as an owner of the session. This can be used on the Slave as a
"session start" flag.
* VHOST_USER_RESET_OWNER
Id: 4
Equivalent ioctl: VHOST_RESET_OWNER
Master payload: N/A
Issued when a new connection is about to be closed. The Master will no
longer own this connection (and will usually close it).
* VHOST_USER_SET_MEM_TABLE
Id: 5
Equivalent ioctl: VHOST_SET_MEM_TABLE
Master payload: memory regions description
Sets the memory map regions on the slave so it can translate the vring
addresses. In the ancillary data there is an array of file descriptors
for each memory mapped region. The size and ordering of the fds matches
the number and ordering of memory regions.
* VHOST_USER_SET_LOG_BASE
Id: 6
Equivalent ioctl: VHOST_SET_LOG_BASE
Master payload: u64
Sets the logging base address.
* VHOST_USER_SET_LOG_FD
Id: 7
Equivalent ioctl: VHOST_SET_LOG_FD
Master payload: N/A
Sets the logging file descriptor, which is passed as ancillary data.
* VHOST_USER_SET_VRING_NUM
Id: 8
Equivalent ioctl: VHOST_SET_VRING_NUM
Master payload: vring state description
Sets the number of vrings for this owner.
* VHOST_USER_SET_VRING_ADDR
Id: 9
Equivalent ioctl: VHOST_SET_VRING_ADDR
Master payload: vring address description
Slave payload: N/A
Sets the addresses of the different aspects of the vring.
* VHOST_USER_SET_VRING_BASE
Id: 10
Equivalent ioctl: VHOST_SET_VRING_BASE
Master payload: vring state description
Sets the base offset in the available vring.
* VHOST_USER_GET_VRING_BASE
Id: 11
Equivalent ioctl: VHOST_USER_GET_VRING_BASE
Master payload: vring state description
Slave payload: vring state description
Get the available vring base offset.
* VHOST_USER_SET_VRING_KICK
Id: 12
Equivalent ioctl: VHOST_SET_VRING_KICK
Master payload: u64
Set the event file descriptor for adding buffers to the vring. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data. This signals that polling should be used
instead of waiting for a kick.
* VHOST_USER_SET_VRING_CALL
Id: 13
Equivalent ioctl: VHOST_SET_VRING_CALL
Master payload: u64
Set the event file descriptor to signal when buffers are used. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data. This signals that polling will be used
instead of waiting for the call.
* VHOST_USER_SET_VRING_ERR
Id: 14
Equivalent ioctl: VHOST_SET_VRING_ERR
Master payload: u64
Set the event file descriptor to signal when error occurs. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data.

View File

@@ -0,0 +1,92 @@
General Description
===================
This document describes VMWare PVSCSI device interface specification.
Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD.
Based on source code of PVSCSI Linux driver from kernel 3.0.4
PVSCSI Device Interface Overview
================================
The interface is based on memory area shared between hypervisor and VM.
Memory area is obtained by driver as device IO memory resource of
PVSCSI_MEM_SPACE_SIZE length.
The shared memory consists of registers area and rings area.
The registers area is used to raise hypervisor interrupts and issue device
commands. The rings area is used to transfer data descriptors and SCSI
commands from VM to hypervisor and to transfer messages produced by
hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA.
PVSCSI Device Registers
=======================
The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES).
The structure of the registers area is described by the PVSCSIRegOffset enum.
There are registers to issue device command (with optional short data),
issue device interrupt, control interrupts masking.
PVSCSI Device Rings
===================
There are three rings in shared memory:
1. Request ring (struct PVSCSIRingReqDesc *req_ring)
- ring for OS to device requests
2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring)
- ring for device request completions
3. Message ring (struct PVSCSIRingMsgDesc *msg_ring)
- ring for messages from device.
This ring is optional and the guest might not configure it.
There is a control area (struct PVSCSIRingsState *rings_state) used to control
rings operation.
PVSCSI Device to Host Interrupts
================================
There are following interrupt types supported by PVSCSI device:
1. Completion interrupts (completion ring notifications):
PVSCSI_INTR_CMPL_0
PVSCSI_INTR_CMPL_1
2. Message interrupts (message ring notifications):
PVSCSI_INTR_MSG_0
PVSCSI_INTR_MSG_1
Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register
Bit set means interrupt enabled, bit cleared - disabled
Interrupt modes supported are legacy, MSI and MSI-X
In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS
is used to check which interrupt has arrived. Interrupts are
acknowledged when the corresponding bit is written to the interrupt
status register.
PVSCSI Device Operation Sequences
=================================
1. Startup sequence:
a. Issue PVSCSI_CMD_ADAPTER_RESET command;
aa. Windows driver reads interrupt status register here;
b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data,
check status and disable device messages if error returned;
(Omitted if device messages disabled by driver configuration)
c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration
as struct PVSCSICmdDescSetupRings;
d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide
rings configuration as struct PVSCSICmdDescSetupMsgRing;
e. Unmask completion and message (if device messages enabled) interrupts.
2. Shutdown sequences
a. Mask interrupts;
b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO;
c. Issue PVSCSI_CMD_ADAPTER_RESET command.
3. Send request
a. Fill next free request ring descriptor;
b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations;
or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations.
4. Abort command
a. Issue PVSCSI_CMD_ABORT_CMD command;
5. Request completion processing
a. Upon completion interrupt arrival process completion
and message (if enabled) rings.