Using the Img-storage Roll¶
To use the Img-Storage roll is necessary to have at least one NAS server which can store all the virtual disk images. Moreover it is required to install the full OS roll.
Overview of the Img-Storage server¶
The Img-storage roll needs several daemons to properly function. Each
daemon is implemented as an init script under /etc/init.d
and can be
restared by the service command. The tree main services are:
- rabbitmq-server this component is used to orchestrate all
comunication between the various nodes and is installed on the
frontend. It is just a standard
Rabbit MQ Server
. - img-storage-vm this is the python daemon which is in charge of
managing the iSCSI mapping on the hosting machine (the machine that
will run the virtual machine), creating the local ZFS volumes, and
converting these to local lvms. img-storage-vm is installed by default
on all VM Container appliances but it can be installed on other nodes
simply by turning to true the attribute
img_storage_vm
. You will also need KVM component on the node to properly run virtual machine, so also the attributekvm
should be set to true. To enable the volumes synchronization, the node should have attributeszfs
andimg_sync
set to true. - img-storage-nas this is the python daemon which manages the
virtual disk repository. It uses ZFS as the underlying technology for
storage. This daemon is also responsible to set up iSCSI targets for
the img-storage-vm. img-storage-nas is installed by default on all
NAS appliances, but it can be changed simply using the attribute
img_storage_nas
(the attributezfs
should also be true in order to install zfs). img-storage-nas allocates virtual disks on a zpool(s) set by frontend configuration, which should be created manually by the administrator before any virtual machine can be used.
For example, to run virtual machine on a standard compute node it is necessary to set:
/opt/rocks/bin/rocks add appliance attr compute img_storage_vm true
/opt/rocks/bin/rocks add appliance attr compute kvm true
Then reinstall all the compute nodes.
Using InfiniBand interface¶
If an infiniband network is present on the nodes, it can be used for data
transfers. To use it, set the attribute IB_net
to the name of the
InfiniBand network for all the nodes.
Image access modes¶
There are two modes supported: direct iSCSI and local synchronized disk. The
mode is set by img_sync
node attribute set to either True or False.
In direct iSCSI mode the VM performs all I/O operations to the remote zvol. The zvol is mapped to iSCSI target on NAS, which is then visible as local block device on compute node.
In sync mode it starts similar to direct iSCSI, but then synchronization is performed, and in the end the VM is using a local drive which is the copy of remove zvol on NAS. The drive is synced back to NAS when VM is terminated.
The sync mode requires zfs to be set up on VM Container nodes. The attribute
vm_container_zpool
sets the name of zpool to use and can be set both
globally for all the nodes or per-node if config is different from the rest of
the nodes.
VM boot workflow with image sync¶
The typical VM start workflow has several steps:
- The zvol is created on NAS (if didn’t exist before)
- iSCSI target is created, allowing only the compute node to connect to it
- Compute node creates local zvol for temporary write
- Compute node attaches iSCSI to local block device
- Compute node creates lvm that reads data from iSCSI and writes to temporary drive and responds success to NAS
- NAS responds success to frontend which boots the VM
- NAS starts asynchronous task whoch copies current ZVOL to compute node
- When finished, sends signal to compute node which starts merging the local zvol with temporary write-only zvol
- When done, switches the VM to merged local zvol
Enable remote virtual disk with Img-Storage¶
To enable a virtual machine to use a remote virtual disk, the name of the NAS
and the name of the zpool holding the disk image must be set up. The command
rocks set host vm nas
can be used for this, while the command rocks list
host vm nas
can be used to show the current value of the NAS name. If the
NAS name with the zpool name is not specified for a virtual host, it will use
its original disks configuration which by default uses local raw files (rocks
list host vm showdisks=1
).
Once the NAS name is configured for a Virtual host, the virtual host will use a remote iSCSI disk provided by the given NAS. For example if we have a vritual compute node we can assign its NAS name with the following commands:
# rocks add host vm vm-container-0-14 compute
added VM compute-0-14-0 on physical node vm-container-0-14
# rocks set host vm nas compute-0-14-0 nas=nas-0-0 zpool=tank
# rocks start host vm compute-0-14-0
nas-0-0:compute-0-14-0-vol mapped to compute-0-14:/dev/mapper/compute-0-14-0-vol-snap
The virtual disks are saved on the NAS specified in the rocks list host vm
nas
under the zpool specified in the zpool field. Each volume name is
created appending ‘-vol’ to the virtual machine name.
# rocks list host vm nas
VM-HOST NAS ZPOOL
compute-0-0-0: nas-0-0 tank
compute-0-14-0: nas-0-0 tank
# ssh nas-0-0
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 231G 1.61T 8.08G /tank
tank/compute-0-0-0-vol 37.1G 1.64T 3.60G -
tank/compute-0-14-0-vol 37.1G 1.64T 3.84G -
Instead of specifing the zpool each time, it is possible to set a host attribute
called img_zpools
which list the zpools (separated by a colon) that should
be assigned to each disks. If multiple zpools are specified (e.g.: rocks set
host attr nas-0-0 img_zpools value="tank1,tank2"
), they will be used randomly
each time the command rocks set host vm nas
is invoked without the zpool
paramter, so that the final distribution of virtual disk should be ballanced.
When the host is started and stopped (with rocks start/stop host vm
) the zvol will be
synced automatically between the NAS and the physical container.
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped ----
vm-hpcdev-pub03-4-vol --------------- tank ------------------------------------------------ unmapped ----
vm-hpcdev-pub03-3-vol --------------- tank ------------------------------------------------ unmapped ----
vm-hpcdev-pub03-2-vol --------------- tank ------------------------------------------------ unmapped ----
vm-hpcdev-pub03-0-vol compute-0-1 tank ------------------------------------------------ mapped ----
vm-hpcdev-pub03-1-vol compute-0-3 tank iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-1-vol mapped ----
vm-hpcdev-pub03-5-vol compute-0-3 tank ------------------------------------------------ mapped ----
# rocks list host storagedev compute-0-3
ZVOL LVM STATUS SIZE (GB) BLOCK DEV IS STARTED SYNCED TIME
vm-hpcdev-pub03-1-vol vm-hpcdev-pub03-1-vol-snap snapshot-merge 36 sdc 1 9099712/73400320 17760 0:32:05
vm-hpcdev-pub03-5-vol vm-hpcdev-pub03-5-vol-snap linear 36 --------- ---------- ------------------------- -------
The vm-hpcdev-pub03-1-vol is currently merging, that’s why we have the iSCSI target still established. Once it’s done, the iSCSI target will be unmapped. The 9099712/73400320 17760 numbers whow the number of blocks left for merging: the task is done when first number, which constantly decreases, is equal to the third one.
Let’s start another VM:
# rocks start host vm vm-hpcdev-pub03-3
nas-0-0:vm-hpcdev-pub03-3-vol mapped to compute-0-3:/dev/mapper/vm-hpcdev-pub03-3-vol-snap
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
vol1 --------------- ------- iqn.2001-04.com.nas-0-0-vol1 unmapped -------
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped -------
vm-hpcdev-pub03-4-vol --------------- tank ------------------------------------------------ unmapped -------
vm-hpcdev-pub03-2-vol --------------- tank ------------------------------------------------ unmapped -------
vm-hpcdev-pub03-0-vol compute-0-1 tank ------------------------------------------------ mapped -------
vm-hpcdev-pub03-1-vol compute-0-3 tank iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-1-vol mapped -------
vm-hpcdev-pub03-5-vol compute-0-3 tank ------------------------------------------------ mapped -------
vm-hpcdev-pub03-3-vol compute-0-3 tank iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-3-vol NAS->VM 0:00:04
# rocks list host storagedev compute-0-3
ZVOL LVM STATUS SIZE (GB) BLOCK DEV IS STARTED SYNCED TIME
vm-hpcdev-pub03-3-vol vm-hpcdev-pub03-3-vol-snap snapshot 35 sdd ---------- 32/73400320 32 -------
vm-hpcdev-pub03-1-vol vm-hpcdev-pub03-1-vol-snap snapshot-merge 36 sdc 1 8950592/73400320 17472 0:36:36
vm-hpcdev-pub03-5-vol vm-hpcdev-pub03-5-vol-snap linear 36 --------- ---------- ------------------------- -------
The process of VM copy to compute node started for zvol vm-hpcdev-pub03-3-vol
There are also ‘manual’ commands to list, create or remove zvol synchronization, as shown below:
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped ----
# rocks add host storagemap nas-0-0 tank vol1 compute-0-3 10
mapping nas-0-0 : tank / vol1 on compute-0-3
/dev/mapper/vol1-snap
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped -------
vol1 compute-0-3 tank iqn.2001-04.com.nas-0-0-vol1 NAS->VM 0:00:06
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped ----
vol1 compute-0-3 tank ------------------------------------------------ mapped ----
# rocks remove host storagemap nas-0-0 vol1
unmapping nas-0-0 : vol1
Success
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped -------
vol1 compute-0-3 tank ------------------------------------------------ NAS<-VM 0:00:07
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped ----
vol1 --------------- tank ------------------------------------------------ unmapped ----
# rocks remove host storageimg nas-0-0 tank vol1
removing nas-0-0 : tank / vol1
Success
# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped ----
Recovering from errors¶
Warning
The scripts will not recover the data from VM container, it will be destroyed. You should manually sync back the snapshots to NAS if needed.
There is administrator script being installed with the package on NAS and VM Container nodes called imgstorageadmin. It allows cleaning the state of VM when something went wrong and return it to usable condition.
The script asks questions in order to fully recover the VM in sync mode. User can reply y(default) to run the action or type n to skip.
Example:
On VM container:
# imgstorageadmin
Unmap iSCSI target? [y]|n: y
From which NAS? (Don't forget .ibnet if used) nas-0-0.ibnet
0 10.2.20.250:3260,1 iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-2-vol
1 10.2.20.250:3260,1 iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-4-vol
2 10.2.20.250:3260,1 iqn.2001-04.com.nas-0-0-vol1
Which target would you like to delete? (number)2
====================================
Destroy lvm? [y]|n: y
0 vol1-snap: 0 18874368 snapshot 32/18874368 32
1 vm-hpcdev-pub03-4-vol-snap: 0 73400320 snapshot 13457872/73400320 26256
2 vm-hpcdev-pub03-2-vol-snap: 0 75497472 snapshot-merge 1086176/73400320 2144
Which lvm would you like to destroy? (number)0
====================================
Remove zvol? [y]|n: y
0 tank
1 tank/vm-hpcdev-pub03-2-vol
2 tank/vm-hpcdev-pub03-2-vol-temp-write
3 tank/vm-hpcdev-pub03-4-vol
4 tank/vm-hpcdev-pub03-4-vol-temp-write
5 tank/vol1
6 tank/vol1-temp-write
Which zvol would you like to delete? (number)5
====================================
Then delete second zvol manually (‘zfs destroy tank/vol1-temp-write -r’) or rerun the script
On NAS:
# imgstorageadmin
Unmap iSCSI target? [y]|n: y
Target 1: iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol
Target 2: iqn.2001-04.com.nas-0-0-vol1
Target 3: iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-2-vol
Target 4: iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-4-vol
Which target number would you like to delete? (number) 2
Remove zvol mapping to VM in DB? [y]|n: y
0 hpcdev-pub03-vol tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol hpcdev-pub02
1 vm-hpcdev-pub03-0-vol
2 vm-hpcdev-pub03-2-vol tank iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-2-vol compute-0-1
3 vm-hpcdev-pub03-5-vol tank
4 vm-hpcdev-pub03-4-vol tank iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-4-vol compute-0-1
5 vm-hpcdev-pub03-1-vol tank
6 vm-hpcdev-pub03-3-vol tank
7 vol1 tank iqn.2001-04.com.nas-0-0-vol1 compute-0-1
Which zvol? (number) 7
Done
Unbusy the zvol? [y]|n: y
0 vm-hpcdev-pub03-4-vol amq.gen-Esp2W6XQojClmQ7APoHAvQ 1409862185.2
1 vol1 amq.gen-bT045S_sjCeNcni0V-pkkQ 1409872201.94
Which zvol? (number) 1
The vol1 is now in clean unmapped state and is ready for mapping:
[root@hpcdev-pub02 ~]# rocks list host storagemap nas-0-0
ZVOL HOST ZPOOL TARGET STATE TIME
hpcdev-pub03-vol hpcdev-pub02 tank iqn.2001-04.com.nas-0-0-hpcdev-pub03-vol mapped ----
vm-hpcdev-pub03-0-vol --------------- ------- ------------------------------------------------ unmapped ----
vm-hpcdev-pub03-2-vol compute-0-1 tank iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-2-vol mapped ----
vm-hpcdev-pub03-5-vol --------------- tank ------------------------------------------------ unmapped ----
vm-hpcdev-pub03-4-vol compute-0-1 tank iqn.2001-04.com.nas-0-0-vm-hpcdev-pub03-4-vol mapped ----
vm-hpcdev-pub03-1-vol --------------- tank ------------------------------------------------ unmapped ----
vm-hpcdev-pub03-3-vol --------------- tank ------------------------------------------------ unmapped ----
vol1 --------------- ------- ------------------------------------------------ unmapped ----