import stack

Thursday, October 23, 2014

Datastore inactive and device cannot be brought online

If your shared storage or volumes go offline while VMs are running, but after recover these volumes and rescan adapter, datastores still stay in inactive state and you're seeing the following in the /var/log/vmkernel.log:

ScsiDevice: 5192: eui.3f7999dba06450376c9ce9006ec1b4eb device :Open count > 0, cannot be brought online

it may indicate that the virtual machine is stuck or specifically the world (process) for the virtual machine vCPU is still holding up to the device. Since the datastore and the backed storage device/volume were not unmounted and detached properly, it could not be brought online again after recovery. At least not until we can kill the stuck VM. Here's the step to do that:

1. Identify the inactive datastore and the device serial number behind it (which would be similar to the one shown in the vmkernel.log)
2. Kill all related world (by id) to the device on the ESX host. Here's a sample script:

DEVICE_SERIAL=eui.3f7999dba06450376c9ce9006ec1b4eb
for i in $(esxcli storage core device world list -d $DEVICE_SERIAL |awk {'print $2'} |tail -n +3)
do
    out=$(esxcli vm process kill --type=force --world-id=$i)
    rc=$?
    if [[ $rc -eq 0 ]]; then
        echo "kill world id=$i successfully"
    fi
    # check for error
    echo $out |grep "Unable to find a virtual machine with the world ID" 1>/dev/null
    rc=$? # rc=0 means world id not found which is OK
    if [[ $rc -eq 1 ]]; then
        echo ERROR: "$out"
    fi    
done

3. Rescan adapter and the device/datastore would come back to normal after it completes

Wednesday, October 22, 2014

Purge vCenter database

In my dev lab, vCenter is heavily used by developers to building up software, testing and running automation all the time. The vCenter database by default keeps 180 day worth of events and tasks. With a lot of activities in a small, default 10 GB size of database, it is eventually full in a few months. The symptom is that vCenter operations would start but fail in the Tasks & Events. Or your vCenter client is disconnected from vCenter constantly.

If the vCenter is installed on a Windows server, a MSSQL event similar to the following would show up in the Event Viewer / Windows Logs / Application:

To purge vCenter database (SQL), open the vCenter database (VIM_VCDB) by using SQL Management Studio. By inspecting the database properties, the Space Avaiable may only have several MB left. Since it's a dev lab, there's no need of keeping 180 day worth of events and tasks. I shrinked it down to 30 day by:

Go to VIM_VCDB > Tables
Right click the dbo.VPX_PARAMETER table and Edit Top 200 Rows
Modify event.maxAgeEnabled to true
Modify event.maxAge to 30
Modify task.maxAgeEnabled to true
Modify task.maxAge to 30
Go to VIM_VCDB > Programmability > Stored Procedures and right click dbo.cleanup_events_tasks_proc and select Execute Stored Procedure

Depending on your environment, it took about 40 minutes to free up 4.6 GB (150 day worth of task and event) in the database. After cleanup, the database Space Available would show more available space:

After free up the vCenter database, all vCenter operations are back to normal. This is the VMware KB link with more details.

Saturday, October 4, 2014

Use thin clone in OpenStack

As a shared/block storage user in VMware environment, I like the feature that we can clone a virtual machine (VM) from another VM or template. It's simply a process of choosing a new VM name, host, datastore, clicking finish and done. Wait... it's done from the admin point of view. The task itself in fact will be queued and executed in the "Recent Tasks" view. Depending on how big is your VM and storage efficiency, it can take several minutes or up to hours before the newly cloned VM can be utilized. The complains are:

1. Slow - especially if you're cloning many of them at the same time
2. No space efficiency - each clone is a full copy of the original VM
3. Can't choose thin clone - need VMware Horizon/View license or customized app
4. Even thick clone needs the vCenter license

Don't get me wrong. vCenter and ESX are great products and very stable. However, in a development engineering lab, I prefer it to be:

1. Cost effective - pay as less as possible
2. Fast - so we can try/validate more crazy ideas
3. Easy - easy to find my template and image, and easy to clone by a click or a few API calls
4. Efficient - share storage space as much as possible

and that's why and when I migrate my dev lab from VMware to OpenStack and Nimble. Here's how a VM/instance is launched in the lab:

First step: Launch an image instance by using a boot volume

Locate the image ID through glance image-list:

# glance image-list

My Ubuntu 14.04 image id is ffc286d5-6fbd-44c0-8ea9-89667599c901. Launch a Ubuntu 14.04 boot volume instance by using nova boot:

# nova boot u1404 --flavor m1.medium --block-device source=image,dest=volume,id=ffc286d5-6fbd-44c0-8ea9-89667599c901,size=2,shutdown=preserve,bootindex=0

Or use Horizon if GUI is preferred:

After click "Launch", a Ubuntu instance will be created. It's boot from a Nimble volume so it can take advantages of Nimble's, snapshot, performance and reliability. That's it. There's only one step in the process. However, where's the fast and storage space efficiency parts? The image needs to be downloaded to the boot volume, and how can the subsequent Ubuntu 14.04 instances start taking advantage of the thin clone?

Yes, the first image boot volume will take the hit of downloading bits from Glance to the boot volume. After it completes, Nimble Cinder driver implementation will tag the Nimble volume with a snapshot. It associates the Ubuntu image with the volume snapshot. When the next nova boot request coming in, it'll simply clone a volume out of the snapshot without downloading the image from Glance. Imagine to create a 40 GB image instance. By using this feature, it can be cloned out from a Nimble volume snapshot immediately and with zero volume usage. All can be achieved by using the same and single nova boot command or GUI. It's perfect for engineers in the development and QA lab.

As a user I might ask: if some instances are clones, does it mean that I can't delete the parent instance when cloned instances are in use? It turns out that the Nimble Cinder driver implementation keeps track of this for you. I can delete any instance from Nova, business as usual. If the instance is backed by a parent volume and I try to delete it, Nimble Cinder driver will offline the volume if there are clones. It'll delete it automatically when there's no clones eventually.

This feature in the Nimble Cinder driver is turned off by default. To enable this feature, insert the following line in the cinder.conf:

nimble_ito_enabled=true

Friday, October 3, 2014

OpenStack Cinder "Copy Image to Volume"

Since OpenStack Havana release, Cinder has enforced a set of minimum features to avoid confusion of which driver features are supported by which OpenStack release. One of the required feature is "Copy Image to Volume". If your OpenStack instance is launched and booted from a volume, it'll be similar to AWS "Create Image (EBS AMI)" (ec2-create-image). This feature can be very useful when you like to modify the base image, create your customized image and upload it to Glance. Here's how you can achieve through OpenStack CLI:

1. Create a Cinder volume from a base image:

# glance image-list
# cinder create --image-id 2dfb6cd2-7a63-4c28-940e-b730969d3040 1

2. After modify the base image, create and upload the newly customized image to Glance:

# cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
|                  ID                  |   Status  | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| d7d2abfa-d882-45a3-ac80-ddb24631f666 | available |     None     |  1   |     None    |   true   |             |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+

NOTE: By default the image disk format will be "raw". If your image disk format is not "raw", specify the --disk-format option.

# cinder upload-to-image d7d2abfa-d882-45a3-ac80-ddb24631f666 myCirrosImage --disk-format qcow2
+---------------------+--------------------------------------+
|       Property      |                Value                 |
+---------------------+--------------------------------------+
|   container_format  |                 bare                 |
|     disk_format     |                qcow2                 |
| display_description |                 None                 |
|          id         | d7d2abfa-d882-45a3-ac80-ddb24631f666 |
|       image_id      | 04ab89ab-e4e7-4389-bd46-e9e6aaf6de39 |
|      image_name     |            myCirrosImage             |
|         size        |                  1                   |
|        status       |              uploading               |
|      updated_at     |      2014-10-03T20:31:21.000000      |
|     volume_type     |                 None                 |
+---------------------+--------------------------------------+

3. It may take a while before the image is uploaded to Glance completely.

# glance image-list
+--------------------------------------+---------------+-------------+------------------+----------+--------+
| ID                                   | Name          | Disk Format | Container Format | Size     | Status |
+--------------------------------------+---------------+-------------+------------------+----------+--------+
| 2dfb6cd2-7a63-4c28-940e-b730969d3040 | cirros        | qcow2       | bare             | 13147648 | active |
| 04ab89ab-e4e7-4389-bd46-e9e6aaf6de39 | myCirrosImage | qcow2       | bare             |          | queued |
+--------------------------------------+---------------+-------------+------------------+----------+--------+

# glance image-list
+--------------------------------------+---------------+-------------+------------------+----------+--------+
| ID                                   | Name          | Disk Format | Container Format | Size     | Status |
+--------------------------------------+---------------+-------------+------------------+----------+--------+
| 2dfb6cd2-7a63-4c28-940e-b730969d3040 | cirros        | qcow2       | bare             | 13147648 | active |
| 04ab89ab-e4e7-4389-bd46-e9e6aaf6de39 | myCirrosImage | qcow2       | bare             | 38005440 | active |
+--------------------------------------+---------------+-------------+------------------+----------+--------+

NOTE: By default the uploaded image is private. If you are admin and would like to make the customized image public, use glance image-update:

# glance image-show 04ab89ab-e4e7-4389-bd46-e9e6aaf6de39 |grep is_public
| is_public        | False

# glance image-update 04ab89ab-e4e7-4389-bd46-e9e6aaf6de39 --is-public True

The default implementation of this feature is at:

https://github.com/openstack/cinder/blob/master/cinder/volume/driver.py

https://github.com/openstack/cinder/blob/master/cinder/image/image_utils.py