import stack: VMware

Thursday, October 23, 2014

Datastore inactive and device cannot be brought online

If your shared storage or volumes go offline while VMs are running, but after recover these volumes and rescan adapter, datastores still stay in inactive state and you're seeing the following in the /var/log/vmkernel.log:

ScsiDevice: 5192: eui.3f7999dba06450376c9ce9006ec1b4eb device :Open count > 0, cannot be brought online

it may indicate that the virtual machine is stuck or specifically the world (process) for the virtual machine vCPU is still holding up to the device. Since the datastore and the backed storage device/volume were not unmounted and detached properly, it could not be brought online again after recovery. At least not until we can kill the stuck VM. Here's the step to do that:

1. Identify the inactive datastore and the device serial number behind it (which would be similar to the one shown in the vmkernel.log)
2. Kill all related world (by id) to the device on the ESX host. Here's a sample script:

DEVICE_SERIAL=eui.3f7999dba06450376c9ce9006ec1b4eb
for i in $(esxcli storage core device world list -d $DEVICE_SERIAL |awk {'print $2'} |tail -n +3)
do
    out=$(esxcli vm process kill --type=force --world-id=$i)
    rc=$?
    if [[ $rc -eq 0 ]]; then
        echo "kill world id=$i successfully"
    fi
    # check for error
    echo $out |grep "Unable to find a virtual machine with the world ID" 1>/dev/null
    rc=$? # rc=0 means world id not found which is OK
    if [[ $rc -eq 1 ]]; then
        echo ERROR: "$out"
    fi    
done

3. Rescan adapter and the device/datastore would come back to normal after it completes

Wednesday, October 22, 2014

Purge vCenter database

In my dev lab, vCenter is heavily used by developers to building up software, testing and running automation all the time. The vCenter database by default keeps 180 day worth of events and tasks. With a lot of activities in a small, default 10 GB size of database, it is eventually full in a few months. The symptom is that vCenter operations would start but fail in the Tasks & Events. Or your vCenter client is disconnected from vCenter constantly.

If the vCenter is installed on a Windows server, a MSSQL event similar to the following would show up in the Event Viewer / Windows Logs / Application:

To purge vCenter database (SQL), open the vCenter database (VIM_VCDB) by using SQL Management Studio. By inspecting the database properties, the Space Avaiable may only have several MB left. Since it's a dev lab, there's no need of keeping 180 day worth of events and tasks. I shrinked it down to 30 day by:

Go to VIM_VCDB > Tables
Right click the dbo.VPX_PARAMETER table and Edit Top 200 Rows
Modify event.maxAgeEnabled to true
Modify event.maxAge to 30
Modify task.maxAgeEnabled to true
Modify task.maxAge to 30
Go to VIM_VCDB > Programmability > Stored Procedures and right click dbo.cleanup_events_tasks_proc and select Execute Stored Procedure

Depending on your environment, it took about 40 minutes to free up 4.6 GB (150 day worth of task and event) in the database. After cleanup, the database Space Available would show more available space:

After free up the vCenter database, all vCenter operations are back to normal. This is the VMware KB link with more details.