Consolidating disks – unable to access file since it is locked

Had a vm which had flagged up as requiring disk consolidation
Attempting consolidation failed with the error ‘ Unable to access file since it is locked’

The error stack showed the following : msg.fileio.lock

Solution:

Storage vMotion the disk to another data store and reattempt consolidation, this time it has cleared the locks and works. Nice and quick one, although not obvious.
Alternatively you could find the host that has a lock on the file and restart hostd, but depending on the environment this method can be a lot faster.

Enable CDP advertising – help the network team help themselves!

A customer of mine requested help in documenting which switch ports were connected to ESXi hosts. Rather than simply documenting this which may get out of date if not maintained, I instead suggested we enable CDP advertising on the vSwitch level, in order for the network team to be able to obtain this information themselves on an ongoing basis.

By default vSwitches come with CDP enabled in listen mode only, being able to detect information about the switches they are connected to but not relaying info about themselves to the switches.

Method

To configure advertising on a standard vSwitch, you SSH onto the host and run the following, changing the vSwitch name for the relevant one:

# esxcli network vswitch standard set -v vSwitch0 -c both

If running distributed switches, you can do this in the GUI of the web console. Select your distributed vSwitch and select Manage > Settings > Properties and click Edit.

Under Discovery Protocol change Operation to Both, and it will both listen for CDP info from the switch and Advertise its own CDP info also.

vCenter PSC converge tool now available for vCenter 6.5 with U2d

VMware is spreading some holiday cheer in the form of the latest update for vCenter 6.5 U2d. Buried in the release notes is:

  • vCenter Server 6.5 Update 2d adds a CLI tool to convert instances of vCenter Server Appliance with an external Platform Services Controller into vCenter Server Appliance with an embedded Platform Services Controller connected in Embedded Linked Mode. For more information, see the 6.7 vCenter Server Installation and Setup guide.
  • With vCenter Server 6.5 Update 2d, you can add VMware Platform Services Controller appliances to Active Directory 2016 domains.
  • With vCenter Server 6.5 Update 2d, you can use the new vRealize Operations Manager plug-in that provides specific metrics and high-level information about data centers, datastores, virtual machines, and ESXi hosts, to vCenter Server and vSAN. The plug-in is supported only in the vSphere Client.
  • With vCenter Server 6.5 Update 2d, the new vRealize Operations Manager plug-in adds by default the Patch method, supported by the HTTP protocol, to facilitate the online installation stage.
  • With vCenter Server 6.5 Update 2d, you can configure the property config.vpxd.macAllocScheme.method in the vCenter Server configuration file, vpxd.cfg, to allow sequential selection of MAC addresses from MAC address pools. The default option for random selection does not change. Modifying the MAC address allocation policy does not affect MAC addresses for existing virtual machines.

This feature was an eagerly awaited core feature of vCenter 6.7 U1 which has now been back-ported to 6.5.
While vCenter 6.5 Update 2 included support for enhanced linked mode using embedded PSCs, this was only good for new deployments leaving everyone with existing installations stuck with external PSCS. This feature allows us to migrate from a setup with external PSCs to embedded PSCs in a few easy steps.
Capture

Procedure

  1. Edit the converge.json and decommission_psc.json templates to include information about the managing vCenter Server Appliance. See Preparing JSON Configuration Files for Reconfiguring External to Embedded Nodes for information on preparing the converge.json template.
    1. Confirm the appliances have been backed up
    1. Accept the thumb print
    1. Run the vcsa-util converge converge.json command on the client machine running on a Windows, Linux, or Mac OS operating system to begin the convergence process to install and configure the new embedded Platform Services Controller. See Syntax of the Converge Command for a list of available arguments for the vsca-util converge command.
    1. You can log into the vCenter Server Appliance appliance management interface (https://appliance-IP-address-or-FQDN:5480) and see that it is now a vCenter Server with embedded Platform Services Controller.
    1. Reconfigure any products that use the external PSC such as vRealize suite, NSX Manager, etc to use the new embedded Platform Services Controller.
    1. Run the vcsa-util decommission decomission.json command to decommission the original Platform Services Controller. This operation removes the external Platform Services Controller from the SSO domain.
    1. Shut down and delete the old PSC VMs.

Worth noting:

  • You still need to update all nodes, including existing external PSCs to 6.5 U2d before running this tool
  • This tool is only for vCenter Appliance (VCSA) deployments, not windows. If you have not already migrated from windows to vcsa, take this as a sign to do so sooner rather than later.
  • The external PSC configuration is being deprecated by VMware, so it is worth taking the time to migrate

Links:

vCenter 6.5 Release Notes at vmware.com
vCenter converge process at vmware.com

vSphere 6.7 U1 GA released! Release notes & downloads

VMware have finally bestowed upon us their latest release of vSphere 6.7 – Update 1!
This brings with it some rather welcome new festures and quality-of-life tweaks:

  • Migrate vCenter with Embedded PSC *between* vSphere domains, retaining data such as tags & licences.
  • vCenter can provide relevant links to KB articles
  • Burst filter to protect vCenter from identical alert flooding
  • HTML5 client now fully featured including new simplified workflows for VCHA
  • vCenter converge tool to migrate from external to easier to manage embedded PSCs
  • Provides upgrade path from 6.5 U2 to 6.7

And many more
vCenter Server 6.7 U1 :
Release Notes
Download
ESXi 6.7 U1:
Release Notes
Download
PowerCLI 11:
What’s New
Download

vCheck – Could not establish trust relationship for the SSL/TLS secure channel

Firstly, if you are not using vCheck in your vSphere environment, hop on over to Alan Renouf’s blog where you will find vCheck – a community driven free set of PowerCLI vSphere environment checks that will generate a nice report (now with the new Clarity html5 client theme!) to be emailed out to you before you reach the office each morning. It is not a replacement for proper monitoring but provides a level of insight you won’t get elsewhere and I swear by it to get a head’s up on potential issues before they have a chance to become catastrophies.
I have recently been finding when running vCheck that several checks would return the error

The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel.

After much troubleshooting I discovered this was down to certificate trust.  In order to resolve this issue:

Download the vCenter certificates

Browse to your vcenter address over https:  Https://vcenter.domain.com and on the bottom right, click Download Trusted Root CA Certificates
vcenter.PNG
 
You’ll get a .zip file with the certificates. Unzip, and if on windows browse to /certs/win and grab the CRT that has a corresponding CRL.
Import this certificate via whatever method is used on your OS of choice. On Windows for example Internet Options > Content > Certificates > Trusted Root Certification Authorities.
Click Import, the CA should be named CA. If it is named ssoserver, you’ve grabbed the wrong cert.
Once imported, vCheck should now be running without errors!

Poor performance of highly specced VMs – vNUMA!

The scenario:

Database server with 10 vCPU and 192 GB ram
Physical host with 2 sockets of 10 cores, 265GB RAM total
Customer was reporting poor performance of their database server, an initial browse of the configuration seemed to show it had been configured well. Multiple paravirtual scsi adapters, enough CPU/RAM for the workload, storage performing well etc. Yet the CPU was almost consistently at 100% and adding more cores did not help.
Eventually spotted part of the problem – CPU Hot Add was enabled. Whilst this is a useful feature for smaller VMs so they can start small and grow as the workload grows, for VMS over 8 vCPUs this disables vNUMA and can lead to poor performance.

Why this is a problem

vNUMA is a technology which was introduced in vSphere 5.5 and improved upon in 6.5. It allows presenting an optimal NUMA layout available to the guest OS. This means the OS is able to optimally place data in memory in its local physical NUMA node, which is faster and more optimal than using memory on the physical processor that the vm is not scheduled on.

The Catch

vNUMA only calculates on CPU capacity. This is fine if you are using less memory than is attached to a single processor in your host, as the vNUMA presentation will be correct.
So in our example host above, a 10 vCPU VM with 128GB ram would be presented a single vNUMA node, as it fits within a single physical numa node. If we were to increase that to 12 vCPUs, it would present 2 vNUMA nodes as it crosses the amount of cores on one physical processor. As the memory is otherwise within the bounds of the physical numa boundary on one processor, we do not have to worry.
However, our database server has 10 cores and 196 GB RAM. Simply disabling CPU Hot Add and allowing vNUMA to take over will present 1 vNUMA node on the basis of the VM having 10 vCPUs. As vNUMA hasn’t taken memory into account, in this instance it has more memory than is in a physical NUMA node (128 GB). 1 vNUMA node in this instance performs poorly due to the total memory crossing two NUMA nodes.

The Solution

In this instance the solution is simple. Configure the vCPU socket layout to match your physical CPU socket layout. Here, we configure 2 sockets of 5 vCPUs which present 2 NUMA nodes to the guest os.
With this configuration, thanks to more optimised memory usage and using the same available resources, CPU usage dropped from 100% constant to around 70-80%. A nice saving without having to allocate more!

Takeaways:

  • On large (>8 vCPU) VMs – don’t enable CPU Hot Add
  • On VMs with more memory than a single numa node or more vCPU cores than on a single processor – manually set your vCPU socket configuration to match the number of sockets in your host system.

CPU incompatible adding host to cluster after patching for Spectre/Meltdown

Interesting problem here. Customer was adding some new hosts to an existing cluster, but got this error:

Move host into cluster
The host’s CPU hardware should support the cluster’s current Enhanced vMotion Compatibility mode, but some of the necessary CPU features are missing from the host. Check the host’s BIOS configuration to ensure that no necessary features are disabled (such as XD, VT, AES, or PCLMULQDQ for Intel, or NX for AMD).

Untitled picture
Usually, this would be due to different CPU hardware, or CPU features in the UEFI not being enabled to match the existing hosts.
In this case, the hardware and UEFI settings were the same. It was discovered as part of QA testing the new hosts were updated with the current patch level which includes CPU microcode updates for Spectre/Meltdown.
This changes the available CPU features and causes a problem. While you can have hosts with differing patch levels coexist within the same cluster for the purposes of a rolling upgrade (and vcenter will only enable the fixes once all hosts have been updated). You cannot add NEW hosts to a cluster that has this microcode installed until the existing hosts have been updated with it.

SOLUTION

In this instance it was a simple solution: use the host rollback option to revert to the previous build level, which matched the other hosts in the cluster and did not display differing CPU features due to the spectre/meltdown microcode.
Reboot the host, and at the ESXi boot screen, press SHIFT+R
You will be presented with this warning:

Current hypervisor will permanently be replaced
with build: X.X.X-XXXXXX. Are you sure? [y/n]

Press Y to revert to the previous build
You can read more about this process at vmware KB1033604
 
Alternatively, you can fully patch the cluster before adding in the new hosts.
 

ESXi 6.0 hosts 'No host data available’

For months now many vSphere 6.0 users have had no hardware info populated from their ESXi 6.0 hosts
hwinfo
The good news now is this has a patch. ESXi-6.0.0-20180704001-standard contains the following little nugget of goodness in the patch notes:

  • System identification information consists of asset tags, service tags, and OEM strings. In earlier releases, this information comes from the Common Information Model (CIM) service, but in ESXi600-201807401, it comes directly from the SMBIOS.
If you have been suffering this bug for many months this should be included in your next patching round.
 
Drop a comment if this resolves your issue

PSA: vSphere 5.5 end of life imminent!


Quick reminder that vSphere 5.5 reaches end-of-life this month – September 19th.
That old workhorse of a platform that many customers are still running for a plethora of reasons is finally reaching end of life. That means no more updates, and no more support from VMware. You will be limited to the self help portal and VMware does not offer new hardware support, server/client/guest OS updates, new security patches or bug fixes.
 

What are my options?

Quite simple here. You need to upgrade. There is no direct path to the latest and greatest vSphere 6.7, so you will need to upgrade to either 6.0 or 6.5 first.
Even if your 5.5 era hardware will not support vSphere 6.5, I would recommend you at least update vCenter to 6.5 even if you can only uplift your old hardware to ESXi 6.0. Not least so you have access to the beautiful vCenter HTML5 Client. Uplevel vCenter is always compatible with downlevel ESXi.
Further reading: VMware Upgrade Center

ESXi Hosts Losing Connectivity to VMFS Datastores

A customers environment was losing access to VMFS datastores on a regular basis.
Events in the log showed this:

Lost access to volume xxx due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

storage
Further info can be gleaned by checking the hostd log. SSH onto the host and use this command, which showed constant connects and disconnects as they were written to the log.

tail -f /var/log/hostd.log | grep "'Vimsvc.ha-eventmgr'"

A look into the vmkernel.log showed that locks were being generated
 

2018-07-10T01:10:27.499Z cpu32:33604)NMP: nmp_ThrottleLogForDevice:3333: Cmd 0x2a (0x43be1b769a40, 38961) to dev "naa.60050768010000002000000000012345" on path "vmhba1:C0:T4:L2" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE
2018-07-10T01:10:27.499Z cpu32:33604)ScsiDeviceIO: 2613: Cmd(0x43be1b769a40) 0x2a, CmdSN 0x8000005e from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
2018-07-10T01:10:27.499Z cpu32:33604)ScsiCore: 1609: Power-on Reset occurred on naa.60050768010000002000000000012345
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be15960ac0) 0x2a, CmdSN 0x8000000f from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be197f48c0) 0x2a, CmdSN 0xfffffa8002110f80 from world 40502 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be19c300c0) 0x2a, CmdSN 0xfffffa8002089940 from world 40502 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be1b4f6340) 0x2a, CmdSN 0x8000007d from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be1b75a300) 0x2a, CmdSN 0x1aa2c from world 32814 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:37.844Z cpu21:32874)HBX: 283: 'DS1234': HB at offset 4075520 - Reclaimed heartbeat [Timeout]:
2018-07-10T01:10:37.844Z cpu21:32874)  [HB state abcdef02 offset 4075520 gen 29 stampUS 3975149635 uuid 5b43f84b-2341c5c8-32b6-90e2baf4630c jrnl  drv 14.61 lockImpl 3]
2018-07-10T01:10:37.847Z cpu21:32874)FS3Misc: 1759: Long VMFS rsv time on 'DS1234' (held for 2714 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2018-07-10T01:12:12.584Z cpu8:38859)etherswitch: L2Sec_EnforcePortCompliance:152: client APP1421.eth0 requested promiscuous mode on port 0x6000006, disallowed by vswitch policy
2018-07-10T01:12:12.584Z cpu8:38859)etherswitch: L2Sec_EnforcePortCompliance:152: client APP1421.eth0 requested promiscuous mode on port 0x6000006, disallowed by vswitch policy
2018-07-10T01:24:08.185Z cpu39:35449 opID=9b64b9f9)World: 15554: VC opID 8de99fb1-d856-4a71-ab08-5501dfffc500-7011-ngc-d5-67-78e8 maps to vmkernel opID 9b64b9f9
2018-07-10T01:24:08.185Z cpu39:35449 opID=9b64b9f9)DLX: 3876: vol 'DS1234', lock at 63432704: [Req mode 1] Checking livenes

Solution

It turned out that this occured when some new hosts were added to the cluster which did not have ATS disabled. The existing hosts had ATS disabled due to a prior storage incompatibility (since resolved but the hosts never had it reenabled), and the host profiles for the new hosts did not have ATS disabled due to the storage no longer suffering the incompatibility.
In this situation, we could enable ATS on the old hosts now that the storage supported it:

# esxcli system settings advanced set -i 1 -o /VMFS3/UseATSForHBOnVMFS5

Or if preferred, disable ATS on the new hosts to match the settings on the old ones

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

 
Just ensure all hosts in the cluster are using the same ATS settings!
Further reading: KB2113956