ESXi Hosts Losing Connectivity to VMFS Datastores

A customers environment was losing access to VMFS datastores on a regular basis.

Events in the log showed this:

Lost access to volume xxx due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

storage

Further info can be gleaned by checking the hostd log. SSH onto the host and use this command, which showed constant connects and disconnects as they were written to the log.

tail -f /var/log/hostd.log | grep "'Vimsvc.ha-eventmgr'"

A look into the vmkernel.log showed that locks were being generated

 

2018-07-10T01:10:27.499Z cpu32:33604)NMP: nmp_ThrottleLogForDevice:3333: Cmd 0x2a (0x43be1b769a40, 38961) to dev "naa.60050768010000002000000000012345" on path "vmhba1:C0:T4:L2" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE
2018-07-10T01:10:27.499Z cpu32:33604)ScsiDeviceIO: 2613: Cmd(0x43be1b769a40) 0x2a, CmdSN 0x8000005e from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
2018-07-10T01:10:27.499Z cpu32:33604)ScsiCore: 1609: Power-on Reset occurred on naa.60050768010000002000000000012345
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be15960ac0) 0x2a, CmdSN 0x8000000f from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be197f48c0) 0x2a, CmdSN 0xfffffa8002110f80 from world 40502 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be19c300c0) 0x2a, CmdSN 0xfffffa8002089940 from world 40502 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be1b4f6340) 0x2a, CmdSN 0x8000007d from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be1b75a300) 0x2a, CmdSN 0x1aa2c from world 32814 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:37.844Z cpu21:32874)HBX: 283: 'DS1234': HB at offset 4075520 - Reclaimed heartbeat [Timeout]:
2018-07-10T01:10:37.844Z cpu21:32874)  [HB state abcdef02 offset 4075520 gen 29 stampUS 3975149635 uuid 5b43f84b-2341c5c8-32b6-90e2baf4630c jrnl  drv 14.61 lockImpl 3]
2018-07-10T01:10:37.847Z cpu21:32874)FS3Misc: 1759: Long VMFS rsv time on 'DS1234' (held for 2714 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2018-07-10T01:12:12.584Z cpu8:38859)etherswitch: L2Sec_EnforcePortCompliance:152: client APP1421.eth0 requested promiscuous mode on port 0x6000006, disallowed by vswitch policy
2018-07-10T01:12:12.584Z cpu8:38859)etherswitch: L2Sec_EnforcePortCompliance:152: client APP1421.eth0 requested promiscuous mode on port 0x6000006, disallowed by vswitch policy
2018-07-10T01:24:08.185Z cpu39:35449 opID=9b64b9f9)World: 15554: VC opID 8de99fb1-d856-4a71-ab08-5501dfffc500-7011-ngc-d5-67-78e8 maps to vmkernel opID 9b64b9f9
2018-07-10T01:24:08.185Z cpu39:35449 opID=9b64b9f9)DLX: 3876: vol 'DS1234', lock at 63432704: [Req mode 1] Checking livenes

Solution

It turned out that this occured when some new hosts were added to the cluster which did not have ATS disabled. The existing hosts had ATS disabled due to a prior storage incompatibility (since resolved but the hosts never had it reenabled), and the host profiles for the new hosts did not have ATS disabled due to the storage no longer suffering the incompatibility.

In this situation, we could enable ATS on the old hosts now that the storage supported it:

# esxcli system settings advanced set -i 1 -o /VMFS3/UseATSForHBOnVMFS5

Or if preferred, disable ATS on the new hosts to match the settings on the old ones

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

 

Just ensure all hosts in the cluster are using the same ATS settings!

Further reading: KB2113956

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s