Module ‘MonitorLoop’ power on failed error – unable to power on vm

Came across this amazingly obtuse error today trying to deploy a new VM
Module ‘MonitorLoop’ power on failed
And was unable to power on the vm. I was surprised to find that despite the data store having some free space, this error actually indicates there is not enough space to inflate the Swap file (being 0k when powered off, and it can’t then extend to the correct size).
You can see this by going to the details of the task and digging into the error stack.

Move the VM to a data store with more space, or increase the free space on the data store and the VM will successfully power on.

ESXi Hosts Losing Connectivity to VMFS Datastores

A customers environment was losing access to VMFS datastores on a regular basis.
Events in the log showed this:

Lost access to volume xxx due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

storage
Further info can be gleaned by checking the hostd log. SSH onto the host and use this command, which showed constant connects and disconnects as they were written to the log.

tail -f /var/log/hostd.log | grep "'Vimsvc.ha-eventmgr'"

A look into the vmkernel.log showed that locks were being generated
 

2018-07-10T01:10:27.499Z cpu32:33604)NMP: nmp_ThrottleLogForDevice:3333: Cmd 0x2a (0x43be1b769a40, 38961) to dev "naa.60050768010000002000000000012345" on path "vmhba1:C0:T4:L2" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE
2018-07-10T01:10:27.499Z cpu32:33604)ScsiDeviceIO: 2613: Cmd(0x43be1b769a40) 0x2a, CmdSN 0x8000005e from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
2018-07-10T01:10:27.499Z cpu32:33604)ScsiCore: 1609: Power-on Reset occurred on naa.60050768010000002000000000012345
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be15960ac0) 0x2a, CmdSN 0x8000000f from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be197f48c0) 0x2a, CmdSN 0xfffffa8002110f80 from world 40502 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be19c300c0) 0x2a, CmdSN 0xfffffa8002089940 from world 40502 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be1b4f6340) 0x2a, CmdSN 0x8000007d from world 38961 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:35.036Z cpu25:32848)ScsiDeviceIO: 2595: Cmd(0x43be1b75a300) 0x2a, CmdSN 0x1aa2c from world 32814 to dev "naa.60050768010000002000000000012345" failed H:0x8 D:0x0 P:0x0
2018-07-10T01:10:37.844Z cpu21:32874)HBX: 283: 'DS1234': HB at offset 4075520 - Reclaimed heartbeat [Timeout]:
2018-07-10T01:10:37.844Z cpu21:32874)  [HB state abcdef02 offset 4075520 gen 29 stampUS 3975149635 uuid 5b43f84b-2341c5c8-32b6-90e2baf4630c jrnl  drv 14.61 lockImpl 3]
2018-07-10T01:10:37.847Z cpu21:32874)FS3Misc: 1759: Long VMFS rsv time on 'DS1234' (held for 2714 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2018-07-10T01:12:12.584Z cpu8:38859)etherswitch: L2Sec_EnforcePortCompliance:152: client APP1421.eth0 requested promiscuous mode on port 0x6000006, disallowed by vswitch policy
2018-07-10T01:12:12.584Z cpu8:38859)etherswitch: L2Sec_EnforcePortCompliance:152: client APP1421.eth0 requested promiscuous mode on port 0x6000006, disallowed by vswitch policy
2018-07-10T01:24:08.185Z cpu39:35449 opID=9b64b9f9)World: 15554: VC opID 8de99fb1-d856-4a71-ab08-5501dfffc500-7011-ngc-d5-67-78e8 maps to vmkernel opID 9b64b9f9
2018-07-10T01:24:08.185Z cpu39:35449 opID=9b64b9f9)DLX: 3876: vol 'DS1234', lock at 63432704: [Req mode 1] Checking livenes

Solution

It turned out that this occured when some new hosts were added to the cluster which did not have ATS disabled. The existing hosts had ATS disabled due to a prior storage incompatibility (since resolved but the hosts never had it reenabled), and the host profiles for the new hosts did not have ATS disabled due to the storage no longer suffering the incompatibility.
In this situation, we could enable ATS on the old hosts now that the storage supported it:

# esxcli system settings advanced set -i 1 -o /VMFS3/UseATSForHBOnVMFS5

Or if preferred, disable ATS on the new hosts to match the settings on the old ones

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

 
Just ensure all hosts in the cluster are using the same ATS settings!
Further reading: KB2113956