I finally got round to looking at a problem which has existed in one of my datacenters for I guess a few years now. It’s really bothered me for a long time but I’ve just been tied up with other projects so it remained. I have two datacenters which are more or less identical in their setup. In one of them (DC1) however my Hewlett Packard Enterprise (HPE) BL460c blade servers running VMware ESXi took literally hours to reboot or startup/shutdown. As you can imagine this isn’t exactly ideal and I was most frustrated. The strange thing was that one of these blades would boot normally if moved to the other datacenter. When I looked into the problem the only difference I could find between the blade chassis was that we had HPE Virtual Connect (VC) modules in DC2 and Cisco B22 Nexus modules in DC1. The switch modules in both datacenters connect back into Cisco Nexus 5596UP switches.
Having updated the BIOS, firmware and drivers along with installing the latest updates for ESXi 5.5 I was ready to start my troubleshooting. I started by taking screenshots and log files of everything – host setup, SAN, network etc so I had all the information required in case I needed to open a case with VMware or HPE. I had previously opened cases and neither could provide an answer to this problem. At this point I had a stack of information, had done a test reboot and again gathered logs etc after that with the host in a bad state (datastores missing, path selection changed back to Most Recently Used or MRU).
Below are some example screenshots taken from the HPE iLO connection during a reboot –
It was now time for some lunch so I took a break and did my usual when facing a technical problem and just read the daily news and let my mind do it’s thing. A thought popped into my head that perhaps the fibre channel (FC) connection was being killed during the shutdown of the host and not coming up in time during startup. I jumped onto one of the Cisco Nexus 5596UP switches which provides access to the SAN along with the Ethernet network.
Let us take a look at the Cisco Nexus 5596UP fibre channel configuration –
BSA_019_GLDP_CAB-B2_N5K_1# sh int vfc1208 vfc1208 is trunking Bound interface is port-channel1208 Port description is *** BSA-GPESXI03 *** Hardware is Ethernet Port WWN is 24:b7:00:2a:6a:1c:48:3f Admin port mode is F, trunk mode is on snmp link state traps are enabled Port mode is TF Port vsan is 400 Trunk vsans (admin allowed and active) (400) Trunk vsans (up) (400) Trunk vsans (isolated) () Trunk vsans (initializing) () 1 minute input rate 38384 bits/sec, 4798 bytes/sec, 21 frames/sec 1 minute output rate 22830040 bits/sec, 2853755 bytes/sec, 1376 frames/sec 44924214793 frames input, 0 bytes 0 discards, 0 errors 128537040978 frames output, 0 bytes 0 discards, 0 errors last clearing of "show interface" counters never Interface last changed at Fri Apr 1 16:20:48 2016
We can see the FC connection is bound to the port channel (‘Bound interface is port-channel1208’) and I figured this was the problem. I think what happens is the port channel drops during the reboot which results on ESXi losing sight of the storage and then going through the All Paths Down/Permanent Device Loss process. During boot the port channel hasn’t come up in time so again it tries it’s connections but the FC isn’t up so it fails.
With that in mind we changed the port configuration for this test host, we bound the FC connections to the individual interfaces rather than the port channel.
BSA_019_GLDP_CAB-B2_N5K_1# sh int vfc1208 vfc1208 is trunking Bound interface is Ethernet125/1/6 Port description is *** BSA-GPESXI03 *** Hardware is Ethernet Port WWN is 24:b7:00:2a:6a:1c:48:3f Admin port mode is F, trunk mode is on snmp link state traps are enabled Port mode is TF Port vsan is 400 Trunk vsans (admin allowed and active) (400) Trunk vsans (up) (400) Trunk vsans (isolated) () Trunk vsans (initializing) () 1 minute input rate 38384 bits/sec, 4798 bytes/sec, 21 frames/sec 1 minute output rate 22830040 bits/sec, 2853755 bytes/sec, 1376 frames/sec 44924214793 frames input, 0 bytes 0 discards, 0 errors 128537040978 frames output, 0 bytes 0 discards, 0 errors last clearing of "show interface" counters Fri Apr 29 15:27:24 2016 Interface last changed at Fri Apr 29 15:54:09 2016
Once the change was made I kicked off a reboot and practically leapt out of my seat with joy when it turned out I was right!
Now when the blade server reboots it does so within minutes and mounts all of the 3PAR datastores without issue, both the local and remote 3PAR (we’re using Peer Persistence). Importantly each datastore also preserved the Path Selection Policy (PSP) Round Robin rather than defaulting back to Most Recently Used. When using a 3PAR you need to be running Round Robin for the PSP.
I was so happy to see this fault resolved once and for all – it really bugs me when I have a problem which needs resolving and it just drags on and on due to other work needing attention first. Hopefully if anyone else has had this problem they may come across this post and find some inspiration.