HPE CLX Error – Socket connection timed out to port 2550

This week I came across an interesting problem on one of our Microsoft Failover Clusters (FOC) which utilises the HPE Cluster Extensions (CLX) software to  provide failover support with our HPE 3PAR arrays.


A couple of roles within one FOC showed the CLX resource as failed with everything else still showing as online. We were made aware of the problem when users reported a major file share as being offline. When I checked the Cluster Shared Volume (CSV) disks which made up the role they showed as online but when expanded had ‘Unknown’.

Obviously I tried to bring the CLX resource online but it would not start. I also checked the 3PAR arrays and found that although the role was running from DC B the virtual volume state was read only. I therefore took the role offline completely and moved it over to DC A and brought it online. At this point the CSV disks went from being in an unknown state to displaying their details (e.g NTFS formatting, capacity). This was great for our users as the file share was back online however I still couldn’t bring CLX online.

Failover Cluster Manager - Failed Roles

When I opened the properties for the CLX resource and selected the CLX tab I got the following error after a long pause –

CLX Resource Configuration Error

The properties window then loads and I see everything I expect –

CLX Role Properties

 

At this point I decided to dig into the CLX log file which can be found at this location by default – C:\Program Files\Hewlett-Packard\Cluster Extension 3PAR\log

[OUTPUT OMITTED]

 

OK so at this point we can see that the CLX resource is attempting to come online but each time it fails with a message indicating it can’t find the properties for the remote copy group. It then goes on to tell us that the socket connection has timed out to a specific port on the 3PAR.

Now at this point I want to remind you that a FOC running CLX has that software installed along with the HPE 3PAR Command Line (CLI) software. The two work in tandem to achieve our goal of moving the resource between nodes and datacenters. With that said I figured there must be some sort of connectivity issue between the node and the array so I proceeded to ping/traceroute to both 3PAR arrays and it all seemed good. Now that I knew there was connectivity I began to wonder if there was a problem with the password file in the CLX config. I logged onto the cluster node running the role and fired up the CLX software and tried a connection test which failed.

CLX Configuration Tool

CLX Password File Generator Error

 

The next thing that came to mind was perhaps there was a software version issue – we recently upgraded both 3PAR arrays to the latest 3.2.2 release however the CLX and CLI software on cluster nodes had not been updated. In the past this has never been an issue, we’ve been through multiple releases with the versions running now.

Getting hold of the latest 3.2.2 CLI software was an absolute nightmare if I’m being honest. Even the HPE 3PAR engineer I went to struggled for a few hours to get a copy for me. Anyway, I eventually got both the CLI and CLX software downloaded and ready for testing.

I took a single cluster node and placed it into maintenance mode then removed the existing CLX and CLI software. You have to be very careful when removing the software. It will try and be helpful and offer to remove from all of the cluster nodes as well as un-registering CLX from the cluster. This is not something you want to happen unless you are dismantling the cluster.

CLX Uninstall - Only Select 1 Node

Un-Register CLX from Cluster

Once both packages were removed I gave the cluster node a reboot just to ensure no outstanding tasks or leftover processes could interfere with the install.

When the node was back online I ran through the installation of CLX and the CLI software, again being careful to only select this node and not push any config to other nodes in the cluster. Having installed the software I ran through the CLX setup and tested connecting to the array and thankfully it all worked.

Successfully Connected to array

In this instance it looks like our problem relates to the software versions we are running on our failover cluster nodes. We have quite a few clusters and many nodes to patch so it’s going to take a while but hopefully we will get this done before any other issues crop up.


 

Hopefully this has been useful!

Leave a Reply