HPE 3PAR Remote Copy Volumes Both Read Only

I recently ‘enjoyed’ working on an issue where a 3PAR remote copy group issue resulted in both virtual volumes entering a read-only state. I’ve seen a similar problem where both copies go into a read-write state which is probably more worrying when it happens. Let’s take a look at the problem, the data I had to analyse and how it was resolved.

First off what was the setup in this scenario?

OK I’m going to ignore big chunks of the infrastructure involved as it doesn’t really matter for this fault scenario. We have two SQL cluster nodes which are located in different DC locations, both connected to a local 3PAR array. The 3PAR arrays are using remote copy replication (in this case synchronous) with one array holding the read-write copy (located in the DC where the SQL instance is running) while the other array holds a read-only copy of the SQL virtual volume. For simplicity I am representing all of the different SQL disks that are actually presented as one virtual volume, in reality there are actually 5. The SQL cluster nodes have the HPE 3PAR cluster extension plugin installed (CLX) to handle the movement of instances between nodes.

3PAR Remote Copy SQL Instance Setup With Witness

When things are working as expected the CLX software allows cluster nodes to initiate a migration between DC locations which then facilitates the movement of our SQL instance. We always expect one 3PAR array to have a read-write copy and the other array to have a read-only copy which is being replicated to.

Now that we have an understanding of the setup and expected operation let’s move on to the situation I found myself in.

The Windows cluster node which had been running the SQL instance showed a stop error/blue screen (BSOD) had occured. I grabbed the memory dump (C:\Windows\memory.dmp) along with the minidump (C:\Windows\MiniDump\) and a quick check showed me that the Resource Hosting Subsystem (RHS) process halted the system. If you don’t know what RHS is then check out this handy article from Microsoft – https://blogs.technet.microsoft.com/askcore/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters/

I’ve added some of the output from WinDBG below just as a reference if you’re interested.

The SQL instance should have moved to the remaining node and come online however it did not – while the instance had been seized by the remaining node it was unable to start, investigation showed the disks could not be brought online.

A quick check on the 3PAR itself showed that the remote copy group was stopped and the sync state indicated ‘Stale’. As ever I’ve had to obfuscate some of the information in the screenshot but you should be able to discern the important information.

Remote Copy Stopped Stale Sync State

You can also get this information from the command line as below using the ‘showrcopy groups‘ command – in this case I added the wildcard to limit the returned output.

Attempts to start the remote copy group failed and looking at the cluster node event logs (both Windows and CLX) showed the system require manual intervention. Here we see an example from the Windows event log.

Event 999 3PARCLxEvents

Next let us take a look at the raw CLX log output.

Both 3PAR arrays are showing as ‘secondary’ meaning neither has a read-write copy – if the SQL disks cannot be brought up in a read-write state then obviously the instance is never going to come online.

A little more digging in the CLX logs (this time on the other SQL cluster node) showed me that the cause appeared to be a timeout in the CLX request to the arrays to switch-over the volume from it’s existing array to the alternate array.

OK so at this point we know that the remaining cluster node tried to switch the 3PAR volumes to be read-write in the DC it is located in but for some reason this request timed out and failed. This left the 3PAR disks in a read-only state in both DC locations meaning the SQL instance could never come online. How do we resolve this?

The 3PAR command we need to resolve this is ‘setrcopygroup‘ and we add the ‘reverse‘ attribute followed by ‘-local‘ and then the name of the remote copy group. I’ve provided an example below based on my situation. Note it is important to run this command on the primary array in your remote copy setup. Choosing to force the replication from the ‘wrong’ array could result in data loss so you must be sure you know what you are doing!

Unfortunately this didn’t work and I was prompted with the following error. To resolve this issue I shutdown my cluster and then the cluster nodes and un-exported the volumes so they were no longer presented to the cluster nodes.

setrcopygroup reverse Command Fails

Once I had done this I was able to force the replication to ‘reverse’ and basically designate one system as primary (read-write) and the other as secondary (read-only). I then exported the volumes back to the cluster nodes. Next I brought the cluster nodes back online, then the cluster itself and finally I brought the SQL instance online. I gave SQL time to settle and for Windows to chkdsk the volumes. Once everything was green and appeared happy I triggered a migration between cluster nodes (ergo between DC locations) and was thankful to see it worked as expected.

It goes without saying that you need to be careful when forcing remote copy groups to come online and replicate data – engage with HPE support and make sure you are confident before running any commands. Arguably in a synchronous replication setup both arrays have identical copies of the data so you wouldn’t lose data however you probably have a preferred site which you consider to be the native primary. This is true in my situation and therefore guided my decision on which end to run the command on. I haven’t included every screenshot, crash dump and log entry that I gathered and reviewed during this troubleshooting exercise but hopefully the above is enough to aid you if you are in a similar situation.

2 thoughts on “HPE 3PAR Remote Copy Volumes Both Read Only

  1. Thanks for the article. What did Microsoft say about preventing the RHS from halting the OS? That is concerning since you did everything right regarding clustering and synchronously replicating data.

    • Hi,

      RHS is actually doing exactly what it is deigned to do in this scenario. When the SQL cluster disk volumes go into a read-only state the instance of course stops functioning as it cannot write to the disks. RHS will see role resources are not responding to heartbeats correctly and after a timeout will crash the instance owner to force the resource onto another cluster node. Reviewing the Windows cluster log shows entries such as the below:

      [RES] Physical Disk : VolumeIsNtfs: Failed to get volume information for \\?\GLOBALROOT\Device\Harddisk6\ClusterPartition2\. Error: 1117.

      If we run this through ‘net helpmsg’ we are informed the error code translates as ‘The request could not be performed because of an I/O device error.’. So as far as Microsoft are concerned the system did what it is designed to and killed the instance owner as the resource was not functioning on it. Unfortunately when the role ended up on the alternate node it also found the volumes in a read-only state resulting in our outage. Does that all make sense or do you want me to explain anything further?


Leave a Reply