I recently ‘enjoyed’ working on an issue where a 3PAR remote copy group issue resulted in both virtual volumes entering a read-only state. I’ve seen a similar problem where both copies go into a read-write state which is probably more worrying when it happens. Let’s take a look at the problem, the data I had to analyse and how it was resolved.
First off what was the setup in this scenario?
OK I’m going to ignore big chunks of the infrastructure involved as it doesn’t really matter for this fault scenario. We have two SQL cluster nodes which are located in different DC locations, both connected to a local 3PAR array. The 3PAR arrays are using remote copy replication (in this case synchronous) with one array holding the read-write copy (located in the DC where the SQL instance is running) while the other array holds a read-only copy of the SQL virtual volume. For simplicity I am representing all of the different SQL disks that are actually presented as one virtual volume, in reality there are actually 5. The SQL cluster nodes have the HPE 3PAR cluster extension plugin installed (CLX) to handle the movement of instances between nodes.
When things are working as expected the CLX software allows cluster nodes to initiate a migration between DC locations which then facilitates the movement of our SQL instance. We always expect one 3PAR array to have a read-write copy and the other array to have a read-only copy which is being replicated to.
Now that we have an understanding of the setup and expected operation let’s move on to the situation I found myself in.
The Windows cluster node which had been running the SQL instance showed a stop error/blue screen (BSOD) had occured. I grabbed the memory dump (C:\Windows\memory.dmp) along with the minidump (C:\Windows\MiniDump\) and a quick check showed me that the Resource Hosting Subsystem (RHS) process halted the system. If you don’t know what RHS is then check out this handy article from Microsoft – https://blogs.technet.microsoft.com/askcore/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters/
I’ve added some of the output from WinDBG below just as a reference if you’re interested.
STACK_TEXT: ffffd001`e6de1968 fffff801`a628e468 : 00000000`0000009e ffffe000`b4086900 00000000`000004b0 00000000`00000005 : nt!KeBugCheckEx ffffd001`e6de1970 fffff801`a628e0f2 : 00000000`00000000 00000000`00000001 ffffd001`e6db8180 00000000`00000000 : netft!NetftProcessWatchdogEvent+0xe4 ffffd001`e6de19b0 fffff800`75032e58 : ffffd001`e6de1b20 00000000`00000000 ffffd001`e6de1ae0 ffffd001`00000001 : netft!NetftWatchdogTimerDpc+0x36 ffffd001`e6de19e0 fffff800`75154dea : ffffd001`e6db8180 ffffd001`e6db8180 ffffd001`e6dc42c0 ffffe000`b47f7080 : nt!KiRetireDpcList+0x4f8 ffffd001`e6de1c60 00000000`00000000 : ffffd001`e6de2000 ffffd001`e6ddc000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a STACK_COMMAND: kb THREAD_SHA1_HASH_MOD_FUNC: ceabc0ebf1b805b8b961103c5fc0900d75e10ebd THREAD_SHA1_HASH_MOD_FUNC_OFFSET: dab219cd26072518188cc78a3d85e21a6c5740b9 THREAD_SHA1_HASH_MOD: bda3626ee178dac7f74a0a80cfdb2e5d00aa032d FOLLOWUP_NAME: MachineOwner IMAGE_VERSION: FAILURE_BUCKET_ID: 0x9E_5_IMAGE_rhs.exe BUCKET_ID: 0x9E_5_IMAGE_rhs.exe PRIMARY_PROBLEM_CLASS: 0x9E_5_IMAGE_rhs.exe TARGET_TIME: 2017-03-30T21:08:05.000Z
The SQL instance should have moved to the remaining node and come online however it did not – while the instance had been seized by the remaining node it was unable to start, investigation showed the disks could not be brought online.
A quick check on the 3PAR itself showed that the remote copy group was stopped and the sync state indicated ‘Stale’. As ever I’ve had to obfuscate some of the information in the screenshot but you should be able to discern the important information.
You can also get this information from the command line as below using the ‘showrcopy groups‘ command – in this case I added the wildcard to limit the returned output.
BSA-SP3PAR01 cli% showrcopy groups *ecr* Remote Copy System Information Status: Started, Normal Group Information Name Target Status Role Mode Options bsa_sqlecr_rcg.r29758 BSA-GP3PAR01 Stopped Secondary Sync LocalVV ID RemoteVV ID SyncStatus LastSyncTime r.bsa_sqlecr_db01 1846 bsa_sqlecr_db01 1812 Stopped 2017-03-30 21:42:16 BST r.bsa_sqlecr_db02 1847 bsa_sqlecr_db02 1813 Stopped 2017-03-30 21:42:16 BST r.bsa_sqlecr_db03 1893 bsa_sqlecr_db03 1854 Stopped 2017-03-30 21:42:16 BST r.bsa_sqlecr_sysdb 1894 bsa_sqlecr_sysdb 1855 Stopped 2017-03-30 21:42:16 BST r.bsa_sqlecr_tempdb 1848 bsa_sqlecr_tempdb 1814 Stopped 2017-03-30 21:42:16 BST
Attempts to start the remote copy group failed and looking at the cluster node event logs (both Windows and CLX) showed the system require manual intervention. Here we see an example from the Windows event log.
Next let us take a look at the raw CLX log output.
[03/30/17 22:08:17][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] MANUAL INTERVENTION NECESSARY: The current status information does NOT allow AUTOMATIC activation of your HP 3PAR StoreServ Storage Remote Copy disk set for this application. [03/30/17 22:08:17][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] Takeover action returns ERROR_GLOBAL. () [03/30/17 22:08:17][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] Resource cannot be brought online for any host in the cluster. [03/30/17 22:08:17][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] CLX resource started going to failed state. [03/30/17 22:08:17][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] CLX resource failed. [03/30/17 22:23:19][CLX: CLX 3PAR SQL-MSSQLSERVER][INFO] CLX resource started going to online state. [03/30/17 22:23:19][CLX: CLX 3PAR SQL-MSSQLSERVER] BEGIN HP 3PAR StoreServ CLUSTER EXTENSION Version 4.02.00 [03/30/17 22:23:19][CLX: CLX 3PAR SQL-MSSQLSERVER][INFO] Using product configuration file version 4.1.0 [03/30/17 22:23:19][CLX: CLX 3PAR SQL-MSSQLSERVER][INFO] AutoPass has detected a valid license for this version of HP 3PAR StoreServ Cluster Extension. [03/30/17 22:23:19][CLX: CLX 3PAR SQL-MSSQLSERVER][INFO] System BSA-SQLECR-GP is a member of data center A system list. [03/30/17 22:23:24][CLX: CLX 3PAR SQL-MSSQLSERVER][INFO] Cluster node: BSA-SQLECR-SP is up [03/30/17 22:23:24][CLX: CLX 3PAR SQL-MSSQLSERVER][INFO] Current Configuration for the Remote Copy volume group : Local Remote Copy Volume Group Name = bsa_sqlecr_rcg Remote Remote Copy Volume Group Name = bsa_sqlecr_rcg.r29758 Local HP 3PAR StoreServ Storage Network Name = bsa-gp3par01 Remote HP 3PAR StoreServ Storage Network Name = bsa-sp3par01 Local HP 3PAR StoreServ Storage Serial Number = 0123456 Remote HP 3PAR StoreServ Storage Serial Number = 9876543 Local HP 3PAR StoreServ Storage Name = BSA-GP3PAR01 Remote HP 3PAR StoreServ Storage Name = BSA-SP3PAR01 Local HP 3PAR StoreServ Storage Password File = "C:\HP3PARCLIPassword\bsa-gp3par01_3parclx.pwd" Remote HP 3PAR StoreServ Storage Password File = "C:\HP3PARCLIPassword\bsa-sp3par01_3parclx.pwd" Replication Mode of the Remote Copy Volume Group = Sync Failsafe Mode of the Remote Copy Volume Group = no_fail_wrt_on_err Local Replication Role of the Remote Copy Volume Group = Secondary Remote Replication Role of the Remote Copy Volume Group = Secondary Remote Copy Link Status = Up Remote Copy Volume Group Status = Stopped Remote Copy Volume Group Virtual Volume SyncStatus = Stale ApplicationStartup = FASTFAILBACK UseNonCurrentDataOk = YES AutoRecover = no
Both 3PAR arrays are showing as ‘secondary’ meaning neither has a read-write copy – if the SQL disks cannot be brought up in a read-write state then obviously the instance is never going to come online.
A little more digging in the CLX logs (this time on the other SQL cluster node) showed me that the cause appeared to be a timeout in the CLX request to the arrays to switch-over the volume from it’s existing array to the alternate array.
[03/30/17 21:43:15][ERROR] HP 3PAR StoreServ Storage CLI Command timeout has occurred. The command cli -sys bsa-sp3par01 -sockssl -pwf "C:\HP3PARCLIPassword\bsa-sp3par01_3parclx.pwd" -csvtable setrcopygroup reverse -f -t BSA-GP3PAR01 -stopgroups -waittask bsa_sqlecr_rcg.r29758 did not finish within the timeout value specified. [03/30/17 21:43:15][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] setrcopygroup for reverse -t BSA-GP3PAR01 -stopgroups bsa_sqlecr_rcg.r29758 reverse started with tasks: 25319 Waiting for tasks to complete [03/30/17 21:43:15][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] Reverse operation has failed for the Remote Copy volume group bsa_sqlecr_rcg.r29758. [03/30/17 21:43:15][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] Takeover action returns ERROR_LOCAL. (0-3-8-12-26-27-29 -> 3) [03/30/17 21:43:15][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] Resource cannot be brought online on host BSA-SQLECR-SP. [03/30/17 21:43:15][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] CLX resource started going to failed state. [03/30/17 21:43:15][CLX: CLX 3PAR SQL-MSSQLSERVER][ERROR] CLX resource failed.
OK so at this point we know that the remaining cluster node tried to switch the 3PAR volumes to be read-write in the DC it is located in but for some reason this request timed out and failed. This left the 3PAR disks in a read-only state in both DC locations meaning the SQL instance could never come online. How do we resolve this?
The 3PAR command we need to resolve this is ‘setrcopygroup‘ and we add the ‘reverse‘ attribute followed by ‘-local‘ and then the name of the remote copy group. I’ve provided an example below based on my situation. Note it is important to run this command on the primary array in your remote copy setup. Choosing to force the replication from the ‘wrong’ array could result in data loss so you must be sure you know what you are doing!
BSA-SP3PAR01 cli% setrcopygroup reverse -local bsa_sqlecr_rcg
Unfortunately this didn’t work and I was prompted with the following error. To resolve this issue I shutdown my cluster and then the cluster nodes and un-exported the volumes so they were no longer presented to the cluster nodes.
Once I had done this I was able to force the replication to ‘reverse’ and basically designate one system as primary (read-write) and the other as secondary (read-only). I then exported the volumes back to the cluster nodes. Next I brought the cluster nodes back online, then the cluster itself and finally I brought the SQL instance online. I gave SQL time to settle and for Windows to chkdsk the volumes. Once everything was green and appeared happy I triggered a migration between cluster nodes (ergo between DC locations) and was thankful to see it worked as expected.
It goes without saying that you need to be careful when forcing remote copy groups to come online and replicate data – engage with HPE support and make sure you are confident before running any commands. Arguably in a synchronous replication setup both arrays have identical copies of the data so you wouldn’t lose data however you probably have a preferred site which you consider to be the native primary. This is true in my situation and therefore guided my decision on which end to run the command on. I haven’t included every screenshot, crash dump and log entry that I gathered and reviewed during this troubleshooting exercise but hopefully the above is enough to aid you if you are in a similar situation.