This is going to be a fairly short post outlining my thoughts and experiences with regard to 3PAR arrays using synchronous replication (Peer Persistence and Cluster Extensions or CLX) and the sizing of disk tiers.
My current environment consists of two 7400 (2 node) 3PAR arrays running synchronous replication using Remote Copy over Fibre Channel (RCFC). They are separated by about 17 miles and we leverage Peer Persistence for our VMware estate and Cluster Extensions (CLX) for our Hyper-V systems and Microsoft Failover Clusters.
This is a great setup and allows us to present our datacentres as one ‘virtual’ datacentres for a truly Active-Active setup. We can migrate workloads and move systems around without impact to provide disaster avoidance and load balancing. All great things, however there is a snag and I’m going to come onto that now.
We leverage the 3PAR Autonomic Optimisation (AO) feature and I think it does a great job. Our virtual machine datastores and SQL disks are natively mapped to the middle (FC) tier and AO then migrates data up to the SSDs or down to the near-line (NL) disks. Now this works fine for volumes running locally, the remote replica will receive all the writes that the source gets but crucially AO data cannot be replicated. This means while on the source we may have 20% of the volume on SSDs, 30% on FC and the remaining 50% on NL, at the target array all of the data will be on the FC tier.
Why is this a problem? Well for normal function, e.g moving a VM to another DC or failing a SQL instance over it’s not the end of the world. Sure we don’t get the benefit of hot data being served from the highest performing tier and cold data dropping to cheap disk but things work OK and if the VM/SQL/whatever is going to stay there any length of time then it will get AO’d and all is good again.
The problem comes when you move a big chunk of your workload – think a full DC outage or planned work. In this scenario my middle tier on the remote 3PAR is suddenly hit with a huge workload it was not previously required to run. My option is to wait 30 minutes and then kick AO into gear and hope it can shift things round to sort the problem but up to that point and even during/after I’m likely to see thrashing.
The big takeaway for me from the above is that you MUST make sure you size your tiers wisely if you intend to run a setup such as mine. I don’t mean double up on everything you have – understand what you need the array to handle. I do not expect to move my entire workload to one DC in a failure scenario, systems are prioritised and some stuff will just shutdown if it’s native DC isn’t available. That being said I need to know that when that event does occur my remaining 3PAR won’t go wide eyed and wonder what it did wrong in a former life as hundreds of systems come ‘a knocking.
What would I like to see as a future development on 3PAR..?
Well I guess it’s pretty obvious I’d love to see AO metrics replicated between arrays running a Peer Persistence/CLX setup. Is that something I think we will see any time soon? Well probably not unless a lot of people need it and with SSD disks coming down in price and increasing in capacity you could argue eventually will we bother with AO?
It’s also important to remember that the above isn’t just an issue with remote replication – if you put a workload on a tier and rely on AO to handle performance you will see a hit till that hot data is moved up.
All of the above out of the way, I want to say that so far in my career the 3PAR has been the best SAN I’ve worked on and if anyone fancies giving me two spare 7450s with a stack of SSDs I’d be more than happy to take them!