It’s coming up to five years ago that the current design used in our DC setup was put together. I think it is safe to say it has been incredibly successful, so much so that we struggled to keep up with the demand for resource. When our customers (internal or external) realised how resilient, flexible and responsive our setup was they rushed to consume every terabyte and CPU cycle possible, we certainly aren’t a huge outfit but I remember at one point 10TB a month was being consumed in used capacity.
Right now we have an Active/Active DC design with a third DC used for witnesses and some site specific stuff at that location. We load balance across the two active datacenters and have a stretched network (based on Cisco Overlay Transport Virtualisation or OTV) and synchronously replicated storage allowing seamless migration of resources. Sometimes a vendor or customer will dictate a design which is horrendous in terms of HA and availability but whenever we get a choice the solution will be built to survive a complete DC outage with zero or very worst case minutes of impact. This relies upon load balancing front end application services and clustering the back ends. We have taken DCs completely offline, i.e powered off, lights out and proved that the seamless migration/failover works. There is something incredibly empowering about knowing your design works and not just theoretically but in practice. I can’t say in any other job I’ve completely shutdown a DC, the SAN and everything else running in it – very surreal to be stood in a silent room that is normally so deafeningly loud.
The current solution really has worked fantastically well (aside from some frustrations with one of the hypervisors) and as such it would be perfectly fine to renew the hardware we have, make a few adjustments and keep working with what we know does the job. Due to the nature of our business we must consider every option available to us and treat all of them fairly and equally. This can be both frustrating and helpful – sometimes it means reviewing a tender which doesn’t meet requirements or is ridiculously spec’d and having to invest time documenting and explaining why it isn’t fit for purpose. Equally it can sometimes result in a solution coming to light which you may never have considered or heard of that ticks every box.
Of course many vendors will no doubt want to pitch their ideas and we will have to do our best to filter out those that do no fit and then determine which of the remaining we should invest in. Will it be hyperconverged or a design similar to the one we have now with some form of separate SAN, compute and fabric? I would very much like to see us take advantage of both RDMA and NVMe in any future design as these will certainly help accelerate our business and the level of performance we can offer customers. I can’t really talk on here about some of the other design goals and decisions but it’s safe to say there is a lot of them! Whenever I can I try to walk to work to give my dodgy joints some exercise – every day that I do so the journey is spent thinking through a new solution or design. I’ll factor in hardware developments, licensing and pretty much every important aspect I can possibly think of. I’m sure my dreams are now focused on comparing NUMA node considerations between the new AMD and Intel CPU architectures and many other low level design considerations!
What does the future hold for us, what will I be designing and deploying next? Whatever it may be I’m sure the journey will be exciting, frustrating and full of learning!