My colleagues and I have been tackling a rather interesting fault over the past few weeks and I’m going to write separately about the technical issues we faced and how we worked with our suppliers to identify the cause. In this post I want to mention a few ‘lessons learnt’ with regard to the soft skills rather than the tech stuff.
One thing that definitely came out of this work was the reminder that you should never assume anything. In this case I’m referencing the assumption that you and somebody else are talking about the same thing. To give an example while discussing the port configuration on a Cisco Nexus switch I was under the impression my colleague and I were both talking about one set of ports while in reality we were not. What really hammered this home was finding out that a similar thing had happened months back when another colleague made a change for me but it wasn’t on the ports I thought it was. Had they made the change on the right ports we may have resolved the issue before it hit us as hard as it finally did.
One of the vendor engineers we worked with had a conference call with us in which he asked a range of questions and many of these were ‘obvious’ or clarifying things you would normally take for granted. This might feel frustrating to have the obvious stated but really in a complex environment it’s vital that everyone is on the same page. I know I was definitely at fault for not asking outright but making assumptions on certain configuration changes.
We spent a lot of time investigating a component within our infrastructure setup, to the point that we neglected other areas at first. The business was shouting about this component and the impact it was having on our users so of course we focused all our efforts on it. When people are running through to your office demanding answers it is easy to focus in on the thing they are talking about when really you need to step back and look at the big picture. We have a complex setup with so many inter dependencies, tunnel vision is not your friend.
The issue we experienced require the involvement of pretty much all our major vendors and we had to do a lot of work for each of them to progress the issue to higher levels of support. There were a number of occasions where management pushed these suppliers to escalate the troubleshooting to the next line of support and not accept the first answer we got back. This was actually pretty vital as we got some really powerful insight from these senior engineers. I think it is safe to say that their feedback and recommendations enabled us to not only monitor and report on the issue more effectively but also implement changes which have had a profound effect on the issue and brought us incredibly close to a full resolution.
So the important message here is be prepared to push a supplier to escalate an issue or provide a more complete answer. They may tell you the problem is ‘x’ but make sure you are happy there is evidence to show this. Be prepared to bat the issue back at them and push for them to do more so you can be assured no stone was left un-turned.
This has been a really good learning experience for myself and the team. I’m sure the business would have liked to see this issue resolved sooner but personally I think we have done well. Our setup is very complex and we are a tiny team covering a huge range of technologies. When you look at the setup we have it is typically run in companies with whole teams of people to look after individual aspects of the solution. We don’t have that luxury so we have to master many skills across a broad range of IT tech.
Hopefully at the end of this we will have a meeting to discuss the journey we’ve been on and then identify what went well and what could have gone better. It’s vital that we learn as much as possible from the experience. For example we now have a much better understanding of how to baseline and report on certain performance metrics – this will allow us to improve the monitoring and alerting we have in place.
OK you don’t have to use Microsoft OneNote but I will simply say it has been incredibly useful to us during the troubleshooting. We created a master page in OneNote with all the case reference numbers and then created sub-pages for each task/step we were carrying out. As this OneNote Notebook was a shared one it allowed my colleagues and I to all collaborate on the documentation at the same time. We added the commands we were running, the output – basically everything to sub-pages. Whatever you choose to use make sure you are documenting everything you do, every command, the output – document all the things!
This post isn’t my usual ‘how to’ but I feel like some of these lessons are important to remember. It is easy in the heat of the moment to forget how best to approach an issue, give yourself time to step back. Don’t let yourself get stressed it usually doesn’t help with thinking so breathe and relax then get to it.