When I originally set out to create this blog, I had visions of very clear lines between documentation and Isolation, as well as repair and escalation. In reality, documentation is at the heart of all of these concepts. It’s discussed here first for that reason. We’ve already discussed documenting the problem, but what about being able to isolate a problem in the network. This isolation I’m talking about is more about removing the problem from the network versus finding the cause of the problem. Accurate and relevant documentation is needed to ensure that, for example, if a network interface is experiencing a large number of errors which is impacting performance, CAN it be shut down while it is repaired, and the network redundancy take over. This will ensure the users have a better experience. The same concept applies in a maintenance window… can a specific change be applied to the network, and while the change is occurring, traffic be re-routed to ensure the users do not experience an outage. More on that later.
The challenge we often see is that there is all-purpose documentation which may not have the right amount of detail to ensure the NetOPs engineer can perform this isolation. I may know the physical connectivity appears redundant, but is the routing architecture deployed in such a way that it will converge without creating a larger outage? Secondly, how current is the documentation set? What if I believe there is proper redundancy, but the document is 6 months old, and through the regular operations of the network, something was moved, added or changed that somehow impacted this redundancy? Again, looking back to 45% of network outages were avoidable and operator error as the cause, it is access to information in the form of relevant documentation that reduces these types of errors. There is nothing worse than having a small issue made into a disastrous issue simply through lack of visibility.
The challenge we often see is that there is all-purpose documentation which may not have the right amount of detail to ensure the NetOPs engineer can perform this isolation.
Still part of documentation is the need for a strong visual aid during an isolation exercise. Again, this could be removing a problem area from the network experience, OR, simply knowing which elements in the network are part of the application path the engineer is working with. DO you know which type of devices are along the path, where redundancy is deployed, which ACLs are applied along that path… interfaces and addressing, is there an MPLS provider in the path, etc. The challenge for most organization is that they leverage CLI which is text based. This requires manually drawing the visual aid which is time consuming and prone to error.
CLI doesn’t cut it in real-time anymore
Even if I were to leverage CLI consoles, getting the data requires knowledge of appropriate usernames and passwords, management IPs, specific commands across multiple device types, and analysis of the output formatted in different way. For the most part, CLI consoles require a serial execution of the command, and what we call “stare and compare” to pull out the relevant data and analyze if for potential errors. We also need to have a detailed log of activities performed to enable post-mortem improvement discussions and assist with automation. CLI consoles, unfortunately, are a raw output of each session. Correlation of the different outputs requires analyzing multiple files, with multiple executions of different commands, all mixed together. And this is for a single console logging session. Add 5-10 different devices to the mix and the process becomes very complex and time consuming. Just as important, most logging requires the operator to enable it. If it isn’t, then all this data becomes lost. Time stamps of the command execution, while chronology in a single session can be determined, is impossible across multiple logging sessions. If a NetOPs engineer choses to use a Graphical user interface, there may be no logging capability at all!
During a fault isolation exercise, relevant historical data is also very handy. It would be great to know what the application path was last week when things were working well, and if anything changed around the time it was discovered to be degraded or failing. Most NMS systems have the ability to display performance data in a graph or chart form, however do not have the ability to tie into that what changed, was the application using this path at the time of the outage, etc… Lastly, without knowledge of the network deployment and how-to knowledge, the entire isolation process stalls. It is like speaking English to someone who only speaks French. If we don’t speak the same language, then it is impossible to communicate.
In NetOPs, if I don’t understand the protocols, or how they were deployed in the network, it is almost impossible to find the problem, never mind resolve it. The challenge in many NetOPs organizations is that this knowledge is held by the few, near the top of the support structure, which means it is going to take time to gain access to these people. With all the costs associated with outages, this is simply bad news for any NetOPs team.
DIRE NetOps can help.
If you find yourself struggling with your NetOps 2.0 Transformation, you are not alone. Let DIRE NetOps, our parent company and professional services organization, help you find the right tools to fill the gap, and transform your NetOps organization into a world class, well oiled machine.