As a network administrator, I have been (partly) responsible for the monitoring of network infrastructures or even entire companies for many customers. For some companies I have even completely redesigned it.
Many companies currently use tools such as Nagios, Solarwinds or a similar package. These tools are ( at least, in my opinion) a first-generation software package because it takes a lot of time to set the triggers (in particular). If you also have to deal with different vendors, it becomes even more complex; every vendor has its own event codes and descriptions, which causes unequal and therefore unclear alerts.
Many parties also rely on the use of SNMP Traps. Although you will not be able to completely disable SNMP-Traps, relying on (only) these traps is dangerous. This has various reasons.
Why we cannot trust SNMP Traps
SNMP Traps work on the basis of UDP and is a single datagram which is sent by the device. Consider for example a temperature that exceeds the limit, for which a switch will send a single UDP datagram once. However, UDP does not work in the same way as TCP, because there is no check whether the UDP datagram has ever arrived. TCP has TCP Retransmissions for this, UDP has no control whatsoever.
In the real world, it can happen that a switch gets into trouble due to (for example) spannig-tree recalculation and the SNMP-Trap does not arrive as a result. These notifications will not be sent again and will never be registered. Same thing is the connection between your monitoring host and the device is unstable. Not something you can count on in the event of disruptions.
As I wrote above, I see the mentioned software packages as first-generation monitoring, this is mainly due to the method of generating alerts. For example, alerts are often set that go off when the bandwidth consumption of a gigabit interface is used for 90% or when a hard disk has only 20% space left.
But is it useful to know that bandwidth is going through that specific interface? Maybe someone is watching a video on Netflix that first has to buffer. Is it useful to know that a hard drive only has 20% space left? If it is a small disk it might, but if it's a 2TB disk then this does not seem worth mentioning.
Future-proof approach to Monitoring and Alerts
The only way to generate correct alerts, which actually require action, is based on metrics. Metrics, metrics and i say again: metrics. You can collect these metrics via SNMP Polling on the devices and with the collected data a trend line can be mapped out. Using the above-mentioned examples, we can determine on the basis of a trend analysis whether there is more often 90% consumption on the Gigabit interface. We can also determine how long it takes before a disk is really full and whether it requires action now.
Doors are opened by collecting metrics. Much more insight into what is happening on the infrastructure, outages can be prevented instead of resolved, better advice can be given in regards to capacity and growth path and it makes it easier for administrators.
The last huge difference is that all alerts are displayed on the same (unique) way. We are no longer dependent on the different vendors!
Steps to improve or setup network monitoring
- Draw the network infrastructure in a real-time map view
No network administrator likes to make network drawings. Make it fun by playing with the real-time display of trends and data and at the same time gain insight into the network and up-to-date documentation.
- Transform from network monitoring to service and chain monitoring
By gaining insight into the ultimate availability of the service, it is also clear what the impact is for the customer
- Reclassification of alerts based on impact.
Considering point 2, we now know what the impact is for a customer. Many services are performed redundantly, so that the impact is less. Combine this with point 1 and you have a real-time view of where the disruption occurs without first having to search for half an hour on the various equipment.
- Create a performance baseline
By measuring the services of the customer on performance (response time) in combination with the availability and response time of the network, it can quickly be determined whether there is a congestion.
- Work with trend analysis and forecast alerting
By making use of all collected data, many false alerts can be prevented. With this data an indication can also be made about capacity and availability in the future.