One of the buzzwords we constantly come across when answering PRTG requests is “SLA Reporting”. To keep up with demand, one of our partners created a PRTG plugin for SLA monitoring and back in March of this year, my colleague Sascha wrote a blog post on it. But what exactly is an SLA, when is it required and what does it have to do with monitoring?
First things first, SLA is an abbreviation for service-level agreement. These agreements are usually made between a (service) provider and their customers and contain the details on what services will be provided and how the stability of the services can be ensured.
And when ever you hear that something needs to be guaranteed, you of course would want to keep an eye on this – or in other words monitor it. Common metrics for SLAs are the mean time between/to failures, the mean time to repair/recovery, and uptime. To understand SLA monitoring a bit better, let’s dive into what these numbers are and what the difference between them is:
The first metric measures how much time has elapsed before an error occurs. If it is a system that can be repaired, the metric is referred to as “mean time between failures”, since we have to be realistic and expect more than one failure. And if we are referring to something that cannot be salvaged after a failure, we call it “mean time to failure” (Oh, and in case you were wondering: yes, my source is Wikipedia).
Once a failure has occurred, the goal is to get everything back up and running as fast as possible. The time that will elapse between the outage and getting everything back up and running is the mean time to repair. And, in the best case, this number should be as low as possible.
This is one of our favorite topics and ultimate goals (have you heard of our Uptime Alliance program?). In general, providers strive to achieve 99.999% of uptime for their services within one year. So uptime is basically the sum of all MTBF/MTTF.
And if you still ask yourself the question of why it is so important to actually track your SLAs, here's another reason:
Even if you have services that are running smoothly, having figures to prove this is more than just a nice-to-have. You can show your customers that you are good at what you do and that the money they pay you is well invested. Additionally, proving that you can keep your promises can serve as an incentive for future clients.
As mentioned before, one of our partners built a plugin that automates the monitoring and reporting for you. But of course there are other things you can do with PRTG in regards to SLA monitoring. First of all, if you have your own tool, you can use our API to export the data (even in raw format!). And in case you do not feel like scripting anything on your own, just check out this script from one of our customers.
Another useful feature of PRTG in regards to monitoring how well you comply with your SLA is actually the possibility to set thresholds as required. For example: you might be providing/managing bandwidth for your customers and while the internet connection is still up theoretically, it may have become unacceptably slow and the performance may be poor. While this is not a classical case of failure, you might still have ‘failed’ your customer. And PRTG can make you aware of that 😉
I hope this blog post was able to give you a little more insights into measuring SLAs with PRTG, and as always, your thoughts and comments are highly appreciated!