When I came back from lunch yesterday and wanted to prepare for the next meeting, I noticed that I couldn't access my emails, my Microsoft Teams client was unable to connect to the outside world and accessing a website was beyond slow.
Just a few minutes later, our IT team reported an internal issue at our data center provider. Our servers in the data center were all up and running; we "just" could no longer access them. It quickly became apparent that a DDOS attack had largely disabled the provider network.
After about 3 hours of downtime, which we bridged with "analog activities" (we can implement the Clean Desk Policy in our office now!), we were able to access our resources again.
The uptime of the IT infrastructure - an important KPI in many companies and IT departments - was unchanged after the failure, but the availability of applications and services suffered.
During this time, we were all shown once again, how dependent we are on the availability of IT systems. We can't imagine what would have happened if we had been unable to access our systems for even longer, perhaps even several days!
(System) Uptime != (Service) Availability
This fact leads me to the question: What is uptime, what is availability, and how do both differ?
Uptime is a measure of system reliability, expressed as the percentage of time a machine, typically a computer, has been working and available.
This means that a system is ready for operation. By definition, this does not mean that all the necessary applications and services are ready for use, and that the "network" service, for example, is available in the expected bandwidth.
Availability is the probability that a system will work as required when required during the period of a mission.
Looking at the production environment, the difference between uptime and availability can best be compared with OEE (Overall Equipment Effectiveness) and TEEP (Total Effective Equipment Performance). These take into account all events that bring down the production environment.
The Five Nines
I'm sure you've heard of the "Five Nines" before. This term is commonly taken to mean 99.999% and refers to Uptime or Availability.
While five nines are the optimum (besides the fabled value of 100%), the concept also covers availabilities with fewer nines. All this leads us to the Table of Nines:
Availability Level | Uptime | Downtime per Year | Downtime per Day |
---|---|---|---|
1 Nine | 90% | 36.5 days | 2.4 hours |
2 Nines | 99% | 3.65 days | 14 minutes |
3 Nines | 99.9% | 8.76 hours | 86 seconds |
4 Nines | 99.99% | 52.6 minutes | 8.6 seconds |
5 Nines | 99.999% | 5.25 minutes | 0.86 seconds |
For example, the more nines a provider's Service Level Agreement guarantees, the more money you must invest in the service.
100% Uptime Is Irrelevant Nowadays
These days, statements like these are heard repeatedly. They do not come from admins or operators of a data center farm, but from application administrators. In times of high availability, distributed systems, and container solutions, the administrator of a particular application no longer has to rely on a single piece of hardware. Much more important is that the service itself, i.e. the connected business process, is available and operational at all times.
The fabulous 100% uptime is and has been an unattainable objective. In times of high availability solutions, an application or service can still be available even during the installation of hardware updates, since the application may be moved dynamically and without interruption to another hardware system, but the physical component in turn requires a restart, which leads inevitably to a downtime (and with it < 100% uptime).
Effects on Monitoring
Besides the monitoring of hardware and components, the monitoring of complex, coherent business processes becomes more and more important. The administrator of an email system may no longer need to know how many megabytes of RAM the hardware is currently using. For them, it is far more interesting whether the mailboxes are available, the clients can access the server fast enough, if POP and SMTP services are running and the Active Directory connection is stable.
This requires that the service processes are clearly defined and implemented as thoroughly and transparently as possible in the monitoring environment. Find out more about the PRTG Business Process Sensor.
Have an Eye on Existing SLAs
For SLAs, for instance from data center operators or web hosting providers, I recommend taking a close look at the individual definition of uptime. Many providers limit their uptime to pure hardware system availability, but not to service or process availability. I can imagine that your contracts also offer room for improvement in this regard. Talk to your suppliers and minimize risks wherever it makes sense!
👉 See also: Monitor your SLAs with Paessler PRTG
Your Downtime Stories
It is always interesting to hear other admins' stories. Do you have your personal downtime experience that you would like to share? No matter if it was a strange network failure or a server that disappeared into Nirvana, we are curious to read your individual lessons learned! Find the comment box just down below!