Enterprise IT — which we (Paessler) loosely define as infrastructure with over 1,000 devices — is a special beast to manage, with many challenges to overcome. One of the biggest issues that IT teams in these large environments face is alert noise. This happens when you're monitoring your infrastructure, network, storage, cloud services and other elements of your IT, and generating alerts and notifications for failures or impending failures. Too much alert noise makes it downright difficult to identify serious problems, and it might even mean ignoring alerts and missing what really matters. Your monitoring efforts are compromised, and the quality of your service goes down.
If you've got the feeling that your alerting situation is out of hand, I've got some suggestions for you to try out. And if you want other ideas of how to effectively monitor your infrastructure, download our free guide to monitoring enterprise IT.
But before we get into the suggestions of reducing alert noise, let's consider why there are so many alerts in the first place.
What causes alert noise, and why it's dangerous
Enterprise IT environments are complex and heterogenous, with hundreds and thousands of devices and applications from many different vendors. Usually you end up with a variety of different monitoring tools, each of them monitoring separate aspects of your IT and each sending a lot of notifications and alerts for every detail within their monitoring range. Because there are so many alerts from so many different tools, it’s almost impossible to match an alert to the responsible team member. All alerts end up in a central IT inbox, and it's easy to lose the overview.
Another problem is assigning the correct importance of an alert. Just a simple example: Getting the alert that a server is running out of memory is helpful when you’re only running five servers in your server room. If you have 5.000 servers in your data center, this alert is not helpful. You need to be able to prioritize and focus on the most important alerts for issues that will cause major failures and outages.
Another common characteristic of large infrastructure is distributed networks, where the enterprise is spread over various locations, often with more than one data center. If you're managing the networks in a central location, all alerts are probably being sent to the central team.
All of this means that it's easy for alerts to get out of hand. And the fact is: if you have too many alerts, they might become meaningless. Either that, or the noise means that important indications of failure are missed. Or both.
How to reduce alert noise
Reducing alert noise in your organization takes a combination of careful, strategic planning and the right monitoring tool. Here are some ways to reduce alert noise.
1) One monitoring tool
Consolidating your monitoring into one tool has benefits beyond alerting, but for this article we'll only focus on alerting. Instead of multiple tools with different ways of alerting you when there are problems, you should aim to have a single tool. This means that when there are alerts, you have one source to go to in order to find the underlying problem.
Additionally, because different tools handle alerting and notifications differently, a single tool means that you can apply the same philosophy across the board.
2) Set the right thresholds
Alerts are based on thresholds: when a device gets hotter than a certain temperature, or when available storage drops below a certain number of gigabytes, then alerts must be triggered. So it stands to reason that good alert management is based on setting the right thresholds. Set them too low, and you'll get inundated with alerts; set them too high, and you won't get notified when there's an issue until it's already too late.
That might sound easy, but how do you manage this when you have thousands monitored devices and applications? That’s exactly why it is crucial to have a monitoring solution that offers automation and other mechanisms like inheriting thresholds for groups of devices.
3) Define response teams and filter alerts
This one requires that you have a monitoring tool with comprehensive rights and roles functionality. This will let you easily create roles and responsibilities for specific teams (or even individuals), and to filter alerts accordingly.
For your monitoring concept, define the user groups according to the areas that they focus on. Then, you define notifications for failures in those areas to go to the specific teams that need to know. For example, you might have an IT team that handles your online store, and another team that handles the E-mail services. In this example, you would configure that the team handling the online store only receives alerts relevant to that area, and the same for the team handling the E-mail services.
This way, alerts get sent only to the relevant teams.
4) Define high-level alerts for management and business stakeholders
Not everyone in your organization needs to know what's going on behind the scenes of your infrastructure. Your IT teams do, sure, but quite often decision makers, management, and other business stakeholders only need to know the health of the network at a very high level.
A good strategy is to organize your infrastructure into IT services according to business processes. For example: your company’s E-mail service, the licensing system, or software build processes are all IT services provided by several connected bits of hardware, software and connectivity.
Let's take the E-mail service as an example. You would map the mail server, storage servers, and the internet connection components of your infrastructure to the "E-mail" business service. Now, if there is a minor failure to one of those components — such as a redundant mail server has performance problems — the E-mail service itself is not endangered because there are fail-over mail servers available. In this instance, only the IT teams responsible get alerted about the performance issues of the server. Management does not need to be alerted.
However, if there is a service-critical problem — maybe a crash of the core switch all mail data passes through — then the E-mail service itself is endangered, and an alert can be sent to relevant management members or stakeholders.
How to monitor large scale infrastructure
Managing alerts is just one piece of monitoring enterprise IT. To get the full guide, which includes tips on how to segment your network and a checklist for selecting a monitoring tool, click on the banner below.
And I'd love to hear about your biggest challenge with monitoring enterprise IT! Let me know in the comments below.