The $1.8 Million Wake-Up Call: Why Basic Infrastructure is Still King

Written by Jonah Kowall | Feb 23, 2026

The figures in the recent New Relic 2024 State of Observability report are difficult to ignore. For financial services companies, the cost of high-impact IT outages has climbed to a median of $1.8 million per hour.

That is roughly $30,000 in losses every minute the systems are down.

The data points to a familiar set of root causes: network failures (37%), software deployment issues (34%), and environment changes (32%). Yet, the industry reaction is often to throw more high-cardinality data into expensive observability backends.

Having spent years covering this space as a Research VP at Gartner and leading product strategy at several observability vendors, I have seen the "three pillars" of observability evolve firsthand. I know the power of distributed tracing and APM. But I also know that if you are using a sledgehammer to crack a nut, you are wasting budget and time.

My time at Thomson Reuters, running the global monitoring team and function, gave me a direct window into this reality. I can attest to the absolute criticality of downtime in these financial services environments. For businesses built on high-velocity data automation, the lack of data availability isn't just an operational hiccup; it is an existential threat to the service. When automation drives the business, even a momentary gap in data flow can cascade into the massive losses detailed in the report.

If you don't get the infrastructure monitoring right, you are missing the boat entirely.

The Economics of Telemetry: Observability vs. Monitoring

We need to stop conflating observability with monitoring, even though they are closely related. Technically, observability is a superset of monitoring, not just because it collects metrics, logs, and traces, but because it enables you to ask and answer questions you didn't know to ask in advance. While monitoring tells you what is broken based on predefined checks, observability helps you understand why it broke by exploring your system's behavior from any angle. However, while one contains the other, they serve different operational needs and have vastly different cost structures in practice.

Observability is about unknown-unknowns. It requires ingesting massive amounts of high-cardinality data (traces, granular logs, and custom metrics) to debug complex microservice logic - the kind of issues where you don't know what went wrong until you start investigating. The cost per gigabyte for this ingestion is high because the computational overhead to index and query it is massive.

Monitoring is about known-unknowns. It is checking the pulse of the environment based on predefined metrics and thresholds. Are CPU, disk, and memory usage within limits? Are the application instances and processes themselves actually running?

Using a high-cost observability platform to check if a process is running or if a disk is full is economically inefficient. You do not need to pay for premium ingestion and retention to answer a binary question about resource availability. In the financial sector, where margins are scrutinized, paying observability rates for commodity monitoring data is a failure of strategy.

The Deployment Gap: Agents, Protocols, and Auto-discovery

The challenge isn't just the invoice; it is often the complexity of how we gather data. While observability relies heavily on instrumentation, monitoring tools offer a broader range of collection methods.

Monitoring tools excel at auto-discovery because they leverage vast libraries of known protocols and APIs. Vendors have spent decades building definitions for SNMP, WMI, Modbus, REST APIs, and more. When you point a monitoring tool at a subnet, it doesn't just see an IP address; it recognizes a Cisco switch, a Dell server, or a NetApp storage array, and it knows exactly what metrics to pull.

This brings us to the Agent vs. Agentless debate. Monitoring tools are flexible enough to use both, and there are distinct pros and cons to each:

Agentless Monitoring: This reduces the effort and security impact for IT professionals significantly. There is no software to install, no binaries to manage, and no risk of an agent destabilizing a critical production server. It relies on standard permissions and protocols.
Agent-based Monitoring: Agents are useful for deep system metrics or when traversing complex network segments. They can often simplify firewall configurations by requiring only a single outbound port rather than managing inbound rules for multiple protocols (like WMI or DCOM). However, this comes at a cost: agents require lifecycle management. They must be patched, updated, and secured, which adds operational overhead.

Observability tools, by contrast, almost exclusively require heavy agents or code-level instrumentation. While Prometheus exporters and OpenTelemetry are becoming ubiquitous standards to bridge this gap, they still require significant configuration. For core infrastructure, the ability to simply authenticate against a known API or protocol remains the fastest path to visibility.

The Myth of the "All-Cloud" Financial Sector

There is a dissonance between the marketing narrative of a "cloud-native world" and the reality of a bank's data center. Financial services infrastructure is heavily hybrid and will remain so for the foreseeable future.

We are not just talking about Kubernetes clusters; we are talking about mainframes, IBM i systems, proprietary trading appliances, and vast physical switching fabrics. These legacy assets are the backbone of transaction processing. They are stable, compliant, and secure.

However, these systems often do not support the modern agents required by observability vendors. You cannot install a Go agent on a legacy Cisco Catalyst switch or a facility cooling unit. You need standard, protocol-based monitoring, such as SNMP, WMI, Modbus, and IPMI.

When the report cites "network failures" as the top cause of outages, it is rarely a subtle latency issue within a service mesh. More often, it is a physical layer issue, a routing loop, or a capacity limit on an on-premises device. These are infrastructure problems, not code problems.

Mean Time to Innocence (MTTI)

The $1.8 million-per-hour cost is a function of MTTR (Mean Time to Recovery). To reduce MTTR, you first need to reduce MTTI, or Mean Time to Innocence.

When an application slows down, the first question is always: "Is it the network?"

If your team has to sift through terabytes of log data or trace IDs to answer that question, you are losing money. You need a dedicated infrastructure view that can instantly rule out (or confirm) Layer 1 through Layer 3 issues. You need to know if there is packet loss on the WAN link or high CPU wait times on the hypervisor immediately.

At Paessler, we see this distinct need daily. Our financial services customers rely on PRTG not as a replacement for APM, but as the foundational source of truth for the physical and virtual estate. They need a tool that speaks the native protocols of their hardware, not just the HTTP headers of their web apps.

Back to Basics: The Hierarchy of Reliability

The industry hype cycle pushes us toward AI-driven root cause analysis and automated remediation. These are valuable goals. But you cannot automate recovery if you do not have reliable signals from the metal up.

Right-Size Your Tooling: Do not pay APM prices for infrastructure polling. Offload your high-volume, low-cardinality health checks to a dedicated monitoring platform.
Respect the Physical Layer: Acknowledge that "on-premises" is still the dominant reality for core banking ledgers. Ensure your tooling supports the protocols those systems actually use.
Unified Visibility, Not Unified Database: You do not need one database for all data. You need a unified view. Keep infrastructure metrics in an efficient monitoring tool and application traces in an observability platform, and integrate the alerts.

Observability is the roof of the house; it protects you from the complex elements. But infrastructure monitoring is the foundation. If you build a heavy roof on a cracked foundation, that $1.8 million per hour cost will become your reality very quickly.

View full post