The figures in the recent New Relic 2024 State of Observability report are difficult to ignore. For financial services companies, the cost of high-impact IT outages has climbed to a median of $1.8 million per hour.
That is roughly $30,000 in losses every minute the systems are down.
The data points to a familiar set of root causes: network failures (37%), software deployment issues (34%), and environment changes (32%). Yet, the industry reaction is often to throw more high-cardinality data into expensive observability backends.
Having spent years covering this space as a Research VP at Gartner and leading product strategy at several observability vendors, I have seen the "three pillars" of observability evolve firsthand. I know the power of distributed tracing and APM. But I also know that if you are using a sledgehammer to crack a nut, you are wasting budget and time.
My time at Thomson Reuters, running the global monitoring team and function, gave me a direct window into this reality. I can attest to the absolute criticality of downtime in these financial services environments. For businesses built on high-velocity data automation, the lack of data availability isn't just an operational hiccup; it is an existential threat to the service. When automation drives the business, even a momentary gap in data flow can cascade into the massive losses detailed in the report.
If you don't get the infrastructure monitoring right, you are missing the boat entirely.
We need to stop conflating observability with monitoring, even though they are closely related. Technically, observability is a superset of monitoring, not just because it collects metrics, logs, and traces, but because it enables you to ask and answer questions you didn't know to ask in advance. While monitoring tells you what is broken based on predefined checks, observability helps you understand why it broke by exploring your system's behavior from any angle. However, while one contains the other, they serve different operational needs and have vastly different cost structures in practice.
Observability is about unknown-unknowns. It requires ingesting massive amounts of high-cardinality data (traces, granular logs, and custom metrics) to debug complex microservice logic - the kind of issues where you don't know what went wrong until you start investigating. The cost per gigabyte for this ingestion is high because the computational overhead to index and query it is massive.
Monitoring is about known-unknowns. It is checking the pulse of the environment based on predefined metrics and thresholds. Are CPU, disk, and memory usage within limits? Are the application instances and processes themselves actually running?
Using a high-cost observability platform to check if a process is running or if a disk is full is economically inefficient. You do not need to pay for premium ingestion and retention to answer a binary question about resource availability. In the financial sector, where margins are scrutinized, paying observability rates for commodity monitoring data is a failure of strategy.
The challenge isn't just the invoice; it is often the complexity of how we gather data. While observability relies heavily on instrumentation, monitoring tools offer a broader range of collection methods.
Monitoring tools excel at auto-discovery because they leverage vast libraries of known protocols and APIs. Vendors have spent decades building definitions for SNMP, WMI, Modbus, REST APIs, and more. When you point a monitoring tool at a subnet, it doesn't just see an IP address; it recognizes a Cisco switch, a Dell server, or a NetApp storage array, and it knows exactly what metrics to pull.
This brings us to the Agent vs. Agentless debate. Monitoring tools are flexible enough to use both, and there are distinct pros and cons to each:
Observability tools, by contrast, almost exclusively require heavy agents or code-level instrumentation. While Prometheus exporters and OpenTelemetry are becoming ubiquitous standards to bridge this gap, they still require significant configuration. For core infrastructure, the ability to simply authenticate against a known API or protocol remains the fastest path to visibility.
There is a dissonance between the marketing narrative of a "cloud-native world" and the reality of a bank's data center. Financial services infrastructure is heavily hybrid and will remain so for the foreseeable future.
We are not just talking about Kubernetes clusters; we are talking about mainframes, IBM i systems, proprietary trading appliances, and vast physical switching fabrics. These legacy assets are the backbone of transaction processing. They are stable, compliant, and secure.
However, these systems often do not support the modern agents required by observability vendors. You cannot install a Go agent on a legacy Cisco Catalyst switch or a facility cooling unit. You need standard, protocol-based monitoring, such as SNMP, WMI, Modbus, and IPMI.
When the report cites "network failures" as the top cause of outages, it is rarely a subtle latency issue within a service mesh. More often, it is a physical layer issue, a routing loop, or a capacity limit on an on-premises device. These are infrastructure problems, not code problems.
The $1.8 million-per-hour cost is a function of MTTR (Mean Time to Recovery). To reduce MTTR, you first need to reduce MTTI, or Mean Time to Innocence.
When an application slows down, the first question is always: "Is it the network?"
If your team has to sift through terabytes of log data or trace IDs to answer that question, you are losing money. You need a dedicated infrastructure view that can instantly rule out (or confirm) Layer 1 through Layer 3 issues. You need to know if there is packet loss on the WAN link or high CPU wait times on the hypervisor immediately.
At Paessler, we see this distinct need daily. Our financial services customers rely on PRTG not as a replacement for APM, but as the foundational source of truth for the physical and virtual estate. They need a tool that speaks the native protocols of their hardware, not just the HTTP headers of their web apps.
The industry hype cycle pushes us toward AI-driven root cause analysis and automated remediation. These are valuable goals. But you cannot automate recovery if you do not have reliable signals from the metal up.
Observability is the roof of the house; it protects you from the complex elements. But infrastructure monitoring is the foundation. If you build a heavy roof on a cracked foundation, that $1.8 million per hour cost will become your reality very quickly.