The Storage Monitoring Matrix: 10 Critical Metrics Every IT Team Must Track

Written by Sascha Neumeier | Apr 22, 2026

By the time you know there's a storage problem, it's already too late.

A LUN fills up overnight. A SAN starts throwing latency spikes nobody noticed. A backup job silently fails for three weeks. These aren't edge cases. They're what happens when IT teams rely on reactive monitoring instead of a structured approach.

That's where a storage monitoring matrix comes in. Think of it as your reference framework for comprehensive infrastructure monitoring: a structured map of the key performance metrics your storage systems should be generating, what thresholds matter, and why each one deserves a place in your dashboards. Whether you're managing on-prem storage, cloud environments, or a hybrid cloud mix of both, the right metrics give you the real-time visibility to catch problems before they become outages.

In this guide, we cover the 10 critical metrics every IT team should be tracking, and what your monitoring tools should be doing with them. For a broader look at why storage monitoring matters in the first place, start with Why You Definitely Need to Monitor Your Storage Infrastructure.

What Is a Storage Monitoring Matrix?

A storage monitoring matrix is a structured reference that maps your storage environments to the specific metrics, thresholds, and alerting rules that keep them healthy. Rather than monitoring everything at once with no clear priority, a matrix gives your team a shared framework: what to watch, what's normal, and when to act.

When it works, you stop chasing fires and start catching problems before they turn into 2 a.m. phone calls. That's the whole point of building one.

Quick Reference: Storage Monitoring Matrix

Use this table as your at-a-glance reference. Each metric, its threshold, and the corresponding PRTG sensor type are mapped below. The full detail for each is in the sections that follow.

Metric	What It Measures	Alert Threshold	PRTG Sensor
Capacity Utilization	Used vs. available space per volume/LUN	Alert: 75% / Critical: 90%	Disk Free Sensor
IOPS	Read/write operations per second	Alert: >80% of rated IOPS	SNMP / Storage Sensor
Latency	I/O request response time (ms)	Warn: >20ms / Critical: >50ms	Storage Performance Sensor
Throughput	Data volume moved (MB/s)	Alert: >80% of rated capacity	SNMP / NetFlow Sensor
Disk Health / RAID	S.M.A.R.T. data + RAID array status	Any degraded array = immediate alert	SMART Sensor / SNMP
SAN Fabric	Port errors, link utilization, HBA health	Port error rate >0.1%	SNMP / Fiber Channel Sensor
NAS Availability	Share uptime + protocol response times	Any unavailability = immediate alert	NAS Sensor / Ping
Storage Pool Allocation	Thin provisioning ratios + pool consumption	Alert: >80% physical pool used	SNMP / REST API Sensor
Backup / Replication	Job status, replication lag, RPO/RTO	Any failed job = immediate alert	HTTP / Script Sensor
Multi-Vendor Visibility	Cross-vendor health + unified alerting	Vendor-specific thresholds	REST API / SNMP Sensor

1. Storage Capacity Utilization — Know Before You Hit the Wall

Nothing derails an IT team faster than a data storage volume that fills up without warning. Capacity utilization tracks the percentage of used versus available space across your volumes, LUNs, storage pools, and file systems, and it's the single most important metric in any storage monitoring matrix.

The problem isn't that teams don't know capacity matters. It's that they don't monitor it with enough granularity. Tracking total storage capacity at the array level tells you very little. What you need is per-volume, per-LUN visibility with trend data, so you can project when you'll hit the wall, not just see that you already have.

What to monitor:

Used vs. available space per volume and LUN
Growth rate trends (daily, weekly, monthly)
Days-to-full projections based on current growth
Snapshot and thin-provisioned volume consumption

Threshold guidance: Set alerts at 75% utilization, critical at 90%. For fast-growing volumes, trigger alerts based on days-to-full. For example, alert when a volume is projected to fill within 14 days.

Proactive capacity planning starts with this metric. For a deeper look at how to set it up, see Storage Capacity Planning with PRTG.

2. IOPS — The Heartbeat of Storage Performance

IOPS (Input/Output Operations Per Second) measures how many read and write operations your storage can handle per second. It's the closest thing to a heartbeat reading for your storage systems, and when it flatlines or spikes, apps feel it immediately.

IOPS saturation is probably the most common cause of application slowdowns that end up getting blamed on the network or the servers. Storage rarely gets the first call.

What to monitor:

Peak vs. average IOPS per disk, LUN, and volume
Read/write ratio (read-heavy vs. write-heavy workloads behave differently)
Queue depth: a rising queue depth alongside high IOPS signals saturation
Per-LUN IOPS breakdown to isolate which workloads are driving load

Threshold guidance: IOPS limits are vendor-specific, but a sustained load above 80% of rated IOPS is a reliable alert trigger. Don't wait for 100%. By then, latency has already degraded and users are already feeling it.

Monitoring IOPS alongside latency (see metric #3) tells you a lot more about what's actually going on than either metric alone.

3. Latency — The Silent Performance Killer You Can't Ignore

Latency is the time it takes for a storage I/O request to complete, measured in milliseconds. It's the metric that tells you how your storage systems actually feel to the applications and users depending on them. And it's often the first sign of trouble, appearing well before IOPS max out or capacity runs low.

High latency is the canary in the coal mine for storage bottlenecks. A sudden spike in read or write latency usually points to a specific root cause: a saturated disk, a misconfigured RAID array, a failing drive, or a congested SAN fabric. The faster you catch it, the faster you can troubleshoot and resolve it before it cascades into an outage.

What to monitor:

Average read and write latency per LUN and volume
Latency spikes and their correlation with IOPS peaks
Queue depth trends (high queue depth plus high latency equals saturation)
Per-controller and per-port latency on SAN environments

Threshold guidance: Under 5ms is healthy for most workloads. Above 20ms warrants investigation. Above 50ms is critical: expect application impact. For latency-sensitive workloads like databases and VMs, tighten these thresholds accordingly.

For a practical guide to building latency observability into your storage performance monitoring with PRTG, see Storage Performance Monitoring with PRTG.

4. Throughput — Measure the Real Data Flow Across Your Storage Systems

While IOPS counts operations, throughput measures the actual volume of data moving through your storage, expressed in MB/s. Both metrics matter, but they tell different stories. A storage system can be handling a high number of small IOPS while throughput remains low, or pushing massive sequential data transfers with relatively few operations.

Throughput drops are a reliable signal of network or controller bottlenecks that IOPS alone won't reveal. If your backup jobs are suddenly taking twice as long, or large file transfers are crawling, throughput is where you'll find the answer.

What to monitor:

Read vs. write throughput per volume and controller
Peak throughput periods and their correlation with scheduled jobs (backups, replication)
Sustained throughput vs. rated link capacity
Throughput trends over time to identify gradual degradation

Threshold guidance: Alert when sustained throughput consistently exceeds 80% of your storage controller's or network link's rated capacity. Sudden drops in throughput, even without hitting limits, are worth investigating as early signs of hardware degradation.

Tracking throughput alongside IOPS and latency is core to any solid storage performance monitoring setup. It's how you get a real read on what your storage environments are actually doing under load.

5. Disk Health and RAID Status — Catch Failures Before They Cascade

Individual disk failures are inevitable. What's not inevitable is being caught off guard by them. Disk health monitoring combined with real-time RAID array status is your early warning system for hardware failures that can escalate from a single degraded drive to a full data loss event faster than most teams expect.

A RAID-5 array with one failed disk is still running. It's also one drive failure away from losing everything. Without active monitoring, that degraded state can persist for days or weeks before anyone notices, and the next failure won't wait.

What to monitor:

Physical disk health via S.M.A.R.T. data (reallocated sectors, pending sectors, uncorrectable errors)
RAID array status: healthy, degraded, rebuilding, or failed
RAID rebuild progress and estimated completion time
Hot spare availability: is a spare actually ready to take over?
Disk error rates and read/write error counts per storage device

Threshold guidance: Any RAID array in a degraded state should trigger an immediate alert, not a warning. S.M.A.R.T. threshold breaches on any disk warrant proactive replacement before failure occurs. Most modern storage devices surface this data automatically if you have the right sensors in place.

Catching these issues early is the difference between a planned maintenance window and an emergency recovery.

6. SAN Fabric Performance — Don't Let the Network Be Your Bottleneck

A Storage Area Network (SAN) is only as reliable as the fabric connecting it: the switches, such as those from Cisco, Host Bus Adapters (HBAs), and fiber channel or iSCSI links that carry data between servers and storage. SAN fabric issues are notorious for looking exactly like storage performance problems. You spend an hour digging through array logs before someone checks the switch and finds a port throwing CRC errors.

Monitoring the SAN fabric as part of your storage monitoring matrix keeps you from chasing ghosts. When latency spikes or throughput drops, you need to know immediately whether the problem is in the storage array itself or in the network connecting it.

What to monitor:

Port error rates on SAN switches and HBAs (CRC errors, link resets, loss of signal)
Link utilization per port: identify congested uplinks before they saturate
Queue depth at the HBA level
Zoning configuration changes: unexpected zoning changes can cause immediate connectivity loss
Inter-switch link (ISL) health and utilization

Threshold guidance: Any port error rate above 0.1% warrants investigation. Link utilization consistently above 70% on SAN uplinks is a capacity planning signal. For storage environments with VMware, latency and IOPS at the datastore level are equally critical. See Monitoring VMware vSphere Performance for a detailed guide.

7. NAS Performance and Availability — Keep File Services Running Smoothly

Network-Attached Storage (NAS) serves a different purpose than SAN. It provides shared file access over standard network protocols like NFS and SMB/CIFS. And because NAS shares are often accessed by every user in an organization, availability and response time directly impact day-to-day productivity.

NAS monitoring tends to get less attention than SAN monitoring, but the impact of a NAS outage is immediately visible to end users in a way that most storage failures are not. Slow share response times, dropped connections, or unavailable shares generate helpdesk tickets fast.

What to monitor:

Share and volume availability (is the share actually accessible?)
Protocol response times per NFS/SMB share
Concurrent connection counts and trends
Disk utilization per NAS volume with growth trend data
Network throughput to the NAS device
Failed authentication attempts (security signal)

Threshold guidance: Any share unavailability should trigger an immediate alert. Response times above 100ms for SMB/NFS operations warrant investigation. For a practical example of NAS metric monitoring in action, see Monitoring Disk Usage on Synology NAS with PRTG.

Real-time NAS monitoring gives you the observability to catch degradation before users do. Which, honestly, is the whole point.

8. Storage Pool and Volume Allocation — Optimize Before You Over-Provision

Modern storage infrastructure relies heavily on thin provisioning: allocating more virtual storage capacity than physically exists, on the assumption that not all of it will be used at once. It's an efficient approach, but it introduces a monitoring blind spot that catches teams off guard: the gap between allocated capacity and actual physical consumption.

When thin-provisioned volumes grow faster than expected, storage pools can run out of physical space even when individual volumes appear to have headroom. Managing your storage resources at the pool level, not just the volume level, is what keeps you from hitting that wall.

What to monitor:

Thin provisioning ratios per storage pool (allocated vs. physical)
Over-commitment levels: how far beyond physical capacity are you allocated?
Snapshot space consumption and growth rate
Volume reclamation opportunities (unused allocated space)
Storage pool free space with trend-based projections

Threshold guidance: Alert when thin-provisioned volume consumption exceeds 80% of the physical pool. Automate snapshot cleanup policies and set hard limits on snapshot retention to prevent silent pool exhaustion.

Optimizing your allocation strategy requires this data. Without it, you're guessing. And you'll probably over-provision to compensate, which wastes budget and storage resources alike.

9. Replication and Backup Status — Verify Your Safety Net Is Actually There

A backup you haven't verified is not really a backup. It's just a file you're hoping is intact.

Replication and backup monitoring is the most overlooked category in most storage monitoring matrices, and it's the one that causes the most pain when it fails silently. Backup jobs fail. Replication falls behind. Retention policies get misconfigured. None of these failures announce themselves loudly. They accumulate quietly until the moment you need to restore something. And then you find out the safety net wasn't there.

What to monitor:

Last successful backup timestamp per system and volume
Backup job completion status (success, warning, failure)
Replication lag: how far behind is your replica from the source?
RPO/RTO compliance: are you actually meeting your recovery objectives?
Backup storage consumption and retention policy adherence
Job duration trends: a backup that's taking longer each night is a warning sign

Threshold guidance: Alert immediately on any failed backup job. Replication lag above 15 minutes warrants investigation for most environments; for critical systems, tighten this to 5 minutes. Automate backup verification checks where possible. Don't rely on manual review.

Downtime caused by data loss is the most expensive kind. Get this one wrong and everything else in your monitoring matrix becomes irrelevant.

10. Multi-Vendor Storage Visibility — One Dashboard to Rule Them All

Most enterprise storage environments aren't built from a single vendor's products. You've got a NetApp array here, a Dell EMC system there, IBM storage in the data center, NAS devices on the edge, and cloud storage like AWS and Azure growing in the background. Each has its own management interface, its own alerting system, its own data format.

The result is exactly what sysadmins describe: "We've got SAN monitoring in one tool, NAS in another, cloud storage somewhere else. It's a mess."

A complete storage monitoring matrix requires unified visibility across all of these storage systems, not a tab for each vendor's console. That means aggregating metrics, alerts, and dashboards into a single monitoring platform that speaks to all of them, whether via SNMP, REST API, or vendor-specific protocols.

What to monitor:

Cross-vendor health scores and availability status in a unified view
Unified alerting across on-premises and cloud environments
API-based data collection from vendor management systems (NetApp ONTAP API, Dell EMC REST API, IBM Spectrum Control)
Consistent metric naming and threshold management across storage types
Trend data and reporting across the full storage infrastructure

How PRTG helps: PRTG's native storage sensors combined with its flexible REST API sensor cover NetApp, Dell EMC, IBM, and dozens of other storage platforms from a single dashboard. No separate consoles, no alert silos. For a full overview of PRTG's storage monitoring capabilities, visit Storage Monitoring with PRTG.

Build Your Storage Monitoring Matrix Today

Storage problems rarely show up with a warning. Capacity fills up, latency creeps higher, a RAID array goes degraded, a backup job quietly fails. Most of the time, nobody notices until something breaks. The teams that do catch it early aren't smarter or luckier. They just have better visibility.

The 10 metrics in this storage monitoring matrix cover the full spectrum of what your storage infrastructure needs you to watch, and how each one should function as part of a unified monitoring approach: capacity, performance, health, availability, data protection, and visibility. Together, they give you enough signal to get ahead of problems instead of just reacting to them.

Here's a concrete first step: pull up your current monitoring dashboard and check which of the 10 metrics in the table above you're actually tracking. If you've got fewer than 7, you have blind spots. If you're missing backup status or RAID health entirely, fix those first. They're the ones that cause the worst outages and the hardest conversations.

PRTG gives you the sensors, dashboards, and real-time alerting to put this matrix into practice across SAN, NAS, RAID, cloud environments, and multi-vendor storage systems. Learn how to monitor your storage environment in 4 steps and get your first sensors running today.

View full post