By the time you know there's a storage problem, it's already too late.
A LUN fills up overnight. A SAN starts throwing latency spikes nobody noticed. A backup job silently fails for three weeks. These aren't edge cases. They're what happens when IT teams rely on reactive monitoring instead of a structured approach.
That's where a storage monitoring matrix comes in. Think of it as your reference framework for comprehensive infrastructure monitoring: a structured map of the key performance metrics your storage systems should be generating, what thresholds matter, and why each one deserves a place in your dashboards. Whether you're managing on-prem storage, cloud environments, or a hybrid cloud mix of both, the right metrics give you the real-time visibility to catch problems before they become outages.
In this guide, we cover the 10 critical metrics every IT team should be tracking, and what your monitoring tools should be doing with them. For a broader look at why storage monitoring matters in the first place, start with Why You Definitely Need to Monitor Your Storage Infrastructure.
A storage monitoring matrix is a structured reference that maps your storage environments to the specific metrics, thresholds, and alerting rules that keep them healthy. Rather than monitoring everything at once with no clear priority, a matrix gives your team a shared framework: what to watch, what's normal, and when to act.
When it works, you stop chasing fires and start catching problems before they turn into 2 a.m. phone calls. That's the whole point of building one.
Use this table as your at-a-glance reference. Each metric, its threshold, and the corresponding PRTG sensor type are mapped below. The full detail for each is in the sections that follow.
| Metric | What It Measures | Alert Threshold | PRTG Sensor |
|---|---|---|---|
| Capacity Utilization | Used vs. available space per volume/LUN | Alert: 75% / Critical: 90% | Disk Free Sensor |
| IOPS | Read/write operations per second | Alert: >80% of rated IOPS | SNMP / Storage Sensor |
| Latency | I/O request response time (ms) | Warn: >20ms / Critical: >50ms | Storage Performance Sensor |
| Throughput | Data volume moved (MB/s) | Alert: >80% of rated capacity | SNMP / NetFlow Sensor |
| Disk Health / RAID | S.M.A.R.T. data + RAID array status | Any degraded array = immediate alert | SMART Sensor / SNMP |
| SAN Fabric | Port errors, link utilization, HBA health | Port error rate >0.1% | SNMP / Fiber Channel Sensor |
| NAS Availability | Share uptime + protocol response times | Any unavailability = immediate alert | NAS Sensor / Ping |
| Storage Pool Allocation | Thin provisioning ratios + pool consumption | Alert: >80% physical pool used | SNMP / REST API Sensor |
| Backup / Replication | Job status, replication lag, RPO/RTO | Any failed job = immediate alert | HTTP / Script Sensor |
| Multi-Vendor Visibility | Cross-vendor health + unified alerting | Vendor-specific thresholds | REST API / SNMP Sensor |
Nothing derails an IT team faster than a data storage volume that fills up without warning. Capacity utilization tracks the percentage of used versus available space across your volumes, LUNs, storage pools, and file systems, and it's the single most important metric in any storage monitoring matrix.
The problem isn't that teams don't know capacity matters. It's that they don't monitor it with enough granularity. Tracking total storage capacity at the array level tells you very little. What you need is per-volume, per-LUN visibility with trend data, so you can project when you'll hit the wall, not just see that you already have.
What to monitor:
Threshold guidance: Set alerts at 75% utilization, critical at 90%. For fast-growing volumes, trigger alerts based on days-to-full. For example, alert when a volume is projected to fill within 14 days.
Proactive capacity planning starts with this metric. For a deeper look at how to set it up, see Storage Capacity Planning with PRTG.
IOPS (Input/Output Operations Per Second) measures how many read and write operations your storage can handle per second. It's the closest thing to a heartbeat reading for your storage systems, and when it flatlines or spikes, apps feel it immediately.
IOPS saturation is probably the most common cause of application slowdowns that end up getting blamed on the network or the servers. Storage rarely gets the first call.
What to monitor:
Threshold guidance: IOPS limits are vendor-specific, but a sustained load above 80% of rated IOPS is a reliable alert trigger. Don't wait for 100%. By then, latency has already degraded and users are already feeling it.
Monitoring IOPS alongside latency (see metric #3) tells you a lot more about what's actually going on than either metric alone.
Latency is the time it takes for a storage I/O request to complete, measured in milliseconds. It's the metric that tells you how your storage systems actually feel to the applications and users depending on them. And it's often the first sign of trouble, appearing well before IOPS max out or capacity runs low.
High latency is the canary in the coal mine for storage bottlenecks. A sudden spike in read or write latency usually points to a specific root cause: a saturated disk, a misconfigured RAID array, a failing drive, or a congested SAN fabric. The faster you catch it, the faster you can troubleshoot and resolve it before it cascades into an outage.
What to monitor:
Threshold guidance: Under 5ms is healthy for most workloads. Above 20ms warrants investigation. Above 50ms is critical: expect application impact. For latency-sensitive workloads like databases and VMs, tighten these thresholds accordingly.
For a practical guide to building latency observability into your storage performance monitoring with PRTG, see Storage Performance Monitoring with PRTG.
While IOPS counts operations, throughput measures the actual volume of data moving through your storage, expressed in MB/s. Both metrics matter, but they tell different stories. A storage system can be handling a high number of small IOPS while throughput remains low, or pushing massive sequential data transfers with relatively few operations.
Throughput drops are a reliable signal of network or controller bottlenecks that IOPS alone won't reveal. If your backup jobs are suddenly taking twice as long, or large file transfers are crawling, throughput is where you'll find the answer.
What to monitor:
Threshold guidance: Alert when sustained throughput consistently exceeds 80% of your storage controller's or network link's rated capacity. Sudden drops in throughput, even without hitting limits, are worth investigating as early signs of hardware degradation.
Tracking throughput alongside IOPS and latency is core to any solid storage performance monitoring setup. It's how you get a real read on what your storage environments are actually doing under load.
Individual disk failures are inevitable. What's not inevitable is being caught off guard by them. Disk health monitoring combined with real-time RAID array status is your early warning system for hardware failures that can escalate from a single degraded drive to a full data loss event faster than most teams expect.
A RAID-5 array with one failed disk is still running. It's also one drive failure away from losing everything. Without active monitoring, that degraded state can persist for days or weeks before anyone notices, and the next failure won't wait.
What to monitor:
Threshold guidance: Any RAID array in a degraded state should trigger an immediate alert, not a warning. S.M.A.R.T. threshold breaches on any disk warrant proactive replacement before failure occurs. Most modern storage devices surface this data automatically if you have the right sensors in place.
Catching these issues early is the difference between a planned maintenance window and an emergency recovery.
A Storage Area Network (SAN) is only as reliable as the fabric connecting it: the switches, such as those from Cisco, Host Bus Adapters (HBAs), and fiber channel or iSCSI links that carry data between servers and storage. SAN fabric issues are notorious for looking exactly like storage performance problems. You spend an hour digging through array logs before someone checks the switch and finds a port throwing CRC errors.
Monitoring the SAN fabric as part of your storage monitoring matrix keeps you from chasing ghosts. When latency spikes or throughput drops, you need to know immediately whether the problem is in the storage array itself or in the network connecting it.
What to monitor:
Threshold guidance: Any port error rate above 0.1% warrants investigation. Link utilization consistently above 70% on SAN uplinks is a capacity planning signal. For storage environments with VMware, latency and IOPS at the datastore level are equally critical. See Monitoring VMware vSphere Performance for a detailed guide.
Network-Attached Storage (NAS) serves a different purpose than SAN. It provides shared file access over standard network protocols like NFS and SMB/CIFS. And because NAS shares are often accessed by every user in an organization, availability and response time directly impact day-to-day productivity.
NAS monitoring tends to get less attention than SAN monitoring, but the impact of a NAS outage is immediately visible to end users in a way that most storage failures are not. Slow share response times, dropped connections, or unavailable shares generate helpdesk tickets fast.
What to monitor:
Threshold guidance: Any share unavailability should trigger an immediate alert. Response times above 100ms for SMB/NFS operations warrant investigation. For a practical example of NAS metric monitoring in action, see Monitoring Disk Usage on Synology NAS with PRTG.
Real-time NAS monitoring gives you the observability to catch degradation before users do. Which, honestly, is the whole point.
Modern storage infrastructure relies heavily on thin provisioning: allocating more virtual storage capacity than physically exists, on the assumption that not all of it will be used at once. It's an efficient approach, but it introduces a monitoring blind spot that catches teams off guard: the gap between allocated capacity and actual physical consumption.
When thin-provisioned volumes grow faster than expected, storage pools can run out of physical space even when individual volumes appear to have headroom. Managing your storage resources at the pool level, not just the volume level, is what keeps you from hitting that wall.
What to monitor:
Threshold guidance: Alert when thin-provisioned volume consumption exceeds 80% of the physical pool. Automate snapshot cleanup policies and set hard limits on snapshot retention to prevent silent pool exhaustion.
Optimizing your allocation strategy requires this data. Without it, you're guessing. And you'll probably over-provision to compensate, which wastes budget and storage resources alike.
A backup you haven't verified is not really a backup. It's just a file you're hoping is intact.
Replication and backup monitoring is the most overlooked category in most storage monitoring matrices, and it's the one that causes the most pain when it fails silently. Backup jobs fail. Replication falls behind. Retention policies get misconfigured. None of these failures announce themselves loudly. They accumulate quietly until the moment you need to restore something. And then you find out the safety net wasn't there.
What to monitor:
Threshold guidance: Alert immediately on any failed backup job. Replication lag above 15 minutes warrants investigation for most environments; for critical systems, tighten this to 5 minutes. Automate backup verification checks where possible. Don't rely on manual review.
Downtime caused by data loss is the most expensive kind. Get this one wrong and everything else in your monitoring matrix becomes irrelevant.
Most enterprise storage environments aren't built from a single vendor's products. You've got a NetApp array here, a Dell EMC system there, IBM storage in the data center, NAS devices on the edge, and cloud storage like AWS and Azure growing in the background. Each has its own management interface, its own alerting system, its own data format.
The result is exactly what sysadmins describe: "We've got SAN monitoring in one tool, NAS in another, cloud storage somewhere else. It's a mess."
A complete storage monitoring matrix requires unified visibility across all of these storage systems, not a tab for each vendor's console. That means aggregating metrics, alerts, and dashboards into a single monitoring platform that speaks to all of them, whether via SNMP, REST API, or vendor-specific protocols.
What to monitor:
How PRTG helps: PRTG's native storage sensors combined with its flexible REST API sensor cover NetApp, Dell EMC, IBM, and dozens of other storage platforms from a single dashboard. No separate consoles, no alert silos. For a full overview of PRTG's storage monitoring capabilities, visit Storage Monitoring with PRTG.
Storage problems rarely show up with a warning. Capacity fills up, latency creeps higher, a RAID array goes degraded, a backup job quietly fails. Most of the time, nobody notices until something breaks. The teams that do catch it early aren't smarter or luckier. They just have better visibility.
The 10 metrics in this storage monitoring matrix cover the full spectrum of what your storage infrastructure needs you to watch, and how each one should function as part of a unified monitoring approach: capacity, performance, health, availability, data protection, and visibility. Together, they give you enough signal to get ahead of problems instead of just reacting to them.
Here's a concrete first step: pull up your current monitoring dashboard and check which of the 10 metrics in the table above you're actually tracking. If you've got fewer than 7, you have blind spots. If you're missing backup status or RAID health entirely, fix those first. They're the ones that cause the worst outages and the hardest conversations.
PRTG gives you the sensors, dashboards, and real-time alerting to put this matrix into practice across SAN, NAS, RAID, cloud environments, and multi-vendor storage systems. Learn how to monitor your storage environment in 4 steps and get your first sensors running today.