[Update] Ironically, as soon as this blog went live, Cloudflare went down! So I will use this opportunity to say... I told you so! Keep reading to find out why.
Someone, at some point in time, (hopefully some kind of professional network engineer) designed your network. They drew a pretty picture with multicolored lines showing traffic flowing between a primary path and a secondary path. Oh, and this plan also had a budget and important management buy-in. Then they moved on to another company, country, or just a better job, and what you've been left with is a Frankenstein's monster of temporary-band-aid solutions to real problems that all became permanent circa 2015. Your so-called redundant paths? One goes down, and the other follows like some kind of tragic Romeo and Juliet situation - it's heartbreaking.
Network redundancy is your safety net against network downtime and outages. It's what keeps your business running when everything goes wrong, and the costs pile up faster than an Azure bill after someone spins up 'just a small test environment.'
Network redundancy means you have another router that your server can switch to if something goes wrong. You have a backup switch that the racks can plug into if one fails. You have multiple internet connections. And you probably also have active-active load balancing so that your single point of failure is now distributed across four devices and four connections and you're all set, right?
Sort of, but probably not.
Let's dig into the details. Network redundancy, in the real world, means you have real standby equipment, real failover connections, and real alternative routes for your network traffic if something goes wrong. It's about ensuring high availability and fault tolerance across your entire network. It's not just a collection of routers, switches, and cables you bought at some point for a future rainy day you may or may not use.
Redundancy works via failover. When a failure in the system triggers the backup system, you have all the time in the world to fix that failure while the backup is carrying the full load of your traffic and no one has any idea anything has changed.
Your users and your customers don't care about your redundancy, and they don't care that your primary internet connection suddenly became unavailable. They just want to continue shopping, continue sending emails, and continue working while you move routers around, swap cables, or fix a failing device.
There are multiple types of redundancy that you need to be aware of when designing a network infrastructure. From data centers to individual network devices, understanding these redundancy types helps you protect against hardware failures and network outages. There are various tools to help you, but we're going to focus on the high-level concepts you should be concerned with.
Device redundancy involves having multiple instances of the same device - so for example you have more than one router. The standby router takes over should your primary device fail. Device redundancy typically involves VRRP (Virtual Router Redundancy Protocol) and HSRP (Hot Standby Router Protocol) that allow multiple routers to work together with one router acting as the primary and the others as hot standby devices.
Path redundancy offers multiple options for traffic to flow between your network devices, and if your primary route goes down, your traffic is automatically rerouted along another path. Routing protocols like BGP (Border Gateway Protocol) are an integral part of path redundancy in networks, constantly checking the network paths and determining where the data should be going.
Link redundancy provides multiple physical links between your network devices. Link aggregation or trunking refers to using multiple network connections as a single logical connection. If one link goes down, the remaining links continue to operate.
Connection redundancy typically involves multiple internet connections from different service providers, so that if one ISP is experiencing an outage, your other connection continues to operate. This protects against disruptions from your primary provider.
Power redundancy, which often gets forgotten, is just as important. It simply means redundant power supplies. If one network device loses power, the others continue to run on their back-ups. For example, if you lose power in a rack, your redundant power supplies should kick in and your equipment will still be up and running. This includes backup generators and uninterruptible power supplies (UPS) that provide power during power outages.
Failover in network redundancy means your standby systems can take over in the event of a failure in your primary path. When that failure is detected, the system automatically reroutes traffic to the backup, and this all happens very quickly to maintain network availability.
Failover can be active-passive or active-active. In the former, a primary system carries all the network traffic and the backup system is idle, in hot standby. When a failure in the primary system is detected, the backup system takes over and becomes active. In active-active, multiple systems share the traffic load, and if one fails, the others absorb its traffic load and continue operating.
Load balancing is often used alongside network redundancy. This is a way to distribute your network traffic across multiple network devices to not only provide redundancy but also improve performance and avoid any single device from becoming overloaded.
There are several aspects to focus on when you are designing a network redundancy strategy. Redundancy should be built into the network design from the very beginning with scalability and disaster recovery in mind, but there are a few key tips if you're also cleaning up after the Frankenstein's monster too.
Monitoring plays a critical role in network redundancy. You need visibility into uptime, network availability, and any disruptions that occur. You need to know something failed and the failover process kicked in, so you can respond appropriately. Monitoring all aspects of your network infrastructure protects against both hardware failures and cyberattacks, and this is where Paessler PRTG comes into play.
We've talked before about all the ways you can use PRTG sensors to monitor and track your network. Router and switch monitoring. Firewall monitoring. Even monitoring your internet connections. All of these can be used to track and ensure your network redundancy systems are in place and functional.
Monitoring is also critical to catch issues before they become critical failures. Maybe your primary router is on the fritz and showing signs of distress. Maybe you notice your bandwidth usage creeping up and now one of your two redundant paths is down, and you weren't aware. Or maybe one of your servers has crashed and now your network operations are compromised? PRTG sensors will let you know these kinds of issues before they can become full-blown business-affecting problems.
Track everything. Network traffic patterns, bandwidth, server and network device performance, your failover events and why they are occurring, and your network performance overall so you can spot degradation and know when to take action.
Finally, a few of the most common mistakes when it comes to network redundancy.
Network redundancy is what ensures business continuity when network failures, outages, or disruptions occur. It protects against everything from hardware failures to natural disasters and cyberattacks. Network redundancy allows your users to continue working and your customers to continue shopping even when parts of your network infrastructure might be down for maintenance.
The time to implement network redundancy is in the design phase when building a new network, but it can also be done as a retroactive project too. Analyze your network. Figure out where the single points of failure are, then put redundancy in place.
Start with the most mission-critical systems and then work out from there. Pick the right solution for each element of your network. Test your failover systems so that you are sure everything works. Document, document, document. And monitor your network and all its failover activities to ensure you know what is happening, when, and why. PRTG can help here, and you can try it for 30 days for free.