Many common network issues can be minimized or avoided altogether with a mixture of proactive monitoring, anticipating problems and planned recovery strategies.
1. Hardware Monitoring
The first step in avoiding common network issues is proper hardware monitoring. Waiting for the help desk phone to start ringing to find out a router is down is a good way to let network issues fester to the point of failure. There are a wide array of tools and resources to monitor networks and attached systems, from comprehensive solutions, to hardware specific-tools and alerts. Using these hardware monitors properly can help uncover problems before they turn into failures.
Internal server temperatures, CPU temperature and other heat monitors need to be watched to ensure that everything keeps operating within specifications. Errors reading or writing from hard drives, memory errors and even increasing bandwidth usage can be signs of hardware trouble in the making. Set alerts to proactively contact system administrators when readings stray outside of the proper ranges. Responding quickly to a temperature problem can be the difference between a simple hot-swap and a crashed server.
2. Set System Alerts and Thresholds
Next, be sure to setup monitors and alerts for higher-level system events and resource usage. Operating system performance monitors can often detect impending disaster. Is a hard drive filling up for no reason? Is there enough disk space? Is the networking bandwidth full, or suddenly underused? Constant paging can be a sign of a system without enough memory, or physical memory errors. These resource monitors can be an advanced warning sign for potential hardware issues. Detecting these anomalies and finding their root cause before any problems occur, is the best way to avoid common network issues.
A well-run network is a fast network. Servers without enough resources are sluggish and impede performance throughout the network. Configuring alerts that trigger when resources are running out provides a window to act before the network is impacted. An alert when storage reaches 80 percent of capacity, for example, provides plenty of time to find an optimal solution rather than hastily trying to delete files or expand storage as users howl in discontent over the diminished network.
3. Regular Database Maintenance
Don't forget to monitor and troubleshoot business-critical databases. Low database performance can be a sign of numerous problems including hardware issues, improperly tuned applications, or even corruption. Changes in performance should be investigated immediately. Troubleshooting issues quickly along with routine database maintenance will help minimize database issues on the network.
4. Power Is Everything
Even if all systems are running as they should, problems can arise from outside of the network equipment and servers themselves. Don't forget to monitor the conditions around equipment. Whether it is a state of the art datacenter or a couple of routers tucked in a closet, proper environmental conditions are a must for avoiding network issues. Power is the most common issue.
A power surge can fry delicate electronics, and even a quick power flicker can reboot servers, or corrupt data storage. And, a backup generator is only as good as the process that brings it online. Almost every sys admin with more than a few years of experience has a story about a generator with no gas or power circuit that was shut off during maintenance and never reconnected.
Ensure that critical systems have uninterruptable powers supplies with battery backup to keep them running while the generator power up and to protect them from power surges. Most importantly, test your power failure system regularly. Have an electrician demonstrate how to cut power to various equipment.
5. Is It Hot In Here, or Is It Just Your Router?
Room temperature is another common problem. Stopped air conditioning units, clogged vents, or a delivery unknowingly sat down on top of a grate, can all impede proper cooling and send temperatures set down of the proper operating range. Be sure that temperatures are not only measured, but that alerts are configured as well. It does not help to find out when the first engineer shows up in the morning that the server room has been 98 degrees since midnight. By then, the damage may have already been done.
Don't forget to monitor the environment for water and moisture as well. An unfortunately placed drip or leak can damage a server or router long before it causes any visible problems.
6. Keep An Eye On That Off-Site Equipment Too
Off-site and out of sight should not mean out of mind. Virtual servers, whether in a public or private cloud, are just as critical as any other production system. Ensure that monitors and alerts are set for those systems just like on-site systems and review the monitoring and alerting system any vendors have for where hardware is located. They should be willing to review recent tests of cooling systems, backup power, and redundant network connections. Add an extra monitor of your own to ensure that systems are reachable from all required locations (this may take several monitors). It doesn't matter if a server is "up" if users cannot connect to it.
7. Proper Change Management
Another way to avoid common network issues is to practice solid change management. Proper change management is critical to avoiding those self-inflicted network issues that have nothing to do with hardware or failing services. How many times has downtime been caused by a seemingly simple change? An updated configuration file, a new login script, a changed route, new equipment and many other changes are enough to take down part or all of a network. Proper testing, vetting and implementation of any changes into the production network is essential to preventing unnecessary network issues.
8. Is Your Website Making A Good First Impression?
Some systems in the network require special attention. Company websites are crucial resources for many businesses. As such, they require additional care. Ensuring the website is up requires a monitor that tests for response not just from on-premises, but from the outside world as well. Make sure to set a remote monitor that uses the public internet to check if the website is up. Otherwise, internal routing may give a false sense of security.
Remember, being "up" isn't enough. A website must be functional and responsive enough to keep customers from leaving. Monitor how quickly pages load and set an alert to detect spikes in traffic. Such an alert may also give the company notice of a non-technical event like a public relations issue from the company appearing on the news or otherwise gaining additional publicity.
9. Can You Hear Me Now?
In companies with a Voice over IP (VoIP) phone or conferencing systems, additional attention is required to keep them running properly. Sound quality issues and video streaming problems are frustrating for users. Such interruptions are often the result of a delay or loss in packet transmission. Monitor your network for these events, and ensure that it can handle any growth.
10. Keep The Bad Guys Out
Avoiding common network issues means avoiding malicious actions as well. Never disregard a sudden change in service level. CPU spikes can indicate attempted attacks, as can sudden bursts of traffic or an increase in the number of login attempts. Heavy, off hours, disk usage may be a sign of someone, inside or out, attempting to copy (or worse erase) large amounts of corporate data. Sometimes, these network monitoring signals will be noticeable before any detection by security monitoring software or antimalware software.
11. Make Sure Your Disaster Recovery Isn't A Second Disaster
Despite the best efforts to avoid common network issues, there is no such thing as a problem free network. No matter how great your monitoring and responses are, sooner or later, something will fail. Planning ahead and having systems in place to recover will minimize downtime. One useful trick is to automate the rebooting process. When some server services hang or fail, a reboot is all that is necessary to correct the issue. Waiting for an alert to go out, then waiting for an administrator to trigger a reboot, are lost precious minutes. An automated process can trigger and reboot whenever a server or service has been down for a predetermined time, eliminating a network issue before it causes problems.
When something does fail despite all the best efforts, backups are vital. Unfortunately, a disaster is often the first time the backups have been tested. Have you ever seen the frustration of restoring incremental backups weeks, or months, apart from the last full backup? Or, have you seen an administrator's face go white when they realize that the mirrored server has been mirroring all the errors from the original? Like backup power, backup data needs to be tested frequently.
12. Know What To Do, and Do It Well
When it comes to avoiding common network issues the key is anticipation. Knowing what can happen, what to do if it does happen and then doing that well, is all it takes to keep a network running. However, simple network availability is not the end goal. A well-run network that avoids common issues, requires proper monitoring and timely response to the data those monitors provide. If you do that your network will be as solid as it can be.