Someone calling in saying the network is slow is about as useful as the patient telling the doctor she feels sick.
It doesn’t provide you with a specific situation to rectify. It only lets you know that you had better do some testing and diagnosis. And in any case, the problem is unlikely to be what the patient, or the end user, thinks it is.
With a network, for example, the cables didn’t suddenly decide to take it easy and drop their bit rate in half. Rather, the bug could well lie in just about any aspect of the infrastructure. Sometimes it even can be traced back to an uneducated user who didn’t know that downloading an innocent-looking piece of spyware would slow down his computer because it keeps banging away at the firewall trying to find a route out.
”If the lights dim, people would say it’s the network,” says Al Hofmann, Director of network services for Hartford Hospital, Conn.’s largest hospital with 45 buildings connected on its main campus LAN, and nearly 100 remote facilities linked up via ATM, frame relay, ISDN and VPNs. ”But when you have a Gigabit network, nine times out of 10, it is the peripheral, not the network.”
No, the problem isn’t always the network, but that doesn’t absolve network administrators from having to shoulder the task of determining where the true cause lies.
Once Hoffman put a network management system in place, he could just point others to a screen showing that the network was working just fine so they could start looking elsewhere. He estimates that this has cut the number of unnecessary calls by two-thirds.
”We don’t have to do a lot of detective work to convince people anymore that it is not a network problem,” he says. ”I don’t have to do the convincing as the tool does it for us.”
Negotiating a Connection
Network management systems, of course, are useful for more than deflecting the blame.
There are times when there is a problem with the network and it is a matter of learning to read the symptoms — utilization (bits in and out), response time in milliseconds or uptime of applications, servers, routers and switches. These things point you toward the area of the problem, but not necessarily to its cause. The next step is to dig into the logs and see what types of errors are cropping up.
Take the example of mismatches on a 10/100Mb Ethernet network. When you plug in a device, the switch and the device are supposed to negotiate what speed will be used for the connection. But that doesn’t work 100 percent of the time. And when there is a mismatch in speeds, it starts clogging up the connection.
”When that occurs, whether it is on the switch, the PC or a Unix box, the interface will start logging FCS (Frame Check Sequence) errors,” says Alan Rice, technical services administrator for Manatee County, Fla. ”When you have it on a server, it will seem like it is running slow because of the large number of retries. The end users will start reporting the network is slow when it is really just a mismatch on the server and switch.”
Rice says that in order to address this problem, he set up reports on his network management systems — WebNM from Somix Technologies, Inc. out of Sanford, ME — to monitor for FCS errors. He has a total of about 6,000 reports set up in the system, half of them for port utilization and the rest for FCS errors. He also uses his log management software to track FCS error entries. When these errors exceed a threshold, the log manager issues an alert that there is a duplex mismatch on the network.
”This allows us to be proactive and resolve it before the end users notice there is a problem,” says Rice. ”We can either force the switch and PC to operate at the same speed, or we can reboot and let the devices renegotiate the connection.”
Tracking Down the Device
FCS is just one of many performance aspects you can grab and analyze with a network and log management system. The exact items monitored will depend on the system architecture and what ”pain points” the organization has. You can monitor every aspect of every device, but this will be more cumbersome than taking a more focused approach.
Instead, take the management aspects that are consuming the most time and automate those.
For servers, switches, routers and hubs, you can activate Simple Network Management Protocol (SNMP) and Remote Monitoring (RMON).
For desktops and servers, keeping an eye out for Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) alerts will let you backup a hard disk and re-route the traffic before the drive crashes.
For external connections, monitoring Forward Explicit Congestion Notices (FECN) and Backward Explicit Congestion Notices (BECN) lets you know when packets are failing to make it through the line.
Firewalls and intrusion detection systems can alert you to attacks on the system, so you can shut down a port or device before it infects other parts of the network.
Having these reports and alerts doesn’t always tell what is wrong with the network, but it does tell you where to start looking and that cuts down repair time.
David Rosicki, Hartford Hospital’s network engineer, tells of how users at one remote site complained that their T-1 connection was slow. But the performance graphs showed that the traffic was peaking at 500 Kbps, well below its 1.544 Mbps capacity.
So, instead of trying to debug the network, they sent an analyst to the location to do a packet analysis and traced it down to a workstation taking too long to write the data. Another time, however, when the hospital replaced a stand alone Digital Service Unit — a device which connects a serial line to a frame relay — with a rack mount from the same vendor, they found that the throughput did drop, so they replaced the defective device.
”With the graphing capability you can really tie down where the problem lies,” says Rosicki.