In the annals of information technology, it’s not unusual for mail servers to go down. The trick, according to Greg Jackson, CIO for the University of Chicago, is to learn from the experience.
Jackson lived that lesson late last year when the university’s mail system stopped functioning. It took about a week to completely solve the problem, a time in which he and his staff learned — and re-learned — several crucial lessons.
“The first lesson we learned is to always be open to other explanations for problems,” he said.
Consider this, then, a case study not just about how things can go wrong, but what can prevent them from being fixed.
Embrace Diversity, Unless…
Part of the problem, Jackson said, is that IT personnel in universities often don’t have the resources available in the corporate sector.
“We have to run very lean,” he said. He recalled that several years ago he visited a large hardware vendor and noted that they had 50 mail servers serving 15,000 accounts. “We were using three servers for the same number of accounts.”
Another factor that contributed to the University of Chicago’s e-mail meltdown was diversity. This doesn’t refer to the laudable type of human diversity for which colleges and universities are renowned, but diversity of computing systems. The problem, he said, was caused “partly by our own willingness to embrace unmanageable diversity.”
Also, in the university environment, “you can’t tell people what to do,” Jackson noted. As a result, some users demanded IMAP mail, some demanded POP and some demanded to use the ill-designed feature of POP that allows users to keep their messages on the server.
“POP was never meant to do that,” Jackson said. “It doesn’t work very well.”
Yet another issue was the increasing number — and size — of mail messages that the system handles. Currently, Jackson said, the university’s mail servers handle about a half-million messages a day.
“And they’re more likely to have mime content, attachments and images,” he said. “Then there’s all the spam, much of which has multimedia.”
So last November, the university’s mail servers — a four-processor Solaris-powered Sun server and a couple of adjunct servers — already were close to being overloaded.
“Six months ago, we replaced the server and, around Christmas, we were planning to swap in a 12-processor server with 20 times the capacity of the old machine,” he said. The new server was just sitting off to the side when the problem occurred.
“So we’re running a complex set of different things and we’re running lean,” Jackson summarized. “And the volume of mail is going up strikingly.”
He and his staff knew about all those problems before the meltdown occurred. However, they weren’t prepared for a big problem that they didn’t know about.
Post-Holiday Surprise
The meltdown started during the Thanksgiving break, when mail volume is very low, Jackson said. Because of the low volume, the problem didn’t become evident until the following Monday when students, staff and faculty returned.
“People are away for Thanksgiving and they come back and check their e-mail on Monday,” he said. “It happens ever year that we see a spike and sometimes it crosses the capacity of the server. Usually, the servers burp and gag a little, then they sort themselves out.”
When the predictable usage spike occurred on that Monday, he had his staff “burp” the servers — stop all incoming mail delivery, restart the machines and purge the queues. That helped a little bit, but “by Tuesday, though, it was clear something was wrong,” Jackson said.
In fact, the slow traffic of the holiday followed by the predictable spike on the following Monday hid an unexpected problem.
“We didn’t know then that we had a major disk array that was failing in a very esoteric way,” Jackson said. “The disk didn’t fail completely. It only had a throughput that was a thousandth of what it should be.”
Mail was getting through, but service was slow and sporadic. He and his staff tried figuring out the problem by focusing on the diversity of technology and high volume of e-mail traffic.
“We had a couple of days of back and forth (looking for) the right strategy. We decided to switch over to the new server, which we weren’t going to do until Christmas. So we took the old disk array out of the old server and put it into the new one.
“We tried to bypass the problem, but the problem moved with us,” Jackson said.
Eventually, the IT personnel figured out that the disk array was the problem. The next step was negotiating with Sun to replace the disk arrays. Sun was reluctant at first but, after Jackson made a call to a high-level Sun executive that he knew, the company agreed to bring additional disk arrays. After installation and a bit more tweaking, the e-mail problem finally was solved after a week of travail.
Simplify, Simplify
While diversity of technology and heavy traffic weren’t the problems, Jackson said that considering those factors slowed down the problem-solving process. So lesson No. 1 is to simplify whenever possible, he said.
To that end, he said the university will speed the process of phasing out POP e-mail.
A second lesson, Jackson said, is to be open to explanations for problems other than the obvious ones. “That cost of us two days,” he said of the time examining the more obvious issues that could have caused the meltdown.
A third lesson, he said, is to avoid single points of failure.
“The entire mail store was on a single T-3 disk array. Now, we have them spread out over four disk arrays.” That way, he said, if one disk array failures, only a quarter of the users can’t get their mail.
A final lesson is that “you have to be ready to escalate with a key vendor. It’s rational for vendors to avoid sending people out to your place, but in this case, it had to happen.”