SHARE

Lessons Learned From an Email Meltdown

In the annals of information technology, it’s not unusual for mail servers to go down. The trick, according to Greg Jackson, CIO for the University of Chicago, is to learn from the experience. Jackson lived that lesson late last year when the university’s mail system stopped functioning. It took about a week to completely solve […]

Written By

David Haskin

Feb 25, 2003

6 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

In the annals of information technology, it’s not unusual for mail servers to go down. The trick, according to Greg Jackson, CIO for the University of Chicago, is to learn from the experience.

Jackson lived that lesson late last year when the university’s mail system stopped functioning. It took about a week to completely solve the problem, a time in which he and his staff learned — and re-learned — several crucial lessons.

“The first lesson we learned is to always be open to other explanations for problems,” he said.

Consider this, then, a case study not just about how things can go wrong, but what can prevent them from being fixed.

Embrace Diversity, Unless…

Part of the problem, Jackson said, is that IT personnel in universities often don’t have the resources available in the corporate sector.

“We have to run very lean,” he said. He recalled that several years ago he visited a large hardware vendor and noted that they had 50 mail servers serving 15,000 accounts. “We were using three servers for the same number of accounts.”

Another factor that contributed to the University of Chicago’s e-mail meltdown was diversity. This doesn’t refer to the laudable type of human diversity for which colleges and universities are renowned, but diversity of computing systems. The problem, he said, was caused “partly by our own willingness to embrace unmanageable diversity.”

Also, in the university environment, “you can’t tell people what to do,” Jackson noted. As a result, some users demanded IMAP mail, some demanded POP and some demanded to use the ill-designed feature of POP that allows users to keep their messages on the server.

“POP was never meant to do that,” Jackson said. “It doesn’t work very well.”

Yet another issue was the increasing number — and size — of mail messages that the system handles. Currently, Jackson said, the university’s mail servers handle about a half-million messages a day.

“And they’re more likely to have mime content, attachments and images,” he said. “Then there’s all the spam, much of which has multimedia.”

So last November, the university’s mail servers — a four-processor Solaris-powered Sun server and a couple of adjunct servers — already were close to being overloaded.

“Six months ago, we replaced the server and, around Christmas, we were planning to swap in a 12-processor server with 20 times the capacity of the old machine,” he said. The new server was just sitting off to the side when the problem occurred.

“So we’re running a complex set of different things and we’re running lean,” Jackson summarized. “And the volume of mail is going up strikingly.”

He and his staff knew about all those problems before the meltdown occurred. However, they weren’t prepared for a big problem that they didn’t know about.

Post-Holiday Surprise

The meltdown started during the Thanksgiving break, when mail volume is very low, Jackson said. Because of the low volume, the problem didn’t become evident until the following Monday when students, staff and faculty returned.

“People are away for Thanksgiving and they come back and check their e-mail on Monday,” he said. “It happens ever year that we see a spike and sometimes it crosses the capacity of the server. Usually, the servers burp and gag a little, then they sort themselves out.”

When the predictable usage spike occurred on that Monday, he had his staff “burp” the servers — stop all incoming mail delivery, restart the machines and purge the queues. That helped a little bit, but “by Tuesday, though, it was clear something was wrong,” Jackson said.

In fact, the slow traffic of the holiday followed by the predictable spike on the following Monday hid an unexpected problem.

“We didn’t know then that we had a major disk array that was failing in a very esoteric way,” Jackson said. “The disk didn’t fail completely. It only had a throughput that was a thousandth of what it should be.”

Mail was getting through, but service was slow and sporadic. He and his staff tried figuring out the problem by focusing on the diversity of technology and high volume of e-mail traffic.

“We had a couple of days of back and forth (looking for) the right strategy. We decided to switch over to the new server, which we weren’t going to do until Christmas. So we took the old disk array out of the old server and put it into the new one.

“We tried to bypass the problem, but the problem moved with us,” Jackson said.

Eventually, the IT personnel figured out that the disk array was the problem. The next step was negotiating with Sun to replace the disk arrays. Sun was reluctant at first but, after Jackson made a call to a high-level Sun executive that he knew, the company agreed to bring additional disk arrays. After installation and a bit more tweaking, the e-mail problem finally was solved after a week of travail.

Simplify, Simplify

While diversity of technology and heavy traffic weren’t the problems, Jackson said that considering those factors slowed down the problem-solving process. So lesson No. 1 is to simplify whenever possible, he said.

To that end, he said the university will speed the process of phasing out POP e-mail.

A second lesson, Jackson said, is to be open to explanations for problems other than the obvious ones. “That cost of us two days,” he said of the time examining the more obvious issues that could have caused the meltdown.

A third lesson, he said, is to avoid single points of failure.

“The entire mail store was on a single T-3 disk array. Now, we have them spread out over four disk arrays.” That way, he said, if one disk array failures, only a quarter of the users can’t get their mail.

A final lesson is that “you have to be ready to escalate with a key vendor. It’s rational for vendors to avoid sending people out to your place, but in this case, it had to happen.”

Ethics and Artificial Intelligence: Driving Greater Equality

FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020

SEE ALL
APPLICATIONS ARTICLES

Lessons Learned From an Email Meltdown

David Haskin

Company

Categories

Lessons Learned From an Email Meltdown

RELATED NEWS AND ANALYSIS

David Haskin

Company

Categories