Distributed databases, distributed headaches

In the multiterabyte world of really big databases, some IT executives are finding that consolidating distributed datacenters into a few "megacenters" saves them money.

Chuck Shellhouse knows really big databases. Shellhouse, the Colorado Springs-based director of service operations in the information technology division of MCI, is responsible for managing more than 40 terabytes of data located in datacenters around the country. Those databases, which primarily run DB2 and Adabas on IBM and Hitachi mainframes, contain all of the business applications needed to run the telecommunications company's entire $20+ billion revenue stream.

"In a typical datacenter in the old days, the technical-support people could see and touch the hardware. That isn't the case anymore," says Chuck Shellhouse, MCI's director of service operations, IT division.
Photo: Steve Starr/SABA
Even though MCI's database dwarfs the databases of most corporations, MCI's computing model has become increasingly common. With companies generating and keeping more data to make better decisions on how to make money, many organizations now rely on the model of geographically dispersed, multiterabyte databases.

But today's forward-thinking companies are changing the definition of distributed computing. No longer are they managing hundreds of distributed environments running small servers. By and large, they've found such set-ups, which involve installing software and backing up data at each location, to be time-consuming and expensive.

Instead, these companies have consolidated their data into just a few datacenters, housing much larger amounts of data at each center. At MCI, Shellhouse and his staff used to manage numerous datacenters in many locations around the country. But with managerial problems and costs spiraling--the datacenters required on-site support personnel, operational personnel, and systems programmers at each location--Shellhouse and his team devised a plan to replace those datacenters with "megacenters" on the backbone of MCI's network. Today, the company has just four datacenters. They operate in an automated, lights-out environment.

Finding profits in data

MCI's progression toward its current four large megacenters mirrors that of many Fortune 1,000 companies. Cost is a primary reason.

Mentis recently surveyed U.S. banks and thrift institutions that have at least one currently operational data warehouse (or plans, with funding, for implementing a data warehouse) and asked each one to specify the expected size of its primary data warehouse database (including all data, indexes, and metadata). Nearly a third of the databases were projected to be 500 gigabytes or more.
Source: Mentis
Among other companies consolidating datacenters are Chase Manhattan and Norcross, Ga.-based CheckFree. Banc One, based in Columbus, Ohio, has gone through a major transition from decentralization to centralization, Citibank is migrating from multiple data networks to a common network, and BankBoston has consolidated from two datacenters to one.

Consolidating a dozen datacenters into a few makes a lot of sense for most large companies, says Daniel Graham, a Somers, N.Y.-based strategy and operations executive in IBM's Global Business Intelligence Solutions group.

"Having [distributed datacenters] is like having children. Two are wonderful, four are a lot of fun, six start to be a drain on your resources, and 20 are impossible," Graham says. "Every time you have another child, you have bought yourself a certain amount of overhead cost."

Driving this exponential growth in database size is marketing. Companies are consolidating internal databases, purchasing additional market-research data, and holding onto data longer in efforts to better focus their marketing efforts.

"The more behavioral characteristics companies can analyze about their customers, the better they can serve them," says IBM's Graham. "If I can start tracking more about Dan Graham's purchasing behavior, the kinds of things he likes and doesn't like, the stores he frequents, and the kind of clothes he buys, I can start using the company's datamining tools and data warehouses to target-market to him. I'd send him a lot less junk mail, but send him the kind of deals he cares about."

As a result of efforts such as these, MCI's 40-terabyte database won't be considered unusually large for very long, experts believe. Today, many Fortune 1,000 companies handle 300 to 500 gigabytes of raw data, which translates to about 1.5 terabytes of DASD, and the average database size is expected to more than triple in the next few years (see chart, "Coming soon to a database near you: Godzilla!").

Inadequate tools

While building very large, geographically dispersed databases containing just the kind of information needed to focus a company's marketing is a great concept, many organizations have had trouble finding adequate tools to manage such behemoths. Many IT departments have found that the tools necessary to manage these databases are inadequate, immature, and in some cases nonexistent.

"Having [distributed datacenters] is like having children. Two are wonderful, four are a lot of fun, six start to be a drain on your resources, and 20 are impossible."
--Daniel Graham, IBM 's Global Business Intelligence Solutions group
It's quite common to hear IT managers complain about the lack of adequate tools. Jim Johnson, New York City-based chief technologist and system architect in the Human Resources group at Xerox, says he prefers to use off-the-shelf tools whenever possible, but has resorted to writing his own tools for some tasks. Johnson is in charge of Xerox's 120 gigabytes of human resources and personnel data, which is housed on eight Oracle servers distributed around the country.

Johnson uses BMC Patrol to monitor the system's hardware, operating system, and Oracle database, but has written his own tools to perform tasks Patrol does not perform. For example, Johnson's team wrote a distributed load process to load large data feeds from external sources into each of its eight servers, which guarantees that they are always in sync.

"In some cases, we preferred not to use commercial tools, because we had specific needs and because we felt we could keep a better handle on things that way," Johnson says. "I always prefer to buy off-the-shelf tools rather than write things in-house, but in this case, it just wasn't possible."

Graham of IBM agrees that effective tools to manage large, geographically dispersed databases are scarce. There are tools available to do some of the necessary tasks, such as transferring data from one server to another, but there's no tool, for example, that coordinates all of the destinations for the data and synchronizes schedules in a way that recognizes all other servers on the network. In addition, there is no single tool or collection of products that provides a complete solution for managing large and geographically dispersed databases.

Shellhouse and his MCI team gave up in frustration after trying to find the tools they needed. "The tools out there today either didn't fit our needs, or they couldn't handle our volumes," he says. What he needed, he says, simply doesn't exist today. Shellhouse and his team are working to build a database-management tool that can manage large, geographically dispersed databases effortlessly, dealing with alarms, recoveries, and backups.

New toys, new challenges

Lack of tools aside, there are other pressing issues and challenges in managing large, geographically dispersed databases. In order to keep the data on his servers in sync, Xerox's Johnson has implemented a series of checks and balances to make sure the data is correctly loaded on all servers and that nobody accesses the data until it's all in sync. This stressful nightly dance promises to become even more complicated as the system loads more data from external systems, and as the database grows to more than double its current size within a few years, he says.

Lessons Learned

Move from dozens or even hundreds of distributed computing sites to a handful of automated, lights-out database facilities to generate significant savings.
Give your vendors and suppliers the flexibility to customize your applications.
Understand that moving from a distributed environment with hundreds of locations to one with three or four lights-out locations is a big change. Give your employees time to get used to the change.
Recognize what you need skills-wise, and hire for the future, not the past. The skills needed to manage datacenters remotely are different from those required to manage older style datacenters. This may result in the need to hire systems administrators with different skills than in the past.
If you can't find tools to meet your needs, consider modifying an off-the-shelf tool or even writing your own. Tools to manage large, geographically dispersed databases are still immature.
Today's dispersed, hands-off databases present challenges that did not exist even a few years ago. Managing large amounts of data remotely is a culture unto itself, and takes special skills, MCI's Shellhouse says.

"In a typical datacenter in the old days, the technical-support people could see and touch the hardware. That isn't the case anymore," Shellhouse says, because of the lights-out nature of his datacenters. "The biggest challenge we have is keeping our technical people current with the changing technology when they don't have a lab where they can see it and play with it. And they have to manage new databases without hands-on experience. It is difficult to refresh and train your personnel when they have never had the opportunity to see this stuff first-hand."

Reducing head count

Consolidation of large databases often entails new investments in hardware, software, and network infrastructure, but it can improve the bottom line by reducing personnel costs.

"We've found that the cost actually goes down as the databases get larger because of centralization and consolidation, which gives us economies of scale," says Shellhouse. "Before we moved to our megacenter concept, we had people at each location responsible for the day-to-day operations and technical support. There may have been only 10 terabytes of data in those days and 100 people. Today, we have 40 terabytes of DASD that is more centrally located and one-fourth the number of people. We've seen our headcount go down consistently year after year, yet our database growth rate has been in the 30% to 40% range."

As IBM's Graham puts it: "Every time you put a distributed database out there, you have just bought at least one systems programmer. You are replicating skills you have at your central hub."

Johnson's experience at Xerox is that the size of the database doesn't have much impact on maintenance costs. The real cost for Xerox is the price of each server the system adds. To keep costs down, Johnson tries to keep each of its eight production servers as similar to the others as possible so that the staff doesn't have to manage several different server configurations.

The Internet: a new paradigm

Database experts have seen the future of distributed computing, and it is the Internet. The Internet provides IT managers with an easier mechanism for distributing data to end users. By simplifying and consolidating on one universal client, they can contact their customers and work with their business partners much more easily.

The Internet changes the whole paradigm of distributed computing, says Carl Olofson, research director for database management systems at International Data Corp., the Framingham, Mass., consulting firm.

"Ultimately, instead of an organization having a fixed topology of networks that have to be connected together, they can employ a much more flexible scheme using the Internet instead of allowing regional offices to connect through their system," Olofson says. In addition, the Internet enables companies to connect to each other and create virtual enterprises, he notes.

Olofson says security, Java standards, and other issues are temporarily preventing the Internet from becoming the principal backbone for most organizations. But once those issues are resolved, companies will experience dramatic changes in the way their databases are used.

The Internet will make it simpler for organizations to centralize the management of geographically distributed databases and organizations, says Ken Jacobs, Oracle's vice president of data server marketing. "It will be extremely easy to consolidate your data into a central server and provide it to users anywhere. And you can do that in a way that preserves the integrity of the data, security, and transaction semantics of the data. The economics become compelling to move toward consolidation."

Although the Internet has not yet made a big impact on the way MCI manages its large distributed databases, Shellhouse expects that to change soon. Eventually, MCI customers will be able to accomplish a variety of tasks by accessing the company's databases via the Internet. They'll be able to obtain billing information and change the format of invoices and the way their calls are routed. Some consumer customers already can access their MCI accounts via the Internet.

Cost may be the biggest reason for companies to make their large distributed databases available via the Internet.

Do you plan to consolidate your company's distributed datacenters? E-mail us and tell us what motivated your decision.
"A lot of organizations are interested in recentralizing the IT management operations because database administrators are so expensive," says IDC's Olofson. "Rather than have a database administrator for each regional office, you can have one DBA team in the central office that can manage all the regional databases. The Internet will move that paradigm along because in the world of the Internet, physical location is more or less irrelevant." //

Karen D. Schwartz is a freelance business and technology writer based in Washington, D.C.

Banks scramble to de-Balkanize their information

Unlike pharmaceutical companies or manufacturing organizations, financial firms don't sell tangible products. In the financial industry, information is the name of the game, and the processing of information helps these institutions, perhaps more than any other, run their businesses efficiently and profitably.

Banking and investment houses came relatively late to the game of distributed computing. Because each segment of a financial institution traditionally has run independently, and because the need for security is so important, many financial databases were created as stand-alone applications, creating numerous islands of information. Eventually, with the advent of client/server computing, trading desks and other financial areas started building their own systems, but many financial institutions are still coping with the legacy of the Balkanization of information.

"Data warehousing in the financial services industry tends to be a somewhat more complex process than in other industries," says Mary Knox, research manager at Durham, N.C.-based Mentis Corp. (http://www.mentis.com), a firm specializing in evaluating information and communications technology in the financial services industry. "Within banking, information systems were originally developed along product-centric lines, so there has been very little thought given to standardization and the ability to merge data from disparate systems."

Even today, it isn't unusual for a large bank to have one host mainframe system serving as its check-processing system, a second machine for CD business, a third for credit card information, and a fourth to handle mortgage loans.

But combining data from different groups within financial institutions has become very much a necessity in today's competitive financial environment. Large databases are needed to better understand the company's relationships with its customers. That is very different from the original use of such databases, which were developed for processing transactions.

A recent survey by Mentis asked financial institutions to specify the level of approved funding for external expenditures and capital investments for the first 12 months of their data warehouse projects. The average was $1.5 million, with nearly a third of organizations spending $2 million or more. The survey was of U.S. commercial banks and thrift institutions that had at least one currently operational data warehouse or firm plans, with funding, for implementation of a data warehouse solution.
Source: Mentis
Banks today are moving aggressively to develop complete customer profiles, which Knox says is key to many banks' relationship- management strategies. But because of the complexity of the data and systems involved, this development is a long and arduous process.

Banks that are successful in merging their data and developing complete customer information will have a valuable competitive tool. Without this information, banks are limited in their ability to identify customer profitability, and thus to develop strategies and tactics for retaining and growing profitable relationships.

In a recent survey, Mentis found that funding for external expenditures and capital investments for the first 12 months of data warehouse projects ranged from $200,000 to $7.5 million, with the average being $1.5 million (see chart, "Big bucks up front"). Although these numbers seem high, banks have no choice but to take the plunge, Knox says.

Despite the cost, many major financial institutions are doing just that. First Union, for example, is in the process of developing an enterprisewide database to allow the company a full view of customer relationships across products. The system will replace previous departmental systems that had inconsistent and incomplete customer data. Capital One, a credit card issuer based in Falls Church, Va., has had great success in using large customer databases for rapid product development and marketing.

And cost, as in all large database projects, is relative. At Boston-based CS First Boston, revamping the trading database support systems saved the institution about $750,000 annually in the cost of database administrators alone, notes Sergey Fradkov, chief technology officer at UNIF/X, a consultancy that helped CS First Boston revamp its database systems.
--Karen D. Schwartz