In this article:
||Evolving into knowledge management|
|The origins of Migration Architect|
What would you rather move your family across country in, a beat up old rental trailer or a shiny new Peterbilt 18-wheeler? The obvious answer is the 18-wheeler, and that’s exactly what the Automobile Club of Southern California opted for when it chose to implement a new datamart.
AAA decided in late 1997 to liberate the data buried in its two VSAM (virtual sequential access method) files–and over 200 transaction types–that contained its cashiering system records. And with records dating back to 1900, 40 million members throughout the United States and Canada today, and total annual revenues that amount to about $1.2 billion, an 18-wheeler was exactly what the automobile club needed for easier access to its data.
“We found social security numbers and alien ID numbers in the ‘fishing’ field,” recalls Ralph Nordstrom, data warehouse architect for the Automobile Club of Southern California (ACSC is the largest U.S. affiliate of AAA). “The club used to sell fishing licenses. And, of course, it has always sold travelers checks.” So when the federal government required notification of transactions worth $10,000 or more, the club decided to appropriate the then unused fishing license field to store the required social security number or alien ID information.
As data warehouses move from being storage facilities to monitoring the business, the market for data-warehouse software and tools will grow into the next century, according to GartnerGroup projections.
Trying to make sense out of fields that have taken on different meanings over the years is just one of the myriad data identification, reconciliation, and transformation problems that plague the early phase–or ETL (extract, transform, and load) tasks–of any data warehouse project. In response to these issues, today’s data warehouse implementation teams are looking for tools that tend to be fast, industry-specific, function-specific, or a combination thereof.
ACSC’s datamart works with sales data (mainframe point of sale [POS] combined with general ledger and cost systems) and is used to generate sales reports for local ACSC offices. Prior to implementation of the datamart, reports were put together manually and didn’t arrive in the most timely fashion. ACSC’s business goal is to get sales reports to branches fast.
Nordstrom, who is responsible for back-end data movement issues, remembers the painful 10 months of ACSC’s first datamart project spent back and forth with Passport, an ETL (extract, transform, and load) tool from Carleton Corp. of Minnetonka, Minn., just trying to move the legacy data into a datamart. ACSC spent almost a year trying to set up extraction rules in Passport, but the source data was too “dirty” and convoluted to be implemented. At one point, Nordstrom’s team had created a series of complex lookups that entailed traversing four or five large tables, all in an attempt to “fix” the data. Finally, the initial Passport project was killed–due to the combination of too many data sources (customers, cashiering/POS, employee, organization, and membership files) and the evolutionary use of fields (the fishing license story), in addition to other typical data quality problems–and ACSC began looking for a way to solve the data warehouse design and data quality beasts.
Don’t call it artificial intelligence
Nordstrom credits colleague Francine Burick, the auto club’s data warehouse coordinator in charge of the end-user information access side of things, with discovering Evoke Software Corp.’s Migration Architect at a data warehousing conference. Typical of today’s data warehousing tools, Evoke does upfront profiling and mapping using data design rules that Carleton’s Passport can’t do. It adds value because of the parsing it can do to “guess” dependencies and similar fields.
Formerly DBStar, Evoke’s Migration Architect is a process whereby the content and structure of legacy data sources are examined and understood in detail and mapping specs are produced (see textbox, “The origins of Migration Architect”). Migration Architect consists of six sequential steps, three for data profiling and three for data mapping. The resulting transformation maps can be used in conjunction with “traditional” ETL tools like Passport, which was ACSC’s approach, since it already had Passport in house and was familiar with it. So instead of a group of people getting together and trying to figure out the content and structure of legacy data by looking at COBOL copy books, outdated documentation, comments in code, and sample data itself, Migration Architect profiles data in three dimensions (down columns, across rows, and across tables). Redundancies (duplicate copies of addresses, etc.) are usually found in the across tables. Migration Architect uses output of the first three steps to build normalized (3NF) data models. A widely accepted industry standard for “good enough” relational design, 3NF avoids update and deletion “anomalies” associated with poor design.
Migration Architect 2.0 performs a combination of automated discovery and interactive analysis in order to generate a normalized data warehouse with atomic data. “It got us 95% there, right out of the box,” according to ACSC’s Nordstrom. Although he wishes Migration Architect would do a better job “preparing” the data for analysis (getting VSAM files into Migration Architect for analysis), Nordstrom was impressed with how easy it was to do column profiling (down columns), dependency profiling (across rows), and redundancy profiling (across tables and/or data sources). Evoke plans to add some utilities to help the “up front” loading of mainframe flat files in its next version, due mid-year 1999. It will probably also add Teradata and IMS to the list of databases currently supported (Oracle, Informix, Sybase, and DB2). In that version, customers will be able to use a traditional relational database for Migration Architect’s datastore. Migration Architect 2.0 currently uses a proprietary data structure that even Evoke’s vice president of engineering, Jack Olson, admits is “messy.”
The secret behind Migration Architect is that it does one thing very well: data profiling and mapping. It doesn’t do the data movement–users need other tools to do that. ACSC chose Passport to move its data. What Migration Architect does is help users figure out and “normalize” raw source data. This goes beyond what data cleansing software does–parsing, standardizing, and reformatting data, and often augmenting it with external demographic or financial data. Data cleansing is used by different people to mean different things, but it usually only refers to quality and consolidation issues.
Bruised and bloodied no more
Just over a year ago, in November 1997, the auto club began–working half days only–using Migration Architect to profile its data. “I wanted a tool that would help us transform the source data into third normal form (3NF),” says Norstrom. Less than six weeks later, Nordstrom and his team generated the desired sales report. “On New Year’s Eve, we delivered the 1997 sales report,” he recalls. “We were the knights in shining armor.”
ACSC’s new data warehouse now houses atomic data, while its datamarts are for sales and customer service. Nordstrom is proud of the auto club’s hub and spoke architecture. And rightly so, as the industry-accepted architecture for “good” enterprise data warehouses and dependent datamarts, this type of architecture works well. Migration Architect profiles and maps ACSC’s data while Carleton’s Passport populates both its enterprise data warehouse, residing in DB2 on the mainframe, and its sales and POS datamarts. These datamarts, which contain records of member “transactions” such as ordering maps and “TripTiks,” are in Oracle 7.3 star schema databases running under HP-UX. (Prepared by AAA for its members, TripTiks are customized route maps for a specific trip, which is broken up into a series of mini maps that are shaped like big tickets, hence the name.)
Now, with not one, but two datamarts under his belt, Nordstrom is ready to tackle what will be the biggest challenge of all–the insurance datamart. A few years ago, auto club staffers estimated it would take between three and four staff years to design a normalized database from data residing in four VSAM files and 58 different record types. Using Migration Architect, Nordstrom hopes to accomplish the task in just 12 months–with three major deliverables along the way at 120-day intervals.
Like ACSC, Leslie Cone, project manager at the U.S. Department of the Interior Bureau of Land Management, also discovered Evoke by accident, thanks to a consultant’s recommendation. About a year ago, it was clear that the BLM’s extremely ambitious long-term reengineering project, the Automated Land and Minerals Record System (ALMRS), was bogged down in a quagmire of horrendously complex data conversions.
ALMRS, which had been initiated almost a decade ago and contracted to Computer Sciences Corp. in 1993, was to have been completed in 1997. Last summer, when it was obvious that ALMRS was taking longer than expected, BLM put the project on hold in favor of a “must complete by March 1999” rehosting project.
Why the urgency? The operating system on the aging Honeywell mini-computer where the bulk of BLM’s data lay was not going to be brought into Y2K compliance. The data, residing in four legacy system databases, contained crucial “patents”–records of public land being transferred into private hands–dating back to the 1700’s when the General Land Office maintained the records. Thanks to Migration Architect’s help with profiling the data files, Cone is confident that the rehosting project, begun in July 1998, will be completed and tested by her March 1999 deadline.
Other innovative products and trends
Evoke isn’t the only innovative product or bellwether trend in the data warehousing market today. Amit Desai, a cofounder of Anubis Inc., sells commercially a data warehouse design tool that he and his partners had developed for their consulting clients. Like Nordstrom and Cone, Desai says he realized the value of any tool that could reduce the time spent on the labor-intensive design process, and Constructa was born. A tool for graphical, dimensional model construction (full graphical interface, schema generation into star or snowflake schema, plus optimization for either performance or space), Constructa also checks overall design integrity. Users can define standards and conventions and have Constructa automatically apply these via inheritance. Data warehousing vendors like Informatica Corp. and Microstrategy Inc. have already recognized the value proposition of Constructa, and have negotiated the rights to distribute it with their own products.
David Wells, principal analyst at Ovum Ltd., a London-based research firm, observes that, in addition to market consolidation among data warehousing vendors, the other main trend is an increased acknowledgment that metadata integration and repositories are emerging as a key end-user requirement. Repositories are like card catalogs of enterprise (or datamart/application) data. Standards efforts associated with the Metadata Coalition and the Object Management Group are symptomatic of the trend.
Wells cites the arena of “packaged analytical applications” as another hotbed of innovative integration. The idea, according to Wells, is that now that the decision support/data warehousing market and technologies have matured, users can buy “prebuilt” applications to perform tasks such as financial risk assessment, customer churn analysis, and so on–“off the shelf.” Analogous to the arrival of SAP R/3 from SAP AG of Walldorf, Germany, and its competitors in the OLTP [online transaction processing] market, says Wells, vendors are now encouraging users to believe that they can buy packaged analytical applications as well.
This is an immensely appealing message, notes Wells, especially for users who are nervous about data warehousing and/or have already been “bruised and bloodied” with unsuccessful data warehousing or datamart projects. In the meantime, the packaged applications market (Baan Co., SAP, etc.) has spawned a completely new market for application integration products (Constellar Corp., CrossWorlds Software Inc., and so on) in an attempt to ease the pain felt by organizations that have attempted to integrate packaged and nonpackaged applications themselves.
For instance, some firms are already building new business applications, such as customer relationship management and campaign management, as extensions of the data warehouse. These new applications are part of a controlled, closed-loop system (as opposed to being standalone, “stovepipe” applications). Data warehouses have moved from being read-only storage facilities to monitoring the business, and they are being used as the basis for analytical engines required to feed process workflow, to generate information for closer customer intimacy, and to provide informational content to reorganize organizational processes and structures. Warehouse content must now be expanded to contain resultant measures based upon initial analytics and segmentation. This requires (or will require) writeable content based upon customer dialogs and responses to the efforts of call centers, direct marketing, and sales representatives.
Other innovators are targeting vertical markets or offering specific functionality. NCR Corp., for example, offers industry-specific versions of its RightStart data warehouse package with bundles “pretuned” for retail, financial services, and data communications, and IBM has similar offerings. Marketing is another sector that has attracted a lot of attention and innovative solutions. Several firms, including Dun & Bradstreet Corp. and MarketSwitch Corp. announced focused marketing and/or CRM (customer relationship management) products at the recent National Center for Database Marketing Show.
What the tea leaves say
Companies implementing data warehousing solutions are at last finding a new crop of tools capable of migrating existing data and then analyzing that data more quickly than ever before. ACSC of Southern California enjoys out-of-the-box efficiency with its choice of Migration Architect, while the BLM expects to significantly reduce the implementation time of a deadline-dependent project using Evoke’s product.
New directions are also visible. Among other things, vendor consolidations, the realization that end-user requirements now call for metadata integration and repositories, and an increased interest in packaged analytical applications, as well as application-integration products, are coming to the forefront of data warehousing.
Indeed, it’s a very exciting time in the world of data warehousing. This taste of some of the innovative products and trends shows that the tea leaves seem to say faster, more focused, more integrated–and more bottom line. //
Karen Watterson ([email protected]) is an independent San Diego-based writer and consultant specializing in database design and data warehousing issues. She has written several books, including Visual Basic Database Programming and Client/Server Technology for Managers, and is completing one on Microsoft’s SQL Server 7.0.