Friday, July 26, 2024

Datamining poised to go mainstream

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Datamining poised to go mainstream

With e-commerce and CRM propelling the market forward and Microsoft on the bandwagon, datamining has finally arrived.

By Karen Watterson

October 1999

In this article:

AT A GLANCE: Just for Feet

Who’s who in datamining
Techniques used in datamining
Datamining: How it’s done

It used to be that datamining was limited to high-end database marketing firms and Global 100 firms–the kind whose online transaction processing (OLTP) systems generated millions of rows of data daily. There’s always been an aura of mystery, even magic, associated with datamining. It was a science practiced on powerful UNIX systems overseen by unsmiling statisticians and brilliant mathematicians.

Today that’s changing. Many Web sites are generating log files and e-commerce transaction files that are eminently mineable. Last month, for instance, online retail giant Amazon.com made headlines with its “purchase circles,” based on the fundamental datamining technique of affinity grouping (clustering). When retail sites suggest specific items to customers based on their past purchases, the sites are using a combination of customer relationship management (CRM) and datamining to increase their revenues.


CIO David Meany, Just for Feet Inc.

Datamining is part of a process called knowledge discovery, where the goal is to better understand the organization’s data in order to resolve business problems or capitalize on opportunities.

Sizing things up

Consider retail shoe vendor Just for Feet Inc. (www.feet.com) of Birmingham, Ala. The company has approximately 160 superstores, in addition to 170 Athletic Attic, Athletic Lady, and Imperial Sports stores. Each store carries from 3,000 to 6,000 different shoe styles. Multiply the styles by all the different sizes, and you’ll start to appreciate what the shoe industry refers to as the “size explosion.” And what better way to take advantage of all that data than with a data warehouse/datamining initiative?

AT A GLANCE:

The company: Based in Birmingham, Ala., Just for Feet Inc. has approximately 160 superstores, plus approximately 170 Athletic Attic, Athletic Lady, and Imperial Sports stores. Each store carries between 3,000 and 6,000 different shoe styles.

The problem: Keeping up with rapidly changing shoe styles.

The solution: A 2.4-terabyte data warehouse that currently mines products and inventory. Mining customer data is on the horizon.

The IT infrastructure: Just for Feet used ICL Plc’s Fast Track Development Toolkit to generate the schema for an Informix Corp. Dynamic Server release 8.0 database and perform the initial data population. Transaction-level data in Just for Feet’s data warehouse is stored in a Sun Microsystems Inc. Enterprise E6500 server.

Each Just for Feet store functions as its own distribution center. With the “in” styles changing so fast, and with regions–even neighborhoods–having different hot styles, it’s not hard to realize how important it is for Just for Feet to have the right kind of shoes in stock at the right location. As a result, it made sense for the company to focus its initial datamining efforts on product rather than customer data. “You can be item-centric or customer-centric,” says David Meany, CIO, referring to alternative approaches to designing and mining Just for Feet’s terabyte-scale data warehouse. But you can’t do both at once.

Datamining purists might say that when Just for Feet generates exception reports for its buyers, that’s not genuine datamining. But the company’s buyers are thrilled with these weekly and monthly reports on sales that allow them to spend more time on the more creative aspects of their jobs–predicting fashion trends and future demand. Meany explains that Just for Feet also does “real” datamining to find answers to issues. For example, the company analyzes distribution practices to see how they impact product sell-through.

The first two phases of the company’s multiphase data warehousing/datamining initiative are now in production, built with the help of ICL Plc (www.icl.com), a global IT services company based in London. Just for Feet used ICL’s Fast Track Development Toolkit to generate the schema for an Informix Corp. Dynamic Server release 8.0 database and perform the initial data population. Currently, Meany only keeps about a year’s worth of transaction-level data in Just for Feet’s data warehouse, which is stored in a Sun Microsystems Inc. Enterprise E6500 server. The system maintains aggregate data for 1997 and 1998.

Although the first stages of Just for Feet’s implementation have been inventory-focused, plans are already underway to expand the company’s analysis capabilities and better leverage the customer component of the data warehouse. Keeping up with the “in” styles is only part of the lure of customer data. Consumers can join the Just for Feet club, with the enticement of special savings. Membership is easy, all you have to do is enter a telephone number and the system does a reverse lookup to determine the address. Is Meany looking forward to mining all of this customer data? You’d better believe it.



Who’s who in datamining

There are dozens of datamining vendors, although some industry consolidation has begun. For now, there’s no clear market leader, and most of the products are expensive and complex to use. They were typically developed for the UNIX workstation market for mathematicians or statisticians, not especially for database folks.

Herb Edelstein’s market analysis of datamining tools, “Data Mining ’99: Technology Report” [available at www.twocrows.com], is 1999’s single best source of information about the datamining market. Edelstein provides analyses of the following vendors and their tools:

AbTech Software (ModelQuest MarketMiner)
*Angoss Software (KnowledgeSEEKER, KnowledgeSTUDIO)
Attar Software (XpertRule Miner)
Business Objects (BusinessMiner)
Cognos Software (4Thought, Scenario)
Group 1 (Model 1)
HNC Software Inc. (DataBase Mining Marksman)
Integral Solutions (Clementine, acquired by SPSS in 1998)
IBM (Intelligent Miner)
Magnify (PATTERN)
MathSoft (S-Plus)
NCR (TeraMiner)
NeoVista Software (Decision Series)
Quadstone (Decisionhouse)
Salford Systems (CART, MARS)
*SAS Institute (Enterprise Miner)
*Silicon Graphics (MineSet)
*SPSS (Base, AnswerTree, Neural Connection)
Tandem Division of Compaq
Thinking Machines (Darwin, acquired by Oracle in 1999)
Torrent Systems (Orchestrate Analytics)
Trajecta (dbProphet)
Unica Technologies (PRW)
Urban Science Applications (GainSmarts)

* These vendors collaborated with Microsoft to create the OLE DB for DM spec. Two more vendors, E.piphany and Datasage, also helped draft the initial spec.

And then there are companies like Fingerhut Companies Inc. (fingerhut.com), the $2 billion firm known for its catalog, direct marketing, and telemarketing ventures, that have spent years honing the process of datamining. The Minnetonka, Minn.-based company’s marketing analytics group maintains several hundred generic models that are used to build targeted segmentation models that generate mailing lists for catalogs.

Typically, the datamining team combines four models: a response model (will the customer respond?), a purchase model (how much will the customer buy?), a return model (is the customer likely to return merchandise?), and a payment model (is the customer a credit risk?). The company maintains data (almost 1,400 variables per customer) on more than 30 million customer households in a data warehouse that tops 7 terabytes.

The players, new and old

Although datamining isn’t new technology, it has only recently emerged from academia, research labs, and several dozen vendors. The availability of data warehouses and cheap storage have certainly contributed to the trend, but today’s keen interest in datamining is largely driven by the explosive growth of e-commerce. Sales and marketing departments want to leverage the data gleaned from Web traffic patterns to do one-to-one marketing.

If the prospect of mining customer data to increase revenues, reduce risk, or detect fraud isn’t enough to propel datamining into the mainstream, there’s always the Microsoft factor. Microsoft Corp. ventured into datamining when the Redmond, Wash., software maker announced work on the OLE DB Extensions for Data Mining specification in May 1999. The project is a joint effort between the Microsoft SQL Server group and Microsoft Research’s Data Mining & Exploration group led by Usama Fayyad in consultation with a select group of vendors (see “Who’s who in datamining”). OLE DB is a specification for a set of data access interfaces designed to enable access to heterogeneous data sources. It’s considered the successor of open database connectivity (ODBC) and has already been “extended” for online analytic processing (OLAP) and a variety of vertical markets.


Techniques used in datamining

Statistics: Identifies instances where one variable causes or influences others. It’s good for trends and confirming hunches

Induction techniques: Generates a hypothesis

Neural networks: Sifts through large amounts of data to find unexpected patterns

Visualization techniques: Helps nontechnical people understand the meaning of the data through graphic displays

OLAP: Helps confirm hypotheses using flexible, slice-and-dice techniques

SQL and similar query languages: Answers specific questions (Purists usually don’t consider this true datamining.)



The Microsoft OLE DB for DM endeavor will likely spawn compliant datamining products sometime in 2000. But that doesn’t mean you can’t do datamining against SQL Server (or any other database) today. In fact, Microsoft’s Site Server 3.0 already includes features such as an intelligent “cross-sell” based on historical sales baskets in stores, the contents of the current shopper basket, and the browsing behavior of the shopper. Site Server ranks products that are likely to be most interesting to the shopper.

Lessons learned about datamining

Don’t try to do everything at once.

Focus delivery on immediate tactical as well as long-term strategic value.
Use consultants with track records in your industry.
Make it easy for end users.

Microsoft isn’t the only firm with interdependent products. IBM Corp.’s SurfAid Analytics (surfaid.dfw.ibm.com) relies on the company’s own Intelligent Miner for Data to deliver sophisticated Web site analytics for a fixed monthly fee that ranges from under $1,000 to about $30,000. SurfAid is a small, entrepreneurial e-business within IBM Global Services, which is based in Somers, N.Y. Clients upload daily Web log files to the SurfAid FTP site. RS/6000 AIX scripts handle preprocessing, which includes “stitching back together” navigation paths of individual Web visitors. Then, one of SurfAid’s RS/6000s runs the IBM Intelligent Miner datamining tool kit against the customer file, which may contain over 150 million hits per day. The result is a daily report that customers can access at a private URL. Because IBM DB2 for OLAP is running behind the scenes, users can “slice and dice” the data starting with almost a dozen different reports.

IBM, by the way, shipped its first datamining tool kit in 1995. Today, the company’s Intelligent Miner for Data and Intelligent Miner for Text are used by customers with large DB2 databases. IBM has also developed a graphical query language, query by image content (QBIC), which lets users make queries of large image databases based on visual image content–properties such as color percentages, color layout, and textures occurring in the images. It is used with Digital Library to do graphical datamining.

Shortly after Microsoft parted the curtains on its datamining spec, Oracle Corp. announced its purchase of leading datamining vendor Thinking Machines Corp. and its Darwin product family. The Redwood City, Calif.-based company hasn’t made any announcements about how Darwin will be integrated into its product line. Although Oracle already has its own text mining product called Oracle ConText, it’s likely that the company will weave Darwin into its marketing campaign and Oracle Applications product line. In another significant move toward consolidation, SPSS Inc. (www.spss.com) acquired Integral Solutions Ltd. (ISL) and its popular Clementine product.

Darwin and Clementine are two of six datamining tools suites that Stamford, Conn.-based Gartner Group, in an August 1999 report on datamining, identified as key players in the generic datamining market. The other four are Angoss’ Knowledge Suite, IBM’s Intelligent Miner for Data, SAS’s EnterpriseMiner, and SGI’s MineSet.

In the audio mining field, speech vendors such as Dragon Systems (http://dragonsystems.com) and Virage Inc. (http://www.virage.com) are working with all the major database vendors–including IBM–to support the technique, which is scheduled to be available later this year. Audio mining might be used to monitor call center traffic, customer service calls, or company voice mail (privacy issues aside) looking for anything from profanity to recurring customer service complaints to suspected industrial espionage.

E-commerce, CRM, and data warehousing will all help propel the datamining market forward. Standards such as extensible markup language (XML), the predictive modeling markup language (PMML), the cross-industry standard process for datamining (CRISP-DM), as well as Microsoft’s OLE DB for DM, will help, too. The evolving technology combined with such success stories as Just for Feet and Fingerhut will certainly drive the market into the mainstream. //

Karen Watterson is an independent San Diego-based consultant who specializes in database and data warehouse design. She’s an editor of industry newsletters (www.pinpub.com) and has just completed a book on SQL Server, “10 Projects you can do with Microsoft SQL Server.” She can be reached at Karen_Watterson@email.msn.com.


Datamining: How it’s done
Datamining overlaps with many fields, including statistics, artificial intelligence, data visualization, machine learning, expert systems, and neural networks. One way to demonstrate the breadth of the field is to categorize datamining into six families of techniques (see “Techniques used in datamining”).

To get a feeling for what’s involved in datamining, imagine that you’re a bank and that you want to identify your most profitable customers. In most cases, that information is buried inside reams of transaction data that’s probably spread out over multiple divisions (loans, savings, asset management, etc.). Let’s assume your bank already has a data warehouse in place.

First, you want to determine whether the data warehouse contains all the data you need–you might want to add external demographic data, for example. Once you’re satisfied with the contents of the data warehouse, you identify the data to be extracted and examine it for quality and completeness. You’re likely to find at least some data that’s incomplete or of poor quality. Then you must decide whether you have the time and money to clean up the offending data; if not, you simply eliminate it from the model.

Next you figure out the best algorithms and methods to use. You buy (or obtain an evaluation copy of) potential tools and use them to develop predictive models. After many “runs” you’ll probably uncover some trends and patterns that can be used to forecast which customers’ business would be most profitable.

Then you refine the predictive model, and run it to generate a list of profitable customers. Sales or marketing executes their campaign, and, if the system worked, you have a high return rate at reduced marketing costs. –K.W.



NewsletterDATAMATION DAILY NEWSLETTER

SUBSCRIBE TO OUR IT MANAGEMENT NEWSLETTER



Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Get the Free Newsletter!

Subscribe to Data Insider for top news, trends & analysis

Latest Articles