Mining for data is like mining for gold. It is a lot of work but, done right, produces a high return. At Newmont Mining Corp.’s site near Battle Mountain, Nev., for example, the company must excavate and process 30 tons of raw material to extract a single ounce of pure gold.
Knowledge workers face a similar problem — finding the hidden gold buried somewhere amidst 3 billion-plus Internet pages as well as an organization’s own data stores.
Without the right tools, employees can spend a huge amount of time prospecting for this hidden data. Anadarko Petroleum Corp. found that its engineers were spending as much as 50% of their time searching for information contained in the company’s 2 million document Intranet. Installing Convera’s RetrievalWare search engine changed all that.
“We had so much information out there in people’s heads or in complex directory structures — places where people don’t even know it exists,” says Bob Downing, Anadarko’s manager of business systems. “We estimate we save 78,000 staff hours per year in the engineering group alone by having rapid access to the documents.”
To help their employees and customers quickly access the information they need, companies spent $450 million on Enterprise Search Engines (ESE) last year, according to Susan Feldman, International Data Corp.’s research vice president for Content Management and Retrieval Software.
Enterprise Search Engines perform much the same function as Internet search engines, but targeted to the needs of a particular group of people rather than the broad public. While the exact feature set and methodology varies among different vendors, they perform three main functions: discovery, categorization and search.
The first action, discovery, consists of finding exactly what content an enterprise has stashed away in its various datastores. On the Internet this is done through “spidering,” the process of following hyperlinks from one page to another and copying the content of those pages into a server for indexing.
Users then search against this index rather than against the actual Web content. Within an enterprise, this process means not just finding HTML documents on the company portal, but also discovering the content within all the file systems, databases and applications the company uses. Older ESEs were limited in what file formats they could search, but most now license technology from either Verity, Inc. or Stellent, Inc. which convert other file formats into text so they can be indexed and searched.
“File formats are largely a non-issue today,” says Dr. Prabhakar Raghavan, Verity’s vice president and CTO. “Most engines will search over 200 file formats.”
Once the data is indexed, it needs to be categorized. Categorization, also called taxonomy, groups the content so a user can browse through the content related to a particular subject, rather than doing a word search. If you are one of the 200 million plus people who visit Yahoo each month, you have seen categorization in action. While Yahoo has employees who manually catagorize all the content, ESE’s generally depend upon a mix of staff input and automatic algorithms to design and maintain the content taxonomy.
“Eighty percent of the companies we surveyed were categorizing information,” says IDC’s Feldman. “This defines the most important concepts, helps people to browse and sharpens search results.”
Once the data has been indexed and categorized it is ready for the user to type in his key words or questions and get a list of documents to retrieve.
Doing the above actions makes it possible to locate the content one needs, but at that point a simple keyword search is likely to pull up so much irrelevant content that the results are unusable. On an intranet it won’t be as bad as typing in the word ‘money’ on Google and having it display the ‘First 10 of about 63,400,000 results.’ But with thousands or millions of documents in the enterprise, an ESE still requires additional tools to narrow the results to what the users actually need.
“The first generation of search engines was about the mechanics of getting a lot of stuff into an index,” explains Giga Information Group director Laura Ramos. “The second generation is understanding what users’ intentions are and more proactively delivering the content or data they are looking for.”
To begin with, search tools are incorporating a variety of new methodologies to better interpret the meaning of both the words in a document and the words a person types into the search box. These include determining what a word means based on its context in a sentence, recognizing probable misspellings, synonyms and different forms of a word.
There are also personalization tools. These include forms that a person can fill out to customize the type of results he receives, engines that learn your preferences over time based on which results you click on and administrative tools to designate certain types of content for different business divisions or workgroups.
Then there are natural language search engines for those who get lost using Boolean syntax. Other engines combine search and categorization so that a person can limit a search to a particular subject area. There are even some that will connect you to an expert in the area.
While there are a lot of exciting new tools in the ESE field, Ramos cautions against going overboard.
“It takes more work than many companies realize if they expect and want a highly relevant, highly targeted experience,” she explains. “If you can’t figure out a practical use for a feature that will start paying back in six months, save it for the next implementation.”
Mix of Tools
A company looking to improve productivity through better search capabilities has a wide variety of tools to select from. At the top end are the full-scale enterprise search engines such as Verity’s K2, Fast Search and Transfer, ASA’s (Wellesley, Mass.) FAST Data Search, Convera’s (Vienna, Va.) RetrievalWare, Autonomy, Inc.’s (San Francisco) Retrieval and Copernic Technologies, Inc.’s (Sainte-Foy, Quebec) Enterprise Search. Hummingbird, Ltd. (Toronto, Ontario) has a version of its Search Server specifically for use in law offices.
Then there are engines such as the Google (Mountain View, Calif.) Search Appliance and Verity’s Infoseek which bring technology developed in for Web search engines into the enterprise space.
Supplementing these engines are specialized tools designed to enhance one particular aspect of search. These include Endeca Technologies, Inc.’s (Reston, Va.) ProFind which assists browsing, Stratify, Inc.’s (Mountain View, Calif.) classification tool Discovery Engine and iPhrase Technologies, Inc.’s (Cambridge, Mass.) Q&A tool, iPhrase.
“The more tools you combine, the better off you are in terms of technologies and algorithms,” says IDC’s Feldman.