SHARE

Enterprise Search Engine Technology Strikes Gold

Mining for data is like mining for gold. It is a lot of work but, done right, produces a high return. At Newmont Mining Corp.’s site near Battle Mountain, Nev., for example, the company must excavate and process 30 tons of raw material to extract a single ounce of pure gold. Knowledge workers face a […]

Written By

Drew Robb

Apr 10, 2003

6 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Mining for data is like mining for gold. It is a lot of work but, done right, produces a high return. At Newmont Mining Corp.’s site near Battle Mountain, Nev., for example, the company must excavate and process 30 tons of raw material to extract a single ounce of pure gold.

Knowledge workers face a similar problem — finding the hidden gold buried somewhere amidst 3 billion-plus Internet pages as well as an organization’s own data stores.

Without the right tools, employees can spend a huge amount of time prospecting for this hidden data. Anadarko Petroleum Corp. found that its engineers were spending as much as 50% of their time searching for information contained in the company’s 2 million document Intranet. Installing Convera’s RetrievalWare search engine changed all that.

“We had so much information out there in people’s heads or in complex directory structures — places where people don’t even know it exists,” says Bob Downing, Anadarko’s manager of business systems. “We estimate we save 78,000 staff hours per year in the engineering group alone by having rapid access to the documents.”

Getting Ready

To help their employees and customers quickly access the information they need, companies spent $450 million on Enterprise Search Engines (ESE) last year, according to Susan Feldman, International Data Corp.’s research vice president for Content Management and Retrieval Software.

Enterprise Search Engines perform much the same function as Internet search engines, but targeted to the needs of a particular group of people rather than the broad public. While the exact feature set and methodology varies among different vendors, they perform three main functions: discovery, categorization and search.

The first action, discovery, consists of finding exactly what content an enterprise has stashed away in its various datastores. On the Internet this is done through “spidering,” the process of following hyperlinks from one page to another and copying the content of those pages into a server for indexing.

Users then search against this index rather than against the actual Web content. Within an enterprise, this process means not just finding HTML documents on the company portal, but also discovering the content within all the file systems, databases and applications the company uses. Older ESEs were limited in what file formats they could search, but most now license technology from either Verity, Inc. or Stellent, Inc. which convert other file formats into text so they can be indexed and searched.

“File formats are largely a non-issue today,” says Dr. Prabhakar Raghavan, Verity’s vice president and CTO. “Most engines will search over 200 file formats.”

Once the data is indexed, it needs to be categorized. Categorization, also called taxonomy, groups the content so a user can browse through the content related to a particular subject, rather than doing a word search. If you are one of the 200 million plus people who visit Yahoo each month, you have seen categorization in action. While Yahoo has employees who manually catagorize all the content, ESE’s generally depend upon a mix of staff input and automatic algorithms to design and maintain the content taxonomy.

“Eighty percent of the companies we surveyed were categorizing information,” says IDC’s Feldman. “This defines the most important concepts, helps people to browse and sharpens search results.”

Once the data has been indexed and categorized it is ready for the user to type in his key words or questions and get a list of documents to retrieve.
Better Results
Doing the above actions makes it possible to locate the content one needs, but at that point a simple keyword search is likely to pull up so much irrelevant content that the results are unusable. On an intranet it won’t be as bad as typing in the word ‘money’ on Google and having it display the ‘First 10 of about 63,400,000 results.’ But with thousands or millions of documents in the enterprise, an ESE still requires additional tools to narrow the results to what the users actually need.

“The first generation of search engines was about the mechanics of getting a lot of stuff into an index,” explains Giga Information Group director Laura Ramos. “The second generation is understanding what users’ intentions are and more proactively delivering the content or data they are looking for.”

To begin with, search tools are incorporating a variety of new methodologies to better interpret the meaning of both the words in a document and the words a person types into the search box. These include determining what a word means based on its context in a sentence, recognizing probable misspellings, synonyms and different forms of a word.

There are also personalization tools. These include forms that a person can fill out to customize the type of results he receives, engines that learn your preferences over time based on which results you click on and administrative tools to designate certain types of content for different business divisions or workgroups.

Then there are natural language search engines for those who get lost using Boolean syntax. Other engines combine search and categorization so that a person can limit a search to a particular subject area. There are even some that will connect you to an expert in the area.

While there are a lot of exciting new tools in the ESE field, Ramos cautions against going overboard.

“It takes more work than many companies realize if they expect and want a highly relevant, highly targeted experience,” she explains. “If you can’t figure out a practical use for a feature that will start paying back in six months, save it for the next implementation.”

Mix of Tools

A company looking to improve productivity through better search capabilities has a wide variety of tools to select from. At the top end are the full-scale enterprise search engines such as Verity’s K2, Fast Search and Transfer, ASA’s (Wellesley, Mass.) FAST Data Search, Convera’s (Vienna, Va.) RetrievalWare, Autonomy, Inc.’s (San Francisco) Retrieval and Copernic Technologies, Inc.’s (Sainte-Foy, Quebec) Enterprise Search. Hummingbird, Ltd. (Toronto, Ontario) has a version of its Search Server specifically for use in law offices.

Then there are engines such as the Google (Mountain View, Calif.) Search Appliance and Verity’s Infoseek which bring technology developed in for Web search engines into the enterprise space.

Supplementing these engines are specialized tools designed to enhance one particular aspect of search. These include Endeca Technologies, Inc.’s (Reston, Va.) ProFind which assists browsing, Stratify, Inc.’s (Mountain View, Calif.) classification tool Discovery Engine and iPhrase Technologies, Inc.’s (Cambridge, Mass.) Q&A tool, iPhrase.

“The more tools you combine, the better off you are in terms of technologies and algorithms,” says IDC’s Feldman.

Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs

FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020

SEE ALL
ARTICLES

Drew Robb

Drew Robb is a contributing writer for Datamation, Enterprise Storage Forum, eSecurity Planet, Channel Insider, and eWeek. He has been reporting on all areas of IT for more than 25 years. He has a degree from the University of Strathclyde UK (USUK), and lives in the Tampa Bay area of Florida.

Enterprise Search Engine Technology Strikes Gold

Drew Robb

Company

Categories

Enterprise Search Engine Technology Strikes Gold

RELATED NEWS AND ANALYSIS

Drew Robb

Company

Categories