Getting the Bigger Picture: Dealing with Unstructured Data: Page 2

Posted September 13, 2004
By

Drew Robb

Drew Robb


(Page 2 of 2)

Continued from Page 1.

Take for example the statements "Jim rode in his Mustang" and "Jim rode on his mustang." There is little difference in the wording but a vast difference in meaning. A human would correctly recognize that one is talking about a car and the other a horse. He would also know that the first sentence must have taken place in the last forty years, since Ford started selling Mustangs in 1964, and odds are that it occurred on a paved road. The other sentence is more likely to have happened on a dirt path in the western United States in the latter half of the nineteenth century. There is also a high degree of probability that Jim refers to a male, adult human. We also recognize that the sentence might contain a typo, the "I" and the "O" are right next to each other on the keyboard, or someone might have bad grammar. By reading other sentences in the same document, we convert these probabilities into certainties.

We constantly engage in those types of analyses and decisions when we speak or read. Very simple and fast for a human. But there is one problem.

"Humans are better than computers when it comes to less structured data," says Gartner, Inc. (Stamford, CT) research vice president Alexander Linden. "The problem with humans is that they can't scale well for large masses of data."

Text Mining Tools

To overcome this scaling problem, companies such as ClearForest Corporation (New York, N.Y.), Inxight Software, Inc. (Sunnyvale, Cal.), Megaputer Intelligence Inc. (Bloomington, Ind.) and SPSS Inc. (Chicago, Ill.) have created products to analyze vast quantities of text information and convert it into actionable intelligence.

The first step typically involves applying "natural language processing" algorithms which determine the meaning of the sentences taking into account context, grammar, synonyms and colloquialisms. It can then categorize the documents and group similar ones. Some tools allow extraction of certain type of data such as all company names or cities. Others present the information in a graphic form making it easier to spot relationships.

Although this technology is still fairly new and is not as accurate yet as traditional data mining, its use is expanding. Dow Chemical, for example is using it to conduct patent searches and manufacturers are using it to mine call center reports for common complaints. The Global Aviation Information Network (GAIN), an international consortium of airlines, government agencies and manufacturers, is developing tools to gather data from mechanics, pilot and flight attendant reports to spot common mechanical problems so they can be corrected before a disaster - a far better method than sorting through airplane wreckage to try to determine the cause.

"We are trying to get smarter by looking at events that happen relatively frequently, but are innocuous by themselves because of the robustness of the systems," says Christopher Hart, Systems Administrator for System Safety for the Federal Aviation Administration. "But if they are part of the links in an accident chain, we are trying to stop those links before they cause an accident."


Page 2 of 2

Previous Page
1 2
 





0 Comments (click to add your comment)
Comment and Contribute

 


(Maximum characters: 1200). You have characters left.