SHARE

Getting the Bigger Picture: Dealing with Unstructured Data

Companies have never suffered from a lack of data. They have warehouses of file boxes and terabytes of storage. What is missing is actionable intelligence that they can use to improve business results. Using data mining tools helps to convert database stores into business intelligence. But that only gives part of the picture since 85 […]

Written By

Drew Robb

Sep 13, 2004

5 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Companies have never suffered from a lack of data. They have warehouses of file boxes and terabytes of storage. What is missing is actionable intelligence that they can use to improve business results. Using data mining tools helps to convert database stores into business intelligence.

But that only gives part of the picture since 85 percent of an organization’s knowledge isn’t in databases. To get at the rest, a new generation of text mining tools allows companies to discover relationships and summarize information from large stores of previously unanalyzed data.

Structured and Unstructured

Information breaks down into two broad categories – structured and unstructured. Structured is what we find in databases. Every bit of information has an assigned format and significance.

Unstructured data is what we find in emails, reports, PowerPoint presentations, voice mail, phone notes, agendas and photographs.

Companies have been using data mining software for years to extract business intelligence from their structured data. Since the database fields are clearly defined, it is easy to run queries and formulas which extract meaningful information, not just raw data. Computers are great at handling massive quantities of structured information, something which people have a hard time doing.

Unstructured data is what we find in emails, reports, PowerPoint presentations, voice mail, phone notes, agendas and photographs. Shaku Atre, president of the Santa Cruz, CA business intelligence consultancy Atre Group, points out that much of this type of information is better referred to as semi-structured since it contains structured metadata such as the e-mail headers or revision dates in Word documents. For simplicity, we will group the entire spectrum of data that is less structured than database entries under the term “unstructured.”

This data typically comprises about 85% of an organization’s knowledge stores, but it is not always easy to find, access, analyze or put to use.

“We are drowning in information but are starving for knowledge,” says Mani Shabrang, technical leader in research and development at The Dow Chemical Company’s business intelligence center in Midland, Michigan. “That information is only useful when it can be located and then synthesized into knowledge.”

Running full text queries to find key words is one way to locate text information but it is severely limited. It still relies on a human to then read that information, spot the relationships and convert it into useful knowledge. One problem lies in determining the true meaning and importance of language.

Continued on Page 2.

Continued from Page 1.

Take for example the statements “Jim rode in his Mustang” and “Jim rode on his mustang.” There is little difference in the wording but a vast difference in meaning. A human would correctly recognize that one is talking about a car and the other a horse. He would also know that the first sentence must have taken place in the last forty years, since Ford started selling Mustangs in 1964, and odds are that it occurred on a paved road. The other sentence is more likely to have happened on a dirt path in the western United States in the latter half of the nineteenth century. There is also a high degree of probability that Jim refers to a male, adult human. We also recognize that the sentence might contain a typo, the “I” and the “O” are right next to each other on the keyboard, or someone might have bad grammar. By reading other sentences in the same document, we convert these probabilities into certainties.

We constantly engage in those types of analyses and decisions when we speak or read. Very simple and fast for a human. But there is one problem.

“Humans are better than computers when it comes to less structured data,” says Gartner, Inc. (Stamford, CT) research vice president Alexander Linden. “The problem with humans is that they can’t scale well for large masses of data.”

Text Mining Tools

To overcome this scaling problem, companies such as ClearForest Corporation (New York, N.Y.), Inxight Software, Inc. (Sunnyvale, Cal.), Megaputer Intelligence Inc. (Bloomington, Ind.) and SPSS Inc.(Chicago, Ill.) have created products to analyze vast quantities of text information and convert it into actionable intelligence.

The first step typically involves applying “natural language processing” algorithms which determine the meaning of the sentences taking into account context, grammar, synonyms and colloquialisms. It can then categorize the documents and group similar ones. Some tools allow extraction of certain type of data such as all company names or cities. Others present the information in a graphic form making it easier to spot relationships.

Although this technology is still fairly new and is not as accurate yet as traditional data mining, its use is expanding. Dow Chemical, for example is using it to conduct patent searches and manufacturers are using it to mine call center reports for common complaints. The Global Aviation Information Network (GAIN), an international consortium of airlines, government agencies and manufacturers, is developing tools to gather data from mechanics, pilot and flight attendant reports to spot common mechanical problems so they can be corrected before a disaster – a far better method than sorting through airplane wreckage to try to determine the cause.

“We are trying to get smarter by looking at events that happen relatively frequently, but are innocuous by themselves because of the robustness of the systems,” says Christopher Hart, Systems Administrator for System Safety for the Federal Aviation Administration. “But if they are part of the links in an accident chain, we are trying to stop those links before they cause an accident.”

Ethics and Artificial Intelligence: Driving Greater Equality

FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020

SEE ALL
ARTICLES

Drew Robb

Drew Robb is a contributing writer for Datamation, Enterprise Storage Forum, eSecurity Planet, Channel Insider, and eWeek. He has been reporting on all areas of IT for more than 25 years. He has a degree from the University of Strathclyde UK (USUK), and lives in the Tampa Bay area of Florida.

Getting the Bigger Picture: Dealing with Unstructured Data

RELATED NEWS AND ANALYSIS

Drew Robb

Company

Categories