Semi-structured data refers to a type of data that falls somewhere between the structured data used by traditional relational databases and unstructured data like multimedia files and images. While it does not fit neatly into tables with predefined schema, like structured data, semi-structured data does support some sort of organization or hierarchy—for example, metadata or semantic tags—which makes it searchable using queries, unlike unstructured data.
Adding structure makes semi-structured data more useful than purely unstructured data for enterprises that increasingly rely on data to fuel decision-making. For enterprises to succeed in today’s data-driven world, they need to know how to work with all types of data to ensure they’re extracting the most information and getting the most accurate insights.
Table of Contents
Featured Partners: Database Software
How Does Semi-Structured Data Work?
Semi-structured data is a hybrid of structured and unstructured data, and as such, it shares some aspects with both types. It’s not as rigidly structured as the former, but it contains identifying information or tags that make it more searchable and actionable than the latter. Organizations both collect semi-structured data and create it by adding information to unstructured data.
For example, consider a customer record in a database. It contains set fields—like name, address, and phone number—that can be searched for with queries. Compare that to a digital photograph made up of millions of pixels, which is purely unstructured data and therefore unsearchable. But if I add metadata to the image—the photographer’s name, the data it was taken, and a brief description, for example—I can then search for and find that image using that information.
A similar approach can be taken with other unstructured data like audio and video files, opening the door to processing and analysis.
Why is Semi-Structured Data Important?
Unstructured data accounts for more than 90 percent of all data generated globally, and is growing at a rate of 55 percent annually. Because enterprises collecting massive amounts of unstructured data need to make it actionable for it to be of use, they can add tags or information to turn unstructured data into semi-structured data to meet this need. Otherwise, they’d be missing out on potential insights from the lion’s share of the data they gather and store.
Semi-structured data can come from a wide range of sources, from Internet of Things (IoT) devices and sensors to web pages, emails, and more. Organizations need to be able to mine all their data, regardless of type, for information.
In customer-facing businesses, for example, being able to analyze semi-structured data can help understand customer trends and behavior by searching through purchase histories, sentiment analysis, user comments, and reviews for information.
Examples of Semi-Structured Data
Semi-structured data is found in a wide range of forms, from web data like HTML and other markup languages to configuration and log files, and it makes up the backbone of big data processing frameworks like Hadoop. Its lack of predefined schema allows a greater degree of flexibility for integrating data across sources and diverse data structures like JSON and XML extensions.
XML
Extensible Markup Language (XML) is an example of semi-structured data in wide use. This markup language lets users define tags and attributes for unstructured data so it can be stored in a hierarchical form.
JSON
JavaScript Object Notation (JSON) is a popular alternative to XML that collects semi-structured data and organizes it for use.
Hyper-Text Markup Language (HTML)
HTML used for websites is a common example of semi-structured data. It provides the hierarchy of structured data with tags like <header>, <li>, <section>, and <footer>, but lacks any of the structured needed for traditional analytics methods. Adding attributes like “meta charset,” “metaname,” and “html lang” makes it possible to identify datasets across the markup document.
Log Files
Generally speaking, log files follow a specific format—and many even include predefined fields that give them structure—but entries often lack order or data types. In the example below, the timestamp follows a specific format of YYYY/MM/DD-HH:MM:SS, the remaining log information is always different. Warning logs, for example, may include stack traces missing from informational logs.
Electronic Data Interchange (EDI)
Organizations use EDI to convert paper files into digital documents. Most offline content is not readable by machines, but EDI can transform their elements into machine readable language. EDI docs have structured segments, but include variable content—for example, a “product code” data element may contain different product codes for different items.
Emails
A typical email includes defined fields like date, subject, sender, and recipient. At the same time, the body of the email is unstructured and can incorporate anything from text to numbers, images, and hyperlinks. Email attachments can also be unstructured, depending upon the format: .pdf, .jpeg, .mp3, for example.
Characteristics Of Semi-Structured Data
While semi-structured data varies based on the type, common characteristics include the following:
- Hierarchy. Semi-structured data often has a parent-child or sibling relationship—in markup languages, attributes like <body> and <title> create that hierarchy to allow contextual information to be searched for or extracted.
- Schema-on-read approach. Schema-on-read means that structure does not have to be predefined in the file before data is inserted, as with a database—instead, the structure or schema can be applied after the fact, during querying.
- Tags and marks. Semi-structured data relies on tags, markers, and labels to identify and organize data elements—for example, in .csv or .tsv file formats, first lines are always marked as headers to include column names.
Advantages of Semi-Structured Data
Semi-structured data offers a wide range of benefits to organizations looking for ways to make unstructured data more useful.
Flexibility
Semi-structured data lacks a fixed schema and can store data that does not fit a strict mold or predefined format, giving businesses flexibility to add, process, analyze, and retrieve data.
Storability and Scalability
Semi-structured data can make large volumes of data usable, and scales readily. Because it is highly portable, it can be moved from one network or storage to another using distributed computing systems like Spark.
Data Integration
Semi-structured data like JSON and XML files are common in web development, and the ability to support different formats ensures that information is exchanged across heterogeneous sources without extra cost or effort.
Disadvantages of Semi-Structured Data
Semi-structured data can require nuance to work with, and has a few limitations or challenges for organizations to take into account.
Difficult Analysis
While semi-structured data is more usable than unstructured data, it’s less useful than structured data and it can be difficult to tag and index.
Limited Tooling
Artificial intelligence and machine learning (AI/ML) tools have not yet had a significant impact on the semi-structured data space—as such, existing tool stacks and models offer limited capability to transform datasets into actionable insights.
Data Security
It can be easy to overlook sensitive information contained in less visible or hidden parts of semi-structured data, and complacency or mistakes can make organizations vulnerable to security risks.
Bottom Line: Semi-Structured Data
Increasingly businesses need sophisticated tools to collect, store, and process data of all types—not just structured and unstructured, but semi-structured as well. As their data estates grow, the ability to seamlessly analyze massive volumes of data will become more of a necessity.
Read about the Best Data Warehouse Tools to learn more about how enterprise organizations are storing the massive volumes of data they’re increasingly collecting and processing.