Semi-structured data is one of many different types of data. From a data classification perspective, it’s one of three: structured data, unstructured data and semi-structured data. Structured data has a long history and is the type used commonly in organizational databases. But more recently, semi-structured and unstructured data has come to the fore as technology has evolved that makes it possible to harness this data and mine it for business insight.
However, much confusion exists concerning these terms. Let’s look at what each is and their overall value.
What is Structured Data?
Structured data has a high level of organization making it predictable, easy to organize and very easily searchable using basic algorithms. The information is rigidly arranged. Data is entered in specific fields containing textual or numeric data. These fields often have their maximum or expected size defined. In addition to the firm structure for information, structured data has very set rules concerning how to access it.
Examples of structured data include relational databases and other transactional data like sales records, as well as Excel files that contain customer address lists. This type of data is generally stored in tables. You end up with various columns and rows of data. One column might be customer names, and other rows would contain further attributes such as: address, zip code, phone, email, credit card number, etc.
What is Unstructured Data?
Unstructured data, on the other hand, is not organized in any discernable manner and has no associated data model. Some refer to data lakes as being the place where unstructured data is stored. This type of information is usually text-heavy and often includes multiple types of data. Examples of types of files generally considered to be unstructured data are: books, some health records, satellite images, Adobe PDF files, a warranty request created by a customer service representative, notes in a web form, objects from presentations, blogs, text messages, word documents, videos, photos and other images. These files are not organized other than being placed into a file system, object store or another repository.
What is Semi-Structured Data?
Matthew Magne, Global Product Marketing for Data Management at SAS, defines semi-structured data as a type of data that contains semantic tags, but does not conform to the structure associated with typical relational databases. While semi-structured entities belong in the same class, they may have different attributes. Examples include email, XML and other markup languages.
While semi-structured data is not a natural fit for legacy databases, it is a critical source for Big Data analytics.
Where does Semi-Structured Data Fit In?
Semi-structured data falls in the middle between structured and unstructured data. It contains certain aspects that are structured, and others that are not. For example, X-rays and other large images consist largely of unstructured data – in this case, a great many pixels. It is impossible to search and query these X-rays in the same way that a large relational database can be searched, queried and analyzed. After all, all you are searching against are pixels within an image. Fortunately, there is a way around this. Although the files themselves may consist of no more than pixels, words or objects, most files include a small section known as metadata. This opens the door to being able to analyze unstructured data.
How Does Metadata Help?
Metadata can be defined as a small portion of any file that contains data about the contents of the file. This often includes how the data was created, its purpose, its time of creation, the author, file size, length, sender/recipient, and more. As a result, large amounts of unstructured or semi-structured data can be catalogued, searched, queried and analyzed via their metadata.
X-rays and other image files also contain metadata. Queries against metadata could uncover the identity of the patient/doctor, when taken, the diagnosis, etc. Semi-structured data, then, is no longer useless to the business. On the contrary, it is now possible to mined great insight from it about customer habits, preferences and opportunities.
How Does Unstructured and Semi-Structured Data Differ?
If almost all unstructured data actually contains some kind of structure in the form of metadata, what’s the difference? The reality is that there is a grey area between truly unstructured data and semi-structured data. Semi-structured may lack organization and certainly is a million miles away from the rigorous organization of the information contained in a relational database. But the presence of metadata really makes the term semi-structured more appropriate than unstructured.
Very little data in the modern age has absolutely no structure and no metadata. In popular usage, therefore, most of what is termed unstructured data is really semi-structured data. Documents, images, and other files have some form of data structure. But for the sake of simplicity, data is loosely split into structured and unstructured categories. Some argue that the distinction between unstructured and semi-structured data is moot.
How Much Semi-Structured Data is Out There?
Unstructured and semi-structured data accounts for the vast majority of all data. Just consider the huge numbers of video files, audio files and social media postings being added every minute and you get an idea why the term big data originated.
Unstructured and semi-structured data represents 85% or more of all data. This percentage is only going to grow once machine learning, artificial intelligence (AI) and the Internet of Things (IoT) gain real momentum in the marketplace. That will lead to huge amounts of data flooding systems every second. For example, IoT sensors are expected to number tens of billions within the next five years. That’s going to generate a lot of unstructured and semi-structured data.
How Does Big Data Fit In?
It is not necessarily the size of the data that makes it big so much as the complexity of that data. Unstructured data is more complex and difficult to work with. Therefore, it is typically associated with Big Data. However, the reality is that Big Data contains a combination of structured, unstructured and semi-structured data. This combination adds further to the complexity.
Now factor in emerging Big Data technologies like Hadoop, NoSQL or MongoDB. These relatively new technologies relax the usual data model requirements and allow the storing of data in a much more unstructured format than, for example, gathering data in a SAS dataset or an Oracle relational database.
But Big Data is only going to get bigger. Floods of semi-structured and unstructured data are already manifesting courtesy of the IoT, satellite imagery, digital microscopy, sonar explorations, Twitter feeds, Facebook YouTube postings, and so on.
How Will Big Data Be Managed?
Whatever the storage mechanism, whether it is a data warehouse or a data lake, and however data is stored, Big Data entails a combination of structured and unstructured data. It all requires some level of data governance. Due to the sheer quantity of data involved, prioritization becomes vital, as well as alignment with business objectives.
Understanding Big Data
Big Data can best be understood by considering four Vs: volume, velocity, variety, and value. Big Data systems must be able to process the required volumes of data with sufficient velocity (both in terms of creation and distribution of that data). Further, systems must be able to cope with a wide variety of file types and data structures. With all of these elements in place, there is now an opportunity to extract real value form this information via analytics. The organizations that can manage all four Vs effectively stand to gain competitive advantage.
“Whatever you call the storage mechanism, be it a data warehouse or data lake, and however you store the data, there’s going to be a combination of structured and unstructured data,” said Magne. “There should be some level of data governance rigor, as well as prioritization and alignment with business value and stakeholder interests to drive decision making. This is how you create a truly data-driven business.”