Data structures are computer programs that optimize how a computing process manages information in memory. Data structures often build on each other to create new structures, and programmers can adopt those structures that best fit a given data access pattern. Choosing the most efficient data structure for the job significantly improves algorithm performance, which accelerates application processing speeds. This enables computing systems to efficiently manage massive amounts of data within large-scale indexing, massive databases, and structured data in big data platforms.
Many programmers and analytics teams will never need to program a data structure as they can use standard libraries. However, understanding data structure helps DBAs and analytics teams to optimize databases and to choose the best analytics toolsets for their big data environment.
Types of Data Structures
There are different types of data structures that build on one another including primitive, simple, and compound structures.
- Primitive data structure/types: are the basic building blocks of simple and compound data structures: integers, floats and doubles, characters, strings, and Boolean. Integers, floats, and doubles represent numbers with or without decimal points. Characters are self-explanatory, and a string represents a group of characters. Boolean represent logical (true or false) values.
- Simple data structure: build on primitive data types to create higher level data structures. The most common simple data structures are arrays and linked lists.
- Compound data structure: builds on primitive and simple data structures, and may be linear or non-linear. Linear data structure forms a linear sequence with unique predecessors and successors. Non-linear data structure does not form linear sequences. The non-linear tree structure is built from hierarchical relationships.
As the name suggests, compound data structures allow for a far greater level of sophistication.
Another common method of categorizing data structure is by access function types.
- Set-like access: Enters and retrieves data items from structures that may contain non-unique elements: sets, link lists, and some trees.
- Key-based access: Stores and retrieves data items using unique keys: arrays, hash tables, and some trees.
- Restricted access: Data structures that control the time and order of data item access: stacks and queues.
Simple Data Structures
The array data structure is one of the oldest and most common type of data structures. An array consists of elements that may be values or variables. The structure identifies elements using an index or key, which enables the data structure to compute the location of each element. The initial element’s memory address is called the foundation or first address. Data elements are indexed and sequentially stored in contiguous memory.
There are many different types of array data structures. In one common example, many databases use one-dimensional linear arrays whose elements are the database records. Arrays may also be multi-dimensional if they access elements from more than one index. Arrays are the foundational structure for many other data structures including hash tables, queues, stacks, and linked lists.
A linked list is the second most common type of data structure. It links elements instead of computing addresses from pointers. While an array mathematically computes data item addresses, linked lists store data items within its own structure. The structure treats each element as a unique object, and each object contains the data and the reference or address of the next one.
There are three types of linked lists: singly linked where each node stores the next node’s address and the end address is null, doubly linked where each node stores the previous and next node’s addresses and the end address is null, and circular linked where each node links to the other in a circle, and there is no ending null.
Compound Data Structures
Computing systems combine simple data structures to form compound data structures. Compound structures may be linear or non-linear.
Linear data structures
Linear data structures form sequences.
A stack is a basic linear data structure: a logical entity pictured as a physical stack or pile of objects. The data structure inserts and deletes elements at one end of the stack, called the top. Programmers develop a stack using array and linked list.
A stack follows the order in which the computing system performs operations. The order is usually Last In First Out (LIFO). The primary operations for stack include 1) Push, which adds an item to the stack; and 2) Pop, which removes an item from the stack in reverse order to a Push. Stack also returns an isEmpty value: “true” on an empty stack and “false” if there is data.
Instead of the stack LIFO order, the queue data structure places elements into a queue in First In First Out (FIFO) order. The insertion procedure is called Enqueue, which inserts an element in the rear or tail of the queue. The deletion procedure is called Dequeue, which removes elements from the front or head of the queue. To move the inserted element to the front of the queue to be Dequeued, the stack data structure must remove all elements between the new element and the front of the queue.
Think of this structure in terms of a printer queue, where print jobs occur in order until they’re printed or cancelled. A queue can be built using array, linked list, or stack.
Non-linear data structures
Non-linear structures are multi-leveled and non-sequential.
A graph data structure is a type of tree that presents a mathematical image of an object set with linked pairs. The interconnected object points are vertices and the links are edges.
Hashing converts key value ranges into index ranges within an array. The hash table data structure associates each value in an array with a unique index that records the value’s insertion point and location, which accelerates data ingress and searches.
Hash collisions are a common occurrence, especially when hashing very large data stores. Most hash tables include collision resolution, often separately storing keys/key pointers along with their associated values.
Trees are hierarchical data structures, usually built as a top-down structure where each node contains a unique value and contains references to child nodes. In any type of tree, no node points back to the root or duplicates a reference.
The top node is called the root. The elements directly underneath are its children; same level elements are siblings. If there is a level below these children, then the upper nodes are also parents. Elements occurring at the bottom of the tree are called leaves.
Trees are hierarchical data structures that help organize data storage.
The purpose of a tree is to store naturally hierarchical information, such as a file system. There are multiple types of trees. A Binary tree is a tree data structure where each node has no more than two children, respectively called the right child and the left child. Additional structures are the Binary tree, often used to efficiently store router tables for high-bandwidth routers; or the Merkel tree, which hashes the value of each child tree within its parent node. Merkel enables databases like NoSQL Apache Cassandra to efficiently verify large data contents.