Download the authoritative guide: Cloud Computing 2018: Using the Cloud to Transform Your Business
A number of storage vendors implement storage tiering with automated data migration based upon user-defined policies. The following vendors implement migration in one way or another, and while they differ in subtle ways -- such as the granularity of data being migrated -- they implement the goal of migration with performance and cost in mind.
The following discussion is not intended to be a pro or con of any vendor, but instead offer a generalized comparison of each product’s interesting capabilities.
EMC has received a considerable amount of attention since its introduction of FAST, or Fully Automated Storage Tiering. EMC’s target-based solution permits the transparent and policy-driven migration of storage objects between three tiers (representing flash, FC, and SATA).
The initial introduction of FAST provides LUN-level migration between tiers, but EMC’s upgraded FAST technology will very soon support sub-LUN level migration (1GB chunks of data). The upgraded FAST (due mid year 2010) also includes what’s called the “FAST Cache,” which introduces high-performance flash drives to reduce response times.
Compellent’s target-based solution for data migration in a tiered storage environment is called Data Progression. In addition to offering automated migration of data, Compellent supports up to 9 tiers based upon drive type (and rotational speed) and RAID level implemented in the tier. Compellent even implements what they call “Fast Track” for rotating media which places the most frequently accessed data within a tier to the faster tracks of each drive. This provides not only migration across tiers, but also optimizes within a tier to reduce seek times of the most active data.
The granularity of blocks that are migrated within Compellent’s architecture are 512KB, but can be tuned to up to 4MB depending upon the particular application.
3PAR implements target-based data migration using what they call Dynamic Optimization within their tiered storage architecture (called Autonomic Storage Tiering). 3PAR’s solution provides both LUN and sub-LUN migration of large blocks across tiers of SSDs, high-performance FC, and enterprise SATA drives.
This includes wide striping of data across SATA drives for the best performance in the capacity tier, and radial placement within the inner and outer tracks of rotating media for optimum performance. Within a tier, 3PAR permits the modification of RAID levels for a given volume for improved protection or performance.
FalconStor provides a network-based solution for data migration within the Network Storage Server (NSS). NSS is a SAN-based virtualization platform that can be inserted into a SAN (FC and iSCSI). FalconStor implements LUN-based data migration (with synchronous block-level mirroring occurring underneath). LUNs continue to be available while the migration takes place (based upon administrator request), leading to no downtime of LUN availability. Further, this mirroring process can occur for legacy LUNs of arbitrary formats.
FalconStor’s NSS also works with their new HyperFS file system, which enables large capacity storage (up to 144PB) and access to billions of files.
Other Data Migration Solutions
While four vendors were discussed in this article, data migration is being implemented by a variety of vendors in their storage products. Other vendors include IBM, HP, Hitachi, Dot Hill, Pillar, Sun, Fujitsu, and others.
Data Migration Overview
The massive growth of archive data clearly identifies the need for storage tiering and the dynamic ability to migrate data between tiers for the best cost and performance benefit. From the vendor discussions, it’s easy to see that there’s a considerable amount of innovation occurring around data migration and other increasingly important storage services as it is, and will be a growing problem in the future.
While automated storage tiers and data migration are becoming checklist items for storage vendors, research continues to identify new optimizations and benefits while seamlessly integrating with other advanced features. You can read about some of this research in resources section at the end of this article.
Storage Tiers and Automated Data Migration
As is the case for most technology domains, change is the only constant. The storage ecosystem is a great example, where change is not only occurring, but at all levels -- from the individual storage devices to the baseline services and front-end protocols used to manipulate our growing masses of data.
In the following pages I'll explore some of those services and along the way touch on many of the related evolutions and revolutions that are happening today.
From a 30,000-foot view, automated data migration is about optimal placement of data. This optimization is a joining of the current characteristics of the data to the desired characteristics of the storage medium. For example, placing hot data on high-performing solid-state disks (SSDs) and cold data (or archive data) on cheaper storage such as SATA drives.
What makes this a challenge is that both elements have a tendency to change over time. New data tends to be used more often, but over time as it ages, it’s used less frequently. Further, storage mediums continue to evolve and diversify, creating new opportunities for storage tiers and matching data to storage characteristics.
Data migration is also useful in the context of advancing storage systems. Data migration provides the transparent ability to migrate data from an aging storage system to a new storage system, even online while users continue to make use of the storage in motion.
Let’s begin our discussion of storage tiering and data migration with an introduction to modern storage systems and their fundamental characteristics.
Storage subsystems rely on a number of tricks to improve performance, typically based on caching and striping data, but ultimately their performance is a function of the storage medium. For this discussion, we’ll focus on a few drive examples that have different characteristics to later lead us into the purpose behind storage tiers.
Table 1 illustrates these differences in a simple way, focusing on where the category leaders exist (for Solid-State Disks, Fibre Channel, Serial Attached SCSI, and Serial ATA). For example, if the focus is low-cost capacity storage, then SATA is the way to go. On the other hand, if speed is the goal and cost is not an issue, then SSDs are the right solution.
Table 1: Characteristics of Common Drive Types.
The idea exhibited here is that storage mediums present different characteristics that can be exploited for the given data being stored. Let’s explore this topic further using the concept of storage tiering.
Note: Outside of the traditional disk interfaces, there are also interfaces that have no native drive interfaces. Examples include Infiniband, FC over Ethernet (FCoE), and iSCSI (IP-based storage protocol). As these protocols and interfaces evolve, the solution landscape changes with it (noting that Infiniband and 10GbE create more decision points for storage systems).
The concept of storage tiers is certainly not new, but has existed for some time under the moniker “Hierarchical Storage Management” (or HSM). HSM is defined as a storage technique that provides the capability to move data between high-cost storage elements (such as FC drives contained within enclosures), and low-cost storage elements (such as optical disks).
IBM first implemented the concept in their mainframe computers, and continued to evolve HSM within their AIX operating system.
While the concept may not be new, storage technologies have evolved to make this concept even more important. Recall from Table 1 that current drive technologies and storage protocols and buses are segmenting the storage landscape and providing the means to alter the cost and speed of access to data.
Applying the concept of drive types based on performance vs. cost results in a tiered storage architecture (which is a common approach by numerous vendors, as shown in Figure 1).
Figure 1: Tiered Storage Architecture.
Ideally, we would take all of our data and place it on the fastest storage available (one example of this is the RAMClouds architecture proposed by Stanford University). But since cost is a factor, it must be factored in. For this reason, a 1MB file sitting in solid state storage costs significantly more than the same file sitting on a consumer SATA drive. Next we need to factor in the temperature of the data. If the file is one that we use frequently and require fast access to, then it’s justified to have this file on an SSD. If the file represents old data which we rarely use, then having that file on the cheaper SATA drive is ideal.
The goal then is to place “hot” data on SSDs and “cold” data on less expensive storage to optimize the overall cost of the data (to find an equilibrium of data temperature and $/GB). To meet that goal, we must first identify the temperature of the data.
Automated Data Migration
Like many complex technologies, there are many ways to implement automated data migration. One common theme is the virtualization of the storage, which creates an abstraction of the user’s view of the storage (LUN or LBA) and the actual storage mapping on disk (PBA).
The ability to automatically and transparently migrate data within a storage system relies on this mapping so that the data can be reconstructed for the user. This reconstruction is embodied within metadata that specifies how data is distributed across the various storage subsystems.
In addition to the various implementation styles (which we’ll explore next), there are a number of trade-offs in what granularity of data is to be migrated (see Figure 2). Each comes with their own advantages and disadvantages. For example, some vendors implement LUN-level migration, which is conceptually simple, but means that all content within a LUN is treated the same way.
Sub-LUN level migration is also implemented, which can take the form of large blocks of data, in the extreme case down to the block level. Sub-LUN level migration has certain advantages, as high-frequency data can be migrated to faster tiers, leaving the other data in the LUN to less expensive tiers of storage.
Sub-LUN level migration also has a cost, as metadata must be managed for the individual blocks of data (and the smaller the chunk size, the less efficient it may ultimately be). Additionally, if the migrated chunks of data are larger than a block, performance gains may be realized in the form of read-ahead (for example, if the blocks within the chunk are logically related).
An important characteristic of a solution that incorporates data migration is efficiency. The solution should minimize any impact on storage performance. Other trade-offs include the method by which data is classified, the frequency that it’s performed, initial placement of data (assume the data is initially hot or cold), and others.
Some implementations, for example, perform data migration as a background process (nightly activity), where others perform this activity in real-time. While potentially introducing latency, real-time migration provides the ability to react dynamically to the user needs of data.
Figure 2: Levels of Data Migration.
Data migration can be implemented in a number of ways, but they can be categorized into three fundamental architectures; host, network, and target. Let’s begin with a short introduction to these three styles, and then explore some implementations that build in one of these three categories. Figure 3 provides a graphical visualization of these styles.
Host-based implementations integrate the tiering and migration logic into the host servers. While this can be restrictive from the perspective of single-user storage, virtualization has changed this to also support multi-user (multi-VM) configurations.
Operating systems, for example, can integrate this type of functionality into their logical volume managers (such as Linux’s LVM), and hypervisors can incorporate into their storage stacks. VMware implements this under the product name Storage vMotion, which permits the migration of live (active) virtual machine disks between storage mediums. This is implemented efficiently using changed block tracking to migrate the virtual machine disk in the background, and in the end, suspend the VM for a short time to move any remaining blocks to the destination datastore.
Network-based implementations place an intermediary into the network between the storage users and the physical storage. This offloads the functionality from the host, but also permits a vendor-agnostic storage backend (storage from multiple vendors). Examples of network-based implementations (for both data migration, and numerous other features) include IBM’s SAN Volume Controller (SVC), HP’s SAN Virtualization Storage Platform (SVSP), and FalconStor’s Network Storage Server (NSS).
Finally, target-based implementations pull the required logic into the storage array itself. Like network-based implementations, the overhead of virtualizing the data is offloaded from the host, creating an abstraction at the target. Once this abstraction is constructed, other advanced features can be implemented, such as data reduction (as the physical placement and format of data is hidden from the host users). Many examples of target-based implementations exist, such as EMC’s FAST, Compellent’s Data Progression, 3PAR’s Dynamic Optimization, and many others.
Figure 3: Implementation Styles.
Data Migration Resources
About the Author
M. Tim Jones is a firmware and product architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His background ranges from the development of software for geosynchronous satellites to the architecture and development of storage and virtualization solutions.