Most IT leaders and many C-suite execs are thinking about—if not planning and already executing—AI-led initiatives. There are dozens of tools across the top three largest public cloud providers alone for AI and machine learning, beyond the many open-source technologies that have cropped up since the launch of ChatGPT in the fall of 2022.
The potential is huge: the generative AI market is poised to grow to $1.3 trillion over the next 10 years from a market size of just $40 billion in 2022, according to a new report by Bloomberg Intelligence.
Getting AI right relies on quality data—particularly unstructured data. AI success depends upon the appropriate curation and management of this file and object data, which makes up at least 80 percent of all data in the world. This article identifies the challenges of those efforts and offers 10 tips for addressing them.
Managing Unstructured Data and ROT
Unstructured data, given its volume and the many different types of files and formats it comprises—from documents and images to sensor and instrument data, video, and more—is vexing to manage. Often distributed across multiple storage systems in the increasingly hybrid, multi-cloud enterprise, it is hard to search, segment, and move around as needed.
Due to its growth, unstructured data is expensive to store and backup. In fact, a majority (68 percent) of enterprise organizations surveyed in 2022 are spending 30 percent or more of their IT budgets on storage. These issues are made worse in data-intensive industries as copies of redundant, obsolete, and trivial (ROT) data are rarely deleted by researchers and other teams when projects are completed.
Managing unstructured data for AI requires new solutions and tactics, including a data-centric approach to guide cost-effective storage and data mobility decisions across vendors and clouds.
There’s also a growing need to ensure that the right data sets are leveraged. New research from Stanford found that the performance of large language models (LLMs) “substantially decreases as the input context grows longer, even for explicitly long-context models.” In other words, curating the right data sets may be more important than large data sets, depending on the project.
10 Tips for Managing Unstructured Data in Generative AI
Generative AI solutions, guidelines, and practices are changing daily. But establishing a foundation for intelligent unstructured data management can help organizations flex and shift through this transformative era. Here are some tactics to consider.
Start with visibility
Data indexing is a powerful way to categorize all of the unstructured data across the enterprise and make it searchable by key metadata (data on your data) such as file size, file extension, date of file creation, and date of last access. Visibility is foundational for right-placing data to meet changing business needs for archiving, analytics, compliance and so on.
Understand key data characteristics
When laying a foundation for AI, more information is better. The more information you have on your data, the better prepared you’ll be to deliver it to AI and ML tools at the right time—and the better prepared you’ll be to ensure you have the right storage infrastructure for these new use cases. At a minimum, you’ll need to understand data volumes and growth rates, storage costs, top data types and sizes, departmental data usage statistics, and “hot” or active versus “cold” or rarely-accessed data.
Tag and segment data
Once you have a base level of understanding about your data assets, you can enrich them with metadata for additional search capabilities. For instance, you may want to search for files containing personally identifiable information (PII) or customer data, intellectual property (IP) data, experiment name, or instrument ID. Those files could be segmented for compliant storage or to feed into an analytics platform.
Collaborate with departments
With so many use cases across organizations today for AI and other research, central IT and department IT liaisons need to work together to design data management strategies. This ensures that users have fast access to their most important data but can also access older data archived to low-cost storage when they need it.
Be selective with training data
Don’t give an AI tool more data than is needed to run a query. This reduces leakage and security risks to organizational data and it may also improve the chance of highly-relevant and accurate outcomes.
Segregate sensitive and proprietary data
Security was the top concern for generative AI in a recent Salesforce survey of IT leaders. By moving sensitive corporate data– such as IP, PII, and customer data–into a private, secure domain, you can ensure that employees won’t be able to send it to AI tools. Some organizations are creating their own private LLMs to circumvent this issue altogether, even though this can be expensive and requires specialized skills and infrastructure.
Work closely with vendors
Data provenance and transparency around the training data used in an AI application are critical—data sources in generative AI applications can be obscure, inaccurate, libelous, and unethical, and can contain PII. Non-AI applications are also now incorporating LLMs into their platforms. Find out how vendors are protecting your organization from the various risks of AI with your data and any external data within its LLM. Get clear on who’s liable for what when something goes awry. Ask for transparency in data sources from the vendor’s LLM.
Create an AI governance plan
If you work in a regulated industry, you’ll need to demonstrate that your organization is complying with data usage. A healthcare organization, for instance, would need to verify that no patient PII data has been leaked to an AI solution per HIPAA rules. An AI governance framework should cover privacy, data protection, ethics and more. Create a task force spanning security, legal, HR, data science, and IT leaders. Data management solutions help by providing a means to track and monitor what data moves to AI tools and by whom.
Audit data use in AI
Related to the above, if you choose to share corporate data with a general LLM such as ChatGPT or Bard, it’s important to track the inputs and outputs and who commissioned the project in the event there are issues later. Problems can include inaccurate or erroneous results from bad data, copyright lawsuits from derivative works, or privacy and security violations. Keep in mind that LLMs not only potentially expose your company’s data to the world but the data of other organizations—and your organization could be liable for the exposure or misuse of any third-party data discovered in a derivative work.
Choose the right tools
When your results must be factually accurate and objective, some generative AI tools may not be the best fit. Consider the recent revelations that ChatGPT’s latest version is generating significantly less accurate and lower quality responses. Machine learning systems may be better when your task requires a deterministic outcome.
Despite the many concerns with AI—and especially generative AI—the groundswell of adoption is on the near horizon. A survey by Upwork found that 62 percent of midsize companies and 41 percent of large companies are leveraging generative AI technology. Another study found that 72 percent of Fortune 500 leaders said their companies will incorporate generative AI within the next three years to improve employee productivity.
No matter where your organization is on the adoption curve, AI will impact your employees, customers, and product lines sooner rather than later. Be prepared by taking a proactive data management approach that encompasses visibility, analytics, segmentation, and governance to your organization can reap the benefits of AI without bringing the house down.
Krishna Subramanian is COO and President of Komprise.