Also see: Big Data Trends and Best Practices
Big Data can easily get out of control and become a monster that consumes you, instead of the other way around. Here are some Big Data best practices to avoid that mess.
Big Data has the potential to offer remarkable insight, or completely overwhelm you. The choice is yours, based on the decisions you make before one bit of data is ever collected. The chief problem is that Big Data is a technology solution, collected by technology professionals, but the best practices are business processes.
Thanks to an explosion of sources and input devices, more data than ever is being collected. IBM estimates that most U.S. companies have 100TB of data stored, and that the cost of bad data to the U.S. government and businesses is $3.1 trillion per year.
And yet businesses create data lakes or data warehouses and pump them full of data, most of which is unused or ever used. Your data lake can quickly become an information cesspool this way.
The most basic problem is a lot of the handling of this data is partially or totally off base. Data is either collected incorrectly or the means for collecting is not properly defined. It can be anything from improperly defined fields to confusing metric with imperial. Business, clearly, grapple with Big Data.
That’s less of a problem with regular, routine, small levels of data that is used in business databases. To really foul things up you need Big Data, with petabytes of information. Because the data scales, so does the potential for gain or for confusion. So getting it right becomes even more important.
So what does it mean to ‘get it right’ in Big Data?
Big Data Best Practices: 8 Key Principles
The truth is, the concept of ‘Big Data best practices’ is evolving as the field of data analytics itself is rapidly evolving. Still, businesses need to compete with the best strategies possible. So we’ve distilled some best practices down in the hopes you can avoid getting overwhelmed with petabytes of worthless data and end up drowning in your data lake.
1) Define the Big Data business goals.
IT has a bad habit of being distracted by the shiny new thing, like a Hadoop cluster. Begin your Big Data journey by clearly stating the business goal first. Start by gathering, analyzing and understanding the business requirements. Your project has to have a business goal, not a technology goal.
Understanding the business requirements and goals should be the first and the most important step that you take before you even begin the process of leveraging Big Data analytics. The business users have to make clear their desired outcome and results, otherwise you have no target for which to aim.
This is where management has to take the lead and tech has to follow. If management does not make business goals clear, then you will not gather and create data correctly. Too many organizations collect everything they can and go through later to weed out what they don’t need. This creates a lot of unnecessary work if you just make abundantly clear up front what you do need and don’t collect anything else.
2) Assess and strategize with partners.
A Big Data project should not be done in isolation by the IT department. It must involve the data owner, which would be a line of business or department, and possibly an outsider, either a vendor providing Big Data technology to the effort or a consultancy, to bring an outside set of eyes to the organization and evaluate your current situation.
Along the way and throughout the process there should be continuous checking to make sure you are collecting the data you need and it will give you the insights you want, just as a chef checks his or her work throughout the cooking process. Don’t just collect everything and then check after you are done, because if the data is wrong, that means going all the way back to the beginning and starting the process over when you didn’t need to.
By working with those who will benefit from the insights gained from the project, you ensure their involvement along the way, which in turn ensures a successful outcome.
3) Determine what you have and what you need in Big Data.
Lots of data does not equate good data. You might have the right data mixed in there somewhere but it will fall to you to determine it. The more haphazardly data is collected, the more often it is disorganized and in varying formats.
As important as determining what you have is determining what you don’t have. Once you have collected the data needed for a project, identify what might be missing. Make sure you have everything before you start.
It’s not always possible to know what data fields you need in advance, so make sure to engineer flexibility to go back and adjust as you progress. This dovetails with issue number three.
The bottom line is sometimes you have to test the data it and review the results. You might be surprised to find you are not getting the answers you need. Best to find out before you plunge head first into the project.
4) Keep continuous communication and assessment going.
Effective collaboration requires on-going communications between the stakeholders and IT. Goals can change mid-way through a project, and if that happens, the necessary changes must be communicated to IT. You might need to stop gathering one form of data and start gathering another. You don’t want that to continue any longer than it has to.
Draw a clear map that breaks down expected or desired outcomes at certain points. If it’s a 12-month project, check in every three months. This gives you a chance to review and change course if necessary.
5) Start slow, react fast in leveraging Big Data.
You first Big Data project should not be overly ambitious. Start with a proof of concept or pilot project that’s relatively small and easy to manage. There is a learning curve here and you don’t want to bite off more than you can chew.
Choose an area where you want to improve your business processes, but it won’t have too great of an impact in case things go wrong or badly. Also, do not force a Big Data solution approach if the problem does not need it.
You should also use Agile techniques and the iterative approach to implementation. Agile is a means of operation and it is not limited to development. What is Agile development, after all? You write a small piece of code, test it eight ways from Sunday, then add another piece, test thoroughly, rinse, repeat. This is a methodology that can be applied to any process, not just programming.
Use Agile and iterative implementation techniques that deliver quick solutions in short steps based on current needs instead of the all-at-once waterfall approach.
6) Evaluate Big Data technology requirements.
The overwhelming majority of data is unstructured, as high as 90% according to IDC. But you still need to look at where data is coming from to determine the best data store. You have the option of SQL or NoSQL and a variety of variations of the two databases.
Do you need real-time insight or are you doing after-the-fact evaluations? You might need Apache Spark for real-time processing, or maybe you can get by with Hadoop, which is a batch process. There are also geographic databases, for data split over multiple locations, which may be a requirement for a company with multiple locations and data centers.
Also, look at the specific analytics features of each database and see if they apply to you. IBM acquired Netezza, a specialist in high-performance analytics appliances, while Teradata and Greenplum have embedded SAS accelerators, Oracle has its own special implementation of the R language used in analytics for its Exadata systems and PostgreSQL has special programming syntax for analytics. So see how each can benefit your needs.
See also: Big Data virtualization.
7) Align with Big Data in the cloud.
You have to be careful when using the cloud since use is metered, and Big Data means lots of data to be processed. However, the cloud has several advantages. The public cloud can be provisioned and scaled up instantly or at least very quickly. Services like Amazon EMR and Google BigQuery allow for rapid prototyping.
The first is using it to rapidly prototype your environment. Using a data subset and the many tools offered by cloud providers like Amazon and Microsoft, you can set up a development and test environment in hours and use it for the testing platform. Then when you have worked out a solid operating model, move it back on premises for the work.
Another advantage of the cloud is much of the data you collect might reside there. In that case, you have no reason to move the data on premises. Many databases and Big Data applications support a variety of data sources from both the cloud and on-premises, so if you are collected data in the cloud, by all means, leave it there.
8) Manage your Big Data experts, as you keep an eye on compliance and access issues.
Big Data is a new, emerging field and not one that lends itself to being self-taught like Python or Java programming. A McKinsey Global Institute study estimates that there will be a shortage of 140,000 to 190,000 people with the necessary expertise this year, and a shortage of another 1.5 million managers and analysts with the skills to make decisions based on the results of analytics.
First thing that must be made clear is who should have access to the data, and how much access should different individuals have. Data privacy is a major issue these days, especially with Europe about to adopt the very burdensome General Data Protection Regulation (GDPR) that will place heavy restrictions on data use.
Make sure to clear all data privacy issues and who has access to that sensitive data. What other governance issues should you be concerned with, such as turnover? Determine what data, if any, can go into the public cloud and what data must remain on-premises, and again, who controls what.
Finally, while universities are adding curricula for data science, there is no standard for the course loads and each program varies slightly in emphasis and skill sets. So don’t be so quick to hire someone with a Master’s in data science because they might not know the tools you use or the industry you are in. Then again, given the skills shortage, you might need to do exactly this — and be ready to train them in your industry vertical.