SHARE

Solving the Big Problem of Dirty Data

More than any year before it, 2017 has seen big data analytics go mainstream in a wide range of industries. Although it’s been a driving force for nearly a decade in fields like engineering and medicine, big data now drives marketing campaigns, shapes customer relationships and guides many other business operations for an ever-growing list […]

Written By

Guest Author

Oct 11, 2017

6 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

More than any year before it, 2017 has seen big data analytics go mainstream in a wide range of industries. Although it’s been a driving force for nearly a decade in fields like engineering and medicine, big data now drives marketing campaigns, shapes customer relationships and guides many other business operations for an ever-growing list of organizations worldwide.

Of course, many companies are still struggling to realize the value of the data they already possess. One 2016 survey by the Harvard Business Review found that most industries are “nowhere close” to capturing the full potential of their data. This lack of capitalization on data is an increasingly serious problem in a competitive landscape where customers expect consistent, predictive, personalized interactions with the organizations that serve them.

Aside from analytics-related hurdles such as lack of organizational policy and planning, the data itself poses its own difficulties. “Big data is dirty data,” as the saying goes. It’s filled with information that’s out-of-date, incomplete or just plain missing. In order for your organization to be able to act on that dirty data, someone first has to clean it up — and figure out how to derive actionable insights from it.

Here are some pointers for approaching your organization’s data cleanup and aligning your approach to analytics around your overall goals.

Train your algorithms on clusters of anomalies

When you run a quick scan on any large data set, you’ll likely notice groupings of unusual entries that jump out from the overall patterns. Sometimes these anomalies take the form of long lines or blocks of missing data points, while other variations may come from entries logged inconsistently from the rest of the data set, or even shifted out of alignment with the rows or columns where they belong.

In all these cases, groups of anomalies tell you something significant about the way the data was gathered and reported. They may indicate that a section of the data set is missing, or that some rows or columns need to be moved into a different alignment, or that a certain abbreviation means the same thing as another one and needs to be found-and-replaced. The more efficiently you can train your algorithms to recognize and fix these issues, the sooner you can move on to the actual analytics — and start doing something useful with that data.

A growing list of software and software-as-a-service (SaaS) companies offer expertise in pattern recognition and data cleanup. But a far more cos-effective solution is to learn to handle these tasks in-house and go on to apply that learning to other data sets.

Start by training your machine learning algorithms to group similar anomalies together. Use these similarities to look for correlations. And chain those correlations together into meaningful trends.

Leverage outliers to build better models

As time-saving as anomaly detection algorithms can be, it’s equally important to use your human intuition to know when to use them and when to take a step back and ask if a given deviation might mean something unexpected. Just because your data looks dirty doesn’t always mean it is. Do a little digging of your own, and you may discover patterns that lead to original, actionable insights.

or example, one team of data analysts was combing through data from a luxury hotel chain, and discovered what appeared to be a large number of inaccurate entries: dozens of teenagers were reported as staying at high-end hotel properties in a wealthy country in the Middle East. But after some cross-checking, it became clear that these high-income guests were, in fact, exactly who they appeared to be. The analysts had stumbled upon a completely untapped customer demographic, a discovery that inspired a new marketing initiative for the hotel brand.

The moral, of course, is that outliers aren’t necessarily “dirty.” Another well-known example of this is Google’s vast repository of misspelled words and phrases typed into its suite of cloud software. Instead of discarding this mountain of seemingly worthless data, Google has held onto it for decades and has used it to create some of the most accurate spell-check algorithms on earth.

Before you discard any outliers, look for patterns in them and think about how those patterns might be useful in ways you haven’t considered before. You just might make a major breakthrough.

Work to build a policy of disciplined data entry

Anomaly detection and outlier identification are indispensable when dealing with second- and third-party data you’ve acquired from partners and vendors. But when it comes to your own first-party data, the most effective way to get a clean, actionable data set is to train your teams to enter data correctly and consistently in the first place.

One of the most impactful ways to enforce disciplined data entry is to standardize the fields and codes used in reporting. Breaking down departmental silos is a major aspect of becoming a data-driven business. Each silo often has its own data standards and formats. As your organization moves toward a more integrated data-sharing structure, emphasize the importance of using the same set of fields, codes and idioms for reporting the same types of information, no matter who’s reporting it.

As you can see, the only way to know for sure what your data is telling you is to treat each data set as unique and look for the unexpected in terms of patterns. As helpful (and popular) as visuals and data dashboards are, they’re only as effective as the analytical capabilities of the people using them.

And despite the proliferation of companies providing data cleanup services, the fact remains that each organization needs to develop its own data-related goals, as well as its own roadmap for using big data to reach those targets.

Ilan Hertz is head of digital marketing at Sisense, the leader in simplifying business intelligence for complex data. He has close to a decade of experience in applying data-driven methodologies in senior marketing positions in the technology industry.

Photo courtesy of Shutterstock.