by David Linthicum
Years ago I worked at a large company where one of my bosses/mentors used to routinely challenge my articles and papers. This included a paper I wrote about our ability to understand information that’s abstracted from large databases, even though that information does not explicitly exist in a database. That’s “geek speak” for: I can learn things about you that are not necessarily in the data, but determined through patterns and the use of the right analytics. The question from him was: “I know you can, but is it right?”
As data science becomes more of a ‘thing,’ we’re beginning to see this ability, and thus new questions, in a big way. The ability to derive personal information from seemingly impersonal information is an increasingly common goal. For example, fairly to very accurate predictions can be made about our health status, political beliefs, income level, and other factors through data mined around buying patterns on our credit cards. However, mashing our credit card data with our browsing history can reveal even more. Tie that in with GPS data from your phone, and more will be known about you than you know about yourself.
We’re all familiar with the case of a father finding out his daughter was pregnant by getting flyers selling baby items after she purchased a pregnancy test, and a few prenatal products, from a large department store. There was no explicit data stating that she was pregnant, but the analytics made a pretty good guess based upon patterns of data.
The analytics, however, are getting more sophisticated. Now that we have access to machine learning and other AI capabilities, the ability to discern the true meaning of data is becoming commonplace. Moreover, we now have the integration capabilities to reach out and include other data sets in our analytics, such as key economic data to mash-up against our historical sales data, or petabytes of diagnostic health data and outcomes to determine when and how we’re likely to die.
This is good if you’re in business and want to know the true meaning of gobs of sales and market data you’ve been gathering for years. But it’s not so good if you’re denied life insurance because there are too many pictures of motorcycles on your Facebook page. So, what are the ethics behind data science? Or better put, what ethics should be in place?
An article by Anne Buff, discussing the ethics behind big data analytics and data integration, frames the problem rather well. “There has been a lot of hype around the introduction of social media data and big data to the worlds of data integration and master data management. After all, isn’t more data – capable of helping us identify and understand our customers better – invaluable to the business? Perhaps, but along with its infinite value could come some highly unexpected, extensive costs and liabilities if not handled appropriately.”
Anne points out that, as we move to better data analytical capabilities, we need to understand the laws and regulations that govern how we handle that data. However, we also need to understand the responsibility of data management with ethics in mind. “We are consistently seeing more news reports about brand-damaging situations companies are facing because the ethical implications of their actions were just not considered.”
This goes behind PR issues, and really comes down to what’s right and what’s wrong. As those who are charged with data management, these questions will become more front and center as the capabilities lead to tempting opportunities that could later be judged in the court of public opinion, or perhaps a court of law. Most often, they will be behind closed door types of decisions that may never see the light of day, unless you have a Snowden-style whistleblower in your employ.
The lawsuits are beginning to show up on the nation’s dockets. For example, recent disclosures in a California lawsuit raise several red flags about how government data may be used. It calls into question cloud providers with business models that rely heavily upon advertising revenue and monetizing user data.
The lawsuit alleges that Google violated federal and state wiretap and privacy laws by data mining the email content of students who used Google’s Apps for Education and Google’s Gmail messaging service. The larger issue is that public-sector users of certain cloud services, including federal government employees, may not be protected from data mining and user profiling for advertising purposes.
Cloud providers, such as Google, are collecting petabytes of data each week. Today’s big data mining approaches mean those providers have the ability to understand more about you than you think.
So, now that we understand a little about what this data is able to tell us, the question is: What limitations and disclosures should exist to address privacy issues? Moreover, what are the true rights and wrongs when it comes to leveraging the power of data, including deeper and smarter analytical processes and tools that are now available?
This comes back to my boss/mentor, and his comments many years ago. Now that we can do this, should we do this? That’s the newer and more important question to ponder as we put these new capabilities in place.
David S. Linthicum is SVP at Cloud Technology Partners.
Photo courtesy of Shutterstock.