Also see: Top 15 Data Warehouse Tools
In a wide ranging conversation with two of data analytic’s top thought leaders, we explored some key questions in analytics today. Our topics included:
1. First, current events: How do you see the current COVID-19 pandemic influencing the data analytics sector and/or the practice of data analytics?
2. An infographic from the Dean of Big Data web site is entitled Data Scientist vs. Business Intelligence. Is one approach better, or does the graphic merely lay out the difference?
3. Why is ‘dark data’ important? What should an effective strategy be toward dark data?
4. I still hear from plenty of executives that companies have numerous struggles around data analytics. Why is data analytics still so hard?
To provide insight into data analytics, I’ll speak with two top industry experts:
Bill Schmarzo, Chief Innovation Officer, Hitachi Vantara
Andi Mann, Chief Technology Advocate, Splunk
Download the podcast:
How do you see the current pandemic influencing the data analytics sector, the practice of data analytics?
Mann: “Yeah, it’s an interesting one. You’ve gotta get more juice out of the stone, right now. And one of the ways you can do that is with analytics, to try and understand where can you target your resources most effectively during a downturn. There’s a lot of people working from home, there’s a lot of people, who are actually still working just by the way, that’s really important.
“And I know you think that’s other people who are affected in terms of retail, online services, digital services, marketing services. They’re all going flat jack, and one way that they can get better is with analytics, using it to target their targeted marketing, targeted engagement with customers. Certainly for non-profits and government bodies as well, being able to use data to target services to the people who need it most in the downturn, the other people who have lost their jobs, who have lost employment, people have been maybe experiencing homelessness for the first time or again.
“So using analytics to be able to target that. At Splunk we’re providing data sets and providing analytics to public service bodies. We’re working with universities to try and track spread, we’re working with businesses and governments to try and track growth of COVID-19 and other things. So, analytics is helping people work through coronavirus, but it’s also helping people work against coronavirus.
“Oh yeah, absolutely, because Splunk is a data analytics platform. We don’t create our own data, but we’re getting data from other sources and providing it to various state and federal government agencies so that they can use Splunk to do the analytics on the datasets. Yeah, it’s really powerful.”
Schmarzo: “Well, I love Andi’s quote, “Get more juice out of the stone.” And that’s actually very much gonna be, not only the mode of operation during the COVID-19 situation, but even more so afterwards. You think about how much money we are spending, it’s tons and tons of money that we’re spending across the world, we gotta pay that back at some point in time.
“And so I think the fact that we’re gonna have to use data analytics to do more with less. Andi sort of referred about marketing campaigns. We’re going to have to become very micro-focused on our marketing campaigns and our treatment campaigns. Everything is gonna become highly personalized.
“Think about healthcare. Right now, we make blanket policy decision about healthcare, we make blanket policy decision about welfare and care overall. We’re gonna have to get away from that. We have way too much waste in the system. So this idea of getting more out of the stone or as I would say, “Do more with less,” become much more micro-focused, is gonna be great for the world of analytics because we’re very good at leveraging very detailed analytic profiles and digital trends to really understand the unique differences between every customer, every teacher, every student, every device.
“So, I think this whole mindset of, “Get more out of the stone,” this “Do more with less,” is gonna be mandated for most organizations because it’s the only way that organizations are gonna be able to transform their economic value curve as we get hit with all these repayment of debt and maybe tough margin pressures, and we’re gonna see increases in taxes, probably dramatic increases in taxes, because there is no free lunch.”
Maguire: Have you heard any anecdotes about how analytics is practiced in this difficult period?
Schmarzo: “Well, the companies in the pharma space are certainly working 24/7. I was on a panel just last week with a machine learning engineer from GlaxoSmithKline and they are just humping like mad, trying to understand this COVID-19. There’s so much we don’t know about it yet.
“We’re data people, and the lack of data that’s out there about this thing is just, it’s tragic. We don’t have enough tests, and then the tests that we do have, we sometimes aren’t even confident in the results. And then we’ve seen… The whole thing that’s going on now is a classic example of how not to do data science. When you have people who are taking small data sets and extrapolating about these great projections, some way over-optimistic, some way over-negative, we just aren’t applying good data science rigor to these problems. And even with small datasets we can be thoughtful but we have to be articulate in what the constraints are and the assumptions we’re making around these datasets.”
“I can’t tell you how many times somebody tells me, “Oh, the Princess Line numbers show that this thing is not very dangerous at all.” Well, that’s what, a couple hundred people? That’s not statistically significant. It’s not a random sample. It doesn’t pass any of the basics of what you do from analytics. So part of this, I think is driving, at least me nuts, where I see people who really aren’t thinking, who are just taking a small set of numbers, and then extrapolating to some extreme side. In many cases are doing that just because of their own personal agendas.”
Mann: “I’ll tell you, a lot of the customers that I talk to, their data scientists are being put to some use, yet in healthcare, Bill, you’re actually right, there’s a lot of people working long, long hours doing number crunching just trying to figure out what we do with the virus itself. But there’s also a lot of people who are trying to figure what to do with their business, in the virus.
“So I’m seeing a lot of people, especially in financial, just trying to understand the business. So, using data science against their business metrics, just to try and understand, as I said before, where to put resources and stuff like that.
“So certainly the other area where I’m seeing a lot of number crunching, by the way, is insurance. There’s gonna be a lot of insurance claims out of this. We’re not sure what yet, it hasn’t all come out. There’s gonna be a lot of challenging times for the insurance industry, so they’ve got a lot of actuarial number crunches. They’re applying data science to their actuarial practices. There’s a lot of flaw on effects using data, using analytics that I think we’re not always aware of.”
Business Intelligence Expert vs. Data Scientist: Roles and Key Strengths
Schmarzo: You need both. If you don’t have a reports that tell you what’s going on, you don’t know where to focus your resources and your data science effort, so they’re very much complementary. And this infographic probably cost more of my BI friends than anything else out there because there’s this misconception that data science is BI 3.0.
“They’re very, very different, and it’s like the difference between a short stop and a catcher, right? You need to have both. You wouldn’t take a short stop and put him at catcher, and God forbid, put a catcher at short stop. You wouldn’t do that. So my intent here was to show that the BI people are really working on trying to clearly communicate the metrics and KPIs against which an organization is measuring progress and success.
“The data science, though, is trying to identify those variables and metrics that might be better predictors of performance. And that’s a very highly exploratory route, failure-centric, you’re gonna try things, you’re gonna fail, you’re gonna learn, you can’t measure progress on that, on the data science side, by how many hours you put in. You only can really measure how effective you are at building a model if you understand the costs of false positive, false negative so it’s really two different worlds.
“One is not better than the other, but we’ve spent a lot of time and I’ve spent a lot of my life on that left hand-side there, and I’ll be honest with you, I had a really hard transition to go from the left side to the right side. I had to unlearn lots of things that I had done in the past and had to accept the fact that my exploration process didn’t start with a schema. That’s how BI people think. You think schema first, and everything falls out.
“In the data science space, it all focuses on really understanding the hypothesis you’re trying to prove, what are the metrics you’re trying to measure success and progress, what are the business entities, the stakeholders, and all the things that… It’s just very, very different. By the way, I love this chart, thank you for sharing it.”
Maguire: “It’s interesting, you talk about the difference. And obviously, I think in today’s resumes, everyone who’s a BI expert, or a data scientist is probably putting data scientist on their resume because it sounds so much better these days. And I think also it’s interesting you call the data science folks, failure-centric, which is actually could be really learning. I’m sure a lot of employers out there go, “Wait a second, we’re paying this high salary for this individual who is failure-centric? Now I’m nervous.”
Schmarzo: “But James, if you’re not failing enough it means you’re not trying enough, right? You’re not pushing the edges enough, if you’re not failing, failure is how we learn. And I tell you what, on the BI side, if you build a schema that doesn’t work, that failure is not accepted. You’re looking for a new job, updating your resume. But you’re gonna constantly try different combinations of data and data elements and transformations and enrichments, to try to figure out which of these variables and combinations really do give me a better prediction.
Mann: “Look, I agree with Bill. I’m going to throw out an international reference for you. It’s as different as an outswinger versus a googly.
[laughter]
Schmarzo: “Googly? Well, I gotta learn what a googly is.”
Maguire: “I want to find out what an outswinger is.”
Mann: “…exactly what I’m talking about. But no, they are very different sciences. They’re both sciences to a large degree. Business intelligence has grown up with a body of knowledge which is actually really important for how you run your business. Some of the differences I see, I really agree with you, Bill, I love this graphic. By the way, I got lost for about an hour and a half just diving into all your infographics.”
“But yeah, there’s some very significant differences…data science is about that innovation process, Bill. You talk about the idea that innovation is about failure. And I absolutely believe that. If you’re not failing, you’re not learning, and you’re definitely not trying enough things. And so being able to get data and understanding… And one of the things that really jumped out at me here, a couple of things… One was asking more questions, as opposed to looking for more answers.
“So data scientists seem to ask a lot of questions, and you ask more questions of your data. Every answer you get is just an opportunity to ask more questions. And so that’s a different way of thinking. It is, I think, a different mindset to think about bringing data from any source to any problem, as opposed to trying to find an answer. So there’s a really fundamental difference in the mindset of how a data scientist looks at an innovation opportunity. Looks at data as never having the final answer, but always posing more questions. Where a business intelligence analyst is going to seek an answer because that’s an important thing that they need, because their business needs to run.
“So this idea of innovation versus running the business. It’s not always that way, it’s not always cut and dry. But that’s one of the biggest differences I see. It’s brought out really well in this infographic in areas like the up-front and carefully-planned versus on-demand enriched with the data sources.
“Because in business intelligence you know what you’re gonna ask, you know you plan that data set with data science. You already know what you’re gonna ask half of the time, so you need to be able to bring in new data sets, enrich them on the fly. So some of these things you caught there, Bill, really lock in this idea of data sciences, looking at innovation and questions. And I think that’s a really interesting way of looking at it.”
Schmarzo: “Thank you, thank you. Let me add a couple of things. I think you hit some really good points here, Andy. Number one, the business intelligence analyst is really about understanding what’s happened, and where the areas are. And the data side is trying to understand why it’s happening. And when you combine those together it becomes powerful.
“The other thing, what I see happen in the business intelligence analyst side is, I’m seeing a maturation of people there around, I’m gonna call it, value engineering. These are people who really understand where and how data and analytics can drive the business. They have a much stronger business acumen flavor, and they’re great at doing value engineering, identifying, validating, prioritizing the source of the value creation.
“And then you combine them with the data science, I mean, that’s a powerful team. So again, it’s a fun slide. By the way, the origin of this slide was a customer, and this was, couple of years ago, who said… Ah, he said, “Schmarzo, you smart ass, you think you know so much about both of these sides? You said, you claim you come from the BI side, now you’re on the data science side. What’s the difference?” It really took me quite a while to really think through… How did I used to think and approach things, and why that’s important. And how I had a change about how I thought things and why that’s important. And then sort of the realization is that, “Ah, you need both, you need both of these things.”
Mann: Yeah, absolutely. And Bill, I think one of the other things that you just made me think of, in terms of differentiation, letting the machine do the work to a large degree. So the business analyst… I mean, James, you talked earlier on, is one more important than the other? One thing that a business analyst brings that maybe a data scientist doesn’t always bring, is that deep business knowledge. And so understanding their business and using their intelligence to understand what problems they’re trying to solve.
“Whereas a data scientist will often, by virtue of things like you’re using big, massive data sets and stuff like that, will often use machine learning and what passes in this world for AI. ‘Cause we’ve talked before, humans are really bad at seeing patterns, but machines are really good at that. So when you get to huge data sets, then using machine learning becomes almost mandatory to be able to get to insight, whereas a business analyst doesn’t necessarily need machine learning, just needs to get the right data sets, and work them the right way to get the insight they need.”
Schmarzo: “James, maybe you meant to drive this full circle, but what’s interesting about this is that when we think about what’s gonna happen in the COVID world, and particularly the post-COVID world, is that we’re gonna have to be able to use these machines to help us develop very granular insights on every one of our customers, employees, our products, our services, our operations.
“It’s that level of granularity that’s gonna allow us to get more out of this, us just going for more, or to do more with less. And BI has traditionally been focused on sort of aggregated data, look at things at aggregate level, and making sort of blanket decisions. And that’s not gonna cut it in the post-COVID-19 world when we’re trying to do more with less, and need those machines to tell us the differences between which patients are at risk for what sort of diseases, which students are most at risk of failure, which customers are most at risk of leaving so that… You’ve brought us full circle here, James. I’m not sure if you had that planned or not.”
Why is ‘dark data’ important? What should an effective strategy be toward dark data?
Mann: This is something we’re really interested in. We’re a company, we deal with data, our customers use Splunk to process their data. It’s a platform for data analytics. And so data is really important, and we had a theory that the more data you use, the better you get at doing business, whatever that might mean. So we worked with ESG, Enterprise Strategy Group, the analyst from an independent firm, and asked them to verify some of our ideas about this dark data. Find more data, your business will do better, was our essential hypothesis. You know, it proved to be true.
“So we looked at, well we didn’t, ESG, the great analysts there looked at what is it to run a good business. So, they looked at things like revenue and profitability and efficiency and stuff like that. They looked at what does it mean to use and find data.
“So they asked questions around a contribution of IT budget and spend to data analytics, your commitment to uncovering dark data, how effective you are operationalizing it. And so when we looked at the differences between the cohorts that could use more of their data in their organization versus those who used last and were less committed to it, there were really significant empirical outcomes.
“And so when we talk about these people who use their dark data, all this hidden stuff that’s tucked away in databases or in log streams or edge devices, or all sorts of turbines, production lines, and we found that when you uncover more data, you make more money and it costs you less. Like to the tune, statistically, you’re gonna get 5% either way basically. 5% more annual revenue, so topline, 5% from bottom line, cutting 5% from operational cost. That’s great, yeah, doing more with less mill, that fits into that.
“But they also are able to get ahead of competitors, two-and-a-half times more likely to develop and launch products than competitors. Also four and a half more times likely to outperform those what we call data deliberators, the people that are less mature over the coming couple of years. Twice as likely to exceed customer attention targets, 10 times more likely to get more than 20% of revenue from new products and services. So data, as we were talking about, data is directly driving innovation. It’s fascinating.”
Maguire: So, this is all about mining unused data, right, but the question is, if it was already unused, how to suddenly find the resources to mine that additional data?
Mann: “We actually work with our customers to do data source assessments. Where is your data, what do you have, what works, what doesn’t. And it’s not that you necessarily have to come to an external agency to deal with. You can put your data scientists on problems like this, uncovering, ’cause as we were talking about before, the data scientist’s role is about uncovering insights that you didn’t already have access to. And so being able to get your data scientists to find your dark data, and start to strategize around how can I make my business better with these unknown unknowns, then, yeah, it’s a different way of looking at the world.”
Schmarzo: So, on this topic of dark data, you said something really interesting. How do you determine if data is of any value or not? How do you know that you should try to go back and try to find these data sources and bring them in? And what we have found is if you let the use cases drive it, the use cases will help you to differentiate what’s valuable data and what is not valuable. It’ll ultimately help me differentiate what’s signal from noise as well in the data. So a lot of our approach is very use case-centric.
“Pick a use case, understand what you’re trying to do, and then brainstorm what data sources you might wanna go look at. And that includes digging through some old ones. And of course, probably the most relevant example today of use of dark data is what happened with COVID-19 and how Taiwan and South Korea, they immediately went to the SARS and swine flu data. They brought that data, they did some projections right, that was data from 10 so years ago, it was useless data. Who would need that data anymore, right? But it was very valuable and helped them to really make fine-grain decisions about what they needed to do.
“So, organizations have this wealth of data in their organization, buried in different parts of the organization. What we find the best way to approach it is think about what are the use cases you’re going after, and then bring together all the right different stakeholders to start thinking about what data do we have, what data might we be able to go get, and start that process. And a lot of times we find the business stakeholders, the business analysts, they have a really good feel for what data might be useful. The data scientists will actually tell you which data is useful.”
Even in the current age, why is data analytics still so hard?
Mann: “So I think there’s a lot of reasons. I think it all stems from the concept that humans are not generally really that great at numbers.
“And that’s not to say that some people aren’t great at math, but numbers are a construct, and most people think in visuals. Humans are very visual, especially we use hearing, smell, we use all our senses so well, we’re not that great at numbers, were not very good at patterns, were awful at patterns.
“We’re not very good at contradictory ideas. So when your data tells you something that you didn’t know, it’s one thing, but when data tells you something that you didn’t believe, that’s very difficult. So a lot of people will throw the data away because it doesn’t substantiate their pre-existing opinions. Interesting when we talk about COVID, the idea of uncovering more data, having more tests, using more data will change the outcomes of these models. Data will set us free, literally, as we get released from quarantine one day.
“And so I think people don’t naturally gravitate to data and analytics. They naturally gravitate to stories and ideas. And so it takes, as I said before, a unique mindset, to be a data scientist. But it also takes a unique ability to compromise and to accept new ideas from data scientists for executives to drive these programs. And these are some uncommon characteristics in humans, unfortunately.”
Schmarzo: First off, Andi is spot on that humans are really bad at numbers and patterns, and if you need any proof of that just go to Las Vegas.
[chuckle]
“Great hotels and casinos there aren’t built there because they give money away. My son likes to say, that gambling is a tax for people who are bad at math…
[chuckle]
“The other thing I think you said, James, is they’re looking for magic from their data analytics. The problem, of course, is the word “magic.” There’s nothing magic about data analytics. Data analytics is hard work. There’s nothing magic about what we do in data science, it’s just a lot of hard work. And it’s really about having a process and a mindset. We’re gonna explore lots of different ideas, we’re gonna try out some different things, we’re gonna fail, and we’re gonna keep iterating, keep learning to that process. That’s, and I don’t wanna self-puncture, but that’s why a lot of what we do is we spend time trying to teach executives how to think like a data scientist.
“We have a whole methodology we take executives through, and I teach this at San Francisco and in Galway in Ireland as well. How do you get business people to think like a data scientist who have start to embrace the power of data and analytics? And that really requires them to, in many cases, to unlearn what things they’ve done, to let go of what they thought were the way things worked and now be ready to embrace new learning and new processes.”
Mann: “Yeah, that’s so true, Bill. Just to add to that, I think because we’re so bad at numbers, because we’re so bad at process, and by the way, James, to your point earlier on, anyone who can use Excel today is a data scientist.
[chuckle]
“But I think the tool sets are partially at fault too. Because data scientists are hugely intellectual people, they don’t mind using complex and difficult tool sets. But to spread that capability out to other people who may be not already data scientists, I think we, as IT leaders, we need to create easier tool sets. I know one thing that we’re doing is letting people plug in open-source algorithms into the machine-learning toolkits. Right?
“So you don’t have to be the data scientist to use data science. I think there’s a lot that we can do as leaders in IT and in data, to be able to make data science more accessible.”