Tuesday, December 10, 2024

Optimizing the Value of Streaming Data

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Edge computing, IoT and a world consumed by data is creating a challenge for businesses: many are overwhelmed by streaming data. It’s cumbersome – some would say impossible – to handle this continuous torrent of data with legacy Big Data architectures and the cloud. Instead, a new method of “continuous intelligence” is called for.

To address this, this webcast we discussed:

  • Can value be derived from this “continuous data stream?” How can it be mined for insight?
  • Given the potential value, what’s the best way to mine value from this data stream? That is, to get value from it before it’s analyzed later by data scientists equipped with data analytics software?
  • What are some challenges that companies encounter when they try to get value from their edge computing and/or streaming data?
  • What’s the future of streaming, real time data, and how can businesses prepare for it? How will this future influence edge computing and/or cloud computing in general?

To provide insight into streaming data, I’ll speak with a leading expert, Simon Crosby, CTO, Swim.ai

Download the podcast:

An Overview: Streaming Data

Crosby: “[In streaming data], the data is infinite but also it’s of short value, short term value. And so then yo==u have this problem which is that if you store it, it’s probably already out of date. And also you have this desire to move to what we call continuous intelligence, which is the ability to make decisions on the fly, not wait for the next batch-run. Okay? So you have to make decisions from data of limited lifespan, quickly, which means you have to do it all the time. And that’s a big challenge.”

When will streaming data be mainstream?

  • “Two things. One is, the data is infinite but also it’s of short value, short term value. And so then you have this problem which is that if you store it, it’s probably already out of date. And also you have this desire to move to what we call continuous intelligence, which is the ability to make decisions on the fly, not wait for the next batch-run. Okay? So you have to make decisions from data of limited lifespan, quickly, which means you have to do it all the time. And that’s a big challenge.”
  • “It is relatively early and I think of streaming analytics as very much a top-down view and it’s kinda being sold for some verticals. So for example, if you look at application performance manager, it’s pretty good there, right, you can launch all your community gunk in the AWS and track it, that’s cool. And there’s a bunch of companies who are solving that vertical problem. In general, the broader problem is much bigger than this.”
  • “So let me give you this really cool example from Dubai where we do smart city work. When a truck with bad braking behavior is approaching an inspector, [it] tells the inspector. This is not analytics in the sense of a top down view. This is not a city manager saying, “How many bad trucks do I have?” This is a need to respond, in real time, to every inspector in the city for potentially every truck. And so this notion of continuous intelligence is also based on the idea that information is situationally relevant. It’s highly contextually bound, right?”
  • “It’s time-based and geo-based, if it’s real world. And so information streams from sources that are contextually related to one another and they are probably related in time doing other things and as anybody who’s in the data science domain knows, you have to find this stuff out. You have to figure it out, and now we have to figure it out on the fly.”
  • “Okay, so think of two things, like the truck and the inspector. When the truck is in range of the inspector, they link, right? And then the inspector can see where the truck is. And I’m talking digital twins, obviously. And we can then figure out what to do. And so we’re talking about graph structures, not necessarily just big data or no SQL storage. Graph structures, which are inherently fluid and where the analysis is continuous and on the fly. And that’s kinda what we do, in summary.”

How To Glean Value from Streaming Data?

  • “Yes, I think that is the problem. And I have some good news, and based on experience, that the problems may not be quite as hard as we think they are.”
  • “Okay. And so let me describe an application, one of the smart city’s application in which intersections in a bunch of US cities predict the future of their behavior two minutes ahead. We are in about 20 cities and if you just go to traffic.swim.ai, you can see Palo Alto. The digital twin of every intersection is predicting its own phases and everything else two minutes out.”
  • “Now, you would think, “Wow, that’s pretty hard” and everything else. But in fact, it isn’t. Okay, every intersection has maybe 100 sensors, and sensors are of three or four types. There are inner loops, there are pedestrian push buttons and there are lights, and that’s kind of it. And the problem, which is the learn and predict problem, is whenever you want a digital twin of an intersection to predict, which was about once a second, take all your own data and link to all of the neighbors, all your neighboring intersections within a thousand yards and use their data too. And then continually guess and refine your guesses based on what actually happens in the real world. So we engage in this unsupervised learning algorithm, which is straightforward.”
  • “It matches our intuition in the sense that we think of the traffic around these sections really just being dependent on the neighborhood. The self-training unsupervised learning algorithms work very well for small numbers of inputs, so I don’t need a huge amount of data science knowledge to go off and do this. So the same code that runs in Palo Alto, runs in Las Vegas and Houston and Jacksonville and everywhere else. And I didn’t have to get a data scientist to build me a model for that city.”

What are some of the biggest problems with making this mining streaming data? What challenges will companies run into?

  • “So the code to do this, to solve this problem, is very short. It’s a couple of thousand lines of Java, not millions of lines of code. In general, what it requires is a slightly different way of thinking. So the received wisdom today is get a whole bunch of data, put it in your Cloud, or your data lake and then find data scientists and build big models in some framework.”
  • “And our approach is perhaps the exact opposite of that. That is, learn and predict on the fly. And in general, our approach, which I guess is a slight different way of looking at the world, is one in which algorithms can be adapted to continuously process data, analyze, predict, do whatever.”
  • “Now, if you can deal with this volume of data, there are certain things you have to do. And number one is staple computer. So Swim OS is a staple, if you know the computer science world, it’s a staple implementation of the actual model, where these little things called web agents, which are staple processes, like Java objects but they’re also actually concurrent objects, they each process their own data and safely represent the memory.”
  • “So we end up building this graph, which is effectively a graph of all these things which link to each other. So an intersection links all its sensors and it links to its neighbors. And this graph can be fluid and then linking is the process by which we get to see state and compute all the time, okay? So it’s concurrently executing implementation of this framework.”
  • “Computing in memory is literally a million times faster than going to a database. It’s literally a million times faster.”

What’s the biggest challenge, say, sheer number of inputs? Is it managing the system?

  • “Well, I think part of the problem is that we, as an industry, are a little bit confused about it, because we’ve heard about all these wonderful things, like AWS Lambda. Just like this, or Kafka or Pulsar or whatever. All these wonderful projects/ And they’re all adopting a model which is based on this Cloud. So the model which has made the Cloud so successful, which is rest stateless computing and databases. So what do you do? You send an event to something which is stateless, and all it can do is put that in a database.”
  • “But you know what? Good luck looking at four petabytes per day. Good luck. Seriously. And you know we’re still pretty early on in this whole process of making everything smart and everything center stage all the time. So people tend to hang on today with this idea that they’ll look at it later, they never do, they never do.”
  • “And so what’s much better to store is something which is a stateful model of the system. So for example, in the traffic scenario, instead of getting voltage fluctuation as a car goes over a loop and saving that. Let’s just say, I just save the fact that there was a car on the loop. Or instead of getting a register and some of the voltage from a light transition, I’ll just say it was a red light, okay? That’s a factor of ten thousand or more reduction data volume already. So the idea here is that you’re taking raw data, this model sensibly and continuously transforms data to state and then state to insights, and then streams those insights.”
  • “So it’s literally this, in the traffic scenario which we are supporting today, the predictions for Palo Alto, and all these other cities stream continuously from the cloud, to providers of rooting apps. So Uber, or whoever, right? They just get predictions of what’s gonna happen in the next two minutes in Palo Alto.”

What’s the Future of Streaming Data?

  • “People have lots of data today, they just don’t know what to do with it, and there are several problems. In general, people who are managing the large pieces of real estate, say, I don’t know, an oil rig. They want a better oil rig, but they don’t necessarily have the skills, which are cloud-native skills to go off and build better, to get better insights, and so on. And so the challenge is to get people from what has been a traditional approach, which is just stick everything on a hard disk, into a more cloud-native approach where they can think about using newer technologies and tools to solve their problems and get real-time insights.”
  • “It’s a bit of a journey, but we’re on a path, we definitely cannot go any further down the big data path.”
  • “What I mentioned is that streaming analytics, in my view is a particular used case, it tends to be top-down, tends to be manager-centric, looking down at all my assets. A key use case is the one where I have to tell every rider, your bus is about arrive at this bus station. Which is millions of response delivered in real-time. And real-time has a real notion here, I’m not allowed to tell somebody to go and get their bus when it’s already left. So the notion of real-time here is very closely tied to the evolution of the real world. The bus will come and go, whether I tell the user or not, but I better tell the user in time. In fact, I have to tell every user in the city in the same time frame that it’s gonna work. And so concurrent person of huge amounts of data is a requirement.”

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Get the Free Newsletter!

Subscribe to Data Insider for top news, trends & analysis

Latest Articles