Observability is the hot new buzzword in IT Operations, DevOps, Agile, and Site Reliability Engineering (SRE) communities. The concept of observability originally comes from the industrial world, and is defined in Wikipedia as:
“A measure of how well internal states of a system can be inferred from knowledge of its external outputs.”
For example, in a water treatment plant with no instrumentation inside the pipes, a plant operator outside the pipes cannot determine if water is flowing, which way it is flowing, how clean it is, etc. The system lacks observability.
However, by adding flow gauges and quality sensors inside the pipes, connected (by ‘telemetry’) to meters or dashboards outside the pipes, the internal system states (flow speed, water purity, etc.) can be inferred from the external system outputs (meters, dashboards, etc.). The system has observability.
Observability for Software Applications and Services
The same principle can be applied to software. Modern developers are building measurement directly into code, delivering observable status indicators to meters and dashboards outside the application. This allows operations teams (including IT ops, sysadmins, SREs) to, for example:
· Detect, isolate, and alert sooner on critical incidents and events.
· Investigate problem root causes more accurately and efficiently.
· Fix incidents faster with real-time feedback on remediation efforts.
· Conduct more accurate post-incident reviews and post-mortems.
· Better understand problem history to preventing recurrence.
· Close feedback loops with requirements for continuous improvement.
· Use analytics and machine learning to predict and prevent problems.
· And much, much more.
Observability for the Real World
No wonder observability is becoming the norm for cloud-native businesses, which can build and deliver new code unhindered by decades of success and the ‘legacy’ of systems and applications that come with that success.
However, even a large traditional enterprise can build observability into services, even without substantial refactoring. For example:
· With no internal system changes – collect internal system-level data directly from servers, storage, networks, containers, cloud services etc. (e.g. entity performance, utilization, capacity).
· With minor configuration changes – deploy collectd to measure and forward infrastructure attributes (e.g. CPU/memory utilization, network performance, storage IOPS).
· With (probably) minor code changes – deploy statsd to collect and forward metrics from inside your application (e.g. transaction response time, volume, errors etc.).
Each approach is valuable to varying degrees. Even basic infrastructure metrics will help to detect and triage many problems, allowing IT Operations teams to answer key technology questions, such as:
· What is a normal transaction volume or resource utilization by hour, day, or month?
· Is my application performing correctly for this time of day, day of week, etc.?
· Is the application infrastructure and configuration sufficient for my current load?
· Are there transaction bottlenecks in certain applications that are causing problems?
· Are there services or systems throwing exceptions and errors that I need to fix?
However, application activity recorded in a well-structured semantic log opens up observability into higher-order data, allowing multiple stakeholders to also answer key business questions such as:
· How long are purchases taking at different times of day, or days of the week?
· What is my click-through rate, and how does it vary by customer, transaction, product?
· Is my current revenue number normal right now – and what should I do about it?
· Who is my best customer? My worst? Where should I focus my marketing?
· How many purchases are failing, and why? What customers are affected?
From Observation to Action with AIOps
Observability itself is not the end goal. More charts and dashboards will not help your business succeed per se. To be truly meaningful, observability must feed action – such as real-time problem and incident triage, closed DevOps feedback loops, or prescriptive problem prevention.
Typically, this means collecting observability data, correlating it with other monitoring outputs, and processing it with advanced analytics and machine learning, to drive ‘known good’ responses into automated actions. Combining monitoring and observability with advanced data integration, machine learning, predictive analytics, and orchestration capabilities delivers what Gartner calls “Artificial Intelligence for IT Operations,” or “AIOps.”
For example, AIOps solutions will take your raw observability data and make it meaningful and actionable by:
· Integrating it with critical system data like DCIM/APM tools, HTTP events, API outputs, device data, SNMP traps, and even RMF, SMF, or CICS data.
· Improving ‘signal to noise’ by correlating, analyzing, and filtering these integrated datasets to suppress alert storms or isolate the most notable events.
· Leveraging machine learning and predictive analytics to identify and even correct otherwise hidden anomalies to get ahead of potential problems.
· Triggering automated workflows to find, fix, and prevent both known and novel incidents by executing known solutions, even without human intervention.
· Correlating technology and business insights to enable Product Managers and DevOps teams to iterate on new ideas in real-time to achieve business goals.
Observability as practiced at (and often preached by) cloud-based startups delivering web-based services is an exciting new world of IT management – but for many traditional IT Ops, it does not seem achievable. However, any business can and should adopt observability techniques, including large enterprise IT. Especially as a supplement to traditional monitoring, observability changes the game in software service delivery, and moves IT closer to the nirvana of true business-technology alignment.
About the author:
Andi Mann is the Chief Technology Advocate for Splunk.