Infrastructure Monitoring: Challenges and Best Practices

SHARE
Share it on Twitter  
Share it on Facebook  
Share it on Linked in  
Email  

Infrastructure monitoring covers an array of uses and issues, from network optimization to diagnostics to intrusion detection. Infrastructure monitoring can tell you if a router is down, a server is running at an unusual utilization, and with the right security, it can also spot an intruder siphoning off company data.

Infrastructure monitoring enables infrastructure management but the two are different. Infrastructure management is the remedial steps you take in response to the findings from your monitoring tools. Infrastructure monitoring in the constant checking of the various elements of your IT platform, from your in-house data center to your private cloud to your public cloud.

With networks growing increasingly complex thanks to a growing range of devices – as well as the advent of the hybrid cloud – the importance of network integrity grows with it. You should have infrastructure monitoring even on a basic network inside a small business, but for complex systems of large enterprise, it’s a vital component of operation.

Understanding Infrastructure Monitoring

In terms of daily operation, infrastructure monitoring is the deployment of software tools to automatically diagnose performance and availability problems across the entire technology stack to catch issues before they become severe.

By “the entire stack” we mean the hardware, operating system, virtualized environment, network, storage, and compute, and applications. Since most large-scale infrastructures span multiple locations and encompass both public and private cloud, that poses a greater challenge to IT to get its arms around all of the moving parts, and automation becomes the key.

Because of the complexity, automation is vitally important, for the following reasons:

  • Automation can respond much faster than a human.
  • It can handle the issue faster than waiting for human intervention.
  • Assuming you’ve programmed the response correctly, automation can reduce errors.
  • It runs 24/7 and needs no sleep, unlike humans.

With automation you can set thresholds, such as server utilization or network bandwidth, and program responses should the network go above or below set thresholds. Should a server freeze or go down, it can be restarted automatically as well.

Infrastructure monitoring covers adding and removing devices to be monitored, performance and health monitoring, network and data movement monitoring, reporting/logs, and an alert system should an issue arise. These are usually in real-time because an alert to a network bottleneck is pointless six hours after it is discovered.

Infrastructure monitoring is typically done through a dashboard, which presents all of your information in one place through visual layout, like meters. They present a real-time update all on one screen, and can generate reports over a time period as well.

Why You Need Infrastructure Monitoring

Infrastructure monitoring gives the manager the data necessary to understand the status of the infrastructure in real-time as well as the ability to measure progress towards organizational objectives. Through the continual collection and review of data about the infrastructure, monitoring allows for measuring both current status as well as how well the network is progressing.

For example, if management has set down a goal of achieving a certain level of network response, monitoring tools can show where the network is in terms of responsiveness and where it has been. It can identify spikes in lag and perhaps the causes why as well.

Ensuring the network runs at peak efficiency requires that you know what devices make up the IT infrastructure but also to keep an eye on those devices in terms of health status and performance. Proactive analysis of your IT system means that you have a better chance of catching imminent failures before they cause a major disruption.

There are plenty of examples of what can go wrong without adequate infrastructure management. The “zombie server” phenomenon, where a physical server sits idle and no one is using it, is one such example. A 2017 study by The Anthesis Group and a Stanford University researcher found that up to 30% of servers in large data centers were zombies. Running, drawing power, but not doing any work. That is a failure of adequate monitoring because an infrastructure monitor would note these servers that aren’t generating any traffic or using any cycles at all.

Another example is malware infiltration. Over the years there have been stories and research on malware that gets into corporate networks and uses the corporate network to deliver malicious payloads, fire off spam, launch Distributed Denial of Service (DDoS) attacks, or sniff network traffic for useful information. Again, this is where monitoring helps because it would notice an unknown app sending out thousands of emails or talking to a server in Russia.

Infrastructure Monitoring Best Practices

Here are several tips for making the best of your infrastructure monitoring tools:

Prioritize – determine ahead of time what are the most important notifications in descending order, from this could cost you your job for some issues to here comes the emails for a lesser issue.

Create a process for alert resolution – There should be a process for the best and quickest resolution for each type of alert. Again it comes down to priority, from we need to inform the CEO to get an intern to handle it.

Buy, don’t build – It’s an age-old debate in IT, build vs. buy. Do you roll your own or risk vendor lock-in? In this case, because of the growing complexity of IT systems, you are better off buying monitoring tools. The good news is there are plenty to choose from.

Test your monitoring and alert system –  because the first time you see your alert system in action shouldn’t be when there is an actual emergency, since it might require some tuning of the system. Dry runs help ensure you can tune the system to your needs.

Set up detailed, comprehensive alerts –  wasn’t it frustrating when the “Check Engine” light came on in your car and you had no idea what the problem was? Well the same applies here. A good alert needs to comprehensive, detailed, and actionable.

Monitor from multiple locations – If you have multiple data centers, monitor them all from each one. If you have three, monitor data centers B and C from A, monitor A and C from B, and so on. Redundancy never hurt anyone.

Get help – the vendors of monitoring tools have support staff and consultants to help you. Use them.

Mix your monitoring tools – there are both on-prem and cloud-based tools. Use them both, especially if you have a hybrid cloud environment.

All’s quiet is not always a good sign – Systems fail. That’s inevitable. They choke on bandwidth or suffer an intrusion. Sometimes the monitor misses things. Don’t assume no alerts for weeks means nothing wrong. The problem could be with the monitor itself.

Review metrics periodically – Performance metrics are not fire and forget. You might set up CPU thresholds that are too high or network bandwidth alerts too stingy. Metrics should undergo regular review.

Infrastructure Monitoring Tools

There are two types of monitoring tools: There are on-premises, locally-installed monitoring software tools and there are SaaS server monitoring tools that operate from outside your network. Your decision should be based on your business needs, but at this point, there are very few on-prem only tools left. Pretty much everyone has made the move to the cloud.

CloudRadar: this puts all your servers, hosts, and services in a unified application, and when issues arise, such as outages and or capacity or performance issues, the software notifies users via email, SMS, Slack, Whatsapp, Telegram, Pushover or Webhooks.

CA Technologies: CA offers a variety of enterprise-level, full-stack monitoring and management solutions for on-prem and the cloud, including DX Application Performance Management, DX App Experience Analytics, DX Infrastructure Manager and Network Operations and Analytics, to name a few.

VMware vRealize Hyperic: Collects performance data from as many as 50,000 metrics across more than 70 application technologies to monitor any component in your hardware, OS, application and middleware stack.

New Relic: Two apps, APM and Infrastructure, when combined cover system and application performance, both on-prem and in the cloud.

BMC Digital Enterprise Management: DEM is a suite of six solutions for full stack monitoring, from IT operation, monitoring unauthorized IT activity, mainframe maintenance, app monitoring and unauthorized use.

Dynatrace: The company is wholly dedicated to creating monitoring tools for performance management, AI for operations, cloud infrastructure monitoring and digital experience management.

Opsview: Its flagship Monitor product provides a single view into all IT assets and systems as well as cloud-based services.

SolarWinds: Its flagship Server and Application Monitor (SAM) tool lets you monitor the health, availability, and performance of your applications and server infrastructure, both on-prem and in the cloud. SAM supports more than 1200+ application and systems templates or easily extends monitoring to any custom or home-grown application.



NewsletterDATAMATION DAILY NEWSLETTER

SUBSCRIBE TO OUR IT MANAGEMENT NEWSLETTER