Should Servers Be Rebooted?: Page 2

(Page 2 of 2)

Too Many Changes

What we fear is a large number of changes having been made, possibly many of them undocumented, and a reboot then fails. At that point identifying what change is causing the system to fail could be an insurmountable process. No single change to roll back, no known path to recoverability.

This is when panic sets in. Of course, a box that is never rebooted intentionally is more likely to reboot unintentionally - meaning the chance of a failed reboot is both more likely to occur and more likely to occur while in active use.

While regular reboots are not intended to reduce the frequency of failed reboots, in fact they actually increase the occurrence of failures. The purpose is to make those failures easily manageable from a "known change" standpoint and, more importantly, to control when those reboots occur. This helps ensure that they happen at a time when the server is designated as being available for maintenance and is designed to be stressed so that problems are found at a time when they can be mitigated without business impact.

I've heard many a system administrator state that they avoid weekend reboots because they do not want to be stuck working on Sundays due to servers failing to come back up after rebooting. I have been paged many a Sunday morning from a failed reboot myself, but every time I receive that call I feel a sense of relief.

I know that we just caught an issue at a time when the business is not impacted financially. Had that server not been restarted during off hours, it might have not been discovered to be "unbootable" until it had failed during active business hours and caused a loss of revenue.

Thanks to regular weekend reboots, we can catch pending disasters safely and, thanks to knowing that we only have one week's worth of changes to investigate, we are routinely able to fix the problems. This allows us to handle servers with generally little effort and great confidence that we understand what changes had been made prior to the failure.

Regular reboots are about protecting the business from outages and downtime that can be mitigated through very simple and reliable processes.

Page 2 of 2

Previous Page
1 2

Tags: datacenter, Servers & Services, hardware, servers

0 Comments (click to add your comment)
Comment and Contribute


(Maximum characters: 1200). You have characters left.