Red lights of impending doom

June 22nd, 2009 by Ian Leave a reply »

I walked in this morning and got logged in.  Shortly after that, I was approached by a coworker stating that the server room was really loud.  I wandered down and heard the server fans screaming away before I even got to the door.  As soon as I opened the door I was blasted by a head wave as hot as some of our mid summer heat waves here in eastern PA.  At first, I felt like I was walking into some crazy bizarro server room.  Every server was flashing amber or red and there was a slight odor of burning electronics.

I quickly realized the 1 year old AC unit had crapped out. The server room has two doors so I opened both doors and grabbed a few box fans to start moving air through the room. I then started to shut down most of the servers and a large chunk of the network gear. Somewhere during that, I called our facilities department and alerted them of the situation. They began to work on the AC unit and determined that a fuse blew, which has been replaced and the unit is back on. The server room is still cooling down now and I have slowly been bringing servers and other equipment back online.

There are a few issues here. Ultimately, things break and we need to have some sort of control of the environment of that room or at the very least, a way to monitor it. First of all, I should have an independent temperature sensor that will alert me when it gets too hot in there. If someone hadn’t alerted me to the problem when they did, there could have been a shit load of issues if equipment started to fail.

The other issue is that I’ve always been toying around with a network monitor system, but I’ve never been able to settle on anything. I’ve mucked around with Zenoss and nagios but I’ve never stuck with anything. I need to just find a proper system and knuckle down and get it configured.

There are lessons to be learned from this morning!

Advertisement

8 comments

  1. Chris says:

    There are a couple things you could do. Is your AC unit ip capable? If so, you can obviously monitor snmp traps or just poll something to tell you when it’s functioning properly. I have a piece of freeware that monitors hd temp. I assume that rose on your servers. This freeware ap also e-mails whenever the temp exceeds a certain threshold.

  2. Matt Simmons says:

    Yikes! Environmental monitoring is your friend!

    Hope everything turns out alright

  3. Legooolas says:

    I don’t think it matters too much what you pick, but I went with Nagios as there are lots of plugins, more are easy to write, graphing is straightforward for pretty much everything (with PNP4Nagios).

    Lots of user-generated plugins (e.g., http://www.monitoringexchange.org )

    There are a lot of people using it so there are lots of pages on how to get particular things working, which is also nice.

    I’ve not found anything that I haven’t been able to monitor with it which I have wanted/needed to — just crank out simple plugins in whatever language you like and add them in :)

  4. Jamie Isaacs says:

    What about OpenIPMI? Then use Cacti to monitor temperature over time and use Nagios or another SNMP monitoring tool to send alerts when the temperature nears unsafe temperature thresholds.

    We use Dell rack servers, and some of the older models don’t have the nice BMC interface that you can monitor directly on the NIC over UDP. This caused me to have to monitor temperature using the ipmitool command directly on each server…

    ipmitool sdr type temp
    Temp | 02h | ok | 3.2 | 50 degrees C
    Temp | 05h | ok | 10.1 | 40 degrees C
    Ambient Temp | 08h | ok | 7.1 | 27 degrees C

    If there is any interest, I can writeup an article about how to accomplish this using OpenIPMI, SNMP (pass_persist), cron, and Cacti.

  5. There are probably existing temperature monitors in your equipment you could use for free. For example you can grab the temperature from the hard drive in a machine via SMART. A small shell script could poll that value and send an alert if it got too high.

    I’m not advocating an ad-hoc setup like that in lieu of comprehensive monitoring, but as a first step it’s better than nothing, and essentially free.

  6. Ian says:

    Thanks for the suggestions all. This episode really showed why I need to have properly monitored systems. I’m going to give Nagios another go ASAP. Also, I ordered an environmental monitoring box from Blackbox with temperature and humidity sensors which I will be installing in addition to the Nagios setup.

    We ended up losing a server in the aftermath with some critical data on it. Thankfully I had multiple backups and we were able to build a new box and restore the data.

  7. Matt Pson says:

    IT’s the right way to go imo. When we wanted to be on top of things last time we got us some stand-alone monitoring deviced that speaks SNMP ( like http://www.akcpinc.com/company/products.htm ) and combined it with Nagios for alerts and Cacti for statistics plus a little box capable of sending SMS. It’s a cheap and simple way to avoid unnecessary troubles and us sysadmins likes to live untroubled lives, right?

Leave a Reply