Tech/Horsepower

How are your servers today? Three things your network engineering department needs.

by Chris on Apr.09, 2008, under Tech

Previously posted and imported from elsewhere

Whether you’re responsible for one website or an entire room full of servers, your job as a network engineer (or administrator, analyst, operator, whatever) is to keep it up and know when it’s down so you can get it back up. Your job is to strive for 100% uptime and that should always be your goal, but you need to be prepared – you will experience downtime at some point. These three items will ensure you know about (and are in the process of fixing) a problem before your clients do.

1) A prioritized monitoring plan.

Critical systems: A website, server, or critical VPN tunnel is hard down. Anything that should be up 24/7 is what matters. You need to know about this stuff immediately. You should be checking it every minute.

Non-critical systems: These can be addressed during business hours. A server running low on disk space or CPU at 100% because a backup is running? Your piecemeal desktop running eight virtual machines is down? Yes, they’re important. But while they look cool when you’re hangin’ at a LAN party and your phone is going nuts (I can hear you now “Nah man, it’s cool, just my monitoring server…”) or on some management report, you don’t need disturbed at 4AM for them.

Why separate your notifications? Your biggest enemy to emergency preparedness is you. If you constantly get text messages for everything possible, it’s no wonder you sleep through them. By knowing any text message you get may be a critical problem, you’ll be less likely to ignore it.

2) Great, unbiased monitoring.

While there are lots of internal monitoring packages, we like NAGIOS. It’s a bit of a bear to set up if you’re not a Linux’ite, but it provides nearly unlimited monitoring capability on many levels across almost all operating systems. Simple PING tests and HTTP checks to custom integrations, NAGIOS can do it with a little patience and best of all it’s completely free.

You should be monitoring externally too. This is especially true if you host websites or other external services. I spent a few months testing lots of external monitoring and uptime companies and chose Pingdom. They proved to be the most reliable, use many monitoring sites across the world, and had zero false alarms during the testing period – an experience competition four times their price cannot claim. They also provide a cool API for integrating uptime stats into your webpage and I can’t wait to turn the developers loose with that in the future.

Using both external and internal monitoring will provide a comfortable level of redundancy. Even if you just use Pingdom to check and make sure your NAGIOS server is up, have an external monitor! If you’re only using NAGIOS and the internet connection it uses to send you notifications (and host all of your websites) goes down, you’ve effectively just shot yourself.

Further words about external monitoring: You may be tempted to put your own NAGIOS box at your house on your cable or DSL connection for your external monitoring. Don’t. Sign up for Pingdom or another multiple-site unbiased monitoring service. We found that even with a business-class DSL connection and static IP address, notifications could be spotty due to the lackadaisical support of email pathways by most ISP’s. Many short, identical looking emails with lots of IP addresses and time stamps in them look an awful lot like SPAM. Not to mention internet connections at most homes aren’t that reliable, which harkens back to #1 up there. If you’re constantly getting false notifications that’s the same as getting meaningless ones.

3) Make your notifications annoying!

You’ve got a plan and your notifications come from both the inside and outside, great! If you’ve got a cell phone specifically for work that you never text on, getting simple text messages to this device and setting a nice, annoying and LOUD sound for them may be enough. If you’re like many engineers out there (including me) who blend their professional and personal lives and every text message may not be an emergency (or you can sleep through a text message notification), you may need a little more than that.

The single biggest improvement I have made to my network engineering operations is turning an emergency notification into a phone call.

That, my friend, is a bold statement. Enter eNotifyMe. This service is a gem, a true diamond-in-the-rough of the internet. Their single biggest feature is turning any email into a phone call. So now not only do your emergency notifications come as a text message, they come as a phone call. There has not been a better way to add urgency than this. It’s easy to not hear a few text messages – it is not easy to miss a few phone calls. Besides this great feature, eNotifyMe provides a ridiculous amount of triggers including AND/ALL/OR situations and schedules for notifications to phone, text, SMS, and more.

Excellent example: One of my clients has a phone server that can send all of its built-in problem notifications to only one email address. This is an issue for two reasons, all of the notifications it sends aren’t critical. In fact, 98% of what it sends are decidedly non-critical. The other is because it can send to only one address, I would have to set up a distribution group on our end to get the notifications where they need to go and I still cannot filter critical from non-critical. Using eNotifyMe I can filter the notifications and send them to the appropriate notification addresses (email for non-critical, phone/text/pager for critical). How cool is that!

Did I just say pager? Yep. Because our systems being up is so critical, I choose to have a redundant notification device for our engineers as well. Any text message that is sent to a cell phone is also sent to our pager. If we have a cell phone malfunction, dead battery, run over by a truck, we’ll still know if our systems are down. A pager can be cheap insurance for your department (about $40 a quarter per-pager) and while they’re not fool-proof because service can be spotty outside of metro areas, they can be a lifeline in case of emergency.

Finally, and probably most importantly, add your client facing staff to critical notifications. They don’t need phone calls, but make sure they’re getting emails and text messages. Chances are, if there is an emergency, your clients are going to call who they know to get the skinny – not you. They’ll pick up the phone and dial the cell of their sales representative way before they dial some 800 number into a support department. If your sales staff knows there is a problem and can soothe a client immediately, it won’t exacerbate the situation with everyone trying to get a hold of you while you have 20 notifications going off in your face. Not to mention your excellent preparedness and unified response is probably well ahead of your competition.

In a perfect world, you’ll never need any of this and it’ll all be a questionable line-item on a budget spreadsheet. The day you do (and you will) you’ll be glad you did. Even at 4AM on a Saturday.

:, , , , ,
No comments for this entry yet...

Leave a Reply

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!