SYSTEM MONITORING WITH HOBBIT Tracy Di Marco White Tracy will discuss system monitoring with Hobbit, demonstrating how ITS staff can use the Hobbit monitoring system now in place. In addition, she will show how it can work for other sites that wish to install it, and benefits other ISU departments have seen from using Hobbit. Questions are encouraged. (This is an overview of the talk... not all links are accessible to everyone.) -------------- First, let's mention why we care about monitoring our systems. Basic tenets of monitoring 1) If you're not monitoring a service, that service is not in production. 2) If the staff monitoring a service don't know what to do in response to what the monitoring software tells them, you're not providing that service at a production level. ITS Monitoring Background Way back when, AIT started using Big Brother (BB) for monitoring. We choose BB because we had a lot of different operating systems and services to monitor, and Big Brother was entirely shell and C based, and very portable. Also, it was free for our non-commerical use. There was a reasonable windows client, and we could even gather some data from Novell servers. There was mailing list support, and an archive of 3rd party addons to do almost any monitoring check we could want. And it was easy to write our own monitoring checks. Big Brother's license was what they called "Better than Free", which meant that the authors required payment for commerical use. Eventually the authors went to work for a company selling monitoring software, and took the commercial version of BB with them. Development on the freely available version of BB slowed down significantly, and a lot of the plugins were not being updated to more recent versions of the underlying software they used, and that made upgrading BB more annoying. So annoying, in fact, I went looking for something that I could use instead. I found Hobbit, which has developed from being just a BB plugin to being a BB replacement. It had trending built in, rather than being an external plugin, so it just worked. The largest and most important part was that it was compatible with BB, and so we wouldn't have to reinstall new clients on every machine we already had monitoring working well on, and all the plugins we had developed ourselves for BB also still worked. I also had Hobbit up and working in under a weekend, as filler while waiting for other things I was working on to complete. Hobbit's client & server are both part of NetBSD's packaging system, and so the install was very simple, and setup was too. Hobbit is also available in Debian and RPM packages for Linux users, and it can use the already available Unix, Windows & Novell BB clients. There is also development underway on a Hobbit Windows client. Enough history, let's look at hobbit. What can be done with hobbit? I'll be going in depth into what ITS does with Hobbit, but first let me go over a little of what Aerospace Engineering does with Hobbit. I'd like to thank Jim Wellman for letting me use his hobbit install, and for the information he gave me on how it works for them. AerE has 1 hobbit server running on Linux, with the Hobbit client installed on various Linux systems in the department, both servers and workstations. The Big Brother Windows client, bbnt, is installed on the windows systems, both servers and workstations. Alerts are sent via email if any of the servers have problems with connectivity, the load average gets too high, the disk starts getting too full, or for certain memory conditions. The Windows BB client can be set to ignore certain messages to the event log. This is done on the client, by updating the registry key 'ignoremsgs', so that alerts are only sent on messages that are known to be bad, or unexpected new ones. Client deployment of the Windows BB client is done through either an MSI file (Group Policy), or through running a script remotely, using AutoIT, which installs the client, sets the ignore messags policy, and also install custom scripts. ---- AerE has three custom scripts in common use. 'bbpqm' checks lpr queues, reporting green if the queue is empty, and red if there's something in the queue. An alert is sent if the printer is red for more than 30 minutes, meaning the queue is likely stuck. 'temperature' monitors the output of from one of the Dell Open Manage command line tools for Windows 2003 Server (omreport chassis temps) and parses out the temperature of the system. It displays them on hobbit, using green for ok, and red for not ok, naturally. 'Reportdiskerrors' is a Windows XP script that uses psloglist (from www.sysinternals.com's pstools package) to read the event logs. If there is a disk error, such as a bad block, it then sends in a report to the hobbit server. An alert is also mailed out if a bad disk is discovered. ---- http://hobbit.aere.iastate.edu/hobbit/ The graphs hobbit generates are useful for students deciding what servers to log into, also. ---- http://kosh.its.iastate.edu/ This is the public page, something everyone can see. One of the changes hobbit brings to this page is the ability for people to mouse over the icons, and get a time estimate of how long the host/event combo has been in whichever color state it is in. All the colored icons throughout hobbit will give you this information on mouseover. The icon meanings: http://kosh.its.iastate.edu/hobbit/help/hobbit-tips.html#icons Almost everything else on this page is inaccessible to the not-ITS people. So, let's show you around. http://kosh.its.iastate.edu/hobbit/bb2.html Main critical page in the current version of hobbit. Shows any current host/service combo that is red, yellow or purple, as well as the last 100 events that happened in the last 240 minutes. This is the page that our operations staff watches. If you click on the colored icons for the current host/service, you can get more information on what's wrong. If you click on the host name, and there is a 'notes' file containing information about the server, it will bring up a new window with the 'notes' file. If you click on the host name, and there is not a 'notes' file, Hobbit will pop up a new window that shows the page that the host lives on, which may at least give some context for the server. Every host in the current non-green list will have an info and a trends button. The trends button will show all the default graphs available for a host, with a time frame of the last 48 hours. Clicking on any of the graphs will show you other time frames for that particular graph. The info button will show various information about the machine, including a link to the 'notes' page, if there is one. It also has what checks and what kind of alerting is done on each check. There is also an interface for enabling or disabling the checks on a host. If you click on an icon in the history list, you will see the host/service message that caused the change to that color. From there, you can look at the full history for that host/service combination. At the bottom of the page is the acknowledgements section. This will show up to 25 host/event problems acknowledged in up to the last 240 minutes. We will be using this to improve operations ability to track notification to staff. ----- On to the main view. http://kosh.its.iastate.edu/hobbit/bb.html This contains every machine we're monitoring remotely, and every machine that is allowed to report to the central servers. Most of the servers are grouped according to different categories. We also have secondary groupings arranged by people, so we can have our own 'personal' page with all the machines we may care about. ----- http://kosh.its.iastate.edu/hobbit/White/White.html This is my personal page. I don't tend to use it as much as others might use their personal pages, because for the most part I care about everything Hobbit monitors. Across the top you'll see all the test names, and clicking on the test names will bring up a description snippet of each of the tests that comes with hobbit. It's fairly simple to add descriptions for custom tests too. ----- http://kosh.its.iastate.edu/hobbit/notes/afs-0.iastate.edu.html Here's the notes page for afs-0. First off we have the contact list. The contact list links off to each person's or group's contact information, as available. Going down the page, we have general criticality information, "The afs servers are 24x7 critical machines, if down, call contact list and send hoofies up message." Then I itemized everything that is checked on the server, what it can mean, and what to do for each of the possible colors. The list is split into things that are monitored remotely, and things that are monitored by the client that runs on the client machine. Part of the reason I differentiate between remote & local monitoring is that so if only remote tests, those done by ulkesh, are purple, it is somewhat more obvious that it is ulkesh that is down or having problems, and not the client machine. Further down the page I have notes about the hardware configuration, and any tips I want to have handy about this particular machine. ---- http://kosh.its.iastate.edu/hobbit/afs/afs.html On this page you can see what disabled tests look like. afs-6 & afs-8 are not in production, and so I don't want alerts about them alarming everyone. ---- http://kosh.its.iastate.edu/hobbit/bb.html The web server for the Hobbit monitoring system is kosh, and it has the most trending graphs, as it is actually running the Hobbit client. Let's look at some of its trending graphs. ----- Reports The event log report lets you look at changes over some time period. Let's look at the last 480 minutes, or 8 hours, and we'll set the max events high, so we'll see everything. The Availability report lets you see availability over a time period you specify. Let's look at Apr 01 2006 - May 01 2006. The Snapshot report lets you see what things looked like at a particular point in time. Let's look at Sat Apr 01 12:00:00 2006 The Configuration report lets you look at an overview of the configuration of Hobbit. How many hosts, what's monitored by each host, and what alerts are set. ---- Administration Find Hosts will tell you what pages a host can be found on, so you don't have to guess. Let's type in AIT, so we'll get a few to choose from. Acknowledge Alert lets you set an acknoledge status for an alert. In the version we're currently running, you need the code from the alert mail to acknoledge an alert. There is a time period for the ack, up to 9999 minutes, at which time it reverts to unacked. The explanation can be "called Tracy, she's looking into it" so the next shift knows what's going on with things on the screen. Enable/Disable is for things that are out of service for some period of time, and shouldn't be reported on the non-green page. ----- The Hobbit demo page, http://www.hswn.dk/hobbit/, lets us look at an upcoming feature a little more. Other upcoming features include * Client logfile monitoring has been implemented * A completely new "Critical Systems" view with a separate configuration file and web-based GUI for editing it. * All configuration files now support the "include" directive. In addition, you can include entire directories with a single "directory" statement in the config file. * Acknowledgments now stay around for a while after the status goes OK, so if a service crashes after a few minutes, the acknowledgment is automatically revived. Multi-level acks are also in the upcoming version. Multi-level acks allow acknowledgements by different levels of the organization. An ack done by the team monitoring the critical systems is a "level 1" ack; it won't stop any alerts from going out, but it will get the status off their webpage. Techs can acknowledge an alert as "level 2" it stops alerts from going out to > people on their level, but doesn't prevent alerts from escalation to higher-level people (managers and such). ----- Tests we've written in house include a lot of various special monitoring we have. We've written an afs monitor, that includes data that gets graphed and warns us if the file server isn't running, or there are calls waiting on the server it starts warning us early in the process. an audit test warns us if any packages installed from the NetBSD packaging system have known vulnerabilities, so we can upgrade them. the inode test, like the disk test, checks to see if you're using up most of the inodes on a partition print-acct checks to make sure that the print accounting service is accessible remotely. printer checks to make sure that print-2's printer port is open and responding. raid-twe monitors the 3ware Escalade raid controller we're using in several machines, raidframe monitors the software raid we're using on several systems. netreg checks to make sure netreg is answering ---- We've gotten a lot of checks from www.deadcat.net, also. http://www.deadcat.net - Big Brother Archive They're the source of a lot of 3rd party plugins, and we've also downloaded systray alerting for windows and for X11 apps. There's also some development of Hobbit 3rd party plugins and addons. http://sourceforge.net/projects/bbwin - Hobbit client for windows in devel. http://devmon.sf.net - allows a system administrator to monitor remote devices via SNMP (Simple Network Management Protocol), querying said devices for current status and alarms.