Quick Start Guide to Infrastructure Monitoring with Nagios

August 9, 2009

Staying ahead of IT service issues can be frustrating when you manage several servers, or even a single server with many services. Enterprise IT Infrastructure Monitoring Solutions (a fancy term for something that is really pretty simple) attempt to remedy the problem by repeatedly checking the status of machines and services on the network and alerting the responsible administrators as soon as something goes wrong, or even before there’s a problem.

It’s hard to argue against implementing a monitoring solution within the network, as it is much a setup-and-forget matter that adds negligible load. The monitoring solution itself is — or at least, should be — very low maintenance, yet provides very valuable insight into the health of the network.

Introducing Nagios

Nagios is an infrastructure monitoring solution that is both popular and open source. Apart from its obvious monitoring capabilities, it includes the ability to associate an event handler to an event, allowing you to fix a problem automatically. If — for example — one of your Python applications crashes, you can have Nagios do python /opt/myapp/myapp.py automatically, before any human administrators have the time to do so. Other features include the ability to create many kinds of reports, and to send notifications and alerts via email and SMS.

Nagios' web interface screenshot

Nagios is based primarily on C and shell scripts, which makes it light on performance but adds a slightly ‘hackish’ feel. It comes with a CGI-based web interface (which we’ll spice up a bit) that lets you view and manage Nagios, through what are known as External Commands.

I’d like to demonstrate how to set up rudimentary Nagios monitoring on a small farm of Linux servers, with an Ubuntu/Debian server running the primary Nagios process. In the end, we’ll be monitoring the states of various services on the servers, including the ones seen in the screenshot above (Apache processes, APT, Current Load, Current Users, Disk Space, Dovecot, FTP, HTTP, MySQL, SMTP, SSH, Swap, Total Processes, and Zombie Processes). We will also receive notifications by email whenever something goes wrong:

Nagios Email Notification

Please note that this guide is meant to get you up and running quickly, and that it’s not a substitute for the official Nagios documentation. If you want to know what all of the different configuration options do (or can do), please consult the (excellent) documentation.

Setting Up The Nagios Server

The steps in this section should just be done on the main Nagios server, not the clients it will be monitoring. We’ll get to those later!

This procedure should be quite similar on other distributions if you use their package managers (yum, yast, urpmi, etc.) or install Nagios from source, but no guarantees.

Let’s become root so we don’t have to prepend sudo to everything:
```
sudo -s
```
If you want to make use of Nagios’ web interface and Apache isn’t already installed:
```
apt-get install apache2
```
It’s entirely possible to use something like nginx or lighttpd to serve the interface, but that is not covered in this guide.
Install Nagios from the package repositories:
```
apt-get install nagios3 nagios-nrpe-plugin
```

Nagios should be accessible at http://nameofnagiosserver/nagios3 already! We still have some configuration to do, though.

Stop Nagios:
```
/etc/init.d/nagios3 stop
```
Add a new user for the web interface, e.g. patrick. The default configuration grants all security permissions to the user nagiosadmin, but we’ll change that to the name of the new user, too:
```
htpasswd -c /etc/nagios3/htpasswd.users patrick
perl -p -i -e "s/nagiosadmin/patrick/g" /etc/nagios3/cgi.cfg
```

The perl command above replaces all occurrences of nagiosadmin with patrick in the file /etc/nagios3/cgi.cfg.

The users listed in /etc/nagios3/cgi.cfg are effectively global administrators. For regular users, you can still add them as users with htpasswd, but assign privileges by making them Contacts for certain hosts or hostgroups, instead. We’ll get to this later!

If you want to add more user accounts for the web interface:
```
htpasswd /etc/nagios3/htpasswd.users john
```
And if you want to give them superuser privileges:
```
perl -p -i -e "s/patrick/patrick, john/g" /etc/nagios3/cgi.cfg
```
Go through /etc/nagios3/cgi.cfg manually to see what the different security options do, and to grant more fine-grained privileges to other administrators.
Edit /etc/nagios3/nagios.cfg and change check_external_commands=0 to 1 to allow monitoring commands to be issued through the web interface

On Debian/Ubuntu, run the following commands after setting check_external_commands=1:

dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw
dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3

Edit /etc/nagios3/conf.d/contacts_nagios2.cfg to match your preferences. Example:

define contact{
        contact_name                    patrick
        alias                           Patrick Mylund
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,r
        service_notification_commands   notify-service-by-email
        host_notification_commands      notify-host-by-email
        email                           [email protected]
        }

And further down:

define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 patrick
        }

Make a host definition for a server you want to monitor by creating a matching config file, e.g for the server ‘tranquillity’, nano -w /etc/nagios3/conf.d/tranquillity_nagios2.cfg, then insert a declaration. Example:
```
define host{
        use                     generic-host            ; Name of host template to use
        host_name               tranquillity
        alias                   PatrickMylund.com Web Server
        address                 209.20.82.6
        }
```
You can put all of your host definitions in one file if you want, e.g. datacenter1_nagios2.cfg — just remember the _nagios2.cfg at the end of the file name, which is what tells Nagios to load that file (and in the proper format).
Repeat the step above to add a host definition for each server you want to monitor

Move some standard configs to make room for our configured ones:

mv /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/localhost_nagios2.cfg.old
mv /etc/nagios3/conf.d/services_nagios2.cfg /etc/nagios3/conf.d/services_nagios2.cfg.old
wget https://cdn.pmylund.com/files/misc/1202-nagios_quickstart/services_nagios2.cfg -O /etc/nagios3/conf.d/services_nagios2.cfg

Edit /etc/nagios3/conf.d/hostgroups_nagios2.cfg. List which hosts (comma-separated) should belong to which groups (debian-servers, http-servers, ssh-servers, and ping-servers), and add some extra hostgroups: db-server, ftp-servers, and mail-servers:

define hostgroup {
        hostgroup_name  db-servers
                alias           Database servers
                members         tranquillity, singularity
        }

define hostgroup {
        hostgroup_name  ftp-servers
                alias           FTP servers
                members         tranquillity, singularity
        }

define hostgroup {
        hostgroup_name  mail-servers
                alias           IMAPS/SMTP servers
                members         tranquillity
        }

You can see which services are associated with which hostgroups by looking in /etc/nagios3/conf.d/services_nagios2.cfg.

We’re done with the Nagios server for now. Let’s look at the settings for the Linux servers we want to monitor.

Configuring Monitored Clients

The steps in this section should be done on each Linux host that you want to monitor.

Again, let’s become root:
```
sudo -s
```
Install Nagios’ NRPE module:
```
apt-get install nagios-nrpe-server
```
Installing the NRPE module is optional, but you won’t be able to run any of Nagios’ scripts directly on the target client if you do not. This is necessary for monitoring system stats, and generally anything that cannot be probed from the outside over the network (by the main Nagios server).
See the NRPE documentation (PDF) for manual installation instructions, as well as how to get information via SSH (get_by_ssh) instead of NRPE.
Stop NRPE:
```
/etc/init.d/nagios-nrpe-server stop
```
Install a custom nrpe_local.cfg (this will save us some time later):
```
mv /etc/nagios/nrpe_local.cfg /etc/nagios/nrpe_local.cfg.old
wget https://cdn.pmylund.com/files/misc/1202-nagios_quickstart/nrpe_local.cfg -O /etc/nagios/nrpe_local.cfg
```
Go through /etc/nagios/nrpe_local.cfg to see the list of commands that Nagios will be able to execute on hosts running NRPE. By default, NRPE will only run the commands defined in this configuration file, and without any arbitrary arguments. I strongly recommend you stick to this for security purposes.
On the main Nagios server, all service commands prefixed with check_nrpe_1arg in /etc/nagios3/services_nagios2.cfg are commands defined in /etc/nagios/nrpe_local.cfg on the monitored clients.
Define what hosts are going to be allowed to probe the NRPE module for information (comma-separated). For instance, if the main Nagios server has IP 192.168.1.105:
```
perl -p -i -e "s/127.0.0.1/192.168.1.105/g" /etc/nagios/nrpe_local.cfg
```
If you have a firewall (iptables, ufw, etc.), you need to open for connections on port 5666 on the clients (for NRPE). If the main Nagios server has IP 192.168.1.105, you could do ufw allow proto tcp from 192.168.1.105 to any port 5666, or ufw allow 5666/tcp with Ubuntu’s Uncomplicated Firewall.
Start the NRPE module:
```
/etc/init.d/nagios-nrpe-server start
```

We just about have a basic Nagios setup now!

Testing Nagios

Let’s see if what we’ve set up is working. On the main Nagios server, start the Nagios service:

/etc/init.d/nagios3 start

If all goes well, navigate to e.g. http://192.168.1.105, login with the user credentials you set up earlier, then click on Service Detail in the menu on the left. All of our services will be PENDING, meaning they’ll be checked shortly. You can speed this up by clicking on a service and clicking Re-schedule the next check of this service (this is what is called an External Command).

If any of the service states turn out to be CRITICAL or UNKNOWN, don’t panic — take a look at the different configuration files in /etc/nagios3/conf.d. The settings and commands are pretty straight-forward.

You can find examples of the resulting configuration files in nagios-conf-example.tar.gz. The configs are for a single server (singularity) with the IP address 192.168.2.3.

An Extra Touch

Nagios’ web interface doesn’t look very pretty. We can spice it up a little by changing the CSS. I’ve prepared a modified status.css for your convenience:

mv /etc/nagios3/stylesheets/status.css /etc/nagios3/stylesheets/status.css.old
wget https://cdn.pmylund.com/files/misc/1202-nagios_quickstart/status.css -O /etc/nagios3/stylesheets/status.css

Now hit F5 in the web interface!

Bear In Mind

The easiest way to monitor the Nagios server itself is to pretend it’s yet another server. Install NRPE, set the connection settings, and add it in the host declarations with the other servers.
The exclamation mark (!) is meant to separate command arguments in Nagios configuration files. For instance, check_nrpe_1arg!check_swap would mean you’re running check_nrpe_1arg with the argument check_swap.
All of the scripts and commands you can issue through Nagios are stand-alone scripts. When configuring Nagios, you can run each command, for instance check_smtp, manually instead of doing tons of trial-and-error with the configuration files:
```
/usr/lib/nagios/plugins/check_smtp -H 192.168.1.105
/usr/lib/nagios/plugins/check_smtp -h
```
All lists in Nagios configuration files are comma-separated.
You can set the contact_groups value on any service, host, or hostgroup declaration. Contact groups are defined in /etc/nagios3/conf.d/contacts_nagios2.cfg. Any person in a contact group that has a user account for the web interface (htpasswd.users) can automatically view any hosts and services associated with it.
Example:
```
define hostgroup {
        hostgroup_name  mail-servers
                alias           IMAPS/SMTP servers
                members         singularity
                contact_groups  mailadmins
        }
```

Again, the best part about what we’ve set up now is that you can go right ahead and forget about it. You’ll receive an e-mail at the contact address specified whenever something is amiss, as well as when it gets better. If I’m right, though, you’ll want to tune your configuration a lot further. We’ve barely touched the surface; Nagios can do much more, and everything is thoroughly documented in the official documentation.