monitoring postgresql

This should serve as yet another account of “naïve” addition of monitoring to an LCFG component.  It proved to be surprisingly straightforward: PostgreSQL on my test server is now being monitored; built-in support is now available on any host:

#include <dice/options/postgresql-monitoring.h>

though take care to follow the instructions within.

convert

First, update your component for new buildtools compatibility; this saves having to explicitly build the Nagios support:

$ lcfg-cfg2meta
$ lcfg-reltool checkmacros

(did I mention that I recently added bash tab-completion to the lcfg-reltool command?) Review the output and make any mandatory changes to the component to ensure compatibility.  You can issue:

$ lcfg-reltool checkmacros --fix_deprecated

To save some time.

Once you’re done, you may dispense with your Makefile and add the generated lcfg.yml file before committing.  It’s probably advisable to test your component in all the usual ways, then commit, before proceeding to break it further…

prepare

Add the nagios include to your defaults file (component.def.cin):

#include "monitoring-1.def"

Create a /nagios component subdirectory. Into this place the requisite nagios perl module.  I don’t want to comment on module writing other than to say that you should read the guide, many services have plugins already written, and will require nothing more than a call to an existing script.

Amongst others, the following components presently provide nagios modules:

  • lcfg-apacheconf
  • lcfg-flexlm
  • lcfg-postgresql
  • lcfg-rpmaccel
  • lcfg-webots

and may be a useful reference.

plugins

To view the existing plugins, include <dice/options/nagios-plugins.h>.  The plugins live in /usr/lib/nagios/plugins and tend to be Perl modules which make use of utils.pm therein, or the CPAN module Nagios::Plugin.

At its core, a plugin appears simply to be a script which returns a set of predefined exit codes.  These exit codes are provided in the utility module mentioned above.  If you wanted to depart from the magical world of perl, it would of course be slightly cheeky to simply copy the specification and return one of the codes, presently defined as:

%ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);

…but it works!

Writing a real plugin and bundling it to the server is a slightly more involved task, which I may cover later.

Testing your prospective check command on an existing plugin is important. From the nagios server (if possible and safe) execute your plugin directly, as you would do through your nagios configuration fragment. I developed my PostgreSQL test by running

/usr/lib/nagios/plugins/check_pgsql --help

and building a test string from here. Once your test command is working, note carefully your arguments (and any whitespace) for later inclusion in your LCFG nagios module.

my module

My nagios/postgresql.pm was simple.  The service-specific parts are emphasised.:

package LCFG::Monitoring::Nagios::Translators::postgresql;

use strict;
use warnings;
use base qw(LCFG::Monitoring::Nagios::Translators::SimpleCheck);
use LCFG::Monitoring::Nagios::Fragment::Command;

sub startPass {
  my ($self) = @_;

  my $command = new LCFG::Monitoring::Nagios::Fragment::Command();
  $command->commandName("check_postgresql");
  $command->commandLine('$USER1$/check_pgsql -H$HOSTADDRESS$ -P$ARG1$ -l nagios
  return $command;
}

sub checkCommand {
  my $pgport = $profile->k('pgport')->d
        or die LCFG::Monitoring::Exception::RunTime->new
                ("Unable to get pgport from profile");

  return ("check_postgresql!$pgport");
}

sub serviceDescription {
  return ("postgresql");
}

1;

Now alter your specfile to include the nagios module when built; unless you’re doing something unusual, or providing your own plugin, the below will suffice:

%package nagios
Summary: Nagios translator for @LCFG_FULLNAME@
Group: LCFG/Translators/Nagios
BuildArch: noarch
Requires: lcfg-nagios

%description nagios
Nagios translator for the LCFG @LCFG_NAME@ component.
@LCFGCONFIGMSG@

%files nagios
%defattr(-,root,root)
%doc @LCFGPOD@/LCFG::Monitoring::Nagios::Translators::@LCFG_NAME@.pod
%{perl_vendorlib}/LCFG/Monitoring/Nagios/Translators/@LCFG_NAME@.pm
%doc %{_mandir}/man3/*

Note the %{perl_vendorlib} line, which depends upon the sometimes-default %install line:

%cmake -DPERL_INSTALLDIRS:STRING=vendor

test

Test your component as documented — though note on typical modern DICE machines you’ll need to modify the command slightly:

lcfg-monitor-test --ldapserver=dir --libs=./blib/lib nagios mytesthost

It’s important to note that normal output from this script will include

limitations

To prevent nagios from having too much access to the database server, I decided to create a ‘nagios’ role, with access only to a ‘nagios’ database, again, accessible only from the nagios server.  Therefore, before any monitoring will succeed the following must be performed on the server:

# CREATE ROLE nagios LOGIN;
# CREATE DATABASE nagios OWNER postgres;

This technique does not test Kerberos usage, and arguably doesn’t prove that any other authentication type is possible, either. I have not yet generalised the configuration to allow testing of any other database/role; of course this is crucial and will appear shortly. At the most basic, the ability to test _template1 would be a reasonable subset of the named-database testing and will be available soon.

testing in practice

Rolling out the test code is the fiddliest part of the service since it requires installing your code onto an lcfg-master and nagios server to test. My workflow, assuming there are resource changes:

  1. Run perl nagios/<component>.pm to check for basic syntax errors.
  2. Make new dev RPMs with lcfg-reltool devrpm.
  3. Submit devrpms to the devel bucket.
  4. Install* -defaults rpm onto lcfg master (tail lcfg slave logs carefully)
  5. Install* component RPM onto development server.
  6. On dev server, add/test any new/changed resources as required.
  7. Run lcfg-monitor-test on development server to check for translator errors (noting that an empty translatorsPre.cfg section is an error!)
  8. Install -nagios rpm onto nagios server
  9. Re-run lcfg-monitor-test on nagios server
  10. If possible: reassemble check command from results, and run check command directly.
  11. Restart nagios server component (perhaps with Inf-unit’s oversight, and tailing the nagios server log carefully).

Repeat as necessary. Note that by ‘install’ I mean install manually using RPM so that an updaterpms is all that’s required to revert. Then, when satisfied:

  1. Do not change the code! Seriously, don’t change it. You can’t trust yourself not to omit a semicolon, here.
  2. lcfg-reltool *release; lcfg-reltool rpm; Submit to LCFG bucket.
  3. Update nagios translator and defaults headers.
  4. Go home.

2 thoughts on “monitoring postgresql

  1. gdutton Post author

    Having written this a few months ago this post is looking a little dusty and out-of-date. However it was becoming stale in my drafts, and is better serving as public reminder to me to document further!

  2. gdutton Post author

    Update: added a working example of the test command to take into account the fact that local machines don’t typically run slapd any more, and the error message generated isn’t very helpful (i.e. required strace to figure out why it was failing…)

Leave a Reply

Your email address will not be published. Required fields are marked *