Monitor running processes with Perl

Update: This article is updated thanks to Colin Keith his excellent comment. I was extremely inspired by it

Maintaining a large number of servers cannot be done without proper programming skills. Each good system administrator must therefor make sure he knows how to automate his daily works.

Although many many programming languages exist, most persons will only write code in one. I happen to like Perl.

In this next blog post, I am going to show how to create a script which can be deployed on all the Linux servers you need to maintain and need to check for certain running services.

Of course, a tool as Nagios together with NRPE and a configured event-handler could also be used, but lately I was often in the situation that the ‘nrpe daemon’ crashed, Nagios was spewing a lot of errors and the event-handler… well, since nrpe was down, the event-handler of course couldn’t connect or do anything. So why rely on a remote triggered action, when a simple script could be used.

The following script will check a default list of services and can additionally load or overwrite these services. A regular expression can be used to check for running processes, and of course, a startup command needs to be defined. And that is all the script will and should do.

It will use three cool CPAN modules:

The first one will be used to get a full listing of all running processes and the second one will provide us a means for using configuration files.

So let’s start our script:

Until now, no rocket science. We load the required modules, we defined our default services that need to be checked.

Next part, check if there is a configuration file on disk. The script looks for a hard-coded path ”/etc/default/watchdog.yaml”:

The last Perl statement actually allows to overwrite one or more (or even all) the default defined services.

Now let’s see if these processes are actually running. The following code was hugely inspired by Colin Keith’s comment below. I have combined his examples together with my code.

Let’s first have a look at the code:

Lines 2 retrieves the current process list. We will save that information in two hashes with a little less information, because we actually only need the PID and the actual ‘command line’ of each process.

At line 16 we will start looping through the processes we have defined in the %services hash.
Inspired by Colins post, we will check if the process’ PID file is still there and if one is configured. If it still exists, we will then verify if the PID stored in the PID file, exists in the process list, which we have stored in %procs. This happens in lines 18-21.
At line 21, if the process is still running and the PID matches, we will check the next service to check ( && next part)
If the process is not running anymore but the PID file was still in the defined path, then it will be removed at line 24.

Otherwise, if no PID file was found or no PID file was configured, we will check the process list with the regular expression defined for that process. We have already created a hash, %matched_procs between lines 7 and 10, which we will use for this checking. If the process exists in the hash, we will skip and check the next process to be checked.

Now, if there was no PID file or the PID file was removed at line 24, the process will be started again. This happens at line 35.
I’ve executed it with the ‘system’ function since I want to have the output of this command directly in STDOUT. And of course, the last thing to do is to check if the process started up correctly or not by checking its exit code.

Now save that script to for instance ‘watchdog.pl’ and configure it in a cron job.
Example:

And here’s an example of the configuration file:

Link to script source code: https://github.com/insani4c/perl_tools/tree/master/watchdog

flattr this!


Comments

Monitor running processes with Perl — 7 Comments

  1. > unless( grep m#$services{$service}->{re}#, map {$_->{cmndline}} \
    @{$processes->table}

    > Now, that ”unless” statement may seem like a lot of kung-fu but actually it
    > isn’t.
    > The $processes->table call, returns an arrayref of hashrefs …

    Nice article, but can I suggest that if you need to explain something like this then it can be simplified. You’re iterating through the list of processes and getting the command line for each service that you want to check, so you’re performing this same operation 5 times.

    Here you can trade memory for speed and code clarity (there’s always a trade off) by performing the iteration and map {} once and stashing the results in an array:

    You can improve the clarity of your unless() with any of the following:

    My preferred method is “|| next”‘s to bug out as soon as you know a situation is good/bad;

    Inline:

    Trailing conditionals are common, but personally I find them distracting:

    And I really don’t like unless() myself, but I know it is common

    Myself I would skip the pushing into @missing_services and then looping through that, and instead merge the two blocks of code because you’re now bugging out early:

    This could also be a nested if(grep { …. }) { system()… } but that means more levels of indentation.

    Of course if this code gets too long then move it into a sub-routine and call the sub from with your for() loop.

    This is more to make the code easier to read/maintain. Otherwise this is a useful tool. One suggestion would be to anchor your patterns, and for better security and better peace of mind, my other suggestion would be to check the pids for each service.

    For example:

    This is harder for something to go wrong because you’re checking that the process is using the PID that it is meant to be using and you don’t have some weird situation in which the exim daemon actually died but it thinks it is still running because of some child process. Just more checks means slightly less chance of thinking everything is okay when it isn’t.

    You could add in tests for the process UID, etc. and also test to see if the process is eating too much resources, cpu, memory, etc. to give you a full system. And then…. you know, in all of that glorious free time that you have :-) You could daemonize it. Maybe Proc::Daemon, so that it doesn’t rely on Cron, which can be blatted too. But if you do that, then you need to be able to connect to it from the outside to test if your watchdog daemon hasn’t been blatted. Watching the watcher, as they say. Nice work though!

    • Now this comment just proofs my love for Perl: there are at least a million ways to write code in Perl…

      Thanks a lot for your comment, it is well appreciated :-) I have updated this post with code inspired by your comments.

  2. I’m glad that my comments inspired you to tweak your code. I think it looks great now, and certainly fulfills an important role, so you could just leave it like this. If you wanted to move the project forward then there are additional changes that I can think of, but in fairness you may not need them. Especially in the case of monitoring, I’ve found that simple is good because it means that it is stable. However you did ask for some ideas so consider;

    * Pushing the code into a code repository such as GitHub, etc. so other people can use your code and, just as importantly, can comment on, and contribute to, your project. Third parties can provide you with a fresh set of eyes and give you ideas that you simply had not thought of. Of course, if this was created at work, then ensure that your company is okay with publishing it. Most would be with something like this, but you need to check and ideally get permission in writing so that you can’t get in trouble later.

    * Adding logging so that you know when something restarted. This could be useful in diagnosing a problem. For example, if Exim is restarted each day at 4am then you would definitely look at the log rotation scripts to see how they are trying to restart it. Perhaps they are doing something odd and so Exim is down for a short while. Up as long as possible is good, so logging can help you to confirm problems – which might be required if you get a customer complaint – and to track down and eliminate the issue.

    * Turning this into a stand-alone daemon process. Cron works well enough, but what happens if the clock is off or crond is killed. There are lots of Perl modules on CPAN to help with the boiler plate daemonization steps (closing inputs, forking and setsid()’ing, etc.)

    * … plus this way your application is a long running process and so you could let it open a network socket so that it can be polled from another box to make sure that it is still running. This is more complicated, but nothing sucks more than thinking that your watchdog process is keeping things going only to have it die silently in the night and only find out because a processes dies and isn’t restarted. You might even be able to integrate it into an existing monitoring solution, such as Nagios, to allow for better reporting.

    * Do you have situations on other servers where the management is more complex? For example do you need to add support for the restart code being a Perl sub to be executed instead of a shell command line? That makes it more complicated, so don’t add it if you don’t need it.

    * Consider how best to handle notifications that a service couldn’t be restarted. Email, logging to a existing monitoring tool like Nagios, etc.

    * Do you use platform management tools like Puppet or Chef? Then write a manifest to distribute the bundle with a default config.

    * Perhaps you could add support to start the processes initially. This way you would spawn them as child processes and so, if they died off, you would receive a signal, and you would know right away that they had died. This is instead of doing a periodic poll of the process table, as your process could be down for the entire time between polls. This makes it much more like a system like daemontools, supervise, or runit.

    And at this point, I’m out of ideas for expansion. Hopefully this helped, and I’m glad that you liked the feed back!

  3. Very good work. Looks very useful. One comment about your updated code though. It now only works when the processes being monitored have a pid file. It will not work with the processes defined in the config file you show at the end of the post. You still need to keep the code that checks the command link in the absence of a pid file configuration.

  4. Good example of Proc::ProcessTable, which I wasn’t familiar with.

    Hard to tell since I only saw the code after you reworked it, but it looks like the bit that was using the ‘re’ part of your structure got lost. As a result, the current version is awfully dependent on pidfiles. What if the service is running but the pidfile is gone somehow? Looks to me like your script would just happily start it up again.

    It’s always good to simplify, but you have to be careful not to go too far. There’s unnecessary complexity, and then there’s necessary complexity. ;->

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">