Update: This article is updated thanks to Colin Keith his excellent comment. I was extremely inspired by it
Maintaining a large number of servers cannot be done without proper programming skills. Each good system administrator must therefor make sure he knows how to automate his daily works.
Although many many programming languages exist, most persons will only write code in one. I happen to like Perl.
In this next blog post, I am going to show how to create a script which can be deployed on all the Linux servers you need to maintain and need to check for certain running services.
Of course, a tool as Nagios together with NRPE and a configured event-handler could also be used, but lately I was often in the situation that the ‘nrpe daemon’ crashed, Nagios was spewing a lot of errors and the event-handler… well, since nrpe was down, the event-handler of course couldn’t connect or do anything. So why rely on a remote triggered action, when a simple script could be used.
The following script will check a default list of services and can additionally load or overwrite these services. A regular expression can be used to check for running processes, and of course, a startup command needs to be defined. And that is all the script will and should do.
The script uses three CPAN modules:
The first one will be used to get a full listing of all running processes and the second one will provide us a means for using configuration files.
So let’s start our script:
#!/usr/bin/env perl use strict; use warnings; use utf8; use Proc::ProcessTable; use YAML qw/LoadFile/; use File::Slurp; # Default set of processes to watch my %default_services = ( 'NRPE' => { 'cmd' => '/etc/init.d/nagios-nrpe-server restart', 're' => '/usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d', 'pidfile' => '/var/tmp/nagios-nrpe-server.pid', }, 'Freshclam' => { 'cmd' => '/etc/init.d/clamav-freshclam restart', 're' => '/usr/bin/freshclam -d --quiet', 'pidfile' => '/var/tmp/clamav-freshclam.pid', }, 'Syslog-NG' => { 'cmd' => '/etc/init.d/syslog-ng restart', 're' => '/usr/sbin/syslog-ng -p /var/run/syslog-ng.pid', 'pidfile' => '/var/run/syslog-ng.pid', }, 'VMToolsD' => { 'cmd' => '/etc/init.d/vmware-tools restart', 're' => '/usr/sbin/vmtoolsd', 'pidfile' => '/var/tmp/vmtoolsd.pid', }, 'Munin-Node' => { 'cmd' => '/etc/init.d/munin-node restart', 're' => '/usr/sbin/munin-node', 'pidfile' => '/var/tmp/munin-node.pid', }, ); my (%services) = (%default_services);
Until now, no rocket science. We load the required modules, we defined our default services that need to be checked.
Next part, check if there is a configuration file on disk. The script looks for a hard-coded path ”/etc/default/watchdog.yaml”:
# Check if there is a local config file and if yes, load them in the services hash if( -f '/etc/default/watchdog.yaml' ){ my $local_config = LoadFile '/etc/default/watchdog.yaml'; %services = (%default_services, %{ $local_config->{services} }); }
The last Perl statement actually allows to overwrite one or more (or even all) the default defined services.
Now let’s see if these processes are actually running. The following code was hugely inspired by Colin Keith’s comment below. I have combined his examples together with my code.
Let’s first have a look at the code:
# Get current process table my $processes = Proc::ProcessTable->new; my %procs; my %matched_procs; foreach my $p (@{ $processes }){ $procs{ $p->{pid} } = $p->{cmndline}; foreach my $s (keys %services){ if($p->{cmndline} =~ m#$services{$s}->{re}#){ $matched_procs{$s}++; last; } } } # Search the process table for not running services foreach my $service ( keys %services ) { if(exists($services{$service}->{pidfile}) && -f $services{$service}->{pidfile} ) { my $pid = read_file( glob($services{$service}->{pidfile}) ); # If we get a pid ensure that it is running, and that we can signal it $pid && exists($procs{$pid}) && kill(0, $pid) && next; # Remove the stale PID file because no running process for this PID file unlink( $services{$service}->{pidfile} ); } else { # check if the configured process regex matches if( exists($matched_procs{$service}) ){ # process is running but has no PID file next; } } # Execute the service command system( $services{$service}->{'cmd'} ); # Check the exit code of the service command if ($? == -1) { print "Failed to restart '$service' with '$services{$service}->{cmd}': $!\n"; } elsif ($? & 127) { printf "Restart of '$service' died with signal %d, %s coredump\n", ($? & 127), ($? & 128) ? 'with':'without'; } else { printf "Process '$service' successfully restarted, exit status: %d\n", $? >> 8; } }
Lines 2 retrieves the current process list. We will save that information in two hashes with a little less information, because we actually only need the PID and the actual ‘command line’ of each process.
At line 16 we will start looping through the processes we have defined in the %services
hash.
Inspired by Colins post, we will check if the process’ PID file is still there and if one is configured. If it still exists, we will then verify if the PID stored in the PID file, exists in the process list, which we have stored in %procs
. This happens in lines 18-21.
At line 21, if the process is still running and the PID matches, we will check the next service to check (&& next
part)
If the process is not running anymore but the PID file was still in the defined path, then it will be removed at line 24.
Otherwise, if no PID file was found or no PID file was configured, we will check the process list with the regular expression defined for that process. We have already created a hash, %matched_procs
between lines 7 and 10, which we will use for this checking. If the process exists in the hash, we will skip and check the next process to be checked.
Now, if there was no PID file or the PID file was removed at line 24, the process will be started again. This happens at line 35.
I’ve executed it with the ‘system’ function since I want to have the output of this command directly in STDOUT. And of course, the last thing to do is to check if the process started up correctly or not by checking its exit code.
Now save that script to for instance ‘watchdog.pl’ and configure it in a cron job.
Example:
*/5 * * * * root /usr/local/bin/watchdog.pl
And here’s an example of the configuration file:
services: Exim-Mailserver: cmd: /etc/init.d/exim4 restart re: /usr/sbin/exim4 -bd -q30m Ossec-Agent: cmd: /etc/init.d/ossec restart re: !!perl/regexp '(?:ossec-agentd|ossec-logcollector|ossec-syscheckd)'
Link to script source code: https://github.com/insani4c/perl_tools/tree/master/watchdog
> unless( grep m#$services{$service}->{re}#, map {$_->{cmndline}} \
@{$processes->table}
…
> Now, that ”unless” statement may seem like a lot of kung-fu but actually it
> isn’t.
> The $processes->table call, returns an arrayref of hashrefs …
Nice article, but can I suggest that if you need to explain something like this then it can be simplified. You’re iterating through the list of processes and getting the command line for each service that you want to check, so you’re performing this same operation 5 times.
Here you can trade memory for speed and code clarity (there’s always a trade off) by performing the iteration and map {} once and stashing the results in an array:
You can improve the clarity of your unless() with any of the following:
My preferred method is “|| next”‘s to bug out as soon as you know a situation is good/bad;
Inline:
Trailing conditionals are common, but personally I find them distracting:
And I really don’t like unless() myself, but I know it is common
Myself I would skip the pushing into @missing_services and then looping through that, and instead merge the two blocks of code because you’re now bugging out early:
This could also be a nested if(grep { …. }) { system()… } but that means more levels of indentation.
Of course if this code gets too long then move it into a sub-routine and call the sub from with your for() loop.
This is more to make the code easier to read/maintain. Otherwise this is a useful tool. One suggestion would be to anchor your patterns, and for better security and better peace of mind, my other suggestion would be to check the pids for each service.
For example:
This is harder for something to go wrong because you’re checking that the process is using the PID that it is meant to be using and you don’t have some weird situation in which the exim daemon actually died but it thinks it is still running because of some child process. Just more checks means slightly less chance of thinking everything is okay when it isn’t.
You could add in tests for the process UID, etc. and also test to see if the process is eating too much resources, cpu, memory, etc. to give you a full system. And then…. you know, in all of that glorious free time that you have 🙂 You could daemonize it. Maybe Proc::Daemon, so that it doesn’t rely on Cron, which can be blatted too. But if you do that, then you need to be able to connect to it from the outside to test if your watchdog daemon hasn’t been blatted. Watching the watcher, as they say. Nice work though!
Now this comment just proofs my love for Perl: there are at least a million ways to write code in Perl…
Thanks a lot for your comment, it is well appreciated 🙂 I have updated this post with code inspired by your comments.
I’m glad that my comments inspired you to tweak your code. I think it looks great now, and certainly fulfills an important role, so you could just leave it like this. If you wanted to move the project forward then there are additional changes that I can think of, but in fairness you may not need them. Especially in the case of monitoring, I’ve found that simple is good because it means that it is stable. However you did ask for some ideas so consider;
* Pushing the code into a code repository such as GitHub, etc. so other people can use your code and, just as importantly, can comment on, and contribute to, your project. Third parties can provide you with a fresh set of eyes and give you ideas that you simply had not thought of. Of course, if this was created at work, then ensure that your company is okay with publishing it. Most would be with something like this, but you need to check and ideally get permission in writing so that you can’t get in trouble later.
* Adding logging so that you know when something restarted. This could be useful in diagnosing a problem. For example, if Exim is restarted each day at 4am then you would definitely look at the log rotation scripts to see how they are trying to restart it. Perhaps they are doing something odd and so Exim is down for a short while. Up as long as possible is good, so logging can help you to confirm problems – which might be required if you get a customer complaint – and to track down and eliminate the issue.
* Turning this into a stand-alone daemon process. Cron works well enough, but what happens if the clock is off or crond is killed. There are lots of Perl modules on CPAN to help with the boiler plate daemonization steps (closing inputs, forking and setsid()’ing, etc.)
* … plus this way your application is a long running process and so you could let it open a network socket so that it can be polled from another box to make sure that it is still running. This is more complicated, but nothing sucks more than thinking that your watchdog process is keeping things going only to have it die silently in the night and only find out because a processes dies and isn’t restarted. You might even be able to integrate it into an existing monitoring solution, such as Nagios, to allow for better reporting.
* Do you have situations on other servers where the management is more complex? For example do you need to add support for the restart code being a Perl sub to be executed instead of a shell command line? That makes it more complicated, so don’t add it if you don’t need it.
* Consider how best to handle notifications that a service couldn’t be restarted. Email, logging to a existing monitoring tool like Nagios, etc.
* Do you use platform management tools like Puppet or Chef? Then write a manifest to distribute the bundle with a default config.
* Perhaps you could add support to start the processes initially. This way you would spawn them as child processes and so, if they died off, you would receive a signal, and you would know right away that they had died. This is instead of doing a periodic poll of the process table, as your process could be down for the entire time between polls. This makes it much more like a system like daemontools, supervise, or runit.
And at this point, I’m out of ideas for expansion. Hopefully this helped, and I’m glad that you liked the feed back!
Very good work. Looks very useful. One comment about your updated code though. It now only works when the processes being monitored have a pid file. It will not work with the processes defined in the config file you show at the end of the post. You still need to keep the code that checks the command link in the absence of a pid file configuration.
Good example of Proc::ProcessTable, which I wasn’t familiar with.
Hard to tell since I only saw the code after you reworked it, but it looks like the bit that was using the ‘re’ part of your structure got lost. As a result, the current version is awfully dependent on pidfiles. What if the service is running but the pidfile is gone somehow? Looks to me like your script would just happily start it up again.
It’s always good to simplify, but you have to be careful not to go too far. There’s unnecessary complexity, and then there’s necessary complexity. ;->
Yeah you’re right! I have updated this blog post once more 😉
Very nice post. 🙂 Thanks
Using the code above, I get the error “Not an ARRAY reference at wd.pl line 48”, which for me is the line stating:
foreach my $p (@{ $processes }){
Is this due to some change in the module, or in perl?
My question can be ignored. I got the source from github and it has the corrections in it that make the file work like a charm. Very useful tool. Thank you.
I ran into an issue a while back where I believe a STUN event in VMWare caused a few processes to die in a guest. Each group has a manager, but this group did not. I was going to copy a manager from another group, modify it, and if it manage this group. I decided to extend your idea instead.
I use databases to create a concept of “organizations”. Each daemon type will spawn one child per org. That is why I started from a manager.
I extended your code to have a default section for $services. I then build the other sections off the database. Instead of a cron job I’m am doing daemon. When changes are made to an org a daemon from a group will be restarted. In your program I implemented a UNIX socket where I could connect, dump status, disable a program, enable a program, stop, etc.
Each org does have an org_config table that I use to disable a program there. When building the config I check that. I also check for config changes each 120s cycle. Before, when I added an org I had to restart the whole group. With the mods I done with your program I just create it and it works.
I did this as rush for that one VMware incident and feature creep hit at light speed. It is a mess so I’ll clean it up and send it to you. You can contact me via my email if you wish.
Chris
Hi Chris
I’d be more than happy to see your suggestions or corrections 😉
I can’t say I have any suggestions that would make your script better. My changes are specific to what I’m doing and I’d thought you’d be interested in seeing how I applied it to the problems I’m facing.
I’ll add one more comment.
My original approach at keeping these programs up is to run each process in the foreground under a parent. On SIGCHLD the parent makes a note and starts it back up.
I never considered the approach of just monitoring /proc. That can be difficult, but I also never considered using a regular expression as a check either.
As a daemon I do run the risk of it dying. Your solution was cron. Since only one can run on a system I typically lock $0 and check iton startup to make sure another copy is not running. Using that idea I can stay a daemon and cron could try to start it every hour. It’ll just fail to lock and exit. If it died within that 30 minute cycle it would run again.
I run a lot of mysqld slaves on one host has hot spares for many systems over direct VPN connections. The systems have a P-t-P VPN directly to the backup system. When I need to reset replication I sometimes run into a problem shutting down the slave for that instance.
A bit of test code to solve it. Of course use mysqladmin to shutdown that instance first. I only bring this up because I just reset an instance and mysqladmin was in a constant loop of connection. The pid file was there, but the service for that one was not.
sub get_pid {
my $re = shift;
my $processes = Proc::ProcessTable->new();
foreach my $p (@{ $processes->table() }){
if($p->{‘cmndline’} =~ m#$re#) {
# Winner Winner Chicken Dinner
return $p->{‘pid’};
}
}
return 0;
}
my $pid = get_pid(‘mysqld_safe.+15034.+’);
printf “PID: %d\n”, $pid;
my $pid = get_pid(‘mysqld .+15034.+’);
printf “PID: %d\n”, $pid;