Friday, November 7, 2008

Neat way to prevent multiple instances of a script

Sometimes you need to ensure that you can never have more than one instance of a script running at the same time. This is especially important with scripts that modifies files, and if the script runs for longer than a fraction of a second it becomes more critical. Also if you have many people administrating a system, it becomes more important to ensure they don't step on each other's toes.

Basically the problem is known as the multiple writers problem, and it is solved by something called "semaphores". "Semaphores" is a feature implemented in the kernel with the purpose of providing a way to guarantee that a piece of code be made "mutually exclusive". I don't want to go into this, it is already properly explained on many websites, but have a look at the Wikipedia article if you are interested in the topic.

One technique to get around the problem of multiple instances of a script is to use the existence of a specific file somewhere to flag other instances of the script that there is already a running instance. The "touch" command does not complain about existing files, so you need to to check first whether the file exists already, exit if it does, and create it otherwise. For example

if [ -f /tmp/already_running ] 
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file /tmp/already_running and try again.
   exit 1
else
   touch /tmp/already_running
fi
....

Near the end of this script it is common to delete the file for the next use. This, however, is not the best solution.

If two instances of the script were started at nearly the same instant, what could happen is that instance 1 checks the file, finds it does not exist, but then gets kicked off the CPU so that instance 2 can run. Instance 2 then checks the lock file and also sees it is OK to continue, and then creates the lock. Instance 1 eventually gets CPU time again and, having already previously checked the lock file, believes it is safe to continue running.

This is known as a race condition, and is by definition what Semaphores are meant to prevent. But semaphores are not easily accessible in scripts, or so it might seem.

Now I know you are asking "but what is the chance of such a precice timing of sceduled cpu time to cause this kind of race condition". Yes, the chances are probably low, especially if you are the only person using a specific script. However there is a proper way of checking that we are the only instance running, and it is even simpler to implement thatn the check-file-touch-file method!

The secret lies in the mkdir command.

if ! mkdir /tmp/already_running
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file /tmp/already_running and try again.
   exit 1
fi
....

mkdir will automatically use the kernel built-in semaphores during the actual process of creating the entry in the file system. It will fail if the directory exists, and on succesfull return, the lock will be in place already, so no extra commands are needed to complete the mutex locking process.

The second part of this is to automate the release of the lock when the script exists. Typically you want the lock to be released even if someone kills the script, press Ctrl-C, or if it terminates normally or on an error.

This is done by means of a EXIT trap, and the format of using traps in bourne shell variants is:

trap "do-something-here" EXIT

This trap must be set AFTER obtaining the lock, otherwise a second instance of the script will "inadvertendly" remove the lock obtained by the first instance of the script (because the new instance will basically remove the lock which it was not able to obtain when it exists on not being unable to obtain the lock.

You obviously don't have to have an if-then-fi to print a message to the user - if you are the only person using a script, you can simplify the checking of the lock as follow:

MUTEX_LOCK=/tmp/myscript_already_running
mkdir $MUTEX_LOCK || exit 1
trap "rmdir $MUTEX_LOCK" EXIT

With the above you will simply have an error message from mkdir which you need to interpret as "the script is already running", eg:

mkdir: Failed to make directory "/tmp/myscript_already_running"; File exists

Using this technique, a whole script might look like this:

#!/bin/ksh
# This is myscript v1.0.

#Set up the running environment
MUTEX_LOCK=/tmp/myscript_already_running
...

if ! mkdir $MUTEX_LOCK
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file $MUTEX_LOCK and try again.
   exit 1
fi
trap "rmdir $MUTEX_LOCK" EXIT
...

# Doing work which requires only one instance of the script to be running
...

# THE END

Note there is no "remove lock" statement at the end of the script. This is handled by the trap, which executes on any exit, except of course a kill -9.

Using a kill -9 should in any case only ever be used as a last resort, because it does not allow the program to clean up after itself.

1 comment:

Mike Gerdts said...

There is a good discussion of this topic at the shell-discuss forum.

For an example of code that can clean locking that can recover from kill -9 and unexpected reboots, see the locking routine I revised/wrote for postrun as part of the fix for 6471025.