dbi Blog

Ideally, but not mandatorily, the management of Documentum processes is performed at the service level, e.g. by systemd. In my blog here, I showed how to configure init files for Documentum under systemd. But containers don’t have systemd, yet. They just run processes, often only one, sometimes more if they are closely related together (e.g. the docbroker, the method server and the content servers), so how to replicate the same functionality with containers ?
The topic of stopping processes in a docker container is abundantly discussed on-line (see for example the excellent article here). O/S signals are the magic solution so much so that I should have entitled this blog “Fun with the Signals” really !
I’ll simply see here if the presented approach can be applied in the particular case of a dockerized Documentum server. However, in order to keep things simple and quick, I won’t test such a real dockerized Documentum installation but rather use a script to simulate the Documentum processes, or any other processes at that since it is so generic.
But first, why bother with this matter ? During all the years that I have been administrating repositories I’ve never noticed anything going wrong after restarting a suddenly stopped server, be it after an intentional kill, a pesky crash or an unplanned database unavailability. Evidently, the content server (CS henceforth) seems quite robust in this respect. Or maybe we were simply lucky so far. Personally, I don’t feel confident if I don’t shut down cleanly a process or service that must be stopped; some data might be still buffered in the CS’ memory and not flushing them properly might introduce inconsistencies or even corruptions. The same goes when an unsuspected multi-step operation is started and aborted abruptly in the middle; ideally, transactions, if they are used, exist for this purpose but anything can go wrong during rollback. Killing a process is like slamming a door, it produces a lot of noise, vibrations in the walls, even damages in the long run and always leaves a bad impression behind. Isn’t it more comforting to clean up and shut the door gently ? Even then something can go wrong but at least it will be through no fault of our own.

A few Reminders

When a “docker container stop” is issued, docker sends the SIGTERM signal to the process with PID == 1 running inside the container. That process, if programmed to do so, can then react to the signal and do anything seen fit, typically shutting the running processes down cleanly. After a 10 seconds grace period, the container is stopped manu militari. In the case of Documentum processes, to put it politely, they don’t give a hoot to signals, except of course to the well-known, unceremonious SIGKILL one. Thus, a proxy process must be introduced which will accept the signal and invoke the proper shutdown scripts to stop the CS processes, usually the dm_shutdown_* and dm_stop_* scripts, or a generic one that takes care of everything, at start up and at shut down time.
Said proxy must run with PID == 1 i.e. it must be the first one started in the container. Sort of, but even if it is not the very first, its PID 1 parent can pass it the control by using one of the exec() family functions; unlike forking a process, those in effect allow a child process to replace its parent under the latter’s PID, kind of like in the Matrix movies the agents Smith inject themselves into someone else’s persona, if you will ;-). The main thing being that at one point the proxy becomes PID 1. Luckily for us, we don’t have to bother with this complexity for the dockerfile’s ENTRYPOINT[] clause takes care of everything.
The proxy also will be the one that starts the CS. In addition, since it must wait for the SIGTERM signal, it must never exit. It can indefinitely wait listening on a fake input (e.g. tail -f /dev/null), or wait for an illusory input in a detached container (e.g. while true; do read; done) or, better yet, do something useful like some light-weight monitoring.
While at it, the proxy process can listen to several conventional signals and react accordingly. For instance, a SIGUSR1 could mean “give me a docbroker docbase map” and a SIGUSR2 “restart the method server”. Admittedly, these actions could be done directly by just executing the relevant commands inside the container or from the outside command-line but the signal way is cheaper and, OK, funnier. So, let’s see how we can set all this up !

The implementation

As said, in order to focus on our topic, i.e. signal trapping, we’ve replaced the CS part with a simple simulation script, dctm.sh, that starts, stops and queries the status of dummy processes. It uses the bash shell and has been written under linux. Here it is:

#!/bin/bash
# launches in the background, or stops or queries the status of, a command with a conventional identification string in the prompt;
# the id is a random number determined during the start;
# it should be passed to enquiry the status of the started process or to stop it;
# Usage:
#   ./dctm.sh stop  | start | status 
# e.g.:
# $ ./dctm.sh start
# | process started with pid 13562 and random value 33699963
# $ psg 33699963
# | docker   13562     1  0 23:39 pts/0    00:00:00 I am number 33699963
# $ ./dctm.sh status 33699963
# $ ./dctm.sh stop 33699963
#
# cec - dbi-services - April 2019
#
trap 'help'         SIGURG
trap 'start_all'    SIGPWR
trap 'start_one'    SIGUSR1
trap 'status_all'   SIGUSR2
trap 'stop_all'     SIGINT SIGABRT
trap 'shutdown_all' SIGHUP SIGQUIT SIGTERM

verb="sleep"
export id_prefix="I am number"

func() {
   cmd="$1"
   case $cmd in
      start)
         # do something that sticks forever-ish, min ca. 20mn;
         (( params = 1111 * $RANDOM ))
         exec -a "$id_prefix" $verb $params &
         echo "process started with pid $! and random value $params"
         ;;
      stop)
         params=" $2"
         pid=$(ps -ajxf | gawk -v params="$params" '{if (match($0, " " ENVIRON["id_prefix"] params "$")) pid = $2} END {print (pid ? pid : "")}')
         if [[ ! -z $pid ]]; then
            kill -9 ${pid} &> /dev/null
            wait ${pid} &> /dev/null
         fi
         ;;
      status)
         params=" $2"
         read pid gid < <(ps -ajxf | gawk -v params="$params" '{if (match($0, " " ENVIRON["id_prefix"] params "$")) pid = $2 " " $3} END {print (pid ? pid : "")}')
         if [[ ! -z $pid ]]; then
            echo "random value${params} is used by process with pid $pid and pgid $gid"
         else
            echo "no such process running"
         fi
         ;;
   esac
}

help() {
   echo
   echo "send signal SIGURG for help"
   echo "send signal SIGPWR to start a few processes"
   echo "send signal SIGUSR1 to start a new process"
   echo "send signal SIGUSR2 for the list of started processes"
   echo "send signal SIGINT | SIGABRT  to stop all the processes"
   echo "send signal SIGHUP | SIGQUIT | SIGTERM to shutdown the processes and exit the container"
}

start_all() {
   echo; echo "starting a few processes at $(date +"%Y/%m/%d %H:%M:%S")"
   for loop in $(seq 5); do
      func start
   done

   # show them;
   echo; echo "started processes"
   ps -ajxf | grep "$id_prefix" | grep -v grep
}

start_one() {
   echo; echo "starting a new process at $(date +"%Y/%m/%d %H:%M:%S")"
   func start
}

status_all() {
   echo; echo "status of running processes at $(date +"%Y/%m/%d %H:%M:%S")"
   for no in $(ps -ef | grep "I am number " | grep -v grep | gawk '{print $NF}'); do
      echo "showing $no"
      func status $no
   done
}

stop_all() {
   echo; echo "shutting down the processes at $(date +"%Y/%m/%d %H:%M:%S")"
   for no in $(ps -ef | grep "I am number " | grep -v grep | gawk '{print $NF}'); do
      echo "stopping $no"
      func stop $no
   done
}

shutdown_all() {
   echo; echo "shutting down the container at $(date +"%Y/%m/%d %H:%M:%S")"
   stop_all
   exit 0
}

# -----------
# main;
# -----------

# starts a few dummy processes;
start_all

# display some usage explanation;
help

# make sure the container stays up and waits for signals;
while true; do read; done

The main part of the script starts a few processes, displays a help screen and then waits for input from stdin.
The script can be first tested outside a container as follows.
Run the script:
./dctm.sh
It will start a few easily distinguishable processes and display a help screen:
starting a few processes at 2019/04/06 16:05:35 process started with pid 17621 and random value 19580264 process started with pid 17622 and random value 19094757 process started with pid 17623 and random value 18211512 process started with pid 17624 and random value 3680743 process started with pid 17625 and random value 18198180 started processes 17619 17621 17619 1994 pts/0 17619 S+ 1000 0:00 | _ I am number 19580264 17619 17622 17619 1994 pts/0 17619 S+ 1000 0:00 | _ I am number 19094757 17619 17623 17619 1994 pts/0 17619 S+ 1000 0:00 | _ I am number 18211512 17619 17624 17619 1994 pts/0 17619 S+ 1000 0:00 | _ I am number 3680743 17619 17625 17619 1994 pts/0 17619 S+ 1000 0:00 | _ I am number 18198180 send signal SIGURG for help send signal SIGPWR to start a few processes send signal SIGUSR1 to start a new process send signal SIGUSR2 for the list of started processes send signal SIGINT | SIGABRT to stop all the processes send signal SIGHUP | SIGQUIT | SIGTERM to shutdown the processes and exit the container
Then, it will simply sit there and wait until it is asked to quit.
From another terminal, let’s check the started processes:
ps -ef | grep "I am number " | grep -v grep docker 17621 17619 0 14:40 pts/0 00:00:00 I am number 19580264 docker 17622 17619 0 14:40 pts/0 00:00:00 I am number 19094757 docker 17623 17619 0 14:40 pts/0 00:00:00 I am number 18211512 docker 17624 17619 0 14:40 pts/0 00:00:00 I am number 3680743 docker 17625 17619 0 14:40 pts/0 00:00:00 I am number 18198180
Those processes could be Documentum ones or anything else, the point here is to control them from the outside, e.g. another terminal session, in or out of a docker container. We will do that though O/S signals. The bash shell lets a script listen and react to signals through the trap command. On top of the script, we have listed all the signals we’d like the script to react upon:

trap 'help'         SIGURG
trap 'start_all'    SIGPWR
trap 'start_one'    SIGUSR1
trap 'status_all'   SIGUSR2
trap 'stop_all'     SIGINT SIGABRT
trap 'shutdown_all' SIGHUP SIGQUIT SIGTERM

It’s really a feast of traps !
The first line for example says that on receiving the SIGURG signal, the script’s function help() should be executed, no matter what the script was doing at that time, which in our case is just waiting for input from stdin.
The SIGPWR signal is interpreted as start in the background another suite of five processes with the same naming convention “I am number ” followed with a random number. The function start_all() is called on receiving this signal.
The SIGUSR1 signal starts one new process in the background. Function start_one() does just this.
The SIGUSR2 signal displays all the started processes so far by invoking function status_all().
The SIGINT and SIGABRT signals shut down all the started processes so far. Function stop_all() is called to this purpose.
Finally, signals SIGHUP, SIGQUIT, or SIGTERM all invokes function shutdown_all() to stop all the processes and exit the script.
Admittedly, those signal’s choice is a bit stretched out but this is for the sake of the demonstration so bear with us. Feel free to remap the signals to the functions any way you prefer.
Now, how to send those signals ? The ill-named kill command or program is here for this. Despite its name, nobody will be killed here fortunately; signals will be sent and processes decide to react opportunely. Here, of course, we do react opportunely.
Here is its syntax (let’s use the −−long-options for clarity):
/bin/kill --signal pid
Since bash has a built-in kill command that behaves differently, make sure to call the right program by specifying its full path name, /bin/kill.
Example of use:
/bin/kill --signal SIGURG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') # or shorter: /bin/kill --signal SIGURG $(pgrep ^dctm.sh$)
The signal’s target is our test program dctm.sh, which is identified vis a vis kill through its PID.
Signals can be specified by their full name, e.g. SIGURG, SIGPWR, etc… or without the SIG prefix such as URG, PWR, etc … or even through their numeric value as shown below:
/bin/kill -L 1 HUP 2 INT 3 QUIT 4 ILL 5 TRAP 6 ABRT 7 BUS 8 FPE 9 KILL 10 USR1 11 SEGV 12 USR2 13 PIPE 14 ALRM 15 TERM 16 STKFLT 17 CHLD 18 CONT 19 STOP 20 TSTP 21 TTIN 22 TTOU 23 URG 24 XCPU 25 XFSZ 26 VTALRM 27 PROF 28 WINCH 29 POLL 30 PWR 31 SYS or: kill -L 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR 31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8 43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2 63) SIGRTMAX-1 64) SIGRTMAX
Thus, the following incantations are equivalent:
/bin/kill --signal SIGURG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal URG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal 23 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
On receiving the supported signals, the related function is invoked and thereafter the script returns to its former activity, namely the loop that waits for a fake input. The loop is needed otherwise the script would exit on returning from a trap handler. In effect, the trap is processed like a function call and, on returning, the next statement at the point the trap occurred is given control. If there is none, then the script terminates. Hence the loop.
Here is the output after sending a few signals; for clarity, the signals sent from another terminal have been manually inserted as highlighted comments before the output they caused.
Output terminal:
# SIGUSR2: status of running processes at 2019/04/06 16:12:46 showing 28046084 random value 28046084 is used by process with pid 29248 and pgid 29245 showing 977680 random value 977680 is used by process with pid 29249 and pgid 29245 showing 26299592 random value 26299592 is used by process with pid 29250 and pgid 29245 showing 25982957 random value 25982957 is used by process with pid 29251 and pgid 29245 showing 27830550 random value 27830550 is used by process with pid 29252 and pgid 29245 5 processes found # SIGUSR1: starting a new process at 2019/04/06 16:18:56 process started with pid 29618 and random value 22120010 # SIGUSR2: status of running processes at 2019/04/06 16:18:56 showing 28046084 random value 28046084 is used by process with pid 29248 and pgid 29245 showing 977680 random value 977680 is used by process with pid 29249 and pgid 29245 showing 26299592 random value 26299592 is used by process with pid 29250 and pgid 29245 showing 25982957 random value 25982957 is used by process with pid 29251 and pgid 29245 showing 27830550 random value 27830550 is used by process with pid 29252 and pgid 29245 showing 22120010 random value 22120010 is used by process with pid 29618 and pgid 29245 6 processes found # SIGURG: send signal SIGURG for help send signal SIGPWR to start a few processes send signal SIGUSR1 to start a new process send signal SIGUSR2 for the list of started processes send signal SIGINT | SIGABRT to stop all the processes send signal SIGHUP | SIGQUIT | SIGTERM to shutdown the processes and exit the container # SIGINT: shutting down the processes at 2019/04/06 16:20:17 stopping 28046084 stopping 977680 stopping 26299592 stopping 25982957 stopping 27830550 stopping 22120010 6 processes stopped # SIGUSR2: status of running processes at 2019/04/06 16:20:18 0 processes found # SIGPWR: starting a few processes at 2019/04/06 16:20:50 process started with pid 29959 and random value 2649735 process started with pid 29960 and random value 14971836 process started with pid 29961 and random value 14339677 process started with pid 29962 and random value 4460665 process started with pid 29963 and random value 12688731 5 processes started started processes: 29245 29959 29245 1994 pts/0 29245 S+ 1000 0:00 | _ I am number 2649735 29245 29960 29245 1994 pts/0 29245 S+ 1000 0:00 | _ I am number 14971836 29245 29961 29245 1994 pts/0 29245 S+ 1000 0:00 | _ I am number 14339677 29245 29962 29245 1994 pts/0 29245 S+ 1000 0:00 | _ I am number 4460665 29245 29963 29245 1994 pts/0 29245 S+ 1000 0:00 | _ I am number 12688731 # SIGUSR2: status of running processes at 2019/04/06 16:20:53 showing 2649735 random value 2649735 is used by process with pid 29959 and pgid 29245 showing 14971836 random value 14971836 is used by process with pid 29960 and pgid 29245 showing 14339677 random value 14339677 is used by process with pid 29961 and pgid 29245 showing 4460665 random value 4460665 is used by process with pid 29962 and pgid 29245 showing 12688731 random value 12688731 is used by process with pid 29963 and pgid 29245 5 processes found # SIGTERM: shutting down the container at 2019/04/06 16:21:42 shutting down the processes at 2019/04/06 16:21:42 stopping 2649735 stopping 14971836 stopping 14339677 stopping 4460665 stopping 12688731 5 processes stopped
In the command terminal:
/bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGUSR1 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGURG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGINT $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGPWR $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') /bin/kill --signal SIGTERM $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
Of course, sending the untrappable SIGKILL signal will abort the process that executes dctm.sh. However, its children processes will survive and be reparented to the root process:
... status of running processes at 2019/04/10 22:38:25 showing 19996889 random value 19996889 is used by process with pid 24520 and pgid 24398 showing 5022831 random value 5022831 is used by process with pid 24521 and pgid 24398 showing 1363197 random value 1363197 is used by process with pid 24522 and pgid 24398 showing 18185959 random value 18185959 is used by process with pid 24523 and pgid 24398 showing 10996678 random value 10996678 is used by process with pid 24524 and pgid 24398 5 processes found # /bin/kill --signal SIGKILL $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}') Killed ps -ef | grep number | grep -v grep docker 24520 1 0 22:38 pts/1 00:00:00 I am number 19996889 docker 24521 1 0 22:38 pts/1 00:00:00 I am number 5022831 docker 24522 1 0 22:38 pts/1 00:00:00 I am number 1363197 docker 24523 1 0 22:38 pts/1 00:00:00 I am number 18185959 docker 24524 1 0 22:38 pts/1 00:00:00 I am number 10996678 # manual killing those processes; ps -ef | grep number | grep -v grep | gawk '{print $2}' | xargs kill -9 ps -ef | grep number | grep -v grep <empty> # this works too: kill -9 $(pgrep -f "I am number [0-9]+$") # or, shorter: pkill -f "I am number [0-9]+$"
Note that there is a simpler way to kill those related processes: by using their PGID, or process group id:
ps -axjf | grep number | grep -v grep 1 25248 25221 24997 pts/1 24997 S 1000 0:00 I am number 3489651 1 25249 25221 24997 pts/1 24997 S 1000 0:00 I am number 6789321 1 25250 25221 24997 pts/1 24997 S 1000 0:00 I am number 15840638 1 25251 25221 24997 pts/1 24997 S 1000 0:00 I am number 19059205 1 25252 25221 24997 pts/1 24997 S 1000 0:00 I am number 12857603 # processes have been reparented to PPID == 1; # highlighted columns 3 is the PGID; # kill them using negative-PGID; kill -9 -25221 ps -axjf | grep number | grep -v grep <empty>
This is why the status() commands displays the PGID.
In order to tell kill that the given PID is actually a PGID, it has to be prefixed with a minus sign. Alternatively, the command:
pkill -g pgid
does that too.
All this looks quite promising so far !
Please, join me now to part II of this article for the dockerization of the test script.