There could be several reasons why a Linux process gets killed. Of course, it could be someone using the kill command but, in this blog, I will talk about the Kernel Panic triggered by the Operating System in case it runs out of RAM. I had a case recently, at a customer, where a nginx process going wild caused such a behavior. To try to restore itself, the OS kills a process using the “OOM Killer”. It analyzes the currently running processes and kill one or more, to urgently free enough resources to stabilize the situation.

The OOM Killer is using some rules to select which process(es) should be killed. It will try to kill as few processes as possible and usually a process that is using a lot of RAM. It is also possible to reduce the likelihood of a process being selected for termination by adjusting the oom_score of a process. Now going back to the topic of this blog, when nginx went wild and started using several GBs of RAM, the OOM Killer was obviously triggered but instead of killing nginx itself (nginx having higher oom_score/oom_score_adj values and obviously using more RAM than anything else at that time), it killed the “master” (main/top/parent) process of a Documentum Repository.

While working on a Documentum Server, you will usually have several Documentum processes, all run by the same user: Docbroker, Repository(ies), Java Method Server. Unless you have a specific configuration (adjusting the oom_score), it would normally be the JMS that gets killed first because it’s usually the process using the more RAM. However, in this case, it was neither nginx nor the JMS but instead the master process of the Repository that has been killed and I must say that was a first for me.

Initially, the monitoring reported job execution issue on the Content Server where it happened, the first error being present in the agentexec.log file:

[dmadmin@cs-2 ~]$ tail -1 $DOCUMENTUM/dba/log/REPO1/agentexec/agentexec.log
...
Thu Jul 21 09:02:21 2022 [INFORMATION] [LAUNCHER 7734] Detected during program initialization: Version: 16.4.0260.0296  Linux64
Thu Jul 21 09:06:44 2022 [ERROR] [AGENTEXEC 8383] Detected while processing dead job dm_DataDictionaryPublisher: The job object indicated the job was in progress, but the job was not actually running.  It is likely that the dm_agent_exec utility was stopped while the job was in progress.
...
[dmadmin@cs-2 ~]$

The Repository log obviously showed the same after that time:

[dmadmin@cs-2 ~]$ cat $DOCUMENTUM/dba/log/cs-2_REPO1.log
...
2022-07-21T09:21:46.679046      8406[8406]      010f42af83387100        JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 010f42af83387100, JMS id: 080f42af80003d29, method: D2EventSenderMailMethod, host:cs-2, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2EventSenderMailMethod -launch_async T -__dm_docbase__ REPO1 -__dm_server_config__ cs-2_REPO1 -docbase_name "REPO1" -due_date "none" -event_name "Job_Failure" -message_text "[ERROR] [AGENTEXEC 8383] Detected while processing dead job dm_DataDictionaryPublisher: The job object indicated the job was in progress, but the job was not actually running.  It is likely that the dm_agent_exec utility was stopped while the job was in progress." -object_name "dm_DataDictionaryPublisher" ...
...
[dmadmin@cs-2 ~]$

Therefore, my first thought was that this is a “simple” AgentExec issue since jobs aren’t being processed anymore. I like to use the “ps uxf” command because it gives a good and easy representation of running processes for the current user, so I mechanically executed it without really thinking that much about it:

[dmadmin@cs-2 ~]$ ps uxf
USER       PID %CPU %MEM  START   TIME COMMAND
dmadmin  26432  0.0  0.0  Jul09   0:00 /bin/sh $JMS_HOME/server/startMethodServer.sh
dmadmin  26434  0.0  0.0  Jul09   0:00  \_ /bin/sh $JMS_HOME/bin/standalone.sh
dmadmin  26532  1.0  6.9  Jul09  14:40      \_ $JAVA_HOME/bin/java -D[Standalone] -server -XX:+UseCompressedOops -server -XX:+UseCompressedOops -Xms8g -Xmx8g -XX:MaxMetaspaceSize=512m -
dmadmin   7246  0.0  0.0  Jul09  15:16 ./dmdocbroker -port 1489 -init_file $DOCUMENTUM/dba/Docbroker.ini
dmadmin   7289  0.0  0.0  Jul09   0:41 ./dmdocbroker -port 1487 -init_file $DOCUMENTUM/dba/DocbrokerExt.ini
dmadmin   7338  0.0  0.1  Jul09   9:30 ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   8224  0.0  0.0  Jul09   0:43  \_ $DM_HOME/bin/mthdsvr master 0xfd051316, 0x7feca9f4a000, 0x223000 1000776  5 7338 cs-2_GR_REPO $DOCUMENTUM/dba/log
dmadmin   8230  0.0  0.0  Jul09  21:25  |   \_ $DM_HOME/bin/mthdsvr worker 0xfd051316, 0x7feca9f4a000, 0x223000 1000776  5 0 cs-2_GR_REPO $DOCUMENTUM/dba/log
dmadmin   8253  0.0  0.0  Jul09  21:28  |   \_ $DM_HOME/bin/mthdsvr worker 0xfd051316, 0x7feca9f4a000, 0x223000 1000776  5 1 cs-2_GR_REPO $DOCUMENTUM/dba/log
dmadmin   8277  0.0  0.0  Jul09  21:06  |   \_ $DM_HOME/bin/mthdsvr worker 0xfd051316, 0x7feca9f4a000, 0x223000 1000776  5 2 cs-2_GR_REPO $DOCUMENTUM/dba/log
dmadmin   8302  0.0  0.0  Jul09  21:22  |   \_ $DM_HOME/bin/mthdsvr worker 0xfd051316, 0x7feca9f4a000, 0x223000 1000776  5 3 cs-2_GR_REPO $DOCUMENTUM/dba/log
dmadmin   8342  0.0  0.0  Jul09  21:34  |   \_ $DM_HOME/bin/mthdsvr worker 0xfd051316, 0x7feca9f4a000, 0x223000 1000776  5 4 cs-2_GR_REPO $DOCUMENTUM/dba/log
dmadmin   8264  0.0  0.0  Jul09   7:47  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   8265  0.0  0.0  Jul09   0:06  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   8290  0.0  0.0  Jul09   0:05  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   8315  0.0  0.0  Jul09   0:06  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   8355  0.0  0.0  Jul09   0:17  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   8357  0.0  0.1  Jul09  23:40  \_ ./dm_agent_exec -enable_ha_setup 1 -docbase_name GR_REPO.cs-2_GR_REPO -docbase_owner dmadmin -sleep_duration 0
dmadmin   8400  0.0  0.0  Jul09   5:43  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin  17396  0.0  0.1  02:42   0:00  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   1736  0.0  0.1  07:24   0:00  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin  23044  0.0  0.1  09:18   0:00  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin  25962  0.0  0.1  09:26   0:00  \_ ./documentum -docbase_name GR_REPO -security acl -init_file $DOCUMENTUM/dba/config/GR_REPO/server_cs-2_GR_REPO.ini
dmadmin   8279  0.0  0.0  Jul09   8:51 ./documentum -docbase_name REPO1 -security acl -init_file $DOCUMENTUM/dba/config/REPO1/server_cs-2_REPO1.ini
dmadmin   8280  0.0  0.0  Jul09   0:58 ./documentum -docbase_name REPO1 -security acl -init_file $DOCUMENTUM/dba/config/REPO1/server_cs-2_REPO1.ini
dmadmin   8304  0.0  0.0  Jul09   0:59 ./documentum -docbase_name REPO1 -security acl -init_file $DOCUMENTUM/dba/config/REPO1/server_cs-2_REPO1.ini
dmadmin   8344  0.0  0.0  Jul09   0:57 ./documentum -docbase_name REPO1 -security acl -init_file $DOCUMENTUM/dba/config/REPO1/server_cs-2_REPO1.ini
dmadmin   8379  0.0  0.0  Jul09   0:17 ./documentum -docbase_name REPO1 -security acl -init_file $DOCUMENTUM/dba/config/REPO1/server_cs-2_REPO1.ini
dmadmin   8383  0.0  0.1  Jul09  23:39 ./dm_agent_exec -enable_ha_setup 1 -docbase_name REPO1.cs-2_REPO1 -docbase_owner dmadmin -sleep_duration 0
dmadmin   8406  0.0  0.0  Jul09   7:36 ./documentum -docbase_name REPO1 -security acl -init_file $DOCUMENTUM/dba/config/REPO1/server_cs-2_REPO1.ini
[dmadmin@cs-2 ~]$

Do you see that? When I noticed the outcome, I was quite surprised… A Repository that is starting will always consists of a master/main/top/parent process using the command “./documentum” (E.g.: 7338 for “GR_REPO“) and then a bunch of child processes (“\_ ./documentum“) that it launches as needed, to handle user sessions, jobs, aso… As you can see above, there is indeed this hierarchy for “GR_REPO” but for the Repository “REPO1“, all the Documentum processes aren’t orchestrated by a master process anymore. In addition, you can see that there are no recent processes in execution (check the “START” column) for “REPO1“. Both these factors meant, to me, that the main Repository process was most probably killed and that could most likely be the root cause of the issue. In the meantime, the JMS also started printing errors that the local Repository (RCS/CFS) isn’t available, but it wouldn’t prevent it to work because of the High Availability setup. Of course, after checking the OS logs, it was confirmed that the OOM Killer played a role in this issue (because of the nginx using lots of memory). Solving it was as simple as killing the remaining “REPO1” processes (documentum & dm_agent_exec) and starting the Repository again.

So, I had this question coming to my mind: what would be the easiest, fastest, and less intrusive way to detect such thing? I could think of three ways to do that.

I. The first one, of course, would be to execute an iAPI/iDQL command (same thing for a Java DFC check), to make sure that the Repository is responding, which wouldn’t be the case if the master process is killed. However, that’s a slow (several seconds) and intrusive way to check since it will create/use a Documentum session (therefore less sessions usable by real users) and create a log file for this session.

A quick health check script would give the below outcome for this situation:

[dmadmin@cs-2 ~]$ /scripts/dbi_health_check.sh
2022-07-21 09:31:05 UTC - Starting checks

2022-07-21 09:31:06 UTC -   OK   << Successful reply from Broker:      'Docbroker' (port '1489', secure)
2022-07-21 09:31:08 UTC -   OK   << Successful reply from Broker:      'DocbrokerExt' (port '1487', secure)
2022-07-21 09:31:09 UTC -   OK   << Successful reply from Repo:        'GR_REPO.cs-2_GR_REPO' (iAPI)
2022-07-21 09:31:40 UTC - NOT OK << No reply, timeout (30s) on Repo:   'REPO1.cs-2_REPO1' (iAPI)
2022-07-21 09:31:40 UTC -   OK   << Successful reply from JMS:         'ServerApps' (port '9082', https)
2022-07-21 09:31:40 UTC -   OK   << Successful reply from JMS:         'ACS' (port '9082', https)

2022-07-21 09:31:40 UTC - Overall status >> NOT OK

[dmadmin@cs-2 ~]$

II. The second would be to check the kernel logs. When the OOM Killer terminate a process, it will log it and it’s possible to access these messages (even without root) through the dmesg command (E.g.: dmesg -T | grep -iE ‘killed process|oom-killer’). You should be able to see the same through the /var/log/messages file but this one usually requires root access. It’s rather easy and not intrusive to find an issue that way but it does require a little bit of scripting because you will indeed find that a process has been killed, but is it a Documentum process and more specifically the master Repository process? So, why not but I’m sure we can do better.

The below extract using dmesg is totally irrelevant to the issue I’m talking about in this blog, it’s just to show what it would look like:

[dmadmin@cs-2 ~]$ dmesg -T | grep -iE 'killed process|oom-killer'
[Sun Jul 24 16:35:58 2022] pool-3-thread-9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=905
[Sun Jul 24 16:35:58 2022] Killed process 31249 (filebeat), UID 999, total-vm:1732060kB, anon-rss:23872kB, file-rss:19420kB, shmem-rss:0kB
[Sun Jul 24 16:35:58 2022] pool-3-thread-9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=905
[Sun Jul 24 16:35:58 2022] Killed process 599 (tail), UID 999, total-vm:5504kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[Sun Jul 24 16:35:58 2022] pool-3-thread-9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=905
[Sun Jul 24 16:35:58 2022] Killed process 29805 (java), UID 999, total-vm:39125556kB, anon-rss:31451436kB, file-rss:12680kB, shmem-rss:0kB
[dmadmin@cs-2 ~]$

III. The last thing I could think of was to simply check the Documentum processes themselves. Since it’s possible to see visually that there is no Master process anymore, then it’s possible to script it easily. In addition to that, if you look at the above list of processes again, you might notice that there is no “mthdsvr” processes anymore for the Repository with the issue…

You might already have a quick check that makes sure that the dmdocbroker or dm_agent_exec processes are present, but it might not detect the issue, for example:

[dmadmin@cs-2 ~]$ docbroker=Docbroker
[dmadmin@cs-2 ~]$ docbase=REPO1
[dmadmin@cs-2 ~]$
[dmadmin@cs-2 ~]$ ps -ef | grep "^${USER}.*dmdocbroker.*init_file.*${docbroker}.ini" | grep -v grep
dmadmin   7246     1  0 Jul09 ?        00:15:16 ./dmdocbroker -port 1489 -init_file $DOCUMENTUM/dba/Docbroker.ini
[dmadmin@cs-2 ~]$
[dmadmin@cs-2 ~]$ ps -ef | grep "^${USER}.*dm_agent_exec.*-docbase_name ${docbase}\." | grep -v grep
dmadmin   8383     1  0 Jul09 ?        00:23:39 ./dm_agent_exec -enable_ha_setup 1 -docbase_name REPO1.cs-2_REPO1 -docbase_owner dmadmin -sleep_duration 0
[dmadmin@cs-2 ~]$

With the above commands, you will NOT detect this specific issue where the main documentum process gets killed. However, as you probably understood if you read this blog, if you are either looking to make sure that the dm_agent_exec is a child process or looking at the mthdsvr processes, then you can detect the issue. If there are no problems, all commands below should also return one result each, if the issue is present, then the last two commands will not return anything:

[dmadmin@cs-2 ~]$ docbroker=Docbroker
[dmadmin@cs-2 ~]$ docbase=REPO1
[dmadmin@cs-2 ~]$ dm_server_config=$(grep "^server_config_name[[:space:]]*=" $DOCUMENTUM/dba/config/${docbase}/server.ini | tail -1 | sed 's,.*=[[:space:]]*,,')
[dmadmin@cs-2 ~]$
[dmadmin@cs-2 ~]$ ps uxf | grep "^${USER}.* ./dmdocbroker.*init_file.*${docbroker}.ini" | grep -v grep
dmadmin   7246  0.0  0.0 128532 48572 ?        S    Jul09  15:16 ./dmdocbroker -port 1489 -init_file $DOCUMENTUM/dba/Docbroker.ini
[dmadmin@cs-2 ~]$
[dmadmin@cs-2 ~]$ ps uxf | grep "^${USER}.* \\\_ ./dm_agent_exec.*-docbase_name ${docbase}\." | grep -v grep
[dmadmin@cs-2 ~]$
[dmadmin@cs-2 ~]$ ps uxf | grep "^${USER}.* \\\_ .*/mthdsvr master.* ${dm_server_config}" | grep -v grep
[dmadmin@cs-2 ~]$

The above should work on RHEL Operating Systems and probably a bunch of others OS. If you don’t want to rely on the child/not child (because “ps uxf” display might differs), then I believe that just making sure that both the dm_agent_exec and the mthdsvr master processes are present for a single Repository should be sufficient as well. If the Repository is currently starting and not yet available, then the dm_agent_exec process won’t be present yet and if the issue reproduces, then the mthdsvr master process will disappear on its own. I believe that’s the simplest and fastest way to detect such issue, so that’s what I’m going to use in the future. Of course, this quick check will not replace a real Repository monitoring that uses iAPI/iDQL/DFC/Whatever to connect to the repository and retrieve data from inside the Repository but that’s usually something that you will execute every 15/30 minute, to not overload your Repository. This one on the other hand is a quick and efficient way to check if processes are still there as they should, in which case, it is a good indication that everything still runs properly.