At a customer using Documentum inside Kubernetes, unusually frequent restarts of xPlore containers were noted a couple weeks after the introduction of CPS Only pods on the different Production environments. The customer had a very simple Standalone xPlore setup, with Dsearch+IndexAgent, but without CPS Only Nodes. It was noticed at this customer that the indexing and search load was reaching the limit of what the then-current setup could handle. Therefore, to increase the processing power of xPlore, it was decided to switch to a Multi-Node setup, with a few additional CPS, depending on the environment.

No issues were found on DEV and QA environments, everything was working properly with the newly added CPS. It’s only after a couple weeks running on Production that the first unusual restart happened. At first glance, it looked like it could have been an issue with the container layers since alerts were being receiving related to file system almost full for the path containing the container’s local volumes/layers and a few minutes later, there would be a restart of some containers, mostly xPlore Dsearch/CPS pods but not only.

After some more investigation, it was seen that the issue was caused by the CPS temporary folder of xPlore being filled with quite a bunch of 400Mb files. Since this path wasn’t using a Persistent Volume, it was therefore using the local container’s storage, which explained the file system full alerts. Once the Kubernetes Worker’s disk for the container’s local volumes/layers would reach its maximum size of 200Gb, then the container engine would kill some containers to free some space, hence the unusual restarts.

These were some of the files in question:

[xplore@ds1-0 ~]$ cd $XPLORE_HOME/dsearch/cps/cps_daemon/bin/temp
[xplore@ds1-0 temp]$ ls -ltr
total 157286400
...
-rw-r----- 1 xplore xplore  408584192 Dec 19 01:25 SSS4AE9
-rw-r----- 1 xplore xplore  409419776 Dec 19 01:27 SSSE305
-rw-r----- 1 xplore xplore  412467200 Dec 19 01:27 SSS652A
-rw-r----- 1 xplore xplore  413335552 Dec 19 01:28 SSS48F
-rw-r----- 1 xplore xplore  482869248 Dec 19 01:31 SSSC206
-rw-r----- 1 xplore xplore  397770752 Dec 19 01:32 SSSB62C
[xplore@ds1-0 temp]$ date
Mon Dec 19 10:02:02 UTC 2022
[xplore@ds1-0 temp]$

The CPS configuration used was pretty simple, similar to the out-of-the-box one. A few changes made for indexing of large files, for exports paths and increasing the daemon_count from 3 to 5, for example. Moreover, the configuration file wasn’t touched for months:

[xplore@ds1-0 temp]$ grep daemon_count $XPLORE_HOME/dsearch/cps/cps_daemon/PrimaryDsearch_local_configuration.xml
 <daemon_count>5</daemon_count>
 <query_dedicated_daemon_count>1</query_dedicated_daemon_count>
[xplore@ds1-0 temp]$
[xplore@ds1-0 temp]$ grep temp $XPLORE_HOME/dsearch/cps/cps_daemon/PrimaryDsearch_local_configuration.xml
    <temp_directory>temp</temp_directory>
    <keep_temp_file>false</keep_temp_file>
    <temp_file_retain_time>900</temp_file_retain_time>
                    <property name="temp_file_folder">temp</property>
[xplore@ds1-0 temp]$

In case you don’t know, you can only have a maximum of 7 threads for the CPS daemon (daemon_count + query_dedicated_daemon_count) because you always need to keep the 8th thread available for restarts, c.f. KB0474159. Here, there are 6 configured so it is properly under the threshold, which means that this isn’t the issue.

While checking the CPS logs, it was seen that there was quite a bunch of CPS self-termination from time to time:

[xplore@ds1-0 temp]$ cdlogs
[xplore@ds1-0 logs]$ grep "2022-12-17.*Consumed memory.*larger than the threshold.*will self-terminate" cps_daemon.log | wc -l
3
[xplore@ds1-0 logs]$ grep "2022-12-18.*Consumed memory.*larger than the threshold.*will self-terminate" cps_daemon.log | wc -l
237
[xplore@ds1-0 logs]$ grep "2022-12-19.*Consumed memory.*larger than the threshold.*will self-terminate" cps_daemon.log | wc -l
242
[xplore@ds1-0 logs]$

To avoid issues over the Christmas period, a cronjob has been created that would perform the clean-up of the temp folder periodically. Based on the data from the logs and the temporary files on the file system, it appeared that the issue might be related to high indexing load scenarios. Therefore, to try to reproduce the issue, around 20 000 documents were submitted to indexing through a refeed task. These documents were selected using a command such as:

select
  r_object_id
from
  dm_document
where
  r_modify_date >= date('10.01.2023 00:00:00 UTC','dd.mm.yyyy hh:mi:ss')
  and r_modify_date < date('11.01.2023 00:00:00 UTC','dd.mm.yyyy hh:mi:ss');

With the refeed in progress, a few SSS* files appeared, and the CPS also crashed a few times, in the middle:

[xplore@ds1-0 logs]$ cd $XPLORE_HOME/dsearch/cps/cps_daemon/bin/temp
[xplore@ds1-0 temp]$ date; ls -ltr | tail -2
Thu Jan 12 08:42:41 UTC 2023
-rw-r----- 1 xplore xplore 385859584 Jan 12 08:27 SSSDA50
-rw-r----- 1 xplore xplore 383336448 Jan 12 08:29 SSSFAD5
[xplore@ds1-0 temp]$
[xplore@ds1-0 temp]$ stat SSSDA50
 File: ‘SSSDA50’
 Size: 385859584 Blocks: 756608 IO Block: 65536 regular file
Device: fbh/251d Inode: 9247321667856512778 Links: 1
Access: (0640/-rw-r-----) Uid: ( 514/ xplore) Gid: ( 514/ xplore)
Access: 2023-01-12 08:26:44.056715000 +0000
Modify: 2023-01-12 08:27:54.168722000 +0000
Change: 2023-01-12 08:27:54.168722000 +0000
 Birth: -
[xplore@ds1-0 temp]$
[xplore@ds1-0 temp]$ stat SSSFAD5
 File: ‘SSSFAD5’
 Size: 383336448 Blocks: 751664 IO Block: 65536 regular file
Device: fbh/251d Inode: 9247321667856512779 Links: 1
Access: (0640/-rw-r-----) Uid: ( 514/ xplore) Gid: ( 514/ xplore)
Access: 2023-01-12 08:27:57.640918000 +0000
Modify: 2023-01-12 08:29:01.369706000 +0000
Change: 2023-01-12 08:29:01.369706000 +0000
 Birth: -
[xplore@ds1-0 temp]$
[xplore@ds1-0 temp]$ cdlogs
[xplore@ds1-0 logs]$ grep "Consumed memory.*larger than the threshold.*will self-terminate" cps_daemon.log | tail -2
2023-01-12 08:27:53,794 WARN [Daemon3(22493)-Core-(140435641489152)] Consumed memory 4148326400, larger than the threshold 4000000000, Daemon3(22493) will self-terminate
2023-01-12 08:29:01,044 WARN [RDaemon0(23282)-Core-(140654441674496)] Consumed memory 4092612608, larger than the threshold 4000000000, RDaemon0(23282) will self-terminate
[xplore@ds1-0 logs]$

As you can see above, the timings are quite close. So close that it looks like a CPS would generate such SSS* file just before crashing… The CPS daemon has 4GB of RAM as memory threshold for its processing and the size of the SSS* files are all around 370-440Mb, so around 10% of that size. Could it be some kind of memory dump or something similar? The ticket CS0053875 has been opened with the OpenText support to try to find the root cause of these SSS* files but as of today, it’s still unclear what these files could be.

As shown above, there are some parameters that control how files are created/deleted for the CPS:

  • temp_directory: Folder under which the CPS temporary files will be stored, default to “temp” (i.e. $XPLORE_HOME/dsearch/cps/cps_daemon/temp/)
  • keep_temp_file: Whether to keep the CPS temporary files or not, default to “false” (i.e. don’t keep)
  • temp_file_retain_time: How long (in seconds) will the CPS temporary files be kept, default to “900” (i.e. 15 minutes) and only applies if “keep_temp_file” is “true”
  • temp_file_folder: Folder under which the CPS temporary format and language identification files will be stored, default to “temp” (i.e. $XPLORE_HOME/dsearch/cps/cps_daemon/temp/)

That’s the theory. These details come from the Documentation of xPlore itself, but in reality, they do not appear to be fully accurate… While doing testing, we tried to use absolute paths in the CPS configuration file, instead of the default relative ones. By doing that, it looks like “temp_directory” changes the location of the “dmftcontentref*” files, while “temp_file_folder” changes the location of the “SSS*” files. But as you saw above, the SSS* files weren’t created inside “$XPLORE_HOME/dsearch/cps/cps_daemon/temp/” but inside “$XPLORE_HOME/dsearch/cps/cps_daemon/bin/temp/” (see the additional /bin/ in the path?). So, there is already something strange there with the real default value used by xPlore when it’s using a relative path.

A second strange point is that “temp_file_retain_time” appears to be linked to the “temp_directory” path but not “temp_file_folder” because if you set the “keep_temp_file” to true, then “dmftcontentref*” files will stay for 15 minutes before being deleted but nothing will happen to “SSS*” files (if you are using their default relative “temp” value). So, the retain parameter appears to be working properly on that aspect. To continue with the testing, we tried to set both “temp_directory” and “temp_file_folder” to the same absolute folder, while keeping the default “keep_temp_file” value of “false” (meaning don’t keep temporary files and therefore it shouldn’t perform the cleanup, since “temp_file_retain_time” is supposed to not take effect). With that configuration, no “dmftcontentref*” files were kept (expected behavior) while “SSS*” files would be created in the same folder and then removed after 15 minutes exactly… That means that contrary to the documentation, it appears that the cleanup is done after “temp_file_retain_time” seconds, even if “keep_temp_file” is set to “false”. Therefore, the documentation statement of “This parameter is only effective when keep_temp_file is set to false” for “temp_file_retain_time” doesn’t appear to be correct.

In conclusion, the temporary folder and more specifically the “SSS*” files can be cleaned by xPlore if you set the same absolute path, i.e., with this configuration:

[xplore@ds1-0 logs]$ cd $XPLORE_HOME/dsearch/cps/cps_daemon
[xplore@ds1-0 cps_daemon]$
[xplore@ds1-0 cps_daemon]$ # Before / Default configuration
[xplore@ds1-0 cps_daemon]$ config_file="PrimaryDsearch_local_configuration.xml"
[xplore@ds1-0 cps_daemon]$
[xplore@ds1-0 cps_daemon]$ grep temp ${config_file}
    <temp_directory>temp</temp_directory>
    <keep_temp_file>false</keep_temp_file>
    <temp_file_retain_time>900</temp_file_retain_time>
                    <property name="temp_file_folder">temp</property>
[xplore@ds1-0 cps_daemon]$
[xplore@ds1-0 cps_daemon]$
[xplore@ds1-0 cps_daemon]$ # Changing default configuration
[xplore@ds1-0 cps_daemon]$ cps_temp_path="$XPLORE_HOME/dsearch/cps/cps_daemon/bin/temp"
[xplore@ds1-0 cps_daemon]$ sed -i "s,<temp_directory>[^<]*<,<temp_directory>${cps_temp_path}<," ${config_file}
[xplore@ds1-0 cps_daemon]$ sed -i "s,temp_file_folder\">[^<]*<,temp_file_folder\">${cps_temp_path}<," ${config_file}
[xplore@ds1-0 cps_daemon]$
[xplore@ds1-0 cps_daemon]$ grep temp ${config_file}
    <temp_directory>$XPLORE_HOME/dsearch/cps/cps_daemon/bin/temp</temp_directory>
    <keep_temp_file>false</keep_temp_file>
    <temp_file_retain_time>900</temp_file_retain_time>
                    <property name="temp_file_folder">$XPLORE_HOME/dsearch/cps/cps_daemon/bin/temp</property>
[xplore@ds1-0 cps_daemon]$

It’s definitively not a final solution since the temporary files are still being created for unknown reasons (sometimes with a CPS crash, sometimes without), but it’s a first step and it’s still better than having pods restart every few hours.