The journey begins

In a VM Oracle VirtualBox with Centos Linux 7.0, you install Documentum v7.3 and create a docbase dmtest with its schema in a database on another VM running Oracle Linux. You start the newly created docbase and successfully connect locally to it from within iapi or idql. You smile with satisfaction and go on. It’ll be a walk in the park, you think.

In another VM, a client, running under Ubuntu 16.0.3 (you like those different distributions, don’t you ?), you install the Documentum 7.3 binaries too and configure the dfc.properties file. You then attempt to connect to the docbase above from within iapi or idql but the connection takes forever (more than 20 minutes) and finally terminates miserably with the error message:

[DFC_DOCBROKER_REQUEST_FAILED] Request to Docbroker "dmtest:1489" failed
[DM_SESSION_E_RPC_ERROR]error: "Server communication failure"
java.net.NoRouteToHostException: No route to host
"

Sometimes the connection is immediate but most of the time is simply takes forever until it fails with the stupid error message above. You scratch your head and start rechecking the network interfaces, ping the hosts, verify the usual files /etc/hosts, /etc/resolv.conf, services.ora, server.ini and local dfc.properties, stop the firewalls on both machine and on the host O/S, even disable the SELinux extensions, but to no avail. You activate DFC and RPC tracing on the client but no new log file is written. You check the logs on the server side but find nothing relevant.
Close to a nervous breakdown, you even try a local docbroker on the client VM and project the docbase to it, but the connection still hangs. You do that on the VM running Oracle Linux too and the result isn’t any better. At the very least, both distributions are consistent.
Those long time-outs and the java.net errors suggest something network-related may hang the program maybe right after it invokes some java stuff. Strangely enough, no JVM process is started on the client to interpret said java code. Thus, you restart the iapi command with system calls tracing on:

strace -o iapi.trc iapi -Udmadmin -Pdmadmin

The produced iapi.trc file is very detailed and contains lots of run-time calls. Excited, you jump to the last lines:


socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 68
ioctl(68, SIOCGIFFLAGS, {ifr_name="enp0s3", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
close(68) = 0
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 68
ioctl(68, SIOCGIFHWADDR, {ifr_name="enp0s3", ifr_hwaddr=08:00:27:03:22:0d}) = 0
close(68) = 0
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 68
ioctl(68, SIOCGIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING}) = 0
close(68) = 0

ioctl(68, SIOCGIFBRDADDR, {ifr_name="enp0s3", ifr_broadaddr={AF_INET, inet_addr("192.168.56.255")}}) = 0
ioctl(68, SIOCGIFNETMASK, {ifr_name="enp0s3", ifr_netmask={AF_INET, inet_addr("255.255.255.0")}}) = 0
ioctl(68, SIOCGIFINDEX, {ifr_name="enp0s3", }) = 0
close(68) = 0
socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 68
open("/proc/net/if_inet6", O_RDONLY) = 69
fstat(69, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(69, "fe800000000000000a0027fffe03220d"..., 1024) = 108
read(69, "", 1024) = 0
read(69, "", 1024) = 0
close(69) = 0


This really looks like network stuff. You recognize the interface name and notice that IPv6 is mentioned and used too but our interface has an IP v4 address. Just to be sure, you deactivate IP v6 in both the client and server VMs for the relevant interfaces with the following commands:

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1

but this does not solve the problem at hand.
Confident, and a bit upset too, you start scrutinizing this trace file line by line from the bottom up until something déjà vu catches your sharp eyes at around the 15’000th line:

...
open("/dev/random", O_RDONLY) = 65
fstat(65, {st_mode=S_IFCHR|0644, st_rdev=makedev(1, 8), ...}) = 0
read(65, "\355\347\333\217X;", 20) = 6
read(65, "\24\314\305\37JK", 14) = 6
read(65, "q\5T\23\1\250", 8) = 6
read(65, "331", 2) = 2
read(65, "`\3\256\354", 20) = 4
read(65, "\221\302\223\270\202\205", 16) = 6
read(65, "\213\222e\206j\361", 10) = 6
read(65, "N+\"H", 4) = 4
read(65, "K\215", 20) = 2
read(65, "\355P1\320\312\301", 18) = 6
read(65, "\27\232a\32\272", 12) = 6
read(65, "\3055n\221 '", 6) = 6


If you do a tail -f on the trace file, you notice unmistakenly that the execution is waiting repeatedly on read(65, …), i.e. a response from /dev/random. You heard lots of confusing tales about how this device is blocking and slows down the java stuff and how better it is to use its non-blocking counterpart /dev/urandom (see other blogs on this very site too). Could it be that this is the culprit eventhough you’re running non java executables ? How to force /dev/urandom to be used instead in our case ? You notice that the Documentum’s environment setting script, dm_set_server_env.sh, defines DEVRANDOM=/dev/urandom and this variable has been set indeed, but obviously it has no effect here. You start googling for information and find wheelbarrows of it. Many middlewares have a way or another to choose the random number generator device, but for native programs the only way seems to be something like removing /dev/random and linking it to /dev/urandom. Instead of sym linking, someone (see references below) proposed a lower-level alternative consisting in using the same node as /dev/urandom to redirect the calls to the same kernel device:

ls /dev/*random
crw-rw-rw- 1 root root 1, 9 Apr 7 14:52 /dev/urandom
crw-r--r-- 1 root root 1, 8 Apr 7 20:23 /dev/random
rm /dev/random
mknod /dev/random c 1 9
ll /dev/*random
crw-rw-rw- 1 root root 1, 9 Apr 7 14:52 /dev/urandom
crw-r--r-- 1 root root 1, 9 Apr 7 21:26 /dev/random

As nothing prevents you to do it on your toy machines, you go for it on the client VM (don’t do this in a production machine unless you know all the possible side-effects of this change !) and a little miracle happens: the response from the server arrives quickly but your spidey sense, along with the error message below, tells you something is still quite not right:

Connecting to Server using docbase dmtest
Could not connect
[DFC_SESSION_DOCBASE_UNREACHABLE] Docbase "dmtest" is unreachable

Again, this is familiar stuff, or so you think, thus you check again the dobcase projections and run dmqdocbroker against the local docbroker (it too was impacted by the slowness and now it starts up swiftly too) on the client VM and notice that the test docbase is reported here as expected:

dmadmin@dmclient:~/documentum/shared$ dmqdocbroker -i
dmqdocbroker: A DocBroker Query Tool
dmqdocbroker: Documentum Client Library Version: 7.3.0000.0205
Targeting current host
Targeting port 1489
---- dmqdocbroker: (TARGET HOST: dmclient.cec.com) ----
p) Ping (test connectivity to) the docbroker
d) Get a docbase map
s) Get a server map
n) Get next largest docbase id
l) lookup a docbase id
o) find all open servers for a docbase
h) Set the host name for the docbroker
e) exit
Enter an option (i.e. letter)> h
Enter the host name for the Docbroker: dmtest -- make it point to the docbroker on the content server VM as defined in the /etc/hosts file;
---- dmqdocbroker: (TARGET HOST: dmtest) ----
p) Ping (test connectivity to) the docbroker
d) Get a docbase map
s) Get a server map
n) Get next largest docbase id
l) lookup a docbase id
o) find all open servers for a docbase
h) Set the host name for the docbroker
e) exit
Enter an option (i.e. letter)> p -- ping the remote docbroker;
Successful reply from docbroker at host (localhost.localdomain) on port(1490) running software version (7.3.0000.0214 Linux64). <------ hum hum ...
---- dmqdocbroker: (TARGET HOST: dmtest) ----
p) Ping (test connectivity to) the docbroker
d) Get a docbase map
s) Get a server map
n) Get next largest docbase id
l) lookup a docbase id
o) find all open servers for a docbase
h) Set the host name for the docbroker
e) exit
Enter an option (i.e. letter)> d -- what docbases project on that docbroker ?
**************************************************
** D O C B R O K E R I N F O **
**************************************************
Docbroker host : localhost.localdomain -- hum hum ...
Docbroker port : 1490
Docbroker network address : INET_ADDR: 02 5d2 c0a8380c localhost.localdomain 192.168.56.12 -- hum hum again
Docbroker version : 7.3.0000.0214 Linux64
**************************************************
** D O C B A S E I N F O **
**************************************************
--------------------------------------------
Docbase name : dmtest -- our docbase is here indeed;
Docbase id : 50000
Docbase description : test docbase
Govern docbase :
Federation name :
Server version : 7.3.0000.0214 Linux64.Oracle
Docbase Roles : Global Registry
Docbase Dormancy Status :
--------------------------------------------

The dfc trace file however tells another story:

java.net.ConnectException: Connection refused
5.234 N/A N/A [main] .DEBUG: IPV4 Address is: 192.168.56.12
5.242 N/A N/A [main] .DEBUG: Updated Server Choices for 'dmtest'. 1 servers found. 0 content servers found
5.242 N/A N/A [main] .DEBUG: Updated Server Choices for 'dmtest'. 0 bad servers found.
5.244 N/A N/A [main] .DEBUG: IPV4 Address is: 192.168.56.12
5.249 N/A N/A [main] .DEBUG: Updated Server Choices for 'dmtest'. 1 servers found. 0 content servers found

5.249 N/A N/A [main] .DEBUG: Updated Server Choices for 'dmtest'. 0 bad servers found.
5.252 N/A N/A [main] .DEBUG: IPV4 Address is: 192.168.56.12
5.258 N/A N/A [main] .DEBUG: Updated Server Choices for 'dmtest'. 1 servers found. 0 content servers found
5.259 N/A N/A [main] .DEBUG: Updated Server Choices for 'dmtest'. 0 bad servers found.
5.259 N/A N/A [main] .WARN: [DFC_SESSION_RETRYING_OPEN] Retrying session establishment to "dmtest". Error was "[DFC_SESSION_DOCBASE_UNREACHABLE] Docbase "dmtest" is unreachable"

Intriguing. The localhost.localdomain reported above several times by the docbroker looks weird and prone to confusion. Local relative to what ? The client’s or the server’s host ? It is likely that after the idql client receives this information from the docbroker, it attempts a connection locally, which obviously cannot work, unless there happens to be a local dmtest docbase there too (which may explain why this error does not occur on the server side). The solution is quite self-evident: shut down the docbase and its docbroker, give a proper hostname to the content server’s VM, restart the docbroker and the docbase and attempt a new connection from the client VM. Et voilà ! This time the connection is established successfully and without delay.
You check dmqdocbroker again to see if it reflects the change:

---- dmqdocbroker: (TARGET HOST: dmtest.cec) ----
p) Ping (test connectivity to) the docbroker
d) Get a docbase map
s) Get a server map
n) Get next largest docbase id
l) lookup a docbase id
o) find all open servers for a docbase
h) Set the host name for the docbroker
e) exit
Enter an option (i.e. letter)> p
Successful reply from docbroker at host (dmtest.cec) on port(1490) running software version (7.3.0000.0214 Linux64).
---- dmqdocbroker: (TARGET HOST: dmtest.cec) ----
p) Ping (test connectivity to) the docbroker
d) Get a docbase map
s) Get a server map
n) Get next largest docbase id
l) lookup a docbase id
o) find all open servers for a docbase
h) Set the host name for the docbroker
e) exit
Enter an option (i.e. letter)> d
**************************************************
** D O C B R O K E R I N F O **
**************************************************
Docbroker host : dmtest.cec -- good !
Docbroker port : 1490
Docbroker network address : INET_ADDR: 02 5d2 c0a8380c dmtest.cec 192.168.56.12
Docbroker version : 7.3.0000.0214 Linux64
**************************************************
** D O C B A S E I N F O **
**************************************************
--------------------------------------------
Docbase name : dmtest
Docbase id : 50000
Docbase description : test docbase
Govern docbase :
Federation name :
Server version : 7.3.0000.0214 Linux64.Oracle
Docbase Roles : Global Registry
Docbase Dormancy Status :
--------------------------------------------

You pat yourself on the back and go for a well-deserved coffee.

More details

When I write that idql is slow to connect with the remote docbase, I mean it:

time idql dmtest -Udmadmin -Pdmadmin <<EoQ
quit
EoQ
1> 2> Bye
real 26m4.512s
user 0m5.672s
sys 0m0.548s

And so is dmqdocbroker:

time dmqdocbroker -i <<EoQ
e
EoQ
Enter an option (i.e. letter)>
Enter an option (i.e. letter)>
real 26m3.085s
user 0m5.286s
sys 0m0.364s

Normally, a successful connection should not take much more than few seconds:

time idql dmtest -Udmadmin -Pdmadmin <<EoQ
quit
EoQ
1> 2> Bye
real 0m5.535s
user 0m5.294s
sys 0m0.110s

Notice how the system is practically idle, even during the long execution: the command is simply waiting for something, probably some random values according to the call trace file.

The hack with the /dev/(.)?random nodes may also be applied by the shorter command below:
sudo cp -lf /dev/urandom /dev/random
However, besides the inherent uncertainty in such system tempering, it needs to be repeated after each reboot of the VM because the pseudo-devices behind /dev/.?random gets remapped to the filesystem at boot time. Although it could be added in the startup scripts, a more permanent and much cleaner solution is to instruct java (we know now that there is some java lurking behind the command-line tools) to use the non-blocking device as per the note from Oracle cited below. In short, edit the file $DOCUMENTUM/shared/java64/1.8.0_77/jre/lib/security/java.security
comment the line:
#securerandom.source=file:/dev/random
and add the line instead:
securerandom.source=file:/dev/urandom
In effect, since Documentum v6.x the commands iapi, idql and dmawk (which is invoked by dmqdocbroker) are just binary stubs that invoke the java back-end installed in $DOCUMENTUM/shared/java64. Fortunately, this slowness issue has been tackled since long for the java and the parameter securerandom.sourcefile has been introduced for this purpose. After the change has been applied, the trace file shows that this device is effectively used instead of /dev/random:

open("/dev/urandom", O_RDONLY) = 65
fstat(65, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
read(65, "Q\315s\37\35T\243\217\216\355Q\250N.\217", 20) = 20
...

And a tail -f on the trace file does not show any pause at the read() call any more.
Also, here we cannot pass the property java.security.egd (an alternative solution) because we don’t start the java subsystem ourselves; it is used invisibly behind the scene by the commands iapi and idql (actually, the library libdmcl.so they are linked with). Moreover, as said before, no JVM process gets started. Is there some sort of JIT compilation going on here ? This is a good question to ask OpenText Support.
If the command-line tools passed along the environment variable $DEVRANDOM to the java layer with the effect to override the setting securerandom.sourcefile in the security file, the issue would be solved. This too is a good suggestion for OpenText.

Interestingly, in my laptop’s VM with the Centos v7.0.1406 and the test docbase, a daemon rngd wakes up sometimes quite noisily:
rngd
Its man page says it all:

RNGD(8) System Manager's Manual NAME
rngd - Check and feed random data from hardware device to kernel random device

Lots of information are also available for this service on the web. If this daemon is stopped on the content server VM, the same issue as on the client VM raises its ugly head too. Again, /dev/random is read repeatedly with long pauses in between. Again, switching to /dev/urandom in $DOCUMENTUM/shared/java64/1.8.0_77/jre/lib/security/java.security fixes it. The presence of this daemon explains why the issue was not encountered on the server although java was using the default /dev/random device. Could such a service be used in the client VM too to feed /dev/random and avoid it to block until it collects enough entropy ? As a matter of fact, yes. When this process is started by the following test command:

sudo rngd -f -o /dev/random -v &

or installed from scratch on the client VM (remember, our client runs under Ubuntu, a Debian off-spring which explain why we are using apt-get instead of yum; also, the package installation automatically starts the rngd daemon):

sudo apt-get install rng-tools

/dev/random no longer blocks as it is visible in the command-line tools’ trace file and through the command “cat /dev/random | od -A d4” for example.
Clearly, the added overhead of setting up and running such a daemon vs. a simple parameter change in a configuration file is only acceptable in special cases, e.g. the application is not a java one, it has no choice but use /dev/random, the hack to make it point to /dev/urandom is a no-go, and it is affected by the blocking issue.

The second error message, the DFC_DOCBASE_UNREACHABLE, is a puzzling one too. In the present case, it went away quite easily but later, after my test VMs were restarted, it came back and it took me hours to expunge it. The VM with the content server was still using the default hostname & domain localhost.localdomain. Documentum recommends to give the host a proper name and domain. After this correction was made on the Centos VM using the command below:

sudo sysctl -w kernel.hostname=myhost

and verified with:

sysctl -n kernel.hostname

or simply:

hostname

and the docbase restarted, remote connections were possible again. The used command “sudo hostname dmclient” does not save the hostname into /dev/hostname so the change only lasts until the next reboot. Either edit that file manually or use the sysctl command under Centos. So, make sure that all the system changes be permanently saved otherwise they will be lost at the next reboot.

Admittedly, the two issues discussed here are very unlikely to occur in most organizations where the machines are properly set up by trained staff, from the O/S, the network and up to the java layer. It’s a different situation with local VMs, especially ones with default configurations.

Interestingly, this /dev/random issue does not occur with the old, all binary, non java version of Documentum, e.g. v5.3: it is able to connect to a v7.3 docbase right away, no question asked, which confirms that the issue is a java one. It is still helpful to have one such old release handy for those hard to diagnose cases !

References

https://security.stackexchange.com/questions/14386/what-do-i-need-to-configure-to-make-sure-my-software-uses-dev-urandom
https://docs.oracle.com/cd/E13209_01/wlcp/wlss30/configwlss/jvmrand.html