Working with Documentum is often a frustrating experience, and one of the main causes is the lack of meaningful error messages. Be it an encouragment to the administrator or developer to refer to the official knowledge base or a gap in the user-friendliness department, this results in the dreaded venture of investigating the origin of the error, which can take hours if not days of laborious and boring work with little chance for a reward, all the while end users are complaining that nothing works any more.
The last time I was confronted to this situation, the error looked quite benine at first:

1
2
3
4
5
1 May 2022 11:33:23,580 ERROR [pool-3-thread-93] (DCTMNewRendition.java:98) - Unable to save document with id: 090f4517812cbdc7. Got a Dctm Exception
com.documentum.fc.client.DfIdNotFoundException: [DM_API_E_EXIST]error:  "Document/object specified by 080f451780000491 does not exist."
at com.documentum.fc.client.impl.docbase.DocbaseExceptionMapper.newException(DocbaseExceptionMapper.java:49) ~[dfc.jar:?]
at com.documentum.fc.client.impl.connection.docbase.MessageEntry.getException(MessageEntry.java:39) ~[dfc.jar:?]
at com.documentum.fc.client.impl.connection.docbase.DocbaseMessageManager.getException(DocbaseMessageManager.java:137) ~[dfc.jar:?]

Just dump the above document with id 090f4517812cbdc7 and look at its permission, the parent folder’s ones, see if the current user is allowed to save the document, nothing particularly transcendent, right ? Wrong. Firstly, the given id is not the same as the current document’s that is being saved. So, where did the server pull it from ? Secondly, an attempt at dumping that id confirms that it does not exist. Now, the nightmare begins. It’s kind of like you just entered the twilight zone actually. Inconsistencies such as this one are difficult to troubleshoot and the crude error message was for sure of no help. I first hypothesized that the given id was somehow linked to the current document either through a relationship or a tbo somewhere but the developers confirmed that they had configured none. So the next obvious step is to doubt them and prove them wrong, and to do this the whole repository had to be searched for that id because somewhere in a type, deeply buried in some attribute, that id had to be contained, referencing the document to be saved, either as a standalone value or embedded in some string. This was challenging enough for me to write this little one-liner – the db-crawler – to search for that string in the whole docbase, actually the whole schema, i.e. at the database level since working from the RDBMS gives access to a considerably much richer set of functions, and this is what I’m going to show you in the next paragraph. To close this anecdotic introduction, I’ll just say that I did found a few occurrences of the given id but they lead nowhere. The actual root cause was a bad DFC cache that somehow desynchronized from the repository and went its own way. The morale of the story is that whenever one encounters weird error messages, the first thing to try is to empty the DFC cache because it probably became stale at some point (maybe the repository was upgraded, new applications deployed, doctypes modified, etc.). The good thing, as it seems there is always some benefit to even the peskiest situations, is that I had the opportunity to dust an ancient technique and share it with you here.

The crawler

As the RDBMS used by that repository was Oracle, we’ll work mainly from within the powerful command-line tool sql*plus. The trick to crawl the database schema is first to get the name of the tables of interest from the dictionary. With Oracle, the view USER_TAB_COLS is the one to consult. Here are the most useful columns:

1
2
3
4
5
6
7
sqlplus scott/tiger@orcl
SQL> desc USER_TAB_COLS
 Name                                         Null?    Type
 ----------------------------------------- -------- ----------------------------
 TABLE_NAME                                   NOT NULL VARCHAR2(128)
 COLUMN_NAME                                  NOT NULL VARCHAR2(128)
...

That view contains one row for each column of each table in the user’s schema. So the crawler just has to get the names of all the columns of all the relevant tables returned by USER_TAB_COLS and generate a SELECT statement against those columns with the string to look for as the condition, e.g. the object id 080f451780000491 from the docbase. Here is how:

1
select 'select ' || COLUMN_NAME || ' from ' || table_name || ' where ' || COLUMN_NAME || ' like ''%080f451780000491%'';' from USER_TAB_COLS where (table_name like '%\_S' escape '\' or table_name like '\_%\_R' escape '\') order by 1;

The relevant tables are the ones that Documentum uses to store the object’s metadata and other technical data, typically the tables ending by _S for single-valued attributes and _R for repeating attributes. All the other objects in the schema, such as the ones named like %_SP or %_SV, are views built upon the previous tables are can therefore be ignored.
For more sophisticated searches, regular expressions are better suited, as shown below:

1
select 'select ' || COLUMN_NAME || ' from ' || table_name || ' where regexp_like(' || COLUMN_NAME || ', ''080f451780000491'');' from USER_TAB_COLS where (table_name like '%\_S' escape '\' or table_name like '\_%\_R' escape '\') order by 2, 1;

Oracle’s regular expressions are quite complete because in addition to being POSIX-compliant, they also support the perl’s non-greedy syntax; see the SQL Reference Manual for details. In the present case, we don’t have many details about the context of the given id so we use the widest possible criteria in our search, and the traditional LIKE pattern-matching condition would be plenty sufficient but let’s switch to regular expressions for generalization (too much is better that not enough, right ?).
The above statement will output a list of SELECT statements, one per column to search, as illustrated below:

1
2
3
4
5
6
select ACCESSOR_NAME from DM_AUDIT_POLICY_S where regexp_like(ACCESSOR_NAME, '080f451780000491');
select ACL_CLASS from DM_ACL_S where regexp_like(ACL_CLASS, '080f451780000491');
select ACL_CLASS from DM_AUDITTRAIL_ACL_S where regexp_like(ACL_CLASS, '080f451780000491');
select ACL_DOMAIN from DMI_TYPE_INFO_S where regexp_like(ACL_DOMAIN, '080f451780000491');
select ACL_DOMAIN from DM_AUDITTRAIL_S where regexp_like(ACL_DOMAIN, '080f451780000491');
...

In order to execute them at once, they must be enclosed in an SQL script so let’s add some sugar. Still from within sql*plus:

1
2
3
spool db-crawler.sql
select 'select ' || COLUMN_NAME || ' from ' || table_name || ' where ' || COLUMN_NAME || ' like ''%080f451780000491%'';' from USER_TAB_COLS where (table_name like '%\_S' escape '\' or table_name like '\_%\_R' escape '\') order by 1;
spool off

Finally, let’s execute the generated sql script and capture its output into a file:

1
2
3
spool db-crawler.out
@db-crawler.sql
spool off

The output is redirected to the file db-crawler.out and can be searched using the familiar commands e?grep with the same search string as in the SQL statement.

Other RDBMS supported by Documentum, such as SQL Server and PostgreSQL, obviously have their own dictionary (sys.tables respectively information_schema) and official administration tools (sqlcmd respectively psql),  and thus the same technique can be applied to them.

Some enhancements

As we work inside sql*plus for Oracle RDBMS, let’s make the most of this tool. sql*plus allows us to factor out the searched string using a substitution variables, as follows:

1
define search_string = "080f451780000491"

It will be referenced in the script through &search_string. This way, the search string can be edited easily without directly modifying the query’s code, always a good practice in scripting and programming.
Also, some formatting of the sql*plus output is needed to remove spurious headers, line-wrapping and trailing blanks; we’ll do this using set statements and the sed command-line utility.
Before executing it, the generated sql script can be cleaned a bit by removing non executable lines such as sql*plus command output, still inside sql*plus:

1
!vi db-crawler.sql

Putting everything together plus some additional formatting instructions, we get:

1
2
3
4
5
6
7
8
9
10
11
12
13
define search_string = "080f451780000491"
spool db-crawler.sql
set pagesize 0
set linesize 1000
set trims on
set term out
select 'select ' || COLUMN_NAME || ' from ' || table_name || ' where ' || COLUMN_NAME || ' like ''%&search_string%'';' from USER_TAB_COLS where (table_name like '%\_S' escape '\' or table_name like '\_%\_R' escape '\') order by 1;
spool off
!vi db-crawler.sql
spool db-crawler.out
@db-crawler.sql
spool off
less -S db-crawler.out

That’s a lot of typing so look down below for an easy to use wrapper script.

One use case: looking for IP addresses

Let’s seach the whole repository’s schema for any ip v4 address expressed as a dot-separated list of 4 quadruplets of 1-byte numbers each. This is useful for example when they should be replaced by their host name counterparts for portability.
The regular expression for an IP adress is:

1
2
3
4
([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])
# or shorter:
(([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.){3}([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])

IP v6 addresses are similarly constructed, except the list is an octuplet of colon-separated 16-bit words expressed in hexadecimal:

1
2
3
4
[A-F0-9]{1,4}:[A-F0-9]{1,4}:[A-F0-9]{1,4}:[A-F0-9]{1,4}:[A-F0-9]{1,4}:[A-F0-9]{1,4}:[A-F0-9]{1,4}:[A-F0-9]{1,4}
# or shorter:
([A-F0-9]{1,4}:){7}[A-F0-9]{1,4}

So, to look for any hard-coded IP v4 address in a complete repository’s schema, we would only set the variable search_string to the above regexp and use the regexp_like condition, as follows:

1
2
3
4
5
6
7
8
9
10
11
12
define search_string = "([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])"
spool db-crawler.sql
set pagesize 0
set linesize 1000
set trims on
select 'select ' || COLUMN_NAME || ' from ' || table_name || ' where regexp_like(' || COLUMN_NAME || ', ''&search_string'');' from USER_TAB_COLS where (table_name like '%\_S' escape '\' or table_name like '\_%\_R' escape '\') order by 1;
spool off
!vi db-crawler.sql
spool db-crawler.out
@db-crawler.sql
spool off
less -S db-crawler.out

Still in sql*plus, let’s grep the output file db-crawler.out using the same regexp and from the sql*plus environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SQL> !egrep '&search_string' db-crawler.out
6.5.0.033
6.5.0.031
6.5.0.031
6.5.0.031
6.5.0.031
6.5.0.031
6.5.0.031
6.5.0.278SP2P2400
6.5.0.037
6.5.0.037
6.5.0.150
5.3.0.0
aspose-email-4.4.0.0-jdk16.jar
castor-0.9.5.3.jar
/app/dctm/product/16.4/lib/castor-0.9.5.4-xml.jar
/app/dctm/product/16.4/install/composer/workspace/MailApp/bin/content/19/-740608919/aspose-email-4.4.0.0-jdk16[1].jar

Some manual editing is necessary here as the above regexp also matches strings such as these, which are obviously version numbers of the content server’s components. Fortunately, in the present case, those are rare.
grepping can also be done from the shell’s command-line as follows:

1
$ egrep "([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])" db-crawler.out

Factoring the search string regexp, especially such complex ones, would increase the readability and lesser the likelihood of typing or copy/pasting mistakes. See down below for a wrapper that takes into account the above point, all the sql*plus statements and the final grepping.

Looking for dates

Let’s now look for all the attributes containing a date. While the format of a displayed date depends on the current locale, dates are stored using the database’s internal format, which is likely some binary one, e.g. a counter of elapsed seconds since some reference date in the past. However, our search expression directly accesses the attribute, which is silently converted to a string by the database server using some criteria (e.g. the time zone and locale). We need therefore to force the date column to a representation that can be compared to the given search string. The easiest way to do this is to tell the sql*plus session to format dates using a conventional, common format which we will also use in the search string, as shown below:

1
2
3
4
5
-- force the time zone to UTC;
ALTER SESSION SET TIME_ZONE = '0:0';
-- force the date representation to a conventional format:
alter session set NLS_DATE_FORMAT = 'DD/MM/YYYY HH24:MI:SS';

A regular expression for this date format could be:

1
2
3
4
[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}
# or more precisely but without checking its validity during leap years:
([1-9]|1[0-9]|2[0-9]|3[01])([1-9]|1[0-2])([1-9][1-9]+)

A popular date format in IT is the basic ISO one:

1
alter session set NLS_DATE_FORMAT = 'YYYYMMDDHH24MISS';

but it is less user-friendly so we won’t use it here.
However, forcing the session date format only works when comparing stand-alone values in date columns, and not always for a date embedded in strings where it may have been converted to some format by the application or the content server itself. For example, the following values are returned from my test repository:

1
2
3
4
5
6
SQL> select OBJECT_NAME from DM_AUDITTRAIL_S where regexp_like(OBJECT_NAME, '2022');
2/18/2022 19:41:29 dm_DataDictionaryPublisher
2/18/2022 19:41:29 dm_DataDictionaryPublisher
2/18/2022 19:41:29 dm_DataDictionaryPublisher
2/12/2022 16:10:48 dm_DataDictionaryPublisher
...

Here a conversion to M/D/YYYY HH24:MI:SS was performed by the content server while setting DM_AUDITTRAIL_S.OBJECT_NAME. A search for the first date regexp would not be able to return those values because their day part uses one digit instead of two; similarly, the second date regexp would not match either because their month part is greater than 12. So, searching dates exhaustively may require several passes with different, more relaxed search strings until all the formats used in strings, or a slightly more flexible regexp such as the following one, are tried out:

1
[0-9]{1,2}.[0-9]{1,2}(.[0-9]{2}|[0-9]{4})?.[0-9]{1,2}.?[0-9]{1,2}.?[0-9]{1,2}

where the date components days and months may have 1 or 2 digits, the date and time field separators can be any character (common ones are / and – for dates, : for times), the separator between the date and time part is anything (commonly a space or the letter T such as in logs produced by a java program), and the optional year component has 2 or 4 digits, although this is quite ambiguous even though the rule in Oracle “RRRR” datetime format can help solve the ambiguity.

An interesting fact is shown in the output below:

1
2
3
4
5
6
SQL> select A_NEXT_CONTINUATION from DM_JOB_S where regexp_like(A_NEXT_CONTINUATION, '[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}');
01/01/0001 00:00:00
01/01/0001 00:00:00
01/01/0001 00:00:00
01/01/0001 00:00:00
...

Those seemingly meaningless dates are actually NULLDATEs and show how the content server stores them in the Oracle database. This is RDBMS-dependent; for example, in PostgreSQL (and also in SQL Server), the content server uses 1/1/1753 for the NULLDATE. Therefore, the underlying RDBMS is to be considered when searching NULLDATEs and the appropriate date literal be used in such cases.

A cool wrapper

To simplify the usage of the db crawler and include the date considerations above, the bash wrapper script below can be used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
# Usage:
#   db-crawler.sh connect_info search_string
# The script invokes sql*plus with the given connect_info and passes it the search_string;
# the connect_info are username[/password][@connect_string];
# if the password is omitted, sqlplus prompts for one;
# If the search string contains spaces, it must be quoted;
# a sql file with a random file name $$.sql will be produced containing the appropriate SELECT statement to search the dictionary for tables in _S and _R;
# that file will be executed and generate another SQL file named db-crawler-$$.sql containing all the SELECT statements to search for the given string across all the _S and _R tables;
# finally, db-crawler-$$.sql will be executed with its output in db-crawler-$$.out;
if [[ $# -ne 2 ]]; then
   echo "Expected syntax: db-crawler.sh connect_info search_string"
   exit 1
fi
connect_info=$1
search_string="$2"
cat - <<-eoq > ./$$.sql
   define search_string = &1
   spool db-crawler_$$.sql
   set pagesize 0
   set linesize 1000
   set trims on
   ALTER SESSION SET TIME_ZONE = '0:0';
   alter session set NLS_DATE_FORMAT = 'DD/MM/YYYY HH24:MI:SS';
   select 'select ' || COLUMN_NAME || ' from ' || table_name || ' where regexp_like(' || COLUMN_NAME || ', ''&search_string'');' from USER_TAB_COLS where (table_name like '%\_S' escape '\' or table_name like '\_%\_R' escape '\') order by 1;
   spool off
   set echo on
   set trims on
   @db-crawler_$$.sql
   spool off
   quit
eoq
sqlplus ${connect_info} @$$.sql "${search_string}" | sed -e '0,/EXIT/d' -e '/quit/,$d' > db-crawler_$$.out
# browse the result output and jump to the first match;
less -S --pattern="${search_string}" db-crawler_$$.out

Let’s apply this wrapper to the use case below.

Looking for a given host name

Let’s suppose now that we want to move or clone a docbase from a host to another one, maybe running a different O/S with some differences due to the implementation. After we applied to procedure, we suspect that the former host name is still referenced elsewhere in the new repository but we have no idea where. We changed it in the well-known places such as DM_SERVER_CONFIG and DM_MOUNT_POINT but there may still be references to it elsewhere. Clearly, we must search all the columns in all the tables of the repository’s schema for the given host name string, which is very easy now by using the wrapper:

1
2
export TWO_TASK=orcl
./db-crawler.sh dmtest/dmtest hostname

Example of execution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
./db-crawler.sh dmtest/dmtest docker
egrep -B 1 "^docker" db-crawler_19815.out | less
SQL> select HOST_NAME from DM_AUDITTRAIL_S where regexp_like(HOST_NAME, 'docker');
docker
docker
...
--
SQL> select HOST_NAME from DM_CLIENT_REGISTRATION_S where regexp_like(HOST_NAME, 'docker');
docker
...
--
SQL> select HOST_NAME from DM_CLIENT_RIGHTS_S where regexp_like(HOST_NAME, 'docker');
docker
--
SQL> select HOST_NAME from DM_MOUNT_POINT_S where regexp_like(HOST_NAME, 'docker');
docker
--
SQL> select OBJECT_NAME from DM_SYSOBJECT_S where regexp_like(OBJECT_NAME, 'docker');
dockerACS1
--
SQL> select R_HOST_NAME from DM_SERVER_CONFIG_S where regexp_like(R_HOST_NAME, 'docker');
docker
--
SQL> select R_LOCK_MACHINE from DM_SYSOBJECT_S where regexp_like(R_LOCK_MACHINE, 'docker');
docker
docker
docker
--
SQL> select SET_CLIENT from DMR_CONTENT_S where regexp_like(SET_CLIENT, 'docker');
docker
docker
...
SQL> select WEB_SERVER_LOC from DM_SERVER_CONFIG_S where regexp_like(WEB_SERVER_LOC, 'docker');
docker

We can see that the host name docker was referenced in the attributes DM_AUDITTRAIL.HOST_NAME, DM_CLIENT_REGISTRATION.HOST_NAME, DM_CLIENT_RIGHTS.HOST_NAME, DM_MOUNT_POINT.HOST_NAME, DM_SERVER_CONFIG.R_HOST_NAME but also in DM_SYSOBJECT.OBJECT_NAME and R_LOCK_MACHINE (there were likely locked documents in the source repository), DM_SERVER_CONFIG.WEB_SERVER_LOC, and DMR_CONTENT.SET_CLIENT. Some of those attributes such as DM_MOUNT_POINT.HOST_NAME could explain a potential issue in the new repository if not set adequately, while others such as DMR_CONTENT.SET_CLIENT should have no impact if carried over as-is into the new repository.

A note about DM_SYSOBJECT_S

Since DM_SYSOBJECT sits at the top of the type hierarchy of most persistent objects, it is linked by most repository objects and search results such as the ones below may be confusing:

1
2
3
4
5
6
7
8
9
$ ./db-crawler.sh dmtest73/dmtest73 dmtest
...
SQL> select OBJECT_NAME from DM_SYSOBJECT_S where regexp_like(OBJECT_NAME, 'dmtest');
dmtest.cecACS1
dmtest73
dmtest73
dmtest73
ContTransferConfig_dmtest73.dmtest73
dfc_dmtest.cec_c0XP4a

Clearly, due to their technical apparence, these rows may belong to objects deriving from DM_SYSOBJECT and belonging to different types; in order to determine which ones, further digging is necessary, e.g.:

1
2
3
4
5
6
7
8
9
10
11
12
$ idql dmtest73 -Udmadmin -Pxxx
1> select r_object_id, r_object_type, OBJECT_NAME from DM_SYSOBJECT where OBJECT_NAME like '%dmtest%');
2> go
r_object_id       r_object_type                     object_name                             
----------------  --------------------------------  ----------------------------------------
0c00c35080000104  dm_cabinet                        dmtest73                               
3c00c35080000103  dm_docbase_config                 dmtest73                               
3d00c35080000102  dm_server_config                  dmtest73                               
0800c35080000495  dm_acs_config                     dmtest.cecACS1                         
0800c350800004ba  dm_cont_transfer_config           ContTransferConfig_dmtest73.dmtest73   
0800c35080000585  dm_client_rights                  dfc_dmtest.cec_c0XP4a                 
(6 rows affected)

We can see for example that lines 8 and 9 from the crawler’s output correspond to repository’s types dm_cont_transfer_config and dm_client_rights. Thus, if something needs to be corrected in the repository, that’s where it should be done preferably in order to minimize the likelyhood of introducing corruptions.

Renaming a docbase/changing its docbase id

Another use case, albeit a far fetched and fragile one, is when a repository must be renamed or its id changed. While the safe, traditional approach is to create a new docbase with the new name/id and exporting/importing the content, a more acrobatic approach would use the crawler to identify all the places in the underlying tables where those values are referenced, and change them.
Clearly, in addition to the docbase’s schema, there are several locations in configuration files where such information is referenced too that should be edited. As most of those files are text files, the find command with a sed invocation can quickly do that in one pass. As to the schema’s tables, a search with the regexp ‘[0-9]{2}docbase_id[0-9a-f]{8} can identify all the ids to change. Something similar could be used for the repository’s name to change. Of course, a great deal of testing should be done, and maybe a call to OpenText to discuss this approach wouldn’t hurt either. Moreover, since they also distribute their own docbase renaming/id changing tool, MigrationUtil (see the following blog articles Documentum – MigrationUtil – 1 – Change Docbase ID and Documentum – MigrationUtil – 2 – Change Docbase Name), it could be interesting to compare both techniques.

Conclusion

The db-crawler has a lot of potential to help troubleshoot puzzling problems, and identify inconsistencies and corruptions in repositories. As all the RDBMS have a dictionary, it is easy to adapt it to a different underlying one. It can even be used outside the context of Documentum repositories, directly in the context of databases.
As a typical repository’s schema contains more than 2’000 tables (more if customized doctypes have been introduced), with possibly millions of documents (corresponding to one or more rows spread over several  database tables), a possible enhancement is to speed the search up by parallelizing the execution of the generated SELECT statements, an easy to do task using a divide-and-conquer technique or a tool such as GNU Parallel.
Currently, the tool is very basic as it addresses a simple need but it does not have to stay this way so feel free to add useful features and optimizations.