In the 2-part blog article Adding a Documentum Extension to gawk, I showed how to extend the gawk scripting language in order to implement the Documentum dmAPI functions and turning it into an effective Documentum client. Thanks to that new interface, a gawk script could now connect to Documentum repositories, send it API commands and retrieve their responses. The official dmawk already does all this but it is based on an undefined awk interpreter and does not look like it has been maintained since long, e.g.:
$ /app/dctm/product/16.4/bin/dmawk -V
awk version 19990620
That program is likely an old proprietary or licensed interpreter offering a traditional implementation of the language. In contrast, by extending gawk, the GNU awk implementation, we benefit from an evolving, up to date, well maintained and extensible interpreter.
Since that article is almost 4 years old, a few things changed in the meantime so I thought it was time to revisit it and update it with the latest user experience. Also, the extension’s first version was hastily published and therefore lacked some polishing and functionalities; the select() function for example was really crude and, being one of the interface’s most important functionalities, deserves far more attention.
First of all, the code is now available from the dbi-services’ github area dbi services, a great progress compared to the former cut&paste-code-from-the-WordPress-page approach ! All the issues with possible wrong character encoding and formatting (especially for python code where indentation is part of the syntax) belong now to the past. Just use the command git clone https://github.com/dbiservices/dbi-DctmApi-gawk.git or click on Code/Download ZIP and follow the instructions provided on the page dbi-DctmApi-gawk to set it up.
dctm.c, the C interface to dmcl.so run-time library, has been cleaned up a bit and a bug corrected. The API’s dmAPIInit() function is now called automatically instead of relying on the user to do it, which was often forgotten. Not that it caused a hard to diagnose run-time error, but calling that init function is redundant with loading the extension through the gawk‘s @load instruction and there is no real necessity to keep both steps separated. Moreover, it was noticed later that dmAPIInit() is idempotent which means that multiple invocations don’t have any observable [side] effect and hence are useless; therefore, that function is no longer exported. Loading and initializing the interface is now one single, indivisible step.
In DctmAPI.awk, verbose error messages can now be better controlled through the global variable dmLogLevel; all messages are conditionally issued from the same point, function dmShow(). When troubleshooting, just set that variable to 1 to enable execution tracing, and 0 (the default) to disable it. dmShow() can also be used in custom code if no separation of the message source, either DctmAPI calls or the custom code, is necessary.
dctm.c now exports a new function dmGetPassword(). It uses unistd‘s getpass() to prompt for a password with a prompt string and no echo. It has to be implemented because dmAPI‘s own dmGetPassword() does nothing in some case (e.g. when gawk is invoked as a coroutine through a pipe). dctm.c‘s dmGetPassword() has already been presented in this article A password() function for dmgawk, along with other techniques for reading a password.
The connect() function has been enhanced as described in the article Connecting to Repositories with the Same Name and/or ID. Basically, it means that if one knows exactly where a given repository’s docbroker is located, one can use the syntax docbase_name[:docker_host[:docbroker_port]] to specify a connect string and, if the traditional way to connect is no longer necessary, one can even remove the dfc.docbroker.host[i]=/dfc.docbroker.port[i]= pairs of entries from the dfc.properties file. The function is smart enough to use both the traditional way of connecting using the standard repository naming resolution mechanism (i.e. sequentially querying the docbrokers in the dfc.properties file until one of them is found that knows the repository or, more generally, following the resolution order defined in dfc.docbroker.search_order) and the better way if the enhanced syntax is provided. The primary reason this syntax was chosen is to work around the current limitation of the resolution mechanism when repositories with same name or docbase id project to the docbrokers listed in the dfc.properties file.
A new connect2() function has been included. It too aims at supporting the same enhanced repository syntax to solve the same issue as above but with an alternative implementation based on the two-part article Connecting to a Repository via a Dynamically Edited dfc.properties. In short, the trick is to programmatically edit the dfc.properties file to replace all the pairs dfc.docbroker.host[i]=/dfc.docbroker.port[i]= with just the one from the connection string so that any ambiguity is removed, and restarting the client (it has to be restarted because a call to dmAPIInit() has no effect, as written above). Agreedly, this approach is more complex as it changes the dfc.properties configuration file and, if that file is relocated, requires a change in the java.ini file as well as described in the Knowledge Base article How to specify a different dfc.properties with IAPI on a Linux server. Moreover, for efficiency reason, it uses yet another gawk extension, gexec, as a replacement of gawk‘s standard system() function. As guessable from its name, this function invokes unistd‘s execl() function. However, unless one wants to experiment, connect() is the function to use.
The select() function is now a family of functions containing dmSelecto(), dmSelect()/dmNext(), dmSelecta() and 2 more functions, simple_show_table() and show_table(), for displaying the result set on the text-mode terminal. Since the data presentation is likely the most frequent usage of the whole interface, it has gone through some extensive enhancements in this area with the addition of colorization, truncation or wrapping column text around and gridding; see the next paragraph for more explanations.
A few words on the programming language gawk
As the DctmAPI.awk extension is written in gawk, it is worth mentioning a few facts about the languages to help understand the code.
gawk is a C-like, procedural, not a object-oriented, not functional, interpreted language. It is a dynamic language in that variables don’t need to be declared and can change their type during their lifetime, except arrays.
There are only scalar data types (integer, floats/doubles and strings) and arrays. Regular expression are a bit special in that their literals can be mixed with strings but their variables are typed as regular expressions and cannot; this is not a major problem for all practical purposes.
Parameters are only passed by copy (IN), except arrays which are passed by reference (IN OUT); thus, scalar values are passed by copy and arrays by reference.
There are no local variables but there is a trick to simulate them: any unbound (unused) arguments in the function’s declaration are considered local to the function; so, it is OK to call a function with less parameters than in its declaration, but not with more; the unmatched function’s parameter will receive the default null value, 0 or “”, depending on their usage context. By convention, local variables are listed in the function declaration with a few separation spaces from the actual parameters to show that they are not used by the caller but just internally to the function.
There is no way to explicitly specify default values for parameter, new variables default to the null value, i.e. 0, “” or empty array (but there is no  literal for the empty array) when used the first time.
There are no boolean types; in logical expressions, null values are evaluated to false, non null to true.
The only non-scalar data structure is the associative array; it can be multi-dimensional and unlimitedly nested; moreover, values can be heterogeneous in type and, if a value is an array, in shape; indeed, arrays can be jigged in all their dimensions. This makes for a very powerful data structure, akin to record/struct or array of records/structs recursively containing arrays in more traditional languages, and can modelize all the classic data structures such as sets, stacks, queues, and trees.
All the classic operators from C are present, even the liberally used ternary operator.
Low-level, C-like control statements such as the for and while loops are present.
In summary, when compared to other main stream scripting languages such python or perl, gawk is really simplistic; however, its purpose is not to compete with those languages but to be useful at what it was created for: text parsing and processing. In this area, it unquestionably succeeded.
For further information on the language, the reference book is GAWK: Effective AWK Programming by Arnold D. Robbins, available on-line here.
The interface’s revamped functions
Let’s now see the main enhancements in more details.
function dmConnect(docbase, user_name, password)
dmConnect connects to the given docbase with the given credentials. Its synopsis is:
dmConnect("docbase_name[:docbroker_host[:docbroker_port | 1489]]", user_name [password])
The function takes a repository name, optionally with an enhanced syntax, a user name and an optional password.
When the enhanced syntax is used, that specifies a docbroker host and an optional docbroker port (defauting to 1489), connect() will directly query the given docbroker for the given repository. This trick allows to unambiguously reach repositories with same name or id, which is impossible with the current resolution mechanism in the DFCs. For example, suppose you have a repository named Books which exists in several installation environments such as DEV, CTLQ and PROD, and you want to be able to access any of them from the same DFCs client running on the same machine. With this syntax, you would call connect() as follows:
dmConnect("Books:dev_server", "dmadmin", "dev_password") dmConnect("Books:ctlq_server", "dmadmin", "ctlq_password") dmConnect("Books:prod_server", "dmadmin", "prod_password")
Technically, the repositories could even be hosted on the same machine if they projected to dedicated docbrokers (docbrokers reject subsequent projections from repositories with same name or id as ones that already projected to it).
Without that enhancement, the program would attempt to connect to the machine returned by the first docbroker from the dfc.properties file that knows Books (not necessarily in the order they are listed in that file), and it might not be the right environment (yet another good reason the have different passwords depending on the environment). By chance, it might be the right one at first but what if you wanted to change environment ? With this enhancement there is no longer the need to list the docbrokers in the dfc.properties file. See the aforementioned article for more information.
A DQL select query can be executed in two steps: firstly the compilation of the DQL statement, in dmSelect(), and secondly the retrieval of the resulting rows, in dmNext(), one row (or document in Documentum parlance) per call. This approach is classic in databases interfaces such as ODBC, JDBC, and APIs such as Oracle’s OCI.
For efficiency, dmSelect() also invokes get_metadata() to retrieve the result set’s metadata once for all. The metadata are returned in a associative array with the following structure (see also the comment in the code).
From the result set:
metadata["nb_cols"] : the number of columns, i.e. length(result["metadata"]["col_name"]); metadata["col_name"][0 ...] : the column names; metadata["is_repeating"][0 ...] : the length of the column's value; metadata["nb_rows"] : the number or rows in the result set, -1 at this point as it is unknown yet;
metadata["max_nb_repeating"][0 ... nb_cols - 1] : the maximum number of repeating values for each column; metadata["max_col_length"][0 ... nb_cols - 1] : the maximum length of the column's value or values if the attribute is repeating; metadata["max_length"] : the maximum of the largest column values, i.e. max(result["metadata"]["max_col_length"][0 ... nb_cols - 1]); metadata["max_concat_col_length"][0 ... nb_cols - 1] : the maximum length of the column's value or concatenated values if the attribute is repeating; = metadata["max_col_length"][0 .. nb_cols - 1] if there are no repeating values or just one; metadata["max_concat_length"] : the maximum of the largest column values, i.e. max(metadata["max_concat_col_length"][0 ... nb_cols - 1]); = metadata["max_length"] if there are no repeating values or just one; metadata["total_max_length"] : the sum of the maximum length of all the column values, i.e. sum(metadata["max_concat_col_length"][0 ... nb_cols - 1]);
This last bunch of metadata are used to easy up the presentation of the result set in functions simple_show_table() and show_table(). They allow to format the result set’s attributes in a nice tabular presentation. See those functions below for more details.
dmNext() receives the metadata obtained by the caller from get_metadata() and calls the dmAPI’s next function to get the next row from the result set. It then stores the received attributes, with their leading spaces stripped, into an associative array indexed by 0-based integers:
result[0 ... nb_cols -1]([0 ... nb_repeating - 1]): a mostly one-dimensional vector containing the row's attributes;
When an attribute has repeating values, element result[i] is no longer a scalar but a 0-based vector containing the i-th attribute’s repeating values. In effect, gawk arrays can be irregular, or saw-edged , with each value being anything from a number or string to a n-dimensional, unlimitedly nested, 1D or 2D or saw-edged sub-array. The only restriction is that once a cell has received an array value, it gets that type from which it cannot escape during its lifetime.
Thus, dmNext only stores a row at a time in memory. The caller obviously decides what to do with this row, either print it or accumulate it for later processing. Currently, the only caller to dmNext() in the interface is dmSelecto() and it prints the result set one row at a time. See this function for an example of how to iterate though the returned row to print the mono-valued or multi-valued attributes.
This function replaces the original dmSelect() one. The “o” suffix stands for output.
The original dmSelecto() was a quick and dirty way to check a SELECT query’ syntax correctness and its result. It still outputs its result set to stdout but repeating attributes are returned now too, and formatted as a string of |-separated values. Column headers and attributes are tab-separated and rows are line-feed terminated. This easy-to-parse format makes it straightforward to import the output into a spreadsheet for example.
dmSelecto() first calls dmSelect(), then repeatedly calls dmNext() and outputs the received row until the whole result set has been gone through.
For each value in a row, it checks against the query’s metadata whether it a repeating attribute; if so, the value is a vector of values and <dmSelecto() iterates through them and prints them with a pipe (|) symbol as separator; if it is a mono-valued attribute, its value is directly output. A tab character separates the attributes. Here is an example of output:
r_object_id object_name authors r_version_label keywords i_folder_id owner_name acl_domain acl_name version_label folder_id
1: 0900c350800001d0 Default Signature Page Template CURRENT|1.0 0b00c350800001c5 dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0b00c350800001c5
2: 0900c350800001da 5/7/2019 15:10:24 dm_PostUpgradeAction CURRENT|1.0 0b00c350800001e7 dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0b00c350800001e7
3: 0900c350800001db Blank PowerPoint Pre-3.0 Presentation CURRENT|1.0 0c00c3508000012f dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0c00c3508000012f
4: 0900c350800001dc Blank WordPerfect 6 Document CURRENT|1.0 0c00c3508000012f dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0c00c3508000012f
5: 0900c350800001dd Blank WordPerfect 7 Document CURRENT|1.0 0c00c3508000012f dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0c00c3508000012f
6: 0900c350800001de Blank WordPerfect 8 Document CURRENT|1.0 0c00c3508000012f dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0c00c3508000012f
54: 0900c35080000343 BPM Runtime CURRENT|1.0 0b00c35080000331|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334 dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0b00c35080000331|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334
55: 0900c35080000344 Castor CURRENT|1.0 0b00c35080000337|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334 dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0b00c35080000337|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334
56: 0900c35080000345 JXPath CURRENT|1.0 0b00c35080000337|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334 dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0b00c35080000337|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334
57: 0900c350800003ba Shme.vrf 1.0|CURRENT 0b00c35080000133|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334 dmadmin dmtest dm_acl_superusers 1.0|CURRENT 0b00c35080000133|0b00c35080000330|0b00c3508000032f|0b00c35080000333|0b00c35080000332|0b00c35080000334
This function is more elaborated than dmSelecto(). The “a” suffix stands for array. Indeed, dmSelecta() reads the whole result set into an array in memory in order to process it later. Obviously, this is not the best way to managed billions of rows but for small result sets, it is acceptable.
dmSelecta() calls dmSelect() to execute the query but, for historical reasons, later calls directly the API’s next to iterates through the data instead of calling the interface’s dmNext(). Besides storing the data, it computes a few maxima that are used later to print the data in a nicely formatted tabular format.
The received data are stored into the more general associative array result with the following structure:
result["metadata"] : the associative array containing the metadata as previously presented; they are included in the result in order to minimize the number of variables to manage and pass around, and also because it makes sense to have everything in one place; result["data"][0 ... nb_rows - 1][0 ... nb_cols -1]: a mostly square matrix (i.e. a table) of ith-row result["data"][i] with its attributes result["data"][i][j];
Thus, the i-‘s row data are stored into result[“data”][i] and the j-th attribute, with its leading spaces stripped, in the row i is stored into result[“data”][i][j]. If that attribute is repeating, the value result[“data”][i][j] is no longer a scalar but a vector with component k stored in result[“data”][i][j][k].
When the data are to be printed in a table with no truncation nor wrapping around, independently of the physical screen size (we consider it unlimited, e.g. when they are piped into the less command), we need the maximum width of each attribute; those are stored in result[“metadata”][“max_col_length”][i] for attribute i. The maximum of those maxima is also computed and stored in result[“metadata”][“max_length”]; it is not used so far and, as they say, is reserved for future use (it could be used to compute the required size of the table displayed by show_table()).
Multi-valued attributes displayed in table’s cells by the function simple_show_table() are first concatenated and pipe-separated; thus, we need to know the maximum width of such a cell, i.e the longest string of concatenated values for each multi-valued attribute. Those maxima are stored in result[“metadata”][“max_concat_col_length”][i] for the repeating attribute i. Finally, like before, the maximum of those maxima is computed and stored in result[“metadata”][“max_concat_length”] and, likewise, it is not used so far (it could be used to compute the required size of the table displayed by simple_show_table()).
In the interface, dmSelecta()‘s result is used by simple_show_table() and show_table(). Refer to those functions for examples of how to iterate through that structure.
# print into a table the result of a select statement stored in array result with the structure described in dmSelecta() above; # colors is the fg/bg colors expressed as string with format fg[.bg], bg default to black; leave it empty if no colors wanted; # colors defaults to "", i.e. no color; # col_periods is a string containing the periodicities of the colors, expressed as string with format [-]fg_period.[-]bg_period; # periodicities are defined as the number of lines to display in the respective color before switching to bg_color.fg_color (i.e. reversing the colors) for that many lines; # it defaults to 1.1, also in case of syntax error; # i.e. fg_period lines are displayed in fg_color/bf_color and then bg_period lines displayed in bg_color/fg_color, rinse, repeat; # if fg_color or bg_color, or both, is negative, no respective colorization takes place; # grid_type can be empty or contain one of the available values such as ascii, half-light, light, light-double-dash, etc...; see function init_grid_symbols() ; function simple_show_table(title, result, colors, col_periods, grid_type , periods, nb_fg, nb_bg, bno_color, bno_inverse, i, j, k, s)
The function prints to the character-mode terminal a nicely formatted table of the data it received through the parameter result. It is has the simple prefix because it does not wrap column text inside the cells, nor truncate it (see show_table() for this) (the flexibility of the %show_table functions come at a price: they have lots of parameters which make them complex to use. Unlike python and its partial and overloaded functions, and default parameter values, gawk is very limited here and we have to resort to other tricks to keep the parameter list short, e.g. delegating special cases to other functions. This doesn’t contain the usage complexity caused be the flexibility we want but at least it attempts to limit somewhat the programmatic one). Instead, it displays the whole data row with the respective columns as large as their largest attribute value. If attributes have multiple values, they are all concatenated together and separated by a pipe character “|”, and this is the new string whose length is taken into consideration to determine the column’s maximum width.
r_object_id object_name authors r_version_label keywords i_folder_id owner_name acl_domain acl_name version_label folder_id 0900c350800001d0 Default Signature Page Template CURRENT|1.0 0b00c350800001c5 dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0b00c3508000 0900c350800001da 5/7/2019 15:10:24 dm_PostUpgradeAction CURRENT|1.0 0b00c350800001e7 dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0b00c3508000 0900c350800001db Blank PowerPoint Pre-3.0 Presentation CURRENT|1.0 0c00c3508000012f dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0c00c3508000 0900c350800001dc Blank WordPerfect 6 Document CURRENT|1.0 0c00c3508000012f dmadmin dmadmin dm_4500c35080000101 CURRENT|1.0 0c00c3508000 ... 0900c3508000062c smart-object-impl.jar CURRENT|1.0 0b00c350800005b2 dmadmin dmtest BOF_acl CURRENT|1.0 0b00c3508000 0900c3508000062d smart-object.jar CURRENT|1.0 0b00c350800005f9|0b00c350800005b2 dmadmin dmtest BOF_acl CURRENT|1.0 0b00c3508000 0900c3508000062e smartcontainer-common-impl.jar 1.0|CURRENT 0b00c350800005fb|0b00c350800005fe|0b00c350800005f9|0b00c350800005b5|0b00c350800005a6|0b00c350800005a9|0b00c350800005b2|0b00c350800005b1|0b00c350800005b0|0b00c350800005af|0b00c350800005ae|0b00c350800005ad|0b00c350800005ac|0b00c350800005ab dmadmin dmtest BOF_acl 1.0|CURRENT 0b00c3508000 0900c3508000062f type-constraint-impl.jar CURRENT|1.0 0b00c350800005f6 dmadmin dmtest BOF_acl CURRENT|1.0 0b00c3508000 0900c35080000630 version-behavior-impl.jar CURRENT|1.0 0b00c350800005b3 dmadmin dmtest BOF_acl CURRENT|1.0 0b00c3508000
Please click on this link Part II for the rest of the article.