Documentum is usually working pretty well with xPlore when it’s about managing its queue on well configured environments and/or with “simple” repositories. Basically, as long as you do not only have very big/complex documents (e.g.: 500Mb excel files with hundreds of thousands of rows) or a lot of customizations that aren’t very optimized or intensive components like a full Life Sciences Suite stack, then xPlore should be able to do its work without much trouble. However, it might still happen that some activities like data fixes (adding/updating metadata on your documents) or migration of documents (and therefore running the ApplyD2Config on them) would generate millions of dm_save and therefore the same number of items being added into the dmi_queue_item table for the indexing.

It’s not something that will happen every day for sure, but I’m working on such things a few times a year. When this kind of activity is happening, there will usually be several dm_save for the same object and therefore it means duplicate item_id inside the dmi_queue_item table. If it’s just a few thousand documents, it should be fine because the xPlore will quickly process them anyway. But what if there are 5 million items to be processed and out of these, there are only 500k unique items (meaning an average of 10 duplicates for each row)? In such cases, letting xPlore process its queue will be quite inefficient and it will probably take several hours, if not days, for the queue to be emptied.

That’s where the queue cleanup is needed. There are, from my point of view, two main solutions:

  1. Extracting all distinct item_id (either through SQL or DQL/API), then removing all item_id (again, either through SQL or DQL/API) and finally resubmitting only the unique items for indexing (either through the dmi_queue_item or the IndexAgent refeed tasks)
  2. Removing only duplicate items and keeping the oldest (alternatively newest) item_id

Of course, it’s also possible to just unregister the events before the activity so that the dmi_queue_item table won’t be flooded, but that’s not fun, right? When I’m included once the activity is already completed and I need to do the cleanup, then I will usually apply the first solution above (using SQL commands) if there are 5/10 million or more of duplicates to remove, because it will be much faster to delete these through the DB. Same thing if these items aren’t very important and they can be processed in the background through the IndexAgent refeed tasks (the dmi_queue_item takes precedence over refeed tasks). However, in this blog, I will talk about the second option, which I usually apply in all other cases.

The second solution could also be done via SQL, but I tend to use DQL/API there, because there isn’t too much to do, so it’s usually fine. In addition to that, I also tend to parallelize the execution on several Content Servers (or several DQL/API sessions on the same CS). The first thing to do is to temporarily stop the IndexAgent through its UI so that it’s not running during the export. Then, as mentioned, we would need to extract the list of item_id with their associated r_object_id from the dmi_queue_item table, so that we can have all rows present at that point in time:

[dmadmin@cs-0 ~]$ mkdir tmp
[dmadmin@cs-0 ~]$ cd tmp/
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ ## Extract items from the dmi_queue_item table
[dmadmin@cs-0 tmp]$ iapi $REPO_NAME -Udmadmin -Pxxx << EOC > object_list.txt
?,c,select r_object_id, item_id from dmi_queue_item where name='dm_fulltext_index_user' and delete_flag=0 order by date_sent asc;
EOC
[dmadmin@cs-0 tmp]$

This iapi command will take a few seconds to be executed, depending on how much rows are present. In this case, I used the ‘dm_fulltext_index_user‘ name, meaning that I want to do that for the first IndexAgent of this repository (if there is HA, there might be another user like ‘dm_fulltext_index_user_02‘). I don’t usually put a filter on the task_state, because even if the status is warning or failed or acquired, I want to retrieve all item_id, as long as it hasn’t been completely processed yet (delete_flag=0) and remove all the duplicates. In addition to that, I order by date_sent in ascending order, meaning that oldest entries will be present at the top of the file. Once the export has been done, the IndexAgent can be restarted through its UI, in normal mode, so that it will resume the processing of the queue. It’s not mandatory to do it at this point in time, but it won’t bring an issue, so you will still gain some time in the processing of the queue if you start it rather sooner than later. The next step is to clean up the file to remove the headers/footers coming from the iapi execution and generating some lists:

[dmadmin@cs-0 tmp]$ ## Backup the original file
[dmadmin@cs-0 tmp]$ cp object_list.txt object_list.txt.orig
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ ## Remove all lines not starting by "1b", i.e.: everything except the lines with "r_object_id  item_id"
[dmadmin@cs-0 tmp]$ sed -i '/^1b/!d' object_list.txt
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ ## Generate sorted and unique lists
[dmadmin@cs-0 tmp]$ sort -k2 object_list.txt > object_list.txt.sorted
[dmadmin@cs-0 tmp]$ sort -k2 -u object_list.txt > object_list.txt.unique
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ wc -l *
  3118245 object_list.txt
  3118264 object_list.txt.orig
  3118245 object_list.txt.sorted
   280124 object_list.txt.unique
  9634878 total
[dmadmin@cs-0 tmp]$

With the sort commands above, we first generated a “sorted” list of the item_id (based on the second column). This list won’t be used by the steps below, but it’s more there if you want to check things later. Remember that oldest rows were at the top, and therefore with the sorted file (sorted on item_id ascending), you can find the r_object_id of all other rows that are duplicates of these and it will also be ordered by date_sent within the same item_id. The second sort command is used to generate the list of “unique” item_id, the ones that will be kept, and since we based it on the same source file, it means that ONLY the oldest item_id will show up on this file BUT it will also be sorted by item_id ascending, obviously. Same as the other one, this list won’t be used because the sort on the item_id instead of the date_sent means that it would be less optimized for the later processing (the removal). I still like to keep it around, just in case I need it for something later.

Now that we have a good base and all the needed details for further investigations later (if something goes wrong), it’s time to really start with generating the lists of items that we will keep and remove:

[dmadmin@cs-0 tmp]$ ## Generate list by keeping only the 1st occurrence seen of the 2nd column
[dmadmin@cs-0 tmp]$ awk '!seen[$2]++' object_list.txt > object_list.txt.tokeep
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ ## Generate list by keeping all except the 1st occurrence seen of the 2nd column
[dmadmin@cs-0 tmp]$ awk 'seen[$2]++' object_list.txt > object_list.txt.toremove
[dmadmin@cs-0 tmp]$

The first awk command will generate a list very similar to the “object_list.txt.unique” one, except that it’s not sorted by item_id but it’s still sorted by date_sent ascending, same as the initial source file. As mentioned, this is because it will be better from my point of view, to work on items based on their date_sent. The second awk command will generate a list with exactly the opposite of the first one. Therefore, the first list contains all rows (r_object_id) that will be kept inside the dmi_queue_item table while the second one contains all rows (r_object_id) that will be removed.

From this point on, I will usually reverse the content of the list to remove so that the newest items appear at the top and not at the bottom. This is so that we can safely start by removing newest rows (at the time of the export!) from the dmi_queue_item while the IndexAgent is still processing the oldest ones (if you indeed restarted it in normal mode earlier). If there are 5 million of rows to remove, depending on how many parallel threads you will start (see later in this blog), then it might take 1/2/4 hours of processing, so during that time, the IndexAgent might process a sizeable amount of items. Since the IndexAgent process items based on date_sent ascending, I think it makes sense to remove items based on date_sent descending. Please remember that this list was extracted when the IndexAgent wasn’t running, so items submitted after that won’t be touched at all, otherwise there could be issues with a wrong version of a document being indexed. Once the list of items to be removed has been reversed, we can generate an API script to process them:

[dmadmin@cs-0 tmp]$ ## Reverse the items to remove
[dmadmin@cs-0 tmp]$ tac object_list.txt.toremove > object_list.txt.toremove.rev
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ ## Generate the API script
[dmadmin@cs-0 tmp]$ sed -e "s,[[:space:]]\+.*,';," -e "s/^/?,c,delete dmi_queue_item objects where r_object_id='/" object_list.txt.toremove.rev > object_list.api
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ wc -l *
  2838121 object_list.api
  3118245 object_list.txt
  3118264 object_list.txt.orig
  3118245 object_list.txt.sorted
   280124 object_list.txt.tokeep
  2838121 object_list.txt.toremove
  2838121 object_list.txt.toremove.rev
   280124 object_list.txt.unique
 18429365 total
[dmadmin@cs-0 tmp]$

From this point on, we can now execute the API script as is, but if you followed the blog, there might be millions of rows to be deleted and therefore, millions of API commands in this script… Therefore, I usually split the API script into smaller batches of ~50k items:

[dmadmin@cs-0 tmp]$ ## Split the big API script into batches of 50k
[dmadmin@cs-0 tmp]$ split -l 50000 object_list.api object_list.api_p
[dmadmin@cs-0 tmp]$
[dmadmin@cs-0 tmp]$ ls -l object_list.api_p*
-rw-r----- 1 dmadmin dmadmin 3600000 Nov 26 12:44 object_list.api_paa
-rw-r----- 1 dmadmin dmadmin 3600000 Nov 26 12:44 object_list.api_pab
-rw-r----- 1 dmadmin dmadmin 3600000 Nov 26 12:44 object_list.api_pac
...
-rw-r----- 1 dmadmin dmadmin 3600000 Nov 26 12:44 object_list.api_pcc
-rw-r----- 1 dmadmin dmadmin 3600000 Nov 26 12:44 object_list.api_pcd
-rw-r----- 1 dmadmin dmadmin 1428984 Nov 26 12:44 object_list.api_pce
[dmadmin@cs-0 tmp]$

Once done, from different shell sessions, it’s possible to start the below command X times and it will make sure that all the batches are processed only once:

[dmadmin@cs-0 tmp]$ ## Session 1
[dmadmin@cs-0 tmp]$ for file in `ls object_list.api_p*`; do
  if [[ ! -f log.${file} ]]; then
    echo "  -- ${file}"
    iapi $REPO_NAME -Udmadmin -Pxxx -r${file} > log.${file}
  fi
done
  -- object_list.api_paa
  -- object_list.api_pac
  ...
  -- object_list.api_pcc
  -- object_list.api_pce
[dmadmin@cs-0 tmp]$
####
####
[dmadmin@cs-0 tmp]$ ## Session 2
[dmadmin@cs-0 tmp]$ for file in `ls object_list.api_p*`; do
  if [[ ! -f log.${file} ]]; then
    echo "  -- ${file}"
    iapi $REPO_NAME -Udmadmin -Pxxx -r${file} > log.${file}
  fi
done
  -- object_list.api_pab
  -- object_list.api_pad
  ...
  -- object_list.api_pcb
  -- object_list.api_pcd
[dmadmin@cs-0 tmp]$

As you can see above, this command will loop through all the batches and for each, if there is already a log file present, it will skip it and check the next one. Otherwise, it will execute the iapi command needed to process the batch, creating the log file needed at the same time so that the batch is executed only once. Therefore, it is possible to parallelize the execution on several shell/CS. Let it run for some time and afterwards, the dmi_queue_item table would be cleaned, without any duplicated items.

Note 1: Because the extract didn’t care about warning/failed, once the cleanup is done, you will need to resubmit the warning/failed items (if any), so that you are certain that these items cannot be indexed. Alternatively, you can filter the warning/failed items during the extract, so that the IndexAgent will try again to process them (if they are duplicate of these warning/failed items in the awaiting/acquired queues.
Note 2: If, after some time, you want to also remove the unique items (because it’s processing too slowly or because there are too much or because more important documents are coming soon), you can remove the items “tokeep” in the exact same way… Also, to generate an “ids.txt” file that the IndexAgent will be able to ingest (put the file here: “$IA_HOME/deployments/IndexAgent.war/WEB-INF/classes/ids.txt“), you can use something like: ‘sed -e “s,[0-9a-f]*[[:space:]]+,,” object_list.txt.tokeep > ids.txt‘. This command will only keep the 2nd column (item_id column).
Note 3: It is obviously possible to compress a lot the steps so that a single command would do everything, but for clarity in the blog, I separated all the steps.