{"id":12178,"date":"2019-01-18T09:26:46","date_gmt":"2019-01-18T08:26:46","guid":{"rendered":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/"},"modified":"2025-10-24T09:32:19","modified_gmt":"2025-10-24T07:32:19","slug":"two-techniques-for-cloning-a-repository-filestore-part-i","status":"publish","type":"post","link":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/","title":{"rendered":"Two techniques for cloning a repository filestore, part I"},"content":{"rendered":"<p>I must confess that my initial thought for the title was &#8220;An optimal repository filestore copy&#8221;. Optimal, really ? Relatively to what ? Which variable(s) define(s) the optimality ? Speed\/time to clone ? Too dependent on the installed hardware and software, and the available resources and execution constraints. Simplicity to do it ? Too simple a method can result in a very long execution time while complexity can give a faster solution but be fragile, and vice-versa. Besides, simplicity is a relative concept; a solution may look simple to someone and cause nightmares to some others. Beauty ? I like that one but no, too fuzzy too. Finally, I settled for the present title for it is neutral and up to the point. I leave it up to the reader to judge if the techniques are optimal or simple or beautiful. I only hope that they can be useful to someone.<br \/>\nThis article has two parts. In each, I&#8217;ll give an alternative for copying a repository&#8217;s filestores from one filesystem to another. Actually, both techniques are very similar, they just differ in the way the the content files&#8217; path locations are determined. But let&#8217;s start.<\/p>\n<h3>A few simple alternatives<\/h3>\n<p>Cloning a Documentum repository is a well-known procedure nowadays. Globally, it implies to first create a placeholder docbase for the clone and then to copy the meta-data stored in the source database, e.g. through an export\/import, usually while the docbase is stopped, plus the document contents, generally stored on disks. If a special storage peripheral is used, such as a Centera CAS or a NAS, there might be a fast, low-level way to clone to content files directly at the device level, check with the manufacturer.<br \/>\nIf all we want is an identical copy of the whole contents&#8217; filesystem, the command dd could be used, e.g. supposing that the source docbase and the clone docbase both use for their contents a dedicated filesystem mounted on \/dev\/mapper\/vol01 respectively on \/dev\/mapper\/vol02:<br \/>\n<code><br \/>\nsudo dd if=\/dev\/mapper\/vol01 bs=1024K of=\/dev\/mapper\/vol02<br \/>\n<\/code><br \/>\nThe clone&#8217;s filesystem \/dev\/mapper\/vol02 could be mounted temporarily on the source docbase&#8217;s machine for the copy and later dismounted. If this is not possible, dd can be used over the network, e.g.:<br \/>\n<code><br \/>\n# from the clone machine;<br \/>\nssh root@source 'dd if=\/dev\/mapper\/vol01 bs=1024K | gzip -1 -' | zcat | dd of=\/dev\/mapper\/vol02<br \/>\n&nbsp;<br \/>\n# from the source machine as root;<br \/>\ndd if=\/dev\/mapper\/vol01 bs=1024K | gzip -1 - | ssh dmadmin@clone 'zcat | dd of=\/dev\/mapper\/vol02'<br \/>\n<\/code><br \/>\ndd, and other partition imaging utilities, perform a block by block mirror copy of the source, which is much faster than working at the file level, although it depends on the percentage of used space (if the source filesystem is almost empty, dd will spend most of its time copying unused blocks, which is useless. Here, a simple file by file copy would be more effective).<br \/>\nIf it is not possible to work at this low level, e.g. filesystems are not dedicated to repositories&#8217; contents or the types of the filesystems differ, then a file by file copy is required. Modern disks are quite fast, especially for deep-pocket companies using SSD, and so file copy operations should be acceptably quick. A naive command such as the one below could even be used to copy the whole repository&#8217;s content (we suppose that the clone will be on the same machine):<br \/>\n<code><br \/>\nmy_docbase_root=\/data\/Documentum\/my_docbase<br \/>\ndest_path=\/some\/other\/or\/identical\/path\/my_docbase<br \/>\ncp -rp ${my_docbase_root} ${dest_path}\/..\/.<br \/>\n<\/code><br \/>\nIf $dest_path differs from $my_docbase_root, don&#8217;t forget to edit the dm_filestore&#8217;s dm_locations of the clone docbase accordingly.<\/p>\n<p>If confidentiality is requested and\/or the copy occurs across a network, scp is recommended as it also encrypts the transferred data:<br \/>\n<code><br \/>\nscp -rp ${my_docbase_root} dmadmin@${dest_machine}:${dest_path}\/..\/.<br \/>\n<\/code><br \/>\nThe venerable tar command could also be used, on-the-fly and without creating an intermediate archive file, e.g.:<br \/>\n<code><br \/>\n( cd ${my_docbase_root}; tar cvzf - * ) | ssh dmadmin@${dest_machine} \"(cd ${dest_path}\/; tar xvzf - )\"<br \/>\n<\/code><br \/>\nEven better, the command rsync could be used as it is much more versatile and efficient (and still secure too if configured to use ssh, which is the default), especially if the copy is done live several times in advance during the days preceding the kick-off of the new docbase; such copies will be performed incrementally and will execute quickly, providing an easy way to synchronize the copy with the source. Example of use:<br \/>\n<code><br \/>\nrsync -avz --stats ${my_docbase_root}\/ dmadmin@{dest_machine}:${dest_path}<br \/>\n<\/code><br \/>\nThe trailing \/ in the source path means copy the content of ${loc_path} but not the directory itself as we assumed it already exists.<br \/>\nAlternatively, we can include the directory too:<br \/>\n<code><br \/>\nrsync -avz --stats ${my_docbase_root} dmadmin@{dest_machine}:${dest_path}\/..\/.<br \/>\n<\/code><br \/>\nIf we run it from the destination machine, the command changes to:<br \/>\n<code><br \/>\nrsync -avz --stats dmadmin@source_machine:${my_docbase_root}\/ ${dest_path}\/.<br \/>\n<\/code><br \/>\nThe &minus;&minus;stats option is handy to obtain a summary of the transferred files and the resulting performance.<\/p>\n<p>Still, if the docbase is large and contains millions to hundreds of millions of documents, copying them to another machine can take some time. If rsync is used, repeated executions will just copy over modified or new documents and optionally remove the deleted ones if the archiving mode is requested (the -a option above) but the first run will take time anyway. Logically, reasonably taking advantage of the available I\/O bandwidth by having several rsync running at once should reduce the time to clone the filestore, shouldn&#8217;t it ? Is it possible to apply here a divide-and-conquer technique and process each part simultaneously ? It is, and here is how.<\/p>\n<h3>How Documentum stores the contents on disk<\/h3>\n<p>Besides the sheer volume of documents and possibly the limited network and disk I\/O bandwidth, one reason the copy can take a long time, independently from the tools used, is the peculiar way Documentum stores its contents on disk, with all the content files exclusively at the bottom of a 6-level deep sub-tree with the following structure (assuming $my_docbase_root has the same value as above; the letter at column 1 means d for directory and f for file):<br \/>\n<code><br \/>\ncd $my_docbase_root<br \/>\nd filestore_1<br \/>\nd    &lt;docbase_id&gt;, e.g. 0000c350<br \/>\nd      80                     starts with 80 and increases by 1 up to ff, for a total of 2^7 = 128 sub-trees;<br \/>\nd        00                   up to 16 ^ 16 = 256 directories directly below 80, from 00 to ff<br \/>\nd          00                 again, up to 256 directories directly below, from 00 to ff, up to 64K total subdirectories at this level; let's call these innermost directories \"terminal directories\";<br \/>\nf            00[.ext]         files are here, up to 256 files, from 00 to ff per terminal directory<br \/>\nf            01[.ext]<br \/>\nf            ...<br \/>\nf            ff[.ext]<br \/>\nd          01<br \/>\nf            00[.ext]<br \/>\nf            01[.ext]<br \/>\nf            ...<br \/>\nf            ff[.ext]<br \/>\nd          ...<br \/>\nd          ff<br \/>\nf            00[.ext]<br \/>\nf            01[.ext]<br \/>\nf            ...<br \/>\nf            ff[.ext]<br \/>\nd        01<br \/>\nd        ...<br \/>\nd        ff<br \/>\nd      81<br \/>\nd        00<br \/>\nd        01<br \/>\nd        ...<br \/>\nd        ff<br \/>\nd      ...<br \/>\nd      ff<br \/>\nd        00<br \/>\nd        01<br \/>\nd        ...<br \/>\nd        ff<br \/>\nf          00[.ext]<br \/>\nf          01[.ext]<br \/>\nf          ...<br \/>\nf          ff[.ext]<br \/>\n<\/code><br \/>\nThe content files on disk may have an optional extension and are located exclusively at the extremities of their sub-tree; there are no files in the intermediate sub-directories. Said otherwise, the files are the leaves of a filestore directory tree.<br \/>\nAll the nodes have a 2-digit lowercase hexadecimal name, from 00 to ff, possibly with holes in the sequences when sub-directories or deleted documents have been swept by the DMClean job. With such a layout, each filestore can store up to (2^7).(2^8).(2^8).(2^8) files, i.e. 2^31 files or a bit more than 2.1 billions files. Any number of filestores can be used for virtually an &#8220;unlimited&#8221; number of content files. However, since each content object has an 16-digit hexadecimal id whose only last 8 hexadecimal digits are really distinctive (and directly map to the filesystem path of the content file, see below), a docbase can effectively contain &#8220;only&#8221; up to 16^8 files, i.e. 2^32 content files or slightly more than 4.2 billions files distributed among all the filestores. Too few ? There is hope, aside from spawning new docbases. The knowledge base note <a title=\"KB7712268 (What happens when unique r_object_id exceeds 4 billion in a repository per datatype)\" href=\"https:\/\/knowledge.opentext.com\/knowledge\/cs.dll\/kcs\/kbarticle\/view\/KB7712268\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a> explains how &#8220;galactic&#8221; r_object_id are allocated if more than 4 billions documents are present in a repository, so it should be possible to have literally gazillions of documents in a docbase. It is not clear though whether this galactic concept is implemented yet or whether it has ever been triggered once, so let us stay with our feet firmly on planet Earth for the time being.<\/p>\n<p>It should be emphasized that such a particular layout in no way causes a performance penalty in accessing the documents&#8217;contents from within the repository because their full path can easily be computed by the content server out of their dm_content.data_ticket attribute (a signed decimal number), e.g. as shown with the one-liner:<br \/>\n<code><br \/>\ndata_ticket=-2147440384<br \/>\necho $data_ticket | gawk '{printf(\"%xn\", $0 + 4294967296)}'<br \/>\n8000a900<br \/>\n<\/code><br \/>\nor, more efficiently entirely in the bash shell:<br \/>\n<code><br \/>\nprintf \"%xn\" $(($data_ticket + 4294967296))<br \/>\n8000a900<br \/>\n<\/code><br \/>\nThis value is now split apart by groups of 2 hex digits with a slash as separator: 80\/00\/a9\/00<br \/>\nTo compute the full path, the filestore&#8217;s location and the docbase id still need to be prepended to the above partial path, e.g. ${my_docbase_root}\/filestore_01\/0000c350\/80\/00\/a9\/00. Thus, knowing the r_object_id of a document, we can find its content file on the filesystem as shown (or, preferably, using the getpath API function) and knowing the full path of a content file makes it possible to find back the document (or documents as the same content can be shared among several documents) in the repository it belongs to. To be complete, the explanation still needs the concept of filestore and its relationship with a location but let&#8217;s stop digressing (check paragraph 3 below for a hint) and focus back to the subject at hand. We have now enough information to get us started.<\/p>\n<p>As there are no files in the intermediate levels, it is necessary to walk the entire tree to reach them and start their processing, which is very time-consuming. Depending on your hardware, a &#8216;ls -1R&#8217; command can takes hours to complete on a set of fairly dense sub-trees. A contrario, this is an advantage for processing through rsync because rsync is able to create all the necessary sub-path levels (aka &#8220;implied directories&#8221; in rsync lingo) if the -R|&minus;&minus;relative option is provided, as if a &#8220;mkdir -p&#8221; were issued; thus, in order to optimally copy an entire filestore, it would be enough to rsync only the terminal directories, once identified, and the whole sub-tree would be recreated implicitly. In the illustration above, the rsync commands for those paths are:<br \/>\n<code><br \/>\ncd ${my_docbase_root}<br \/>\nrsync -avzR --stats filestore_01\/80\/00\/00 dmadmin@{dest_machine}:${dest_path}<br \/>\nrsync -avzR --stats filestore_01\/80\/00\/01 dmadmin@{dest_machine}:${dest_path}<br \/>\nrsync -avzR --stats filestore_01\/80\/00\/ff dmadmin@{dest_machine}:${dest_path}<br \/>\nrsync -avzR --stats filestore_01\/ff\/ff\/00 dmadmin@{dest_machine}:${dest_path}<br \/>\n<\/code><br \/>\nIn rsync &ge; v2.6.7, it is even possible to restrict the part within the source full path that should be copied remotely, so no preliminary cd is necessary, e.g.:<br \/>\n<code><br \/>\nrsync -avzR --stats ${my_docbase_root}<strong>\/.\/<\/strong>filestore_01\/80\/00\/00 dmadmin@{dest_machine}:${dest_path}<br \/>\n<\/code><br \/>\nNote the \/.\/ path component, it marks the start of the relative path to reproduce remotely. This command will create the directory ${dest_path}\/filestore_01\/80\/00\/00 on the remote host and copy its content there.<br \/>\nPath specification can be quite complicated, so use the -n|&minus;&minus;dry-run and -v|&minus;&minus;verbose (or even -vv for more details) options to have a peek at rsync&#8217;s actions before they are applied.<\/p>\n<p>With the -R option, we get to transfer only the terminal sub-directories AND their original relative paths, efficiency and convenience !<br \/>\nWe potentially replace millions of file by file copy commands with only up to 128 * 64K directory copy commands per filestore, which is much more concise and efficient.<\/p>\n<p>However, if there are N content files to transfer, at least \u2308N\/256\u2309 such rsync commands will be necessary, e.g. a minimum of 3&#8217;900 commands for 1 million files, subject of course to their distribution in the sub-tree (some lesser dense terminal sub-directories can contain less than 256 files so more directories and hence commands are required). It is not documented how the content server distributes them over the sub-directories and there is no balancing job that relocates the content files in order to reduce the number of terminal directories by increasing the density of the left ones and removing the emptied ones. All this is quite sensible because, while it may matter to us, it is a non-issue to the repository.<br \/>\nNevertheless, on the bright side, since a tree is by definition acyclic, rsync transfers won&#8217;t overlap and therefore can be parallelized without synchronization issues, even when intermediate &#8220;implied directories&#8221;&#8216;s creations are requested simultaneously by 2 or more rsync commands (if rsync performs an operation equivalent to &#8220;mkdir -p&#8221;, possible errors due to race conditions during concurrent execution can be simply ignored since the operation is idempotent).<\/p>\n<p>Empty terminal paths can be skipped without fear because they are not referenced in the docbase (only content files are) and hence their absence from the copy cannot introduce inconsistencies.<\/p>\n<p>Of course, in the example above, those hypothetical 3900 rsync commands won&#8217;t be launched at once but in groups of some convenient value depending on the load that the machines can endure or the application&#8217;s response time degradation if the commands are executed during the normal work hours. Since we are dealing with files and possibly the network, the biggest bottleneck will be the I\/Os and care should be exercised not to saturate them. When dedicated hardware such as high-speed networked NAS with SSD drives are used, this is less of a problem and more parallel rsync instances can be started, but such expensive resources are often shared across a company so that someone might still be impacted at one point. I for one remember one night as I was running one data-intensive DQL query in ten different repositories at once and users were suddenly complaining that they couldn&#8217;t work their Word documents any more because response time fell down to a crawl. How was that possible ? What was the relationship between DQL queries in several repositories and a desktop program ? Well, the repositories used Oracle databases whose datafiles were stored on a NAS also used as a host for networked drives mounted on desktop machines. Evidently, that configuration was somewhat sloppy but one never knows how things are configured at a large company, so be prepared for the worst and set up the transfer so that they can be easily suspended, interrupted and resumed. The good thing with rsync in archive mode is that the transfers can be resumed where they left off at a minimal cost just by relaunching the same commands with no need to compute a restarting point.<\/p>\n<p>It goes without saying that setting up public key authentication or ssh connection sharing is mandatory when rsync-ing to a remote machine in order to suppress the thousands of authentication requests that will pop up during the batch execution. <\/p>\n<h3>But up to 128 * 64K rsync commands per filestore : isn&#8217;t that a bit too much ?<\/h3>\n<p>rsync performs mostly I\/Os operations, reading and writing the filesystems, possibly across a network. The time spent in those slow operations (mostly waiting for them to complete, actually) by far outweighs the time spent launching the rsync processes and executing user or system code, especially if the terminal sub-directories are densely filled with large files. Moreover, if the DMClean jobs has been run ahead of time, this 128 * 64K figure is a maximum, it is only reached if all of the terminal sub-directories are not empty.<br \/>\nStill, rsync has some cleverness of its own in processing the files to copy so why not let it do its job ? Is it possible to reduce their number ? Of course, by just stopping one level before the terminal sub-directories, at their parents&#8217; level. From there, up to 128 * 256 rsync commands are necessary, down from 128 * 64K commands. rsync would then explore itself the up to 64K terminal directories below, hopefully more efficiently than when explicitly told so. For sparse sub-directories or small docbases, this could be more efficient. If so, what would be the cut-off point ? This is a complex question depending on so many factors that there is no other way to answer it than to experiment with several situations. A few informal tests show that copying terminal sub-directories with up to 64K rsync commands is about 40% faster than copying their parent sub-directories. If optimality is defined as speed, then the &#8220;64k&#8221; variant is the best; if it is defined as compactness, then the &#8220;256&#8221; variant is the best. One explanation for this could be that the finer and simpler the tasks to perform, the quicker they terminate and free up processes to be reused. Or maybe rsync is overwhelmed by the up to 64K sub-directories to explore and is not so good at that and needs some help. The scripts in this article allow experimenting with the &#8220;256&#8221; variant.<\/p>\n<p>To summarize up to this point, we will address the cost of navigating the special Documentum sub-tree layout by walking the location sub-trees up to the last directory level (or up to 2 levels above if so requested) and generate efficient rsync commands that can easily be parallelized. But before we start, how about asking the repository about its contents ? As it keeps track of it, wouldn&#8217;t this alternative be much easier and faster than navigating complex directory sub-trees ? Let&#8217;see.<\/p>\n<h3>Alternate solution: just ask the repository !<\/h3>\n<p>Since a repository obviously knows where its content files are stored on disks, it makes sense to get this information directly from it. In order to be sure to include all the possible renditions as well, we should query dmr_content instead of dm_document(all) (note that the DQL function MFILE_URL() returns those too, so that a &#8220;select MFILE_URL(&#8221;) from dm_document(all)&#8221; could also be used here). Also, unless the dmDMClean job is run beforehand, dmr_content includes orphan contents as well, so this point must be clarified ahead. Anyway, by querying dmr_content we are sure not to omit any content, orphans or not.<br \/>\nThe short python\/bash-like pseudo-code shows how we could do it:<\/p>\n<pre class=\"brush: python; gutter: true; first-line: 1; highlight: [4,5,7]\">\nfor each filestore in the repository:\n(\n   for each content in the filestore:\n      get its path, discard its filename;\n      add the path to the set of paths to transfer;\n   for each path in the set of paths:\n      generate an rsync -avzR --stat command;\n) &amp;\n<\/pre>\n<p>Line 4 just gets the terminal sub-directories, while line 5 ensures that they are unique in order to avoid rsync-ing the same path multiple times. We use sets here to guarantee distinct terminal path values (set elements are unique in the set).<br \/>\nLine 7 outputs the rsync commands for the terminal sub-directories and the destination.<br \/>\nEven though all filestores are processed concurrently, there could be millions of contents in each filestore and such queries could take forever. However, if we run several such queries in parallel, each working on its own partition (i.e. a non-overlapping subset of the whole such that their union is equal to the whole), we could considerably speed it up. Constraints such &#8220;where r_object_id like &#8216;%0&#8242;&#8221;, &#8220;where r_object_id like &#8216;%1&#8242;&#8221;, &#8220;where r_object_id like &#8216;%2&#8242;&#8221;, .. &#8220;where r_object_id like &#8216;%f'&#8221; can slice up the whole set of documents into 16 more or less equally-sized subsets (since the r_object_id is essentially a sequence, its modulo 16 or 256 distribution is uniform), which can then be worked on independently and concurrently. Constraints like &#8220;where r_object_id like &#8216;%00&#8242;&#8221; .. &#8220;where r_object_id like &#8216;%ff'&#8221; can produce 256 slices, and so on.<br \/>\nHere is a short python 3 program that does all this:<\/p>\n<pre class=\"brush: python; gutter: true; first-line: 1; highlight: [10,42,47,51,154,171,175]\">\n#!\/usr\/bin\/env python\n\n# 12\/2018, C. Cervini, dbi-services;\n \nimport os\nimport sys\nimport traceback\nimport getopt\nfrom datetime import datetime\nimport DctmAPI\nimport multiprocessing\nimport multiprocessing.pool\n\ndef Usage():\n   print(\"\"\"Purpose:\nConnects as dmadmin\/xxxx to a given repository and generates rsync commands to transfer the given filestores' contents to the given destination;\nUsage:\n   .\/walk_docbase.py -h|--help | -r|--repository  [-f|--filestores [{,}]|all] [-d|--dest ] [-l|--level ]\nExample:\n   .\/walk_docbase.py --docbase dmtest\nwill list all the filestores in docbase dmtest and exit;\n   .\/walk_docbase.py --docbase dmtest --filestore all --dest dmadmin@remote-host:\/documentum\/data\/cloned_dmtest\nwill transfer all the filestores' content to the remote destination in \/documentum\/data, e.g.:\n   dm_location.path = \/data\/dctm\/dmtest\/filestore_01 --&gt; \/documentum\/data\/cloned_dmtest\/filestore_01\nif dest does not contain a file path, the same path as the source is used, e.g.:\n   .\/walk_docbase.py --docbase dmtest --filestore filestore_01 --dest dmadmin@remote-host\nwill transfer the filestore_01 filestore's content to the remote destination into the same directory, e.g.:\n   dm_location.path = \/data\/dctm\/dmtest\/filestore_01 --&gt; \/documentum\/dctm\/dmtest\/filestore_01\nIn any case, the destination root directory, if any is given, must exist as rsync does not create it (although it creates the implied directories);\nGenerated statements can be dry-tested by adding the option --dry-run, e.g.\nrsync -avzR --dry-run \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/0c dmadmin@dmtest:\/home\/dctm\/dmtest\nCommented out SQL statements to update the dm_location.file_system_path for each dm_filestore are output if the destination path differs from the source's;\nlevel is the starting sub-directory level that will be copied by rsync;\nAllowd values for level are 0 (the default, the terminal directories level), -1 and -2;\nlevel -1 means the sub-directory level above the terminal directories, and so on;\nPractically, use 0 for better granularity and parallelization;\n\"\"\")\n\ndef print_error(error):\n   print(error)\n\ndef collect_results(list_paths):\n   global stats\n   for s in list_paths:\n      stats.paths = stats.paths.union(s)\n\ndef explore_fs_slice(stmt):\n   unique_paths = set()\n   lev = level\n   try:\n      for item in DctmAPI.select_cor(session, stmt):\n         fullpath = DctmAPI.select2dict(session, f\"execute get_path for '{item['r_object_id']}'\")\n         last_sep = fullpath[0]['result'].rfind(os.sep)\n         fullpath = fullpath[0]['result'][last_sep - 8 : last_sep]\n         for lev in range(level, 0):\n            last_sep = fullpath.rfind(os.sep)\n            fullpath = fullpath[ : last_sep]\n         unique_paths.add(fullpath)\n   except Exception as e:\n      print(e)\n      traceback.print_stack()\n   DctmAPI.show(f\"for stmt {stmt}, unique_paths={unique_paths}\")\n   return unique_paths\n\n# --------------------------------------------------------\n# main;\nif __name__ == \"__main__\":\n   DctmAPI.logLevel = 0\n \n   # parse the command-line parameters;\n   # old-style for more flexibility is not needed here;\n   repository = None\n   s_filestores = None\n   dest = None\n   user = \"\"\n   dest_host = \"\"\n   dest_path = \"\"\n   level = 0\n   try:\n       (opts, args) = getopt.getopt(sys.argv[1:], \"hr:f:d:l:\", [\"help\", \"docbase=\", \"filestore=\", \"destination=\", \"level=\"])\n   except getopt.GetoptError:\n      print(\"Illegal option\")\n      print(\".\/graph-stats.py -h|--help | [-r|--repository ] [-f|--filestores [{,}]|all] [-d|--dest ][-l|--level ]\")\n      sys.exit(1)\n   for opt, arg in opts:\n      if opt in (\"-h\", \"--help\"):\n         Usage()\n         sys.exit()\n      elif opt in (\"-r\", \"--repository\"):\n         repository = arg\n      elif opt in (\"-f\", \"--filestores\"):\n         s_filestores = arg\n      elif opt in (\"-d\", \"--dest\"):\n         dest = arg\n         p_at = arg.rfind(\"@\")\n         p_colon = arg.rfind(\":\")\n         if -1 != p_at:\n            user = arg[ : p_at]\n            if -1 != p:colon_\n               dest_host = arg[p_at + 1 : p_colon]\n               dest_path = arg[p_colon + 1 : ]\n            else:\n               dest_path = arg[p_at + 1 : ]\n         elif -1 != p:colon_\n            dest_host = arg[ : p_colon]\n            dest_path = arg[p_colon + 1 : ]\n         else:\n            dest_path = arg\n      elif opt in (\"-l\", \"--level\"):\n         try:\n            level = int(arg)\n            if -2 &gt; level or level &gt; 0:\n               print(\"raising\")\n               raise Exception()\n         except:\n            print(\"level must be a non positive integer inside the interval (-2,  0)\")\n            sys.exit()\n   if None == repository:\n      print(\"the repository is mandatory\")\n      Usage()\n      sys.exit()\n   if None == dest or \"\" == dest:\n      if None != s_filestores:\n         print(\"the destination is mandatory\")\n         Usage()\n         sys.exit()\n   if None == s_filestores or 'all' == s_filestores:\n      # all filestores requested;\n      s_filestores = \"all\"\n      filestores = None\n   else:\n      filestores = s_filestores.split(\",\")\n \n   # global parameters;\n   # we use a typical idiom to create a cheap namespace;\n   class gp: pass\n   gp.maxWorkers = 100\n   class stats: pass\n\n   # connect to the repository;\n   DctmAPI.show(f\"Will connect to docbase(s): {repository} and transfer filestores [{s_filestores}] to destination {dest if dest else 'None'}\")\n\n   status = DctmAPI.dmInit()\n   session = DctmAPI.connect(docbase = repository, user_name = \"dmadmin\", password = \"dmadmin\")\n   if session is None:\n      print(\"no session opened, exiting ...\")\n      exit(1)\n\n   # we need the docbase id in hex format;\n   gp.docbase_id = \"{:08x}\".format(int(DctmAPI.dmAPIGet(\"get,c,docbaseconfig,r_docbase_id\")))\n\n   # get the requested filestores' dm_locations;\n   stmt = 'select fs.r_object_id, fs.name, fs.root, l.r_object_id as \"loc_id\", l.file_system_path from dm_filestore fs, dm_location l where {:s}fs.root = l.object_name'.format(f\"fs.name in ({str(filestores)[1:-1]}) and \" if filestores else \"\")\n   fs_dict = DctmAPI.select2dict(session, stmt)\n   DctmAPI.show(fs_dict)\n   if None == dest:\n      print(f\"filestores in repository {repository}\")\n      for s in fs_dict:\n         print(s['name'])\n      sys.exit()\n\n   # filestores are processed sequentially but inside each filestore, the contents are queried concurrently;\n   for storage_id in fs_dict:\n      print(f\"# rsync statements for filestore {storage_id['name']};\")\n      stats.paths = set()\n      stmt = f\"select r_object_id from dmr_content where storage_id = '{storage_id['r_object_id']}' and r_object_id like \"\n      a_stmts = []\n      for slice in range(16):\n         a_stmts.append(stmt + \"'%0{:0x}'\".format(slice))\n      gp.path_pool = multiprocessing.pool.Pool(processes = 16)\n      gp.path_pool.map_async(explore_fs_slice, a_stmts, len(a_stmts), callback = collect_results, error_callback = print_error)\n      gp.path_pool.close()\n      gp.path_pool.join()\n      SQL_stmts = set()\n      for p in stats.paths:\n         last_sep = storage_id['file_system_path'].rfind(os.sep)\n         if \"\" == dest_path:\n            dp = storage_id['file_system_path'][ : last_sep] \n         else:\n            dp = dest_path\n         # note the dot in the source path: relative implied directories will be created from that position; \n         print(f\"rsync -avzR {storage_id['file_system_path'][ : last_sep]}\/.{storage_id['file_system_path'][last_sep : ]}\/{str(gp.docbase_id)}\/{p} {(user + '@') if user else ''}{dest_host}{':' if dest_host else ''}{dp}\")\n         if storage_id['file_system_path'][ : last_sep] != dest_path:\n            # dm_location.file_system_path has changed, give the SQL statements to update them in clone's schema;\n            SQL_stmts.add(f\"UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '{storage_id['file_system_path'][ : last_sep]}', '{dest_path}') WHERE r_object_id = '{storage_id['loc_id']}';\")\n      # commented out SQL statements to run before starting the repository clone;\n      for stmt in SQL_stmts:\n         print(f\"# {stmt}\")\n\n   status = DctmAPI.disconnect(session)\n   if not status:\n      print(\"error while  disconnecting\")\n<\/pre>\n<p>On line 10, the module DctmAPI is imported; such module was presented in a previous article (see <a title=\"Adding a Documentum extension into python\" href=\"https:\/\/www.dbi-services.com\/blog\/adding-a-documentum-extension-into-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>) but I include an updated version at the end of the present article.<br \/>\nNote the call to DctmAPI.select_cor() on line 51; this is a special version of DctmAPI.select2dict() where _cor stands for coroutine; actually, in python, it is called a generator but it looks very much like a coroutine from other programming languages. Its main interest is to separate navigating through the result set from consuming the returned data, for more clarity; also, since the the data are consumed one row at a time, there is no need to read them all into memory at once and pass them to the caller, which is especially efficient here where we can potentially have millions of documents. DctmAPI.select2dict() is still available and used when the expected result set is very limited, as for the list of dm_locations on line 154. By the way, DctmAPI.select2dict() invokes DctmAPI.select_cor() from within a list constructor, so they share that part of the code.<br \/>\nOn line 171, function map_async is used to start 16 concurrent calls to explore_fs_slice on line 47 per filestore (the r_object_id % 16 expressed in DQL as r_object_id like &#8216;%0&#8217; .. &#8216;%f&#8217;), each in its own process. That function repeatedly gets an object_id from the coroutine above and calls the administrative method get_path on it (we could query the dm_content.data_ticket and compute ourselves the file path but would it be any faster ?); the function returns a set of unique paths for its slice of ids. map_async then waits until all the processes terminate. Their result is collected by the callback collect_results starting on line 42; its parameter, list_paths, is a list of sets received from map_async (which received the sets from the terminating concurrent invocation of explore_fs_slice and put them in a list) that are further made unique by union-ing them into a global set. Starting on line 175, this set is iterated to generate the rsync commands.<br \/>\nExample of execution:<br \/>\n<code><br \/>\n.\/walk-docbase.py -r dmtest -f all -d dmadmin@dmtest:\/home\/dctm\/dmtest | tee clone-content<br \/>\n# rsync statements for filestore filestore_01;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/0c dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/02 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/06 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/0b dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/04 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/09 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/01 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/03 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/07 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/00 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/05 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/0a dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/08 dmadmin@dmtest:\/home\/dctm\/dmtest<br \/>\n# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '\/home\/dmadmin\/documentum\/data\/dmtest', '\/home\/dctm\/dmtest') WHERE r_object_id = '3a00c3508000013f';<br \/>\n# rsync statements for filestore thumbnail_store_01;<br \/>\n# rsync statements for filestore streaming_store_01;<br \/>\n# rsync statements for filestore replicate_temp_store;<br \/>\n# rsync statements for filestore replica_filestore_01;<br \/>\n<\/code><br \/>\nThis execution generated the rsync commands to copy all the dmtest repository&#8217;s filestores to the remote host dmtest&#8217;s new location \/home\/dctm\/dmtest. As the filestores&#8217; dm_location has changed, an SQL statement (to be taken as an example because it is for an Oracle RDBMS; the syntax may differ in another RDBMS) has been generated too to accommodate the new path. We do this in SQL because the docbase clone will still be down at this time and the change must be done at the database level.<br \/>\nThe other default filestores in the test docbase are empty and so no rsync are necessary for them; normally, the placeholder docbase already has initialized their sub-tree.<br \/>\nThose rsync commands could be executed in parallel, say, 10 at a time, by launching them in the background with &#8220;wait&#8221; commands inserted in between, like this:<br \/>\n<code><br \/>\n.\/walk-docbase.py -r dmtest -f all -d dmadmin@dmtest:\/home\/dctm\/dmtest | gawk -v nb_rsync=10 'BEGIN {<br \/>\n   print \"#!\/bin\/bash\"<br \/>\n   nb_dirs = 0<br \/>\n}<br \/>\n{<br \/>\n   print $0 \" &amp;\"<br \/>\n   if (!$0 || match($0, \/^#\/)) next<br \/>\n   if (0 == ++nb_dirs % nb_rsync)<br \/>\n      print \"wait\"<br \/>\n}<br \/>\nEND {<br \/>\n   if (0 != nb_dirs % nb_rsync)<br \/>\n      print \"wait\"<br \/>\n}' | tee migr.sh<br \/>\n#!\/bin\/bash<br \/>\n# rsync statements for filestore filestore_01; &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/08 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/02 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/04 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/03 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/06 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/01 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/0a dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/07 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/0c dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/05 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nwait<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/00 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/09 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00\/0b dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\n# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '\/home\/dmadmin\/documentum\/data\/dmtest', '\/home\/dctm\/dmtest') WHERE r_object_id = '3a00c3508000013f';<br \/>\n# rsync statements for filestore thumbnail_store_01;<br \/>\n# rsync statements for filestore streaming_store_01;<br \/>\n# rsync statements for filestore replicate_temp_store;<br \/>\n# rsync statements for filestore replica_filestore_01;<br \/>\nwait<br \/>\n<\/code><br \/>\nEven though there may be a count difference of up to 255 files between some rsync commands, they should complete roughly at the same time so that nb_rsync commands should be running at any time. If not, i.e. if the transfers frequently wait for a few long running rsync to complete (it could happen with huge files), it may be worth using a task manager that makes sure the requested parallelism degree is respected at any one time throughout the whole execution.<br \/>\nLet&#8217;s now make the generated script executable and launch it:<br \/>\n<code><br \/>\nchmod +x migr.sh<br \/>\ntime .\/migr.sh<br \/>\n<\/code><br \/>\nThe parameter level lets one choose the levels of the sub-directories that rsync will copy, 0 (the default) for terminal sub-directories, -1 for the level right above them and -2 for the level above those. As discussed, the lesser the level, the lesser rsync commands are necessary, e.g. up to 128 * 64K for level = 0, up to 128 * 256 for level = -1 and up to 128 for level = -2.<br \/>\nExample of execution:<br \/>\n<code><br \/>\n.\/walk-docbase.py -r dmtest -d dd --level -1<br \/>\n# rsync statements for filestore filestore_01;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80\/00 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\n# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '\/home\/dmadmin\/documentum\/data\/dmtest', 'dd') WHERE r_object_id = '3a00c3508000013f';<br \/>\n# rsync statements for filestore thumbnail_store_01;<br \/>\n# rsync statements for filestore streaming_store_01;<br \/>\n# rsync statements for filestore replicate_temp_store;<br \/>\n# rsync statements for filestore replica_filestore_01;<br \/>\n<\/code><br \/>\nAnd also:<br \/>\n<code><br \/>\n.\/walk-docbase.py -r dmtest -d dd --level -2<br \/>\n# rsync statements for filestore filestore_01;<br \/>\nrsync -avzR \/home\/dmadmin\/documentum\/data\/dmtest\/.\/content_storage_01\/0000c350\/80 dmadmin@dmtest:\/home\/dctm\/dmtest &amp;<br \/>\n# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '\/home\/dmadmin\/documentum\/data\/dmtest', 'dd') WHERE r_object_id = '3a00c3508000013f';<br \/>\n# rsync statements for filestore thumbnail_store_01;<br \/>\n# rsync statements for filestore streaming_store_01;<br \/>\n# rsync statements for filestore replicate_temp_store;<br \/>\n# rsync statements for filestore replica_filestore_01;<br \/>\n<\/code><\/p>\n<p>So, this first solution looks quite simple eventhough it initially puts a little, yet tolerable stress on the docbase. The python script connects to the repository and generates the required rsync commands (with an user-selectable compactness level) and a gawk filter prepares an executable with those statements launched in parallel N (user-selectable) at a time.<br \/>\nPerformance-wise, its not so good because all the contents must be queried for their full path, and that&#8217;s a lot of queries for a large repository.<\/p>\n<p>All this being said, let&#8217;s see now if a direct and faster, out of the repository filesystem copy procedure can be devised. Please, follow the rest of this article in <a title=\"Two techniques for cloning a repository filestore, part II\" href=\"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-ii\" target=\"_blank\" rel=\"noopener noreferrer\">part II<\/a>. The next paragraph just lists the latest version of the module DctmAPI.py.<\/p>\n<h3>DctmAPI.py revisited<\/h3>\n<pre class=\"brush: python; gutter: true; first-line: 1; highlight: [151,223]\">\n\"\"\"\nThis module is a python - Documentum binding based on ctypes;\nrequires libdmcl40.so\/libdmcl.so to be reachable through LD_LIBRARY_PATH;\nC. Cervini - dbi-services.com - december 2018\n\nThe binding works as-is for both python2 amd python3; no recompilation required; that's the good thing with ctypes compared to e.g. distutils\/SWIG;\nUnder a 32-bit O\/S, it must use the libdmcl40.so, whereas under a 64-bit Linux it must use the java backed one, libdmcl.so;\n\nFor compatibility with python3 (where strings are now unicode ones and no longer arrays of bytes, ctypes strings parameters are always converted to unicode, either by prefixing them\nwith a b if litteral or by invoking their encode('ascii', 'ignore') method; to get back to text from bytes, b.decode() is used;these works in python2 as well as in python3 so the source is compatible with these two versions of the language;\n\"\"\"\n\nimport os\nimport ctypes\nimport sys, traceback\n\n# use foreign C library;\n# use this library in eContent server = v6.x, 64-bit Linux;\ndmlib = 'libdmcl.so'\n\ndm = 0\n\nclass getOutOfHere(Exception):\n   pass\n\ndef show(mesg, beg_sep = False, end_sep = False):\n   \"displays the message msg if allowed\"\n   if logLevel &gt; 0:\n      print((\"n\" if beg_sep else \"\") + repr(mesg), (\"n\" if end_sep else \"\"))\n\ndef dmInit():\n   \"\"\"\n   initializes the Documentum part;\n   returns True if successfull, False otherwise;\n   dmAPI* are global aliases on their respective dm.dmAPI* for some syntaxic sugar;\n   since they already have an implicit namespace through their dm prefix, dm.dmAPI* would be redundant so let's get rid of it;\n   returns True if no error, False otherwise;\n   \"\"\"\n\n   show(\"in dmInit()\")\n   global dm\n\n   try:\n      dm = ctypes.cdll.LoadLibrary(dmlib);  dm.restype = ctypes.c_char_p\n      show(\"dm=\" + str(dm) + \" after loading library \" + dmlib)\n      dm.dmAPIInit.restype    = ctypes.c_int;\n      dm.dmAPIDeInit.restype  = ctypes.c_int;\n      dm.dmAPIGet.restype     = ctypes.c_char_p;      dm.dmAPIGet.argtypes  = [ctypes.c_char_p]\n      dm.dmAPISet.restype     = ctypes.c_int;         dm.dmAPISet.argtypes  = [ctypes.c_char_p, ctypes.c_char_p]\n      dm.dmAPIExec.restype    = ctypes.c_int;         dm.dmAPIExec.argtypes = [ctypes.c_char_p]\n      status  = dm.dmAPIInit()\n   except Exception as e:\n      print(\"exception in dminit(): \")\n      print(e)\n      traceback.print_stack()\n      status = False\n   finally:\n      show(\"exiting dmInit()\")\n      return True if 0 != status else False\n   \ndef dmAPIDeInit():\n   \"\"\"\n   releases the memory structures in documentum's library;\n   returns True if no error, False otherwise;\n   \"\"\"\n   status = dm.dmAPIDeInit()\n   return True if 0 != status else False\n   \ndef dmAPIGet(s):\n   \"\"\"\n   passes the string s to dmAPIGet() method;\n   returns a non-empty string if OK, None otherwise;\n   \"\"\"\n   value = dm.dmAPIGet(s.encode('ascii', 'ignore'))\n   return value.decode() if value is not None else None\n\ndef dmAPISet(s, value):\n   \"\"\"\n   passes the string s to dmAPISet() method;\n   returns TRUE if OK, False otherwise;\n   \"\"\"\n   status = dm.dmAPISet(s.encode('ascii', 'ignore'), value.encode('ascii', 'ignore'))\n   return True if 0 != status else False\n\ndef dmAPIExec(stmt):\n   \"\"\"\n   passes the string stmt to dmAPIExec() method;\n   returns TRUE if OK, False otherwise;\n   \"\"\"\n   status = dm.dmAPIExec(stmt.encode('ascii', 'ignore'))\n   return True if 0 != status else False\n\ndef connect(docbase, user_name, password):\n   \"\"\"\n   connects to given docbase as user_name\/password;\n   returns a session id if OK, None otherwise\n   \"\"\"\n   show(\"in connect(), docbase = \" + docbase + \", user_name = \" + user_name + \", password = \" + password) \n   try:\n      session = dmAPIGet(\"connect,\" + docbase + \",\" + user_name + \",\" + password)\n      if session is None or not session:\n         raise(getOutOfHere)\n      else:\n         show(\"successful session \" + session)\n         show(dmAPIGet(\"getmessage,\" + session).rstrip())\n   except getOutOfHere:\n      print(\"unsuccessful connection to docbase \" + docbase + \" as user \" + user_name)\n      session = None\n   except Exception as e:\n      print(\"Exception in connect():\")\n      print(e)\n      traceback.print_stack()\n      session = None\n   finally:\n      show(\"exiting connect()\")\n      return session\n\ndef execute(session, dql_stmt):\n   \"\"\"\n   execute non-SELECT DQL statements;\n   returns TRUE if OK, False otherwise;\n   \"\"\"\n   show(\"in execute(), dql_stmt=\" + dql_stmt)\n   try:\n      query_id = dmAPIGet(\"query,\" + session + \",\" + dql_stmt)\n      if query_id is None:\n         raise(getOutOfHere)\n      err_flag = dmAPIExec(\"close,\" + session + \",\" + query_id)\n      if not err_flag:\n         raise(getOutOfHere)\n      status = True\n   except getOutOfHere:\n      show(dmAPIGet(\"getmessage,\" + session).rstrip())\n      status = False\n   except Exception as e:\n      print(\"Exception in execute():\")\n      print(e)\n      traceback.print_stack()\n      status = False\n   finally:\n      show(dmAPIGet(\"getmessage,\" + session).rstrip())\n      show(\"exiting execute()\")\n      return status\n\ndef select2dict(session, dql_stmt, attr_name = None):\n   \"\"\"\n   execute the DQL SELECT statement passed in dql_stmt and returns an array of dictionaries (one per row) into result;\n   attributes_names is the list of extracted attributes (the ones in SELECT ..., as interpreted by the server); if not None, attribute namea are appended to it, otherwise nothing is returned;\n   \"\"\"\n   show(\"in select2dict(), dql_stmt=\" + dql_stmt)\n   return list(select_cor(session, dql_stmt, attr_name))\n\ndef select_cor(session, dql_stmt, attr_name = None):\n   \"\"\"\n   execute the DQL SELECT statement passed in dql_stmt and return one row at a time;\n   coroutine version;\n   if the optional attributes_names is not None, it contains an appended list of attributes returned by the result set, otherwise no names are returned;\n   return True if OK, False otherwise;\n   \"\"\"\n   show(\"in select_cor(), dql_stmt=\" + dql_stmt)\n\n   status = False\n   try:\n      query_id = dmAPIGet(\"query,\" + session + \",\" + dql_stmt)\n      if query_id is None:\n         raise(getOutOfHere)\n\n      # iterate through the result set;\n      row_counter = 0\n      if None == attr_name:\n         attr_name = []\n      width = {}\n      while dmAPIExec(\"next,\" + session + \",\" + query_id):\n         result = {}\n         nb_attrs = dmAPIGet(\"count,\" + session + \",\" + query_id)\n         if nb_attrs is None:\n            show(\"Error retrieving the count of returned attributes: \" + dmAPIGet(\"getmessage,\" + session))\n            raise(getOutOfHere)\n         nb_attrs = int(nb_attrs) \n         for i in range(nb_attrs):\n            if 0 == row_counter:\n               # get the attributes' names only once for the whole query;\n               value = dmAPIGet(\"get,\" + session + \",\" + query_id + \",_names[\" + str(i) + \"]\")\n               if value is None:\n                  show(\"error while getting the attribute name at position \" + str(i) + \": \" + dmAPIGet(\"getmessage,\" + session))\n                  raise(getOutOfHere)\n               attr_name.append(value)\n               if value in width:\n                  width[value] = max(width[attr_name[i]], len(value))\n               else:\n                  width[value] = len(value)\n\n            is_repeating = dmAPIGet(\"repeating,\" + session + \",\" + query_id + \",\" + attr_name[i])\n            if is_repeating is None:\n               show(\"error while getting the arity of attribute \" + attr_name[i] + \": \" + dmAPIGet(\"getmessage,\" + session))\n               raise(getOutOfHere)\n            is_repeating = int(is_repeating)\n\n            if 1 == is_repeating:\n               # multi-valued attributes;\n               result[attr_name[i]] = []\n               count = dmAPIGet(\"values,\" + session + \",\" + query_id + \",\" + attr_name[i])\n               if count is None:\n                  show(\"error while getting the arity of attribute \" + attr_name[i] + \": \" + dmAPIGet(\"getmessage,\" + session))\n                  raise(getOutOfHere)\n               count = int(count)\n\n               for j in range(count):\n                  value = dmAPIGet(\"get,\" + session + \",\" + query_id + \",\" + attr_name[i] + \"[\" + str(j) + \"]\")\n                  if value is None:\n                     value = \"null\"\n                  #result[row_counter] [attr_name[i]].append(value)\n                  result[attr_name[i]].append(value)\n            else:\n               # mono-valued attributes;\n               value = dmAPIGet(\"get,\" + session + \",\" + query_id + \",\" + attr_name[i])\n               if value is None:\n                  value = \"null\"\n               width[attr_name[i]] = len(attr_name[i])\n               result[attr_name[i]] = value\n         if 0 == row_counter:\n            show(attr_name.append)\n         yield result\n         row_counter += 1\n      err_flag = dmAPIExec(\"close,\" + session + \",\" + query_id)\n      if not err_flag:\n         show(\"Error closing the query collection: \" + dmAPIGet(\"getmessage,\" + session))\n         raise(getOutOfHere)\n\n      status = True\n\n   except getOutOfHere:\n      show(dmAPIGet(\"getmessage,\" + session).rstrip())\n      status = False\n   except Exception as e:\n      print(\"Exception in select2dict():\")\n      print(e)\n      traceback.print_stack()\n      status = False\n   finally:\n      return status\n\ndef select(session, dql_stmt, attribute_names):\n   \"\"\"\n   execute the DQL SELECT statement passed in dql_stmt and outputs the result to stdout;\n   attributes_names is a list of attributes to extract from the result set;\n   return True if OK, False otherwise;\n   \"\"\"\n   show(\"in select(), dql_stmt=\" + dql_stmt)\n   try:\n      query_id = dmAPIGet(\"query,\" + session + \",\" + dql_stmt)\n      if query_id is None:\n         raise(getOutOfHere)\n\n      s = \"\"\n      for attr in attribute_names:\n         s += \"[\" + attr + \"]t\"\n      print(s)\n      resp_cntr = 0\n      while dmAPIExec(\"next,\" + session + \",\" + query_id):\n         s = \"\"\n         for attr in attribute_names:\n            value = dmAPIGet(\"get,\" + session + \",\" + query_id + \",\" + attr)\n            if \"r_object_id\" == attr and value is None:\n               raise(getOutOfHere)\n            s += \"[\" + (value if value else \"None\") + \"]t\"\n            show(str(resp_cntr) + \": \" + s)\n         resp_cntr += 1\n      show(str(resp_cntr) + \" rows iterated\")\n\n      err_flag = dmAPIExec(\"close,\" + session + \",\" + query_id)\n      if not err_flag:\n         raise(getOutOfHere)\n\n      status = True\n   except getOutOfHere:\n      show(dmAPIGet(\"getmessage,\" + session).rstrip())\n      status = False\n   except Exception as e:\n      print(\"Exception in select():\")\n      print(e)\n      traceback.print_stack()\n      print(resp_cntr); print(attr); print(s); print(\"[\" + value + \"]\")\n      status = False\n   finally:\n      show(\"exiting select()\")\n      return status\n\ndef walk_group(session, root_group, level, result):\n   \"\"\"\n   recursively walk a group hierarchy with root_group as top parent;\n   \"\"\"\n\n   try:\n      root_group_id = dmAPIGet(\"retrieve,\" + session + \",dm_group where group_name = '\" + root_group + \"'\")\n      if 0 == level:\n         if root_group_id is None:\n            show(\"Cannot retrieve group [\" + root_group + \"]:\" + dmAPIGet(\"getmessage,\" + session))\n            raise(getOutOfHere)\n      result[root_group] = {}\n\n      count = dmAPIGet(\"values,\" + session + \",\" + root_group_id + \",groups_names\")\n      if \"\" == count:\n         show(\"error while getting the arity of attribute groups_names: \" + dmAPIGet(\"getmessage,\" + session))\n         raise(getOutOfHere)\n      count = int(count)\n\n      for j in range(count):\n         value = dmAPIGet(\"get,\" + session + \",\" + root_group_id + \",groups_names[\" + str(j) + \"]\")\n         if value is not None:\n            walk_group(session, value, level + 1, result[root_group])\n\n   except getOutOfHere:\n      show(dmAPIGet(\"getmessage,\" + session).rstrip())\n   except Exception as e:\n      print(\"Exception in walk_group():\")\n      print(e)\n      traceback.print_stack()\n\ndef disconnect(session):\n   \"\"\"\n   closes the given session;\n   returns True if no error, False otherwise;\n   \"\"\"\n   show(\"in disconnect()\")\n   try:\n      status = dmAPIExec(\"disconnect,\" + session)\n   except Exception as e:\n      print(\"Exception in disconnect():\")\n      print(e)\n      traceback.print_stack()\n      status = False\n   finally:\n      show(\"exiting disconnect()\")\n      return status\n<\/pre>\n<p>Highlighted are the main changed lines that added a generator-based select() function yielding a dictionary for each row from the result set and changed select2dict() to use it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I must confess that my initial thought for the title was &#8220;An optimal repository filestore copy&#8221;. Optimal, really ? Relatively to what ? Which variable(s) define(s) the optimality ? Speed\/time to clone ? Too dependent on the installed hardware and software, and the available resources and execution constraints. Simplicity to do it ? Too simple [&hellip;]<\/p>\n","protected":false},"author":40,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[525],"tags":[],"type_dbi":[],"class_list":["post-12178","post","type-post","status-publish","format-standard","hentry","category-enterprise-content-management"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.8 (Yoast SEO v27.8) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Two techniques for cloning a repository filestore, part I - dbi Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Two techniques for cloning a repository filestore, part I\" \/>\n<meta property=\"og:description\" content=\"I must confess that my initial thought for the title was &#8220;An optimal repository filestore copy&#8221;. Optimal, really ? Relatively to what ? Which variable(s) define(s) the optimality ? Speed\/time to clone ? Too dependent on the installed hardware and software, and the available resources and execution constraints. Simplicity to do it ? Too simple [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/\" \/>\n<meta property=\"og:site_name\" content=\"dbi Blog\" \/>\n<meta property=\"article:published_time\" content=\"2019-01-18T08:26:46+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-10-24T07:32:19+00:00\" \/>\n<meta name=\"author\" content=\"Middleware Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Middleware Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"38 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/\"},\"author\":{\"name\":\"Middleware Team\",\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/#\\\/schema\\\/person\\\/8d8563acfc6e604cce6507f45bac0ea1\"},\"headline\":\"Two techniques for cloning a repository filestore, part I\",\"datePublished\":\"2019-01-18T08:26:46+00:00\",\"dateModified\":\"2025-10-24T07:32:19+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/\"},\"wordCount\":3767,\"commentCount\":0,\"articleSection\":[\"Enterprise content management\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/\",\"url\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/\",\"name\":\"Two techniques for cloning a repository filestore, part I - dbi Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/#website\"},\"datePublished\":\"2019-01-18T08:26:46+00:00\",\"dateModified\":\"2025-10-24T07:32:19+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/#\\\/schema\\\/person\\\/8d8563acfc6e604cce6507f45bac0ea1\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/two-techniques-for-cloning-a-repository-filestore-part-i\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Accueil\",\"item\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Two techniques for cloning a repository filestore, part I\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/\",\"name\":\"dbi Blog\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/#\\\/schema\\\/person\\\/8d8563acfc6e604cce6507f45bac0ea1\",\"name\":\"Middleware Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/ddcae7ba0f9d1a0e7ae707f0e689e4a9c95bb48ec49c8e6d9cc86d43f4121cb6?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/ddcae7ba0f9d1a0e7ae707f0e689e4a9c95bb48ec49c8e6d9cc86d43f4121cb6?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/ddcae7ba0f9d1a0e7ae707f0e689e4a9c95bb48ec49c8e6d9cc86d43f4121cb6?s=96&d=mm&r=g\",\"caption\":\"Middleware Team\"},\"url\":\"https:\\\/\\\/www.dbi-services.com\\\/blog\\\/author\\\/middleware-team\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Two techniques for cloning a repository filestore, part I - dbi Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/","og_locale":"en_US","og_type":"article","og_title":"Two techniques for cloning a repository filestore, part I","og_description":"I must confess that my initial thought for the title was &#8220;An optimal repository filestore copy&#8221;. Optimal, really ? Relatively to what ? Which variable(s) define(s) the optimality ? Speed\/time to clone ? Too dependent on the installed hardware and software, and the available resources and execution constraints. Simplicity to do it ? Too simple [&hellip;]","og_url":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/","og_site_name":"dbi Blog","article_published_time":"2019-01-18T08:26:46+00:00","article_modified_time":"2025-10-24T07:32:19+00:00","author":"Middleware Team","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Middleware Team","Est. reading time":"38 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/#article","isPartOf":{"@id":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/"},"author":{"name":"Middleware Team","@id":"https:\/\/www.dbi-services.com\/blog\/#\/schema\/person\/8d8563acfc6e604cce6507f45bac0ea1"},"headline":"Two techniques for cloning a repository filestore, part I","datePublished":"2019-01-18T08:26:46+00:00","dateModified":"2025-10-24T07:32:19+00:00","mainEntityOfPage":{"@id":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/"},"wordCount":3767,"commentCount":0,"articleSection":["Enterprise content management"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/","url":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/","name":"Two techniques for cloning a repository filestore, part I - dbi Blog","isPartOf":{"@id":"https:\/\/www.dbi-services.com\/blog\/#website"},"datePublished":"2019-01-18T08:26:46+00:00","dateModified":"2025-10-24T07:32:19+00:00","author":{"@id":"https:\/\/www.dbi-services.com\/blog\/#\/schema\/person\/8d8563acfc6e604cce6507f45bac0ea1"},"breadcrumb":{"@id":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.dbi-services.com\/blog\/two-techniques-for-cloning-a-repository-filestore-part-i\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Accueil","item":"https:\/\/www.dbi-services.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Two techniques for cloning a repository filestore, part I"}]},{"@type":"WebSite","@id":"https:\/\/www.dbi-services.com\/blog\/#website","url":"https:\/\/www.dbi-services.com\/blog\/","name":"dbi Blog","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.dbi-services.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.dbi-services.com\/blog\/#\/schema\/person\/8d8563acfc6e604cce6507f45bac0ea1","name":"Middleware Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/ddcae7ba0f9d1a0e7ae707f0e689e4a9c95bb48ec49c8e6d9cc86d43f4121cb6?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/ddcae7ba0f9d1a0e7ae707f0e689e4a9c95bb48ec49c8e6d9cc86d43f4121cb6?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/ddcae7ba0f9d1a0e7ae707f0e689e4a9c95bb48ec49c8e6d9cc86d43f4121cb6?s=96&d=mm&r=g","caption":"Middleware Team"},"url":"https:\/\/www.dbi-services.com\/blog\/author\/middleware-team\/"}]}},"_links":{"self":[{"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/posts\/12178","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/users\/40"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/comments?post=12178"}],"version-history":[{"count":1,"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/posts\/12178\/revisions"}],"predecessor-version":[{"id":41192,"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/posts\/12178\/revisions\/41192"}],"wp:attachment":[{"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/media?parent=12178"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/categories?post=12178"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/tags?post=12178"},{"taxonomy":"type","embeddable":true,"href":"https:\/\/www.dbi-services.com\/blog\/wp-json\/wp\/v2\/type_dbi?post=12178"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}