At a customer, I recently had a case where a mass-import job was executed on an interface that, in the background, uses Alfresco for document and metadata storage. From the point of view of the interface team, there was no problem as documents were properly being created in Alfresco (although performance wasn’t exceptional). However, after some time, our monitoring started sending us alerts that Solr indexing nearly stopped / was very slow. I might talk about the Solr part in a future blog but what happened is that the interface was configured to import documents into Alfresco in a way that caused too many documents in a single folder.
Too many documents in the same folder of Alfresco
The interface was trying to import documents in the folder “YYYY/MM/DD/HH” (YYYY being the year, MM the month, DD the day and HH the hour). This might be fine for Business-As-Usual (BAU), when the load isn’t too high, but when mass-importing documents, that meant several thousand documents per folder (5’000, 10’000, 20’000, …), the limit being what Alfresco can ingest in an hour or what the interface manages to send. As you probably know, Alfresco definitively doesn’t like folders with much more than a thousand nodes inside (in particular because of associations and indexing design)… When I saw that, I asked the interface team to stop the import job, but unfortunately, it wasn’t stopped right away and almost 190 000 documents were already imported into Alfresco.
Alfresco APIs for the win?
You cannot really let Alfresco in this state since Solr would heavily be impacted by this kind of situation and any change to a document in such folder could result in heavy load. Therefore, from my point of view, the best is to remove the documents and execute a new/correct import with a better distribution of documents per folder.
A first solution could be to restore the DB to a point in time before the activity started, but that means a downtime and anything else that happened in the meantime would be lost. A second option would be to find all the documents imported and remove them through API. As you might know, Share UI will not really be useful in this case since Share will either crash or just take way too long to open the folder, so don’t even try… And even if it is able to somehow open the folder containing XX’XXX nodes, you probably shouldn’t try to delete it because it will take forever, and you will not be able to know what’s the status of this process that runs in the background. Therefore, from my point of view, the only reasonable solution is through API.
Finding documents to delete
As mentioned, Solr indexing was nearly dead, so I couldn’t rely on it to find what was imported recently. Using REST-API could be possible but there are some limitations when working with huge set of results. In this case, I decided to go with a simple DB query (if you are interested in useful Alfresco DB queries), listing all documents created since the start of the mass-import by the interface user:
SQL> SELECT n.id AS "Node ID",
n.store_id AS "Store ID",
n.uuid AS "Document ID (UUID)",
n.audit_creator AS "Creator",
n.audit_created AS "Creation Date",
n.audit_modifier AS "Modifier",
n.audit_modified AS "Modification Date",
n.type_qname_id
FROM alfresco.alf_node n,
alfresco.alf_node_properties p
WHERE n.id=p.node_id
AND p.qname_id=(SELECT id FROM alf_qname WHERE local_name='content')
AND n.audit_created>='2023-11-23T19:00:00Z'
AND n.audit_creator='itf_user'
AND n.audit_created is not null;
In case the interface isn’t using a dedicated user for the mass-import process, it might be a bit more difficult to find the correct list of documents to be removed, as you would need to take care not to remove the BAU documents… Maybe using a recursive query based on the folder on which the documents were imported or some custom type/metadata or similar. The result of the above query was put in a text file for the processing:
alfresco@acs01:~$ cat alfresco_documents.txt
Node ID Store ID Document ID (UUID) Creator Creation Date Modifier Modification Date TYPE_QNAME_ID
--------- -------- ------------------------------------ --------- ------------------------- --------- ------------------------- -------------
156491155 6 0f16ef7a-4cf1-4304-b578-71480570c070 itf_user 2023-11-23T19:01:02.511Z itf_user 2023-11-23T19:01:03.128Z 265
156491158 4 2f65420a-1105-4306-9733-210501ae7efb itf_user 2023-11-23T19:01:03.198Z itf_user 2023-11-23T19:01:03.198Z 265
156491164 6 a208d56f-df1a-4f2f-bc73-6ab39214b824 itf_user 2023-11-23T19:01:03.795Z itf_user 2023-11-23T19:01:03.795Z 265
156491166 4 908d385f-d6bb-4b94-ba5c-6d6942bb75c3 itf_user 2023-11-23T19:01:03.918Z itf_user 2023-11-23T19:01:03.918Z 265
...
159472069 6 cabf7343-35c4-4e8b-8a36-0fa0805b367f itf_user 2023-11-24T07:50:20.355Z itf_user 2023-11-24T07:50:20.355Z 265
159472079 4 1bcc7301-97ab-4ddd-9561-0ecab8d09efb itf_user 2023-11-24T07:50:20.522Z itf_user 2023-11-24T07:50:20.522Z 265
159472098 6 19d1869c-83d9-449a-8417-b460ccec1d60 itf_user 2023-11-24T07:50:20.929Z itf_user 2023-11-24T07:50:20.929Z 265
159472107 4 bcd0f8a2-68b3-4cc9-b0bd-2af24dc4ff43 itf_user 2023-11-24T07:50:21.074Z itf_user 2023-11-24T07:50:21.074Z 265
159472121 6 74bbe0c3-2437-4d16-bfbc-97bfa5a8d4e0 itf_user 2023-11-24T07:50:21.365Z itf_user 2023-11-24T07:50:21.365Z 265
159472130 4 f984679f-378b-4540-853c-c36f13472fac itf_user 2023-11-24T07:50:21.511Z itf_user 2023-11-24T07:50:21.511Z 265
159472144 6 579a2609-f5be-47e4-89c8-daaa983a314e itf_user 2023-11-24T07:50:21.788Z itf_user 2023-11-24T07:50:21.788Z 265
159472153 4 7f408815-79e1-462a-aa07-182ee38340a3 itf_user 2023-11-24T07:50:21.941Z itf_user 2023-11-24T07:50:21.941Z 265
379100 rows selected.
alfresco@acs01:~$
The above Store ID of ‘6’ is for the ‘workspace://SpacesStore‘ (live document store) and ‘4’ is for the ‘workspace://version2Store‘ (version store):
SQL> SELECT id, protocol, identifier FROM alf_store;
ID PROTOCOL IDENTIFIER
--- ---------- ----------
1 user alfrescoUserStore
2 system system
3 workspace lightWeightVersionStore
4 workspace version2Store
5 archive SpacesStore
6 workspace SpacesStore
Looking at the number of rows for each Store ID gives the exact same number and confirms there are no deleted documents yet:
alfresco@acs01:~$ grep " 4 " alfresco_documents.txt | wc -l
189550
alfresco@acs01:~$
alfresco@acs01:~$ grep " 5 " alfresco_documents.txt | wc -l
0
alfresco@acs01:~$
alfresco@acs01:~$ grep " 6 " alfresco_documents.txt | wc -l
189550
alfresco@acs01:~$
Therefore, there is around 190k docs to remove in total, which is roughly the same number seen in the filesystem. The Alfresco ContentStore has a little bit more obviously since it also contains the BAU documents.
REST-API environment preparation
Now that the list is complete, the next step is to extract the IDs of the documents, so that we can use these in REST-API calls. The IDs are simply the third column from the file (Document ID (UUID)):
alfresco@acs01:~$ grep " 6 " alfresco_documents.txt | awk '{print $3}' > input_file_6_id.txt
alfresco@acs01:~$
alfresco@acs01:~$ wc -l alfresco_documents.txt input_file_6_id.txt
379104 alfresco_documents.txt
189550 input_file_6_id.txt
568654 total
alfresco@acs01:~$
Now, to be able to execute REST-API calls, we will also need to define the username/password as well as the URL to be used. I executed the REST-API calls from the Alfresco server itself, so I didn’t really need to think too much about security, and I just used a BASIC authorization method using localhost and HTTPS. If you are executing that remotely, you might want to use tickets instead (and obviously keep the HTTPS protocol). To prepare for the removal, I defined the needed environment variables as follow:
alfresco@acs01:~$ alf_user=admin
alfresco@acs01:~$ read -s -p "Enter ${alf_user} password: " alf_passwd
Enter admin password:
alfresco@acs01:~$
alfresco@acs01:~$ auth=$(echo -n "${alf_user}:${alf_passwd}" | base64)
alfresco@acs01:~$
alfresco@acs01:~$ alf_base_url="https://localhost:8443/alfresco"
alfresco@acs01:~$ alf_node_url="${alf_base_url}/api/-default-/public/alfresco/versions/1/nodes"
alfresco@acs01:~$
alfresco@acs01:~$ input_file="$HOME/input_file_6_id.txt"
alfresco@acs01:~$ output_file="$HOME/output_file_6.txt"
alfresco@acs01:~$
With the above, we have our authorization string (base64 encoding of ‘username:password‘) as well as the Alfresco API URL. In case you wonder, you can find the definition of the REST-APIs in the Alfresco API Explorer. I also defined the input file, which contains all document IDs and an output file, which will contain the list of all documents processed, with the outcome of the command, to be able to check for any issues and follow the progress.
Deleting documents with REST-API
The last step is now to create a small command/script that will execute the deletion of the documents in REST-API. Things to note here is that I’m using ‘permanent=true‘ so that the documents will not end-up in the trashcan but will be completely and permanently deleted. Therefore, you need to make sure the list of documents is correct! You can obviously set that parameter to false if you really want to, but please be aware that it will impact the performance quite a bit… Otherwise the command is fairly simple, it loops on the input file, execute the deletion query, get its output and log it:
alfresco@acs01:~$ while read -u 3 line; do
out=$(curl -k -s -X DELETE "${alf_node_url}/${line}?permanent=true" -H "accept: application/json" -H "Authorization: Basic ${auth}" | sed 's/.*\(statusCode":[0-9]*\),.*/\1/')
echo "${line} -- ${out}" >> "${output_file}"
done 3< "${input_file}"
The above is the simplest way/form of removal, with a single thread executed on a single server. You can obviously do multi-threaded deletions by splitting the input file into several and triggering commands in parallel, either on the same host or even on other hosts (if you have an Alfresco Cluster). In this example, I was able to get a consistent throughput of ~3130 documents deleted every 5 minutes, which means ~10.4 documents deleted per second. Again, that was on a single server with a single thread:
alfresco@acs01:~$ while true; do
echo "$(date) -- $(wc -l output_file_6.txt)"
sleep 300
done
Fri Nov 24 09:57:38 CET 2023 -- 810 output_file_6.txt
...
Fri Nov 24 10:26:55 CET 2023 -- 18920 output_file_6.txt
Fri Nov 24 10:31:55 CET 2023 -- 22042 output_file_6.txt
Fri Nov 24 10:36:55 CET 2023 -- 25180 output_file_6.txt
Fri Nov 24 10:41:55 CET 2023 -- 28290 output_file_6.txt
...
Since the cURL output (‘statusCode‘) is also recorded in the log file, I was able to confirm that 100% of the queries were successfully executed and all my documents were permanently deleted. With multi-threading and offloading to other members of the Cluster, it would have been possible to increase that by a lot (x5? x10? x20?) but that wasn’t needed in this case since the interface job needed to be updated before a new import could be triggered.
Luis
05.06.2024Thank you for sharing your knowledge. My 2 cents, you may use:
curl -k -s -o /dev/null -w "%{http_code}"
to tell curl to show just the response HTTP code and avoid "seding" the output.
Morgan Patou
05.06.2024Hi Luis,
Thanks for the feedback. Yes I guess it's more or less the same thing, the main difference being that with the "sed", I can catch what's the reason for the error ("errorKey"), for example "framework.exception.EntityNotFound" or the summary field ("briefSummary") that Alfresco send, for example "The entity with id xxx was not found". For simplicity in the blog I only kept the "statusCode" but at the customer, I got both the code and the message, to look at it later, if needed.
Cheers,
Morgan