I guess you all already know what a lifecycle is. We born, we live and we die… In fact, it’s exactly the same for Aflresco Nodes! Well, at least from an end user point of view. But what is really going on behind that? This is what I will try to explain in this post.
First of all what is an Alfresco Node? For most people, a Node is just a document stored in Alfresco but in reality it’s much more than that: everything in Alfresco is a Node! A Node has a type that defines its properties and it also have some associations with other Nodes. This is a very short and simplified description but we don’t need to understand what exactly a Node is to understand the lifecycle process. So as said above, Alfresco Nodes have their own lifecycle but it’s a little bit more complicated than just three simple steps.
Please note that in this post, I will use $ALF_HOME as a reference to the location where alfresco has been installed (e.g. /opt/alfresco-4.2.c) and $ALF_DATA as a reference to the alf_data location. By default the alf_data folder is $ALF_HOME/alf_data/.
In this post I will use a document as an Alfresco Node to easily understand the process. So this lifecycle start with the creation of a new document named “Test_Lifecycle.docx” with the following creation date: Febrary the 1st,2015 at 16:45.
When a document is created in Alfresco, three things are done:
- File System: the content of this file is stored on the Alfresco “Content Store”. The Content Store is by default under $ALF_DATA/contentstore. This file is actually put somewhere under this folder that depends on the creation time and an ID is given to this file. For our file, it would be: $ALF_DATA/contentstore/2015/2/1/16/45/408a6980-237e-4315-88cd-6955053787c3.bin.
- Database: the medatada of this file are stored on the Alfresco Database. In fact in the DB, this document is mainly referenced using its NodeRef or NodeID. This NodeRef is something we can see on the Alfresco Web Interface from the document details page (web preview of a document): http://HOSTNAME:PORT/share/page/document-details?nodeRef=workspace://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6. Please be aware that the NodeRef contains an UUID but it’s not the same that the ID on the Content Store side… Moreover, there is a property in the DB that link the NodeRef to the Content Store’s ID for Alfresco to be able to retrieve the content of a file.
- Index: an index is created for this file in the Search engine (can be Lucene for older versions of Alfresco or Solr for newer versions). This index is in the “workspace” store.
II. Update, Review, Approve, Publish, aso…
Once the document is created, his life really begins. You can update/review/approve/publish it manually or automatically with different processes. All these actions are part of the active life of a document. From an administration point of view, there isn’t that much to say here.
III. Deletion – User level
When a document isn’t needed anymore, for any reason, a user with sufficient permissions is able to delete it. For our example, let’s say that a user deleted our file “Test_Lifecycle.docx” using the Alfresco Share Web Interface on Febrary the 20th, 2015 at 15:30 (19 days after creation). When using the Web Interface or Web Services to delete a document, the “nodeService.deleteNode” method is called. So what happened to our “three things”?
- FS: nothing changed on the Content Store. The file content is still here.
- DB: on the DB side, the NodeRef changed from workspace://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6 to archive://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6 (the “store_id” field changed on the “alf_node” table).
- Index: same thing for the search index: the index is moved from the “workspace” store to the “archive” store.
Actually, when a user deletes a document from a Web Interface of Alfresco, the document is just moved to a “global trashcan”. By default all users have access to this global trashcan in Alfresco Explorer to restore the documents they may have deleted by mistake. Of course they can’t see all documents but only the ones related to them. On Alfresco Share, the access to this global trashcan is configured in the share-config.xml file and by default on most versions, only administrators have access to it.
The only way to avoid this global trashcan is to programmatically delete the document by applying the aspect “cm:temporary” to the document and then call the “nodeService.deleteNode” on it. In that way, the document is removed from the UI and isn’t put in the global trashcan.
IV. Deletion – Administration level
What I describe here as an “Administration level” is the second level of deletion that happen by default. This level is the deletion of the document from the global trashcan. If the document is still in the global trashcan, Administrators (or users if you are using Alfresco Explorer) can still restore the document. If the document is “un-deleted”, then it will return exactly where it was before and of course metadata & index of this document will be moved from the “archive” store to the “workspace” store to return in an active life.
On April the 1st, 2015 at 08:05 (40 days after deletion at user level), an administrator decides to remove “Test_Lifecycle.docx” from the global trashcan. This can be done manually or programmatically. Moreover, there are also some existing add-ons that can be configured to automatically delete elements in the trashcan older than XX days. This time, the “NodeArchiveService.purgeArchiveNode” method is called (archiveService.purge in some older Alfresco versions). So what happened to our “three things” this time?
- FS: still nothing changed on the Content Store. The file content is still here.
- DB: on the DB side, the document is still there but some references/fields (not all) are removed. All references on the “alf_content_data” table are removed when only some fields are emptied on the “alf_node” table. For Alfresco 4.0 and below, the “node_deleted” field on the table “alf_node” is changed from 0 to 1. On newer versions of Alfresco, the “node_deleted” doesn’t exist anymore but the QNAME of the node (field “type_qname_id”) on the “alf_node” table is changed from 51 (“content”) to 140 (“deleted”). So the Node is now deleted from the global trashcan and Alfresco knows that this node can now be safely deleted but this will not be done now… Once the “node_deleted” or “type_qname_id” is set, the “orphan_time” field on the “alf_content_url” table for this document is also changed from NULL to the current unix timestamp (+ gmt offset). In our case it will be orphan_time=1427875500540.
- Index: the search index for this Node is removed.
As you can see, there are still some remaining elements on the File System and in the DB. That’s why there is a last step in our lifecycle…
V. Deletion – One more step.
As you saw before, the document “Test_Lifecycle.docx” is now considered as an “orphaned” Node. On Alfresco, by default, all orphaned Nodes are protected for 14 days. That means that during this period, the orphaned Nodes will NOT be touched at all. Of course this value can be changed easily on Alfresco configuration files. So what happen after 14 days? Well in fact every day at 4am (again, by default… can be changed), a scheduled job (the “contentStoreCleaner”) scan Alfresco for orphaned Nodes older than 14 days. Therefore on April the 15th, 2015 at 04:00, the scheduled job runs and here is what it does:
- FS: the content file is moved from $ALF_DATA/contentstore/ to $ALF_DATA/contentstore.deleted/
- DB: on the DB side, the document is still here but the line related to this document on the “alf_content_url” table (this table contains the orphan_time and reference to the FS location) is removed.
- Index: nothing to do, the search index was already removed.
You can avoid this step where documents are put on the “contentstore.deleted” folder by setting “system.content.eagerOrphanCleanup=true” in the alfresco-global.properties configuration file. If you do so, after 14 days, the document on the File System is not moved but will be deleted instead.
VI. Deletion – Still one more step…!
As said before, there are still some references to the “Test_Lifecycle.docx” document on the Alfresco Database (especially on the “alf_node” table). Another scheduled job, the nodeServiceCleanup runs every day at 21:00 to clean everything that is related to Nodes that has been deleted (orphaned nodes) for more than 30 days. So here is the result:
- FS: the content file is still on the $ALF_DATA/contentstore.deleted/ folder
- DB: the DB is finally clean!
- Index: nothing to do, the search index was already removed.
VII. Deletion – Oh you must be kidding me!?
So many steps, isn’t it! As saw before, the only remaining thing to do is to remove the content file from the $ALF_DATA/contentstore.deleted/ folder. You probably think that there is also a job that do that for you after XX days but it’s not the case, there is nothing in Alfresco that deletes the content file from this location. In consequences, if you want to clean the File System, you will have to do it by yourself.
On Unix for example, you can simply create a crontab entry:
50 23 * * * $ALF_HOME/scripts/cleanContentStoreDeleted.sh
Then create this script. Below is an example of content that will do the trick but you can put what you want/need in this script to remove the content as per your requirements or create a backup or … :
#!/bin/bash CS_DELETED=$ALF_DATA/contentstore.deleted/ # Remove all files from contentstore.deleted older than 30 days find $CS_DELETED -type f -mtime +30 | xargs rm 2> /dev/null # Remove all empty folders from contentstore.deleted older than 60 days find $CS_DELETED -type d -mtime +60 -empty | xargs rm -r 2> /dev/null
Please be aware that you should be sure that the folder “$ALF_DATA/contentstore.deleted” exists… And when I say that, I mean YOU MUST ABSOLUTELY BE SURE that it exists. Please also never remove anything under the “contentstore” folder and never remove the “contentstore.deleted” folder itself!
You may wonder why the DB isn’t cleaned automatically when the trashcan is cleaned and why the file content is also kept 14 days… Well I can assure you that there are several reasons and the principal one is for backup/restore performance concerns. I will not explain it in details but basically as the file content isn’t touched for 14 days by default, that means that you can restore your database up to 14 days in the past and your database will still be consistent with the File System without to restore the FS! Of course if you just do that you will lose the documents uploaded/changed in the last 14 days because your database wasn’t aware of these files 14 days ago. But you can just backup/restore the content files created in the last 14 days with an incremental backup.
E.g.: Today (05-May-2015), I want to backup the FS (only the last 14 days), then I will have to backup all folders inside $ALF_DATA/contentstore/2015/5 AND I will also have to backup all folders inside $ALF_DATA/contentstore/2015/4 with a folder name bigger or equal than 30 (days in Apr) + 5 (days in May) – 14= 21.
I hope this post was clear enough because it’s true that it can be hard to understand everything regarding the lifecycle of Alfresco Node and to deal with it properly.