Beginning of the week, as I was working for our ServiceDesk (SLA support for our customers), I saw a few dozen mails generated by our monitoring over the weekend on a Production Alfresco 7.x Cluster doing the yo-yo in terms of RAM and Disk Space. Nothing was down, just some strange behavior where 20GB of free space would be gone and then re-appear after a few minutes and same thing for the RAM/SWAP.

The first thing I checked was the disk space mentioned on the alert. We received alerts from all members of the cluster one by one, almost in a perfect round-robin manner. On the second node, I saw the issue occurring in real-time, so I looked into what exactly was generating all the noise:

alfresco@alf-p2:~# date; df -h /tmp
Mon Feb 12 07:27:41 UTC 2024
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb2        19G    7G   12G  35% /tmp
alfresco@alf-p2:~#
alfresco@alf-p2:~# date; df -h /tmp
Mon Feb 12 07:28:20 UTC 2024
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb2        19G    9G    9G  49% /tmp
alfresco@alf-p2:~#
alfresco@alf-p2:~# du -sm /tmp/
9427    /tmp/
alfresco@alf-p2:~#
alfresco@alf-p2:~# du -sm /tmp/
9484    /tmp/
alfresco@alf-p2:~#
alfresco@alf-p2:~# du -sm /tmp/
9541    /tmp/
alfresco@alf-p2:~#

In less than a minute, around 2/3Gb of temporary files were generated, which doesn’t seem very healthy:

alfresco@alf-p2:~# cd /tmp
alfresco@alf-p2:/tmp#
alfresco@alf-p2:/tmp# ls -ltr
total 480
...
-rw-r-----   1 alfresco  alfresco    115 Feb 11 21:26 scheduler.json
drwxr-x---   2 alfresco  alfresco   4096 Feb 12 07:28 Alfresco/
drwxrwxrwt 117 root      root      12288 Feb 12 07:28 ./
alfresco@alf-p2:/tmp#
alfresco@alf-p2:/tmp# cd Alfresco/
alfresco@alf-p2:/tmp/Alfresco# ls -l
total 10553428
drwxr-x---   2 alfresco alfresco        4096 Feb 12 07:29 ./
drwxrwxrwt 117 root     root           12288 Feb 12 07:29 ../
-rw-r-----   1 alfresco alfresco     1897650 Feb 12 07:23 source_11877384286747332767_tmp.pdf
-rw-r-----   1 alfresco alfresco 10804789248 Feb 12 07:29 target_18121744399232974935_tmp.txt
alfresco@alf-p2:/tmp/Alfresco#
alfresco@alf-p2:/tmp/Alfresco#
alfresco@alf-p2:/tmp/Alfresco# ls -l
total 10686460
drwxr-x---   2 alfresco alfresco        4096 Feb 12 07:29 ./
drwxrwxrwt 117 root     root           12288 Feb 12 07:29 ../
-rw-r-----   1 alfresco alfresco     1897650 Feb 12 07:23 source_11877384286747332767_tmp.pdf
-rw-r-----   1 alfresco alfresco 10941014016 Feb 12 07:29 target_18121744399232974935_tmp.txt
alfresco@alf-p2:/tmp/Alfresco#

At that point in time, it looked like Alfresco was doing something that was causing the issue for the Disk Space, at least. Here, we can see a PDF file that is a “source” and a TXT file that appears to be under generation, as a “target”. So of course, my first thought here is that this is probably the Alfresco Transformation Service that is causing this issue, trying to transform a PDF into TXT, most probably for indexing of the content of this file.

While looking at the RAM/SWAP usage on this server, it was also showing the same thing, with the Java process of the ATS using 100% CPU (fortunately, the host has multiple CPUs) and going overboard with its RAM, forcing the host to SWAP.

Therefore, I looked at the ATS logs and saw 2 types of errors. First was a few IOException on PDFBox “Error: End-Of-File: expected line” but there wasn’t a lot of those… Then there was another error, much more present, that was the consequence of the FileSystem being full:

alfresco@alf-p2:~# cat $ATS_HOME/logs/transform-core-aio.log
...
2024-02-12 07:18:37.380 ERROR 23713 --- [o-8090-exec-141] o.a.transformer.TransformController      : Error writing: Seite 1

org.alfresco.transform.exceptions.TransformException: Error writing: Seite 1
        at org.alfresco.transformer.executors.Transformer.transform(Transformer.java:83) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
        at org.alfresco.transformer.AIOController.transformImpl(AIOController.java:118) ~[classes!/:2.5.3]
        at org.alfresco.transformer.AbstractTransformerController.transform(AbstractTransformerController.java:173) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
        at jdk.internal.reflect.GeneratedMethodAccessor75.invoke(Unknown Source) ~[na:na]
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
        at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
        ...
Caused by: java.lang.IllegalStateException: Error writing: Seite 1
        at org.alfresco.transformer.executors.Tika.transform(Tika.java:697) ~[alfresco-transform-tika-2.5.3.jar!/:2.5.3]
        at org.alfresco.transformer.executors.Tika.transform(Tika.java:673) ~[alfresco-transform-tika-2.5.3.jar!/:2.5.3]
        at org.alfresco.transformer.executors.Tika.transform(Tika.java:617) ~[alfresco-transform-tika-2.5.3.jar!/:2.5.3]
        at org.alfresco.transformer.executors.TikaJavaExecutor.call(TikaJavaExecutor.java:141) ~[alfresco-transform-tika-2.5.3.jar!/:2.5.3]
        at org.alfresco.transformer.executors.TikaJavaExecutor.transform(TikaJavaExecutor.java:131) ~[alfresco-transform-tika-2.5.3.jar!/:2.5.3]
        at org.alfresco.transformer.executors.Transformer.transform(Transformer.java:70) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
        ... 55 common frames omitted
Caused by: org.xml.sax.SAXException: Error writing: Seite 1
        at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:110) ~[tika-core-1.26.jar!/:1.26]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.26.jar!/:1.26]
        at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) ~[tika-core-1.26.jar!/:1.26]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.26.jar!/:1.26]
        ...
        at org.alfresco.transformer.executors.Tika.transform(Tika.java:693) ~[alfresco-transform-tika-2.5.3.jar!/:2.5.3]
        ... 60 common frames omitted
        Suppressed: java.io.IOException: No space left on device
                at java.base/java.io.FileOutputStream.writeBytes(Native Method) ~[na:na]
                at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354) ~[na:na]
                at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:233) ~[na:na]
                at java.base/sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:337) ~[na:na]
                at java.base/sun.nio.cs.StreamEncoder.close(StreamEncoder.java:161) ~[na:na]
                at java.base/java.io.OutputStreamWriter.close(OutputStreamWriter.java:255) ~[na:na]
                at java.base/java.io.BufferedWriter.close(BufferedWriter.java:269) ~[na:na]
                at org.alfresco.transformer.executors.Tika.transform(Tika.java:684) ~[alfresco-transform-tika-2.5.3.jar!/:2.5.3]
                ... 60 common frames omitted
Caused by: java.io.IOException: No space left on device
        at java.base/java.io.FileOutputStream.writeBytes(Native Method) ~[na:na]
        at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354) ~[na:na]
        at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:233) ~[na:na]
        at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:303) ~[na:na]
...
alfresco@alf-p2:~#

As you can see above, at 07:18, the FileSystem /tmp was 100% full and when I checked 5 minutes later, at 07:23, a new transformation was already producing a 10Gb text file and still growing. So, it was clear that this happens repeatedly, most probably for the same document. According to the monitoring, the issue started just before the weekend. Looking at the first occurrences of the FileSystem full from the ATS logs gave the following:

alfresco@alf-p2:~# grep '2024.*Error writing' $ATS_HOME/logs/transform-core-aio.log
2024-02-09 19:20:51.628 ERROR 23713 --- [o-8090-exec-166] o.a.transformer.TransformController      : Error writing:
2024-02-09 19:41:29.954 ERROR 23713 --- [o-8090-exec-156] o.a.transformer.TransformController      : Error writing: Seite 1
2024-02-09 20:02:11.764 ERROR 23713 --- [o-8090-exec-160] o.a.transformer.TransformController      : Error writing: Seite 1
2024-02-09 20:23:08.828 ERROR 23713 --- [o-8090-exec-163] o.a.transformer.TransformController      : Error writing:
2024-02-09 20:44:05.313 ERROR 23713 --- [o-8090-exec-141] o.a.transformer.TransformController      : Error writing: Seite 1
2024-02-09 21:04:52.642 ERROR 23713 --- [o-8090-exec-162] o.a.transformer.TransformController      : Error writing: Seite 1
...
2024-02-12 07:18:37.380 ERROR 23713 --- [o-8090-exec-152] o.a.transformer.TransformController      : Error writing: Seite 1
alfresco@alf-p2:~#

With the above, it pretty much confirms that it’s the same document that is always failing, since it’s blocking on “Seite 1“, which means “Page 1” in English.

To be able to find which document is causing the issue in Alfresco, there isn’t a lot of details available, since the ATS isn’t really giving you much about what it is doing. All I had was a temporary name (which obviously doesn’t trace back to anything in the Repository) and a size. Therefore, I checked for documents on the Alfresco Data (“alf_data“) with a size equal to the document “/tmp/Alfresco/source_11877384286747332767_tmp.pdf” (i.e. 1897650 bytes), created in the last few days. I expected it to be created on the 9-Feb, a little before 19:20 and I indeed found one:

alfresco@alf-p2:~# find /alf_data/contentstore/2024/2/ -type f -ls | grep 1897650
 34508512  1856 -rw-r----- 1 alfresco alfresco 1897650 Feb 9 19:02 /alf_data/contentstore/2024/2/9/19/02/174f569e-93a3-4829-8ad5-bd3d6e78447b.bin
alfresco@alf-p2:~#
alfresco@alf-p2:~# md5sum /tmp/Alfresco/source_11877384286747332767_tmp.pdf /alf_data/contentstore/2024/2/9/19/02/174f569e-93a3-4829-8ad5-bd3d6e78447b.bin
45ed40bd5f84b7c68e246885f2b6a55f  /tmp/Alfresco/source_11877384286747332767_tmp.pdf
45ed40bd5f84b7c68e246885f2b6a55f  /alf_data/contentstore/2024/2/9/19/02/174f569e-93a3-4829-8ad5-bd3d6e78447b.bin
alfresco@alf-p2:~#
alfresco@alf-p2:~# diff /tmp/Alfresco/source_11877384286747332767_tmp.pdf /alf_data/contentstore/2024/2/9/19/02/174f569e-93a3-4829-8ad5-bd3d6e78447b.bin
alfresco@alf-p2:~#

Therefore, this is the same content file. There is of course the possibility that a duplicate node was using the same content before February (as I searched only inside /2024/2, that means February), but since the issue appeared only over the weekend, it’s pretty safe to assume it’s this document/node.

alfresco@alf-p2:~# stat /alf_data/contentstore/2024/2/9/19/02/174f569e-93a3-4829-8ad5-bd3d6e78447b.bin
  File: /alf_data/contentstore/2024/2/9/19/02/174f569e-93a3-4829-8ad5-bd3d6e78447b.bin
  Size: 1897650         Blocks: 3712       IO Block: 262144 regular file
Device: 34h/52d Inode: 34508512    Links: 1
Access: (0640/-rw-r-----)  Uid: (  113/alfresco)   Gid: (  116/alfresco)
Access: 2024-02-09 19:02:12.153002964 +0000
Modify: 2024-02-09 19:02:12.157983495 +0000
Change: 2024-02-09 19:02:12.157983635 +0000
 Birth: -
alfresco@alf-p2:~#

From that point, I had the “content_url” of a Node. Therefore, I could have used the Database (see useful database queries) to find the NodeRef of this Alfresco Node but at this customer, I don’t have an easy access to the DB, so I went through Share instead.

I know the node was created (or modified) at 19:02:12 (+/- 1s) on the 9-Feb, and even if the content isn’t indexed, its metadata should still be available searchable. Therefore, I just performed a search on Alfresco Share, to find documents created (or modified) at that exact time, i.e. cm:created:’2024-02-09T19:02:12′.

That gave me 4 results, out of which only 1 had a size around 2MB. To validate if this was indeed the document causing the issue, I simply used the JavaScript Console to dump this file and it gave me the exact same “content_url“. I could also validate on Share that this specific file wasn’t content-indexed yet (despite being in the repository for 2.5 days).

As a temporary workaround, to stop the OS from going crazy, I set this document as metadata-indexed only (no content), using the “Index Control” aspect. If you don’t know how this works, it’s pretty simple for a node:

  • Click on “Manage Aspect”
  • From the list of “Available to Add”, find “Index Control (cm:indexControl)”
  • Click on “+” to add it to the list of “Currently Selected”
  • Click on “Apply changes”
  • Click on “Edit Properties”
  • Uncheck the “Is Content Indexed” option

After doing that, you should be able to see something like that on the node’s properties:

Alfresco Index Control

In case a transformation for this document is already in progress, you will need to wait for the FileSystem to be full for the ATS (java) to remove its temporary file and realize that this document doesn’t need to be transformed anymore. You can probably also restart the process, if you prefer.

That’s only a workaround of course, not a real solution. Therefore, even if I knew that the issue was most probably around “Seite 1“, I replicated the issue on TEST by uploading this same file into the TEST environment and then looked inside the TXT content, to validate that assumption:

alfresco@alf-t1:/tmp/Alfresco# ls -l
total 23960
drwxr-x---  2 alfresco alfresco      4096 Feb 12 09:10 ./
drwxrwxrwt 25 root     root         36864 Feb 12 09:10 ../
-rw-r-----  1 alfresco alfresco   1897650 Feb 12 09:10 source_2995534351432950419_tmp.pdf
-rw-r-----  1 alfresco alfresco  22593536 Feb 12 09:10 target_7429882841367188802_tmp.txt
alfresco@alf-t1:/tmp/Alfresco#
alfresco@alf-t1:/tmp/Alfresco# wc -l target_7429882841367188802_tmp.txt
2509490 target_7429882841367188802_tmp.txt
alfresco@alf-t1:/tmp/Alfresco#
alfresco@alf-t1:/tmp/Alfresco# grep -v "^[[:space:]]*Seite 1$" target_7429882841367188802_tmp.txt | wc -l
1913
alfresco@alf-t1:/tmp/Alfresco#
alfresco@alf-t1:/tmp/Alfresco# sleep 30
alfresco@alf-t1:/tmp/Alfresco#
alfresco@alf-t1:/tmp/Alfresco# wc -l target_7429882841367188802_tmp.txt
83418233 target_7429882841367188802_tmp.txt
alfresco@alf-t1:/tmp/Alfresco#
alfresco@alf-t1:/tmp/Alfresco# grep -v "^[[:space:]]*Seite 1$" target_7429882841367188802_tmp.txt | wc -l
1913
alfresco@alf-t1:/tmp/Alfresco#

As shown above, there are 1913 lines of some texts and then the rest of the millions of lines are exactly “ Seite 1“. This text is actually coming from the page 34 of the PDF (it’s a merge of multiple PDFs it seems). By removing the page 34 from the document, it can be indexed properly. In the end, the “quick” solution for this customer is to fix the PDF (e.g. transform the page 34 into an image, then back into a PDF and OCRize it so it is indexed and searchable).