In Apache Tomcat, there is a Stuck Thread Detector Valve which can be enabled pretty easily through the Tomcat “server.xml” file. I talked about that in a previous blog already and provided an AWK parser to extract important details from the logs. However, Tomcat does not have, as far as I know, any notion of Hogging Threads. That’s what I will talk about in this blog.

Hogging Threads

If we take a WebLogic Server, as an example, there is a notion of both Hogging and Stuck Threads. There is a pretty short (and not really helpful) explanation of what a Hogging Thread is in the Doc ID 780411.1 and another one in the Doc ID 1517608.1. In summary, WebLogic has Work Managers that keeps track of requests currently being executed and their execution time. Every two seconds, WebLogic will calculate the average time it took to complete requests in the last couple minutes. It will then consider as a Hogging Thread any request that is currently taking more than 7 times the average. Therefore, if the calculated average is 10 seconds, Hogging Threads are requests that takes more than 7*10 = 70 seconds. If it completes before 600 seconds, then that’s it. However, if it stays in execution for more than 600s, then it becomes a Stuck Thread. In short, you can consider Hogging Threads as unusually slow requests.

Tomcat usual monitoring

If you look at most monitoring solutions of Tomcat, it will usually use JMX directly or indirectly through an Application deployed on Tomcat (or a JavaAgent). This will allow to extract existing metrics and use these Out-Of-The-Box values in Prometheus, Grafana, Zabbix, etc… However, this doesn’t really provide much useful information on the ThreadPool (or Executor) MBean. Tomcat only includes details about how many threads are currently busy (doing something) and what’s the maximum number of threads possible (as defined in the “server.xml“). But there is no runtime information or state on the requests… In another blog (damn, that was 8 years ago already!), I talked about a WLST script I wrote to be able to monitor a lot of things from WebLogic Servers and that included much more than what Tomcat gives you. The WebLogic JMX MBean is “ThreadPoolRuntimeMBean” and you can find details about all metrics that WebLogic provides here (example for WebLogic 12c).

Simple logic to implement

If you would like to have a Hogging Thread monitoring, you would need to build it yourself. I assume it should be possible to very closely develop/recreate the WebLogic Work Managers for Tomcat, but you would need Java and Tomcat knowledge to extend the HTTP Servlet or use a custom valve. However, it would modify the behavior of Tomcat and it would most probably have some performance impact as well… To stay simple and still provide something close to it, I thought about doing something like that: since you will need to enable the Stuck Thread Detector Valve to obtain details related to the Stuck Threads, you could set the threshold of this valve to 70/100/120s for example. Then you could consider all detected threads as Hogging from the moment it appears on the Tomcat logs and until it gets completed OR until it becomes a Stuck Thread (600s if we keep it aligned with WebLogic).

WebLogic considers user requests to calculate the average and ignores anything related to internal processing. On Tomcat, you could simulate something close to that by looking at the access log and taking the values from ‘%T‘ (time taken to process the request in seconds) or ‘%F‘ (time taken to commit the response in milliseconds), c.f. the documentation. For example, taking all lines from the last 2 minutes, calculating the average time taken of all these, and multiplying that by 7. That would give you a close-enough behavior to WebLogic. But how do you find Threads running for longer than that (this average changing every time it gets calculated!) without extending Tomcat? The access log will only contain completed requests and not on-going ones… To do that, you will anyway need to extend the HTTP Servlet or use a custom valve, from my point of view.

Example of detection for Hogging Threads

In this section, I will use the simple approach I mentioned above, that uses a fixed amount of time to decide if a thread is Hogging or not. Depending on your application, the value to use might be 50, 90, 120 or whatever makes sense to you. Here I will use 120s as a baseline. Therefore, my Tomcat “server.xml” content is as follow:

[tomcat@d2-0 ~]$ grep -A1 Stuck $TOMCAT_HOME/conf/server.xml
        <Valve className="org.apache.catalina.valves.StuckThreadDetectionValve"
               threshold="120" />
[tomcat@d2-0 ~]$

If I take the AWK parser I shared on this blog, then I would get this kind of outcome:

[tomcat@d2-0 ~]$ cat $TOMCAT_HOME/logs/catalina.out | awk -f parse_stuck.awk
============================================
========== COMPLETED LONG THREADS ==========
============================================
THREAD                                      STIME           DTIME     DURATION           CTIME    NUM    REQUEST
https-jsse-nio-8080-exec-41     [6/15/24 5:44 AM]    05:46:22,268    [131,487]    05:46:32,271      0    [https://dctm-env.domain.com/D2/x3_portal/loginService]
https-jsse-nio-8080-exec-26     [6/15/24 7:48 AM]    07:50:33,630    [128,939]    07:50:43,633      0    [https://dctm-env.domain.com/D2/x3_portal/creationService]
https-jsse-nio-8080-exec-27    [6/15/24 10:17 AM]    10:19:26,921    [131,451]    10:19:36,925      0    [https://dctm-env.domain.com/D2/x3_portal/doclistService]
https-jsse-nio-8080-exec-1     [6/15/24 10:19 AM]    10:21:36,952    [129,760]    10:21:46,954      0    [https://dctm-env.domain.com/D2/x3_portal/searchManagerService]
https-jsse-nio-8080-exec-26    [6/15/24 11:49 AM]    11:51:43,233    [124,429]    11:51:46,368      0    [https://dctm-env.domain.com/D2/servlet/ExportContent?uid=...]

============================================
======= CURRENT LONG RUNNING THREADS =======
============================================
THREAD                                      STIME           DTIME     DURATION    REQUEST
https-jsse-nio-8080-exec-34    [6/15/24 12:18 AM]    12:20:24,530    [122,150]    [https://dctm-env.domain.com/D2/x3_portal/loginService]
https-jsse-nio-8080-exec-66    [6/15/24 12:25 AM]    12:27:10,305    [120,300]    [https://dctm-env.domain.com/D2/x3_portal/searchManagerService]
https-jsse-nio-8080-exec-2     [6/15/24 12:30 AM]    12:32:05,982    [121,495]    [https://dctm-env.domain.com/D2/x3_portal/creationService]
[tomcat@d2-0 ~]$

On that output, there are 5 threads that took around 130s but that are completed. However, there are still 3 threads that haven’t completed yet. In terms of Stuck and Hogging threads, what matters is what is still currently running. Therefore, we shouldn’t take care of completed ones. What is important for a monitoring is, at least, to receive the number of Stuck/Hogging threads. You might want to include the request URL as well, so you can detect specific issues in case a lot of requests are Stuck/Hogging on the same service. In this blog, I will only care about the number.

For that purpose, let’s simplify a bit the AWK parser, creating a new one with this content (as a reminder, this parser depends on the log format):

[tomcat@d2-0 ~]$ cat thread.awk
/./ {
  sub("UTC","");
}

/org.apache.catalina.valves.StuckThreadDetectionValve.notifyStuckThreadDetected/ {
  thread=substr($7,2,length($7)-2);
  dtime[thread]=$1"_"$2;
  tmp=substr($0,index($0,"active for ")+11);
  duration[thread]=substr(tmp,1,index(tmp," ")-1);
  not_ended[thread]=0;
}

/org.apache.catalina.valves.StuckThreadDetectionValve.notifyStuckThreadCompleted/ {
  thread=substr($7,2,length($7)-2);
  tmp=substr($0,index($0,"approximately ")+14);
  duration[thread]=substr(tmp,1,index(tmp," ")-1);
  not_ended[thread]=1;

  dtime[thread]="";
  duration[thread]=="";
  not_ended[thread]=="";
}

END {
  for ( i in not_ended ) {
    if (not_ended[i]==0) {
      printf("%s\t%10s\n",dtime[i],duration[i]);
    }
  }
}
[tomcat@d2-0 ~]$

To validate it, I executed it on the exact same log file, and I expected only 3 rows, fetching only the detection time and the current duration of the thread. I got exactly what I expected:

[tomcat@d2-0 ~]$ cat $TOMCAT_HOME/logs/catalina.out | awk -f thread.awk
2024-06-15_12:20:24,530  [122,150]
2024-06-15_12:27:10,305  [120,300]
2024-06-15_12:32:05,982  [121,495]
[tomcat@d2-0 ~]$

My goal from here was to find the exact start time of the request and compare it with the current date/time. If the resulting difference is lower than 600, then I would consider the thread as Hogging, otherwise as Stuck. This is because whatever appears on the log has been running since 120s at least, as defined in the threshold of the Stuck Thread Detector Valve. Doing that in bash is pretty easy:

[tomcat@d2-0 ~]$ ls -l thread.*
-rw-------. 1 tomcat tomcat  731 Jun 15 11:28 thread.awk
-rwxr-x---. 1 tomcat tomcat 1200 Jun 15 11:50 thread.sh
[tomcat@d2-0 ~]$
[tomcat@d2-0 ~]$ cat thread.sh
#!/bin/bash

# Variables
script_folder=`which ${0}`
script_folder=`dirname ${script_folder}`
parse_awk="${script_folder}/thread.awk"
input_log="$TOMCAT_HOME/logs/catalina.out"
nb_stuck_threads=0
nb_hogging_threads=0
current_time=`date +%s`

# Parse log file to retrieve long running threads
while read line; do
  thread_detection=`echo ${line} | awk -F, '{print $1}' | sed 's,_, ,'`
  thread_duration=`echo ${line} | awk -F\[ '{print $2}' | awk -F, '{print $1}'`
  thread_detection_epoch=`date -d "${thread_detection} UTC" +%s`
  thread_start="$(( ${thread_detection_epoch} - ${thread_duration} ))"
  thread_age="$(( ${current_time} - ${thread_start} ))"
  if [[ ${thread_age} -ge 600 ]]; then
    nb_stuck_threads=$(( nb_stuck_threads + 1 ))
  else
    nb_hogging_threads=$(( nb_hogging_threads + 1 ))
  fi
done < <(cat ${input_log} | awk -f ${parse_awk})

echo "$(date --utc) -- ${nb_stuck_threads} Stuck // ${nb_hogging_threads} Hogging"
[tomcat@d2-0 ~]$

And it’s execution:

[tomcat@d2-0 ~]$ ./thread.sh
Sat Jun 15 12:34:40 UTC 2024 -- 1 Stuck // 2 Hogging
[tomcat@d2-0 ~]$
[tomcat@d2-0 ~]$ ./thread.sh
Sat Jun 15 12:35:21 UTC 2024 -- 2 Stuck // 1 Hogging
[tomcat@d2-0 ~]$

Is that the expected outcome? Yes, there are 3 running threads that were detected between 12:20 and 12:32. At that time, these threads were running since 120 to 122 seconds already, meaning the execution started at, respectively, 12:18:22, 12:25:10 and 12:30:04. Therefore, the first execution of the shell script being at 12:34:40, that means that only the 1st of the 3 threads is over 10 minutes old (16min 18s exactly), the 2 others are still below (9min 30s & 4min 36s respectively). Then, the second execution being at 12:35:21, it means that the middle thread is now 10min 11s old: it’s not a Hogging Thread anymore, but a Stuck Thread.

That’s just a quick example and as mentioned, it could be improved by not only giving the number of threads but instead a mapping of threads vs URLs to detect trends. The output would also depends on the monitoring system that would fetch this data.