As part of a MSSQL consolidation project, with application servers located in one domain and the databases in another, we received an alert indicating that some databases were not available.
After starting the investigation, we discovered that the Availability Group (AAG), to which these databases belong, was temporary taken in RESOLVING_NORMAL:
The state of the local availability replica in availability group 'xxxxxxxx has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired.
Since this state does not allow databases to accept connections to maintain High Availability consistency, this explains why we received such an alert. In other words, it is the symptom and not the root-cause.
When we continue the investigation, the lease has expired because the following major issue has appeared at the instance level:
Stack Signature for the dump is 0x000000000000030A Process 0:0:0 (0x744) Worker 0x0000019C97866160 appears to be non-yielding on Scheduler 15. Thread creation time: 13281025838286. Approx Thread CPU Used: kernel 171 ms, user 35062 ms. Process Utilization 18%. System Idle 47%. Interval: 70427 ms. Timeout waiting for external dump process 18424.
In other words, this thread did no respond in time and SQL Server considered it blocked. This explains the previous behavior, and also the generation of a Mini Dump.
For information, the instance is installed in SQL Server 2019 Enterprise and had the latest Cumulative Update when the issue occurred.
A review of the Mini Dump, only the publicly readable portion, did not allow to find the root-cause.
We also did not find any KBs or opened tickets which could explain the error we were facing. The investigation of other logs (OS, Cluster, XE) did not bring additional clues.
Requiring more information to understand what happened, we opened a ticket at the Microsoft SQL Server Support. We provided the dump file and also relative logs of the incident.
After two days, we got an answer from the Support: the Security caches located in the Memory Clerk grow significantly and implied memory pressure. This was due to a high number of distinct Ad Hoc queries running on the consolidated instance.
But how is it possible?
As a short reminder, Access Check Results (ACRs) are evaluated when database objects are accessed by SQL Server. These results are stored in cache, called Access Check Cache.
Therefore, ACRs can be reused for the same query which allows to not recalculate them each time ; reducing cpu overhead and improving execution time at a cost of some memory.
As described in this Microsoft documentation, in rare circumstances we may want to either reduce the size of the cache if too much memory is used or increase the size of the cache if a high CPU usage is detected (because of ACRs calculation of course).
In our scenario, the goal was to limit the number of entries in TokenAndPermUserStore to avoid future memory pressures, which implied the Mini Dump.
That is why the Microsoft Support recommended to either:
– Apply Trace Flag 4618 – Limits the number of entries in TokenAndPermUserStore cache store to 1024
– Apply Trace Flag 4618 + Trace Flags 4610 – Limits the number of entries in TokenAndPermUserStore cache store to 8192
We decided to apply Trace Flags 4618 and 4610 to set the TokenAndPermUserStore cache store to 8192 entries. Of course these changes had to be applied on all replicas which belong to the same AAG.
To apply changes dynamically without restarting the instance, we used DBCC TRACEON. To persist the change, we added the trace flags in Startup Parameters of the service.
For more information, please refer to the Microsoft Documentation.
Regarding the detection of the Mini Dump, we were quite lucky the first time because we were alerted through the availability of the databases. But this was because the databases were members of an AAG and primary on the node. In other words, we were informed indirectly about the generation of the dump.
Today, to be informed directly, we are using the PowerShell module dbachecks to detect new SQL dumps. This module offers plenty of other relevant checks, that is why I highly recommend to check it out!
So far, we did not notice new dumps and we consider this topic as solved. Hoping this blog will help you!