By Franck Pachot
.
sqlplus / as sysdba
oradebug setmypid
oradebug unlimit
oradebug hanganalyze 3
oradebug dump ashdumpseconds 30
oradebug dump systemstate 266
oradebug tracefile_name
Your Oracle database – production DB, of course – is hanging. All users are blocked. You quickly check the obvious suspects (archivelog destination full, system swapping, etc.) but it’s something else. Even you, the Oracle DBA, cannot do anything: any select is hanging. And maybe you’re even not able to connect with a simple ‘sqlplus / as sysdba’.
What do you do ? There may be several ways to investigate deeper (strace or truss for example) but it will take time. And your boss is clear: the only important thing is to get the production running again as soon as possible. No time to investigate. SHUTDOWN ABORT and restart.
Ok, but now that everything is back to normal, your boss rules has changed: the system was down for 15 minutes. We have to provide an explanation. Root Cause Analysis.
But how will you investigate now ? You have restarted everything, so all V$ information is gone. You have Diagnostic Pack ? But the system was hanged: no ASH information went to disk. You can open an SR but what information will you give?
Hang Analyze
The next time it happens, you need to have a way to get some information that can be analyzed post mortem. But you need to be able to do that very quickly just before your boss shouts ‘shutdown abort now’. And this is why I’ve put it at the begining of the post, so that you can find it quickly if you need it…
That takes only a few seconds to generate all post-mortem necessary information. If you can take 1 more minute, you will even be able to read the first lines of hanganalyze output, and you will be able to identify a true hanging situation and maybe just kill the root of the blocking sessions instead of a merciless restart.
In order to show you the kind of output you get, I’ve run a few jobs locking the same resources (TM locks) – which is not a true hanging situation because the blocking session can resolve the situation.
Here is the first lines from the oradebug hanganalyze:
Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'PL/SQL lock timer'Systemstate has all information about System Objects (sessions, processes, ...) but you have to navigate into it in order to understand the wait chain. In my example:
SO: 0x914ada70, type: 4, owner: 0x91990478, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x91990478, name=session, file=ksu.h LINE:13580, pg=0 conuid=0
(session) sid: 23 ser: 7 trans: 0x8ea8e3e8, creator: 0x91990478
...
service name: SYS$USERS
client details:
O/S info: user: oracle, term: UNKNOWN, ospid: 7929
machine: vboxora12c program: oracle@vboxora12c (J002)
Current Wait Stack:
0: waiting for 'enq: TM - contention'
name|mode=0x544d0003, object #=0x1737c, table/partition=0x0
wait_id=10 seq_num=11 snap_id=1
wait times: snap=15.991474 sec, exc=15.991474 sec, total=15.991474 sec
wait times: max=40.000000 sec, heur=15.991474 sec
wait counts: calls=6 os=6
in_wait=1 iflags=0x15a0
There is at least one session blocking this session.
Dumping 1 direct blocker(s):
inst: 1, sid: 254, ser: 5
Dumping final blocker: inst: 1, sid: 256, ser: 5
This is a session that is waiting, and we have the final blocker: inst: 1, sid: 256, ser: 5
Then we get to the final blocker by searching the sid: 256:
SO: 0x9168a408, type: 4, owner: 0x9198d058, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x9198d058, name=session, file=ksu.h LINE:13580, pg=0 conuid=0
(session) sid: 256 ser: 5 trans: 0x8ea6b618, creator: 0x9198d058
...
service name: SYS$USERS
client details:
O/S info: user: oracle, term: UNKNOWN, ospid: 7925
machine: vboxora12c program: oracle@vboxora12c (J000)
Current Wait Stack:
0: waiting for 'PL/SQL lock timer'
duration=0x0, =0x0, =0x0
wait_id=0 seq_num=1 snap_id=1
wait times: snap=25.936165 sec, exc=25.936165 sec, total=25.936165 sec
wait times: max=50.000000 sec, heur=25.936165 sec
wait counts: calls=1 os=9
in_wait=1 iflags=0x5a0
There are 5 sessions blocked by this session.
Dumping one waiter:
inst: 1, sid: 254, ser: 5
wait event: 'enq: TM - contention'
p1: 'name|mode'=0x544d0004
p2: 'object #'=0x1737c
p3: 'table/partition'=0x0
row_wait_obj#: 95100, block#: 0, row#: 0, file# 0
min_blocked_time: 19 secs, waiter_cache_ver: 44
Analysing the System State takes much longer than the hanganalyze, but it has more information.
V$WAIT_CHAINS
When the blocking situation is not so desesperate, but you just want to see what is blocking, the hanganalyze information is also available online in V$WAIT_CHAINS. The advantage over ASH is that you see all processes (not only foreground, not only active ones).
Here is an example:
CHAIN_ID | CHAIN | CHAIN_SIGNATURE | INSTANCE | OSID | PID | SID | BLOCK |
---|---|---|---|---|---|---|---|
1 | FALSE | ‘PL/SQL lock timer’ <=’enq: TM – contention’ <=’enq: TM – contention’ |
1 | 7929 | 42 | 23 | TRUE |
1 | FALSE | ‘PL/SQL lock timer’ <=’enq: TM – contention’ <=’enq: TM – contention’ |
1 | 7927 | 41 | 254 | TRUE |
1 | FALSE | ‘PL/SQL lock timer’ <=’enq: TM – contention’ <=’enq: TM – contention’ |
1 | 7925 | 39 | 256 | FALSE |
2 | FALSE | ‘PL/SQL lock timer’ <=’enq: TM – contention’ <=’enq: TM – contention’ |
1 | 7933 | 46 | 25 | TRUE |
3 | FALSE | ‘PL/SQL lock timer’ <=’enq: TM – contention’ <=’enq: TM – contention’ |
1 | 7931 | 45 | 260 | TRUE |
4 | FALSE | ‘PL/SQL lock timer’ <=’enq: TM – contention’ <=’enq: TM – contention’ |
1 | 7935 | 47 | 262 | TRUE |
ASH Dump
There is something else that you can get if you have Diagnostic Pack. The ASH information can be dumped to trace file even if it cannot be collected in the database.
oradebug dump ashdumpseconds 30
that will gather ASH from latest 30 seconds, and the trace file will even have the sqlldr ctl file to load it in an ASH like table.
sqlplus -prelim
But what can you do if you can’t even connect / as sysdba ?
There is the ‘preliminary connection’ that does not create a session:
sqlplus -prelim / as sysdba
With that you will be able to get a systemstate. You will be able to get a ashdump.
But unfortunately, since 11.2.0.2 you cannot get a hanganalyze:
ERROR: Can not perform hang analysis dump without a process state object and a session state object.
But there is a workaround for that (from Tanel Poders’s blog): try to use a session that is already connected.
For exemple I use the DIAG background process (it’s better not to use vital processes for that)
SQL> oradebug setorapname diag
Oracle pid: 8, Unix process pid: 7805, image: oracle@vboxora12c (DIAG)
Core message
Even in hurry,
- Always check an hanganalyze to understand the problem.
- Always get a systemstate before a shutdown abort.
and you will have information to investigate later, or to provide to Oracle Support.