dbi services Blog
Welcome to the dbi services Blog! This blog focuses on IT infrastructure - featuring news, troubleshooting, and tips & tricks. It covers database, middleware, and OS technologies such as Oracle, Microsoft SQL Server & SharePoint, Documentum, MySQL, PostgreSQL, Sybase, Unix/Linux, etc. The dbi services blog represents the view of our consultants, not necessarily that of dbi services. Feel free to comment on the postings!
Oracle is hanging? Don't forget hanganalyze and systemstate!
sqlplus / as sysdba oradebug setmypid oradebug unlimit oradebug hanganalyze 3 oradebug dump ashdumpseconds 30 oradebug systemstate 266 oradebug tracefile_name
Your Oracle database - production DB, of course - is hanging. All users are blocked. You quickly check the obvious suspects (archivelog destination full, system swapping, etc.) but it's something else. Even you, the Oracle DBA, cannot do anything: any select is hanging. And maybe you're even not able to connect with a simple 'sqlplus / as sysdba'.
What do you do ? There may be several ways to investigate deeper (strace or truss for example) but it will take time. And your boss is clear: the only important thing is to get the production running again as soon as possible. No time to investigate. SHUTDOWN ABORT and restart.
Ok, but now that everything is back to normal, your boss rules has changed: the system was down for 15 minutes. We have to provide an explanation. Root Cause Analysis.
But how will you investigate now ? You have restarted everything, so all V$ information is gone. You have Diagnostic Pack ? But the system was hanged: no ASH information went to disk. You can open an SR but what information will you give?
The next time it happens, you need to have a way to get some information that can be analyzed post mortem. But you need to be able to do that very quickly just before your boss shouts 'shutdown abort now'. And this is why I've put it at the begining of the post, so that you can find it quickly if you need it...
That takes only a few seconds to generate all post-mortem necessary information. If you can take 1 more minute, you will even be able to read the first lines of hanganalyze output, and you will be able to identify a true hanging situation and maybe just kill the root of the blocking sessions instead of a merciless restart.
In order to show you the kind of output you get, I've run a few jobs locking the same resources (TM locks) - which is not a true hanging situation because the blocking session can resolve the situation.
Here is the first lines from the oradebug hanganalyze:
Chains most likely to have caused the hang: [a] Chain 1 Signature: 'PL/SQL lock timer'Systemstate has all information about System Objects (sessions, processes, ...) but you have to navigate into it in order to understand the wait chain. In my example:
SO: 0x914ada70, type: 4, owner: 0x91990478, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x91990478, name=session, file=ksu.h LINE:13580, pg=0 conuid=0 (session) sid: 23 ser: 7 trans: 0x8ea8e3e8, creator: 0x91990478 ... service name: SYS$USERS client details: O/S info: user: oracle, term: UNKNOWN, ospid: 7929 machine: vboxora12c program: oracle@vboxora12c (J002) Current Wait Stack: 0: waiting for 'enq: TM - contention' name|mode=0x544d0003, object #=0x1737c, table/partition=0x0 wait_id=10 seq_num=11 snap_id=1 wait times: snap=15.991474 sec, exc=15.991474 sec, total=15.991474 sec wait times: max=40.000000 sec, heur=15.991474 sec wait counts: calls=6 os=6 in_wait=1 iflags=0x15a0 There is at least one session blocking this session. Dumping 1 direct blocker(s): inst: 1, sid: 254, ser: 5 Dumping final blocker: inst: 1, sid: 256, ser: 5
This is a session that is waiting, and we have the final blocker: inst: 1, sid: 256, ser: 5
Then we get to the final blocker by searching the sid: 256:
SO: 0x9168a408, type: 4, owner: 0x9198d058, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9198d058, name=session, file=ksu.h LINE:13580, pg=0 conuid=0 (session) sid: 256 ser: 5 trans: 0x8ea6b618, creator: 0x9198d058 ... service name: SYS$USERS client details: O/S info: user: oracle, term: UNKNOWN, ospid: 7925 machine: vboxora12c program: oracle@vboxora12c (J000) Current Wait Stack: 0: waiting for 'PL/SQL lock timer' duration=0x0, =0x0, =0x0 wait_id=0 seq_num=1 snap_id=1 wait times: snap=25.936165 sec, exc=25.936165 sec, total=25.936165 sec wait times: max=50.000000 sec, heur=25.936165 sec wait counts: calls=1 os=9 in_wait=1 iflags=0x5a0 There are 5 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 254, ser: 5 wait event: 'enq: TM - contention' p1: 'name|mode'=0x544d0004 p2: 'object #'=0x1737c p3: 'table/partition'=0x0 row_wait_obj#: 95100, block#: 0, row#: 0, file# 0 min_blocked_time: 19 secs, waiter_cache_ver: 44
Analysing the System State takes much longer than the hanganalyze, but it has more information.
When the blocking situation is not so desesperate, but you just want to see what is blocking, the hanganalyze information is also available online in V$WAIT_CHAINS. The advantage over ASH is that you see all processes (not only foreground, not only active ones).
Here is an example:
|1||FALSE||'PL/SQL lock timer' sqlplus -prelim / as sysdba
With that you will be able to get a systemstate. You will be able to get a ashdump. But unfortunately, since 184.108.40.206 you cannot get a hanganalyze:
ERROR: Can not perform hang analysis dump without a process state object and a session state object.
But there is a workaround for that (from Tanel Poders's blog): try to use a session that is already connected.
SQL> oradebug setorapname diag Oracle pid: 8, Unix process pid: 7805, image: oracle@vboxora12c (DIAG)
Core messageEven in hurry,