When using ZDM for a Physical Online Migration, even in the last current version, 21.4.5.0.0, switchover is failing. In this blog, I will share the reason and the solution to solve the problem before moving forward with the migration.

Problem description

We are using last version of ZDM :

[zdmuser@zdmhost migration]$ /u01/app/oracle/product/zdm/bin/zdmcli -build
version: 21.0.0.0.0
full version: 21.4.0.0.0
patch version: 21.4.5.0.0
label date: 221207.30
ZDM kit build date: Mar 21 2024 22:07:12 UTC
CPAT build version: 23.12.0

ZDM Response file parameter has been setup to use broker during migration: ZDM_USE_DG_BROKER=TRUE

This problem might not happened if not using the broker to run the switchover operation. We have decided to use the broker to be able to easily check and monitor the synchronisation between our on-premise database and the ExaCC one.

The on-premise database is here called ONPR, and the final PDB, ZDM needs to create and migrate to, will be ONPRZ_APP_001T. We configured ZDM to include non-cdb to cdb migration.

We have run ZDM Migration and paused it just after Data Guard is configured.

[zdmuser@zdmhost migration]$ /u01/app/oracle/product/zdm/bin/zdmcli query job -jobid 75
zdmhost.balgroupit.com: Audit ID: 1164
Job ID: 75
User: zdmuser
Client: zdmhost
Job Type: "MIGRATE"
Scheduled job command: "zdmcli migrate database -sourcesid ONPR -rsp /home/zdmuser/migration/zdm_ONPR_physical_online.rsp -sourcenode vmonpr -srcauth zdmauth -srcarg1 user:oracle -srcarg2 identity_file:/home/zdmuser/.ssh/id_rsa -srcarg3 sudo_location:/usr/bin/sudo -targetnode ExaCC-cl01n1 -tgtauth zdmauth -tgtarg1 user:opc -tgtarg2 identity_file:/home/zdmuser/.ssh/id_rsa -tgtarg3 sudo_location:/usr/bin/sudo -tdekeystorepasswd -tgttdekeystorepasswd -pauseafter ZDM_CONFIGURE_DG_SRC"
Scheduled job execution start time: 2024-03-22T15:29:56+01. Equivalent local time: 2024-03-22 15:29:56
Current status: PAUSED
Current Phase: "ZDM_CONFIGURE_DG_SRC"
Result file path: "/u01/app/oracle/chkbase/scheduled/job-75-2024-03-22-15:30:13.log"
Metrics file path: "/u01/app/oracle/chkbase/scheduled/job-75-2024-03-22-15:30:13.json"
Job execution start time: 2024-03-22 15:30:13
Job execution end time: 2024-03-22 16:32:20
Job execution elapsed time: 19 minutes 44 seconds
ZDM_GET_SRC_INFO ................ COMPLETED
ZDM_GET_TGT_INFO ................ COMPLETED
ZDM_PRECHECKS_SRC ............... COMPLETED
ZDM_PRECHECKS_TGT ............... COMPLETED
ZDM_SETUP_SRC ................... COMPLETED
ZDM_SETUP_TGT ................... COMPLETED
ZDM_PREUSERACTIONS .............. COMPLETED
ZDM_PREUSERACTIONS_TGT .......... COMPLETED
ZDM_VALIDATE_SRC ................ COMPLETED
ZDM_VALIDATE_TGT ................ COMPLETED
ZDM_DISCOVER_SRC ................ COMPLETED
ZDM_COPYFILES ................... COMPLETED
ZDM_PREPARE_TGT ................. COMPLETED
ZDM_SETUP_TDE_TGT ............... COMPLETED
ZDM_RESTORE_TGT ................. COMPLETED
ZDM_RECOVER_TGT ................. COMPLETED
ZDM_FINALIZE_TGT ................ COMPLETED
ZDM_CONFIGURE_DG_SRC ............ COMPLETED
ZDM_SWITCHOVER_SRC .............. PENDING
ZDM_SWITCHOVER_TGT .............. PENDING
ZDM_POST_DATABASE_OPEN_TGT ...... PENDING
ZDM_NONCDBTOPDB_PRECHECK ........ PENDING
ZDM_NONCDBTOPDB_CONVERSION ...... PENDING
ZDM_POST_MIGRATE_TGT ............ PENDING
TIMEZONE_UPGRADE_PREPARE_TGT .... PENDING
TIMEZONE_UPGRADE_TGT ............ PENDING
ZDM_POSTUSERACTIONS ............. PENDING
ZDM_POSTUSERACTIONS_TGT ......... PENDING
ZDM_CLEANUP_SRC ................. PENDING
ZDM_CLEANUP_TGT ................. PENDING

Pause After Phase: "ZDM_CONFIGURE_DG_SRC"
[zdmuser@zdmhost migration]$

This will give us the opportunity to have done all the preparation without any downtime and we can wait for the migration maintenance window.

The Standby on the ExaCC is synchronised with the primary on-premise database. There is no lag.

oracle@vmonpr:/home/oracle/ [ONPR] dgmgrl
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Mar 22 16:40:23 2024
Version 19.22.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.

DGMGRL> connect /
Connected to "ONPR"
Connected as SYSDG.

DGMGRL> show configuration lag

Configuration - ZDM_onpr

  Protection Mode: MaxPerformance
  Members:
  onpr           - Primary database
    onprz_app_001t - Physical standby database
                     Transport Lag:      0 seconds (computed 1 second ago)
                     Apply Lag:          0 seconds (computed 1 second ago)

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 22 seconds ago)

We resume ZDM migration job now, adding a pause just after switchover has been completed.

[zdmuser@zdmhost migration]$ /u01/app/oracle/product/zdm/bin/zdmcli resume job -jobid 75 -pauseafter ZDM_SWITCHOVER_TGT
zdmhost.balgroupit.com: Audit ID: 1165

As we can see the job failed during the ZDM_SWITCHOVER_SRC phase.

[zdmuser@zdmhost migration]$ /u01/app/oracle/product/zdm/bin/zdmcli query job -jobid 75
zdmhost.balgroupit.com: Audit ID: 1166
Job ID: 75
User: zdmuser
Client: zdmhost
Job Type: "MIGRATE"
Scheduled job command: "zdmcli migrate database -sourcesid ONPR -rsp /home/zdmuser/migration/zdm_ONPR_physical_online.rsp -sourcenode vmonpr -srcauth zdmauth -srcarg1 user:oracle -srcarg2 identity_file:/home/zdmuser/.ssh/id_rsa -srcarg3 sudo_location:/usr/bin/sudo -targetnode ExaCC-cl01n1 -tgtauth zdmauth -tgtarg1 user:opc -tgtarg2 identity_file:/home/zdmuser/.ssh/id_rsa -tgtarg3 sudo_location:/usr/bin/sudo -tdekeystorepasswd -tgttdekeystorepasswd -pauseafter ZDM_CONFIGURE_DG_SRC"
Scheduled job execution start time: 2024-03-22T15:29:56+01. Equivalent local time: 2024-03-22 15:29:56
Current status: FAILED
Result file path: "/u01/app/oracle/chkbase/scheduled/job-75-2024-03-22-15:30:13.log"
Metrics file path: "/u01/app/oracle/chkbase/scheduled/job-75-2024-03-22-15:30:13.json"
Job execution start time: 2024-03-22 15:30:13
Job execution end time: 2024-03-22 16:43:34
Job execution elapsed time: 22 minutes 4 seconds
ZDM_GET_SRC_INFO ................ COMPLETED
ZDM_GET_TGT_INFO ................ COMPLETED
ZDM_PRECHECKS_SRC ............... COMPLETED
ZDM_PRECHECKS_TGT ............... COMPLETED
ZDM_SETUP_SRC ................... COMPLETED
ZDM_SETUP_TGT ................... COMPLETED
ZDM_PREUSERACTIONS .............. COMPLETED
ZDM_PREUSERACTIONS_TGT .......... COMPLETED
ZDM_VALIDATE_SRC ................ COMPLETED
ZDM_VALIDATE_TGT ................ COMPLETED
ZDM_DISCOVER_SRC ................ COMPLETED
ZDM_COPYFILES ................... COMPLETED
ZDM_PREPARE_TGT ................. COMPLETED
ZDM_SETUP_TDE_TGT ............... COMPLETED
ZDM_RESTORE_TGT ................. COMPLETED
ZDM_RECOVER_TGT ................. COMPLETED
ZDM_FINALIZE_TGT ................ COMPLETED
ZDM_CONFIGURE_DG_SRC ............ COMPLETED
ZDM_SWITCHOVER_SRC .............. FAILED
ZDM_SWITCHOVER_TGT .............. PENDING
ZDM_POST_DATABASE_OPEN_TGT ...... PENDING
ZDM_NONCDBTOPDB_PRECHECK ........ PENDING
ZDM_NONCDBTOPDB_CONVERSION ...... PENDING
ZDM_POST_MIGRATE_TGT ............ PENDING
TIMEZONE_UPGRADE_PREPARE_TGT .... PENDING
TIMEZONE_UPGRADE_TGT ............ PENDING
ZDM_POSTUSERACTIONS ............. PENDING
ZDM_POSTUSERACTIONS_TGT ......... PENDING
ZDM_CLEANUP_SRC ................. PENDING
ZDM_CLEANUP_TGT ................. PENDING

Pause After Phase: "ZDM_SWITCHOVER_TGT"

ZDM log will display following information, showing that the switchover is failing on starting the old primary on-premise database.

####################################################################
PRGZ-3605 : Oracle Data Guard Broker switchover to database "ONPRZ_APP_001T" on database "ONPR" failed.
ONPRZ_APP_001T
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Mar 22 15:43:19 2024
Version 19.22.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "ONPR"
Connected as SYSDG.
DGMGRL> Performing switchover NOW, please wait...
Operation requires a connection to database "onprz_app_001t"
Connecting ...
Connected to "ONPRZ_APP_001T"
Connected as SYSDBA.
New primary database "onprz_app_001t" is opening...
Operation requires start up of instance "ONPR" on database "onpr"
Starting instance "ONPR"...
ORA-01017: invalid username/password; logon denied


Please complete the following steps to finish switchover:
        start up and mount instance "ONPR" of database "onpr"

DGMGRL>

Root Cause analyses

We can see that the temporary database, created on the ExaCC by ZDM, has got the primary role and is opened READ WRITE.

oracle@ExaCC-cl01n1:/u02/app/oracle/zdm/zdm_ONPR_RZ2_70/zdm/log/ [ONPR1 (CDB$ROOT)] ps -ef | grep [p]mon | grep -i onprz
oracle    51392      1  0 14:23 ?        00:00:00 ora_pmon_ONPRZ_APP_001T1

oracle@ExaCC-cl01n1:/u02/app/oracle/zdm/zdm_ONPR_RZ2_70/zdm/log/ [ONPR1 (CDB$ROOT)] export ORACLE_SID=ONPRZ_APP_001T1

oracle@ExaCC-cl01n1:/u02/app/oracle/zdm/zdm_ONPR_RZ2_72/zdm/log/ [ONPRZ_APP_001T1 (CDB$ROOT)] sqh

SQL*Plus: Release 19.0.0.0.0 - Production on Fri Mar 22 16:45:32 2024
Version 19.22.0.0.0

Copyright (c) 1982, 2023, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c EE Extreme Perf Release 19.0.0.0.0 - Production
Version 19.22.0.0.0

SQL> select open_mode, database_role from v$database;

OPEN_MODE            DATABASE_ROLE
-------------------- ----------------
READ WRITE           PRIMARY

But the on-premise database is stopped when it should be started in MOUNT status and have standby role.

oracle@vmonpr:/u00/app/oracle/zdm/zdm_ONPR_75/zdm/log/ [ONPR] ONPR
********* dbi services Ltd. *********
STATUS          : STOPPED
*************************************

This is due to the fact that ZDM is incorrectly setting the connection, using a local SYS authentication (dgmgrl /) rather than going through the listener, which is mandatory for a switchover operation. This can be seen in the ZDM logs.

oracle@vmonpr:/u00/app/oracle/zdm/zdm_ONPR_75/zdm/log/ [ONPR] grep dgmgrl zdm_switchover_src_24209.log
[jobid-75][2024-03-22T15:43:19Z][mZDM_Queries.pm:9597]:[DEBUG] Will be running following dgmgrl statements as user: oracle:
                  /u00/app/oracle/product/19.22.0.0.240116.EE/bin/dgmgrl /
[jobid-75][2024-03-22T15:43:19Z][mZDM_Utils.pm:3450]:[DEBUG] run_as_user2InMem: Running /u00/app/oracle/product/19.22.0.0.240116.EE/bin/dgmgrl /
[jobid-75][2024-03-22T15:43:34Z][mZDM_Utils.pm:3473]:[DEBUG] Remove /u00/app/oracle/zdm/zdm_ONPR_75/zdm/log/zdm_dgmgrl_out_c7pQRGXf
[jobid-75][2024-03-22T15:43:34Z][mZDM_Utils.pm:3482]:[DEBUG] /u00/app/oracle/product/19.22.0.0.240116.EE/bin/dgmgrl / successfully executed
[jobid-75][2024-03-22T15:43:34Z][mZDM_Queries.pm:9544]:[DEBUG] Successfully executed dgmgrl script 'switchover to 'onprz_app_001t';'
oracle@vmonpr:/u00/app/oracle/zdm/zdm_ONPR_75/zdm/log/ [ONPR]

Solution

We first start the on-premise database.

oracle@vmonpr:/u00/app/oracle/zdm/zdm_ONPR_75/zdm/log/ [ONPR] sqh

SQL*Plus: Release 19.0.0.0.0 - Production on Fri Mar 22 16:45:51 2024
Version 19.22.0.0.0

Copyright (c) 1982, 2023, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup mount
ORACLE instance started.

Total System Global Area 2147479664 bytes
Fixed Size                  8941680 bytes
Variable Size            1224736768 bytes
Database Buffers          905969664 bytes
Redo Buffers                7831552 bytes
Database mounted.

We check Data Guard configuration to ensure the on-premise standby is synchronised with the primary database. We should not have any gap.

oracle@vmonpr:/u00/app/oracle/zdm/zdm_ONPR_75/zdm/log/ [ONPR] dgmgrl
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Mar 22 16:47:29 2024
Version 19.22.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect /
Connected to "ONPR"
Connected as SYSDG.

DGMGRL> show configuration lag

Configuration - ZDM_onpr

  Protection Mode: MaxPerformance
  Members:
  onprz_app_001t - Primary database
    onpr           - Physical standby database
                     Transport Lag:      0 seconds (computed 1 second ago)
                     Apply Lag:          0 seconds (computed 1 second ago)

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 16 seconds ago)

DGMGRL>

Now, we can not just resume ZDM job, because it will retry the last failed phase, which is ZDM_SWITCHOVER_SRC, trying to do the switchover again. And it will fail as on-premise database is not primary any more. The role has been already switched.

This is why, we do not have any other choice than updating the ZDM metadata XML file to change the status of this phase to SUCCESS, knowing we have manually resolved and completed it.

The XML metadata file can be found in the directory GHcheckpoints/<SOURCE_HOST>+<ORACLE_SID>+<TARGET_HOST>. The XML file name is <SOURCE_HOST>+<ORACLE_SID>+<TARGET_HOST>.xml.

We need to update the file as following.

[zdmuser@zdmhost vmonpr+ONPR+ExaCC-cl01n1]$ pwd
/u01/app/oracle/chkbase/GHcheckpoints/vmonpr+ONPR+ExaCC-cl01n1

[zdmuser@zdmhost vmonpr+ONPR+ExaCC-cl01n1]$ cp -p vmonpr+ONPR+ExaCC-cl01n1.xml vmonpr+ONPR+ExaCC-cl01n1.xml.20240322_prob_broker

[zdmuser@zdmhost vmonpr+ONPR+ExaCC-cl01n1]$ vi vmonpr+ONPR+ExaCC-cl01n1.xml

[zdmuser@zdmhost vmonpr+ONPR+ExaCC-cl01n1]$ diff vmonpr+ONPR+ExaCC-cl01n1.xml vmonpr+ONPR+ExaCC-cl01n1.xml.20240322_prob_broker
106c106
<    <CHECKPOINT LEVEL="MAJOR" NAME="ZDM_SWITCHOVER_SRC" DESC="ZDM_SWITCHOVER_SRC" STATE="SUCCESS"/>
---
>    <CHECKPOINT LEVEL="MAJOR" NAME="ZDM_SWITCHOVER_SRC" DESC="ZDM_SWITCHOVER_SRC" STATE="START"/>

We can now resume the job as before the problem.

[zdmuser@zdmhost vmonpr+ONPR+ExaCC-cl01n1]$ /u01/app/oracle/product/zdm/bin/zdmcli resume job -jobid 75 -pauseafter ZDM_SWITCHOVER_TGT
zdmhost.balgroupit.com: Audit ID: 1167

And see that it has been completed successfully now, and waiting for the next resume. All new phases have been completed successfully, and the job status is set to PAUSED.

[zdmuser@zdmhost vmonpr+ONPR+ExaCC-cl01n1]$ /u01/app/oracle/product/zdm/bin/zdmcli query job -jobid 75
zdmhost.balgroupit.com: Audit ID: 1168
Job ID: 75
User: zdmuser
Client: zdmhost
Job Type: "MIGRATE"
Scheduled job command: "zdmcli migrate database -sourcesid ONPR -rsp /home/zdmuser/migration/zdm_ONPR_physical_online.rsp -sourcenode vmonpr -srcauth zdmauth -srcarg1 user:oracle -srcarg2 identity_file:/home/zdmuser/.ssh/id_rsa -srcarg3 sudo_location:/usr/bin/sudo -targetnode ExaCC-cl01n1 -tgtauth zdmauth -tgtarg1 user:opc -tgtarg2 identity_file:/home/zdmuser/.ssh/id_rsa -tgtarg3 sudo_location:/usr/bin/sudo -tdekeystorepasswd -tgttdekeystorepasswd -pauseafter ZDM_CONFIGURE_DG_SRC"
Scheduled job execution start time: 2024-03-22T15:29:56+01. Equivalent local time: 2024-03-22 15:29:56
Current status: PAUSED
Current Phase: "ZDM_SWITCHOVER_TGT"
Result file path: "/u01/app/oracle/chkbase/scheduled/job-75-2024-03-22-15:30:13.log"
Metrics file path: "/u01/app/oracle/chkbase/scheduled/job-75-2024-03-22-15:30:13.json"
Job execution start time: 2024-03-22 15:30:13
Job execution end time: 2024-03-22 16:53:21
Job execution elapsed time: 24 minutes 42 seconds
ZDM_GET_SRC_INFO ................ COMPLETED
ZDM_GET_TGT_INFO ................ COMPLETED
ZDM_PRECHECKS_SRC ............... COMPLETED
ZDM_PRECHECKS_TGT ............... COMPLETED
ZDM_SETUP_SRC ................... COMPLETED
ZDM_SETUP_TGT ................... COMPLETED
ZDM_PREUSERACTIONS .............. COMPLETED
ZDM_PREUSERACTIONS_TGT .......... COMPLETED
ZDM_VALIDATE_SRC ................ COMPLETED
ZDM_VALIDATE_TGT ................ COMPLETED
ZDM_DISCOVER_SRC ................ COMPLETED
ZDM_COPYFILES ................... COMPLETED
ZDM_PREPARE_TGT ................. COMPLETED
ZDM_SETUP_TDE_TGT ............... COMPLETED
ZDM_RESTORE_TGT ................. COMPLETED
ZDM_RECOVER_TGT ................. COMPLETED
ZDM_FINALIZE_TGT ................ COMPLETED
ZDM_CONFIGURE_DG_SRC ............ COMPLETED
ZDM_SWITCHOVER_SRC .............. COMPLETED
ZDM_SWITCHOVER_TGT .............. COMPLETED
ZDM_POST_DATABASE_OPEN_TGT ...... PENDING
ZDM_NONCDBTOPDB_PRECHECK ........ PENDING
ZDM_NONCDBTOPDB_CONVERSION ...... PENDING
ZDM_POST_MIGRATE_TGT ............ PENDING
TIMEZONE_UPGRADE_PREPARE_TGT .... PENDING
TIMEZONE_UPGRADE_TGT ............ PENDING
ZDM_POSTUSERACTIONS ............. PENDING
ZDM_POSTUSERACTIONS_TGT ......... PENDING
ZDM_CLEANUP_SRC ................. PENDING
ZDM_CLEANUP_TGT ................. PENDING

Pause After Phase: "ZDM_SWITCHOVER_TGT"

To wrap up

ZDM has got the flexibility to complete a phase manually and change ZDM metadata in order to move forward. This solution needs to be executed carefully. If you are not full confident, I would recommend you to open a SR. Of course, we can not use such method just to resolve a part of a phase. We will need to complete the phase manually until the end, executing manually all operations. Easy for a phase that is just running a switchover. It might be more complex for another phase.