While applying what seemed to be a minor update, we managed to “break” our cloud-init automated OCI compute instance deployment process. This blog describes how to quickly troubleshoot & debug your #cloud-config on Oracle Cloud Infrastructure (not so different from other hyperscalers)

safe harbor statement

The below troubleshooting scenario has been simplified to ease the understanding of the root cause analysis

Step 1, Identify the error message

Let’s start our investigation at the roots, with logfiles.

Python modules log: /var/log/cloud-init.log

By default, all python module vent are logged with the standard hierarchy levels : WARNING, DEBUG, aso ..

[opc@bastion-vpn-server ~]$ sudo tail -f -n2 /var/log/cloud-init.log
2023-05-31 20:15:44,198 - util.py[DEBUG]: cloud-init mode 'modules' took 8.091 seconds
2023-05-31 20:15:44,198 - handlers.py[DEBUG]: finish: modules-final: FAIL: running modules for final

Hum, interesting! last line shows we ran until final stage (aka. our user-defined #cloud-config). But it’s still unclear what really caused the failure.

Let’s move forward to the next log.

cloud-init configuration output: /var/log/cloud-init-output.log

Now, we just moved to the “main” logfile which protocols all #cloud-config configuration output (aka. STDOUT and STDERR)

[opc@bastion-vpn-server ~]$ sudo tail -f  /var/log/cloud-init-output.log
...
... <snippet truncated>
...
/var/lib/cloud/instance/scripts/runcmd: line 2: /usr/bin: Is a directory
...
... <snippet truncated>
...
No match for argument: htop
Error: Unable to find a match: htop
...
... <snippet truncated>
...
2023-05-31 20:15:44,107 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-05-31 20:15:44,107 - util.py[WARNING]: Running module scripts-user () failed


Hum, we get closer! Looks like a command was not interpreted correctly but do we have still access to the commands executed? Yes, of course check the generated scripts in /var/lib/cloud/instance/scripts:

[opc@bastion-vpn-server ~]$ sudo cat /var/lib/cloud/instance/scripts/runcmd
#!/bin/sh
'/usr/bin' 'dnf' '-y' 'install' 'https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm'
'/usr/bin/sleep' '5'
'/usr/bin/dnf' '-y' 'install' 'htop'

From now on, a Linux expert will immediate recognize a typo ! Are we sure we have only a single error? Let’s move forward and try to validate the syntax

Step 2, Validate cloud-init user-data deployed on the system

Now, here we (almost) cheated ! Unfortunately, on OCI cloud-init version isn’t up2date (Release 22.1 GA date Feb. 2022). And no, Oracle is not lagging behind others large hyperscalers (at least compared to the one with the smile). Indeed, cloud-init evolves very quickly and the documentation is not very easy to follow especially across releases.

Validate the syntax as follows (At the time of writing this post the cloud-init release 22.1 is available on OCI):

[opc@bastion-vpn-server ~]# sudo cloud-init --version
/usr/bin/cloud-init 22.1-6.0.4.el8_7.2
[opc@bastion-vpn-server ~]$ sudo cloud-init devel schema --system
Error:
Cloud config schema errors: runcmd.1: ['/usr/bin/sleep', 5] is not valid under any of the given schemas

Since cloud-init release 22.2 (GA date May 2022) the command to validate cloud-init user-data evolved to:

[opc@bastion-vpn-server opt]$ sudo cloud-init --version
$/bin/cloud-init 23.2
[opc@bastion-vpn-server opt]$ sudo cloud-init schema --system
Invalid cloud-config /var/lib/cloud/instances/ocid1.instance.oc1.eu-zurich-1.an5heljr4p5qcaac5uh7vhwsfmycazwru55ywyfrpljox76mn6edhyvfyfza/cloud-config.txt
Error: Cloud config schema errors: runcmd.1: ['/usr/bin/sleep', 5] is not valid under any of the given schemas
Error: Invalid cloud-config schema: user-data

The last snippet is an example after latest cloud-init release was installed from source code (https://github.com/canonical/cloud-init/releases)


But wait a minute? this is not really the problem identified earlier, isn’t it?

/var/lib/cloud/instance/scripts/runcmd: line 2: /usr/bin: Is a directory

Indeed, we have introduced not only one typo but also a second issue (syntax inconsistency).

Let’s move forward and retrieve the “raw/original” #cloud-config

Step3, retrieve (and validate) raw user-data #cloud-config

The cloud-init user-datas is passed to the compute instance through RestAPIs. On OCI you can retrieve the #cloud-config user data YAML file as follows:

[opc@bastion-vpn-server ~]$ curl --silent -H "Authorization: Bearer Oracle" -L http://169.254.169.254/opc/v2/instance/metadata/user_data -o - | base64 --decode > /tmp/cloud-init_user-data.yml

Now, we can use the aforementioned syntax validation commands to get rid of the error(s):

[opc@bastion-vpn-server opt]$ sudo cloud-init devel schema --config-file /tmp/cloud-init_user-data.yml
Invalid cloud-config /tmp/cloud-init_user-data.yml
Error: Cloud config schema errors: runcmd.1: ['/usr/bin/sleep', 5] is not valid under any of the given schemas
Error: Invalid cloud-config schema: user-data
[opc@bastion-vpn-server opt]$ vi /tmp/cloud-init_user-data.yml
...
<fix the typo>
...
[opc@bastion-vpn-server opt]$ sudo cloud-init devel schema --config-file /tmp/cloud-init_user-data.yml
Valid cloud-config: /tmp/cloud-init_user-data.yml

Sounds good? Might be too easy to be true but who knows 😉
Now, we identified, fixed error and tested the fixes by re-running the whole provisioning process.

Step 4, if needed re-run #cloud-config stages/modules

Working with cloud-init can sometimes be really annoying as it’s a very complex tool with orchestrated stages, modules (ie: runcmd, ansible, write_files, aso…) which are customizable and customized by each hyperscaler differently.

Follows, the cheat sheet to get through if further troubleshooting is needed

# Remove logs and artifacts so cloud-init can re-run
sudo cloud-init clean --logs

# Detect local #cloud-config source (ie: cloud platform settings, IPs, routes)
sudo cloud-init init
sudo cloud-init init --local

# Run #cloud-config modules defined in /etc/cloud/cloud.cfg (ie: timezone)
sudo cloud-init modules --mode=config 

# Re-run user accustomed #cloud-config
sudo cloud-init modules --mode=final

# Alternatively, re-run only a single user accustomed #cloud-config module
sudo cloud-init single --name runcmd

Well, depending on /etc/cloud/cloud.cfg settings most of the modules if executed won’t be re-executed without tweaking the config.file if they already ran (whatever the outcome).

[opc@bastion-vpn-server ~]$ sudo tail -f  /var/log/cloud-init.log
...
... <snippet truncated>
...
2023-06-01 21:15:44,968 - helpers.py[DEBUG]: config-runcmd already ran (freq=once-per-instance)

Please refer to the documentation to

Have fun on #OCI with #cloud-init

Hope this post was worth reading 😉