Rollback framework documentation
balenaCloud and balenaOS support host OS Updates(HUP). Rollbacks is a framework designed to roll back the OS update in case something goes wrong.
There are two rollback mechanisms in the OS, covering different update failure modes: one based on health checks rollback-health, and another recognizing if the new system is unbootable for some reason rollback-altboot. Their detailed operations are explained below.
rollback-health
The new OS gets to userspace but something is unhealthy. Userspace is functional and we can use systemd services and bash scripts in this case.
- This state is checked by a systemd service:
rollback-health.service. - During a HUP, a flag file
rollback-health-breadcrumbis left in the state partition to enable therollback-healthsystemd service on next boot. rollback-health.servicerunsrollback-healthwhich runsrollback-tests. Two things are checked to establish if balenaOS is healthy or not.- balenaEngine not working. The balenaEngine healthcheck is run.
- VPN is not connecting but it used to in the previous OS.
- These tests are run once every minute for 15 minutes which is the default value of the
ROLLBACK_HEALTH_TIMEOUTvariable. - If the OS is considered healthy,
rollback-healthclears the flag files left in the state partition. This service won't run again. - If a rollback due to healthcheck fail is triggered, the previous OS boot hooks are run to restore previous boot files,
resin_root_partis updated inresinOS_uEnv.txtin the boot parititon to point to the previous OS partition, a flag filerollback-health-triggeredis left in the state partition, and a reboot is triggered.
rollback-altboot
The new OS is unbootable and does not get to Linux userspace. (A kernel panic, something crashes before the OS reaches userspace and is able to run systemd). This requires the bootloader and userspace to work together. The bootloader needs to count the number of boots and userspace needs to reset the bootcount if the OS is functional.
- During a HUP, the variable
upgrade_availableis set inresinOS_uEnv.txtin the boot partition. resinOS_uEnv.txtis read by the bootloader and bootcount is incremented ifupgrade_available=1- Bootcount is saved in the boot partition.
grubenvfor grub andbootcount.envfor u-boot. - During a boot, the bootloader checks the value of the
bootcountvariable. If it is higher than 1, this means nothing in the OS cleared the bootcount. It is assumed that the new OS failed to reach userspace and the bootloader is supposed to boot the previous rootfs. i.e. Ifresin_root_part=3inresinOS_uEnv.txt, the bootloader will try to boot assumingresin_root_part=2 - The bootloader has done its job and booted the previous OS. However, the bootfiles (e.g dt overlay files) in the boot partition are still of the new broken rootfs as we don't have multiple copies of them in the boot partition.
- We need to copy the previous boot files into the boot partition. These files are available in the root partition in the
resin-bootfolder. - During a HUP, a flag file
rollback-altboot-breadcrumbis left in the state partition. rollback-altboot.serviceis the systemd service that runs ifrollback-altboot-breadcrumbis present.rollback-altboot.servicechecks if we are running the previous root. i.e.resin_root_part=3inresinOS_uEnv.txt, but the current OS is actually mounted and running fromresin_root_part=2.- If
rollback-altbootdetects that the bootloader has booted the previous rootfs. rollback-altbootthen runs boot hooks and copies over the currently running rootfs boot files fromresin-bootinto the boot partition.- If
rollback-altbootfails to clear the state and reboot the board for whatever reason,rollback-healthwill attempt to clear rollback state and reboot the board after 15 minutes.
- If
- If
rollback-altboot.servicedetects that the bootloader has booted the correct rootfs, this script does nothing and letsrollback-health.servicefunction. Therollback-altboot-breadcrumbfile is cleared by therollback-health.service.